Ticket #420 (closed defect: wontfix)

Opened 4 years ago

Last modified 4 years ago

Varnish Downtime

Reported by: chris Owned by: chris
Priority: major Milestone:
Component: Live server Keywords:
Cc: ed, laura, jim Estimated Number of Hours: 0.0
Add Hours to Ticket: 0 Billable?: yes
Total Hours: 1.3

Description (last modified by chris) (diff)

Varnish fell over early this morning, this is the entry in /var/log/messages:

Oct  9 02:55:30 quince varnishd[8635]: Child (2879) died signal=3
Oct  9 02:55:54 quince varnishd[8635]: child (11098) Started
Oct  9 02:56:24 quince varnishd[8635]: Child (11098) said Child starts
Oct  9 02:56:27 quince varnishd[8635]: Child (11098) said SMF.s0 mmap'ed 536870912 bytes of 536870912
Oct  9 02:56:27 quince varnishd[8635]: Child (11098) said Child dies

Change History

comment:1 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.4
  • Total Hours changed from 0.0 to 0.4

comment:2 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.3
  • Total Hours changed from 0.4 to 0.7

Looking at https://www.varnish-cache.org/docs/3.0/tutorial/troubleshooting.html#varnish-is-crashing this could be the cause and a fix:

Specifically if you see the "Error in munmap" error on Linux 
you might want to increase the amount of maps available. 
Linux is limited to a maximum of 64k maps. Setting 
vm.max_max_count i sysctl.conf will enable you to increase 
this limit. 

I tried adding this to /etc/sysctl.conf:

vm.max_max_count = 256000

And ran:

sysctl -p

But it returned:

error: "vm.max_max_count" is an unknown key

More research required...

comment:3 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.2
  • Status changed from new to accepted
  • Total Hours changed from 0.7 to 0.9

The problem was that it should be:

vm.max_map_count = 256000

Have tried this, also this ticket could be related:

https://www.varnish-cache.org/trac/ticket/1119

I think we should probably leave this for now and see if it happens again.

comment:4 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.4
  • Total Hours changed from 0.9 to 1.3

I had a good chat with people in the #varnish irc channel and everybody though the load spike before varnish stopped was the key and that more RAM would help, some comments:

14:08 < s> it must have been thrashing like crazy at the time
14:08 < s> chrisc: a full 24 seconds from dead child to restart? Buy a bigger server
14:08 < s> it must have been thrashing like crazy at the time
14:10  * chrisc there was a load spike https://kiwi.transitionnetwork.org/munin/webarch.net/quince.webarch.net/load.html
14:11 < s> people have been reporting weird stuff with transparent hugepages
14:11 < s> I assume your syslog has "Child not responding to CLI, killing it" too?
14:11 < s> https://www.varnish-cache.org/trac/ticket/1054
14:13 < chrisc> no it doesn't, varnish died at about 3am and i restarted it manually at around 8:15 am
14:13 < s> chrisc: no lines before 2:55:30?
14:13 < chrisc> no
14:14 < chrisc> Oct  9 01:52:02 quince varnishd[8635]: CLI telnet 127.0.0.1 39594 127.0.0.1 81 Wr 200
14:14 < chrisc> that's the one before
14:14 < s> "steal" CPU...something on the physical host starved your varnish vm
14:14 < s> maybe the syslog about child not responding got lost due to the same load
14:15 < chrisc> i think a lot of rsync backup jobs would have been running at that time and disk io will have been an issue
14:15 < chrisc> between the line i just pasted and the ones on the ticket it's just drupal messages
14:17 < s> maybe somebody else can chime in, but to me this looks like the varnishd master process killing the child
14:18 < s> because the child doesn't ping
14:18 < chrisc> the server was getting hammered https://kiwi.transitionnetwork.org/munin/webarch.net/quince.webarch.net/cpu.html
14:18 < chrisc> by the looks of the munin cpu stats
14:19 < s> and then varnishd tries to restart the child, but isn't able to communicate with it during startup because of the load on the 
               machine
14:19  * chrisc nods
14:20 < chrisc> s: thanks, i guess i need to track down the cause of the load spike :-)
14:20 < s> so buy bigger hardware, or don't virtualize varnish with something that steals the resources
14:28 < D> considering my 3 year old laptop has more than 3GB of RAM
14:29 < D> (it has 8GB)
14:29 < D> and a decent beast of a server these days has 256GB of RAM (still only 2U and well under 20K total)
14:29 < D> I'd say get bigger hardware :)
14:31 < D> I'm not a big fan of solving all problems with more hardware
14:31 < U> we just upped a bunch of servers with 64G more RAM each, because it is the cheapest option
14:31 < D> but this is just a no-brainer
14:31 < U> even server-ram is cheap cheap cheap now
14:31 < D> yup
14:31 < D> I was surprised at how little the 256G costs us
14:33 < D> mysql & apache with ssl, running php and drupal?
14:33 < D> large site, plenty of pictures?
14:33 < chrisc> yeah with varnish on port 80 doing reverse proxy
14:33 < chrisc> yeah, large site
14:33 < D> all that on a single 3GB VM?
14:33 < D> they're daft
14:33 < chrisc> authenticated users on 443 only
14:34 < D> I mean, you can run it with that little memory
14:34 < D> but it's not going to be fast
14:34 < D> ever
14:37 < D> but honestly, if were hardware instead of VM, I'd just tell them to slam 4x4GB sticks in the server
14:37 < D> should cost less than a half a day of your time
14:37 < D> :P

comment:5 follow-up: ↓ 6 Changed 4 years ago by jim

So it looks like my cache clear setup the resource stress situation even if it didn't directly cause varnish to fall over.

And the #varnish chaps have a point... We either need Varnish on a separate box (with a few other non-critical services/sites), or a load more memory... Or both.

comment:6 in reply to: ↑ 5 Changed 4 years ago by chris

Replying to jim:

And the #varnish chaps have a point... We either need Varnish on a separate box (with a few other non-critical services/sites), or a load more memory... Or both.

I have just ordered some extra RAM for our new newest server, it works out at £249.11 for 48GB -- £5.19 per GB inc postage.

The sites files are 2.3G and the plain text dump of the database is 170M -- everything could be run from RAM for about £15 worth of hardware, I think the TN should consider colocating it's own server so that it becomes really cheap to throw RAM at things :-)

comment:7 Changed 4 years ago by ed

Well we've got around £700 in the PSE budget... ????

comment:8 Changed 4 years ago by chris

  • Status changed from accepted to closed
  • Resolution set to wontfix
  • Description modified (diff)

We are switching to Nginx.

Note: See TracTickets for help on using tickets.