Ticket #420 (closed defect: wontfix)
Varnish Downtime
Reported by: | chris | Owned by: | chris |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | Live server | Keywords: | |
Cc: | ed, laura, jim | Estimated Number of Hours: | 0.0 |
Add Hours to Ticket: | 0 | Billable?: | yes |
Total Hours: | 1.3 |
Description (last modified by chris) (diff)
Varnish fell over early this morning, this is the entry in /var/log/messages:
Oct 9 02:55:30 quince varnishd[8635]: Child (2879) died signal=3 Oct 9 02:55:54 quince varnishd[8635]: child (11098) Started Oct 9 02:56:24 quince varnishd[8635]: Child (11098) said Child starts Oct 9 02:56:27 quince varnishd[8635]: Child (11098) said SMF.s0 mmap'ed 536870912 bytes of 536870912 Oct 9 02:56:27 quince varnishd[8635]: Child (11098) said Child dies
Change History
comment:1 Changed 4 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.4
- Total Hours changed from 0.0 to 0.4
comment:2 Changed 4 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.3
- Total Hours changed from 0.4 to 0.7
Looking at https://www.varnish-cache.org/docs/3.0/tutorial/troubleshooting.html#varnish-is-crashing this could be the cause and a fix:
Specifically if you see the "Error in munmap" error on Linux you might want to increase the amount of maps available. Linux is limited to a maximum of 64k maps. Setting vm.max_max_count i sysctl.conf will enable you to increase this limit.
I tried adding this to /etc/sysctl.conf:
vm.max_max_count = 256000
And ran:
sysctl -p
But it returned:
error: "vm.max_max_count" is an unknown key
More research required...
comment:3 Changed 4 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.2
- Status changed from new to accepted
- Total Hours changed from 0.7 to 0.9
The problem was that it should be:
vm.max_map_count = 256000
Have tried this, also this ticket could be related:
https://www.varnish-cache.org/trac/ticket/1119
I think we should probably leave this for now and see if it happens again.
comment:4 Changed 4 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.4
- Total Hours changed from 0.9 to 1.3
I had a good chat with people in the #varnish irc channel and everybody though the load spike before varnish stopped was the key and that more RAM would help, some comments:
14:08 < s> it must have been thrashing like crazy at the time 14:08 < s> chrisc: a full 24 seconds from dead child to restart? Buy a bigger server 14:08 < s> it must have been thrashing like crazy at the time 14:10 * chrisc there was a load spike https://kiwi.transitionnetwork.org/munin/webarch.net/quince.webarch.net/load.html 14:11 < s> people have been reporting weird stuff with transparent hugepages 14:11 < s> I assume your syslog has "Child not responding to CLI, killing it" too? 14:11 < s> https://www.varnish-cache.org/trac/ticket/1054 14:13 < chrisc> no it doesn't, varnish died at about 3am and i restarted it manually at around 8:15 am 14:13 < s> chrisc: no lines before 2:55:30? 14:13 < chrisc> no 14:14 < chrisc> Oct 9 01:52:02 quince varnishd[8635]: CLI telnet 127.0.0.1 39594 127.0.0.1 81 Wr 200 14:14 < chrisc> that's the one before 14:14 < s> "steal" CPU...something on the physical host starved your varnish vm 14:14 < s> maybe the syslog about child not responding got lost due to the same load 14:15 < chrisc> i think a lot of rsync backup jobs would have been running at that time and disk io will have been an issue 14:15 < chrisc> between the line i just pasted and the ones on the ticket it's just drupal messages 14:17 < s> maybe somebody else can chime in, but to me this looks like the varnishd master process killing the child 14:18 < s> because the child doesn't ping 14:18 < chrisc> the server was getting hammered https://kiwi.transitionnetwork.org/munin/webarch.net/quince.webarch.net/cpu.html 14:18 < chrisc> by the looks of the munin cpu stats 14:19 < s> and then varnishd tries to restart the child, but isn't able to communicate with it during startup because of the load on the machine 14:19 * chrisc nods 14:20 < chrisc> s: thanks, i guess i need to track down the cause of the load spike :-) 14:20 < s> so buy bigger hardware, or don't virtualize varnish with something that steals the resources 14:28 < D> considering my 3 year old laptop has more than 3GB of RAM 14:29 < D> (it has 8GB) 14:29 < D> and a decent beast of a server these days has 256GB of RAM (still only 2U and well under 20K total) 14:29 < D> I'd say get bigger hardware :) 14:31 < D> I'm not a big fan of solving all problems with more hardware 14:31 < U> we just upped a bunch of servers with 64G more RAM each, because it is the cheapest option 14:31 < D> but this is just a no-brainer 14:31 < U> even server-ram is cheap cheap cheap now 14:31 < D> yup 14:31 < D> I was surprised at how little the 256G costs us 14:33 < D> mysql & apache with ssl, running php and drupal? 14:33 < D> large site, plenty of pictures? 14:33 < chrisc> yeah with varnish on port 80 doing reverse proxy 14:33 < chrisc> yeah, large site 14:33 < D> all that on a single 3GB VM? 14:33 < D> they're daft 14:33 < chrisc> authenticated users on 443 only 14:34 < D> I mean, you can run it with that little memory 14:34 < D> but it's not going to be fast 14:34 < D> ever 14:37 < D> but honestly, if were hardware instead of VM, I'd just tell them to slam 4x4GB sticks in the server 14:37 < D> should cost less than a half a day of your time 14:37 < D> :P
comment:5 follow-up: ↓ 6 Changed 4 years ago by jim
So it looks like my cache clear setup the resource stress situation even if it didn't directly cause varnish to fall over.
And the #varnish chaps have a point... We either need Varnish on a separate box (with a few other non-critical services/sites), or a load more memory... Or both.
comment:6 in reply to: ↑ 5 Changed 4 years ago by chris
Replying to jim:
And the #varnish chaps have a point... We either need Varnish on a separate box (with a few other non-critical services/sites), or a load more memory... Or both.
I have just ordered some extra RAM for our new newest server, it works out at £249.11 for 48GB -- £5.19 per GB inc postage.
The sites files are 2.3G and the plain text dump of the database is 170M -- everything could be run from RAM for about £15 worth of hardware, I think the TN should consider colocating it's own server so that it becomes really cheap to throw RAM at things :-)