Context Navigation

Ticket #420 (closed defect: wontfix)

Opened 4 years ago

Last modified 4 years ago

Varnish Downtime

Reported by:	chris	Owned by:	chris
Priority:	major	Milestone:
Component:	Live server	Keywords:
Cc:	ed, laura, jim	Estimated Number of Hours:	0.0
Add Hours to Ticket:	0	Billable?:	yes
Total Hours:	1.3

Description (last modified by chris) (diff)

Varnish fell over early this morning, this is the entry in /var/log/messages:

Oct  9 02:55:30 quince varnishd[8635]: Child (2879) died signal=3
Oct  9 02:55:54 quince varnishd[8635]: child (11098) Started
Oct  9 02:56:24 quince varnishd[8635]: Child (11098) said Child starts
Oct  9 02:56:27 quince varnishd[8635]: Child (11098) said SMF.s0 mmap'ed 536870912 bytes of 536870912
Oct  9 02:56:27 quince varnishd[8635]: Child (11098) said Child dies

Change History

comment:1 Changed 4 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.4
Total Hours changed from 0.0 to 0.4

comment:2 Changed 4 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.3
Total Hours changed from 0.4 to 0.7

Looking at https://www.varnish-cache.org/docs/3.0/tutorial/troubleshooting.html#varnish-is-crashing this could be the cause and a fix:

Specifically if you see the "Error in munmap" error on Linux 
you might want to increase the amount of maps available. 
Linux is limited to a maximum of 64k maps. Setting 
vm.max_max_count i sysctl.conf will enable you to increase 
this limit.

I tried adding this to /etc/sysctl.conf:

vm.max_max_count = 256000

And ran:

sysctl -p

But it returned:

error: "vm.max_max_count" is an unknown key

More research required...

comment:3 Changed 4 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.2
Status changed from new to accepted
Total Hours changed from 0.7 to 0.9

The problem was that it should be:

vm.max_map_count = 256000

Have tried this, also this ticket could be related:

https://www.varnish-cache.org/trac/ticket/1119

I think we should probably leave this for now and see if it happens again.

comment:4 Changed 4 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.4
Total Hours changed from 0.9 to 1.3

I had a good chat with people in the #varnish irc channel and everybody though the load spike before varnish stopped was the key and that more RAM would help, some comments:

14:08 < s> it must have been thrashing like crazy at the time
14:08 < s> chrisc: a full 24 seconds from dead child to restart? Buy a bigger server
14:08 < s> it must have been thrashing like crazy at the time
14:10  * chrisc there was a load spike https://kiwi.transitionnetwork.org/munin/webarch.net/quince.webarch.net/load.html
14:11 < s> people have been reporting weird stuff with transparent hugepages
14:11 < s> I assume your syslog has "Child not responding to CLI, killing it" too?
14:11 < s> https://www.varnish-cache.org/trac/ticket/1054
14:13 < chrisc> no it doesn't, varnish died at about 3am and i restarted it manually at around 8:15 am
14:13 < s> chrisc: no lines before 2:55:30?
14:13 < chrisc> no
14:14 < chrisc> Oct  9 01:52:02 quince varnishd[8635]: CLI telnet 127.0.0.1 39594 127.0.0.1 81 Wr 200
14:14 < chrisc> that's the one before
14:14 < s> "steal" CPU...something on the physical host starved your varnish vm
14:14 < s> maybe the syslog about child not responding got lost due to the same load
14:15 < chrisc> i think a lot of rsync backup jobs would have been running at that time and disk io will have been an issue
14:15 < chrisc> between the line i just pasted and the ones on the ticket it's just drupal messages
14:17 < s> maybe somebody else can chime in, but to me this looks like the varnishd master process killing the child
14:18 < s> because the child doesn't ping
14:18 < chrisc> the server was getting hammered https://kiwi.transitionnetwork.org/munin/webarch.net/quince.webarch.net/cpu.html
14:18 < chrisc> by the looks of the munin cpu stats
14:19 < s> and then varnishd tries to restart the child, but isn't able to communicate with it during startup because of the load on the 
               machine
14:19  * chrisc nods
14:20 < chrisc> s: thanks, i guess i need to track down the cause of the load spike :-)
14:20 < s> so buy bigger hardware, or don't virtualize varnish with something that steals the resources
14:28 < D> considering my 3 year old laptop has more than 3GB of RAM
14:29 < D> (it has 8GB)
14:29 < D> and a decent beast of a server these days has 256GB of RAM (still only 2U and well under 20K total)
14:29 < D> I'd say get bigger hardware :)
14:31 < D> I'm not a big fan of solving all problems with more hardware
14:31 < U> we just upped a bunch of servers with 64G more RAM each, because it is the cheapest option
14:31 < D> but this is just a no-brainer
14:31 < U> even server-ram is cheap cheap cheap now
14:31 < D> yup
14:31 < D> I was surprised at how little the 256G costs us
14:33 < D> mysql & apache with ssl, running php and drupal?
14:33 < D> large site, plenty of pictures?
14:33 < chrisc> yeah with varnish on port 80 doing reverse proxy
14:33 < chrisc> yeah, large site
14:33 < D> all that on a single 3GB VM?
14:33 < D> they're daft
14:33 < chrisc> authenticated users on 443 only
14:34 < D> I mean, you can run it with that little memory
14:34 < D> but it's not going to be fast
14:34 < D> ever
14:37 < D> but honestly, if were hardware instead of VM, I'd just tell them to slam 4x4GB sticks in the server
14:37 < D> should cost less than a half a day of your time
14:37 < D> :P

comment:5 follow-up: ↓ 6 Changed 4 years ago by jim

So it looks like my cache clear setup the resource stress situation even if it didn't directly cause varnish to fall over.

And the #varnish chaps have a point... We either need Varnish on a separate box (with a few other non-critical services/sites), or a load more memory... Or both.

comment:6 in reply to: ↑ 5 Changed 4 years ago by chris

Replying to jim:

And the #varnish chaps have a point... We either need Varnish on a separate box (with a few other non-critical services/sites), or a load more memory... Or both.

I have just ordered some extra RAM for our new newest server, it works out at £249.11 for 48GB -- £5.19 per GB inc postage.

The sites files are 2.3G and the plain text dump of the database is 170M -- everything could be run from RAM for about £15 worth of hardware, I think the TN should consider colocating it's own server so that it becomes really cheap to throw RAM at things :-)

comment:7 Changed 4 years ago by ed

Well we've got around £700 in the PSE budget... ????

comment:8 Changed 4 years ago by chris

Status changed from accepted to closed
Resolution set to wontfix
Description modified (diff)

We are switching to Nginx.

Note: See TracTickets for help on using tickets.

Download in other formats: