<?xml version="1.0"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
  <channel>
    <title>Transition Technology: Ticket #420: Varnish Downtime</title>
    <link>http://localhost:8080/trac/ticket/420</link>
    <description>&lt;p&gt;
Varnish fell over early this morning, this is the entry in &lt;tt&gt;/var/log/messages&lt;/tt&gt;:
&lt;/p&gt;
&lt;pre class="wiki"&gt;Oct  9 02:55:30 quince varnishd[8635]: Child (2879) died signal=3
Oct  9 02:55:54 quince varnishd[8635]: child (11098) Started
Oct  9 02:56:24 quince varnishd[8635]: Child (11098) said Child starts
Oct  9 02:56:27 quince varnishd[8635]: Child (11098) said SMF.s0 mmap'ed 536870912 bytes of 536870912
Oct  9 02:56:27 quince varnishd[8635]: Child (11098) said Child dies
&lt;/pre&gt;</description>
    <language>en-us</language>
    <image>
      <title>Transition Technology</title>
      <url>/trac/chrome/site/TransitionNetwork-Logo-Web-Small.jpg</url>
      <link>http://localhost:8080/trac/ticket/420</link>
    </image>
    <generator>Trac 0.12.5</generator>
    <item>
      
        <dc:creator>chris</dc:creator>

      <pubDate>Tue, 09 Oct 2012 11:49:56 GMT</pubDate>
      <title>hours, totalhours changed</title>
      <link>http://localhost:8080/trac/ticket/420#comment:1</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/420#comment:1</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;hours&lt;/strong&gt;
                changed from &lt;em&gt;0.0&lt;/em&gt; to &lt;em&gt;0.4&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;totalhours&lt;/strong&gt;
                changed from &lt;em&gt;0.0&lt;/em&gt; to &lt;em&gt;0.4&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>chris</dc:creator>

      <pubDate>Tue, 09 Oct 2012 12:09:50 GMT</pubDate>
      <title>hours, totalhours changed</title>
      <link>http://localhost:8080/trac/ticket/420#comment:2</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/420#comment:2</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;hours&lt;/strong&gt;
                changed from &lt;em&gt;0.0&lt;/em&gt; to &lt;em&gt;0.3&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;totalhours&lt;/strong&gt;
                changed from &lt;em&gt;0.4&lt;/em&gt; to &lt;em&gt;0.7&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
Looking at &lt;a class="ext-link" href="https://www.varnish-cache.org/docs/3.0/tutorial/troubleshooting.html#varnish-is-crashing"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;https://www.varnish-cache.org/docs/3.0/tutorial/troubleshooting.html#varnish-is-crashing&lt;/a&gt; this could be the cause and a fix:
&lt;/p&gt;
&lt;pre class="wiki"&gt;Specifically if you see the "Error in munmap" error on Linux
you might want to increase the amount of maps available.
Linux is limited to a maximum of 64k maps. Setting
vm.max_max_count i sysctl.conf will enable you to increase
this limit.
&lt;/pre&gt;&lt;p&gt;
I tried adding this to &lt;tt&gt;/etc/sysctl.conf&lt;/tt&gt;:
&lt;/p&gt;
&lt;pre class="wiki"&gt;vm.max_max_count = 256000
&lt;/pre&gt;&lt;p&gt;
And ran:
&lt;/p&gt;
&lt;pre class="wiki"&gt;sysctl -p
&lt;/pre&gt;&lt;p&gt;
But it returned:
&lt;/p&gt;
&lt;pre class="wiki"&gt;error: "vm.max_max_count" is an unknown key
&lt;/pre&gt;&lt;p&gt;
More research required...
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>chris</dc:creator>

      <pubDate>Tue, 09 Oct 2012 12:20:41 GMT</pubDate>
      <title>hours, status, totalhours changed</title>
      <link>http://localhost:8080/trac/ticket/420#comment:3</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/420#comment:3</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;hours&lt;/strong&gt;
                changed from &lt;em&gt;0.0&lt;/em&gt; to &lt;em&gt;0.2&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;status&lt;/strong&gt;
                changed from &lt;em&gt;new&lt;/em&gt; to &lt;em&gt;accepted&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;totalhours&lt;/strong&gt;
                changed from &lt;em&gt;0.7&lt;/em&gt; to &lt;em&gt;0.9&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
The problem was that it should be:
&lt;/p&gt;
&lt;pre class="wiki"&gt;vm.max_map_count = 256000
&lt;/pre&gt;&lt;p&gt;
Have tried this, also this ticket could be related:
&lt;/p&gt;
&lt;p&gt;
&lt;a class="ext-link" href="https://www.varnish-cache.org/trac/ticket/1119"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;https://www.varnish-cache.org/trac/ticket/1119&lt;/a&gt;
&lt;/p&gt;
&lt;p&gt;
I think we should probably leave this for now and see if it happens again.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>chris</dc:creator>

      <pubDate>Tue, 09 Oct 2012 13:47:28 GMT</pubDate>
      <title>hours, totalhours changed</title>
      <link>http://localhost:8080/trac/ticket/420#comment:4</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/420#comment:4</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;hours&lt;/strong&gt;
                changed from &lt;em&gt;0.0&lt;/em&gt; to &lt;em&gt;0.4&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;totalhours&lt;/strong&gt;
                changed from &lt;em&gt;0.9&lt;/em&gt; to &lt;em&gt;1.3&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
I had a good chat with people in the #varnish irc channel and everybody though the load spike before varnish stopped was the key and that more RAM would help, some comments:
&lt;/p&gt;
&lt;pre class="wiki"&gt;14:08 &amp;lt; s&amp;gt; it must have been thrashing like crazy at the time
14:08 &amp;lt; s&amp;gt; chrisc: a full 24 seconds from dead child to restart? Buy a bigger server
14:08 &amp;lt; s&amp;gt; it must have been thrashing like crazy at the time
14:10  * chrisc there was a load spike https://kiwi.transitionnetwork.org/munin/webarch.net/quince.webarch.net/load.html
14:11 &amp;lt; s&amp;gt; people have been reporting weird stuff with transparent hugepages
14:11 &amp;lt; s&amp;gt; I assume your syslog has "Child not responding to CLI, killing it" too?
14:11 &amp;lt; s&amp;gt; https://www.varnish-cache.org/trac/ticket/1054
14:13 &amp;lt; chrisc&amp;gt; no it doesn't, varnish died at about 3am and i restarted it manually at around 8:15 am
14:13 &amp;lt; s&amp;gt; chrisc: no lines before 2:55:30?
14:13 &amp;lt; chrisc&amp;gt; no
14:14 &amp;lt; chrisc&amp;gt; Oct  9 01:52:02 quince varnishd[8635]: CLI telnet 127.0.0.1 39594 127.0.0.1 81 Wr 200
14:14 &amp;lt; chrisc&amp;gt; that's the one before
14:14 &amp;lt; s&amp;gt; "steal" CPU...something on the physical host starved your varnish vm
14:14 &amp;lt; s&amp;gt; maybe the syslog about child not responding got lost due to the same load
14:15 &amp;lt; chrisc&amp;gt; i think a lot of rsync backup jobs would have been running at that time and disk io will have been an issue
14:15 &amp;lt; chrisc&amp;gt; between the line i just pasted and the ones on the ticket it's just drupal messages
14:17 &amp;lt; s&amp;gt; maybe somebody else can chime in, but to me this looks like the varnishd master process killing the child
14:18 &amp;lt; s&amp;gt; because the child doesn't ping
14:18 &amp;lt; chrisc&amp;gt; the server was getting hammered https://kiwi.transitionnetwork.org/munin/webarch.net/quince.webarch.net/cpu.html
14:18 &amp;lt; chrisc&amp;gt; by the looks of the munin cpu stats
14:19 &amp;lt; s&amp;gt; and then varnishd tries to restart the child, but isn't able to communicate with it during startup because of the load on the
               machine
14:19  * chrisc nods
14:20 &amp;lt; chrisc&amp;gt; s: thanks, i guess i need to track down the cause of the load spike :-)
14:20 &amp;lt; s&amp;gt; so buy bigger hardware, or don't virtualize varnish with something that steals the resources
14:28 &amp;lt; D&amp;gt; considering my 3 year old laptop has more than 3GB of RAM
14:29 &amp;lt; D&amp;gt; (it has 8GB)
14:29 &amp;lt; D&amp;gt; and a decent beast of a server these days has 256GB of RAM (still only 2U and well under 20K total)
14:29 &amp;lt; D&amp;gt; I'd say get bigger hardware :)
14:31 &amp;lt; D&amp;gt; I'm not a big fan of solving all problems with more hardware
14:31 &amp;lt; U&amp;gt; we just upped a bunch of servers with 64G more RAM each, because it is the cheapest option
14:31 &amp;lt; D&amp;gt; but this is just a no-brainer
14:31 &amp;lt; U&amp;gt; even server-ram is cheap cheap cheap now
14:31 &amp;lt; D&amp;gt; yup
14:31 &amp;lt; D&amp;gt; I was surprised at how little the 256G costs us
14:33 &amp;lt; D&amp;gt; mysql &amp;amp; apache with ssl, running php and drupal?
14:33 &amp;lt; D&amp;gt; large site, plenty of pictures?
14:33 &amp;lt; chrisc&amp;gt; yeah with varnish on port 80 doing reverse proxy
14:33 &amp;lt; chrisc&amp;gt; yeah, large site
14:33 &amp;lt; D&amp;gt; all that on a single 3GB VM?
14:33 &amp;lt; D&amp;gt; they're daft
14:33 &amp;lt; chrisc&amp;gt; authenticated users on 443 only
14:34 &amp;lt; D&amp;gt; I mean, you can run it with that little memory
14:34 &amp;lt; D&amp;gt; but it's not going to be fast
14:34 &amp;lt; D&amp;gt; ever
14:37 &amp;lt; D&amp;gt; but honestly, if were hardware instead of VM, I'd just tell them to slam 4x4GB sticks in the server
14:37 &amp;lt; D&amp;gt; should cost less than a half a day of your time
14:37 &amp;lt; D&amp;gt; :P
&lt;/pre&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>jim</dc:creator>

      <pubDate>Wed, 10 Oct 2012 18:56:02 GMT</pubDate>
      <title></title>
      <link>http://localhost:8080/trac/ticket/420#comment:5</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/420#comment:5</guid>
      <description>
        &lt;p&gt;
So it looks like my cache clear setup the resource stress situation even if it didn't directly cause varnish to fall over.
&lt;/p&gt;
&lt;p&gt;
And the #varnish chaps have a point... We either need Varnish on a separate box (with a few other non-critical services/sites), or a load more memory... Or both.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>chris</dc:creator>

      <pubDate>Tue, 16 Oct 2012 13:16:12 GMT</pubDate>
      <title></title>
      <link>http://localhost:8080/trac/ticket/420#comment:6</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/420#comment:6</guid>
      <description>
        &lt;p&gt;
Replying to &lt;a href="http://localhost:8080/trac/ticket/420#comment:5" title="Comment 5 for Ticket #420"&gt;jim&lt;/a&gt;:
&lt;/p&gt;
&lt;blockquote class="citation"&gt;
&lt;p&gt;
And the #varnish chaps have a point... We either need Varnish on a separate box (with a few other non-critical services/sites), or a load more memory... Or both.
&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;
I have just ordered some extra RAM for our new newest server, it works out at £249.11 for 48GB -- £5.19 per GB inc postage.
&lt;/p&gt;
&lt;p&gt;
The sites files are 2.3G and the plain text dump of the database is 170M -- everything could be run from RAM for about £15 worth of hardware, I think the TN should consider colocating it's own server so that it becomes really cheap to throw RAM at things :-)
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>ed</dc:creator>

      <pubDate>Tue, 23 Oct 2012 07:57:44 GMT</pubDate>
      <title></title>
      <link>http://localhost:8080/trac/ticket/420#comment:7</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/420#comment:7</guid>
      <description>
        &lt;p&gt;
Well we've got around £700 in the PSE budget... ????
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>chris</dc:creator>

      <pubDate>Tue, 29 Jan 2013 17:36:55 GMT</pubDate>
      <title>status, description changed; resolution set</title>
      <link>http://localhost:8080/trac/ticket/420#comment:8</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/420#comment:8</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;status&lt;/strong&gt;
                changed from &lt;em&gt;accepted&lt;/em&gt; to &lt;em&gt;closed&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;resolution&lt;/strong&gt;
                set to &lt;em&gt;wontfix&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;description&lt;/strong&gt;
              modified (&lt;a href="/trac/ticket/420?action=diff&amp;amp;version=8"&gt;diff&lt;/a&gt;)
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
We are switching to Nginx.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item>
 </channel>
</rss>