<?xml version="1.0"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
  <channel>
    <title>Transition Technology: Ticket #563: 503 Errors</title>
    <link>http://localhost:8080/trac/ticket/563</link>
    <description>&lt;p&gt;
The BOA &lt;tt&gt;/var/xdrago/second.sh&lt;/tt&gt; script is run every minute via the root crontab and if it detects a certain load level it changes the nginx config to a "high load" config which results in bots being served 503 errors when they spider the site. When the load goes higher and hits another threshold the &lt;tt&gt;second.sh&lt;/tt&gt; script stops the site, see &lt;a class="closed ticket" href="http://localhost:8080/trac/ticket/555" title="maintenance: Load spikes causing the TN site to be stopped for 15 min at a time (closed: fixed)"&gt;ticket:555&lt;/a&gt;.
&lt;/p&gt;
&lt;h2 id="OriginalDescription"&gt;Original Description&lt;/h2&gt;
&lt;p&gt;
The site is generating a lot of 503 errors, 83 since 6:30am today and there were around 750 yesterday.
&lt;/p&gt;
</description>
    <language>en-us</language>
    <image>
      <title>Transition Technology</title>
      <url>/trac/chrome/site/TransitionNetwork-Logo-Web-Small.jpg</url>
      <link>http://localhost:8080/trac/ticket/563</link>
    </image>
    <generator>Trac 0.12.5</generator>
    <item>
      
        <dc:creator>chris</dc:creator>

      <pubDate>Wed, 19 Jun 2013 11:16:27 GMT</pubDate>
      <title>hours, totalhours changed</title>
      <link>http://localhost:8080/trac/ticket/563#comment:1</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/563#comment:1</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;hours&lt;/strong&gt;
                changed from &lt;em&gt;0.0&lt;/em&gt; to &lt;em&gt;0.3&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;totalhours&lt;/strong&gt;
                changed from &lt;em&gt;0.0&lt;/em&gt; to &lt;em&gt;0.3&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
I have copied the script we have to find 502 errors in the nginx logs and email them out one a day, /usr/local/bin/find-502s to /usr/local/bin/50x-errors and changed it to search for 502, 503 and 504.
&lt;/p&gt;
&lt;p&gt;
So far today:
&lt;/p&gt;
&lt;pre class="wiki"&gt;Total 502 errors: 7
Total 503 errors: 152
Total 504 errors: 0
&lt;/pre&gt;&lt;p&gt;
I'm on the phone with Jim now.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>chris</dc:creator>

      <pubDate>Wed, 19 Jun 2013 11:25:05 GMT</pubDate>
      <title>hours, totalhours changed</title>
      <link>http://localhost:8080/trac/ticket/563#comment:2</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/563#comment:2</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;hours&lt;/strong&gt;
                changed from &lt;em&gt;0.0&lt;/em&gt; to &lt;em&gt;0.15&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;totalhours&lt;/strong&gt;
                changed from &lt;em&gt;0.3&lt;/em&gt; to &lt;em&gt;0.45&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
Following our phone call, ideas we are going to try:
&lt;/p&gt;
&lt;ul&gt;&lt;li&gt;Jim - firewall
&lt;/li&gt;&lt;li&gt;Chris - php-fpm restarts, try disabling them
&lt;/li&gt;&lt;li&gt;Chris - logging - can we add any more?
&lt;/li&gt;&lt;/ul&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>chris</dc:creator>

      <pubDate>Wed, 19 Jun 2013 13:51:35 GMT</pubDate>
      <title>hours, totalhours changed</title>
      <link>http://localhost:8080/trac/ticket/563#comment:3</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/563#comment:3</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;hours&lt;/strong&gt;
                changed from &lt;em&gt;0.0&lt;/em&gt; to &lt;em&gt;1.4&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;totalhours&lt;/strong&gt;
                changed from &lt;em&gt;0.45&lt;/em&gt; to &lt;em&gt;1.85&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
I have changed these things in the &lt;tt&gt;/opt/local/etc/php53-fpm.conf&lt;/tt&gt; file to prevent php-fpm restarts to see if this helps the problems:
&lt;/p&gt;
&lt;pre class="wiki"&gt;; If this number of child processes exit with SIGSEGV or SIGBUS within the time
; interval set by emergency_restart_interval then FPM will restart. A value
; of '0' means 'Off'.
; Default Value: 0
# chris
#emergency_restart_threshold = 5
emergency_restart_threshold = 0
; Interval of time used by emergency_restart_interval to determine when
; a graceful restart will be initiated.  This can be useful to work around
; accidental corruptions in an accelerator's shared memory.
; Available Units: s(econds), m(inutes), h(ours), or d(ays)
; Default Unit: seconds
; Default Value: 0
# chris
#emergency_restart_interval = 1m
emergency_restart_interval = 0
; The number of requests each child process should execute before respawning.
; This can be useful to work around memory leaks in 3rd party libraries. For
; endless request processing specify '0'. Equivalent to PHP_FCGI_MAX_REQUESTS.
; Default Value: 0
# chris
#pm.max_requests = 500
pm.max_requests = 0
&lt;/pre&gt;&lt;p&gt;
I have restarted php53-fpm.
&lt;/p&gt;
&lt;p&gt;
Updated result of the script to find 50x errors, these totals are since 6:30am:
&lt;/p&gt;
&lt;pre class="wiki"&gt;Total 502 errors: 7
Total 503 errors: 173
Total 504 errors: 0
&lt;/pre&gt;&lt;p&gt;
The increase from earlier in the day is due to the site going down again at noon, I haven't looked at all the logs for this event but it looks much like the previous ones documented on &lt;a class="closed ticket" href="http://localhost:8080/trac/ticket/555" title="maintenance: Load spikes causing the TN site to be stopped for 15 min at a time (closed: fixed)"&gt;ticket:555&lt;/a&gt;
&lt;/p&gt;
&lt;p&gt;
I have set this script to run just before the Nginx logs are rotated, it can also be run on the command line, Jim, try this:
&lt;/p&gt;
&lt;pre class="wiki"&gt;/usr/local/bin/50x-errors | less
&lt;/pre&gt;&lt;p&gt;
Looking at the cron jobs which kill and restart critical processes I think we should consider stopping the second.sh one to see if this helps, following is a (hopefully) human readable version of what they appear to do.
&lt;/p&gt;
&lt;h2 id="second.sh"&gt;second.sh&lt;/h2&gt;
&lt;p&gt;
This runs every minute:
&lt;/p&gt;
&lt;pre class="wiki"&gt;* * * * * bash /var/xdrago/second.sh &amp;gt;/dev/null 2&amp;gt;&amp;amp;1
&lt;/pre&gt;&lt;p&gt;
It does things based on the load average, specifically, if:
&lt;/p&gt;
&lt;pre class="wiki"&gt;awk '{print $1*100}' /proc/loadavg
&lt;/pre&gt;&lt;p&gt;
is greater than 388 and less than 1444 and /data/conf/nginx_high_load_off.conf exists then nginx is restarted.
&lt;/p&gt;
&lt;p&gt;
else if this:
&lt;/p&gt;
&lt;pre class="wiki"&gt;awk '{print $2*100}' /proc/loadavg
&lt;/pre&gt;&lt;p&gt;
is greater than 388 and less than 888 and /data/conf/nginx_high_load_off.conf exists then nginx is restarted.
&lt;/p&gt;
&lt;p&gt;
else the result of the awk commands above are less than 388 and /data/conf/nginx_high_load.conf exists then /data/conf/nginx_high_load.conf is moved to /data/conf/nginx_high_load_off.conf and nginx is reloaded.
&lt;/p&gt;
&lt;p&gt;
If the values of the awk commands are greater than 1888 or 1555 then these command line operations are killed:
&lt;/p&gt;
&lt;pre class="wiki"&gt;killall -9 php drush.php wget
&lt;/pre&gt;&lt;p&gt;
If the first awk command is greater than 1444 or the second is greater than 888 then the folowing is done:
&lt;/p&gt;
&lt;pre class="wiki"&gt;  /etc/init.d/nginx stop
  killall -9 nginx
  sleep 1
  killall -9 nginx
  /etc/init.d/php-fpm stop
  /etc/init.d/php53-fpm stop
  killall -9 php-fpm php-cgi
&lt;/pre&gt;&lt;h2 id="minute.sh"&gt;minute.sh&lt;/h2&gt;
&lt;p&gt;
This runs every minute:
&lt;/p&gt;
&lt;pre class="wiki"&gt;* * * * * bash /var/xdrago/minute.sh &amp;gt;/dev/null 2&amp;gt;&amp;amp;1
&lt;/pre&gt;&lt;p&gt;
And checks for segfaults
&lt;/p&gt;
&lt;h2 id="runner.sh"&gt;runner.sh&lt;/h2&gt;
&lt;p&gt;
This runs every minute:
&lt;/p&gt;
&lt;pre class="wiki"&gt;* * * * * bash /var/xdrago/runner.sh &amp;gt;/dev/null 2&amp;gt;&amp;amp;1
&lt;/pre&gt;&lt;p&gt;
And checks that the load is less than 388 and if it is then /var/xdrago/run-tn is run.
&lt;/p&gt;
&lt;h2 id="clear.sh"&gt;clear.sh&lt;/h2&gt;
&lt;p&gt;
This was clobbering log files but all the clobbering has been commented out.
&lt;/p&gt;
&lt;h2 id="graceful.sh"&gt;graceful.sh&lt;/h2&gt;
&lt;p&gt;
This runs at 6pm each day and does things like updating the GeoIP database and it restarts nginx and php53-fpm.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>chris</dc:creator>

      <pubDate>Wed, 19 Jun 2013 19:54:57 GMT</pubDate>
      <title>hours, totalhours changed</title>
      <link>http://localhost:8080/trac/ticket/563#comment:4</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/563#comment:4</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;hours&lt;/strong&gt;
                changed from &lt;em&gt;0.0&lt;/em&gt; to &lt;em&gt;0.5&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;totalhours&lt;/strong&gt;
                changed from &lt;em&gt;1.85&lt;/em&gt; to &lt;em&gt;2.35&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
Latest running total of 50x errors for today:
&lt;/p&gt;
&lt;pre class="wiki"&gt;Total 502 errors: 11
Total 503 errors: 261
Total 504 errors: 0
&lt;/pre&gt;&lt;p&gt;
One thought, the server has 14 CPUs, this means that a load of 14 is equivalent to a load of 1 on a server with 1 CPU. If the BOA scripts, with all the killing of processes that happen when the load gets to various levels, are based on servers with a different hardware configuration, less CPUs, less RAM and perhaps SSDs (for example the log clobbering was justified on the basis that busy servers with, presumably very fast but also very small, SSDs required it) then perhaps this could be part of the problem?
&lt;/p&gt;
&lt;p&gt;
Looking at the User-Agent strings for the 503 errors, most are bots:
&lt;/p&gt;
&lt;pre class="wiki"&gt;50x-errors  | grep "Feedfetcher-Google" | wc -l
26
50x-errors  | grep "Baiduspider" | wc -l
10
50x-errors  | grep "Googlebot" | wc -l
117
50x-errors  | grep "YandexBot" | wc -l
14
50x-errors  | grep "bingbot" | wc -l
46
50x-errors  | grep "Sogou web spider" | wc -l
8
50x-errors  | grep "magpie-crawler" | wc -l
16
50x-errors  | grep "Feedly" | wc -l
5
50x-errors  | grep "Sosospider" | wc -l
4
&lt;/pre&gt;&lt;p&gt;
In fact out of the total of 272 502 and 503 errors only 26 were not to bots, and out of these 26 around half were RSS readers:
&lt;/p&gt;
&lt;pre class="wiki"&gt;50x-errors  | grep -v "Googlebot" | grep -v "Feedfetcher-Google" | grep -v "Baiduspider" | grep -v "YandexBot" | grep -v "bingbot" | grep -v "Sogou web spider" | grep -v "magpie-crawler" | grep -v "Feedly" | grep -v "Sosospider" | grep transitionnetwork | wc -l
26
&lt;/pre&gt;&lt;p&gt;
I'm not sure what the significance of the large number of bot hits is other than the observation that the site gets lots more traffic from bots than people and that this traffic will never get recorded by Piwik -- perhaps I should generate some stats from the Nginx log files so we get a better picture of how much total traffic the server is getting, not just the human traffic?
&lt;/p&gt;
&lt;p&gt;
I suggest we try:
&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;Disabling the second.sh script which does a lot of process killing based on load.
&lt;/li&gt;&lt;li&gt;Increasing the number of php53-fpm processes.
&lt;/li&gt;&lt;li&gt;See what happens.
&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;
Jim -- would you be happy with trying that?
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>chris</dc:creator>

      <pubDate>Wed, 19 Jun 2013 20:27:03 GMT</pubDate>
      <title>hours, totalhours changed</title>
      <link>http://localhost:8080/trac/ticket/563#comment:5</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/563#comment:5</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;hours&lt;/strong&gt;
                changed from &lt;em&gt;0.0&lt;/em&gt; to &lt;em&gt;0.5&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;totalhours&lt;/strong&gt;
                changed from &lt;em&gt;2.35&lt;/em&gt; to &lt;em&gt;2.85&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
Looking at &lt;tt&gt;/opt/local/etc/php53-fpm.conf&lt;/tt&gt;:
&lt;/p&gt;
&lt;pre class="wiki"&gt;; Set open file descriptor rlimit for the master process.
; Default Value: system defined value
rlimit_files = 1024
&lt;/pre&gt;&lt;p&gt;
This seems quite low, but there is no evidence in the logs that we have hit this limit. There are some notes here on increasing it if we did want to increase it &lt;a class="ext-link" href="http://www.cyberciti.biz/faq/linux-unix-nginx-too-many-open-files/"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;http://www.cyberciti.biz/faq/linux-unix-nginx-too-many-open-files/&lt;/a&gt;
&lt;/p&gt;
&lt;p&gt;
I have edited these values to dramatically increase the number of php-fpm processes:
&lt;/p&gt;
&lt;pre class="wiki"&gt;process.max = 100
pm.max_children = 90
pm.start_servers = 20
pm.min_spare_servers = 10
pm.max_spare_servers = 20
pm.process_idle_timeout = 30s;
request_slowlog_timeout = 60s
&lt;/pre&gt;&lt;p&gt;
The thinking (hunch / guess) being that if when demand peaks lots of processes need to be spawned this might be increasing the load such that the process killing is triggered.
&lt;/p&gt;
&lt;p&gt;
There are now 21 php-fpm processes by default:
&lt;/p&gt;
&lt;pre class="wiki"&gt;ps -lA | grep php | wc -l
21
&lt;/pre&gt;&lt;p&gt;
I haven't commented out the second.sh script -- waiting for Jims thoughts if we should try that.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>chris</dc:creator>

      <pubDate>Wed, 19 Jun 2013 21:08:58 GMT</pubDate>
      <title>hours, totalhours changed</title>
      <link>http://localhost:8080/trac/ticket/563#comment:6</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/563#comment:6</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;hours&lt;/strong&gt;
                changed from &lt;em&gt;0.0&lt;/em&gt; to &lt;em&gt;0.6&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;totalhours&lt;/strong&gt;
                changed from &lt;em&gt;2.85&lt;/em&gt; to &lt;em&gt;3.45&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
Regarding getting a better idea about what Nginx by processing the logs, how about doing it like this:
&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;Just before the Nginx access log is rotated (this happens daily) copy it to penguin (using ssh and rsync).
&lt;/li&gt;&lt;li&gt;Set up AWstats on penguin to analyse the log files, &lt;a class="ext-link" href="http://awstats.sourceforge.net/"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;http://awstats.sourceforge.net/&lt;/a&gt;
&lt;/li&gt;&lt;li&gt;Perhaps generate some Munin stats from the awstats, &lt;a class="ext-link" href="http://www.gho.no/2009/03/munin-website-traffic-graph-plugin-for-awstats/"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;http://www.gho.no/2009/03/munin-website-traffic-graph-plugin-for-awstats/&lt;/a&gt;
&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;
The thinking being that penguin has spare resources to do log processing and it will save setting up awstats on parrot.
&lt;/p&gt;
&lt;p&gt;
I have been keeping an eye on the munin stats since increasing the number of php-fpm processes and more RAM is being used, but not more than we have available, see:
&lt;/p&gt;
&lt;ul&gt;&lt;li&gt;&lt;a class="ext-link" href="https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/phpfpm_memory.html"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/phpfpm_memory.html&lt;/a&gt;
&lt;/li&gt;&lt;li&gt;&lt;a class="ext-link" href="https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/phpfpm_processes.html"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/phpfpm_processes.html&lt;/a&gt;
&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;
We just had another load spike:
&lt;/p&gt;
&lt;pre class="wiki"&gt;uptime
 22:01:44 up 34 days, 22:40,  2 users,  load average: 31.81, 12.29, 4.81
&lt;/pre&gt;&lt;p&gt;
A load of 28 is like a load of 2 on a 1 CPU machine -- a load of 31 isn't enough on this server to start killing processes but I saw php-fpm killed so I manually restarted it.
&lt;/p&gt;
&lt;p&gt;
IMHO second.sh is killing too many processes so I have disabled it by commenting it from the root crontab:
&lt;/p&gt;
&lt;pre class="wiki"&gt;#* * * * * bash /var/xdrago/second.sh &amp;gt;/dev/null 2&amp;gt;&amp;amp;1
&lt;/pre&gt;&lt;p&gt;
Let's see what happens now.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>chris</dc:creator>

      <pubDate>Wed, 19 Jun 2013 22:25:01 GMT</pubDate>
      <title>hours, totalhours changed</title>
      <link>http://localhost:8080/trac/ticket/563#comment:7</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/563#comment:7</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;hours&lt;/strong&gt;
                changed from &lt;em&gt;0.0&lt;/em&gt; to &lt;em&gt;0.7&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;totalhours&lt;/strong&gt;
                changed from &lt;em&gt;3.45&lt;/em&gt; to &lt;em&gt;4.15&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
These are the times php-fpm was stopped/killed in the last few days:
&lt;/p&gt;
&lt;pre class="wiki"&gt;grep bye-bye php53-fpm-error.log
[19-Jun-2013 12:02:12] NOTICE: exiting, bye-bye!
[19-Jun-2013 14:33:57] NOTICE: exiting, bye-bye!
[19-Jun-2013 17:00:45] NOTICE: exiting, bye-bye!
[19-Jun-2013 21:26:17] NOTICE: exiting, bye-bye!
[19-Jun-2013 22:01:48] NOTICE: exiting, bye-bye!
[19-Jun-2013 22:02:22] NOTICE: exiting, bye-bye!
&lt;/pre&gt;&lt;pre class="wiki"&gt;grep bye-bye php53-fpm-error.log.1
[18-Jun-2013 08:14:30] NOTICE: exiting, bye-bye!
[18-Jun-2013 10:02:57] NOTICE: exiting, bye-bye!
[18-Jun-2013 11:00:54] NOTICE: exiting, bye-bye!
[18-Jun-2013 14:20:58] NOTICE: exiting, bye-bye!
[18-Jun-2013 14:36:25] NOTICE: exiting, bye-bye!
[18-Jun-2013 15:00:36] NOTICE: exiting, bye-bye!
[18-Jun-2013 16:00:49] NOTICE: exiting, bye-bye!
[18-Jun-2013 16:30:00] NOTICE: exiting, bye-bye!
[18-Jun-2013 16:50:54] NOTICE: exiting, bye-bye!
[18-Jun-2013 17:00:53] NOTICE: exiting, bye-bye!
[18-Jun-2013 18:00:30] NOTICE: exiting, bye-bye!
[18-Jun-2013 20:36:30] NOTICE: exiting, bye-bye!
[18-Jun-2013 21:57:53] NOTICE: exiting, bye-bye!
[18-Jun-2013 23:20:41] NOTICE: exiting, bye-bye!
[19-Jun-2013 04:00:27] NOTICE: exiting, bye-bye!
[19-Jun-2013 06:29:00] NOTICE: exiting, bye-bye!
&lt;/pre&gt;&lt;pre class="wiki"&gt;zgrep bye-bye php53-fpm-error.log.2.gz
[17-Jun-2013 08:35:39] NOTICE: exiting, bye-bye!
[17-Jun-2013 10:16:08] NOTICE: exiting, bye-bye!
[17-Jun-2013 10:16:42] NOTICE: exiting, bye-bye!
[17-Jun-2013 12:48:29] NOTICE: exiting, bye-bye!
[17-Jun-2013 13:21:45] NOTICE: exiting, bye-bye!
[17-Jun-2013 14:00:39] NOTICE: exiting, bye-bye!
[17-Jun-2013 15:00:42] NOTICE: exiting, bye-bye!
[17-Jun-2013 15:30:22] NOTICE: exiting, bye-bye!
[17-Jun-2013 15:51:11] NOTICE: exiting, bye-bye!
[17-Jun-2013 17:06:52] NOTICE: exiting, bye-bye!
[17-Jun-2013 23:02:25] NOTICE: exiting, bye-bye!
[18-Jun-2013 01:01:03] NOTICE: exiting, bye-bye!
[18-Jun-2013 05:01:16] NOTICE: exiting, bye-bye!
[18-Jun-2013 06:26:18] NOTICE: exiting, bye-bye!
&lt;/pre&gt;&lt;pre class="wiki"&gt;zgrep bye-bye php53-fpm-error.log.3.gz
[16-Jun-2013 17:22:56] NOTICE: exiting, bye-bye!
[16-Jun-2013 22:03:49] NOTICE: exiting, bye-bye!
[17-Jun-2013 05:00:40] NOTICE: exiting, bye-bye!
[17-Jun-2013 06:00:39] NOTICE: exiting, bye-bye!
[17-Jun-2013 06:26:02] NOTICE: exiting, bye-bye!
&lt;/pre&gt;&lt;pre class="wiki"&gt;zgrep bye-bye php53-fpm-error.log.4.gz
[15-Jun-2013 08:15:51] NOTICE: exiting, bye-bye!
[15-Jun-2013 08:50:41] NOTICE: exiting, bye-bye!
[15-Jun-2013 20:00:26] NOTICE: exiting, bye-bye!
[16-Jun-2013 06:26:31] NOTICE: exiting, bye-bye!
&lt;/pre&gt;&lt;pre class="wiki"&gt;zgrep bye-bye php53-fpm-error.log.5.gz
[14-Jun-2013 13:50:57] NOTICE: exiting, bye-bye!
[14-Jun-2013 14:01:40] NOTICE: exiting, bye-bye!
[14-Jun-2013 14:35:59] NOTICE: exiting, bye-bye!
[15-Jun-2013 01:29:13] NOTICE: exiting, bye-bye!
[15-Jun-2013 01:35:29] NOTICE: exiting, bye-bye!
[15-Jun-2013 01:35:40] NOTICE: exiting, bye-bye!
[15-Jun-2013 01:39:17] NOTICE: exiting, bye-bye!
[15-Jun-2013 01:52:54] NOTICE: exiting, bye-bye!
[15-Jun-2013 02:55:20] NOTICE: exiting, bye-bye!
[15-Jun-2013 02:56:40] NOTICE: exiting, bye-bye!
[15-Jun-2013 05:20:34] NOTICE: exiting, bye-bye!
&lt;/pre&gt;&lt;p&gt;
When chatting with Jim on the phone today he wondered about the /var/xdrago/guest-fire.sh script, this runs every minute and checks logs and blocks hosts using the firewall. I haven't yet found where this script is called.
&lt;/p&gt;
&lt;p&gt;
We haven't yet solved the issue with the 503 errors, the last one was 1 min ago, these to the latest totals for today:
&lt;/p&gt;
&lt;pre class="wiki"&gt;Total 502 errors: 12
Total 503 errors: 522
Total 504 errors: 0
&lt;/pre&gt;&lt;p&gt;
More tomorrow...
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>chris</dc:creator>

      <pubDate>Wed, 19 Jun 2013 23:12:07 GMT</pubDate>
      <title>hours, totalhours changed</title>
      <link>http://localhost:8080/trac/ticket/563#comment:8</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/563#comment:8</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;hours&lt;/strong&gt;
                changed from &lt;em&gt;0.0&lt;/em&gt; to &lt;em&gt;0.6&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;totalhours&lt;/strong&gt;
                changed from &lt;em&gt;4.15&lt;/em&gt; to &lt;em&gt;4.75&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
We are getting lots more 503 errors, the total for today is now:
&lt;/p&gt;
&lt;pre class="wiki"&gt;Total 502 errors: 12
Total 503 errors: 823
Total 504 errors: 0
&lt;/pre&gt;&lt;p&gt;
The load is very low:
&lt;/p&gt;
&lt;pre class="wiki"&gt;uptime
 23:44:04 up 35 days, 22 min,  2 users,  load average: 0.13, 0.21, 0.25
uptime
 23:54:10 up 35 days, 32 min,  2 users,  load average: 0.03, 0.22, 0.28
uptime
 23:59:23 up 35 days, 37 min,  2 users,  load average: 0.24, 0.27, 0.27
uptime
 00:02:10 up 35 days, 40 min,  2 users,  load average: 0.36, 0.31, 0.28
&lt;/pre&gt;&lt;p&gt;
I have set this variable back to where it was:
&lt;/p&gt;
&lt;pre class="wiki"&gt;; The timeout for serving a single request after which a PHP backtrace will be
; dumped to the 'slowlog' file. A value of '0s' means 'off'.
; Available units: s(econds)(default), m(inutes), h(ours), or d(ays)
; Default Value: 0
; chris
request_slowlog_timeout = 30s
;request_slowlog_timeout = 60s
&lt;/pre&gt;&lt;p&gt;
As we had nothing in the php error log since I increased it to 60 seconds.
&lt;/p&gt;
&lt;p&gt;
Commenting out the seconds.sh script hasn't solved the 503 problem, I was tempted to leave it commented out over night and reassess it in the morning as I'm not sure what will happen when there is a spike in the load with it not running, however as the 503s are really stacking up faster than ever I'm going to uncomment it again to see if that reduces them again.
&lt;/p&gt;
&lt;p&gt;
Again the clients getting the 503 are mostly bots -- perhaps this is due to some BOA script(s) intentionally denying access to bots?
&lt;/p&gt;
&lt;p&gt;
I'm not sure if we should also be considering tweaking some mysql settings:
&lt;/p&gt;
&lt;pre class="wiki"&gt;/usr/local/bin/mysqltuner.pl
 &amp;gt;&amp;gt;  MySQLTuner 1.2.0 - Major Hayden &amp;lt;major@mhtx.net&amp;gt;
 &amp;gt;&amp;gt;  Bug reports, feature requests, and downloads at http://mysqltuner.com/
 &amp;gt;&amp;gt;  Run with '--help' for additional options and output filtering
[OK] Logged in using credentials from debian maintenance account.
-------- General Statistics --------------------------------------------------
[--] Skipped version check for MySQLTuner script
[OK] Currently running supported MySQL version 5.5.31-MariaDB-1~squeeze-log
[OK] Operating on 64-bit architecture
-------- Storage Engine Statistics -------------------------------------------
[--] Status: +Archive -BDB +Federated +InnoDB -ISAM -NDBCluster
[--] Data in MyISAM tables: 104M (Tables: 2)
[--] Data in InnoDB tables: 452M (Tables: 1031)
[--] Data in PERFORMANCE_SCHEMA tables: 0B (Tables: 17)
[!!] Total fragmented tables: 100
-------- Security Recommendations  -------------------------------------------
[OK] All database users have passwords assigned
-------- Performance Metrics -------------------------------------------------
[--] Up for: 1d 12h 22m 0s (8M q [68.424 qps], 222K conn, TX: 16B, RX: 1B)
[--] Reads / Writes: 88% / 12%
[--] Total buffers: 1.1G global + 13.4M per thread (75 max threads)
[OK] Maximum possible memory usage: 2.1G (52% of installed RAM)
[OK] Slow queries: 0% (44/8M)
[OK] Highest usage of available connections: 74% (56/75)
[OK] Key buffer size / total MyISAM indexes: 509.0M/92.3M
[OK] Key buffer hit rate: 98.1% (15M cached / 281K reads)
[OK] Query cache efficiency: 73.0% (5M cached / 8M selects)
[!!] Query cache prunes per day: 937872
[OK] Sorts requiring temporary tables: 1% (4K temp sorts / 243K sorts)
[!!] Joins performed without indexes: 10338
[!!] Temporary tables created on disk: 29% (83K on disk / 285K total)
[OK] Thread cache hit rate: 99% (56 created / 222K connections)
[!!] Table cache hit rate: 0% (128 open / 60K opened)
[OK] Open file limit used: 0% (4/196K)
[OK] Table locks acquired immediately: 99% (3M immediate / 3M locks)
[OK] InnoDB data size / buffer pool: 452.9M/509.0M
-------- Recommendations -----------------------------------------------------
General recommendations:
    Run OPTIMIZE TABLE to defragment tables for better performance
    Adjust your join queries to always utilize indexes
    When making adjustments, make tmp_table_size/max_heap_table_size equal
    Reduce your SELECT DISTINCT queries without LIMIT clauses
    Increase table_cache gradually to avoid file descriptor limits
Variables to adjust:
    query_cache_size (&amp;gt; 64M)
    join_buffer_size (&amp;gt; 1.0M, or always use indexes with joins)
    tmp_table_size (&amp;gt; 64M)
    max_heap_table_size (&amp;gt; 128M)
    table_cache (&amp;gt; 128)
&lt;/pre&gt;&lt;p&gt;
Really off to bed now...
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>chris</dc:creator>

      <pubDate>Wed, 19 Jun 2013 23:44:11 GMT</pubDate>
      <title>hours, totalhours changed</title>
      <link>http://localhost:8080/trac/ticket/563#comment:9</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/563#comment:9</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;hours&lt;/strong&gt;
                changed from &lt;em&gt;0.0&lt;/em&gt; to &lt;em&gt;0.5&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;totalhours&lt;/strong&gt;
                changed from &lt;em&gt;4.75&lt;/em&gt; to &lt;em&gt;5.25&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
One final thought, the 503 errors and the periods of downtime reported in &lt;a class="closed ticket" href="http://localhost:8080/trac/ticket/555" title="maintenance: Load spikes causing the TN site to be stopped for 15 min at a time (closed: fixed)"&gt;ticket:555&lt;/a&gt; might be unrelated -- I suspect that the down times are caused by second.sh killing almost everything when the load gets to 14.44 -- it does the following:
&lt;/p&gt;
&lt;pre class="wiki"&gt;  /etc/init.d/nginx stop
  killall -9 nginx
  sleep 1
  killall -9 nginx
  /etc/init.d/php-fpm stop
  /etc/init.d/php53-fpm stop
  killall -9 php-fpm php-cgi
&lt;/pre&gt;&lt;p&gt;
Which seems totally crazy to me -- just because the load reaches 14 we shouldn't brutally kill all the key processes that run the site should we?
&lt;/p&gt;
&lt;p&gt;
Running total for the 503s:
&lt;/p&gt;
&lt;pre class="wiki"&gt;Total 502 errors: 12
Total 503 errors: 1570
Total 504 errors: 0
&lt;/pre&gt;&lt;p&gt;
I have commented second.sh out again but I think we should experiment with the thresholds of things being killed -- I think perhaps they are set for servers with a lot less than 14 CPUs -- Jim what values do you have on your server for these in /var/xdrago/second.sh -- are these values hard coded irrespective of the number of CPUs or are they generated by BOA based on the number of CPUs?
&lt;/p&gt;
&lt;pre class="wiki"&gt;CTL_ONEX_SPIDER_LOAD=388
CTL_FIVX_SPIDER_LOAD=388
CTL_ONEX_LOAD=1444
CTL_FIVX_LOAD=888
CTL_ONEX_LOAD_CRIT=1888
CTL_FIVX_LOAD_CRIT=1555
&lt;/pre&gt;&lt;p&gt;
Note how they are multiplied:
&lt;/p&gt;
&lt;pre class="wiki"&gt;ONEX_LOAD=`awk '{print $1*100}' /proc/loadavg`
FIVX_LOAD=`awk '{print $2*100}' /proc/loadavg`
&lt;/pre&gt;&lt;p&gt;
So for example:
&lt;/p&gt;
&lt;pre class="wiki"&gt;uptime ; awk '{print $1*100}' /proc/loadavg
 00:31:03 up 35 days,  1:09,  2 users,  load average: 0.50, 0.42, 0.36
 50
&lt;/pre&gt;&lt;p&gt;
A load of 0.5 is 50 for the CTL_ONEX_ variables, see &lt;a class="closed ticket" href="http://localhost:8080/trac/ticket/563#comment:3" title="maintenance: 503 Errors (closed: fixed)"&gt;ticket:563#comment:3&lt;/a&gt; for some more detail on what second.sh does.
&lt;/p&gt;
&lt;p&gt;
Do we also need to look at the redis settings? This is all we get each day in the redis log, from when it's restarted:
&lt;/p&gt;
&lt;pre class="wiki"&gt;[52978] 19 Jun 01:26:29.201 # User requested shutdown...
[52978] 19 Jun 01:26:30.366 # Redis is now ready to exit, bye bye...
[24363] 19 Jun 01:26:31.524 # Server started, Redis version 2.6.13
[24363] 19 Jun 01:26:31.524 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
&lt;/pre&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>jim</dc:creator>

      <pubDate>Thu, 20 Jun 2013 07:48:50 GMT</pubDate>
      <title></title>
      <link>http://localhost:8080/trac/ticket/563#comment:10</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/563#comment:10</guid>
      <description>
        &lt;p&gt;
503s are a red herring I think... The rest of the stuff on this ticket is good though Chris.
&lt;/p&gt;
&lt;p&gt;
Bots/503s come from &lt;a class="ext-link" href="https://github.com/omega8cc/nginx-for-drupal/blob/master/aegir/conf/nginx_modern_include.conf#L17"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;https://github.com/omega8cc/nginx-for-drupal/blob/master/aegir/conf/nginx_modern_include.conf#L17&lt;/a&gt; and &lt;a class="ext-link" href="https://github.com/omega8cc/nginx-for-drupal/blob/master/aegir/conf/nginx_high_load_off.conf"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;https://github.com/omega8cc/nginx-for-drupal/blob/master/aegir/conf/nginx_high_load_off.conf&lt;/a&gt;
&lt;/p&gt;
&lt;p&gt;
It's deliberate and a good idea IMHO to prevent bots getting to the site when it's really busy. BUT! clearly the detection of the high load is squiffy, hence the issues in &lt;a class="closed ticket" href="http://localhost:8080/trac/ticket/555" title="maintenance: Load spikes causing the TN site to be stopped for 15 min at a time (closed: fixed)"&gt;#555&lt;/a&gt; and your investigation of second.sh.
&lt;/p&gt;
&lt;p&gt;
So the strategy should be to make the BOA scripts work with the correct thresholds. There's a lot of good stuff (with the occasional headscratcher/WFT moment I agree!) wrapped up in these scripts from a team that hosts a LOT of fast Drupal sites. Clearly they don't have 14 cores, so the detection is off.
&lt;/p&gt;
&lt;p&gt;
We need to either find the variable that sets this, or manually change the thresholds ourselves.
&lt;/p&gt;
&lt;p&gt;
To get going faster, I suggest you search in &lt;a class="missing wiki"&gt;GitHub?&lt;/a&gt; rather than manually looking at the scripts... e.g. two searches:
&lt;/p&gt;
&lt;ul&gt;&lt;li&gt;&lt;a class="ext-link" href="https://github.com/omega8cc/nginx-for-drupal/search?q=cpu&amp;amp;ref=cmdform"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;https://github.com/omega8cc/nginx-for-drupal/search?q=cpu&amp;amp;ref=cmdform&lt;/a&gt;
&lt;/li&gt;&lt;li&gt;&lt;a class="ext-link" href="https://github.com/omega8cc/nginx-for-drupal/search?q=load&amp;amp;ref=cmdform"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;https://github.com/omega8cc/nginx-for-drupal/search?q=load&amp;amp;ref=cmdform&lt;/a&gt;
&lt;/li&gt;&lt;/ul&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>jim</dc:creator>

      <pubDate>Thu, 20 Jun 2013 07:48:56 GMT</pubDate>
      <title>hours, totalhours changed</title>
      <link>http://localhost:8080/trac/ticket/563#comment:11</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/563#comment:11</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;hours&lt;/strong&gt;
                changed from &lt;em&gt;0.0&lt;/em&gt; to &lt;em&gt;0.5&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;totalhours&lt;/strong&gt;
                changed from &lt;em&gt;5.25&lt;/em&gt; to &lt;em&gt;5.75&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>jim</dc:creator>

      <pubDate>Thu, 20 Jun 2013 07:49:38 GMT</pubDate>
      <title></title>
      <link>http://localhost:8080/trac/ticket/563#comment:12</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/563#comment:12</guid>
      <description>
        &lt;p&gt;
I'll look at the firewall late tonight... per comment 2.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>chris</dc:creator>

      <pubDate>Mon, 24 Jun 2013 09:53:00 GMT</pubDate>
      <title>hours, priority, status, description, totalhours changed</title>
      <link>http://localhost:8080/trac/ticket/563#comment:13</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/563#comment:13</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;hours&lt;/strong&gt;
                changed from &lt;em&gt;0.0&lt;/em&gt; to &lt;em&gt;0.5&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;priority&lt;/strong&gt;
                changed from &lt;em&gt;critical&lt;/em&gt; to &lt;em&gt;major&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;status&lt;/strong&gt;
                changed from &lt;em&gt;new&lt;/em&gt; to &lt;em&gt;accepted&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;description&lt;/strong&gt;
              modified (&lt;a href="/trac/ticket/563?action=diff&amp;amp;version=13"&gt;diff&lt;/a&gt;)
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;totalhours&lt;/strong&gt;
                changed from &lt;em&gt;5.75&lt;/em&gt; to &lt;em&gt;6.25&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
These are the recent summary of results from the &lt;tt&gt;/usr/local/bin/50x-errors&lt;/tt&gt; script which is run every day just before the nginx logs are rotated:
&lt;/p&gt;
&lt;pre class="wiki"&gt;Date: Wed, 19 Jun 2013 14:53:27 +0100
Subject: 7 502, 173 503 and 0 504 errors from puffin.webarch.net
Date: Thu, 20 Jun 2013 06:25:29 +0100
Subject: 12 502, 1570 503 and 0 504 errors from puffin.webarch.net
Date: Fri, 21 Jun 2013 06:25:20 +0100
Subject: 464 502, 65 503 and 1 504 errors from puffin.webarch.net
Date: Sat, 22 Jun 2013 06:25:24 +0100
Subject: 0 502, 149 503 and 1 504 errors from puffin.webarch.net
Date: Sun, 23 Jun 2013 06:25:47 +0100
Subject: 1 502, 212 503 and 0 504 errors from puffin.webarch.net
Date: Mon, 24 Jun 2013 06:25:16 +0100
Subject: 2 502, 103 503 and 1 504 errors from puffin.webarch.net
&lt;/pre&gt;&lt;p&gt;
It's worth noting that since the downtime on 20th June, which is in the stats above for 21st June as the stats are generated at 6:25am, there have been very few 502 or 504 errors. The 503 errors for the last few days have been generated when the site is in high load mode and this has been happening quite often, see the spikes on the graphs here: &lt;a class="closed ticket" href="http://localhost:8080/trac/ticket/555#comment:54" title="maintenance: Load spikes causing the TN site to be stopped for 15 min at a time (closed: fixed)"&gt;ticket:555#comment:54&lt;/a&gt;.
&lt;/p&gt;
&lt;p&gt;
The thresholds in &lt;tt&gt;second.sh&lt;/tt&gt; were initially multiplied by four, see &lt;a class="closed ticket" href="http://localhost:8080/trac/ticket/555#comment:43" title="maintenance: Load spikes causing the TN site to be stopped for 15 min at a time (closed: fixed)"&gt;ticket:555#comment:43&lt;/a&gt;, and then by five, see &lt;a class="closed ticket" href="http://localhost:8080/trac/ticket/555#comment:52" title="maintenance: Load spikes causing the TN site to be stopped for 15 min at a time (closed: fixed)"&gt;ticket:555#comment:52&lt;/a&gt;, and the RAM was doubled after the 20th June downtime, from 4GB to 8GB.
&lt;/p&gt;
&lt;p&gt;
I think it would perhaps be worth doing a bit more work on the &lt;tt&gt;50x-errors&lt;/tt&gt; script -- if it counted, sorted and listed the user agents that were triggering the 503 errors this would be a good check that it is an error that is just served to bots.
&lt;/p&gt;
&lt;p&gt;
Apart from that this ticket is probably ready to be closed.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>chris</dc:creator>

      <pubDate>Wed, 10 Jul 2013 09:16:08 GMT</pubDate>
      <title>hours, totalhours changed</title>
      <link>http://localhost:8080/trac/ticket/563#comment:14</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/563#comment:14</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;hours&lt;/strong&gt;
                changed from &lt;em&gt;0.0&lt;/em&gt; to &lt;em&gt;0.5&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;totalhours&lt;/strong&gt;
                changed from &lt;em&gt;6.25&lt;/em&gt; to &lt;em&gt;6.75&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
I have done some work on the script which checks for web server errors codes each day and emails a summary, it's now documented here: &lt;a class="wiki" href="http://localhost:8080/trac/wiki/ErrorCodeCheck"&gt;wiki:ErrorCodeCheck&lt;/a&gt;.
&lt;/p&gt;
&lt;p&gt;
Yesterday was quite a bad day, see &lt;a class="closed ticket" href="http://localhost:8080/trac/ticket/569" title="defect: 403s served to editors, admin very slow (closed: fixed)"&gt;ticket:569&lt;/a&gt;, the script reported:
&lt;/p&gt;
&lt;pre class="wiki"&gt;Date: Wed, 10 Jul 2013 06:25:28 +0100
Subject: 770 403, 4219 404, 12 502, 2271 503 and 0 504 errors from puffin.webarch.net
&lt;/pre&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>chris</dc:creator>

      <pubDate>Fri, 15 Nov 2013 12:44:11 GMT</pubDate>
      <title>hours, status, totalhours changed; resolution set</title>
      <link>http://localhost:8080/trac/ticket/563#comment:15</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/563#comment:15</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;hours&lt;/strong&gt;
                changed from &lt;em&gt;0.0&lt;/em&gt; to &lt;em&gt;0.2&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;status&lt;/strong&gt;
                changed from &lt;em&gt;accepted&lt;/em&gt; to &lt;em&gt;closed&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;resolution&lt;/strong&gt;
                set to &lt;em&gt;fixed&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;totalhours&lt;/strong&gt;
                changed from &lt;em&gt;6.75&lt;/em&gt; to &lt;em&gt;6.95&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
Since early August there has only been one day on which there were more than 100 502 errors however there have been days when the 503 errors have been over 6k, see the list posted on &lt;a class="closed ticket" href="http://localhost:8080/trac/ticket/483#comment:63" title="defect: Nginx 502 Bad Gateway Errors with BOA (closed: fixed)"&gt;ticket:483#comment:63&lt;/a&gt;.
&lt;/p&gt;
&lt;p&gt;
I have added a new section to the Puffin wiki page, &lt;a class="wiki" href="http://localhost:8080/trac/wiki/PuffinServer#HTTPStats"&gt;wiki:PuffinServer#HTTPStats&lt;/a&gt; and the whole server load issue has been followed up on &lt;a class="closed ticket" href="http://localhost:8080/trac/ticket/555" title="maintenance: Load spikes causing the TN site to be stopped for 15 min at a time (closed: fixed)"&gt;ticket:555&lt;/a&gt; and it is documented at &lt;a class="wiki" href="http://localhost:8080/trac/wiki/PuffinServer#LoadSpikes"&gt;wiki:PuffinServer#LoadSpikes&lt;/a&gt;.
&lt;/p&gt;
&lt;p&gt;
I think this ticket can now be closed.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item>
 </channel>
</rss>