Context Navigation

Changes between Version 130 and Version 131 of PuffinServer

Timestamp:: 06/02/14 11:50:22 (2 years ago)
Author:: chris
Comment:: Load Spike suicide notes archived to PuffinServerBoaLoadSpikes

Legend:

: Unmodified
: Added
: Removed
: Modified

PuffinServer

-                      v130
+                      v131
 == Load Spikes ==
+The server has been suffering from load spikes which cause the site to be unresponsive for clients, you can see the current status via the [https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/load.html puffin Munin load graph], note the Max values for the last day, week, month and year.
+'''When the load hits 23.28 robots are served 403 Forbidden responses and when the load hits 86.64 maintenance tasks are killed and when the load hits 113.28 the server terminates until the 5 min load average falls below 93.30.'''
+The [ticket:563#second.sh default thresholds] have been changed as they were causing [ticket:555 the shut to shutdown for 15 min at a time] far too often, the [ticket:555#comment:124 current values] were applied on [ticket:555#comment:126 23rd October 2013].
+The server has 14 CPU cores, see [https://en.wikipedia.org/wiki/Load_average#Unix-style_load_calculation Unix-style load calculation], the current thresholds are generated from these variables in {{{/root/.barracuda.cnf}}}, the commented out values are the default ones:
+{{{
+#_LOAD_LIMIT_ONE=1444
+#_LOAD_LIMIT_TWO=888
+_LOAD_LIMIT_ONE=8664
+_LOAD_LIMIT_TWO=5328
+}}}
+These variables are used by the {{{/var/xdrago/second.sh}}} script, which is run every minute via cron and has a internal loop which causes it to run 5 times, waiting 10 seconds between each run, and it has the following variables in it (these have been [ticket:555#comment:126 edited from their default values]):
+{{{
+ONEX_LOAD=`awk '{print $1*100}' /proc/loadavg`
+FIVX_LOAD=`awk '{print $2*100}' /proc/loadavg`
+CTL_ONEX_SPIDER_LOAD=2328
+CTL_FIVX_SPIDER_LOAD=2328
+CTL_ONEX_LOAD=8664
+CTL_FIVX_LOAD=5328
+CTL_ONEX_LOAD_CRIT=11328
+CTL_FIVX_LOAD_CRIT=9330
+}}}
+These values translate to the following loads for comparison to the Munin graphs:
+* ONEX_LOAD: load average over the last minute times 100
+* FIVX_LOAD: load average over the last 5 minutes times 100
+* CTL_ONEX_SPIDER_LOAD: 23.28
+* CTL_FIVX_SPIDER_LOAD: 23.28
+* CTL_ONEX_LOAD: 86.64
+* CTL_FIVX_LOAD: 53.28
+* CTL_ONEX_LOAD_CRIT: 113.28
+* CTL_FIVX_LOAD_CRIT: 93.30
+And the logic, translated into english, is:
+. If the load average over the last minute is greater than 23.28 and less than 86.64 and the nginx high load config isn't in use then start to use it.
+. Else if the load average over the last 5 mins is greater than 23.28 and less than 53.28 and the nginx high load config isn't in use then start to use it.
+. Else if the load average over the last minute is less than 23.28 and the the load average over the last 5 mins is less than 23.28 and the nginx high load config is in use then stop using it.
+. If the load average over the last minute is greater than 132.16 and if {{{/var/run/boa_run.pid}}} exists, wait a second, if not kill some maintenance jobs:  {{{killall -9 php drush.php wget}}}
+. Else if the load average over the last 5 mins is greater than 108.85 and if {{{/var/run/boa_run.pid}}} exists, wait a second, if not kill some maintenance jobs: {{{killall -9 php drush.php wget}}}
+. If the load average over the last minute is greater than 101.08 then kill the web server, {{{killall -9 nginx}}} and {{{killall -9 php-fpm php-cgi}}}
+. Else if the load average over the last 5 mins is greater than 62.16 then kill the web server, {{{killall -9 nginx}}} and {{{killall -9 php-fpm php-cgi}}}
+. Else restart all the services via {{{/var/xdrago/proc_num_ctrl.cgi}}}
+Tickets generated in relation to these issues include:
+* ticket:483    Nginx 502 Bad Gateway Errors with BOA
+* ticket:543    Puffin Load Spike
+* ticket:552    Puffin Downtime 23rd May 2013
+* ticket:554    Site slow down and MySQL load increase
+* ticket:555    Load spikes causing the TN site to be stopped for 15 min at a time
+* ticket:563    503 Errors
+* ticket:569    403s served to editors, admin very slow
+* ticket:576    Site down
+A total of 77.5 hours was spent on the tickets listed above, the final one was closed on 15th December 2013 and the total time was added up, see ticket:555#comment:132.
+The documentation of the load spike suicides that the server suffered from in 2013 has been moved to wiki:PuffinServerBoaLoadSpikes as that documenattion is out dated.
+When the server was updated to BOA-2.2.3 on ticket:721 the scripts in {{{/var/xdrago/}}} were changed, however the load spike issue hasn't been finally resolved, see ticket:670#comment:22.
 == Tickets ==