Context Navigation

Changes between Version 61 and Version 62 of PuffinServer

Timestamp:: 10/14/13 12:47:14 (3 years ago)
Author:: chris
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

PuffinServer

-                      v61
+                      v62
 See ticket:555#comment:13 for the notes regarding the installation of the mysql munin stats package.
+== Load Spikes ==
+The server has been suffering from load spikes which cause the site to be unresponsive for clients, you can see the current status via the [https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/load.html puffin Munin load graph].
+When the load hits 3.88 robots are served 403 Forbidden responses and when the load hits 72.2 the server shuts down until the 5 min load average falls below 44.4.
+The [ticket:563#second.sh default thresholds] have been changed as they were causing [trac:ticket/555 the shut to shutdown for 15 min at a time] far too often.
+The current thresholds are generated from these variables in {{{/root/.barracuda.cnf}}} and the commented out default ones:
+{{{
+#_LOAD_LIMIT_ONE=1444
+#_LOAD_LIMIT_TWO=888
+_LOAD_LIMIT_ONE==7220
+_LOAD_LIMIT_TWO=4440
+}}}
+These variables are used by the {{{/var/xdrago/second.sh}}} script, which is run every minute via cron, which has the following variables in it:
+{{{
+ONEX_LOAD=`awk '{print $1*100}' /proc/loadavg`
+FIVX_LOAD=`awk '{print $2*100}' /proc/loadavg`
+CTL_ONEX_SPIDER_LOAD=388
+CTL_FIVX_SPIDER_LOAD=388
+CTL_ONEX_LOAD=7220
+CTL_FIVX_LOAD=4440
+CTL_ONEX_LOAD_CRIT=1888
+CTL_FIVX_LOAD_CRIT=1555
+}}}
+These values translate to the following loads for comparison to the Munin graphs:
+* ONEX_LOAD: load average over the last minute times 100
+* FIVX_LOAD: load average over the last 5 minutes times 100
+* CTL_ONEX_SPIDER_LOAD: 3.88
+* CTL_FIVX_SPIDER_LOAD: 3.88
+* CTL_ONEX_LOAD: 72.20
+* CTL_FIVX_LOAD: 44.40
+* CTL_ONEX_LOAD_CRIT: 18.88
+* CTL_FIVX_LOAD_CRIT: 15.55
+And the logic, translated into english, is:
+. If the load average over the last minute is greater than 3.88 and less than 72.20 and the nginx high load config isn't in use then start to use it.
+. Else if the load average over the last 5 mins is greater than 3.88 and less than 44.40 and the nginx high load config isn't in use then start to use it.
+. Else if the load average over the last minute is less than 3.88 and the the load average over the last 5 mins is less than 3.88 and the nginx high load config is in use then stop using it.
+. If the load average over the last minute is greater than 18.88 then if {{{/var/run/boa_run.pid}}} exists, wait a second, if not kill some jobs:  {{{killall -9 php drush.php wget}}}
+. Else if the load average over the last 5 mins is greater than 15.55 then if {{{/var/run/boa_run.pid}}} exists, wait a second, if not kill some jobs: {{{killall -9 php drush.php wget}}}
+. If the load average over the last minute is greater than 72.20 then kill the web server, {{{killall -9 nginx}}} and {{{killall -9 php-fpm php-cgi}}}
+. Else if the load average over the last 5 mins is greater than 44.40 then kill the web server, {{{killall -9 nginx}}} and {{{killall -9 php-fpm php-cgi}}}
+. Else restart all the services via {{{/var/xdrago/proc_num_ctrl.cgi}}}
 == Tickets ==