wiki:PuffinServerBoaLoadSpikes

Version 1 (modified by chris, 2 years ago) (diff)

Created as an archive of the BOA Load Spike documentation

The following content is now out of date, since an update in early 2014 replaced the scripts in /var/xdrago/, see Upgrade to BOA-2.2.3 Stable Edition trac:ticket/721.

For upto date documentation see wiki:PuffinServer

BOA Load Spikes

The server has been suffering from load spikes which cause the site to be unresponsive for clients, you can see the current status via the puffin Munin load graph, note the Max values for the last day, week, month and year.

When the load hits 23.28 robots are served 403 Forbidden responses and when the load hits 86.64 maintenance tasks are killed and when the load hits 113.28 the server terminates until the 5 min load average falls below 93.30.

The default thresholds have been changed as they were causing the shut to shutdown for 15 min at a time far too often, the current values were applied on 23rd October 2013.

The server has 14 CPU cores, see Unix-style load calculation, the current thresholds are generated from these variables in /root/.barracuda.cnf, the commented out values are the default ones:

#_LOAD_LIMIT_ONE=1444
#_LOAD_LIMIT_TWO=888
_LOAD_LIMIT_ONE=8664
_LOAD_LIMIT_TWO=5328

These variables are used by the /var/xdrago/second.sh script, which is run every minute via cron and has a internal loop which causes it to run 5 times, waiting 10 seconds between each run, and it has the following variables in it (these have been edited from their default values):

ONEX_LOAD=`awk '{print $1*100}' /proc/loadavg`
FIVX_LOAD=`awk '{print $2*100}' /proc/loadavg`
CTL_ONEX_SPIDER_LOAD=2328
CTL_FIVX_SPIDER_LOAD=2328
CTL_ONEX_LOAD=8664
CTL_FIVX_LOAD=5328
CTL_ONEX_LOAD_CRIT=11328
CTL_FIVX_LOAD_CRIT=9330

These values translate to the following loads for comparison to the Munin graphs:

  • ONEX_LOAD: load average over the last minute times 100
  • FIVX_LOAD: load average over the last 5 minutes times 100
  • CTL_ONEX_SPIDER_LOAD: 23.28
  • CTL_FIVX_SPIDER_LOAD: 23.28
  • CTL_ONEX_LOAD: 86.64
  • CTL_FIVX_LOAD: 53.28
  • CTL_ONEX_LOAD_CRIT: 113.28
  • CTL_FIVX_LOAD_CRIT: 93.30

And the logic, translated into english, is:

  1. If the load average over the last minute is greater than 23.28 and less than 86.64 and the nginx high load config isn't in use then start to use it.
  2. Else if the load average over the last 5 mins is greater than 23.28 and less than 53.28 and the nginx high load config isn't in use then start to use it.
  3. Else if the load average over the last minute is less than 23.28 and the the load average over the last 5 mins is less than 23.28 and the nginx high load config is in use then stop using it.
  1. If the load average over the last minute is greater than 132.16 and if /var/run/boa_run.pid exists, wait a second, if not kill some maintenance jobs: killall -9 php drush.php wget
  2. Else if the load average over the last 5 mins is greater than 108.85 and if /var/run/boa_run.pid exists, wait a second, if not kill some maintenance jobs: killall -9 php drush.php wget
  1. If the load average over the last minute is greater than 101.08 then kill the web server, killall -9 nginx and killall -9 php-fpm php-cgi
  2. Else if the load average over the last 5 mins is greater than 62.16 then kill the web server, killall -9 nginx and killall -9 php-fpm php-cgi
  3. Else restart all the services via /var/xdrago/proc_num_ctrl.cgi

Tickets generated in relation to these issues include:

A total of 77.5 hours was spent on the tickets listed above, the final one was closed on 15th December 2013 and the total time was added up, see ticket:555#comment:132.