Version 1 (modified by chris, 2 years ago) (diff) |
---|
The following content is now out of date, since an update in early 2014 replaced the scripts in /var/xdrago/, see Upgrade to BOA-2.2.3 Stable Edition trac:ticket/721.
For upto date documentation see wiki:PuffinServer
BOA Load Spikes
The server has been suffering from load spikes which cause the site to be unresponsive for clients, you can see the current status via the puffin Munin load graph, note the Max values for the last day, week, month and year.
When the load hits 23.28 robots are served 403 Forbidden responses and when the load hits 86.64 maintenance tasks are killed and when the load hits 113.28 the server terminates until the 5 min load average falls below 93.30.
The default thresholds have been changed as they were causing the shut to shutdown for 15 min at a time far too often, the current values were applied on 23rd October 2013.
The server has 14 CPU cores, see Unix-style load calculation, the current thresholds are generated from these variables in /root/.barracuda.cnf, the commented out values are the default ones:
#_LOAD_LIMIT_ONE=1444 #_LOAD_LIMIT_TWO=888 _LOAD_LIMIT_ONE=8664 _LOAD_LIMIT_TWO=5328
These variables are used by the /var/xdrago/second.sh script, which is run every minute via cron and has a internal loop which causes it to run 5 times, waiting 10 seconds between each run, and it has the following variables in it (these have been edited from their default values):
ONEX_LOAD=`awk '{print $1*100}' /proc/loadavg` FIVX_LOAD=`awk '{print $2*100}' /proc/loadavg` CTL_ONEX_SPIDER_LOAD=2328 CTL_FIVX_SPIDER_LOAD=2328 CTL_ONEX_LOAD=8664 CTL_FIVX_LOAD=5328 CTL_ONEX_LOAD_CRIT=11328 CTL_FIVX_LOAD_CRIT=9330
These values translate to the following loads for comparison to the Munin graphs:
- ONEX_LOAD: load average over the last minute times 100
- FIVX_LOAD: load average over the last 5 minutes times 100
- CTL_ONEX_SPIDER_LOAD: 23.28
- CTL_FIVX_SPIDER_LOAD: 23.28
- CTL_ONEX_LOAD: 86.64
- CTL_FIVX_LOAD: 53.28
- CTL_ONEX_LOAD_CRIT: 113.28
- CTL_FIVX_LOAD_CRIT: 93.30
And the logic, translated into english, is:
- If the load average over the last minute is greater than 23.28 and less than 86.64 and the nginx high load config isn't in use then start to use it.
- Else if the load average over the last 5 mins is greater than 23.28 and less than 53.28 and the nginx high load config isn't in use then start to use it.
- Else if the load average over the last minute is less than 23.28 and the the load average over the last 5 mins is less than 23.28 and the nginx high load config is in use then stop using it.
- If the load average over the last minute is greater than 132.16 and if /var/run/boa_run.pid exists, wait a second, if not kill some maintenance jobs: killall -9 php drush.php wget
- Else if the load average over the last 5 mins is greater than 108.85 and if /var/run/boa_run.pid exists, wait a second, if not kill some maintenance jobs: killall -9 php drush.php wget
- If the load average over the last minute is greater than 101.08 then kill the web server, killall -9 nginx and killall -9 php-fpm php-cgi
- Else if the load average over the last 5 mins is greater than 62.16 then kill the web server, killall -9 nginx and killall -9 php-fpm php-cgi
- Else restart all the services via /var/xdrago/proc_num_ctrl.cgi
Tickets generated in relation to these issues include:
- ticket:483 Nginx 502 Bad Gateway Errors with BOA
- ticket:543 Puffin Load Spike
- ticket:552 Puffin Downtime 23rd May 2013
- ticket:554 Site slow down and MySQL load increase
- ticket:555 Load spikes causing the TN site to be stopped for 15 min at a time
- ticket:563 503 Errors
- ticket:569 403s served to editors, admin very slow
- ticket:576 Site down
A total of 77.5 hours was spent on the tickets listed above, the final one was closed on 15th December 2013 and the total time was added up, see ticket:555#comment:132.