Changes between Initial Version and Version 1 of PuffinServerBoaLoadSpikes


Ignore:
Timestamp:
06/02/14 11:43:55 (2 years ago)
Author:
chris
Comment:

Created as an archive of the BOA Load Spike documentation

Legend:

Unmodified
Added
Removed
Modified
  • PuffinServerBoaLoadSpikes

    v1 v1  
     1The following content is now out of date, since an update in early 2014 replaced the scripts in {{{/var/xdrago/}}}, see Upgrade to BOA-2.2.3 Stable Edition trac:ticket/721. 
     2 
     3For upto date documentation see wiki:PuffinServer 
     4 
     5== BOA Load Spikes == 
     6 
     7The server has been suffering from load spikes which cause the site to be unresponsive for clients, you can see the current status via the [https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/load.html puffin Munin load graph], note the Max values for the last day, week, month and year. 
     8 
     9'''When the load hits 23.28 robots are served 403 Forbidden responses and when the load hits 86.64 maintenance tasks are killed and when the load hits 113.28 the server terminates until the 5 min load average falls below 93.30.''' 
     10 
     11The [ticket:563#second.sh default thresholds] have been changed as they were causing [ticket:555 the shut to shutdown for 15 min at a time] far too often, the [ticket:555#comment:124 current values] were applied on [ticket:555#comment:126 23rd October 2013]. 
     12 
     13The server has 14 CPU cores, see [https://en.wikipedia.org/wiki/Load_average#Unix-style_load_calculation Unix-style load calculation], the current thresholds are generated from these variables in {{{/root/.barracuda.cnf}}}, the commented out values are the default ones: 
     14 
     15{{{ 
     16#_LOAD_LIMIT_ONE=1444 
     17#_LOAD_LIMIT_TWO=888 
     18_LOAD_LIMIT_ONE=8664 
     19_LOAD_LIMIT_TWO=5328 
     20}}} 
     21 
     22These variables are used by the {{{/var/xdrago/second.sh}}} script, which is run every minute via cron and has a internal loop which causes it to run 5 times, waiting 10 seconds between each run, and it has the following variables in it (these have been [ticket:555#comment:126 edited from their default values]): 
     23 
     24{{{ 
     25ONEX_LOAD=`awk '{print $1*100}' /proc/loadavg` 
     26FIVX_LOAD=`awk '{print $2*100}' /proc/loadavg` 
     27CTL_ONEX_SPIDER_LOAD=2328 
     28CTL_FIVX_SPIDER_LOAD=2328 
     29CTL_ONEX_LOAD=8664 
     30CTL_FIVX_LOAD=5328 
     31CTL_ONEX_LOAD_CRIT=11328 
     32CTL_FIVX_LOAD_CRIT=9330 
     33}}} 
     34 
     35These values translate to the following loads for comparison to the Munin graphs: 
     36 
     37* ONEX_LOAD: load average over the last minute times 100 
     38* FIVX_LOAD: load average over the last 5 minutes times 100 
     39* CTL_ONEX_SPIDER_LOAD: 23.28 
     40* CTL_FIVX_SPIDER_LOAD: 23.28 
     41* CTL_ONEX_LOAD: 86.64 
     42* CTL_FIVX_LOAD: 53.28 
     43* CTL_ONEX_LOAD_CRIT: 113.28 
     44* CTL_FIVX_LOAD_CRIT: 93.30 
     45 
     46And the logic, translated into english, is: 
     47 
     481. If the load average over the last minute is greater than 23.28 and less than 86.64 and the nginx high load config isn't in use then start to use it. 
     492. Else if the load average over the last 5 mins is greater than 23.28 and less than 53.28 and the nginx high load config isn't in use then start to use it. 
     503. Else if the load average over the last minute is less than 23.28 and the the load average over the last 5 mins is less than 23.28 and the nginx high load config is in use then stop using it. 
     51 
     521. If the load average over the last minute is greater than 132.16 and if {{{/var/run/boa_run.pid}}} exists, wait a second, if not kill some maintenance jobs:  {{{killall -9 php drush.php wget}}} 
     532. Else if the load average over the last 5 mins is greater than 108.85 and if {{{/var/run/boa_run.pid}}} exists, wait a second, if not kill some maintenance jobs: {{{killall -9 php drush.php wget}}} 
     54 
     551. If the load average over the last minute is greater than 101.08 then kill the web server, {{{killall -9 nginx}}} and {{{killall -9 php-fpm php-cgi}}} 
     562. Else if the load average over the last 5 mins is greater than 62.16 then kill the web server, {{{killall -9 nginx}}} and {{{killall -9 php-fpm php-cgi}}} 
     573. Else restart all the services via {{{/var/xdrago/proc_num_ctrl.cgi}}} 
     58 
     59Tickets generated in relation to these issues include: 
     60 
     61* ticket:483    Nginx 502 Bad Gateway Errors with BOA 
     62* ticket:543    Puffin Load Spike 
     63* ticket:552    Puffin Downtime 23rd May 2013 
     64* ticket:554    Site slow down and MySQL load increase 
     65* ticket:555    Load spikes causing the TN site to be stopped for 15 min at a time 
     66* ticket:563    503 Errors 
     67* ticket:569    403s served to editors, admin very slow 
     68* ticket:576    Site down 
     69 
     70A total of 77.5 hours was spent on the tickets listed above, the final one was closed on 15th December 2013 and the total time was added up, see ticket:555#comment:132.