Changes between Version 130 and Version 131 of PuffinServer


Ignore:
Timestamp:
06/02/14 11:50:22 (2 years ago)
Author:
chris
Comment:

Load Spike suicide notes archived to PuffinServerBoaLoadSpikes

Legend:

Unmodified
Added
Removed
Modified
  • PuffinServer

    v130 v131  
    3131== Load Spikes == 
    3232 
    33 The server has been suffering from load spikes which cause the site to be unresponsive for clients, you can see the current status via the [https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/load.html puffin Munin load graph], note the Max values for the last day, week, month and year. 
    34  
    35 '''When the load hits 23.28 robots are served 403 Forbidden responses and when the load hits 86.64 maintenance tasks are killed and when the load hits 113.28 the server terminates until the 5 min load average falls below 93.30.''' 
    36  
    37 The [ticket:563#second.sh default thresholds] have been changed as they were causing [ticket:555 the shut to shutdown for 15 min at a time] far too often, the [ticket:555#comment:124 current values] were applied on [ticket:555#comment:126 23rd October 2013]. 
    38  
    39 The server has 14 CPU cores, see [https://en.wikipedia.org/wiki/Load_average#Unix-style_load_calculation Unix-style load calculation], the current thresholds are generated from these variables in {{{/root/.barracuda.cnf}}}, the commented out values are the default ones: 
    40  
    41 {{{ 
    42 #_LOAD_LIMIT_ONE=1444 
    43 #_LOAD_LIMIT_TWO=888 
    44 _LOAD_LIMIT_ONE=8664 
    45 _LOAD_LIMIT_TWO=5328 
    46 }}} 
    47  
    48 These variables are used by the {{{/var/xdrago/second.sh}}} script, which is run every minute via cron and has a internal loop which causes it to run 5 times, waiting 10 seconds between each run, and it has the following variables in it (these have been [ticket:555#comment:126 edited from their default values]): 
    49  
    50 {{{ 
    51 ONEX_LOAD=`awk '{print $1*100}' /proc/loadavg` 
    52 FIVX_LOAD=`awk '{print $2*100}' /proc/loadavg` 
    53 CTL_ONEX_SPIDER_LOAD=2328 
    54 CTL_FIVX_SPIDER_LOAD=2328 
    55 CTL_ONEX_LOAD=8664 
    56 CTL_FIVX_LOAD=5328 
    57 CTL_ONEX_LOAD_CRIT=11328 
    58 CTL_FIVX_LOAD_CRIT=9330 
    59 }}} 
    60  
    61 These values translate to the following loads for comparison to the Munin graphs: 
    62  
    63 * ONEX_LOAD: load average over the last minute times 100 
    64 * FIVX_LOAD: load average over the last 5 minutes times 100 
    65 * CTL_ONEX_SPIDER_LOAD: 23.28 
    66 * CTL_FIVX_SPIDER_LOAD: 23.28 
    67 * CTL_ONEX_LOAD: 86.64 
    68 * CTL_FIVX_LOAD: 53.28 
    69 * CTL_ONEX_LOAD_CRIT: 113.28 
    70 * CTL_FIVX_LOAD_CRIT: 93.30 
    71  
    72 And the logic, translated into english, is: 
    73  
    74 1. If the load average over the last minute is greater than 23.28 and less than 86.64 and the nginx high load config isn't in use then start to use it. 
    75 2. Else if the load average over the last 5 mins is greater than 23.28 and less than 53.28 and the nginx high load config isn't in use then start to use it. 
    76 3. Else if the load average over the last minute is less than 23.28 and the the load average over the last 5 mins is less than 23.28 and the nginx high load config is in use then stop using it. 
    77  
    78 1. If the load average over the last minute is greater than 132.16 and if {{{/var/run/boa_run.pid}}} exists, wait a second, if not kill some maintenance jobs:  {{{killall -9 php drush.php wget}}} 
    79 2. Else if the load average over the last 5 mins is greater than 108.85 and if {{{/var/run/boa_run.pid}}} exists, wait a second, if not kill some maintenance jobs: {{{killall -9 php drush.php wget}}} 
    80  
    81 1. If the load average over the last minute is greater than 101.08 then kill the web server, {{{killall -9 nginx}}} and {{{killall -9 php-fpm php-cgi}}} 
    82 2. Else if the load average over the last 5 mins is greater than 62.16 then kill the web server, {{{killall -9 nginx}}} and {{{killall -9 php-fpm php-cgi}}} 
    83 3. Else restart all the services via {{{/var/xdrago/proc_num_ctrl.cgi}}} 
    84  
    85 Tickets generated in relation to these issues include: 
    86  
    87 * ticket:483    Nginx 502 Bad Gateway Errors with BOA 
    88 * ticket:543    Puffin Load Spike 
    89 * ticket:552    Puffin Downtime 23rd May 2013 
    90 * ticket:554    Site slow down and MySQL load increase 
    91 * ticket:555    Load spikes causing the TN site to be stopped for 15 min at a time 
    92 * ticket:563    503 Errors 
    93 * ticket:569    403s served to editors, admin very slow 
    94 * ticket:576    Site down 
    95  
    96 A total of 77.5 hours was spent on the tickets listed above, the final one was closed on 15th December 2013 and the total time was added up, see ticket:555#comment:132. 
     33The documentation of the load spike suicides that the server suffered from in 2013 has been moved to wiki:PuffinServerBoaLoadSpikes as that documenattion is out dated. 
     34 
     35When the server was updated to BOA-2.2.3 on ticket:721 the scripts in {{{/var/xdrago/}}} were changed, however the load spike issue hasn't been finally resolved, see ticket:670#comment:22. 
    9736 
    9837== Tickets ==