| 18 | |
| 19 | == Load Spikes == |
| 20 | |
| 21 | The server has been suffering from load spikes which cause the site to be unresponsive for clients, you can see the current status via the [https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/load.html puffin Munin load graph]. |
| 22 | |
| 23 | When the load hits 3.88 robots are served 403 Forbidden responses and when the load hits 72.2 the server shuts down until the 5 min load average falls below 44.4. |
| 24 | |
| 25 | The [ticket:563#second.sh default thresholds] have been changed as they were causing [trac:ticket/555 the shut to shutdown for 15 min at a time] far too often. |
| 26 | |
| 27 | The current thresholds are generated from these variables in {{{/root/.barracuda.cnf}}} and the commented out default ones: |
| 28 | |
| 29 | {{{ |
| 30 | #_LOAD_LIMIT_ONE=1444 |
| 31 | #_LOAD_LIMIT_TWO=888 |
| 32 | _LOAD_LIMIT_ONE==7220 |
| 33 | _LOAD_LIMIT_TWO=4440 |
| 34 | }}} |
| 35 | |
| 36 | These variables are used by the {{{/var/xdrago/second.sh}}} script, which is run every minute via cron, which has the following variables in it: |
| 37 | |
| 38 | {{{ |
| 39 | ONEX_LOAD=`awk '{print $1*100}' /proc/loadavg` |
| 40 | FIVX_LOAD=`awk '{print $2*100}' /proc/loadavg` |
| 41 | CTL_ONEX_SPIDER_LOAD=388 |
| 42 | CTL_FIVX_SPIDER_LOAD=388 |
| 43 | CTL_ONEX_LOAD=7220 |
| 44 | CTL_FIVX_LOAD=4440 |
| 45 | CTL_ONEX_LOAD_CRIT=1888 |
| 46 | CTL_FIVX_LOAD_CRIT=1555 |
| 47 | }}} |
| 48 | |
| 49 | These values translate to the following loads for comparison to the Munin graphs: |
| 50 | |
| 51 | * ONEX_LOAD: load average over the last minute times 100 |
| 52 | * FIVX_LOAD: load average over the last 5 minutes times 100 |
| 53 | * CTL_ONEX_SPIDER_LOAD: 3.88 |
| 54 | * CTL_FIVX_SPIDER_LOAD: 3.88 |
| 55 | * CTL_ONEX_LOAD: 72.20 |
| 56 | * CTL_FIVX_LOAD: 44.40 |
| 57 | * CTL_ONEX_LOAD_CRIT: 18.88 |
| 58 | * CTL_FIVX_LOAD_CRIT: 15.55 |
| 59 | |
| 60 | And the logic, translated into english, is: |
| 61 | |
| 62 | 1. If the load average over the last minute is greater than 3.88 and less than 72.20 and the nginx high load config isn't in use then start to use it. |
| 63 | 2. Else if the load average over the last 5 mins is greater than 3.88 and less than 44.40 and the nginx high load config isn't in use then start to use it. |
| 64 | 3. Else if the load average over the last minute is less than 3.88 and the the load average over the last 5 mins is less than 3.88 and the nginx high load config is in use then stop using it. |
| 65 | |
| 66 | 1. If the load average over the last minute is greater than 18.88 then if {{{/var/run/boa_run.pid}}} exists, wait a second, if not kill some jobs: {{{killall -9 php drush.php wget}}} |
| 67 | 2. Else if the load average over the last 5 mins is greater than 15.55 then if {{{/var/run/boa_run.pid}}} exists, wait a second, if not kill some jobs: {{{killall -9 php drush.php wget}}} |
| 68 | |
| 69 | 1. If the load average over the last minute is greater than 72.20 then kill the web server, {{{killall -9 nginx}}} and {{{killall -9 php-fpm php-cgi}}} |
| 70 | 2. Else if the load average over the last 5 mins is greater than 44.40 then kill the web server, {{{killall -9 nginx}}} and {{{killall -9 php-fpm php-cgi}}} |
| 71 | 3. Else restart all the services via {{{/var/xdrago/proc_num_ctrl.cgi}}} |
| 72 | |
| 73 | |
| 74 | |