| | 18 | |
| | 19 | == Load Spikes == |
| | 20 | |
| | 21 | The server has been suffering from load spikes which cause the site to be unresponsive for clients, you can see the current status via the [https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/load.html puffin Munin load graph]. |
| | 22 | |
| | 23 | When the load hits 3.88 robots are served 403 Forbidden responses and when the load hits 72.2 the server shuts down until the 5 min load average falls below 44.4. |
| | 24 | |
| | 25 | The [ticket:563#second.sh default thresholds] have been changed as they were causing [trac:ticket/555 the shut to shutdown for 15 min at a time] far too often. |
| | 26 | |
| | 27 | The current thresholds are generated from these variables in {{{/root/.barracuda.cnf}}} and the commented out default ones: |
| | 28 | |
| | 29 | {{{ |
| | 30 | #_LOAD_LIMIT_ONE=1444 |
| | 31 | #_LOAD_LIMIT_TWO=888 |
| | 32 | _LOAD_LIMIT_ONE==7220 |
| | 33 | _LOAD_LIMIT_TWO=4440 |
| | 34 | }}} |
| | 35 | |
| | 36 | These variables are used by the {{{/var/xdrago/second.sh}}} script, which is run every minute via cron, which has the following variables in it: |
| | 37 | |
| | 38 | {{{ |
| | 39 | ONEX_LOAD=`awk '{print $1*100}' /proc/loadavg` |
| | 40 | FIVX_LOAD=`awk '{print $2*100}' /proc/loadavg` |
| | 41 | CTL_ONEX_SPIDER_LOAD=388 |
| | 42 | CTL_FIVX_SPIDER_LOAD=388 |
| | 43 | CTL_ONEX_LOAD=7220 |
| | 44 | CTL_FIVX_LOAD=4440 |
| | 45 | CTL_ONEX_LOAD_CRIT=1888 |
| | 46 | CTL_FIVX_LOAD_CRIT=1555 |
| | 47 | }}} |
| | 48 | |
| | 49 | These values translate to the following loads for comparison to the Munin graphs: |
| | 50 | |
| | 51 | * ONEX_LOAD: load average over the last minute times 100 |
| | 52 | * FIVX_LOAD: load average over the last 5 minutes times 100 |
| | 53 | * CTL_ONEX_SPIDER_LOAD: 3.88 |
| | 54 | * CTL_FIVX_SPIDER_LOAD: 3.88 |
| | 55 | * CTL_ONEX_LOAD: 72.20 |
| | 56 | * CTL_FIVX_LOAD: 44.40 |
| | 57 | * CTL_ONEX_LOAD_CRIT: 18.88 |
| | 58 | * CTL_FIVX_LOAD_CRIT: 15.55 |
| | 59 | |
| | 60 | And the logic, translated into english, is: |
| | 61 | |
| | 62 | 1. If the load average over the last minute is greater than 3.88 and less than 72.20 and the nginx high load config isn't in use then start to use it. |
| | 63 | 2. Else if the load average over the last 5 mins is greater than 3.88 and less than 44.40 and the nginx high load config isn't in use then start to use it. |
| | 64 | 3. Else if the load average over the last minute is less than 3.88 and the the load average over the last 5 mins is less than 3.88 and the nginx high load config is in use then stop using it. |
| | 65 | |
| | 66 | 1. If the load average over the last minute is greater than 18.88 then if {{{/var/run/boa_run.pid}}} exists, wait a second, if not kill some jobs: {{{killall -9 php drush.php wget}}} |
| | 67 | 2. Else if the load average over the last 5 mins is greater than 15.55 then if {{{/var/run/boa_run.pid}}} exists, wait a second, if not kill some jobs: {{{killall -9 php drush.php wget}}} |
| | 68 | |
| | 69 | 1. If the load average over the last minute is greater than 72.20 then kill the web server, {{{killall -9 nginx}}} and {{{killall -9 php-fpm php-cgi}}} |
| | 70 | 2. Else if the load average over the last 5 mins is greater than 44.40 then kill the web server, {{{killall -9 nginx}}} and {{{killall -9 php-fpm php-cgi}}} |
| | 71 | 3. Else restart all the services via {{{/var/xdrago/proc_num_ctrl.cgi}}} |
| | 72 | |
| | 73 | |
| | 74 | |