Context Navigation

← Previous Ticket
Next Ticket →

Ticket #555 (closed maintenance: fixed)

Opened 3 years ago

Last modified 3 years ago

Load spikes causing the TN site to be stopped for 15 min at a time

Reported by:	chris	Owned by:	chris
Priority:	major	Milestone:	Maintenance
Component:	Live server	Keywords:
Cc:	ed, jim, aland	Estimated Number of Hours:	0.25
Add Hours to Ticket:	0	Billable?:	yes
Total Hours:	50.18

Description (last modified by chris) (diff)

The BOA /var/xdrago/second.sh script is run every minute via the root crontab and if it detects a certain load level it changes the nginx config to a "high load" config which results in bots being served 503 errors when they spider the site, see ticket:563. When the load goes higher and hits another threshold the second.sh script kills the webserver applications, nginx and php-fpm, and waits till the load has dropped before starting them up again. This was happening once or twice a day following the increase in traffic around the launch of The Power of Just Doing Stuff. This has been addressed by multiplying the thresholds by 5 in second.sh.

Original Description

This morning at 10:19:24 I received the following alert from puffin:

Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 6.59 

Time:                    Wed May 29 10:17:02 2013 +0100
1 Min Load Avg:          23.39
5 Min Load Avg:          6.59
15 Min Load Avg:         2.57
Running/Total Processes: 44/326

At 10:21:57 I got an alert regarding ssh:

Service: SSH 
Host: puffin
Address: puffin.webarch.net
State: CRITICAL

Date/Time: Wed May 29 10:21:57 BST 2013

Additional Info:

CRITICAL - Socket timeout after 10 seconds

Then at 10:26:47 ssh appeared to have recovered:

Service: SSH
Host: puffin
Address: puffin.webarch.net
State: OK

Date/Time: Wed May 29 10:26:47 BST 2013

Additional Info:

SSH OK - OpenSSH_5.5p1 Debian-6+squeeze3 (protocol 2.0)

But then pingdom reported at 10:29:07:

www.transitionnetwork.org is down since 29/05/2013  10:24:57.

There was then a report regarding Nginx at 10:32:07:

Notification Type: PROBLEM

Service: HTTP
Host: puffin
Address: puffin.webarch.net
State: CRITICAL

Date/Time: Wed May 29 10:32:07 BST 2013

Additional Info:

Connection refused

So at 10:33:47 I ssh'd in and found that php53-fpm and nginx were not running and it took several attempts to get them running again.

The up email from pingdom reported:

www.transitionnetwork.org is UP again at 29/05/2013  10:36:57, after 12m of downtime.

I can't find anything in the logs to indicate what caused the load spike and php-fpm and nginx to stopp running.

Attachments

puffin-load-week-2013-03.png (31.3 KB) - added by chris 3 years ago.: Puffin load from March 2013
puffin-multips_memory-month-2013-05-31.png (31.1 KB) - added by chris 3 years ago.: Puffin memory usage by selected application
puffin-cpu-day-2013-06-22.png (26.1 KB) - added by chris 3 years ago.: Puffin CPU Spikes 2013-06-22
puffin-load-day-2013-06-22.png (21.4 KB) - added by chris 3 years ago.: Puffin Load Spikes 2013-06-22
puffin_2013-07-19_mysql_connections-month.png (30.4 KB) - added by chris 3 years ago.: Puffin MySQL Connections by Month for 2013-07-19
puffin_2013-07-19_mysql_qcache_mem-day.png (21.1 KB) - added by chris 3 years ago.: Puffin MySQL Query Cache by Day 2013-07-19
puffin_2013-07-19_phpfpm_status-day.png (25.4 KB) - added by chris 3 years ago.: Puffin 2013-07-19 PHP-FPM Status
puffin_2013-07-19_2_phpfpm_status-day.png (26.3 KB) - added by chris 3 years ago.: Puffin PHP-FPM 2013-07-19
puffin_2013-07-10_multips_memory-day.png (26.6 KB) - added by chris 3 years ago.: Puffin 2013-07-19 Memory Usage
puffin_2013-07-19_fw_packets-day.png (22.8 KB) - added by chris 3 years ago.: Puffin 2013-07-19 Firewall Packets
puffin_2013-07-19-2_mysql_queries-day.png (39.6 KB) - added by chris 3 years ago.: Puffin 2013-07-19 Mysql Query Cache
puffin_daily_usage_201307.png (3.4 KB) - added by chris 3 years ago.: Puffin Webalizer 2013-07-19
puffin-2013-07-26-load-day.png (18.5 KB) - added by chris 3 years ago.: Puffin Load 2013-07-26
mem-9sept2013-after-extra-drupal-caching.png (47.5 KB) - added by jim 3 years ago.: 9 sept 2013 - mem usage after more Drupal caching enabled
puffin_load-week_2013-10-03.png (24.1 KB) - added by chris 3 years ago.
puffin_load-month_2013-10-03.png (23.3 KB) - added by chris 3 years ago.
puffin-2013-10-09-load-day.png (31.8 KB) - added by chris 3 years ago.
puffin-2013-10-09-load-week.png (28.2 KB) - added by chris 3 years ago.
puffin-2013-10-15-load-day.png (17.4 KB) - added by chris 3 years ago.
phpfpm_status-day.png (19.7 KB) - added by jim 3 years ago.
piwik-visitors-and-hits-month-2013-10-20.png (20.0 KB) - added by chris 3 years ago.
piwik-visitors-and-hits-year-2013-10-20.png (54.6 KB) - added by chris 3 years ago.
puffin-2013-10-15-goaccess.png (20.9 KB) - added by chris 3 years ago.
puffin-2013-10-15-goaccess-41.189.xxx.xxx.png (16.4 KB) - added by chris 3 years ago.
30.png (22.4 KB) - added by chris 3 years ago.
puffin-2013-10-20-load-day.png (20.5 KB) - added by chris 3 years ago.
puffin-2013-10-20-cpu-day.png (24.1 KB) - added by chris 3 years ago.
puffin-2013-10-20-phpfpm_status-day.png (18.3 KB) - added by chris 3 years ago.
puffin-2013-10-20-phpfpm_connections-day.png (24.5 KB) - added by chris 3 years ago.
puffin-2013-10-20-phpfpm_connections-day.2.png (24.5 KB) - added by chris 3 years ago.
puffin-2013-10-20-nginx_request-day.png (35.5 KB) - added by chris 3 years ago.
puffin-2013-10-20-fw_packets-day.png (26.1 KB) - added by chris 3 years ago.
puffin-2013-10-20-if_eth0-day.png (35.2 KB) - added by chris 3 years ago.
puffin-2013-10-20-fw_conntrack-day.png (49.4 KB) - added by chris 3 years ago.
puffin-2013-10-20-mysql_qcache-day.png (45.4 KB) - added by chris 3 years ago.
puffin-2013-10-20-mysql_queries-day.png (42.0 KB) - added by chris 3 years ago.
puffin-2013-10-20_2-load-day.png (21.4 KB) - added by chris 3 years ago.
puffin-2013-10-20_2-phpfpm_connections-day.png (24.8 KB) - added by chris 3 years ago.
puffin-2013-10-20_2-fw_conntrack-day.png (50.0 KB) - added by chris 3 years ago.
puffin-2013-10-20_2-fw_conntrack-day.2.png (50.0 KB) - added by chris 3 years ago.
puffin-2013-10-20_2-mysql_qcache-day.png (46.8 KB) - added by chris 3 years ago.
puffin-2013-10-20_2-mysql_qcache-day.2.png (46.8 KB) - added by chris 3 years ago.
puffin-2013-10-20_2-mysql_queries-day.png (42.7 KB) - added by chris 3 years ago.
puffin-2013-10-20_3_load-day.png (22.9 KB) - added by chris 3 years ago.
tn-piwik-2013-11-03.png (44.3 KB) - added by chris 3 years ago.
puffin_2013-12-15_load-pinpoint_1363530313_1387203913.png (63.7 KB) - added by chris 3 years ago.: Puffin Load Spikes 2013

Change History

comment:1 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 0.0 to 0.25

I have taken another look through all the log files I can find to try to see the cause of this problem.

Recent load spikes from the lfd log:

zgrep \*LOAD\* /var/log/lfd.log.1.gz 
May 20 08:01:11 puffin lfd[54576]: *LOAD* 5 minute load average is 16.78, threshold is 6 - email sent
May 21 01:00:51 puffin lfd[41103]: *LOAD* 5 minute load average is 6.89, threshold is 6 - email sent
May 22 03:00:55 puffin lfd[53028]: *LOAD* 5 minute load average is 12.08, threshold is 6 - email sent
May 23 06:26:15 puffin lfd[63570]: *LOAD* 5 minute load average is 12.14, threshold is 6 - email sent
May 23 12:01:00 puffin lfd[17263]: *LOAD* 5 minute load average is 6.51, threshold is 6 - email sent
May 23 17:01:06 puffin lfd[36796]: *LOAD* 5 minute load average is 7.20, threshold is 6 - email sent
May 24 00:31:05 puffin lfd[65325]: *LOAD* 5 minute load average is 9.52, threshold is 6 - email sent
May 24 04:00:37 puffin lfd[36216]: *LOAD* 5 minute load average is 8.13, threshold is 6 - email sent
May 24 06:25:57 puffin lfd[24168]: *LOAD* 5 minute load average is 6.18, threshold is 6 - email sent
May 24 14:30:52 puffin lfd[57785]: *LOAD* 5 minute load average is 6.16, threshold is 6 - email sent
May 24 23:06:12 puffin lfd[32645]: *LOAD* 5 minute load average is 7.13, threshold is 6 - email sent
May 25 00:30:38 puffin lfd[42590]: *LOAD* 5 minute load average is 8.23, threshold is 6 - email sent
May 25 01:41:28 puffin lfd[42106]: *LOAD* 5 minute load average is 7.03, threshold is 6 - email sent
May 25 06:31:15 puffin lfd[56919]: *LOAD* 5 minute load average is 6.70, threshold is 6 - email sent
May 25 11:01:07 puffin lfd[45083]: *LOAD* 5 minute load average is 7.16, threshold is 6 - email sent
May 25 13:06:43 puffin lfd[37685]: *LOAD* 5 minute load average is 10.29, threshold is 6 - email sent
May 25 14:06:57 puffin lfd[26794]: *LOAD* 5 minute load average is 10.04, threshold is 6 - email sent
May 25 15:25:49 puffin lfd[44323]: *LOAD* 5 minute load average is 8.82, threshold is 6 - email sent
May 25 22:01:46 puffin lfd[29675]: *LOAD* 5 minute load average is 9.22, threshold is 6 - email sent
May 25 23:13:43 puffin lfd[29903]: *LOAD* 5 minute load average is 7.43, threshold is 6 - email sent
May 26 06:25:57 puffin lfd[62219]: *LOAD* 5 minute load average is 8.12, threshold is 6 - email sent

grep \*LOAD\* /var/log/lfd.log
May 27 01:01:17 puffin lfd[15660]: *LOAD* 5 minute load average is 6.43, threshold is 6 - email sent
May 27 06:26:06 puffin lfd[13991]: *LOAD* 5 minute load average is 7.20, threshold is 6 - email sent
May 27 08:55:33 puffin lfd[33207]: *LOAD* 5 minute load average is 6.25, threshold is 6 - email sent
May 27 11:55:39 puffin lfd[35278]: *LOAD* 5 minute load average is 6.67, threshold is 6 - email sent
May 27 14:01:14 puffin lfd[29845]: *LOAD* 5 minute load average is 9.81, threshold is 6 - email sent
May 27 16:01:08 puffin lfd[29159]: *LOAD* 5 minute load average is 6.47, threshold is 6 - email sent
May 28 00:31:49 puffin lfd[10358]: *LOAD* 5 minute load average is 10.78, threshold is 6 - email sent
May 28 01:32:04 puffin lfd[893]: *LOAD* 5 minute load average is 6.83, threshold is 6 - email sent
May 28 12:00:50 puffin lfd[13534]: *LOAD* 5 minute load average is 6.58, threshold is 6 - email sent
May 28 21:07:31 puffin lfd[52724]: *LOAD* 5 minute load average is 6.01, threshold is 6 - email sent
May 29 10:16:09 puffin lfd[46640]: *LOAD* 5 minute load average is 6.59, threshold is 6 - email sent

It's worth noting that these are 5 minute load averages -- the highest load during these spikes will have been higher still, like this morning when the 1 Min Load Avg was 23 when the 5 minute load average email was sent, see above.

I can't see a pattern here but it's an issue we need to keep an eye on as there are several of these spikes a day, I still don't know what the cause was or why nginx and php53-fpm stopped running.

comment:2 follow-up: ↓ 6 Changed 3 years ago by jim

Ran the same on my VPS and got:

May 21 04:20:47 babylon lfd[8020]: *LOAD* 5 minute load average is 11.32, threshold is 6 - email sent
May 22 15:46:04 babylon lfd[18283]: *LOAD* 5 minute load average is 6.13, threshold is 6 - email sent
May 23 04:18:57 babylon lfd[1874]: *LOAD* 5 minute load average is 14.04, threshold is 6 - email sent
May 25 04:20:28 babylon lfd[17059]: *LOAD* 5 minute load average is 22.00, threshold is 6 - email sent
May 27 04:19:07 babylon lfd[1056]: *LOAD* 5 minute load average is 8.93, threshold is 6 - email sent
May 28 04:20:48 babylon lfd[1050]: *LOAD* 5 minute load average is 8.50, threshold is 6 - email sent
May 29 04:20:20 babylon lfd[29153]: *LOAD* 5 minute load average is 6.95, threshold is 6 - email sent

Gmail had been auto-archiving me alerts so I missed these, but they are in my mailbox. Tweaked my filter to promote these now.

On Babylon (my system) it started on the 21st -- that does coincide with a system update for me (I left these a few weeks). What follows is a condensed set of entries from late 20th when I did a barracuda up-stable system, minus the stuff I've got that's extra like NewRelic? and Webmin:

grep "status installed" /var/log/dpkg.log
...
2013-05-20 22:23:02 status installed man-db 2.5.7-8
2013-05-20 22:23:02 status installed php5-common 5.3.25-1~dotdeb.0
2013-05-20 22:23:04 status installed php5-cli 5.3.25-1~dotdeb.0
2013-05-20 22:23:06 status installed php5-fpm 5.3.25-1~dotdeb.0
2013-05-20 22:23:07 status installed php5-mysql 5.3.25-1~dotdeb.0
2013-05-20 22:23:07 status installed php5-imap 5.3.25-1~dotdeb.0
2013-05-20 22:23:07 status installed php5-ldap 5.3.25-1~dotdeb.0
2013-05-20 22:23:07 status installed php5-geoip 5.3.25-1~dotdeb.0
2013-05-20 22:23:07 status installed php5-xsl 5.3.25-1~dotdeb.0
2013-05-20 22:23:07 status installed php5-mcrypt 5.3.25-1~dotdeb.0
2013-05-20 22:23:07 status installed php5-curl 5.3.25-1~dotdeb.0
2013-05-20 22:23:07 status installed php5-xmlrpc 5.3.25-1~dotdeb.0
2013-05-20 22:23:07 status installed php5-sqlite 5.3.25-1~dotdeb.0
2013-05-20 22:23:07 status installed php5-gd 5.3.25-1~dotdeb.0
2013-05-20 22:23:07 status installed php5-apc 5.3.25-1~dotdeb.0
2013-05-20 22:23:07 status installed php5-imagick 5.3.25-1~dotdeb.0
2013-05-20 22:23:07 status installed php5-gmp 5.3.25-1~dotdeb.0
2013-05-20 22:23:22 status installed linux-libc-dev 2.6.32-48squeeze3
2013-05-20 22:23:22 status installed php-pear 5.3.25-1~dotdeb.0
2013-05-20 22:23:22 status installed php5-dev 5.3.25-1~dotdeb.0
2013-05-20 22:23:23 status installed libxenstore3.0 4.0.1-5.11
...

I'd bet my ass that one of the above is doing this.

Chris, when did the load spikes start for Puffin? And do you have any matching entries around that time?

My money is on an issue in PHP 5.3.25-1.

comment:3 Changed 3 years ago by jim

Add Hours to Ticket changed from 0.0 to 0.5
Total Hours changed from 0.25 to 0.75

comment:4 Changed 3 years ago by jim

The earlier version was php5-common 5.3.24-1~dotdeb.0, and that gave me not issues... Looking on https://bugs.php.net now.

comment:5 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 0.75 to 1.0

These are all the ones I have in my inbox:

Apr 22 lfd on puffin.webarch.net: High 5 minute load average alert - 6.01
Apr 22 lfd on puffin.webarch.net: High 5 minute load average alert - 6.02
Apr 23 lfd on puffin.webarch.net: High 5 minute load average alert - 7.22
Apr 24 lfd on puffin.webarch.net: High 5 minute load average alert - 8.59
Apr 25 lfd on puffin.webarch.net: High 5 minute load average alert - 6.35
Apr 26 lfd on puffin.webarch.net: High 5 minute load average alert - 9.26
Apr 28 lfd on puffin.webarch.net: High 5 minute load average alert - 7.39
Apr 28 lfd on puffin.webarch.net: High 5 minute load average alert - 16.03
Apr 28 lfd on puffin.webarch.net: High 5 minute load average alert - 6.97
Apr 29 lfd on puffin.webarch.net: High 5 minute load average alert - 7.67
Apr 29 lfd on puffin.webarch.net: High 5 minute load average alert - 64.21
Apr 30 lfd on puffin.webarch.net: High 5 minute load average alert - 7.49
May 01 lfd on puffin.webarch.net: High 5 minute load average alert - 6.69
May 03 lfd on puffin.webarch.net: High 5 minute load average alert - 6.09
May 03 lfd on puffin.webarch.net: High 5 minute load average alert - 7.62
May 04 lfd on puffin.webarch.net: High 5 minute load average alert - 6.04
May 05 lfd on puffin.webarch.net: High 5 minute load average alert - 6.04
May 06 lfd on puffin.webarch.net: High 5 minute load average alert - 6.67
May 07 lfd on puffin.webarch.net: High 5 minute load average alert - 6.75
May 07 lfd on puffin.webarch.net: High 5 minute load average alert - 7.40
May 08 lfd on puffin.webarch.net: High 5 minute load average alert - 7.21
May 10 lfd on puffin.webarch.net: High 5 minute load average alert - 9.86
May 10 lfd on puffin.webarch.net: High 5 minute load average alert - 6.02
May 11 lfd on puffin.webarch.net: High 5 minute load average alert - 12.52
May 11 lfd on puffin.webarch.net: High 5 minute load average alert - 7.30
May 11 lfd on puffin.webarch.net: High 5 minute load average alert - 6.60
May 11 lfd on puffin.webarch.net: High 5 minute load average alert - 9.22
May 12 lfd on puffin.webarch.net: High 5 minute load average alert - 10.70
May 12 lfd on puffin.webarch.net: High 5 minute load average alert - 7.26
May 13 lfd on puffin.webarch.net: High 5 minute load average alert - 6.54
May 13 lfd on puffin.webarch.net: High 5 minute load average alert - 10.77
May 14 lfd on puffin.webarch.net: High 5 minute load average alert - 8.79
May 14 lfd on puffin.webarch.net: High 5 minute load average alert - 7.96
May 14 lfd on puffin.webarch.net: High 5 minute load average alert - 9.26
May 16 lfd on puffin.webarch.net: High 5 minute load average alert - 10.61
May 17 lfd on puffin.webarch.net: High 5 minute load average alert - 6.02
May 17 lfd on puffin.webarch.net: High 5 minute load average alert - 6.16
May 18 lfd on puffin.webarch.net: High 5 minute load average alert - 7.40
May 20 lfd on puffin.webarch.net: High 5 minute load average alert - 16.78
May 20 lfd on puffin.webarch.net: High 5 minute load average alert - 16.78
May 20 lfd on puffin.webarch.net: High 5 minute load average alert - 16.78
May 20 lfd on puffin.webarch.net: High 5 minute load average alert - 16.78
May 21 lfd on puffin.webarch.net: High 5 minute load average alert - 6.89
May 22 lfd on puffin.webarch.net: High 5 minute load average alert - 12.08
May 23 lfd on puffin.webarch.net: High 5 minute load average alert - 12.14
May 23 lfd on puffin.webarch.net: High 5 minute load average alert - 6.51
May 23 lfd on puffin.webarch.net: High 5 minute load average alert - 7.20
May 24 lfd on puffin.webarch.net: High 5 minute load average alert - 9.52
May 24 lfd on puffin.webarch.net: High 5 minute load average alert - 8.13
May 24 lfd on puffin.webarch.net: High 5 minute load average alert - 6.18
May 24 lfd on puffin.webarch.net: High 5 minute load average alert - 6.16
May 24 lfd on puffin.webarch.net: High 5 minute load average alert - 7.13
May 25 lfd on puffin.webarch.net: High 5 minute load average alert - 8.23
May 25 lfd on puffin.webarch.net: High 5 minute load average alert - 7.03
May 25 lfd on puffin.webarch.net: High 5 minute load average alert - 6.70
May 25 lfd on puffin.webarch.net: High 5 minute load average alert - 7.16
May 25 lfd on puffin.webarch.net: High 5 minute load average alert - 10.29
May 25 lfd on puffin.webarch.net: High 5 minute load average alert - 10.04
May 25 lfd on puffin.webarch.net: High 5 minute load average alert - 8.82
May 25 lfd on puffin.webarch.net: High 5 minute load average alert - 9.22
May 25 lfd on puffin.webarch.net: High 5 minute load average alert - 7.43
May 26 lfd on puffin.webarch.net: High 5 minute load average alert - 8.12
May 27 lfd on puffin.webarch.net: High 5 minute load average alert - 6.43
May 27 lfd on puffin.webarch.net: High 5 minute load average alert - 7.20
May 27 lfd on puffin.webarch.net: High 5 minute load average alert - 6.25
May 27 lfd on puffin.webarch.net: High 5 minute load average alert - 6.67
May 27 lfd on puffin.webarch.net: High 5 minute load average alert - 9.81
May 27 lfd on puffin.webarch.net: High 5 minute load average alert - 6.47
May 28 lfd on puffin.webarch.net: High 5 minute load average alert - 10.78
May 28 lfd on puffin.webarch.net: High 5 minute load average alert - 6.83
May 28 lfd on puffin.webarch.net: High 5 minute load average alert - 6.58
May 28 lfd on puffin.webarch.net: High 5 minute load average alert - 6.01
May 29 lfd on puffin.webarch.net: High 5 minute load average alert - 6.59

I'm not sure if there were any prior to this, there might have been and I deleted them or there might not have been any, in any case it's been happening since late April at least.

There is a 5 days worth of Munin stats, attached, from March which show it wasn't an issue then, max load 1.88.

I have edited /etc/logrotate.d/lfd so we will keep a years worth of lfd logs rather than a weeks worth.

Changed 3 years ago by chris

Attachment puffin-load-week-2013-03.png added

Puffin load from March 2013

comment:6 in reply to: ↑ 2 ; follow-up: ↓ 8 Changed 3 years ago by chris

Replying to jim:

On Babylon (my system) it started on the 21st

How much further back than that do you have logs for?

comment:7 Changed 3 years ago by jim

Nothing on bugs.php.net, so it must be a related sub-package...

I've now scanned my other logs and nothing jumps out... And I note the high load for me usually happens around 4.20am except for one entry.

I now think that Puffin is not necessarily suffering the same issue as Babylon...

I note one thing though -- on the update for the 20th I did, the Barracuda email sent telling me it was successful said this near the bottom:

Barracuda [Mon May 20 22:27:27 BST 2013] ==> ALRT: Your OS kernel has been upgraded!
Barracuda [Mon May 20 22:27:27 BST 2013] ==> ALRT: You *must* reboot immediately to make it active and stay secure!

I did reboot after that...

Anyway, that's all the analysis of Babylon I can/will do for now... I'll do some Googling for similar symptoms around:

NginX 1.5.0
PHP 5.3.25 (and related)
MariaDB 5.5.31.

comment:8 in reply to: ↑ 6 Changed 3 years ago by jim

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 1.0 to 1.25

Replying to chris:

How much further back than that do you have logs for?

I have all the alert emails since the dawn of (Babylon) time, but there's nothing before 21th that's suspect. Got GZipped logs that go back too, but again nothing in them before (or after TBH) 20th that looks interesting.

Looking at the Puffin email list, I'd chalk the occasional ~6 load down to a burst of traffic, but definitely a cluster around 11th that's very suspect -- does that coincide with updates?

Last word on Babylon: in case you want to compare, my CGP is here cgp.aegir.i-jk.co.uk - and its def not a Kernel thing, as I'm on 3.9...

Got to go out now, will look at this again tonight.

comment:9 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 1.25 to 1.5

There was just another load spike, but this time nginx and php53-fpm didn't stop running, lfd email:

Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 34.49

Time:                    Wed May 29 15:43:38 2013 +0100
1 Min Load Avg:          80.66
5 Min Load Avg:          34.49
15 Min Load Avg:         13.19
Running/Total Processes: 90/311

Pingdom reported:

www.transitionnetwork.org is down since 29/05/2013  15:40:57.

Nagios alert (these are the ones that go direct to my phone):

Notification Type: PROBLEM

Service: HTTP
Host: puffin
Address: puffin.webarch.net
State: CRITICAL

Date/Time: Wed May 29 15:45:07 BST 2013

Additional Info:

Connection refused

Pingdom:

www.transitionnetwork.org is UP again at 29/05/2013  15:51:57, after 11m of downtime.

And nagios:

Notification Type: RECOVERY

Service: HTTP
Host: puffin
Address: puffin.webarch.net
State: OK

Date/Time: Wed May 29 15:55:07 BST 2013

Additional Info:

HTTP OK: HTTP/1.1 200 OK - 692 bytes in 0.005 second response time

comment:10 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 1.5 to 1.75

The site was down again for 5 mins last night, I have set Munin to send me a email if the load goes over 4, at 30 May 2013 04:05:13 I was sent:

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average                                                                 
        CRITICALs: load is 10.97 (outside range [:4]).

Then from pingdom at 30 May 2013 04:06:02 +0100:

www.transitionnetwork.org is down since 30/05/2013  04:01:57.

And from pingdom when it came back up, 30 May 2013 04:07:04 +0100:

www.transitionnetwork.org is UP again at 30/05/2013  04:06:57, after 5m of downtime.

From munin, 30 May 2013 04:10:15 +0100:

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average                                                                 
        CRITICALs: load is 4.08 (outside range [:4]).

And munin again, 30 May 2013 04:15:12 +0100:

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average

OKs: load is 1.81.

}}}

Again I can't see anything in the logs to indicate the cause of this.
}}}

Version 0, edited 3 years ago by chris (next)

comment:11 Changed 3 years ago by chris

Actually 5 mins before the above happened there was this email from lfd:

Date: Thu, 30 May 2013 04:01:42 +0100 (BST)
Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 21.26

Time:                    Thu May 30 04:01:41 2013 +0100
1 Min Load Avg:          65.70
5 Min Load Avg:          21.26
15 Min Load Avg:         7.69
Running/Total Processes: 25/331

And this coincides with these errors in /var/log/syslog:

May 30 04:01:43 puffin mysqld: 130530  4:01:43 [Warning] Aborted connection 1011234 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
May 30 04:01:43 puffin mysqld: 130530  4:01:43 [Warning] Aborted connection 1011237 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
May 30 04:01:43 puffin mysqld: 130530  4:01:43 [Warning] Aborted connection 1011247 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
May 30 04:01:43 puffin mysqld: 130530  4:01:43 [Warning] Aborted connection 1011260 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
May 30 04:01:43 puffin mysqld: 130530  4:01:43 [Warning] Aborted connection 1011239 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
May 30 04:01:43 puffin mysqld: 130530  4:01:43 [Warning] Aborted connection 1011229 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
May 30 04:01:43 puffin mysqld: 130530  4:01:43 [Warning] Aborted connection 1011240 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
May 30 04:01:43 puffin mysqld: 130530  4:01:43 [Warning] Aborted connection 1011231 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
May 30 04:01:43 puffin mysqld: 130530  4:01:43 [Warning] Aborted connection 1011261 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
May 30 04:01:43 puffin mysqld: 130530  4:01:43 [Warning] Aborted connection 1011238 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
May 30 04:01:43 puffin mysqld: 130530  4:01:43 [Warning] Aborted connection 1011232 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)

I don't know if the high load caused the mysql problem or if the mysql problem was caused by the high load.

Changed 3 years ago by chris

Attachment puffin-multips_memory-month-2013-05-31.png added

Puffin memory usage by selected application

comment:12 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 1.75 to 2.0

The memory usage of MariaDB/MySQL is still going up and has now hit 2G, half the physical RAM, I don't know if this is related to the load spikes and downtime:

I have installed some additional MySQL munin plugins to get some better stats:

cd /etc/munin/plugins
ln -s /usr/share/munin/plugins/mysql_ mysql_bin_relay_log
ln -s /usr/share/munin/plugins/mysql_ mysql_commands
ln -s /usr/share/munin/plugins/mysql_ mysql_connections
ln -s /usr/share/munin/plugins/mysql_ mysql_files_tables
ln -s /usr/share/munin/plugins/mysql_ mysql_innodb_bpool
ln -s /usr/share/munin/plugins/mysql_ mysql_innodb_bpool_act
ln -s /usr/share/munin/plugins/mysql_ mysql_innodb_insert_buf
ln -s /usr/share/munin/plugins/mysql_ mysql_innodb_io
ln -s /usr/share/munin/plugins/mysql_ mysql_innodb_io_pend
ln -s /usr/share/munin/plugins/mysql_ mysql_innodb_log
ln -s /usr/share/munin/plugins/mysql_ mysql_innodb_rows
ln -s /usr/share/munin/plugins/mysql_ mysql_innodb_semaphores
ln -s /usr/share/munin/plugins/mysql_ mysql_innodb_tnx
ln -s /usr/share/munin/plugins/mysql_ mysql_myisam_indexes
ln -s /usr/share/munin/plugins/mysql_ mysql_network_traffic
ln -s /usr/share/munin/plugins/mysql_ mysql_qcache
ln -s /usr/share/munin/plugins/mysql_ mysql_qcache_mem
ln -s /usr/share/munin/plugins/mysql_ mysql_select_types
ln -s /usr/share/munin/plugins/mysql_ mysql_slow
ln -s /usr/share/munin/plugins/mysql_ mysql_sorts
ln -s /usr/share/munin/plugins/mysql_ mysql_table_locks
ln -s /usr/share/munin/plugins/mysql_ mysql_tmp_tables

Last edited 3 years ago by chris (previous) (diff)

comment:13 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 2.0 to 2.25

Those munin plugins didn't work due to this bug: http://munin-monitoring.org/ticket/1302

So I have installed this one: https://github.com/kjellm/munin-mysql

comment:14 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 2.25 to 2.5

The install steps for the munin plugin which is generating graphs here, https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/index.html#mysql

cd /usr/local/src
wget https://github.com/kjellm/munin-mysql/archive/master.zip
unzip master.zip
cd munin-mysql-master

I then needed to add this to /etc/munin/plugin-conf.d/munin-node as I could get it to work using the debian-sys-maint account:

[mysql]
env.mysqlconnection DBI:mysql:mysql;host=127.0.0.1;port=3306
env.mysqluser root
env.mysqlpassword XXX

comment:15 Changed 3 years ago by jim

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 2.5 to 2.6

We could use a little more debugging too, so I've also set MySQL to log slow queries by uncommenting these lines in /etc/mysql/my.cnf @ line 57:

slow_query_log          = 1
long_query_time         = 5
slow_query_log_file     = /var/log/mysql/sql-slow-query.log

I also set long_query_time to 5 seconds from 10.

Chris, please restart MySQL at your leisure to enable this logging to /var/log/mysql/sql-slow-query.log. It'll be interesting to see if there's a pattern of table locks etc that cause this.

comment:16 Changed 3 years ago by jim

Add Hours to Ticket changed from 0.0 to 0.05
Total Hours changed from 2.6 to 2.65

FYI the only differences in my.cnf between Babylon and Puffin are puffin has higher values for innodb_buffer_pool_size and key_buffer_size, and Babylon has skip-name-resolve commented out, while Puffin does not.

comment:17 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 2.65 to 2.9

The old mysql munin plugins had stopped working after I fixed the new ones last night, I have now got them all working this config did the trick in /etc/munin/plugin-conf.d/munin-node:

[mysql*]
user root
env.mysqlopts --defaults-file=/etc/mysql/debian.cnf
env.mysqluser debian-sys-maint
env.mysqlconnection DBI:mysql:mysql;mysql_read_default_file=/etc/mysql/debian.cnf

Graphs: https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/index.html#mysql

comment:18 Changed 3 years ago by chris

Seems like I spoke too soon regarding the old munin mysql plugins, the work on the command line:

sudo -i
cd /etc/munin/plugins
munin-run mysql_bytes 
 recv.value 33653453213
 sent.value 687336777447
munin-run mysql_queries 
 delete.value 464002
 insert.value 451873
 replace.value 0
 select.value 11268953
 update.value 902061
 cache_hits.value 210825250
munin-run mysql_slowqueries 
 queries.value 59
munin-run mysql_threads 
 threads.value 1

But they are not producing graphs and I don't understand why.

comment:19 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 2.9 to 3.15

Munin stats are working now, it looks like all I forgot to do this morning was restart munin-node, we now have stats here again:

https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/mysql_queries.html

I have also restarted MariaDB as requested by Jim on ticket:555#comment:15

comment:20 follow-up: ↓ 21 Changed 3 years ago by jim

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 3.15 to 3.25

FYI /var/log/mysql/sql-slow-query.log is rotated every few hours too... So I've logged in and done this:

screen
tail -F /var/log/mysql/sql-slow-query.log > ~/jk_screen_sql_slow.log

So this way we won't miss anything... please ignore the screen session, I'll log in and kill it in a few days. (there might well be a more efficient way?)

comment:21 in reply to: ↑ 20 ; follow-up: ↓ 25 Changed 3 years ago by chris

Replying to jim:

there might well be a more efficient way?

The clobbering of logs by BOA is, IMHO, horrible and for us totally unnecessary.

There are tools for log rotation and compression in debian and I'd much rather use these.

BOA should at least give user a option to switch their log clobbering off via a configuration variable.

I'd be happy to find all the BOA scripts that do the log clobbering and comment out these parts and document which scripts need amending on each BOA upgrade and also to raise a ticket with BOA to ask them to allow users to switch off the clobbering.

comment:22 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.67
Total Hours changed from 3.25 to 3.92

These are all the logs in /var/log/ which have been clobbered:

mysql/sql-slow-query.log
nginx/speed_purge.log.1
php/error_log_53
php/error_log_52
php/php53-fpm-error.log
php/php53-fpm-error.log
php/php-fpm-slow.log
php/php-fpm-error.log
php/error_log_cli_52
php/php53-fpm-slow.log
php/error_log_cli_53
redis/redis-server.log

While looking at these I did notice these lines in php/php53-fpm-error.log (between the clobberings data is still written to the log files) which potentially need some action, these entries about the front page taking over 30 seconds to generate:

[06-Jun-2013 10:21:32] WARNING: [pool www] child 6037, script '/data/disk/tn/static/transition-network-d6-004/index.php' (request: "GET /index.php") executing too slow (35.152610 sec), logging
[06-Jun-2013 10:23:04] WARNING: [pool www] child 15643, script '/data/disk/tn/static/transition-network-d6-004/index.php' (request: "GET /index.php") executing too slow (33.938562 sec), logging
[06-Jun-2013 11:00:33] WARNING: [pool www] child 62121, script '/data/disk/tn/static/transition-network-d6-004/index.php' (request: "POST /index.php") executing too slow (30.960707 sec), logging

And this indicating that we need more php-fpm processes:

[06-Jun-2013 11:00:39] WARNING: [pool www] server reached pm.max_children setting (10), consider raising it

The two scripts doing the clobbering are /var/xdrago/clear.sh and graceful.sh and these are the logs they are set to set to clobber:

clear.sh:echo rotate > /var/log/php/php-fpm-error.log
clear.sh:echo rotate > /var/log/php/php-fpm-slow.log
clear.sh:echo rotate > /var/log/php/php53-fpm-error.log
clear.sh:echo rotate > /var/log/php/php53-fpm-slow.log
clear.sh:echo rotate > /var/log/php/error_log_52
clear.sh:echo rotate > /var/log/php/error_log_53
clear.sh:echo rotate > /var/log/php/error_log_cli_52
clear.sh:echo rotate > /var/log/php/error_log_cli_53
clear.sh:echo rotate > /var/log/redis/redis-server.log
clear.sh:echo rotate > /var/log/mysql/sql-slow-query.log
clear.sh:  echo rotate > /var/log/nginx/access.log
graceful.sh:  echo rotate > /var/log/nginx/speed_purge.log
graceful.sh:    echo rotate > /var/log/newrelic/nrsysmond.log
graceful.sh:    echo rotate > /var/log/newrelic/php_agent.log
graceful.sh:    echo rotate > /var/log/newrelic/newrelic-daemon.log

I have edited them in vim:

vim /var/xdrago/graceful.sh /var/xdrago/clear.sh

And run this regular expression on them to comment out the lines doing the clobbering:

:1,$s/echo rotate/# echo rotate/gc

I have updated the BOA update notes to include this step, wiki:PuffinServer#UpgradingBOA.

I'm not sure where BOA is setting the max number of php-fpm processes to 10, in /etc/php5 it is set to 12:

grep -r "pm.max_children" /etc/php5/ | grep -v ";"
/etc/php5/fpm/pool.d/www.conf:pm.max_children = 12

I guess this is overridden somewhere so I'm doing some grepping to try to find out where.

comment:23 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.2
Total Hours changed from 3.92 to 4.12

According to this BOA ticket, https://drupal.org/node/1711596 pm.max_children is set in /etc/php5

These are the key settings in /etc/php5/fpm/pool.d/www.conf:

pm.max_children = 12
pm.max_spare_servers = 3
pm.min_spare_servers = 2
pm.start_servers = 3

Perhaps min_spare_servers is deducted from max_children to get the number 10?

Each php-fpm process uses around 100MB of RAM, see:

https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/phpfpm_average.html

So I think it's probably safe to increase this to 16, or more, as it only spikes above 7 a few times a day, see:

https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/phpfpm_processes.html

I have changed it to 16 and restarted php53-fpm.

The grep processes are still running, I expect they won't return anything, I'll keep an eye on /var/log/php/php53-fpm-error.log to see what is reported.

comment:24 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 4.12 to 4.22

Interesting comment here on php-fpm settings:

The number of fpm children should be the number of children that you need. As a starting point
you want generally at least as many as CPUs as you have, so maybe 1 or 2 or 4 depending on
your computer, plus 2 or 3 more for when a child is waiting on something like a database
backend. But that is only a general rule. If your child processes are blocking for long
periods of time for something, like your php script is retrieving something offsite, you might
want more. With just Drupal accessing a database, you don't need that many extra.

https://groups.drupal.org/node/218179#comment-715829

We have 14 CPUs so setting it to 16 seems reasonable.

The grepping has finished and it's clear that the pm.max_children variable is set in /etc/php5/fpm/pool.d/www.conf

comment:25 in reply to: ↑ 21 ; follow-up: ↓ 26 Changed 3 years ago by jim

Replying to chris:

The clobbering of logs by BOA is, IMHO, horrible and for us totally unnecessary. There are tools for log rotation and compression in debian and I'd much rather use these.

BOA should at least give user a option to switch their log clobbering off via a configuration variable.

I completely agree.

You should consider raising a ticket on the Barracuda issue queue, since we're all part of the OS project now... The log rotation is only really useful for servers getting dozens+ hits per second. And as you say, there are better ways.
---
Regarding PHP-FPM workers etc, this landed a few days ago: http://drupalcode.org/project/barracuda.git/commit/65da24cf162f588932a8b9ee140a028a2a7ea869

Also FYI NginX 1.5.1 is included now, so the next up-stable system should grab this.

comment:26 in reply to: ↑ 25 ; follow-up: ↓ 35 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.8
Total Hours changed from 4.22 to 5.02

Replying to jim:

Replying to chris:

BOA should at least give user a option to switch their log clobbering off via a configuration variable.

I completely agree.

You should consider raising a ticket on the Barracuda issue queue

Thanks, I have posted the following at https://drupal.org/node/2013631

BOA Log Clobbering

By default BOA clobbers several log files, this cron job:

11 * * * * bash /var/xdrago/clear.sh >/dev/null 2>&1

Clobbers these logs:

grep "echo rotate" /var/xdrago/clear.sh
  echo rotate > /var/log/php/php-fpm-error.log
  echo rotate > /var/log/php/php-fpm-slow.log
  echo rotate > /var/log/php/php53-fpm-error.log
  echo rotate > /var/log/php/php53-fpm-slow.log
  echo rotate > /var/log/php/error_log_52
  echo rotate > /var/log/php/error_log_53
  echo rotate > /var/log/php/error_log_cli_52
  echo rotate > /var/log/php/error_log_cli_53
  echo rotate > /var/log/redis/redis-server.log
  echo rotate > /var/log/mysql/sql-slow-query.log
  echo rotate > /var/log/nginx/access.log

And this cron job:

18 0 * * * bash /var/xdrago/graceful.sh >/dev/null 2>&1

Clobbers these logs:

grep "echo rotate" /var/xdrago/graceful.sh
  echo rotate > /var/log/nginx/speed_purge.log
  echo rotate > /var/log/newrelic/nrsysmond.log
  echo rotate > /var/log/newrelic/php_agent.log
  echo rotate > /var/log/newrelic/newrelic-daemon.log

On servers where there isn't a problem with disk space it would be nice if there was an option to disable this clobbering and rely on the distribution log rotation scripts as there is potentially useful information that is lost when the logs are clobbered.

---
Regarding PHP-FPM workers etc, this landed a few days ago: http://drupalcode.org/project/barracuda.git/commit/65da24cf162f588932a8b9ee140a028a2a7ea869

Thanks for that, I have now found the config file with 10 in it:

grep -r "pm.max_children" /opt
  /opt/local/etc/php53-fpm.conf:;   static  - a fixed number (pm.max_children) of child processes;
  /opt/local/etc/php53-fpm.conf:;             pm.max_children      - the maximum number of children that can
  /opt/local/etc/php53-fpm.conf:;             pm.max_children           - the maximum number of children that
  /opt/local/etc/php53-fpm.conf:pm.max_children = 10

That file contains:

process.max = 12
pm.max_children = 10
pm.start_servers = 6
pm.min_spare_servers = 1
pm.max_spare_servers = 6

I changed these values:

process.max = 20
pm.max_children = 16

And restarted php-fpm53 and soon after got this in the error logs:

[06-Jun-2013 14:00:28] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 0 idle, and 13 total children

So I have also changed this:

pm.min_spare_servers = 4

And I have updated the BOA update notes to mention these edits, wiki:PuffinServer#UpgradingBOA and we should keep an eye on these graphs to see what the result is:

Also FYI NginX 1.5.1 is included now, so the next up-stable system should grab this.

Interesting, I wonder where BOA gets Nginx from these days, dotdeb only has 1.4 and I though we were getting it from there:

http://www.dotdeb.org/category/nginx/

But we can't be:

nginx -v
nginx version: nginx/1.5.0

In fact it doesn't appear to be installed using aptitude at all:

aptitude search nginx
  p   nginx                                                       - small, powerful, scalable web/proxy server                            
  c   nginx-common                                                - small, powerful, scalable web/proxy server - common files             
  p   nginx-dbg                                                   - Debugging symbols for nginx                                           
  p   nginx-doc                                                   - small, powerful, scalable web/proxy server - documentation            
  p   nginx-extras                                                - nginx web/proxy server (extended version)                             
  p   nginx-extras-dbg                                            - nginx web/proxy server (extended version) - debugging symbols         
  p   nginx-full                                                  - nginx web/proxy server (standard version)                             
  p   nginx-full-dbg                                              - nginx web/proxy server (standard version) - debugging symbols         
  p   nginx-light                                                 - nginx web/proxy server (basic version)                                
  p   nginx-light-dbg                                             - nginx web/proxy server (basic version) - debugging symbols            
  p   nginx-naxsi                                                 - nginx web/proxy server (version with naxsi)                           
  p   nginx-naxsi-dbg                                             - nginx web/proxy server (version with naxsi) - debugging symbols       
  p   nginx-naxsi-ui                                              - nginx web/proxy server - naxsi configuration front-end                
  p   nginx-passenger                                             - nginx web/proxy server (Passenger version)                            
  p   nginx-passenger-dbg                                         - nginx web/proxy server (Passenger version) - debugging symbols

comment:27 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 5.02 to 5.27

Still getting some errors in /var/log/php/php53-fpm-error.log:

[06-Jun-2013 16:32:18] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 3 idle, and 13 total children
[06-Jun-2013 19:00:06] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 2 idle, and 10 total children
[06-Jun-2013 19:40:09] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 3 idle, and 11 total children
[06-Jun-2013 20:00:32] WARNING: [pool www] server reached pm.max_children setting (16), consider raising it

And looking at the graphs here:

The average memory usage is now less even though the peaks are higher, so I have edited /opt/local/etc/php53-fpm.conf and changed:

process.max = 30
pm.max_children = 24
pm.start_servers = 8
pm.min_spare_servers = 6
pm.max_spare_servers = 12

Not sure if these are optimal, will check the log again tomorrow.

comment:28 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 5.27 to 5.52

We hit the php-fpm limit again this morning, perhaps related to the fact that the newsletter went out this morning as well, the Nginx requests per second are higher than average (at 2.42/sec compared with 1.38/sec) at the moment:

[07-Jun-2013 08:00:50] WARNING: [pool www] server reached pm.max_children setting (24), consider raising it

I have edited /opt/local/etc/php53-fpm.conf with the view to reduce the general php-pfm memory usage but to allow it to spike higher:

process.max = 40
pm.max_children = 36
pm.start_servers = 6
pm.min_spare_servers = 4
pm.max_spare_servers = 10

Since we only have one pool process.max doesn't need to be greater than pm.max_children.

I'm not convinced that this needs to be set so low as we don't have any evidence for memory leaks, we should perhaps try increasing it by a factor of 10 to reduce the number of times php-fpm processed have to be killed and restarted:

; The number of requests each child process should execute before respawning.
; This can be useful to work around memory leaks in 3rd party libraries. For
; endless request processing specify '0'. Equivalent to PHP_FCGI_MAX_REQUESTS.
; Default Value: 0
pm.max_requests = 500

comment:29 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 5.52 to 5.77

The max php-fpm limit was hit again:

[09-Jun-2013 08:05:24] WARNING: [pool www] server reached pm.max_children setting (36), consider raising it

The context for this limit being hit:

[09-Jun-2013 08:05:06] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 0 idle, and 20 total children
[09-Jun-2013 08:05:07] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 0 idle, and 24 total children
[09-Jun-2013 08:05:08] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 28 total children
[09-Jun-2013 08:05:09] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 3 idle, and 32 total children
[09-Jun-2013 08:05:24] WARNING: [pool www] server reached pm.max_children setting (36), consider raising it
[09-Jun-2013 08:05:33] WARNING: [pool www] child 26743, script '/data/disk/tn/static/transition-network-d6-004/index.php' (request: "GET /index.php") executing too slow (32.894680 sec), logging
[09-Jun-2013 08:05:33] WARNING: [pool www] child 26330, script '/data/disk/tn/static/transition-network-d6-004/index.php' (request: "GET /index.php") executing too slow (32.119437 sec), logging
[09-Jun-2013 08:05:33] WARNING: [pool www] child 26329, script '/data/disk/tn/static/transition-network-d6-004/index.php' (request: "GET /index.php") executing too slow (31.277767 sec), logging
[09-Jun-2013 08:05:33] WARNING: [pool www] child 26328, script '/data/disk/tn/static/transition-network-d6-004/index.php' (request: "GET /index.php") executing too slow (32.266466 sec), logging
[09-Jun-2013 08:05:33] WARNING: [pool www] child 26082, script '/data/disk/tn/static/transition-network-d6-004/index.php' (request: "GET /index.php") executing too slow (30.955183 sec), logging
[09-Jun-2013 08:05:33] WARNING: [pool www] child 26066, script '/data/disk/tn/static/transition-network-d6-004/index.php' (request: "GET /index.php") executing too slow (30.320056 sec), logging
[09-Jun-2013 08:05:33] WARNING: [pool www] child 11592, script '/data/disk/tn/static/transition-network-d6-004/index.php' (request: "GET /index.php") executing too slow (30.421503 sec), logging
[09-Jun-2013 08:05:33] WARNING: [pool www] child 11590, script '/data/disk/tn/static/transition-network-d6-004/index.php' (request: "GET /index.php") executing too slow (32.509934 sec), logging
[09-Jun-2013 08:05:33] WARNING: [pool www] child 11589, script '/data/disk/tn/static/transition-network-d6-004/index.php' (request: "GET /index.php") executing too slow (30.739996 sec), logging
[09-Jun-2013 08:05:33] WARNING: [pool www] child 11586, script '/data/disk/tn/static/transition-network-d6-004/index.php' (request: "GET /index.php") executing too slow (32.881669 sec), logging
[09-Jun-2013 08:05:33] WARNING: [pool www] child 10656, script '/data/disk/tn/static/transition-network-d6-004/index.php' (request: "GET /index.php") executing too slow (32.704514 sec), logging

Looking at this graph it looks like we might have also hit the max number of mysql connections:

https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/mysql_connections.html

So these values have been edited in /etc/mysql/my.cnf, they were set at 30:

max_connections         = 40
max_user_connections    = 40

And mysql was restarted.

And /opt/local/etc/php53-fpm.conf was edited:

process.max = 50
pm.max_children = 42
pm.max_spare_servers = 8

And php53-fpm restarted.

comment:30 Changed 3 years ago by ed

My understanding of the BOA rig was that it could withstand a slashdotting. This ticket is making me increasingy nervous - particularly as we are about to have articles in Guardian and Daily Mail, and there is about to be a sustained online PR campaign aroudn Rob's book from now all across July.

comment:31 follow-up: ↓ 32 Changed 3 years ago by jim

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 5.77 to 6.02

Hi Ed, this is not a question of a 'slashdotting' -- the server can handle that, that workload is proven to be ok.

This is about a busy site on a server, which - due to either a misconfiguration, a database issue or something wrong with our current versions of the web stack packages (PHP, MySQL etc) - has less capacity than is expected.

It's clear from Chris and my examinations that the PHP-FPM processes are being used up because something is making them hang for 30s+. I have some suspicions but nothing more, as has Chris.

My current theory is this: something is holding/blocking some requests, and they stack up behind this. I see this on my machine ONLY at 4.20 in the morning when some scripts backup/daily tasks are running. Puffin is getting this more and at other times, so it's either not the same issue, or not the same trigger.

My hunch is either:

our versions of PHP, MySQL, NGINX or other related stuff has a bug that's blocking/locking things
Our tn.org Drupal database as some issue around the mixed MyISAM/INNODB setup that is causing locks (this is true on my sites, which could explain my 4.20am load warning)
Puffin has some other issue not present on Babylon.

The danger is that if we have DB locking, and therefore lots of PHP processes queued up/blocking on that IO, that simply adding more workers over and above the BOA standard quantities risks eating all the memory and knocking the server over.

Anyway, I will continue to look into this on both Puffin and Babylon. It's telling that there are no issues raise on the Barracuda issue list like this, and it's used by thousands. So for now I think it could be a system- or Drupal-level issue.

More as it happens...

comment:32 in reply to: ↑ 31 Changed 3 years ago by chris

Replying to jim:

The danger is that if we have DB locking, and therefore lots of PHP processes queued up/blocking on that IO, that simply adding more workers over and above the BOA standard quantities risks eating all the memory and knocking the server over.

Yes, I have also been very concerned about this and have been keeping a very close eye on the memory usage and swap:

The php-fpm and mysql process limits have been increased slowly and with this concern very much in mind.

There is also an emergency plan if the shit really does hit the fan in a massive way, there is 48GB of RAM sitting on my desk which can be added to the server ;-) In other words Ed -- don't worry too much, the site isn't going to go down if we can help it :-)

comment:33 Changed 3 years ago by ed

nice, ta

comment:34 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.5
Total Hours changed from 6.02 to 6.52

Jim, have you taken a look at the /var/log/php/php53-fpm-slow.log -- now it is no longer being clobbered there is quite a lot of info here, which perhaps might help, every time a request takes more than 30 seconds and is logged in /var/log/php/php53-fpm-error.log:, eg this is the last one:

[10-Jun-2013 10:39:30] WARNING: [pool www] child 47463, script '/data/disk/tn/static/transition-network-d6-004/index.php' (request: "GET /index.php") executing too slow (38.153063 sec), logging
[10-Jun-2013 10:39:30] NOTICE: child 47463 stopped for tracing
[10-Jun-2013 10:39:30] NOTICE: about to trace 47463
[10-Jun-2013 10:39:30] NOTICE: finished trace of 47463

There is a corresponding entry in the slowlog:

[10-Jun-2013 10:39:30]  [pool www] pid 47463
script_filename = /data/disk/tn/static/transition-network-d6-004/index.php
[0x0000000002686fb8] fsockopen() /data/disk/tn/static/transition-network-d6-004/includes/common.inc:475
[0x0000000002684190] drupal_http_request() /data/disk/tn/static/transition-network-d6-004/sites/all/modules/contrib/image_resize_filter/image_resize_filter.module:366
[0x0000000002683c38] image_resize_filter_get_images() /data/disk/tn/static/transition-network-d6-004/sites/all/modules/contrib/image_resize_filter/image_resize_filter.module:59
[0x00007fff91780a00] image_resize_filter_filter() unknown:0
[0x0000000002683910] call_user_func_array() /data/disk/tn/static/transition-network-d6-004/includes/module.inc:532
[0x0000000002683178] module_invoke() /data/disk/tn/static/transition-network-d6-004/modules/filter/filter.module:455
[0x0000000002682d80] check_markup() /data/disk/tn/static/transition-network-d6-004/modules/node/node.module:1058
[0x0000000002682a48] node_prepare() /data/disk/tn/static/transition-network-d6-004/modules/node/node.module:1102
[0x0000000002682698] node_build_content() /data/disk/tn/static/transition-network-d6-004/modules/node/node.module:1023
[0x0000000002682348] node_view() /data/disk/tn/static/transition-network-d6-004/modules/node/node.module:1118
[0x00000000026821d8] node_show() /data/disk/tn/static/transition-network-d6-004/modules/node/node.module:1814
[0x0000000002681c00] node_page_view() /data/disk/tn/static/transition-network-d6-004/sites/all/modules/contrib/ctools/page_manager/plugins/tasks/node_view.inc:107
[0x00007fff91781510] page_manager_node_view() unknown:0
[0x0000000002681828] call_user_func_array() /data/disk/tn/static/transition-network-d6-004/includes/menu.inc:360
[0x00000000026814d8] menu_execute_active_handler() /data/disk/tn/static/transition-network-d6-004/index.php:17

However, reading through this log I can't see any really obvious patterns... but we might be getting fewer than we were, but Sunday is always a slow day for the site (there were 17 yesterday, Sunday, and 84 the day before and 52 the day before that, the first full day we have logs for):

grep "stopped for tracing" /var/log/php/php53-fpm-error.log | grep 09-Jun-2013 | wc -l
17
grep "stopped for tracing" /var/log/php/php53-fpm-error.log | grep 08-Jun-2013 | wc -l
84
grep "stopped for tracing" /var/log/php/php53-fpm-error.log | grep 07-Jun-2013 | wc -l
52

There have been 6 so far today.

comment:35 in reply to: ↑ 26 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.71
Total Hours changed from 6.52 to 7.23

Replying to chris:

Replying to jim:

Replying to chris:

BOA should at least give user a option to switch their log clobbering off via a configuration variable.

I completely agree.

You should consider raising a ticket on the Barracuda issue queue

Thanks, I have posted the following at https://drupal.org/node/2013631

The ticket has been closed (won't fix):

These logs have no configured logrotate scripts, so we just wipe them out. We
do this also because on fast enough system with SSD it is possible to quickly
fill the disk with logs if there is something which keeps generating errors. We
have seen servers crashed because of this, hence this aggressive procedure.

I don't consider it worth adding extra logrotate scripts since any really
useful errors you can find in the syslog anyway, but feel free to disagree and
submit patch for review and re-open.

Also, you seems to use really old BOA version, because we don't purge Nginx
access log, unless there is /root/.high_traffic.cnf control file.

https://drupal.org/node/2013631#comment-7511009

So, it doesn't appear to be worth following this up further with the BOA people.

These are the logs were are not rotated and which are of a size that makes them worth rotating:

80K  /var/log/mysql/sql-slow-query.log
120K /var/log/php/php53-fpm-error.log
218K /var/log/php/php53-fpm-slow.log
113M /var/log/php/www.access.log

I have edited /etc/logrotate.d/mysql-server and commented out the rotation of logs that MariaDB doesn't create and replacing it with the slow query log and also increased the number of days to keep logs from 7 to 30:

#/var/log/mysql.log /var/log/mysql/mysql.log /var/log/mysql/mysql-slow.log {
/var/log/mysql/sql-slow-query.log {

rotate 30

I have copied the nginx logrotate script to /etc/logrotate.d/php-fpm and edited it to:

/var/log/php/*.log {
        daily
        missingok
        rotate 30
        compress
        delaycompress
        notifempty
        create 0640 www-data adm
        sharedscripts
        postrotate
                [ ! -f /var/run/php53-fpm.pid ] || kill -USR1 `cat /var/run/php53-fpm.pid`
        endscript
}

The scripts were then manually run to test them:

logrotate -vf /etc/logrotate.d/mysql-server
logrotate -vf /etc/logrotate.d/php-fpm

comment:36 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.7
Total Hours changed from 7.23 to 7.93

There was another period of downtime last night at around 10 pm for 8 mins , with a load peak of over 80, I can't see any indication in the logs for what caused this, there is a corresponding gap in some of the munin stats but no clues there either. Some detail follows, but it's not very illuminating.

There were around 180 hits recorded from the Guardian article yesterday and around 80 from the Alternet article -- usually Sunday isn't very busy, perhaps this is related, but this wasn't a massive spike in traffic so it shouldn't have caused this effect.

Email Alerts

These are the email alerts I got:

Date: Sun, 16 Jun 2013 22:03:50 +0100 (BST)
Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 39.16

Time:                    Sun Jun 16 22:03:49 2013 +0100
1 Min Load Avg:          86.33
5 Min Load Avg:          39.16
15 Min Load Avg:         15.39
Running/Total Processes: 88/417

Date: Sun, 16 Jun 2013 22:05:15 +0100
Subject: puffin.transitionnetwork.org Munin Alert

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average
        CRITICALs: load is 30.48 (outside range [:8]).

Date: Sun, 16 Jun 2013 22:07:07 +0100
Subject: ** PROBLEM Service Alert: puffin/HTTP is CRITICAL **

***** Nagios *****

Notification Type: PROBLEM

Service: HTTP
Host: puffin
Address: puffin.webarch.net
State: CRITICAL

Date/Time: Sun Jun 16 22:07:07 BST 2013

Additional Info:

Connection refused

Date: Sun, 16 Jun 2013 22:08:04 +0100
Subject: DOWN alert: www.transitionnetwork.org (www.transitionnetwork.org) is DOWN

PingdomAlert DOWN:
 www.transitionnetwork.org (www.transitionnetwork.org) is down since 16/06/2013  22:03:57.

Date: Sun, 16 Jun 2013 22:10:15 +0100
Subject: puffin.transitionnetwork.org Munin Alert

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average
        CRITICALs: load is 11.31 (outside range [:8]).

And then it came back up:

Date: Sun, 16 Jun 2013 22:12:01 +0100
Subject: UP alert: www.transitionnetwork.org (www.transitionnetwork.org) is UP


PingdomAlert UP:
 www.transitionnetwork.org (www.transitionnetwork.org) is UP again at 16/06/2013  22:11:57, after 8m of downtime.

Date: Sun, 16 Jun 2013 22:12:07 +0100
Subject: ** RECOVERY Service Alert: puffin/HTTP is OK **

***** Nagios *****

Notification Type: RECOVERY

Service: HTTP
Host: puffin
Address: puffin.webarch.net
State: OK

Date/Time: Sun Jun 16 22:12:07 BST 2013

Additional Info:

HTTP OK: HTTP/1.1 200 OK - 692 bytes in 0.006 second response time

Date: Sun, 16 Jun 2013 22:15:14 +0100
Subject: puffin.transitionnetwork.org Munin Alert

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average
        WARNINGs: load is 4.44 (outside range [:4]).

Date: Sun, 16 Jun 2013 22:20:17 +0100
Subject: puffin.transitionnetwork.org Munin Alert

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average
        OKs: load is 1.91.

Log Entries

There are these in the the php log:

php53-fpm-error.log.1:[16-Jun-2013 17:22:32] WARNING: [pool www] server reached pm.max_children setting (42), consider raising it
php53-fpm-error.log.1:[16-Jun-2013 22:02:55] WARNING: [pool www] server reached pm.max_children setting (42), consider raising it

But I can't find anything much else to indicate what happened.

Settings Changed

I realise that this isn't the answer but have further tweaked these values in /etc/mysql/my.cnf, they were set at 40:

max_connections         = 50
max_user_connections    = 50

And mysql was restarted.

And /opt/local/etc/php53-fpm.conf was edited:

process.max = 60
pm.max_children = 50

And php53-fpm restarted.

comment:37 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.3
Total Hours changed from 7.93 to 8.23

The site want down again yesterday for about 8 mins mins at 3:30pm. The load peaked at 44.

Email Alerts

These are the emails I got as it went down:

Date: Mon, 17 Jun 2013 15:00:50 +0100
Subject: puffin.transitionnetwork.org Munin Alert

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average
        WARNINGs: load is 4.44 (outside range [:4]).

Date: Mon, 17 Jun 2013 15:05:15 +0100
Subject: puffin.transitionnetwork.org Munin Alert

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average
        OKs: load is 2.20.

Date: Mon, 17 Jun 2013 15:28:30 +0100 (BST)
Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 8.00

Time:                    Mon Jun 17 15:26:55 2013 +0100
1 Min Load Avg:          25.40
5 Min Load Avg:          8.00
15 Min Load Avg:         3.40
Running/Total Processes: 47/331

Date: Mon, 17 Jun 2013 15:30:41 +0100
Subject: puffin.transitionnetwork.org Munin Alert

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average
        CRITICALs: load is 44.16 (outside range [:8]).

Date: Mon, 17 Jun 2013 15:31:07 +0100
Subject: ** PROBLEM Service Alert: puffin/HTTP is CRITICAL **

***** Nagios *****

Notification Type: PROBLEM

Service: HTTP
Host: puffin
Address: puffin.webarch.net
State: CRITICAL

Date/Time: Mon Jun 17 15:31:07 BST 2013

Additional Info:

Connection refused

Date: Mon, 17 Jun 2013 15:34:07 +0100
Subject: DOWN alert: www.transitionnetwork.org (www.transitionnetwork.org) is DOWN

PingdomAlert DOWN:
 www.transitionnetwork.org (www.transitionnetwork.org) is down since 17/06/2013  15:29:57.

Date: Mon, 17 Jun 2013 15:35:13 +0100
Subject: puffin.transitionnetwork.org Munin Alert

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average
        CRITICALs: load is 17.69 (outside range [:8]).

And then it recovered:

Date: Mon, 17 Jun 2013 15:39:05 +0100
Subject: UP alert: www.transitionnetwork.org (www.transitionnetwork.org) is UP

PingdomAlert UP:
 www.transitionnetwork.org (www.transitionnetwork.org) is UP again at 17/06/2013  15:38:57, after 9m of downtime.

Date: Mon, 17 Jun 2013 15:40:18 +0100
Subject: puffin.transitionnetwork.org Munin Alert

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average
        WARNINGs: load is 6.71 (outside range [:4]).

Date: Mon, 17 Jun 2013 15:41:07 +0100
Subject: ** RECOVERY Service Alert: puffin/HTTP is OK **

***** Nagios *****

Notification Type: RECOVERY

Service: HTTP
Host: puffin
Address: puffin.webarch.net
State: OK

Date/Time: Mon Jun 17 15:41:07 BST 2013

Additional Info:

HTTP OK: HTTP/1.1 200 OK - 692 bytes in 0.004 second response time

Date: Mon, 17 Jun 2013 15:45:15 +0100
Subject: puffin.transitionnetwork.org Munin Alert

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average
        OKs: load is 2.80.

Log Entries

There are a lot of these in the php-fpm error log just before the server went down:

[17-Jun-2013 15:25:40] WARNING: [pool www] child 43600, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.757526 sec), logging
[17-Jun-2013 15:25:40] WARNING: [pool www] child 35447, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (37.827107 sec), logging
[17-Jun-2013 15:25:40] WARNING: [pool www] child 29997, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (39.862649 sec), logging
[17-Jun-2013 15:25:40] WARNING: [pool www] child 29468, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.815989 sec), logging
[17-Jun-2013 15:25:40] WARNING: [pool www] child 29153, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (31.050583 sec), logging
[17-Jun-2013 15:25:50] WARNING: [pool www] child 45198, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.144049 sec), logging
[17-Jun-2013 15:25:50] WARNING: [pool www] child 35536, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.534339 sec), logging
[17-Jun-2013 15:25:50] WARNING: [pool www] child 35173, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (36.196559 sec), logging
[17-Jun-2013 15:25:50] WARNING: [pool www] child 29155, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (34.026623 sec), logging
[17-Jun-2013 15:26:10] WARNING: [pool www] child 45210, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.262363 sec), logging
[17-Jun-2013 15:26:10] WARNING: [pool www] child 45208, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (40.206509 sec), logging
[17-Jun-2013 15:26:20] WARNING: [pool www] child 45241, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (36.576699 sec), logging
[17-Jun-2013 15:26:20] WARNING: [pool www] child 45240, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (36.708073 sec), logging
[17-Jun-2013 15:26:20] WARNING: [pool www] child 45238, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (37.866814 sec), logging
[17-Jun-2013 15:26:20] WARNING: [pool www] child 45228, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (38.811486 sec), logging
[17-Jun-2013 15:26:20] WARNING: [pool www] child 45217, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (39.173565 sec), logging
[17-Jun-2013 15:26:20] WARNING: [pool www] child 45212, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (39.873535 sec), logging
[17-Jun-2013 15:26:20] WARNING: [pool www] child 29468, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (38.069448 sec), logging
[17-Jun-2013 15:26:30] WARNING: [pool www] child 45259, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (31.301978 sec), logging
[17-Jun-2013 15:26:30] WARNING: [pool www] child 45257, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (31.574328 sec), logging
[17-Jun-2013 15:26:30] WARNING: [pool www] child 45253, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (32.682662 sec), logging
[17-Jun-2013 15:26:30] WARNING: [pool www] child 45250, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (33.273894 sec), logging
[17-Jun-2013 15:26:30] WARNING: [pool www] child 45248, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (34.214245 sec), logging
[17-Jun-2013 15:26:30] WARNING: [pool www] child 45246, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (36.041068 sec), logging
[17-Jun-2013 15:26:30] WARNING: [pool www] child 45245, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (34.248792 sec), logging
[17-Jun-2013 15:27:00] WARNING: [pool www] child 45283, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.404809 sec), logging
[17-Jun-2013 15:27:00] WARNING: [pool www] child 45279, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (36.632685 sec), logging
[17-Jun-2013 15:27:00] WARNING: [pool www] child 45258, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (38.613425 sec), logging
[17-Jun-2013 15:27:00] WARNING: [pool www] child 45244, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.865239 sec), logging
[17-Jun-2013 15:27:11] WARNING: [pool www] child 45243, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (32.187872 sec), logging
[17-Jun-2013 15:27:21] WARNING: [pool www] child 45313, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.217324 sec), logging
[17-Jun-2013 15:27:21] WARNING: [pool www] child 45292, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (37.632868 sec), logging
[17-Jun-2013 15:27:21] WARNING: [pool www] child 45289, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (39.336141 sec), logging
[17-Jun-2013 15:27:21] WARNING: [pool www] child 45263, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.367704 sec), logging
[17-Jun-2013 15:27:51] WARNING: [pool www] child 45324, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (31.800232 sec), logging
[17-Jun-2013 15:27:51] WARNING: [pool www] child 45317, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "HEAD /index.php") executing too slow (32.487625 sec), logging
[17-Jun-2013 15:27:51] WARNING: [pool www] child 45277, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (34.859741 sec), logging
[17-Jun-2013 15:28:11] WARNING: [pool www] child 45337, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (31.740100 sec), logging
[17-Jun-2013 15:28:11] WARNING: [pool www] child 45336, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.493527 sec), logging
[17-Jun-2013 15:28:11] WARNING: [pool www] child 45321, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (39.162045 sec), logging
[17-Jun-2013 15:28:21] WARNING: [pool www] child 45342, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (39.866735 sec), logging
[17-Jun-2013 15:28:31] WARNING: [pool www] child 45353, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (39.079867 sec), logging

And a lot of these in the daemon.log:

Jun 17 15:29:54 puffin mysqld: 130617 15:29:54 [Warning] Aborted connection 18018 to db: 'masterpuffinwe_0' user: 'masterpuffinwe_0' host: 'localhost' (Unknown error)
Jun 17 15:29:55 puffin mysqld: 130617 15:29:55 [Warning] Aborted connection 18055 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:29:55 puffin mysqld: 130617 15:29:55 [Warning] Aborted connection 18024 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:29:55 puffin mysqld: 130617 15:29:55 [Warning] Aborted connection 18028 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:29:55 puffin mysqld: 130617 15:29:55 [Warning] Aborted connection 18025 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:29:55 puffin mysqld: 130617 15:29:55 [Warning] Aborted connection 18057 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:29:55 puffin mysqld: 130617 15:29:55 [Warning] Aborted connection 18039 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:29:55 puffin mysqld: 130617 15:29:55 [Warning] Aborted connection 18026 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:29:55 puffin mysqld: 130617 15:29:55 [Warning] Aborted connection 18019 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:29:55 puffin mysqld: 130617 15:29:55 [Warning] Aborted connection 18022 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:29:55 puffin mysqld: 130617 15:29:55 [Warning] Aborted connection 18062 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:29:55 puffin mysqld: 130617 15:29:55 [Warning] Aborted connection 18043 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:29:55 puffin mysqld: 130617 15:29:55 [Warning] Aborted connection 18011 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:29:55 puffin mysqld: 130617 15:29:55 [Warning] Aborted connection 18040 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:29:55 puffin mysqld: 130617 15:29:55 [Warning] Aborted connection 18036 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:29:55 puffin mysqld: 130617 15:29:55 [Warning] Aborted connection 18021 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:29:55 puffin mysqld: 130617 15:29:55 [Warning] Aborted connection 18030 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:29:55 puffin mysqld: 130617 15:29:55 [Warning] Aborted connection 18044 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:29:55 puffin mysqld: 130617 15:29:55 [Warning] Aborted connection 18045 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:29:55 puffin mysqld: 130617 15:29:55 [Warning] Aborted connection 18047 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:29:55 puffin mysqld: 130617 15:29:55 [Warning] Aborted connection 18046 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:29:55 puffin mysqld: 130617 15:29:55 [Warning] Aborted connection 18033 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:29:55 puffin mysqld: 130617 15:29:55 [Warning] Aborted connection 18069 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:29:58 puffin mysqld: 130617 15:29:58 [Warning] Aborted connection 18010 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:30:14 puffin mysqld: 130617 15:30:13 [Warning] Aborted connection 18075 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:30:14 puffin mysqld: 130617 15:30:14 [Warning] Aborted connection 18058 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:30:17 puffin mysqld: 130617 15:30:17 [Warning] Aborted connection 18023 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:30:22 puffin mysqld: 130617 15:30:22 [Warning] Aborted connection 18067 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:30:22 puffin mysqld: 130617 15:30:22 [Warning] Aborted connection 18037 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:30:22 puffin mysqld: 130617 15:30:22 [Warning] Aborted connection 18029 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:30:22 puffin mysqld: 130617 15:30:22 [Warning] Aborted connection 18035 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:30:22 puffin mysqld: 130617 15:30:22 [Warning] Aborted connection 18038 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:30:22 puffin mysqld: 130617 15:30:22 [Warning] Aborted connection 18054 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:30:22 puffin mysqld: 130617 15:30:22 [Warning] Aborted connection 18032 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:30:22 puffin mysqld: 130617 15:30:22 [Warning] Aborted connection 18017 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:30:22 puffin mysqld: 130617 15:30:22 [Warning] Aborted connection 18064 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:30:22 puffin mysqld: 130617 15:30:22 [Warning] Aborted connection 18034 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:30:22 puffin mysqld: 130617 15:30:22 [Warning] Aborted connection 18060 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:30:22 puffin mysqld: 130617 15:30:22 [Warning] Aborted connection 18063 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:30:22 puffin mysqld: 130617 15:30:22 [Warning] Aborted connection 18065 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:30:22 puffin mysqld: 130617 15:30:22 [Warning] Aborted connection 18050 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:30:22 puffin mysqld: 130617 15:30:22 [Warning] Aborted connection 18013 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:30:22 puffin mysqld: 130617 15:30:22 [Warning] Aborted connection 18016 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:30:22 puffin mysqld: 130617 15:30:22 [Warning] Aborted connection 18041 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:30:22 puffin mysqld: 130617 15:30:22 [Warning] Aborted connection 18027 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:30:22 puffin mysqld: 130617 15:30:22 [Warning] Aborted connection 18009 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:30:22 puffin mysqld: 130617 15:30:22 [Warning] Aborted connection 18008 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 17 15:30:22 puffin mysqld: 130617 15:30:22 [Warning] Aborted connection 18020 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)

But there are also other times with lots of the above mysql errors which haven't resulted in the server going down -- between June 16 at 5pm and June 18th at 9am there are 328:

grep "Aborted connection " daemon.log | wc -l
328

I'm afraid I still don't know what is causing these outages.

comment:38 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.3
Total Hours changed from 8.23 to 8.53

It just happened again, at 10am, the load peaked at around 30 and the site went down for around 5 mins.

Email Alerts

Date: Tue, 18 Jun 2013 10:01:11 +0100 (BST)
Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 8.93

Time:                    Tue Jun 18 10:01:11 2013 +0100
1 Min Load Avg:          30.61
5 Min Load Avg:          8.93
15 Min Load Avg:         3.45
Running/Total Processes: 45/371

Date: Tue, 18 Jun 2013 10:05:07 +0100
Subject: ** PROBLEM Service Alert: puffin/HTTP is CRITICAL **

***** Nagios *****

Notification Type: PROBLEM

Service: HTTP
Host: puffin
Address: puffin.webarch.net
State: CRITICAL

Date/Time: Tue Jun 18 10:05:07 BST 2013

Additional Info:

Connection refused

Date: Tue, 18 Jun 2013 10:05:18 +0100
Subject: puffin.transitionnetwork.org Munin Alert

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average
        CRITICALs: load is 14.37 (outside range [:8]).

Date: Tue, 18 Jun 2013 10:06:04 +0100
Subject: DOWN alert: www.transitionnetwork.org (www.transitionnetwork.org) is DOWN

PingdomAlert DOWN:
 www.transitionnetwork.org (www.transitionnetwork.org) is down since 18/06/2013  10:01:57.

And Pingdom reported it back up after 6 mins down:

Date: Tue, 18 Jun 2013 10:08:03 +0100
Subject: UP alert: www.transitionnetwork.org (www.transitionnetwork.org) is UP

PingdomAlert UP:
 www.transitionnetwork.org (www.transitionnetwork.org) is UP again at 18/06/2013  10:07:59, after 6m of downtime.

Date: Tue, 18 Jun 2013 10:10:07 +0100
Subject: ** RECOVERY Service Alert: puffin/HTTP is OK **

***** Nagios *****

Notification Type: RECOVERY

Service: HTTP
Host: puffin
Address: puffin.webarch.net
State: OK

Date/Time: Tue Jun 18 10:10:07 BST 2013

Additional Info:

HTTP OK: HTTP/1.1 200 OK - 692 bytes in 0.004 second response time

Date: Tue, 18 Jun 2013 10:10:32 +0100
Subject: puffin.transitionnetwork.org Munin Alert

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average
        WARNINGs: load is 5.98 (outside range [:4]).

Log Entries

daemon.log:

Jun 18 10:01:11 puffin mysqld: 130618 10:01:11 [Warning] Aborted connection 137674 to db: 'masterpuffinwe_0' user: 'masterpuffinwe_0' host: 'localhost' (Unknown error)
Jun 18 10:02:57 puffin mysqld: 130618 10:02:57 [Warning] Aborted connection 137691 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 18 10:02:57 puffin mysqld: 130618 10:02:57 [Warning] Aborted connection 137680 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 18 10:02:57 puffin mysqld: 130618 10:02:57 [Warning] Aborted connection 137695 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 18 10:02:57 puffin mysqld: 130618 10:02:57 [Warning] Aborted connection 137681 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 18 10:02:57 puffin mysqld: 130618 10:02:57 [Warning] Aborted connection 137692 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 18 10:02:57 puffin mysqld: 130618 10:02:57 [Warning] Aborted connection 137688 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 18 10:02:57 puffin mysqld: 130618 10:02:57 [Warning] Aborted connection 137678 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 18 10:02:57 puffin mysqld: 130618 10:02:57 [Warning] Aborted connection 137696 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 18 10:02:57 puffin mysqld: 130618 10:02:57 [Warning] Aborted connection 137666 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 18 10:02:57 puffin mysqld: 130618 10:02:57 [Warning] Aborted connection 137686 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 18 10:02:57 puffin mysqld: 130618 10:02:57 [Warning] Aborted connection 137704 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 18 10:02:57 puffin mysqld: 130618 10:02:57 [Warning] Aborted connection 137693 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 18 10:02:57 puffin mysqld: 130618 10:02:57 [Warning] Aborted connection 137684 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 18 10:02:57 puffin mysqld: 130618 10:02:57 [Warning] Aborted connection 137677 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 18 10:02:57 puffin mysqld: 130618 10:02:57 [Warning] Aborted connection 137668 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 18 10:02:57 puffin mysqld: 130618 10:02:57 [Warning] Aborted connection 137683 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 18 10:02:57 puffin mysqld: 130618 10:02:57 [Warning] Aborted connection 137689 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 18 10:02:57 puffin mysqld: 130618 10:02:57 [Warning] Aborted connection 137670 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 18 10:02:57 puffin mysqld: 130618 10:02:57 [Warning] Aborted connection 137685 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 18 10:02:57 puffin mysqld: 130618 10:02:57 [Warning] Aborted connection 137669 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 18 10:02:57 puffin mysqld: 130618 10:02:57 [Warning] Aborted connection 137671 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 18 10:02:57 puffin mysqld: 130618 10:02:57 [Warning] Aborted connection 137682 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)

The php-fpm error log has nothing in it -- it's been clobbered again, but I'm not sure what did this -- the clobbering on the scripts in /var/xdrago is still commented out.

Munin Graphs

The max number of MySQL connections seems to be reach soon after each increase, I don't know if it should be increased further, see the connections by month graph:

https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/mysql_connections.html

Again I can't see any indication to the cause of the downtime in the Munin graphs.

comment:39 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 8.53 to 8.78

Yesterday did have the highest number of visits in a day for the whole of the last year:

1742 visits, 1491 unique visitors
3 min 4s average visit duration
55% visits have bounced (left the website after one page)
3.5 actions (page views, downloads, outlinks and internal site searches) per visit
1.46s average generation time
5698 pageviews, 4380 unique pageviews
33 total searches on your website, 33 unique keywords
81 downloads, 77 unique downloads
202 outlinks, 192 unique outlinks
184 max actions in one visit

These are the results from the mysqltuner.pl script:

 perl mysqltuner.pl

 >>  MySQLTuner 1.2.0 - Major Hayden <major@mhtx.net>
 >>  Bug reports, feature requests, and downloads at http://mysqltuner.com/
 >>  Run with '--help' for additional options and output filtering
[OK] Logged in using credentials from debian maintenance account.

-------- General Statistics --------------------------------------------------
[--] Skipped version check for MySQLTuner script
[OK] Currently running supported MySQL version 5.5.31-MariaDB-1~squeeze-log
[OK] Operating on 64-bit architecture

-------- Storage Engine Statistics -------------------------------------------
[--] Status: +Archive -BDB +Federated +InnoDB -ISAM -NDBCluster 
[--] Data in MyISAM tables: 104M (Tables: 2)
[--] Data in InnoDB tables: 437M (Tables: 782)
[--] Data in PERFORMANCE_SCHEMA tables: 0B (Tables: 17)
[!!] Total fragmented tables: 94

-------- Security Recommendations  -------------------------------------------
[OK] All database users have passwords assigned

-------- Performance Metrics -------------------------------------------------
[--] Up for: 22h 54m 12s (6M q [73.364 qps], 148K conn, TX: 11B, RX: 937M)
[--] Reads / Writes: 91% / 9%
[--] Total buffers: 1.1G global + 13.4M per thread (50 max threads)
[OK] Maximum possible memory usage: 1.8G (44% of installed RAM)
[OK] Slow queries: 0% (23/6M)
[!!] Highest connection usage: 100%  (51/50)
[OK] Key buffer size / total MyISAM indexes: 509.0M/91.1M
[OK] Key buffer hit rate: 98.6% (9M cached / 138K reads)
[OK] Query cache efficiency: 74.2% (4M cached / 5M selects)
[!!] Query cache prunes per day: 1104157
[OK] Sorts requiring temporary tables: 1% (2K temp sorts / 174K sorts)
[!!] Joins performed without indexes: 7006
[!!] Temporary tables created on disk: 30% (64K on disk / 211K total)
[OK] Thread cache hit rate: 99% (51 created / 148K connections)
[!!] Table cache hit rate: 0% (128 open / 28K opened)
[OK] Open file limit used: 0% (4/196K)
[OK] Table locks acquired immediately: 99% (2M immediate / 2M locks)
[OK] InnoDB data size / buffer pool: 437.1M/509.0M

-------- Recommendations -----------------------------------------------------
General recommendations:
    Run OPTIMIZE TABLE to defragment tables for better performance
    MySQL started within last 24 hours - recommendations may be inaccurate
    Reduce or eliminate persistent connections to reduce connection usage
    Adjust your join queries to always utilize indexes
    When making adjustments, make tmp_table_size/max_heap_table_size equal
    Reduce your SELECT DISTINCT queries without LIMIT clauses
    Increase table_cache gradually to avoid file descriptor limits
Variables to adjust:
    max_connections (> 50)
    wait_timeout (< 3600)
    interactive_timeout (< 28800)
    query_cache_size (> 64M)
    join_buffer_size (> 1.0M, or always use indexes with joins)
    tmp_table_size (> 64M)
    max_heap_table_size (> 128M)
    table_cache (> 128)

I have increased these from 50:

max_connections         = 75
max_user_connections    = 75

And restarted MySQL -- there is enough RAM for this:

[--] Total buffers: 1.1G global + 13.4M per thread (75 max threads)
[OK] Maximum possible memory usage: 2.1G (52% of installed RAM)

See:

https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/memory.html

I realise Jim isn't keen on BOA settings being tweaked but I don't know what else to do?

comment:40 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 1.0
Total Hours changed from 8.78 to 9.78

The site went down again this morning for 5 mins.

Email Alerts

Date: Wed, 19 Jun 2013 06:27:46 +0100 (BST)
Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 8.48

Time:                    Wed Jun 19 06:27:11 2013 +0100
1 Min Load Avg:          25.74
5 Min Load Avg:          8.48
15 Min Load Avg:         3.22
Running/Total Processes: 36/381

Date: Wed, 19 Jun 2013 06:30:14 +0100
Subject: puffin.transitionnetwork.org Munin Alert

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average
        CRITICALs: load is 20.06 (outside range [:8]).

Date: Wed, 19 Jun 2013 06:32:07 +0100
Subject: ** PROBLEM Service Alert: puffin/HTTP is CRITICAL **

***** Nagios *****

Notification Type: PROBLEM

Service: HTTP
Host: puffin
Address: puffin.webarch.net
State: CRITICAL

Date/Time: Wed Jun 19 06:32:07 BST 2013

Additional Info:

Connection refused

Date: Wed, 19 Jun 2013 06:33:07 +0100
Subject: DOWN alert: www.transitionnetwork.org (www.transitionnetwork.org) is DOWN

PingdomAlert DOWN:
 www.transitionnetwork.org (www.transitionnetwork.org) is down since 19/06/2013  06:28:57.

Date: Wed, 19 Jun 2013 06:34:35 +0100
Subject: UP alert: www.transitionnetwork.org (www.transitionnetwork.org) is UP

PingdomAlert UP:
 www.transitionnetwork.org (www.transitionnetwork.org) is UP again at 19/06/2013  06:33:57, after 5m of downtime.

Date: Wed, 19 Jun 2013 06:35:13 +0100
Subject: puffin.transitionnetwork.org Munin Alert

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average
        WARNINGs: load is 7.41 (outside range [:4]).

Logs

Again there are a lot of database connection refused messages in the daemon.log but these appear to come after the site went down -- they seem to be a symptom not a cause?

Jun 19 06:28:55 puffin mysqld: 130619  6:28:55 [Warning] Aborted connection 116403 to db: 'tnpuffinwebarchn' user: 'tnpuffinwebarchn' host: 'localhost' (Unknown error)
Jun 19 06:28:55 puffin mysqld: 130619  6:28:55 [Warning] Aborted connection 116391 to db: 'masterpuffinwe_0' user: 'masterpuffinwe_0' host: 'localhost' (Unknown error)
Jun 19 06:28:56 puffin mysqld: 130619  6:28:56 [Warning] Aborted connection 116355 to db: 'masterpuffinwe_0' user: 'masterpuffinwe_0' host: 'localhost' (Unknown error)
Jun 19 06:28:57 puffin mysqld: 130619  6:28:57 [Warning] Aborted connection 116370 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 19 06:28:57 puffin mysqld: 130619  6:28:57 [Warning] Aborted connection 116377 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 19 06:28:57 puffin mysqld: 130619  6:28:57 [Warning] Aborted connection 116394 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 19 06:28:57 puffin mysqld: 130619  6:28:57 [Warning] Aborted connection 116392 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 19 06:28:57 puffin mysqld: 130619  6:28:57 [Warning] Aborted connection 116395 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 19 06:28:57 puffin mysqld: 130619  6:28:57 [Warning] Aborted connection 116374 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 19 06:28:57 puffin mysqld: 130619  6:28:57 [Warning] Aborted connection 116378 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 19 06:28:57 puffin mysqld: 130619  6:28:57 [Warning] Aborted connection 116406 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 19 06:28:57 puffin mysqld: 130619  6:28:57 [Warning] Aborted connection 116373 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 19 06:28:57 puffin mysqld: 130619  6:28:57 [Warning] Aborted connection 116375 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 19 06:28:57 puffin mysqld: 130619  6:28:57 [Warning] Aborted connection 116369 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 19 06:28:57 puffin mysqld: 130619  6:28:57 [Warning] Aborted connection 116386 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 19 06:28:57 puffin mysqld: 130619  6:28:57 [Warning] Aborted connection 116367 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 19 06:28:57 puffin mysqld: 130619  6:28:57 [Warning] Aborted connection 116413 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 19 06:28:57 puffin mysqld: 130619  6:28:57 [Warning] Aborted connection 116407 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 19 06:28:59 puffin mysqld: 130619  6:28:59 [Warning] Aborted connection 116371 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 19 06:28:59 puffin mysqld: 130619  6:28:59 [Warning] Aborted connection 116356 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 19 06:28:59 puffin mysqld: 130619  6:28:59 [Warning] Aborted connection 116405 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 19 06:29:00 puffin mysqld: 130619  6:29:00 [Warning] Aborted connection 116372 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 19 06:29:00 puffin mysqld: 130619  6:29:00 [Warning] Aborted connection 116411 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 19 06:29:00 puffin mysqld: 130619  6:29:00 [Warning] Aborted connection 116363 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 19 06:29:00 puffin mysqld: 130619  6:29:00 [Warning] Aborted connection 116360 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 19 06:29:00 puffin mysqld: 130619  6:29:00 [Warning] Aborted connection 116357 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 19 06:29:00 puffin mysqld: 130619  6:29:00 [Warning] Aborted connection 116361 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 19 06:29:00 puffin mysqld: 130619  6:29:00 [Warning] Aborted connection 116362 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)

Following are all the entries in the php-fpm error starting from just befor the site went down:

[19-Jun-2013 06:25:31] WARNING: [pool www] child 15382, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (31.833145 sec), logging
[19-Jun-2013 06:25:32] NOTICE: child 15382 stopped for tracing
[19-Jun-2013 06:25:32] NOTICE: about to trace 15382
[19-Jun-2013 06:25:32] NOTICE: finished trace of 15382
[19-Jun-2013 06:25:42] WARNING: [pool www] child 65342, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (37.993295 sec), logging
[19-Jun-2013 06:25:42] WARNING: [pool www] child 58525, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (33.797258 sec), logging
[19-Jun-2013 06:25:42] NOTICE: child 58525 stopped for tracing
[19-Jun-2013 06:25:42] NOTICE: about to trace 58525
[19-Jun-2013 06:25:42] ERROR: failed to ptrace(PEEKDATA) pid 58525: Input/output error (5)
[19-Jun-2013 06:25:42] NOTICE: finished trace of 58525
[19-Jun-2013 06:25:42] NOTICE: child 65342 stopped for tracing
[19-Jun-2013 06:25:42] NOTICE: about to trace 65342
[19-Jun-2013 06:25:42] NOTICE: finished trace of 65342
[19-Jun-2013 06:25:52] WARNING: [pool www] child 65341, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.574860 sec), logging
[19-Jun-2013 06:25:52] WARNING: [pool www] child 60136, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.518012 sec), logging
[19-Jun-2013 06:25:52] WARNING: [pool www] child 60039, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (34.343026 sec), logging
[19-Jun-2013 06:25:52] WARNING: [pool www] child 58526, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.164298 sec), logging
[19-Jun-2013 06:25:52] NOTICE: child 60136 stopped for tracing
[19-Jun-2013 06:25:52] NOTICE: about to trace 60136
[19-Jun-2013 06:25:52] NOTICE: finished trace of 60136
[19-Jun-2013 06:25:52] NOTICE: child 58526 stopped for tracing
[19-Jun-2013 06:25:52] NOTICE: about to trace 58526
[19-Jun-2013 06:25:52] NOTICE: finished trace of 58526
[19-Jun-2013 06:25:52] NOTICE: child 60039 stopped for tracing
[19-Jun-2013 06:25:52] NOTICE: about to trace 60039
[19-Jun-2013 06:25:53] NOTICE: finished trace of 60039
[19-Jun-2013 06:25:53] NOTICE: child 65341 stopped for tracing
[19-Jun-2013 06:25:53] NOTICE: about to trace 65341
[19-Jun-2013 06:25:53] NOTICE: finished trace of 65341
[19-Jun-2013 06:25:57] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 2 idle, and 19 total children
[19-Jun-2013 06:25:58] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 3 idle, and 21 total children
[19-Jun-2013 06:26:22] WARNING: [pool www] child 23986, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (34.334257 sec), logging
[19-Jun-2013 06:26:22] NOTICE: child 23986 stopped for tracing
[19-Jun-2013 06:26:22] NOTICE: about to trace 23986
[19-Jun-2013 06:26:22] NOTICE: finished trace of 23986
[19-Jun-2013 06:26:32] WARNING: [pool www] child 24074, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (34.565718 sec), logging
[19-Jun-2013 06:26:32] WARNING: [pool www] child 24072, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.141893 sec), logging
[19-Jun-2013 06:26:32] WARNING: [pool www] child 24071, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.893850 sec), logging
[19-Jun-2013 06:26:32] WARNING: [pool www] child 24070, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.088002 sec), logging
[19-Jun-2013 06:26:32] WARNING: [pool www] child 24069, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (36.859334 sec), logging
[19-Jun-2013 06:26:32] WARNING: [pool www] child 23991, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (38.090436 sec), logging
[19-Jun-2013 06:26:32] WARNING: [pool www] child 23989, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (38.128572 sec), logging
[19-Jun-2013 06:26:32] WARNING: [pool www] child 23988, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (38.120426 sec), logging
[19-Jun-2013 06:26:32] WARNING: [pool www] child 58524, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (38.118462 sec), logging
[19-Jun-2013 06:26:32] NOTICE: child 58524 stopped for tracing
[19-Jun-2013 06:26:32] NOTICE: about to trace 58524
[19-Jun-2013 06:26:32] NOTICE: finished trace of 58524
[19-Jun-2013 06:26:32] NOTICE: child 23988 stopped for tracing
[19-Jun-2013 06:26:32] NOTICE: about to trace 23988
[19-Jun-2013 06:26:32] NOTICE: finished trace of 23988
[19-Jun-2013 06:26:32] NOTICE: child 23989 stopped for tracing
[19-Jun-2013 06:26:32] NOTICE: about to trace 23989
[19-Jun-2013 06:26:32] NOTICE: finished trace of 23989
[19-Jun-2013 06:26:32] NOTICE: child 23991 stopped for tracing
[19-Jun-2013 06:26:32] NOTICE: about to trace 23991
[19-Jun-2013 06:26:32] NOTICE: finished trace of 23991
[19-Jun-2013 06:26:32] NOTICE: child 24069 stopped for tracing
[19-Jun-2013 06:26:32] NOTICE: about to trace 24069
[19-Jun-2013 06:26:33] NOTICE: finished trace of 24069
[19-Jun-2013 06:26:33] NOTICE: child 24070 stopped for tracing
[19-Jun-2013 06:26:33] NOTICE: about to trace 24070
[19-Jun-2013 06:26:33] NOTICE: finished trace of 24070
[19-Jun-2013 06:26:33] NOTICE: child 24071 stopped for tracing
[19-Jun-2013 06:26:33] NOTICE: about to trace 24071
[19-Jun-2013 06:26:33] NOTICE: finished trace of 24071
[19-Jun-2013 06:26:33] NOTICE: child 24072 stopped for tracing
[19-Jun-2013 06:26:33] NOTICE: about to trace 24072
[19-Jun-2013 06:26:33] NOTICE: finished trace of 24072
[19-Jun-2013 06:26:33] NOTICE: child 24074 stopped for tracing
[19-Jun-2013 06:26:33] NOTICE: about to trace 24074
[19-Jun-2013 06:26:33] NOTICE: finished trace of 24074
[19-Jun-2013 06:27:02] WARNING: [pool www] child 24075, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.738509 sec), logging
[19-Jun-2013 06:27:02] NOTICE: child 24075 stopped for tracing
[19-Jun-2013 06:27:02] NOTICE: about to trace 24075
[19-Jun-2013 06:27:02] NOTICE: finished trace of 24075
[19-Jun-2013 06:27:12] WARNING: [pool www] child 24078, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.576278 sec), logging
[19-Jun-2013 06:27:12] WARNING: [pool www] child 24073, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (32.021376 sec), logging
[19-Jun-2013 06:27:12] NOTICE: child 24073 stopped for tracing
[19-Jun-2013 06:27:12] NOTICE: about to trace 24073
[19-Jun-2013 06:27:12] ERROR: failed to ptrace(PEEKDATA) pid 24073: Input/output error (5)
[19-Jun-2013 06:27:12] NOTICE: finished trace of 24073
[19-Jun-2013 06:27:12] NOTICE: child 24078 stopped for tracing
[19-Jun-2013 06:27:12] NOTICE: about to trace 24078
[19-Jun-2013 06:27:12] NOTICE: finished trace of 24078
[19-Jun-2013 06:27:22] WARNING: [pool www] child 24080, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (37.363864 sec), logging
[19-Jun-2013 06:27:22] WARNING: [pool www] child 24079, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (39.426180 sec), logging
[19-Jun-2013 06:27:22] NOTICE: child 24080 stopped for tracing
[19-Jun-2013 06:27:22] NOTICE: about to trace 24080
[19-Jun-2013 06:27:23] NOTICE: finished trace of 24080
[19-Jun-2013 06:27:23] NOTICE: child 24079 stopped for tracing
[19-Jun-2013 06:27:23] NOTICE: about to trace 24079
[19-Jun-2013 06:27:23] ERROR: failed to ptrace(PEEKDATA) pid 24079: Input/output error (5)
[19-Jun-2013 06:27:23] NOTICE: finished trace of 24079
[19-Jun-2013 06:27:42] WARNING: [pool www] child 24209, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (37.571959 sec), logging
[19-Jun-2013 06:27:42] WARNING: [pool www] child 24207, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (39.672834 sec), logging
[19-Jun-2013 06:27:42] NOTICE: child 24209 stopped for tracing
[19-Jun-2013 06:27:42] NOTICE: about to trace 24209
[19-Jun-2013 06:27:42] NOTICE: finished trace of 24209
[19-Jun-2013 06:27:42] NOTICE: child 24207 stopped for tracing
[19-Jun-2013 06:27:42] NOTICE: about to trace 24207
[19-Jun-2013 06:27:43] NOTICE: finished trace of 24207
[19-Jun-2013 06:28:32] WARNING: [pool www] child 24216, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (36.870764 sec), logging
[19-Jun-2013 06:28:32] WARNING: [pool www] child 24212, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (38.754472 sec), logging
[19-Jun-2013 06:28:32] NOTICE: child 24212 stopped for tracing
[19-Jun-2013 06:28:33] NOTICE: about to trace 24212
[19-Jun-2013 06:28:33] ERROR: failed to ptrace(PEEKDATA) pid 24212: Input/output error (5)
[19-Jun-2013 06:28:33] NOTICE: finished trace of 24212
[19-Jun-2013 06:28:33] NOTICE: child 24216 stopped for tracing
[19-Jun-2013 06:28:33] NOTICE: about to trace 24216
[19-Jun-2013 06:28:33] ERROR: failed to ptrace(PEEKDATA) pid 24216: Input/output error (5)
[19-Jun-2013 06:28:33] NOTICE: finished trace of 24216
[19-Jun-2013 06:28:52] WARNING: [pool www] child 24256, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (37.653180 sec), logging
[19-Jun-2013 06:28:53] NOTICE: child 24256 stopped for tracing
[19-Jun-2013 06:28:53] NOTICE: about to trace 24256
[19-Jun-2013 06:28:53] ERROR: failed to ptrace(PEEKDATA) pid 24256: Input/output error (5)
[19-Jun-2013 06:28:53] NOTICE: finished trace of 24256
[19-Jun-2013 06:28:57] NOTICE: Finishing ...
[19-Jun-2013 06:28:58] NOTICE: Finishing ...
[19-Jun-2013 06:28:59] NOTICE: Finishing ...
[19-Jun-2013 06:29:00] NOTICE: exiting, bye-bye!

The logs were rotated at this point and the first couple of line in the next log file:

[19-Jun-2013 06:34:24] NOTICE: fpm is running, pid 29480
[19-Jun-2013 06:34:24] NOTICE: ready to handle connections

I have found lots of 503 errors in the nginx logs, I'll follow that up on another ticket.

comment:41 Changed 3 years ago by jim

Add Hours to Ticket changed from 0.0 to 0.75
Total Hours changed from 9.78 to 10.53

Adding hours of investigation.

Also noting I don't think the PHP-FPM logs are telling us much other than 'something stopped responding'.

And I've enabled syslog on all main Drupal sites, and commented out the bit of /var/xdrago/daily.sh that disables syslog for performance reasons.

I'm also now convinced this is nothing to do with Drupal, since no Drupal errors happen before or after theses events

comment:42 Changed 3 years ago by chris

Description modified (diff)

comment:43 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.5
Total Hours changed from 10.53 to 11.03

On ticket:563#comment:3 it has been noted that second.sh restarts nginx when the load hits 3.8 and if the load reaches 14.4 then nginx is killed and php-fpm stopped.

These values seem totally off for a 14 CPU server and the suspicion is that this is the cause of the downtime when there is a load spike.

So these values in /var/xdrago/second.sh:

CTL_ONEX_SPIDER_LOAD=388
CTL_FIVX_SPIDER_LOAD=388
CTL_ONEX_LOAD=1444
CTL_FIVX_LOAD=888
CTL_ONEX_LOAD_CRIT=1888
CTL_FIVX_LOAD_CRIT=1555

Have been multiplied by 4:

CTL_ONEX_SPIDER_LOAD=1552
CTL_FIVX_SPIDER_LOAD=1552
CTL_ONEX_LOAD=5776
CTL_FIVX_LOAD=3552
CTL_ONEX_LOAD_CRIT=7552
CTL_FIVX_LOAD_CRIT=6220

And the crontab has been re-enabled:

* * * * * bash /var/xdrago/second.sh >/dev/null 2>&1

comment:44 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 1.7
Priority changed from major to critical
Total Hours changed from 11.03 to 12.73
Summary changed from 12 mins of downtime on 29th May 2013 to Load spikes, ksoftirqd using all the CPU and services stopping for 15 min at a time

I have changed the Munin server on Penguin to update every 3 mins rather than every 5 mins to get a better resolution on the stats, this is the script that needed editing to do this, /etc/cron.d/munin

#*/5 * * * *     munin if [ -x /usr/bin/munin-cron ]; then /usr/bin/munin-cron; fi
*/3 * * * *     munin if [ -x /usr/bin/munin-cron ]; then /usr/bin/munin-cron; fi

When the last spike happened it appears that ksoftirqd was using almost all the CPU, it looked like this in top:

top - 10:20:24 up 47 min,  2 users,  load average: 8.27, 3.22, 1.69
Tasks: 248 total,  25 running, 223 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.6%us,  1.5%sy,  0.0%ni, 18.5%id,  0.0%wa,  0.0%hi,  4.4%si, 74.9%st
Mem:   8372060k total,  4995628k used,  3376432k free,  2847284k buffers
Swap:  1048568k total,        0k used,  1048568k free,   576780k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                               
   13 root      20   0     0    0    0 R  103  0.0   0:59.73 ksoftirqd/3                                                            
   19 root      20   0     0    0    0 R  103  0.0   1:03.89 ksoftirqd/5                                                            
   10 root      20   0     0    0    0 R  102  0.0   1:18.58 ksoftirqd/2                                                            
   16 root      20   0     0    0    0 R  102  0.0   0:59.98 ksoftirqd/4                                                            
   40 root      20   0     0    0    0 R  101  0.0   1:09.77 ksoftirqd/12                                                           
   22 root      20   0     0    0    0 R  100  0.0   1:06.77 ksoftirqd/6                                                            
   25 root      20   0     0    0    0 R   99  0.0   1:10.52 ksoftirqd/7                                                            
   34 root      20   0     0    0    0 R   99  0.0   0:52.93 ksoftirqd/10                                                           
    4 root      20   0     0    0    0 R   98  0.0   0:41.95 ksoftirqd/0                                                            
    7 root      20   0     0    0    0 R   98  0.0   0:50.85 ksoftirqd/1                                                            
   28 root      20   0     0    0    0 R   98  0.0   1:13.32 ksoftirqd/8                                                            
   31 root      20   0     0    0    0 R   70  0.0   0:58.65 ksoftirqd/9                                                            
   37 root      20   0     0    0    0 R   62  0.0   0:47.97 ksoftirqd/11                                                           
   30 root      RT   0     0    0    0 S   27  0.0   0:03.66 migration/9                                                            
29492 www-data  20   0  771m  92m  50m R    2  1.1   0:03.26 php-fpm                                                                
29493 www-data  20   0  762m  66m  33m S    1  0.8   0:00.88 php-fpm                                                                
 3356 mysql     20   0 1647m 414m 9.9m S    1  5.1   1:24.31 mysqld

Other people have had problems like this, for example:

More investigation is needed, but I think we have finally found a cause or perhaps a symptom that might lead us to a cause...

I have updated the puffin documentation, adding section on php-fpm, wiki:PuffinServer#php-fpm and also documenting the my.cnf and php53-fpm.conf and second.sh tweaks from the default BOA settings.

I have also spent a fair amount of time watching top and the munin stats.

comment:45 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 2.0
Total Hours changed from 12.73 to 14.73

Investigating the ksoftirqd issue...

If ksoftirqd is taking more than a tiny percentage of CPU time,
this indicates the machine is under heavy soft interrupt load.

http://www.tin.org/bin/man.cgi?section=9&topic=ksoftirqd

For reference following is the result of cat /proc/interrupts, we could do with the output form this when there is next a load spike.

           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      
565:     402997          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     eth0
566:         60          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     blkif
567:     484313          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     blkif
568:       1765          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     hvc_console
569:        434          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     xenbus
570:          0          0          0          0          0          0          0          0          0          0          0          0          0       6043  xen-percpu-ipi       callfuncsingle13
571:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug13
572:          0          0          0          0          0          0          0          0          0          0          0          0          0       1568  xen-percpu-ipi       callfunc13
573:          0          0          0          0          0          0          0          0          0          0          0          0          0      49895  xen-percpu-ipi       resched13
574:          0          0          0          0          0          0          0          0          0          0          0          0          0     302441  xen-percpu-virq      timer13
575:          0          0          0          0          0          0          0          0          0          0          0          0       5890          0  xen-percpu-ipi       callfuncsingle12
576:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug12
577:          0          0          0          0          0          0          0          0          0          0          0          0       1546          0  xen-percpu-ipi       callfunc12
578:          0          0          0          0          0          0          0          0          0          0          0          0      46341          0  xen-percpu-ipi       resched12
579:          0          0          0          0          0          0          0          0          0          0          0          0     236553          0  xen-percpu-virq      timer12
580:          0          0          0          0          0          0          0          0          0          0          0       6018          0          0  xen-percpu-ipi       callfuncsingle11
581:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug11
582:          0          0          0          0          0          0          0          0          0          0          0       1672          0          0  xen-percpu-ipi       callfunc11
583:          0          0          0          0          0          0          0          0          0          0          0      41650          0          0  xen-percpu-ipi       resched11
584:          0          0          0          0          0          0          0          0          0          0          0     218108          0          0  xen-percpu-virq      timer11
585:          0          0          0          0          0          0          0          0          0          0       5640          0          0          0  xen-percpu-ipi       callfuncsingle10
586:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug10
587:          0          0          0          0          0          0          0          0          0          0       1683          0          0          0  xen-percpu-ipi       callfunc10
588:          0          0          0          0          0          0          0          0          0          0      47145          0          0          0  xen-percpu-ipi       resched10
589:          0          0          0          0          0          0          0          0          0          0     242891          0          0          0  xen-percpu-virq      timer10
590:          0          0          0          0          0          0          0          0          0       6235          0          0          0          0  xen-percpu-ipi       callfuncsingle9
591:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug9
592:          0          0          0          0          0          0          0          0          0       1689          0          0          0          0  xen-percpu-ipi       callfunc9
593:          0          0          0          0          0          0          0          0          0      46975          0          0          0          0  xen-percpu-ipi       resched9
594:          0          0          0          0          0          0          0          0          0     249278          0          0          0          0  xen-percpu-virq      timer9
595:          0          0          0          0          0          0          0          0       5937          0          0          0          0          0  xen-percpu-ipi       callfuncsingle8
596:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug8
597:          0          0          0          0          0          0          0          0       1807          0          0          0          0          0  xen-percpu-ipi       callfunc8
598:          0          0          0          0          0          0          0          0      53368          0          0          0          0          0  xen-percpu-ipi       resched8
599:          0          0          0          0          0          0          0          0     267242          0          0          0          0          0  xen-percpu-virq      timer8
600:          0          0          0          0          0          0          0       5972          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle7
601:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug7
602:          0          0          0          0          0          0          0       1985          0          0          0          0          0          0  xen-percpu-ipi       callfunc7
603:          0          0          0          0          0          0          0      51311          0          0          0          0          0          0  xen-percpu-ipi       resched7
604:          0          0          0          0          0          0          0     291006          0          0          0          0          0          0  xen-percpu-virq      timer7
605:          0          0          0          0          0          0       6333          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle6
606:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug6
607:          0          0          0          0          0          0       2057          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc6
608:          0          0          0          0          0          0      56102          0          0          0          0          0          0          0  xen-percpu-ipi       resched6
609:          0          0          0          0          0          0     283555          0          0          0          0          0          0          0  xen-percpu-virq      timer6
610:          0          0          0          0          0       6743          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle5
611:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug5
612:          0          0          0          0          0       2335          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc5
613:          0          0          0          0          0      60353          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched5
614:          0          0          0          0          0     323767          0          0          0          0          0          0          0          0  xen-percpu-virq      timer5
615:          0          0          0          0       7381          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle4
616:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug4
617:          0          0          0          0       4111          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc4
618:          0          0          0          0      75912          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched4
619:          0          0          0          0     398036          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer4
620:          0          0          0       8603          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle3
621:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug3
622:          0          0          0       3801          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc3
623:          0          0          0      82767          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched3
624:          0          0          0     451333          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer3
625:          0          0       6532          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle2
626:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug2
627:          0          0       1550          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc2
628:          0          0     108076          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched2
629:          0          0     641910          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer2
630:          0      10103          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle1
631:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug1
632:          0       1259          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc1
633:          0     127537          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched1
634:          0     914718          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer1
635:       5077          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle0
636:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug0
637:       1120          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc0
638:     161171          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched0
639:    1055621          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer0
NMI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Non-maskable interrupts
LOC:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Local timer interrupts
SPU:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Spurious interrupts
PMI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Performance monitoring interrupts
PND:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Performance pending work
RES:     161171     127537     108076      82767      75912      60353      56102      51311      53368      46975      47145      41650      46341      49895   Rescheduling interrupts
CAL:       6197      11362       8082      12404      11492       9078       8390       7957       7744       7924       7323       7690       7436       7611   Function call interrupts
TLB:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   TLB shootdowns
TRM:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Threshold APIC interrupts
MCE:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Machine check exceptions
MCP:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Machine check polls
ERR:          0
MIS:          0

We could assign different CPUs to different things, there is a report here of the 100% CPU usage by ksoftirqd happening to a CPU assigned to a ethernet interface:

I have installed:

libelf1{a} linux-tools-2.6 linux-tools-2.6.32{a}

So we can run perf top to see what it reports when there is another load spike. More on this tool here:

https://perf.wiki.kernel.org/index.php/Tutorial

Here is a thread relating the ksoftirqd issue to iptables:

http://en.it-usenet.org/thread/16263/17508/

Which leads here:

http://www.linuxforums.org/forum/kernel/184652-high-softirq-cpu-usage-while-ipsec-active.html

Which says:

setting "/proc/sys/net/ipv4/xfrm4_gc_thresh" to a relatively
small (0-100 instead of 3276) solves the issue.

We have:

cat /proc/sys/net/ipv4/xfrm4_gc_thresh
2097152

Perhaps there is something in Jim's huntch that this is a firewall issue...

This post:

http://lists.graemef.net/pipermail/lvs-users/2012-May/049429.html

Suggests:

the /ksoftirqd/ issue could actually be a kernel versioning issue.
And it was introduced in 2.6.28 and looks like it was only fixed in
2.6.37.rc1. We run Debian Squeeze, which is on 2.6.32+29.

For you own viewing:
Start by reading all 'Kernel 2.6.35 and 100% S.I. CPU Time' in
http://lists.graemef.net/pipermail/lvs-users/2010-September/subject.html#start
Then move on to
http://lists.graemef.net/pipermail/lvs-users/2010-October/subject.html#start

Our kernel version:

uname -a
  Linux puffin.webarch.net 2.6.32-5-xen-amd64 #1 SMP Fri May 10 11:48:05 UTC 2013 x86_64 GNU/Linux

We could consider updating debian to Wheezy, to get a more recent kernel, see ticket:535

For now the doubling of the RAM from 4GB to 8GB and the tweaks to second.sh might have resolved the issue... (the load just went up to just over 4 and the task killing wasn't triggered...) but it is probably too soon to tell...

comment:46 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.4
Total Hours changed from 14.73 to 15.13

There was just another load spike, which the server recovered from, I didn't get to a console in time to do a cat /proc/interrupts. I don't think the server stopped responding to regular users, but bots were served 503s, this is an improvement! Looking at this graph it appears that php53-fpm might have been restarted:

https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/multips_memory.html

I think we could do with adding some debugging stuff to second.sh so we know when the "high load" things are triggered and what the values of variables when they are. I'll look at adding some things to the script later tonight or tomorrow.

Info from the latest load spike:

Email Alerts

Date: Thu, 20 Jun 2013 16:01:48 +0100 (BST)
Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 9.33

Time:                    Thu Jun 20 16:01:48 2013 +0100
1 Min Load Avg:          32.35
5 Min Load Avg:          9.33
15 Min Load Avg:         3.39
Running/Total Processes: 31/309

Date: Thu, 20 Jun 2013 16:03:14 +0100
Subject: puffin.transitionnetwork.org Munin Alert

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average
        CRITICALs: load is 17.35 (outside range [:8]).

Date: Thu, 20 Jun 2013 16:06:14 +0100
Subject: puffin.transitionnetwork.org Munin Alert

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average
        CRITICALs: load is 9.74 (outside range [:8]).

Date: Thu, 20 Jun 2013 16:09:14 +0100
Subject: puffin.transitionnetwork.org Munin Alert

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average
        WARNINGs: load is 5.66 (outside range [:4]).

Date: Thu, 20 Jun 2013 16:12:14 +0100
Subject: puffin.transitionnetwork.org Munin Alert

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average
        OKs: load is 3.37.

Logs

Entries in the /var/log/php/php53-fpm-error.log:

[20-Jun-2013 16:00:40] WARNING: [pool www] child 54783, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (36.367681 sec), logging
[20-Jun-2013 16:00:40] WARNING: [pool www] child 33601, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "POST /index.php") executing too slow (31.116963 sec), logging
[20-Jun-2013 16:00:40] WARNING: [pool www] child 33589, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (34.251771 sec), logging
[20-Jun-2013 16:00:40] WARNING: [pool www] child 33588, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (34.443108 sec), logging
[20-Jun-2013 16:00:40] WARNING: [pool www] child 33587, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (37.739212 sec), logging
[20-Jun-2013 16:00:40] WARNING: [pool www] child 33579, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "POST /index.php") executing too slow (35.045901 sec), logging
[20-Jun-2013 16:00:40] NOTICE: child 33579 stopped for tracing
[20-Jun-2013 16:00:40] NOTICE: about to trace 33579
[20-Jun-2013 16:00:40] NOTICE: finished trace of 33579
[20-Jun-2013 16:00:40] NOTICE: child 33587 stopped for tracing
[20-Jun-2013 16:00:40] NOTICE: about to trace 33587
[20-Jun-2013 16:00:40] ERROR: failed to ptrace(PEEKDATA) pid 33587: Input/output error (5)
[20-Jun-2013 16:00:45] NOTICE: finished trace of 33587
[20-Jun-2013 16:00:45] NOTICE: child 33589 stopped for tracing
[20-Jun-2013 16:00:45] NOTICE: about to trace 33589
[20-Jun-2013 16:00:45] NOTICE: finished trace of 33589
[20-Jun-2013 16:00:45] NOTICE: child 33601 stopped for tracing
[20-Jun-2013 16:00:45] NOTICE: about to trace 33601
[20-Jun-2013 16:00:45] ERROR: failed to ptrace(PEEKDATA) pid 33601: Input/output error (5)
[20-Jun-2013 16:00:46] NOTICE: finished trace of 33601
[20-Jun-2013 16:00:46] NOTICE: child 33588 stopped for tracing
[20-Jun-2013 16:00:46] NOTICE: about to trace 33588
[20-Jun-2013 16:00:46] NOTICE: finished trace of 33588
[20-Jun-2013 16:00:46] NOTICE: child 54783 stopped for tracing
[20-Jun-2013 16:00:46] NOTICE: about to trace 54783
[20-Jun-2013 16:00:47] NOTICE: finished trace of 54783
[20-Jun-2013 16:00:51] WARNING: [pool www] child 54787, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (34.603425 sec), logging
[20-Jun-2013 16:00:51] WARNING: [pool www] child 33596, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (38.672290 sec), logging
[20-Jun-2013 16:00:51] WARNING: [pool www] child 33593, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.293697 sec), logging
[20-Jun-2013 16:00:51] WARNING: [pool www] child 33580, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (37.488432 sec), logging
[20-Jun-2013 16:00:51] NOTICE: child 33580 stopped for tracing
[20-Jun-2013 16:00:51] NOTICE: about to trace 33580
[20-Jun-2013 16:00:51] ERROR: failed to ptrace(PEEKDATA) pid 33580: Input/output error (5)
[20-Jun-2013 16:00:52] NOTICE: finished trace of 33580
[20-Jun-2013 16:00:52] NOTICE: child 33593 stopped for tracing
[20-Jun-2013 16:00:52] NOTICE: about to trace 33593
[20-Jun-2013 16:00:52] ERROR: failed to ptrace(PEEKDATA) pid 33593: Input/output error (5)
[20-Jun-2013 16:00:53] NOTICE: finished trace of 33593
[20-Jun-2013 16:00:53] NOTICE: child 33596 stopped for tracing
[20-Jun-2013 16:00:53] NOTICE: about to trace 33596
[20-Jun-2013 16:00:54] NOTICE: finished trace of 33596
[20-Jun-2013 16:00:54] NOTICE: child 54787 stopped for tracing
[20-Jun-2013 16:00:54] NOTICE: about to trace 54787
[20-Jun-2013 16:00:54] ERROR: failed to ptrace(PEEKDATA) pid 54787: Input/output error (5)
[20-Jun-2013 16:00:54] NOTICE: finished trace of 54787
[20-Jun-2013 16:01:11] WARNING: [pool www] child 33592, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (31.579042 sec), logging
[20-Jun-2013 16:01:12] NOTICE: child 33592 stopped for tracing
[20-Jun-2013 16:01:12] NOTICE: about to trace 33592
[20-Jun-2013 16:01:12] ERROR: failed to ptrace(PEEKDATA) pid 33592: Input/output error (5)
[20-Jun-2013 16:01:12] NOTICE: finished trace of 33592
[20-Jun-2013 16:01:21] WARNING: [pool www] child 33594, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (37.143000 sec), logging
[20-Jun-2013 16:01:21] NOTICE: child 33594 stopped for tracing
[20-Jun-2013 16:01:21] NOTICE: about to trace 33594
[20-Jun-2013 16:01:21] ERROR: failed to ptrace(PEEKDATA) pid 33594: Input/output error (5)
[20-Jun-2013 16:01:22] NOTICE: finished trace of 33594
[20-Jun-2013 16:01:52] WARNING: [pool www] child 33605, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.814794 sec), logging
[20-Jun-2013 16:01:52] NOTICE: child 33605 stopped for tracing
[20-Jun-2013 16:01:52] NOTICE: about to trace 33605
[20-Jun-2013 16:01:52] NOTICE: finished trace of 33605
[20-Jun-2013 16:02:12] WARNING: [pool www] child 6915, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (36.627892 sec), logging
[20-Jun-2013 16:02:12] WARNING: [pool www] child 6907, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (39.777419 sec), logging
[20-Jun-2013 16:02:12] NOTICE: child 6915 stopped for tracing
[20-Jun-2013 16:02:12] NOTICE: about to trace 6915
[20-Jun-2013 16:02:13] NOTICE: finished trace of 6915
[20-Jun-2013 16:02:13] NOTICE: child 6907 stopped for tracing
[20-Jun-2013 16:02:13] NOTICE: about to trace 6907
[20-Jun-2013 16:02:13] NOTICE: finished trace of 6907
[20-Jun-2013 16:02:22] WARNING: [pool www] child 6946, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (33.695327 sec), logging
[20-Jun-2013 16:02:22] WARNING: [pool www] child 6945, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (33.695105 sec), logging
[20-Jun-2013 16:02:22] WARNING: [pool www] child 6944, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (33.730524 sec), logging
[20-Jun-2013 16:02:22] WARNING: [pool www] child 33615, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (37.763802 sec), logging
[20-Jun-2013 16:02:22] WARNING: [pool www] child 33613, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (33.695476 sec), logging
[20-Jun-2013 16:02:22] WARNING: [pool www] child 33610, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (33.671556 sec), logging
[20-Jun-2013 16:02:22] WARNING: [pool www] child 33608, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (34.599949 sec), logging
[20-Jun-2013 16:02:22] WARNING: [pool www] child 33607, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (32.528014 sec), logging
[20-Jun-2013 16:02:22] WARNING: [pool www] child 33604, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (33.114966 sec), logging
[20-Jun-2013 16:02:22] NOTICE: child 33608 stopped for tracing
[20-Jun-2013 16:02:22] NOTICE: about to trace 33608
[20-Jun-2013 16:02:22] NOTICE: finished trace of 33608
[20-Jun-2013 16:02:22] NOTICE: child 33615 stopped for tracing
[20-Jun-2013 16:02:22] NOTICE: about to trace 33615
[20-Jun-2013 16:02:22] ERROR: failed to ptrace(PEEKDATA) pid 33615: Input/output error (5)
[20-Jun-2013 16:02:23] NOTICE: finished trace of 33615
[20-Jun-2013 16:02:23] NOTICE: child 33604 stopped for tracing
[20-Jun-2013 16:02:23] NOTICE: about to trace 33604
[20-Jun-2013 16:02:23] NOTICE: finished trace of 33604
[20-Jun-2013 16:02:23] NOTICE: child 33607 stopped for tracing
[20-Jun-2013 16:02:23] NOTICE: about to trace 33607
[20-Jun-2013 16:02:24] NOTICE: finished trace of 33607
[20-Jun-2013 16:02:24] NOTICE: child 33610 stopped for tracing
[20-Jun-2013 16:02:24] NOTICE: about to trace 33610
[20-Jun-2013 16:02:25] NOTICE: finished trace of 33610
[20-Jun-2013 16:02:25] NOTICE: child 33613 stopped for tracing
[20-Jun-2013 16:02:25] NOTICE: about to trace 33613
[20-Jun-2013 16:02:25] NOTICE: finished trace of 33613
[20-Jun-2013 16:02:25] NOTICE: child 6944 stopped for tracing
[20-Jun-2013 16:02:25] NOTICE: about to trace 6944
[20-Jun-2013 16:02:25] NOTICE: finished trace of 6944
[20-Jun-2013 16:02:25] NOTICE: child 6945 stopped for tracing
[20-Jun-2013 16:02:25] NOTICE: about to trace 6945
[20-Jun-2013 16:02:25] NOTICE: finished trace of 6945
[20-Jun-2013 16:02:25] NOTICE: child 6946 stopped for tracing
[20-Jun-2013 16:02:25] NOTICE: about to trace 6946
[20-Jun-2013 16:02:25] NOTICE: finished trace of 6946
[20-Jun-2013 16:02:32] WARNING: [pool www] child 6938, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (36.932866 sec), logging
[20-Jun-2013 16:02:32] WARNING: [pool www] child 6937, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (38.457173 sec), logging
[20-Jun-2013 16:02:32] NOTICE: child 6938 stopped for tracing
[20-Jun-2013 16:02:32] NOTICE: about to trace 6938
[20-Jun-2013 16:02:32] NOTICE: finished trace of 6938
[20-Jun-2013 16:02:32] NOTICE: child 6937 stopped for tracing
[20-Jun-2013 16:02:32] NOTICE: about to trace 6937
[20-Jun-2013 16:02:32] NOTICE: finished trace of 6937

There is nothing in the daemon.log and nothing of note in the syslog.

comment:47 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 15.13 to 15.38

There was another load spike, again I didn't get a chance to dump /proc/interrupts:

Email Alerts

Date: Thu, 20 Jun 2013 19:01:46 +0100 (BST)
Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 10.05

Time:                    Thu Jun 20 19:01:46 2013 +0100
1 Min Load Avg:          33.55
5 Min Load Avg:          10.05
15 Min Load Avg:         3.69
Running/Total Processes: 62/368

Date: Thu, 20 Jun 2013 19:02:10 +0100
Subject: puffin.transitionnetwork.org Munin Alert

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average
        CRITICALs: load is 13.22 (outside range [:8]).

Date: Thu, 20 Jun 2013 19:03:14 +0100
Subject: puffin.transitionnetwork.org Munin Alert

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average
        CRITICALs: load is 11.15 (outside range [:8]).

Date: Thu, 20 Jun 2013 19:06:14 +0100
Subject: puffin.transitionnetwork.org Munin Alert

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average
        WARNINGs: load is 6.11 (outside range [:4]).

Date: Thu, 20 Jun 2013 19:09:14 +0100
Subject: puffin.transitionnetwork.org Munin Alert

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average
        OKs: load is 3.54.

Logs

php-fpm error log:

[20-Jun-2013 19:00:31] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 7 idle, and 26 total children
[20-Jun-2013 19:00:32] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 9 idle, and 29 total children
[20-Jun-2013 19:00:39] WARNING: [pool www] child 7304, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (37.312525 sec), logging
[20-Jun-2013 19:00:39] WARNING: [pool www] child 33087, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (32.971690 sec), logging
[20-Jun-2013 19:00:39] WARNING: [pool www] child 8066, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.179033 sec), logging
[20-Jun-2013 19:00:39] WARNING: [pool www] child 7543, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (33.357803 sec), logging
[20-Jun-2013 19:00:39] NOTICE: child 8066 stopped for tracing
[20-Jun-2013 19:00:39] NOTICE: about to trace 8066
[20-Jun-2013 19:00:40] NOTICE: finished trace of 8066
[20-Jun-2013 19:00:40] NOTICE: child 7543 stopped for tracing
[20-Jun-2013 19:00:40] NOTICE: about to trace 7543
[20-Jun-2013 19:00:42] NOTICE: finished trace of 7543
[20-Jun-2013 19:00:42] NOTICE: child 33087 stopped for tracing
[20-Jun-2013 19:00:42] NOTICE: about to trace 33087
[20-Jun-2013 19:00:43] NOTICE: finished trace of 33087
[20-Jun-2013 19:00:43] NOTICE: child 7304 stopped for tracing
[20-Jun-2013 19:00:43] NOTICE: about to trace 7304
[20-Jun-2013 19:00:43] NOTICE: finished trace of 7304
[20-Jun-2013 19:00:49] WARNING: [pool www] child 8074, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.430897 sec), logging
[20-Jun-2013 19:00:49] WARNING: [pool www] child 7259, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.990784 sec), logging
[20-Jun-2013 19:00:49] WARNING: [pool www] child 6946, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.829723 sec), logging
[20-Jun-2013 19:00:49] WARNING: [pool www] child 6944, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (36.942049 sec), logging
[20-Jun-2013 19:00:49] NOTICE: child 8074 stopped for tracing
[20-Jun-2013 19:00:49] NOTICE: about to trace 8074
[20-Jun-2013 19:00:50] ERROR: failed to ptrace(PEEKDATA) pid 8074: Input/output error (5)
[20-Jun-2013 19:00:51] NOTICE: finished trace of 8074
[20-Jun-2013 19:00:51] NOTICE: child 6944 stopped for tracing
[20-Jun-2013 19:00:51] NOTICE: about to trace 6944
[20-Jun-2013 19:00:52] NOTICE: finished trace of 6944
[20-Jun-2013 19:00:52] NOTICE: child 6946 stopped for tracing
[20-Jun-2013 19:00:52] NOTICE: about to trace 6946
[20-Jun-2013 19:00:52] NOTICE: finished trace of 6946
[20-Jun-2013 19:00:52] NOTICE: child 7259 stopped for tracing
[20-Jun-2013 19:00:52] NOTICE: about to trace 7259
[20-Jun-2013 19:00:53] ERROR: failed to ptrace(PEEKDATA) pid 7259: Input/output error (5)
[20-Jun-2013 19:00:54] NOTICE: finished trace of 7259
[20-Jun-2013 19:01:00] WARNING: [pool www] child 4053, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.609099 sec), logging
[20-Jun-2013 19:01:00] WARNING: [pool www] child 33098, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.525960 sec), logging
[20-Jun-2013 19:01:00] WARNING: [pool www] child 8073, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.638407 sec), logging
[20-Jun-2013 19:01:00] WARNING: [pool www] child 7158, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (36.034242 sec), logging
[20-Jun-2013 19:01:00] WARNING: [pool www] child 6961, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (31.711724 sec), logging
[20-Jun-2013 19:01:00] WARNING: [pool www] child 6945, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (32.942187 sec), logging
[20-Jun-2013 19:01:00] NOTICE: child 6945 stopped for tracing
[20-Jun-2013 19:01:00] NOTICE: about to trace 6945
[20-Jun-2013 19:01:01] NOTICE: finished trace of 6945
[20-Jun-2013 19:01:01] NOTICE: child 6961 stopped for tracing
[20-Jun-2013 19:01:01] NOTICE: about to trace 6961
[20-Jun-2013 19:01:03] NOTICE: child 7158 stopped for tracing
[20-Jun-2013 19:01:03] NOTICE: about to trace 7158
[20-Jun-2013 19:01:04] NOTICE: finished trace of 7158
[20-Jun-2013 19:01:04] NOTICE: child 8073 stopped for tracing
[20-Jun-2013 19:01:04] NOTICE: about to trace 8073
[20-Jun-2013 19:01:06] NOTICE: finished trace of 8073
[20-Jun-2013 19:01:06] NOTICE: child 33098 stopped for tracing
[20-Jun-2013 19:01:06] NOTICE: about to trace 33098
[20-Jun-2013 19:01:07] NOTICE: finished trace of 33098
[20-Jun-2013 19:01:07] NOTICE: child 4053 stopped for tracing
[20-Jun-2013 19:01:07] NOTICE: about to trace 4053
[20-Jun-2013 19:01:08] NOTICE: finished trace of 4053
[20-Jun-2013 19:01:10] WARNING: [pool www] child 6962, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (38.560128 sec), logging
[20-Jun-2013 19:01:10] NOTICE: child 6962 stopped for tracing
[20-Jun-2013 19:01:10] NOTICE: about to trace 6962
[20-Jun-2013 19:01:12] NOTICE: finished trace of 6962
[20-Jun-2013 19:01:13] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 8 idle, and 36 total children
[20-Jun-2013 19:01:14] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 9 idle, and 38 total children
[20-Jun-2013 19:01:20] WARNING: [pool www] child 33086, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (31.627510 sec), logging
[20-Jun-2013 19:01:20] WARNING: [pool www] child 33083, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.743111 sec), logging
[20-Jun-2013 19:01:20] WARNING: [pool www] child 7307, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (32.171790 sec), logging
[20-Jun-2013 19:01:20] NOTICE: child 33083 stopped for tracing
[20-Jun-2013 19:01:20] NOTICE: about to trace 33083
[20-Jun-2013 19:01:22] NOTICE: finished trace of 33083
[20-Jun-2013 19:01:22] NOTICE: child 7307 stopped for tracing
[20-Jun-2013 19:01:22] NOTICE: about to trace 7307
[20-Jun-2013 19:01:24] NOTICE: finished trace of 7307
[20-Jun-2013 19:01:24] NOTICE: child 33086 stopped for tracing
[20-Jun-2013 19:01:24] NOTICE: about to trace 33086
[20-Jun-2013 19:01:26] NOTICE: finished trace of 33086
[20-Jun-2013 19:01:30] WARNING: [pool www] child 33100, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (39.927719 sec), logging
[20-Jun-2013 19:01:30] WARNING: [pool www] child 33085, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (37.855957 sec), logging
[20-Jun-2013 19:01:30] NOTICE: child 33085 stopped for tracing
[20-Jun-2013 19:01:30] NOTICE: about to trace 33085
[20-Jun-2013 19:01:31] NOTICE: finished trace of 33085
[20-Jun-2013 19:01:31] NOTICE: child 33100 stopped for tracing
[20-Jun-2013 19:01:31] NOTICE: about to trace 33100
[20-Jun-2013 19:01:32] ERROR: failed to ptrace(PEEKDATA) pid 33100: Input/output error (5)
[20-Jun-2013 19:01:33] NOTICE: finished trace of 33100
[20-Jun-2013 19:01:40] WARNING: [pool www] child 4059, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (31.344221 sec), logging
[20-Jun-2013 19:01:40] WARNING: [pool www] child 4055, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (34.783235 sec), logging
[20-Jun-2013 19:01:40] WARNING: [pool www] child 4054, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (32.046578 sec), logging
[20-Jun-2013 19:01:40] WARNING: [pool www] child 7260, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (36.125317 sec), logging
[20-Jun-2013 19:01:40] NOTICE: child 4059 stopped for tracing
[20-Jun-2013 19:01:40] NOTICE: about to trace 4059
[20-Jun-2013 19:01:42] NOTICE: finished trace of 4059
[20-Jun-2013 19:01:42] NOTICE: child 7260 stopped for tracing
[20-Jun-2013 19:01:42] NOTICE: about to trace 7260
[20-Jun-2013 19:01:44] NOTICE: finished trace of 7260
[20-Jun-2013 19:01:44] NOTICE: child 4054 stopped for tracing
[20-Jun-2013 19:01:44] NOTICE: about to trace 4054
[20-Jun-2013 19:01:45] NOTICE: finished trace of 4054
[20-Jun-2013 19:01:45] NOTICE: child 4055 stopped for tracing
[20-Jun-2013 19:01:45] NOTICE: about to trace 4055
[20-Jun-2013 19:01:45] ERROR: failed to ptrace(PEEKDATA) pid 4055: Input/output error (5)
[20-Jun-2013 19:01:46] NOTICE: finished trace of 4055
[20-Jun-2013 19:01:50] WARNING: [pool www] child 4110, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.004370 sec), logging
[20-Jun-2013 19:01:50] WARNING: [pool www] child 4063, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (32.482875 sec), logging
[20-Jun-2013 19:01:50] WARNING: [pool www] child 4060, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.355311 sec), logging
[20-Jun-2013 19:01:50] NOTICE: child 4063 stopped for tracing
[20-Jun-2013 19:01:50] NOTICE: about to trace 4063
[20-Jun-2013 19:01:50] ERROR: failed to ptrace(PEEKDATA) pid 4063: Input/output error (5)
[20-Jun-2013 19:01:51] NOTICE: finished trace of 4063
[20-Jun-2013 19:01:51] NOTICE: child 4060 stopped for tracing
[20-Jun-2013 19:01:51] NOTICE: about to trace 4060
[20-Jun-2013 19:01:51] NOTICE: finished trace of 4060
[20-Jun-2013 19:01:51] NOTICE: child 4110 stopped for tracing
[20-Jun-2013 19:01:51] NOTICE: about to trace 4110
[20-Jun-2013 19:01:51] NOTICE: finished trace of 4110
[20-Jun-2013 19:02:00] WARNING: [pool www] child 4065, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (32.068884 sec), logging
[20-Jun-2013 19:02:00] NOTICE: child 4065 stopped for tracing
[20-Jun-2013 19:02:00] NOTICE: about to trace 4065
[20-Jun-2013 19:02:00] ERROR: failed to ptrace(PEEKDATA) pid 4065: Input/output error (5)
[20-Jun-2013 19:02:00] NOTICE: finished trace of 4065

I'm now going to look at adding some extra things to second.sh to give up some more data about the load spikes.

Last edited 3 years ago by chris (previous) (diff)

comment:48 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 15.38 to 15.63

That last comment should have been submitted 15 mins ago.

I have made some additions to second.sh to dump some extra info to /var/log/high-load.log, specifically the bits between # start additions and # end additions:

#!/bin/bash

SHELL=/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/opt/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

hold()
{
  # start additions
  echo "====================" >> /var/log/high-load.log
  echo "php-fpm and nginx about to be killed" >> /var/log/high-load.log
  echo "ONEX_LOAD = $ONEX_LOAD" >> /var/log/high-load.log
  echo "FIVX_LOAD = $FIVX_LOAD" >> /var/log/high-load.log
  echo "uptime : " >> /var/log/high-load.log
  uptime >> /var/log/high-load.log
  echo "cat /proc/interrupts : " >> /var/log/high-load.log
  cat /proc/interrupts >> /var/log/high-load.log
  echo "====================" >> /var/log/high-load.log
  # end additions
  /etc/init.d/nginx stop
  killall -9 nginx
  sleep 1
  killall -9 nginx
  /etc/init.d/php-fpm stop
  /etc/init.d/php53-fpm stop
  killall -9 php-fpm php-cgi
  echo load is $ONEX_LOAD:$FIVX_LOAD while maxload is $CTL_ONEX_LOAD:$CTL_FIVX_LOAD

}

terminate()
{
  if test -f /var/run/boa_run.pid ; then
    sleep 1
  else
    killall -9 php drush.php wget
  fi
}

nginx_high_load_on()
{
  # start additions
  echo "====================" >> /var/log/high-load.log
  echo "nginx high load on" >> /var/log/high-load.log
  echo "ONEX_LOAD = $ONEX_LOAD" >> /var/log/high-load.log
  echo "FIVX_LOAD = $FIVX_LOAD" >> /var/log/high-load.log
  echo "uptime : " >> /var/log/high-load.log
  uptime >> /var/log/high-load.log
  echo "cat /proc/interrupts : " >> /var/log/high-load.log
  cat /proc/interrupts >> /var/log/high-load.log
  echo "====================" >> /var/log/high-load.log
  # end additions
  mv -f /data/conf/nginx_high_load_off.conf /data/conf/nginx_high_load.conf
  /etc/init.d/nginx reload
}

nginx_high_load_off()
{
  # start additions
  echo "====================" >> /var/log/high-load.log
  echo "nginx high load off" >> /var/log/high-load.log
  echo "ONEX_LOAD = $ONEX_LOAD" >> /var/log/high-load.log
  echo "FIVX_LOAD = $FIVX_LOAD" >> /var/log/high-load.log
  echo "uptime : " >> /var/log/high-load.log
  uptime >> /var/log/high-load.log
  echo "====================" >> /var/log/high-load.log
  # end additions
  mv -f /data/conf/nginx_high_load.conf /data/conf/nginx_high_load_off.conf
  /etc/init.d/nginx reload
  echo "nginx_high_load_off" >> /var/log/high-load.log
}

control()
{
ONEX_LOAD=`awk '{print $1*100}' /proc/loadavg`
FIVX_LOAD=`awk '{print $2*100}' /proc/loadavg`
#CTL_ONEX_SPIDER_LOAD=388
#CTL_FIVX_SPIDER_LOAD=388
#CTL_ONEX_LOAD=1444
#CTL_FIVX_LOAD=888
#CTL_ONEX_LOAD_CRIT=1888
#CTL_FIVX_LOAD_CRIT=1555
CTL_ONEX_SPIDER_LOAD=1552
CTL_FIVX_SPIDER_LOAD=1552
CTL_ONEX_LOAD=5776
CTL_FIVX_LOAD=3552
CTL_ONEX_LOAD_CRIT=7552
CTL_FIVX_LOAD_CRIT=6220
if [ $ONEX_LOAD -ge $CTL_ONEX_SPIDER_LOAD ] && [ $ONEX_LOAD -lt $CTL_ONEX_LOAD ] && [ -e "/data/conf/nginx_high_load_off.conf" ] ; then
  nginx_high_load_on
elif [ $FIVX_LOAD -ge $CTL_FIVX_SPIDER_LOAD ] && [ $FIVX_LOAD -lt $CTL_FIVX_LOAD ] && [ -e "/data/conf/nginx_high_load_off.conf" ] ; then
  nginx_high_load_on
elif [ $ONEX_LOAD -lt $CTL_ONEX_SPIDER_LOAD ] && [ $FIVX_LOAD -lt $CTL_FIVX_SPIDER_LOAD ] && [ -e "/data/conf/nginx_high_load.conf" ] ; then
  nginx_high_load_off
fi
if [ $ONEX_LOAD -ge $CTL_ONEX_LOAD_CRIT ] ; then
  terminate
elif [ $FIVX_LOAD -ge $CTL_FIVX_LOAD_CRIT ] ; then
  terminate
fi
if [ $ONEX_LOAD -ge $CTL_ONEX_LOAD ] ; then
  hold
elif [ $FIVX_LOAD -ge $CTL_FIVX_LOAD ] ; then
  hold
else
  echo load is $ONEX_LOAD:$FIVX_LOAD while maxload is $CTL_ONEX_LOAD:$CTL_FIVX_LOAD
  echo ...OK now doing CTL...
  perl /var/xdrago/proc_num_ctrl.cgi
  touch /var/xdrago/log/proc_num_ctrl.done
  echo CTL done
fi
}

control
sleep 10
control
sleep 10
control
sleep 10
control
sleep 10
control
sleep 10
control
echo Done !
###EOF2013###

Next time there is a load spike we should get a better idea what happened.

comment:49 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 1.25
Total Hours changed from 15.63 to 16.88

There were several times over night when the load was such that /var/xdrago/second.sh switched the nginx high load configuration on, the good news is that these load spikes didn't result in second.sh killing php-fpm and nginx and the site going down for 10 to 15 mins:

23:07:25 up 13:34,  2 users,  load average: 18.60, 4.61, 1.72

01:24:22 up 15:51,  0 users,  load average: 15.55, 8.88, 4.12

02:00:39 up 16:27,  0 users,  load average: 25.90, 6.44, 3.43

02:01:23 up 16:28,  0 users,  load average: 17.93, 6.78, 3.67

05:09:48 up 19:36,  0 users,  load average: 16.50, 4.67, 1.81

At these points the site will have started serving 503 errors to bots -- based on past experience it probably is bots causing the load spikes so this is reasonable and also it seems that the thresholds in second.sh (the original ones were all multiplied by 4 see ticket:555#comment:43) are perhaps on the low side, I expect the server could cope with somewhat higher loads, so I have considered re-multiplied the variables by 6 rather than 4 so we would have:

CTL_ONEX_SPIDER_LOAD=2328
CTL_FIVX_SPIDER_LOAD=2328
CTL_ONEX_LOAD=8664
CTL_FIVX_LOAD=5328
CTL_ONEX_LOAD_CRIT=11328
CTL_FIVX_LOAD_CRIT=9330

But I haven't made this change live since we are seeing, with the current values, significant numbers of php-fpm errors at most of the times when these thresholds are hit -- last night the php-fpm errors started at these times:

[20-Jun-2013 23:07:00] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 0 idle, and 27 total children
...

[21-Jun-2013 01:20:51] WARNING: [pool www] child 18789, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.134680 sec), logging
...

[21-Jun-2013 05:09:12] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 0 idle, and 27 total children
...

There were no errors for the 2am load spikes -- this is why I think the thresholds could possibly go up somewhat.

Following is the raw data dumped into the high-load.log files that was created, I'm not sure how much use the dump of /proc/interrupts is -- I don't know how to interpret it. So I have added a dump of the output of top and ps -lA to the second.sh script like this:

  echo "top : " >> /var/log/high-load.log
  top -n 1 -b >> /var/log/high-load.log
  echo "processes : " >> /var/log/high-load.log
  ps -lA >> /var/log/high-load.log

Raw high-load.log file:

nginx_high_load_off
====================
nginx high load on
ONEX_LOAD = 1860
FIVX_LOAD = 461
uptime : 
 23:07:25 up 13:34,  2 users,  load average: 18.60, 4.61, 1.72
cat /proc/interrupts : 
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      
565:    1784162          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     eth0
566:         60          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     blkif
567:     905180          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     blkif
568:       1854          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     hvc_console
569:        434          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     xenbus
570:          0          0          0          0          0          0          0          0          0          0          0          0          0      26005  xen-percpu-ipi       callfuncsingle13
571:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug13
572:          0          0          0          0          0          0          0          0          0          0          0          0          0       6767  xen-percpu-ipi       callfunc13
573:          0          0          0          0          0          0          0          0          0          0          0          0          0     227951  xen-percpu-ipi       resched13
574:          0          0          0          0          0          0          0          0          0          0          0          0          0    1291581  xen-percpu-virq      timer13
575:          0          0          0          0          0          0          0          0          0          0          0          0      26350          0  xen-percpu-ipi       callfuncsingle12
576:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug12
577:          0          0          0          0          0          0          0          0          0          0          0          0       6948          0  xen-percpu-ipi       callfunc12
578:          0          0          0          0          0          0          0          0          0          0          0          0     219634          0  xen-percpu-ipi       resched12
579:          0          0          0          0          0          0          0          0          0          0          0          0    1174174          0  xen-percpu-virq      timer12
580:          0          0          0          0          0          0          0          0          0          0          0      26401          0          0  xen-percpu-ipi       callfuncsingle11
581:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug11
582:          0          0          0          0          0          0          0          0          0          0          0       7247          0          0  xen-percpu-ipi       callfunc11
583:          0          0          0          0          0          0          0          0          0          0          0     219060          0          0  xen-percpu-ipi       resched11
584:          0          0          0          0          0          0          0          0          0          0          0    1197614          0          0  xen-percpu-virq      timer11
585:          0          0          0          0          0          0          0          0          0          0      26611          0          0          0  xen-percpu-ipi       callfuncsingle10
586:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug10
587:          0          0          0          0          0          0          0          0          0          0       7423          0          0          0  xen-percpu-ipi       callfunc10
588:          0          0          0          0          0          0          0          0          0          0     224488          0          0          0  xen-percpu-ipi       resched10
589:          0          0          0          0          0          0          0          0          0          0    1183295          0          0          0  xen-percpu-virq      timer10
590:          0          0          0          0          0          0          0          0          0      27486          0          0          0          0  xen-percpu-ipi       callfuncsingle9
591:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug9
592:          0          0          0          0          0          0          0          0          0       7726          0          0          0          0  xen-percpu-ipi       callfunc9
593:          0          0          0          0          0          0          0          0          0     230742          0          0          0          0  xen-percpu-ipi       resched9
594:          0          0          0          0          0          0          0          0          0    1251364          0          0          0          0  xen-percpu-virq      timer9
595:          0          0          0          0          0          0          0          0      28062          0          0          0          0          0  xen-percpu-ipi       callfuncsingle8
596:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug8
597:          0          0          0          0          0          0          0          0       8165          0          0          0          0          0  xen-percpu-ipi       callfunc8
598:          0          0          0          0          0          0          0          0     240833          0          0          0          0          0  xen-percpu-ipi       resched8
599:          0          0          0          0          0          0          0          0    1293059          0          0          0          0          0  xen-percpu-virq      timer8
600:          0          0          0          0          0          0          0      28893          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle7
601:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug7
602:          0          0          0          0          0          0          0       9295          0          0          0          0          0          0  xen-percpu-ipi       callfunc7
603:          0          0          0          0          0          0          0     248633          0          0          0          0          0          0  xen-percpu-ipi       resched7
604:          0          0          0          0          0          0          0    1373937          0          0          0          0          0          0  xen-percpu-virq      timer7
605:          0          0          0          0          0          0      30234          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle6
606:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug6
607:          0          0          0          0          0          0       9368          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc6
608:          0          0          0          0          0          0     274551          0          0          0          0          0          0          0  xen-percpu-ipi       resched6
609:          0          0          0          0          0          0    1462225          0          0          0          0          0          0          0  xen-percpu-virq      timer6
610:          0          0          0          0          0      31654          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle5
611:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug5
612:          0          0          0          0          0      10537          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc5
613:          0          0          0          0          0     283690          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched5
614:          0          0          0          0          0    1559405          0          0          0          0          0          0          0          0  xen-percpu-virq      timer5
615:          0          0          0          0      34942          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle4
616:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug4
617:          0          0          0          0      19428          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc4
618:          0          0          0          0     345960          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched4
619:          0          0          0          0    1848094          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer4
620:          0          0          0      39952          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle3
621:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug3
622:          0          0          0      18020          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc3
623:          0          0          0     400301          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched3
624:          0          0          0    2227256          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer3
625:          0          0      28683          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle2
626:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug2
627:          0          0       6529          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc2
628:          0          0     495137          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched2
629:          0          0    3081831          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer2
630:          0      44953          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle1
631:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug1
632:          0       5252          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc1
633:          0     572621          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched1
634:          0    4354146          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer1
635:      22470          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle0
636:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug0
637:       4754          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc0
638:     685627          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched0
639:    4955618          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer0
NMI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Non-maskable interrupts
LOC:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Local timer interrupts
SPU:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Spurious interrupts
PMI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Performance monitoring interrupts
PND:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Performance pending work
RES:     685627     572621     495137     400301     345960     283690     274551     248633     240833     230742     224488     219060     219634     227951   Rescheduling interrupts
CAL:      27224      50205      35212      57972      54370      42191      39602      38188      36227      35212      34034      33648      33298      32772   Function call interrupts
TLB:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   TLB shootdowns
TRM:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Threshold APIC interrupts
MCE:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Machine check exceptions
MCP:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Machine check polls
ERR:          0
MIS:          0
====================
====================
nginx high load off
ONEX_LOAD = 1376
FIVX_LOAD = 680
uptime : 
 23:08:31 up 13:35,  2 users,  load average: 13.76, 6.80, 2.72
====================
nginx_high_load_off
====================
nginx high load on
ONEX_LOAD = 1555
FIVX_LOAD = 888
uptime : 
 01:24:22 up 15:51,  0 users,  load average: 15.55, 8.88, 4.12
cat /proc/interrupts : 
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      
565:    1995500          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     eth0
566:         63          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     blkif
567:    1069795          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     blkif
568:       2044          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     hvc_console
569:        434          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     xenbus
570:          0          0          0          0          0          0          0          0          0          0          0          0          0      30449  xen-percpu-ipi       callfuncsingle13
571:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug13
572:          0          0          0          0          0          0          0          0          0          0          0          0          0       7869  xen-percpu-ipi       callfunc13
573:          0          0          0          0          0          0          0          0          0          0          0          0          0     260721  xen-percpu-ipi       resched13
574:          0          0          0          0          0          0          0          0          0          0          0          0          0    1478165  xen-percpu-virq      timer13
575:          0          0          0          0          0          0          0          0          0          0          0          0      30774          0  xen-percpu-ipi       callfuncsingle12
576:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug12
577:          0          0          0          0          0          0          0          0          0          0          0          0       8121          0  xen-percpu-ipi       callfunc12
578:          0          0          0          0          0          0          0          0          0          0          0          0     254737          0  xen-percpu-ipi       resched12
579:          0          0          0          0          0          0          0          0          0          0          0          0    1376728          0  xen-percpu-virq      timer12
580:          0          0          0          0          0          0          0          0          0          0          0      30853          0          0  xen-percpu-ipi       callfuncsingle11
581:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug11
582:          0          0          0          0          0          0          0          0          0          0          0       8436          0          0  xen-percpu-ipi       callfunc11
583:          0          0          0          0          0          0          0          0          0          0          0     251721          0          0  xen-percpu-ipi       resched11
584:          0          0          0          0          0          0          0          0          0          0          0    1394406          0          0  xen-percpu-virq      timer11
585:          0          0          0          0          0          0          0          0          0          0      31173          0          0          0  xen-percpu-ipi       callfuncsingle10
586:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug10
587:          0          0          0          0          0          0          0          0          0          0       8703          0          0          0  xen-percpu-ipi       callfunc10
588:          0          0          0          0          0          0          0          0          0          0     256183          0          0          0  xen-percpu-ipi       resched10
589:          0          0          0          0          0          0          0          0          0          0    1368331          0          0          0  xen-percpu-virq      timer10
590:          0          0          0          0          0          0          0          0          0      32216          0          0          0          0  xen-percpu-ipi       callfuncsingle9
591:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug9
592:          0          0          0          0          0          0          0          0          0       9006          0          0          0          0  xen-percpu-ipi       callfunc9
593:          0          0          0          0          0          0          0          0          0     262299          0          0          0          0  xen-percpu-ipi       resched9
594:          0          0          0          0          0          0          0          0          0    1446665          0          0          0          0  xen-percpu-virq      timer9
595:          0          0          0          0          0          0          0          0      32687          0          0          0          0          0  xen-percpu-ipi       callfuncsingle8
596:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug8
597:          0          0          0          0          0          0          0          0       9629          0          0          0          0          0  xen-percpu-ipi       callfunc8
598:          0          0          0          0          0          0          0          0     276355          0          0          0          0          0  xen-percpu-ipi       resched8
599:          0          0          0          0          0          0          0          0    1500997          0          0          0          0          0  xen-percpu-virq      timer8
600:          0          0          0          0          0          0          0      33784          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle7
601:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug7
602:          0          0          0          0          0          0          0      10826          0          0          0          0          0          0  xen-percpu-ipi       callfunc7
603:          0          0          0          0          0          0          0     285922          0          0          0          0          0          0  xen-percpu-ipi       resched7
604:          0          0          0          0          0          0          0    1594024          0          0          0          0          0          0  xen-percpu-virq      timer7
605:          0          0          0          0          0          0      35401          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle6
606:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug6
607:          0          0          0          0          0          0      11034          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc6
608:          0          0          0          0          0          0     310967          0          0          0          0          0          0          0  xen-percpu-ipi       resched6
609:          0          0          0          0          0          0    1672939          0          0          0          0          0          0          0  xen-percpu-virq      timer6
610:          0          0          0          0          0      37500          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle5
611:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug5
612:          0          0          0          0          0      12318          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc5
613:          0          0          0          0          0     328004          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched5
614:          0          0          0          0          0    1805389          0          0          0          0          0          0          0          0  xen-percpu-virq      timer5
615:          0          0          0          0      40776          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle4
616:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug4
617:          0          0          0          0      22784          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc4
618:          0          0          0          0     392575          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched4
619:          0          0          0          0    2135207          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer4
620:          0          0          0      47012          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle3
621:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug3
622:          0          0          0      21038          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc3
623:          0          0          0     456108          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched3
624:          0          0          0    2577954          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer3
625:          0          0      33514          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle2
626:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug2
627:          0          0       7611          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc2
628:          0          0     566137          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched2
629:          0          0    3564833          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer2
630:          0      52771          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle1
631:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug1
632:          0       6126          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc1
633:          0     657234          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched1
634:          0    5016190          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer1
635:      26484          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle0
636:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug0
637:       5594          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc0
638:     780994          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched0
639:    5721689          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer0
NMI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Non-maskable interrupts
LOC:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Local timer interrupts
SPU:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Spurious interrupts
PMI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Performance monitoring interrupts
PND:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Performance pending work
RES:     780994     657234     566137     456108     392575     328004     310967     285922     276355     262299     256183     251721     254737     260721   Rescheduling interrupts
CAL:      32078      58897      41125      68050      63560      49818      46435      44610      42316      41222      39876      39289      38895      38318   Function call interrupts
TLB:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   TLB shootdowns
TRM:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Threshold APIC interrupts
MCE:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Machine check exceptions
MCP:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Machine check polls
ERR:          0
MIS:          0
====================
====================
nginx high load off
ONEX_LOAD = 1546
FIVX_LOAD = 1097
uptime : 
 01:26:01 up 15:52,  0 users,  load average: 15.46, 10.97, 5.37
====================
nginx_high_load_off
====================
nginx high load on
ONEX_LOAD = 2590
FIVX_LOAD = 644
uptime : 
 02:00:39 up 16:27,  0 users,  load average: 25.90, 6.44, 3.43
cat /proc/interrupts : 
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      
565:    2046192          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     eth0
566:         63          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     blkif
567:    1283062          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     blkif
568:       2048          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     hvc_console
569:        434          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     xenbus
570:          0          0          0          0          0          0          0          0          0          0          0          0          0      31586  xen-percpu-ipi       callfuncsingle13
571:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug13
572:          0          0          0          0          0          0          0          0          0          0          0          0          0       8142  xen-percpu-ipi       callfunc13
573:          0          0          0          0          0          0          0          0          0          0          0          0          0     269663  xen-percpu-ipi       resched13
574:          0          0          0          0          0          0          0          0          0          0          0          0          0    1530188  xen-percpu-virq      timer13
575:          0          0          0          0          0          0          0          0          0          0          0          0      31960          0  xen-percpu-ipi       callfuncsingle12
576:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug12
577:          0          0          0          0          0          0          0          0          0          0          0          0       8442          0  xen-percpu-ipi       callfunc12
578:          0          0          0          0          0          0          0          0          0          0          0          0     266405          0  xen-percpu-ipi       resched12
579:          0          0          0          0          0          0          0          0          0          0          0          0    1430246          0  xen-percpu-virq      timer12
580:          0          0          0          0          0          0          0          0          0          0          0      32015          0          0  xen-percpu-ipi       callfuncsingle11
581:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug11
582:          0          0          0          0          0          0          0          0          0          0          0       8790          0          0  xen-percpu-ipi       callfunc11
583:          0          0          0          0          0          0          0          0          0          0          0     261470          0          0  xen-percpu-ipi       resched11
584:          0          0          0          0          0          0          0          0          0          0          0    1450235          0          0  xen-percpu-virq      timer11
585:          0          0          0          0          0          0          0          0          0          0      32368          0          0          0  xen-percpu-ipi       callfuncsingle10
586:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug10
587:          0          0          0          0          0          0          0          0          0          0       9095          0          0          0  xen-percpu-ipi       callfunc10
588:          0          0          0          0          0          0          0          0          0          0     265116          0          0          0  xen-percpu-ipi       resched10
589:          0          0          0          0          0          0          0          0          0          0    1413617          0          0          0  xen-percpu-virq      timer10
590:          0          0          0          0          0          0          0          0          0      33418          0          0          0          0  xen-percpu-ipi       callfuncsingle9
591:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug9
592:          0          0          0          0          0          0          0          0          0       9399          0          0          0          0  xen-percpu-ipi       callfunc9
593:          0          0          0          0          0          0          0          0          0     272425          0          0          0          0  xen-percpu-ipi       resched9
594:          0          0          0          0          0          0          0          0          0    1491680          0          0          0          0  xen-percpu-virq      timer9
595:          0          0          0          0          0          0          0          0      33999          0          0          0          0          0  xen-percpu-ipi       callfuncsingle8
596:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug8
597:          0          0          0          0          0          0          0          0       9977          0          0          0          0          0  xen-percpu-ipi       callfunc8
598:          0          0          0          0          0          0          0          0     287828          0          0          0          0          0  xen-percpu-ipi       resched8
599:          0          0          0          0          0          0          0          0    1560125          0          0          0          0          0  xen-percpu-virq      timer8
600:          0          0          0          0          0          0          0      35195          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle7
601:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug7
602:          0          0          0          0          0          0          0      11249          0          0          0          0          0          0  xen-percpu-ipi       callfunc7
603:          0          0          0          0          0          0          0     297309          0          0          0          0          0          0  xen-percpu-ipi       resched7
604:          0          0          0          0          0          0          0    1652758          0          0          0          0          0          0  xen-percpu-virq      timer7
605:          0          0          0          0          0          0      36831          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle6
606:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug6
607:          0          0          0          0          0          0      11503          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc6
608:          0          0          0          0          0          0     323277          0          0          0          0          0          0          0  xen-percpu-ipi       resched6
609:          0          0          0          0          0          0    1735526          0          0          0          0          0          0          0  xen-percpu-virq      timer6
610:          0          0          0          0          0      38914          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle5
611:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug5
612:          0          0          0          0          0      12793          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc5
613:          0          0          0          0          0     341022          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched5
614:          0          0          0          0          0    1874834          0          0          0          0          0          0          0          0  xen-percpu-virq      timer5
615:          0          0          0          0      42273          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle4
616:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug4
617:          0          0          0          0      23587          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc4
618:          0          0          0          0     408052          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched4
619:          0          0          0          0    2210084          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer4
620:          0          0          0      48748          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle3
621:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug3
622:          0          0          0      21775          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc3
623:          0          0          0     472951          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched3
624:          0          0          0    2674154          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer3
625:          0          0      34753          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle2
626:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug2
627:          0          0       7902          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc2
628:          0          0     588019          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched2
629:          0          0    3689556          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer2
630:          0      54763          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle1
631:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug1
632:          0       6348          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc1
633:          0     682282          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched1
634:          0    5198436          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer1
635:      27410          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle0
636:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug0
637:       5820          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc0
638:     807251          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched0
639:    5922415          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer0
NMI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Non-maskable interrupts
LOC:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Local timer interrupts
SPU:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Spurious interrupts
PMI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Performance monitoring interrupts
PND:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Performance pending work
RES:     807251     682283     588019     472951     408052     341022     323277     297309     287828     272426     265116     261470     266405     269663   Rescheduling interrupts
CAL:      33230      61111      42655      70523      65860      51707      48334      46444      43976      42817      41463      40805      40402      39728   Function call interrupts
TLB:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   TLB shootdowns
TRM:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Threshold APIC interrupts
MCE:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Machine check exceptions
MCP:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Machine check polls
ERR:          0
MIS:          0
====================
====================
nginx high load off
ONEX_LOAD = 1496
FIVX_LOAD = 588
uptime : 
 02:01:10 up 16:28,  0 users,  load average: 14.96, 5.88, 3.35
====================
nginx_high_load_off
====================
nginx high load on
ONEX_LOAD = 1793
FIVX_LOAD = 678
uptime : 
 02:01:23 up 16:28,  0 users,  load average: 17.93, 6.78, 3.67
cat /proc/interrupts : 
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      
565:    2047161          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     eth0
566:         63          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     blkif
567:    1283453          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     blkif
568:       2048          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     hvc_console
569:        434          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     xenbus
570:          0          0          0          0          0          0          0          0          0          0          0          0          0      31627  xen-percpu-ipi       callfuncsingle13
571:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug13
572:          0          0          0          0          0          0          0          0          0          0          0          0          0       8147  xen-percpu-ipi       callfunc13
573:          0          0          0          0          0          0          0          0          0          0          0          0          0     269998  xen-percpu-ipi       resched13
574:          0          0          0          0          0          0          0          0          0          0          0          0          0    1531826  xen-percpu-virq      timer13
575:          0          0          0          0          0          0          0          0          0          0          0          0      32001          0  xen-percpu-ipi       callfuncsingle12
576:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug12
577:          0          0          0          0          0          0          0          0          0          0          0          0       8443          0  xen-percpu-ipi       callfunc12
578:          0          0          0          0          0          0          0          0          0          0          0          0     266989          0  xen-percpu-ipi       resched12
579:          0          0          0          0          0          0          0          0          0          0          0          0    1432919          0  xen-percpu-virq      timer12
580:          0          0          0          0          0          0          0          0          0          0          0      32314          0          0  xen-percpu-ipi       callfuncsingle11
581:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug11
582:          0          0          0          0          0          0          0          0          0          0          0       8795          0          0  xen-percpu-ipi       callfunc11
583:          0          0          0          0          0          0          0          0          0          0          0     261913          0          0  xen-percpu-ipi       resched11
584:          0          0          0          0          0          0          0          0          0          0          0    1451460          0          0  xen-percpu-virq      timer11
585:          0          0          0          0          0          0          0          0          0          0      32392          0          0          0  xen-percpu-ipi       callfuncsingle10
586:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug10
587:          0          0          0          0          0          0          0          0          0          0       9100          0          0          0  xen-percpu-ipi       callfunc10
588:          0          0          0          0          0          0          0          0          0          0     265999          0          0          0  xen-percpu-ipi       resched10
589:          0          0          0          0          0          0          0          0          0          0    1416401          0          0          0  xen-percpu-virq      timer10
590:          0          0          0          0          0          0          0          0          0      33436          0          0          0          0  xen-percpu-ipi       callfuncsingle9
591:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug9
592:          0          0          0          0          0          0          0          0          0       9405          0          0          0          0  xen-percpu-ipi       callfunc9
593:          0          0          0          0          0          0          0          0          0     272650          0          0          0          0  xen-percpu-ipi       resched9
594:          0          0          0          0          0          0          0          0          0    1492937          0          0          0          0  xen-percpu-virq      timer9
595:          0          0          0          0          0          0          0          0      34046          0          0          0          0          0  xen-percpu-ipi       callfuncsingle8
596:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug8
597:          0          0          0          0          0          0          0          0       9984          0          0          0          0          0  xen-percpu-ipi       callfunc8
598:          0          0          0          0          0          0          0          0     288262          0          0          0          0          0  xen-percpu-ipi       resched8
599:          0          0          0          0          0          0          0          0    1561403          0          0          0          0          0  xen-percpu-virq      timer8
600:          0          0          0          0          0          0          0      35258          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle7
601:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug7
602:          0          0          0          0          0          0          0      11259          0          0          0          0          0          0  xen-percpu-ipi       callfunc7
603:          0          0          0          0          0          0          0     297715          0          0          0          0          0          0  xen-percpu-ipi       resched7
604:          0          0          0          0          0          0          0    1654240          0          0          0          0          0          0  xen-percpu-virq      timer7
605:          0          0          0          0          0          0      36901          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle6
606:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug6
607:          0          0          0          0          0          0      11511          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc6
608:          0          0          0          0          0          0     324057          0          0          0          0          0          0          0  xen-percpu-ipi       resched6
609:          0          0          0          0          0          0    1737127          0          0          0          0          0          0          0  xen-percpu-virq      timer6
610:          0          0          0          0          0      38938          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle5
611:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug5
612:          0          0          0          0          0      12799          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc5
613:          0          0          0          0          0     341531          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched5
614:          0          0          0          0          0    1876790          0          0          0          0          0          0          0          0  xen-percpu-virq      timer5
615:          0          0          0          0      42326          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle4
616:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug4
617:          0          0          0          0      23594          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc4
618:          0          0          0          0     408579          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched4
619:          0          0          0          0    2211654          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer4
620:          0          0          0      48791          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle3
621:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug3
622:          0          0          0      21778          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc3
623:          0          0          0     473447          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched3
624:          0          0          0    2677567          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer3
625:          0          0      34768          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle2
626:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug2
627:          0          0       7903          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc2
628:          0          0     588486          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched2
629:          0          0    3691186          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer2
630:          0      54799          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle1
631:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug1
632:          0       6351          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc1
633:          0     682680          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched1
634:          0    5200423          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer1
635:      27442          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle0
636:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug0
637:       5823          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc0
638:     807678          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched0
639:    5925119          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer0
NMI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Non-maskable interrupts
LOC:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Local timer interrupts
SPU:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Spurious interrupts
PMI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Performance monitoring interrupts
PND:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Performance pending work
RES:     807678     682680     588486     473447     408579     341531     324058     297715     288262     272650     265999     261913     266989     269998   Rescheduling interrupts
CAL:      33265      61150      42671      70569      65920      51737      48412      46517      44030      42841      41492      41109      40444      39774   Function call interrupts
TLB:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   TLB shootdowns
TRM:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Threshold APIC interrupts
MCE:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Machine check exceptions
MCP:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Machine check polls
ERR:          0
MIS:          0
====================
====================
nginx high load off
ONEX_LOAD = 1526
FIVX_LOAD = 657
uptime : 
 02:01:33 up 16:28,  0 users,  load average: 15.26, 6.57, 3.63
====================
nginx_high_load_off
====================
nginx high load on
ONEX_LOAD = 1650
FIVX_LOAD = 467
uptime : 
 05:09:48 up 19:36,  0 users,  load average: 16.50, 4.67, 1.81
cat /proc/interrupts : 
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      
565:    2380435          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     eth0
566:         63          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     blkif
567:    1410789          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     blkif
568:       2082          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     hvc_console
569:        434          0          0          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     xenbus
570:          0          0          0          0          0          0          0          0          0          0          0          0          0      42456  xen-percpu-ipi       callfuncsingle13
571:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug13
572:          0          0          0          0          0          0          0          0          0          0          0          0          0       9744  xen-percpu-ipi       callfunc13
573:          0          0          0          0          0          0          0          0          0          0          0          0          0     331645  xen-percpu-ipi       resched13
574:          0          0          0          0          0          0          0          0          0          0          0          0          0    1860502  xen-percpu-virq      timer13
575:          0          0          0          0          0          0          0          0          0          0          0          0      41972          0  xen-percpu-ipi       callfuncsingle12
576:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug12
577:          0          0          0          0          0          0          0          0          0          0          0          0      10145          0  xen-percpu-ipi       callfunc12
578:          0          0          0          0          0          0          0          0          0          0          0          0     325265          0  xen-percpu-ipi       resched12
579:          0          0          0          0          0          0          0          0          0          0          0          0    1760822          0  xen-percpu-virq      timer12
580:          0          0          0          0          0          0          0          0          0          0          0      43312          0          0  xen-percpu-ipi       callfuncsingle11
581:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug11
582:          0          0          0          0          0          0          0          0          0          0          0      10444          0          0  xen-percpu-ipi       callfunc11
583:          0          0          0          0          0          0          0          0          0          0          0     323204          0          0  xen-percpu-ipi       resched11
584:          0          0          0          0          0          0          0          0          0          0          0    1781544          0          0  xen-percpu-virq      timer11
585:          0          0          0          0          0          0          0          0          0          0      43417          0          0          0  xen-percpu-ipi       callfuncsingle10
586:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug10
587:          0          0          0          0          0          0          0          0          0          0      10869          0          0          0  xen-percpu-ipi       callfunc10
588:          0          0          0          0          0          0          0          0          0          0     323502          0          0          0  xen-percpu-ipi       resched10
589:          0          0          0          0          0          0          0          0          0          0    1740947          0          0          0  xen-percpu-virq      timer10
590:          0          0          0          0          0          0          0          0          0      46241          0          0          0          0  xen-percpu-ipi       callfuncsingle9
591:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug9
592:          0          0          0          0          0          0          0          0          0      11085          0          0          0          0  xen-percpu-ipi       callfunc9
593:          0          0          0          0          0          0          0          0          0     341915          0          0          0          0  xen-percpu-ipi       resched9
594:          0          0          0          0          0          0          0          0          0    1873149          0          0          0          0  xen-percpu-virq      timer9
595:          0          0          0          0          0          0          0          0      44888          0          0          0          0          0  xen-percpu-ipi       callfuncsingle8
596:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug8
597:          0          0          0          0          0          0          0          0      11936          0          0          0          0          0  xen-percpu-ipi       callfunc8
598:          0          0          0          0          0          0          0          0     359589          0          0          0          0          0  xen-percpu-ipi       resched8
599:          0          0          0          0          0          0          0          0    1949204          0          0          0          0          0  xen-percpu-virq      timer8
600:          0          0          0          0          0          0          0      50257          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle7
601:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug7
602:          0          0          0          0          0          0          0      13336          0          0          0          0          0          0  xen-percpu-ipi       callfunc7
603:          0          0          0          0          0          0          0     368282          0          0          0          0          0          0  xen-percpu-ipi       resched7
604:          0          0          0          0          0          0          0    2040586          0          0          0          0          0          0  xen-percpu-virq      timer7
605:          0          0          0          0          0          0      47261          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle6
606:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug6
607:          0          0          0          0          0          0      13757          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc6
608:          0          0          0          0          0          0     396099          0          0          0          0          0          0          0  xen-percpu-ipi       resched6
609:          0          0          0          0          0          0    2120277          0          0          0          0          0          0          0  xen-percpu-virq      timer6
610:          0          0          0          0          0      50513          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle5
611:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug5
612:          0          0          0          0          0      15254          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc5
613:          0          0          0          0          0     419430          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched5
614:          0          0          0          0          0    2309372          0          0          0          0          0          0          0          0  xen-percpu-virq      timer5
615:          0          0          0          0      56524          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle4
616:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug4
617:          0          0          0          0      28146          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc4
618:          0          0          0          0     499245          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched4
619:          0          0          0          0    2720161          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer4
620:          0          0          0      66270          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle3
621:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug3
622:          0          0          0      25841          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc3
623:          0          0          0     578674          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched3
624:          0          0          0    3251638          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer3
625:          0          0      45891          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle2
626:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug2
627:          0          0       9375          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc2
628:          0          0     712928          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched2
629:          0          0    4438415          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer2
630:          0      68953          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle1
631:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug1
632:          0       7530          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc1
633:          0     817419          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched1
634:          0    6221346          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer1
635:      38004          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfuncsingle0
636:          0          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      debug0
637:       6948          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       callfunc0
638:     961089          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-ipi       resched0
639:    7058770          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-percpu-virq      timer0
NMI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Non-maskable interrupts
LOC:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Local timer interrupts
SPU:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Spurious interrupts
PMI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Performance monitoring interrupts
PND:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Performance pending work
RES:     961089     817419     712928     578674     499245     419430     396099     368282     359589     341915     323502     323204     325265     331645   Rescheduling interrupts
CAL:      44952      76483      55266      92111      84670      65767      61018      63593      56824      57326      54286      53756      52117      52200   Function call interrupts
TLB:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   TLB shootdowns
TRM:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Threshold APIC interrupts
MCE:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Machine check exceptions
MCP:          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Machine check polls
ERR:          0
MIS:          0
====================
====================
nginx high load off
ONEX_LOAD = 1345
FIVX_LOAD = 737
uptime : 
 05:10:59 up 19:37,  0 users,  load average: 13.45, 7.37, 3.04
====================
nginx_high_load_off

This is the revised /var/xdrago/second.sh script, the revisions could be more elegant and I might clean them up at some point:

#!/bin/bash

SHELL=/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/opt/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

hold()
{
  # start additions
  echo "====================" >> /var/log/high-load.log
  echo "php-fpm and nginx about to be killed" >> /var/log/high-load.log
  echo "ONEX_LOAD = $ONEX_LOAD" >> /var/log/high-load.log
  echo "FIVX_LOAD = $FIVX_LOAD" >> /var/log/high-load.log
  echo "uptime : " >> /var/log/high-load.log
  uptime >> /var/log/high-load.log
  echo "top : " >> /var/log/high-load.log
  top -n 1 -b >> /var/log/high-load.log
  echo "processes : " >> /var/log/high-load.log
  ps -lA >> /var/log/high-load.log
  echo "cat /proc/interrupts : " >> /var/log/high-load.log
  cat /proc/interrupts >> /var/log/high-load.log
  echo "====================" >> /var/log/high-load.log
  # end additions
  /etc/init.d/nginx stop
  killall -9 nginx
  sleep 1
  killall -9 nginx
  /etc/init.d/php-fpm stop
  /etc/init.d/php53-fpm stop
  killall -9 php-fpm php-cgi
  echo load is $ONEX_LOAD:$FIVX_LOAD while maxload is $CTL_ONEX_LOAD:$CTL_FIVX_LOAD

}

terminate()
{
  if test -f /var/run/boa_run.pid ; then
    sleep 1
  else
    killall -9 php drush.php wget
  fi
}

nginx_high_load_on()
{
  # start additions
  echo "====================" >> /var/log/high-load.log
  echo "nginx high load on" >> /var/log/high-load.log
  echo "ONEX_LOAD = $ONEX_LOAD" >> /var/log/high-load.log
  echo "FIVX_LOAD = $FIVX_LOAD" >> /var/log/high-load.log
  echo "uptime : " >> /var/log/high-load.log
  uptime >> /var/log/high-load.log
  echo "top : " >> /var/log/high-load.log
  top -n 1 -b >> /var/log/high-load.log
  echo "processes : " >> /var/log/high-load.log
  ps -lA >> /var/log/high-load.log
  echo "cat /proc/interrupts : " >> /var/log/high-load.log
  cat /proc/interrupts >> /var/log/high-load.log
  echo "====================" >> /var/log/high-load.log
  # end additions
  mv -f /data/conf/nginx_high_load_off.conf /data/conf/nginx_high_load.conf
  /etc/init.d/nginx reload
}

nginx_high_load_off()
{
  # start additions
  echo "====================" >> /var/log/high-load.log
  echo "nginx high load off" >> /var/log/high-load.log
  echo "ONEX_LOAD = $ONEX_LOAD" >> /var/log/high-load.log
  echo "FIVX_LOAD = $FIVX_LOAD" >> /var/log/high-load.log
  echo "uptime : " >> /var/log/high-load.log
  uptime >> /var/log/high-load.log
  echo "top : " >> /var/log/high-load.log
  top -n 1 -b >> /var/log/high-load.log
  echo "processes : " >> /var/log/high-load.log
  ps -lA >> /var/log/high-load.log
  echo "====================" >> /var/log/high-load.log
  # end additions
  mv -f /data/conf/nginx_high_load.conf /data/conf/nginx_high_load_off.conf
  /etc/init.d/nginx reload
  echo "nginx_high_load_off" >> /var/log/high-load.log
}

control()
{
ONEX_LOAD=`awk '{print $1*100}' /proc/loadavg`
FIVX_LOAD=`awk '{print $2*100}' /proc/loadavg`
# Original values:
#CTL_ONEX_SPIDER_LOAD=388
#CTL_FIVX_SPIDER_LOAD=388
#CTL_ONEX_LOAD=1444
#CTL_FIVX_LOAD=888
#CTL_ONEX_LOAD_CRIT=1888
#CTL_FIVX_LOAD_CRIT=1555
# x4 of original:
CTL_ONEX_SPIDER_LOAD=1552
CTL_FIVX_SPIDER_LOAD=1552
CTL_ONEX_LOAD=5776
CTL_FIVX_LOAD=3552
CTL_ONEX_LOAD_CRIT=7552
CTL_FIVX_LOAD_CRIT=6220
# x6 of original:
#CTL_ONEX_SPIDER_LOAD=2328
#CTL_FIVX_SPIDER_LOAD=2328
#CTL_ONEX_LOAD=8664
#CTL_FIVX_LOAD=5328
#CTL_ONEX_LOAD_CRIT=11328
#CTL_FIVX_LOAD_CRIT=9330
if [ $ONEX_LOAD -ge $CTL_ONEX_SPIDER_LOAD ] && [ $ONEX_LOAD -lt $CTL_ONEX_LOAD ] && [ -e "/data/conf/nginx_high_load_off.conf" ] ; then
  nginx_high_load_on
elif [ $FIVX_LOAD -ge $CTL_FIVX_SPIDER_LOAD ] && [ $FIVX_LOAD -lt $CTL_FIVX_LOAD ] && [ -e "/data/conf/nginx_high_load_off.conf" ] ; then
  nginx_high_load_on
elif [ $ONEX_LOAD -lt $CTL_ONEX_SPIDER_LOAD ] && [ $FIVX_LOAD -lt $CTL_FIVX_SPIDER_LOAD ] && [ -e "/data/conf/nginx_high_load.conf" ] ; then
  nginx_high_load_off
fi
if [ $ONEX_LOAD -ge $CTL_ONEX_LOAD_CRIT ] ; then
  terminate
elif [ $FIVX_LOAD -ge $CTL_FIVX_LOAD_CRIT ] ; then
  terminate
fi
if [ $ONEX_LOAD -ge $CTL_ONEX_LOAD ] ; then
  hold
elif [ $FIVX_LOAD -ge $CTL_FIVX_LOAD ] ; then
  hold
else
  echo load is $ONEX_LOAD:$FIVX_LOAD while maxload is $CTL_ONEX_LOAD:$CTL_FIVX_LOAD
  echo ...OK now doing CTL...
  perl /var/xdrago/proc_num_ctrl.cgi
  touch /var/xdrago/log/proc_num_ctrl.done
  echo CTL done
fi
}

control
sleep 10
control
sleep 10
control
sleep 10
control
sleep 10
control
sleep 10
control
echo Done !
###EOF2013###

The errors in the php-fpm log are much like the others pasted in other comment on this thread so I haven't added them here.

For reference following are the results of perl /usr/local/bin/mysqltuner.pl:

 >>  MySQLTuner 1.2.0 - Major Hayden <major@mhtx.net>
 >>  Bug reports, feature requests, and downloads at http://mysqltuner.com/
 >>  Run with '--help' for additional options and output filtering
[OK] Logged in using credentials from debian maintenance account.

-------- General Statistics --------------------------------------------------
[--] Skipped version check for MySQLTuner script
[OK] Currently running supported MySQL version 5.5.31-MariaDB-1~squeeze-log
[OK] Operating on 64-bit architecture

-------- Storage Engine Statistics -------------------------------------------
[--] Status: +Archive -BDB +Federated +InnoDB -ISAM -NDBCluster 
[--] Data in MyISAM tables: 104M (Tables: 2)
[--] Data in InnoDB tables: 444M (Tables: 1037)
[--] Data in PERFORMANCE_SCHEMA tables: 0B (Tables: 17)
[!!] Total fragmented tables: 97

-------- Security Recommendations  -------------------------------------------
[OK] All database users have passwords assigned

-------- Performance Metrics -------------------------------------------------
[--] Up for: 1d 1h 6m 42s (5M q [60.782 qps], 148K conn, TX: 10B, RX: 880M)
[--] Reads / Writes: 86% / 14%
[--] Total buffers: 1.1G global + 13.4M per thread (75 max threads)
[OK] Maximum possible memory usage: 2.1G (26% of installed RAM)
[OK] Slow queries: 0% (35/5M)
[OK] Highest usage of available connections: 74% (56/75)
[OK] Key buffer size / total MyISAM indexes: 509.0M/93.0M
[OK] Key buffer hit rate: 98.3% (11M cached / 201K reads)
[OK] Query cache efficiency: 74.7% (3M cached / 4M selects)
[!!] Query cache prunes per day: 789637
[OK] Sorts requiring temporary tables: 2% (3K temp sorts / 148K sorts)
[!!] Joins performed without indexes: 5225
[!!] Temporary tables created on disk: 29% (53K on disk / 179K total)
[OK] Thread cache hit rate: 99% (56 created / 148K connections)
[!!] Table cache hit rate: 0% (128 open / 41K opened)
[OK] Open file limit used: 0% (4/196K)
[OK] Table locks acquired immediately: 99% (1M immediate / 1M locks)
[OK] InnoDB data size / buffer pool: 444.7M/509.0M

-------- Recommendations -----------------------------------------------------
General recommendations:
    Run OPTIMIZE TABLE to defragment tables for better performance
    Adjust your join queries to always utilize indexes
    When making adjustments, make tmp_table_size/max_heap_table_size equal
    Reduce your SELECT DISTINCT queries without LIMIT clauses
    Increase table_cache gradually to avoid file descriptor limits
Variables to adjust:
    query_cache_size (> 64M)
    join_buffer_size (> 1.0M, or always use indexes with joins)
    tmp_table_size (> 64M)
    max_heap_table_size (> 128M)
    table_cache (> 128)

comment:50 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 16.88 to 17.13

The top command usually displayes one line for all the CPU activity:

top - 11:47:11 up 1 day,  2:14,  1 user,  load average: 0.22, 0.29, 0.37
Tasks: 249 total,   5 running, 244 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.3%us,  0.7%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8372060k total,  7650408k used,   721652k free,  2503992k buffers
Swap:  1048568k total,        0k used,  1048568k free,  2375212k cached

But we want to know what all the CPUs are doing, in interactive mode you can toggle this behaviour with 1.

To get the text dump to output this data you need to start top press 1 so all the CPUs show up, then write a config file by trying W and this will save the current configuration to $HOME/.toprc and will result on the batch mode outputting info on all the CPUs, for example:

top - 11:35:57 up 1 day,  2:02,  1 user,  load average: 0.45, 0.45, 0.46
Tasks: 238 total,   3 running, 235 sleeping,   0 stopped,   0 zombie
Cpu0  :  2.5%us,  2.3%sy,  0.0%ni, 91.3%id,  3.3%wa,  0.0%hi,  0.1%si,  0.5%st
Cpu1  :  1.9%us,  2.2%sy,  0.0%ni, 95.2%id,  0.2%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu2  :  1.9%us,  1.9%sy,  0.0%ni, 95.7%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu3  :  1.3%us,  1.5%sy,  0.0%ni, 96.8%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu4  :  1.1%us,  1.3%sy,  0.0%ni, 97.0%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu5  :  0.9%us,  1.2%sy,  0.0%ni, 97.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu6  :  0.8%us,  1.1%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu7  :  0.7%us,  1.0%sy,  0.0%ni, 97.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu8  :  0.7%us,  1.0%sy,  0.0%ni, 97.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu9  :  0.7%us,  1.0%sy,  0.0%ni, 97.7%id,  0.2%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu10 :  0.6%us,  0.9%sy,  0.0%ni, 98.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu11 :  0.6%us,  0.9%sy,  0.0%ni, 97.9%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu12 :  0.6%us,  0.9%sy,  0.0%ni, 98.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu13 :  0.7%us,  1.0%sy,  0.0%ni, 97.9%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Mem:   8372060k total,  7629848k used,   742212k free,  2503360k buffers
Swap:  1048568k total,        0k used,  1048568k free,  2367916k cached

I spent some time reading the top man page so that when we next have a load spike we should get to see the state of all the CPUs on the high-load.log file.

comment:51 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.35
Total Hours changed from 17.13 to 17.48

I have scripted copying the nginx access log to penguin just before it they are rotated every day, I'll document this later.

The next task, for next week now, will be to process the logs on penguin using awstats, by then we will have a few days of logs to start with.

comment:52 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 2.0
Total Hours changed from 17.48 to 19.48

Over night there have were several load spikes, the first few were ones where switching to the high load config resulted in the server recovering quite fast, the last one is more critical as the load spike was higher and nginx and php-fpm were killed resulting in around 5 mins of downtime.

Following are the first part of the output from top that was logged each time.

16:31:02

 16:31:02 up 1 day,  6:57,  2 users,  load average: 16.52, 4.73, 1.85
top :
top - 16:31:03 up 1 day,  6:57,  2 users,  load average: 16.52, 4.73, 1.85
Tasks: 274 total,  33 running, 239 sleeping,   0 stopped,   2 zombie
Cpu0  :  2.5%us,  2.3%sy,  0.0%ni, 91.5%id,  3.1%wa,  0.0%hi,  0.1%si,  0.5%st
Cpu1  :  2.0%us,  2.2%sy,  0.0%ni, 95.2%id,  0.2%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu2  :  1.9%us,  1.9%sy,  0.0%ni, 95.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu3  :  1.3%us,  1.5%sy,  0.0%ni, 96.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu4  :  1.2%us,  1.4%sy,  0.0%ni, 97.0%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu5  :  0.9%us,  1.2%sy,  0.0%ni, 97.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu6  :  0.8%us,  1.1%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu7  :  0.8%us,  1.0%sy,  0.0%ni, 97.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu8  :  0.7%us,  1.0%sy,  0.0%ni, 97.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu9  :  0.7%us,  1.0%sy,  0.0%ni, 97.7%id,  0.2%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu10 :  0.6%us,  0.9%sy,  0.0%ni, 98.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu11 :  0.7%us,  1.0%sy,  0.0%ni, 97.9%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu12 :  0.7%us,  1.0%sy,  0.0%ni, 97.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu13 :  0.7%us,  1.0%sy,  0.0%ni, 97.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Mem:   8372060k total,  7887204k used,   484856k free,  2519720k buffers
Swap:  1048568k total,        0k used,  1048568k free,  2539488k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
45571 www-data  20   0  774m 137m  94m R   97  1.7   5:22.81 php-fpm
17991 www-data  20   0  758m  66m  39m R   96  0.8   0:39.38 php-fpm
45603 www-data  20   0  766m 107m  72m R   91  1.3   5:52.13 php-fpm
45604 www-data  20   0  777m 110m  63m R   91  1.4   5:50.46 php-fpm
18490 www-data  20   0  739m  16m 7460 R   87  0.2   0:15.91 php-fpm
45606 www-data  20   0  772m  99m  57m R   86  1.2   5:22.92 php-fpm
45891 www-data  20   0  775m 136m  91m R   86  1.7   6:26.29 php-fpm
18965 www-data  20   0  739m  16m 7468 R   74  0.2   0:16.27 php-fpm
45574 www-data  20   0  769m  95m  56m R   64  1.2   5:57.58 php-fpm
18540 www-data  20   0  735m 8000 2640 R   56  0.1   0:02.59 php-fpm
45645 www-data  20   0  776m 119m  74m R   56  1.5   6:12.16 php-fpm
45607 www-data  20   0  779m 119m  71m R   54  1.5   5:38.10 php-fpm
45597 www-data  20   0  773m 105m  63m R   53  1.3   5:56.12 php-fpm
45906 www-data  20   0  769m 128m  89m R   53  1.6   5:13.50 php-fpm
18507 www-data  20   0  739m  16m 7460 R   51  0.2   0:12.39 php-fpm
45572 www-data  20   0  774m 101m  58m R   49  1.2   5:40.97 php-fpm
45599 www-data  20   0  779m 105m  57m R   48  1.3   6:05.59 php-fpm
45573 www-data  20   0  783m 119m  67m R   47  1.5   5:48.89 php-fpm
18495 www-data  20   0  739m  16m 7460 R   44  0.2   0:14.88 php-fpm
45646 www-data  20   0  759m  91m  62m R   43  1.1   5:28.63 php-fpm
19210 aegir     20   0  234m  25m 8748 R   12  0.3   0:00.28 drush.php
18445 root      20   0 43396 8028 1084 S    4  0.1   0:00.94 munin-node
19256 root      20   0 19200 1372  912 R    2  0.0   0:00.04 top
19278 nobody    20   0 43396 8132 1188 R    2  0.1   0:00.02 munin-node
19287 root      20   0 10376  912  768 S    2  0.0   0:00.02 awk
52877 www-data  20   0 75092  11m 1940 S    1  0.1   0:16.75 nginx
52894 www-data  20   0 75092  11m 1948 S    1  0.1   0:19.34 nginx
    1 root      20   0  8356  780  648 S    0  0.0   0:05.11 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd

21:57:29

 21:57:29 up 1 day, 12:24,  1 user,  load average: 18.16, 5.39, 2.02
top :
top - 21:57:30 up 1 day, 12:24,  1 user,  load average: 18.16, 5.39, 2.02
Tasks: 259 total,  27 running, 230 sleeping,   0 stopped,   2 zombie
Cpu0  :  2.5%us,  2.3%sy,  0.0%ni, 91.8%id,  2.8%wa,  0.0%hi,  0.1%si,  0.5%st
Cpu1  :  2.0%us,  2.2%sy,  0.0%ni, 95.2%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu2  :  1.9%us,  1.9%sy,  0.0%ni, 95.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu3  :  1.3%us,  1.5%sy,  0.0%ni, 96.7%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu4  :  1.1%us,  1.4%sy,  0.0%ni, 97.0%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu5  :  0.9%us,  1.2%sy,  0.0%ni, 97.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu6  :  0.8%us,  1.1%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu7  :  0.8%us,  1.1%sy,  0.0%ni, 97.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu8  :  0.7%us,  1.0%sy,  0.0%ni, 97.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu9  :  0.7%us,  1.0%sy,  0.0%ni, 97.8%id,  0.2%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu10 :  0.6%us,  0.9%sy,  0.0%ni, 98.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu11 :  0.6%us,  0.9%sy,  0.0%ni, 97.9%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu12 :  0.7%us,  1.0%sy,  0.0%ni, 98.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu13 :  0.7%us,  1.0%sy,  0.0%ni, 97.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Mem:   8372060k total,  8077252k used,   294808k free,  2468380k buffers
Swap:  1048568k total,        0k used,  1048568k free,  2611272k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
50092 www-data  20   0  739m  16m 7476 R  100  0.2   0:26.70 php-fpm
49775 www-data  20   0  770m  74m  34m R   93  0.9   0:28.96 php-fpm
50127 www-data  20   0  740m  19m 9.8m R   85  0.2   0:25.75 php-fpm
50145 www-data  20   0  740m  22m  11m R   75  0.3   0:24.47 php-fpm
49766 www-data  20   0  774m  80m  37m R   73  1.0   0:32.37 php-fpm
49770 www-data  20   0  754m  55m  31m R   67  0.7   0:26.50 php-fpm
55399 www-data  20   0  766m  87m  51m R   67  1.1   1:20.79 php-fpm
49773 www-data  20   0  773m  75m  31m R   66  0.9   0:22.04 php-fpm
49780 www-data  20   0  760m  57m  27m R   66  0.7   0:29.57 php-fpm
49779 www-data  20   0  738m  15m 5752 R   64  0.2   0:10.90 php-fpm
50126 www-data  20   0  739m  16m 7476 R   62  0.2   0:19.51 php-fpm
56147 www-data  20   0  769m 147m 108m R   60  1.8   1:37.23 php-fpm
49771 www-data  20   0  770m  74m  34m R   58  0.9   0:28.96 php-fpm
50093 www-data  20   0  739m  16m 7460 R   56  0.2   0:25.06 php-fpm
55366 www-data  20   0  767m 144m 107m R   52  1.8   1:43.76 php-fpm
49778 www-data  20   0  770m  74m  34m R   48  0.9   0:31.67 php-fpm
55402 www-data  20   0  766m 102m  66m R   42  1.3   1:33.75 php-fpm
20343 www-data  20   0 75060  11m 1900 S   39  0.1   0:06.93 nginx
49776 www-data  20   0  772m  79m  36m R   37  1.0   0:25.08 php-fpm
49963 www-data  20   0  760m  59m  29m R   37  0.7   0:27.17 php-fpm
55404 www-data  20   0  767m  98m  61m R   33  1.2   1:57.49 php-fpm
50053 www-data  20   0  744m  39m  25m R   29  0.5   0:29.79 php-fpm
50155 www-data  20   0  735m 9192 3648 R   27  0.1   0:14.61 php-fpm
 3356 mysql     20   0 1908m 1.0g  10m S    6 13.0  67:13.42 mysqld
20313 www-data  20   0 75060  11m 1900 S    4  0.1   0:10.47 nginx
20334 www-data  20   0 75060  11m 1900 S    4  0.1   0:08.10 nginx
50262 root      20   0 43396 7976 1032 R    4  0.1   0:00.24 munin-node
50835 root      20   0 19200 1384  912 R    2  0.0   0:00.02 top
    1 root      20   0  8356  780  648 S    0  0.0   0:05.71 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd

23:01:03

 23:01:03 up 1 day, 13:27,  0 users,  load average: 28.25, 8.73, 3.37
top :
top - 23:01:04 up 1 day, 13:27,  0 users,  load average: 28.25, 8.73, 3.37
Tasks: 298 total,  37 running, 257 sleeping,   0 stopped,   4 zombie
Cpu0  :  2.5%us,  2.4%sy,  0.0%ni, 91.8%id,  2.8%wa,  0.0%hi,  0.1%si,  0.5%st
Cpu1  :  2.0%us,  2.3%sy,  0.0%ni, 95.2%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu2  :  1.9%us,  2.0%sy,  0.0%ni, 95.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu3  :  1.3%us,  1.6%sy,  0.0%ni, 96.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu4  :  1.1%us,  1.4%sy,  0.0%ni, 97.0%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu5  :  0.9%us,  1.2%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu6  :  0.8%us,  1.1%sy,  0.0%ni, 97.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu7  :  0.8%us,  1.1%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu8  :  0.7%us,  1.1%sy,  0.0%ni, 97.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu9  :  0.7%us,  1.0%sy,  0.0%ni, 97.7%id,  0.2%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu10 :  0.6%us,  1.0%sy,  0.0%ni, 98.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu11 :  0.6%us,  1.0%sy,  0.0%ni, 97.9%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu12 :  0.7%us,  1.0%sy,  0.0%ni, 97.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu13 :  0.7%us,  1.0%sy,  0.0%ni, 97.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Mem:   8372060k total,  7889596k used,   482464k free,  2471192k buffers
Swap:  1048568k total,        0k used,  1048568k free,  2653560k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 4481 www-data  20   0  760m  77m  47m R   55  0.9   1:02.28 php-fpm
 4485 www-data  20   0  764m  93m  60m R   49  1.1   1:07.65 php-fpm
 4470 www-data  20   0  769m  88m  49m R   35  1.1   1:14.21 php-fpm
 4473 www-data  20   0  774m  92m  49m R   35  1.1   1:06.79 php-fpm
 4474 www-data  20   0  768m  86m  48m R   35  1.1   0:57.78 php-fpm
31358 www-data  20   0  736m  13m 7196 R   35  0.2   0:12.86 php-fpm
31373 www-data  20   0  735m 8336 2916 R   33  0.1   0:03.32 php-fpm
50093 www-data  20   0  759m  85m  56m R   31  1.0   2:22.98 php-fpm
31354 www-data  20   0  739m  16m 7456 R   30  0.2   0:14.51 php-fpm
31371 www-data  20   0  738m  12m 3900 R   30  0.2   0:06.13 php-fpm
50145 www-data  20   0  769m  97m  58m R   30  1.2   2:28.35 php-fpm
 4478 www-data  20   0  752m  69m  48m R   29  0.9   0:58.57 php-fpm
31370 www-data  20   0  735m 8372 2948 R   29  0.1   0:03.91 php-fpm
50126 www-data  20   0  769m  87m  49m R   29  1.1   2:31.44 php-fpm
31363 www-data  20   0  739m  16m 7048 R   27  0.2   0:11.86 php-fpm
31374 www-data  20   0  735m 8292 2876 R   27  0.1   0:03.14 php-fpm
31378 www-data  20   0  735m 8120 2728 R   27  0.1   0:02.17 php-fpm
 4480 www-data  20   0  771m  89m  49m R   26  1.1   1:16.16 php-fpm
31360 www-data  20   0  736m  14m 7636 R   26  0.2   0:13.45 php-fpm
31362 www-data  20   0  738m  12m 4116 R   26  0.2   0:07.34 php-fpm
31372 www-data  20   0  735m 9128 3608 R   26  0.1   0:06.26 php-fpm
31377 www-data  20   0  735m 7884 2560 R   25  0.1   0:02.14 php-fpm
 4468 www-data  20   0  759m  76m  47m R   24  0.9   1:07.90 php-fpm
 4483 www-data  20   0  770m  98m  58m R   24  1.2   1:08.15 php-fpm
 4488 www-data  20   0  759m  77m  48m R   24  0.9   1:02.41 php-fpm
 4479 www-data  20   0  753m  89m  67m R   22  1.1   0:54.15 php-fpm
50127 www-data  20   0  767m  95m  58m R   22  1.2   2:16.62 php-fpm
52791 www-data  20   0 75052  11m 1904 R   18  0.1   0:04.02 nginx
31361 www-data  20   0  738m  12m 4108 R   17  0.2   0:06.97 php-fpm
31273 aegir     20   0  235m  25m 8768 R   14  0.3   0:21.13 drush.php
31342 www-data  20   0  736m  14m 7772 R   12  0.2   0:14.19 php-fpm
31469 aegir     20   0  224m  18m 8632 R    8  0.2   0:00.25 drush.php
 2340 ntp       20   0 38340 2168 1592 R    7  0.0   0:26.49 ntpd
31523 root      20   0 19200 1412  912 R    5  0.0   0:00.14 top
31567 root      20   0 16852 1068  868 R    2  0.0   0:00.02 tar
 2322 pdnsd     20   0  206m 1656  632 S    1  0.0   0:25.85 pdnsd
 3868 root      20   0 37176 2400 1884 S    1  0.0   0:17.73 master
31264 root      20   0 10796 1572 1180 S    1  0.0   0:00.39 backupninja
31272 root      20   0 10836 1624 1192 S    1  0.0   0:00.13 metche
31319 root      20   0 43396 8032 1088 S    1  0.1   0:00.14 munin-node
31470 root      20   0 10660 1384 1124 S    1  0.0   0:00.01 bash
31484 root      20   0 10684 1424 1144 S    1  0.0   0:00.02 bash
31576 root      20   0 13288  712  416 R    1  0.0   0:00.01 bzip2
31596 root      20   0     0    0    0 Z    1  0.0   0:00.01 awk <defunct>
31597 root      20   0  3956  592  496 S    1  0.0   0:00.01 mysql_slowqueri
    1 root      20   0  8356  780  648 S    0  0.0   0:05.81 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd
    3 root      RT   0     0    0    0 S    0  0.0   0:29.00 migration/0

23:07:32

 23:07:32 up 1 day, 13:34,  0 users,  load average: 27.50, 18.11, 10.43
top :
top - 23:07:33 up 1 day, 13:34,  0 users,  load average: 27.50, 18.11, 10.43
Tasks: 289 total,  56 running, 233 sleeping,   0 stopped,   0 zombie
Cpu0  :  2.5%us,  2.4%sy,  0.0%ni, 91.7%id,  2.8%wa,  0.0%hi,  0.1%si,  0.5%st
Cpu1  :  2.0%us,  2.3%sy,  0.0%ni, 95.1%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu2  :  1.9%us,  2.0%sy,  0.0%ni, 95.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu3  :  1.3%us,  1.6%sy,  0.0%ni, 96.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu4  :  1.1%us,  1.4%sy,  0.0%ni, 96.9%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu5  :  0.9%us,  1.3%sy,  0.0%ni, 97.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu6  :  0.8%us,  1.2%sy,  0.0%ni, 97.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu7  :  0.8%us,  1.1%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu8  :  0.7%us,  1.1%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu9  :  0.7%us,  1.0%sy,  0.0%ni, 97.7%id,  0.2%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu10 :  0.6%us,  1.0%sy,  0.0%ni, 97.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu11 :  0.6%us,  1.0%sy,  0.0%ni, 97.8%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu12 :  0.7%us,  1.0%sy,  0.0%ni, 97.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu13 :  0.7%us,  1.1%sy,  0.0%ni, 97.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Mem:   8372060k total,  7559140k used,   812920k free,  2471720k buffers
Swap:  1048568k total,        0k used,  1048568k free,  2408416k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
37023 www-data  20   0  788m 112m  54m R   40  1.4   0:25.56 php-fpm
37039 www-data  20   0  761m  71m  39m R   36  0.9   0:24.71 php-fpm
37040 www-data  20   0  768m  78m  41m R   36  1.0   0:18.50 php-fpm
40099 www-data  20   0  735m 6748 1540 R   36  0.1   0:00.41 php-fpm
39694 www-data  20   0  735m 6956 1740 R   33  0.1   0:07.21 php-fpm
39706 www-data  20   0  735m 6928 1716 R   31  0.1   0:05.90 php-fpm
39691 www-data  20   0  735m 7788 2484 R   29  0.1   0:10.36 php-fpm
37022 www-data  20   0  758m  69m  41m R   28  0.8   0:24.13 php-fpm
37032 www-data  20   0  759m  67m  39m R   28  0.8   0:23.29 php-fpm
39421 www-data  20   0 74420  10m  992 R   28  0.1   0:09.38 nginx
37035 www-data  20   0  765m  72m  36m R   27  0.9   0:29.01 php-fpm
39700 www-data  20   0  735m 6928 1716 R   27  0.1   0:08.08 php-fpm
37026 www-data  20   0  774m  88m  42m R   25  1.1   0:22.94 php-fpm
37036 www-data  20   0  758m  62m  35m R   25  0.8   0:21.28 php-fpm
37038 www-data  20   0  768m  76m  39m R   25  0.9   0:22.01 php-fpm
39695 www-data  20   0  735m 6928 1716 R   25  0.1   0:07.05 php-fpm
39697 www-data  20   0  735m 6956 1740 R   25  0.1   0:06.80 php-fpm
37030 www-data  20   0  759m  68m  40m R   24  0.8   0:24.42 php-fpm
37041 www-data  20   0  758m  67m  40m R   24  0.8   0:22.59 php-fpm
39698 www-data  20   0  735m 6912 1708 R   24  0.1   0:06.30 php-fpm
39956 www-data  20   0  735m 6960 1740 R   24  0.1   0:05.96 php-fpm
39966 www-data  20   0  735m 6932 1716 R   24  0.1   0:07.24 php-fpm
37031 www-data  20   0  756m  64m  37m R   23  0.8   0:26.85 php-fpm
37044 www-data  20   0  750m  59m  38m R   23  0.7   0:18.09 php-fpm
37045 www-data  20   0  758m  67m  40m R   23  0.8   0:18.56 php-fpm
37047 www-data  20   0  766m  76m  40m R   23  0.9   0:17.97 php-fpm
39696 www-data  20   0  735m 6928 1716 R   23  0.1   0:06.79 php-fpm
39996 www-data  20   0  735m 6916 1704 R   23  0.1   0:02.76 php-fpm
40018 www-data  20   0  735m 6920 1708 R   23  0.1   0:02.33 php-fpm
37042 www-data  20   0  768m  78m  41m R   21  1.0   0:20.97 php-fpm
39749 www-data  20   0  735m 6928 1716 R   21  0.1   0:05.78 php-fpm
39985 www-data  20   0  735m 6908 1700 R   21  0.1   0:01.85 php-fpm
39983 www-data  20   0  735m 6916 1708 R   20  0.1   0:03.29 php-fpm
40017 www-data  20   0  735m 6936 1716 R   20  0.1   0:03.86 php-fpm
39882 www-data  20   0  735m 6932 1716 R   19  0.1   0:05.45 php-fpm
39965 www-data  20   0  735m 6908 1700 R   19  0.1   0:02.07 php-fpm
39999 www-data  20   0  735m 6920 1708 R   19  0.1   0:04.01 php-fpm
37024 www-data  20   0  766m  72m  36m R   17  0.9   0:22.42 php-fpm
37029 www-data  20   0  756m  63m  36m R   17  0.8   0:21.47 php-fpm
37046 www-data  20   0  759m  67m  39m R   17  0.8   0:26.86 php-fpm
39693 www-data  20   0  735m 6932 1716 R   17  0.1   0:09.51 php-fpm
40013 www-data  20   0  735m 6920 1708 R   17  0.1   0:02.85 php-fpm
39995 www-data  20   0  735m 6936 1716 R   16  0.1   0:01.65 php-fpm
40012 www-data  20   0  735m 6920 1708 R   16  0.1   0:03.52 php-fpm
39986 www-data  20   0  735m 6736 1532 R   13  0.1   0:00.86 php-fpm
39787 www-data  20   0  735m 6908 1700 R   11  0.1   0:04.14 php-fpm
37033 www-data  20   0  752m  59m  37m R    9  0.7   0:21.88 php-fpm
39982 www-data  20   0  735m 6860 1652 R    9  0.1   0:03.13 php-fpm
39984 www-data  20   0  735m 6788 1584 R    9  0.1   0:02.38 php-fpm
39936 www-data  20   0  735m 6932 1716 R    8  0.1   0:06.02 php-fpm
39968 www-data  20   0  735m 6932 1716 R    8  0.1   0:05.48 php-fpm
39998 www-data  20   0  735m 6920 1708 R    8  0.1   0:03.16 php-fpm
39417 www-data  20   0 74420 9.8m  800 S    7  0.1   0:04.31 nginx
39964 www-data  20   0  735m 6932 1716 R    5  0.1   0:06.88 php-fpm
 1817 root      20   0  6468  600  480 S    1  0.0   2:45.31 vnstatd
 2524 root      20   0 53372  15m 1532 S    1  0.2   6:13.17 lfd
 3356 mysql     20   0 1921m 1.0g  10m S    1 13.0  69:10.85 mysqld
40039 aegir     20   0  235m  26m 8896 S    1  0.3   0:03.03 drush.php
40196 root      20   0 19200 1404  912 R    1  0.0   0:00.03 top
    1 root      20   0  8356  780  648 S    0  0.0   0:05.82 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd

08:00:56

 08:00:56 up 1 day, 22:27,  0 users,  load average: 21.87, 6.02, 2.21
top :
top - 08:00:57 up 1 day, 22:27,  0 users,  load average: 21.87, 6.02, 2.21
Tasks: 269 total,  57 running, 210 sleeping,   0 stopped,   2 zombie
Cpu0  :  2.5%us,  2.4%sy,  0.0%ni, 91.2%id,  3.3%wa,  0.0%hi,  0.1%si,  0.5%st
Cpu1  :  2.0%us,  2.3%sy,  0.0%ni, 94.8%id,  0.4%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu2  :  1.9%us,  2.0%sy,  0.0%ni, 95.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu3  :  1.3%us,  1.6%sy,  0.0%ni, 96.4%id,  0.3%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu4  :  1.1%us,  1.4%sy,  0.0%ni, 96.9%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu5  :  0.9%us,  1.3%sy,  0.0%ni, 97.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu6  :  0.8%us,  1.2%sy,  0.0%ni, 97.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu7  :  0.8%us,  1.1%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu8  :  0.7%us,  1.1%sy,  0.0%ni, 97.7%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu9  :  0.7%us,  1.1%sy,  0.0%ni, 97.7%id,  0.2%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu10 :  0.6%us,  1.0%sy,  0.0%ni, 97.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu11 :  0.7%us,  1.0%sy,  0.0%ni, 97.8%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu12 :  0.7%us,  1.0%sy,  0.0%ni, 97.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu13 :  0.7%us,  1.1%sy,  0.0%ni, 97.7%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Mem:   8372060k total,  7356528k used,  1015532k free,  1851100k buffers
Swap:  1048568k total,        0k used,  1048568k free,  2433460k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
11415 www-data  20   0  787m 203m 147m R   56  2.5   7:37.28 php-fpm
50968 www-data  20   0  739m  16m 7472 R   52  0.2   0:04.66 php-fpm
15493 www-data  20   0  769m 149m 110m R   47  1.8   9:07.38 php-fpm
50967 www-data  20   0  739m  16m 7472 R   46  0.2   0:04.06 php-fpm
13672 www-data  20   0  772m 149m 107m R   44  1.8   7:46.01 php-fpm
50978 www-data  20   0  739m  16m 7472 R   44  0.2   0:04.68 php-fpm
12691 www-data  20   0  768m 163m 125m R   41  2.0   7:42.48 php-fpm
50963 www-data  20   0  739m  16m 7472 R   41  0.2   0:04.06 php-fpm
50964 www-data  20   0  739m  16m 7472 R   41  0.2   0:04.16 php-fpm
 5695 www-data  20   0  778m 148m 101m R   40  1.8   4:12.11 php-fpm
11120 www-data  20   0  765m 142m 107m R   40  1.7   9:24.31 php-fpm
 6009 www-data  20   0  775m 130m  85m R   38  1.6   4:20.16 php-fpm
11422 www-data  20   0  770m 161m 121m R   38  2.0   7:45.76 php-fpm
50981 www-data  20   0  739m  16m 7472 R   38  0.2   0:03.97 php-fpm
50939 www-data  20   0  738m  14m 5684 R   37  0.2   0:04.94 php-fpm
50941 www-data  20   0  739m  16m 7480 R   37  0.2   0:06.56 php-fpm
50970 www-data  20   0  738m  14m 5684 R   37  0.2   0:04.11 php-fpm
14047 www-data  20   0  759m 149m 120m R   35  1.8   7:48.22 php-fpm
14433 www-data  20   0  768m 162m 125m R   35  2.0   8:14.64 php-fpm
50977 www-data  20   0  739m  16m 7472 R   35  0.2   0:04.07 php-fpm
51022 www-data  20   0  739m  16m 7472 R   35  0.2   0:04.18 php-fpm
51023 www-data  20   0  738m  12m 3904 R   35  0.2   0:02.34 php-fpm
 5686 www-data  20   0  791m 156m  96m R   34  1.9   4:12.40 php-fpm
50980 www-data  20   0  739m  16m 7472 R   34  0.2   0:04.31 php-fpm
51029 www-data  20   0  735m 6756 1540 R   34  0.1   0:01.02 php-fpm
50960 www-data  20   0  739m  16m 7472 R   32  0.2   0:04.74 php-fpm
 5696 www-data  20   0  777m 138m  91m R   29  1.7   4:33.54 php-fpm
50942 www-data  20   0  739m  16m 7472 R   29  0.2   0:05.36 php-fpm
50969 www-data  20   0  739m  16m 7472 R   29  0.2   0:04.29 php-fpm
51030 www-data  20   0  739m  15m 6208 R   28  0.2   0:03.38 php-fpm
51031 www-data  20   0  735m 6912 1700 R   28  0.1   0:01.77 php-fpm
11732 www-data  20   0  837m 222m 116m R   25  2.7   8:58.31 php-fpm
50976 www-data  20   0  739m  16m 7472 R   25  0.2   0:04.09 php-fpm
 5691 www-data  20   0  775m 134m  89m R   24  1.6   4:39.07 php-fpm
 5700 www-data  20   0  778m 137m  90m R   24  1.7   3:51.47 php-fpm
51024 www-data  20   0  738m  14m 5572 R   24  0.2   0:02.52 php-fpm
51027 www-data  20   0  735m 6912 1700 R   24  0.1   0:01.40 php-fpm
15486 www-data  20   0  779m 171m 123m R   22  2.1   8:46.50 php-fpm
16620 www-data  20   0  769m 151m 113m R   19  1.9   7:55.02 php-fpm
15496 www-data  20   0  770m 153m 113m R   18  1.9   9:36.26 php-fpm
50944 www-data  20   0  739m  16m 7472 R   13  0.2   0:04.99 php-fpm
47157 www-data  20   0 74420  10m 1928 R   12  0.1   0:09.84 nginx
27129 redis     20   0  191m  52m  920 R    3  0.6   2:53.16 redis-server
51223 root      20   0     0    0    0 R    3  0.0   0:00.02 sleep
50614 root      20   0 43396 8028 1084 S    1  0.1   0:04.77 munin-node
51227 root      20   0 19200 1384  912 R    1  0.0   0:00.02 top
    1 root      20   0  8356  780  648 S    0  0.0   0:08.74 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.02 kthreadd

09:59:58

 09:59:58 up 2 days, 26 min,  1 user,  load average: 18.10, 5.17, 1.93
top :
top - 09:59:59 up 2 days, 26 min,  1 user,  load average: 18.10, 5.17, 1.93
Tasks: 266 total,  40 running, 226 sleeping,   0 stopped,   0 zombie
Cpu0  :  2.5%us,  2.4%sy,  0.0%ni, 91.3%id,  3.3%wa,  0.0%hi,  0.1%si,  0.5%st
Cpu1  :  1.9%us,  2.3%sy,  0.0%ni, 94.8%id,  0.4%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu2  :  1.9%us,  2.0%sy,  0.0%ni, 95.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu3  :  1.3%us,  1.6%sy,  0.0%ni, 96.3%id,  0.3%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu4  :  1.1%us,  1.5%sy,  0.0%ni, 96.9%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu5  :  0.9%us,  1.3%sy,  0.0%ni, 97.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu6  :  0.8%us,  1.2%sy,  0.0%ni, 97.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu7  :  0.8%us,  1.2%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu8  :  0.7%us,  1.1%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu9  :  0.7%us,  1.1%sy,  0.0%ni, 97.6%id,  0.2%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu10 :  0.6%us,  1.0%sy,  0.0%ni, 97.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu11 :  0.7%us,  1.1%sy,  0.0%ni, 97.8%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu12 :  0.7%us,  1.1%sy,  0.0%ni, 97.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.4%st
Cpu13 :  0.7%us,  1.1%sy,  0.0%ni, 97.7%id,  0.1%wa,  0.0%hi,  0.0%si,  0.4%st
Mem:   8372060k total,  7259364k used,  1112696k free,  1821832k buffers
Swap:  1048568k total,        0k used,  1048568k free,  2483304k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
51208 www-data  20   0  759m  81m  52m R   95  1.0   2:24.82 php-fpm
53242 www-data  20   0  764m  94m  61m R   56  1.2   2:02.07 php-fpm
51027 www-data  20   0  767m  87m  51m R   54  1.1   2:40.76 php-fpm
53626 www-data  20   0  772m 101m  58m R   54  1.2   2:11.45 php-fpm
53664 www-data  20   0  767m  96m  60m R   54  1.2   2:30.49 php-fpm
51211 www-data  20   0  767m  89m  52m R   50  1.1   2:11.54 php-fpm
51215 www-data  20   0  759m  85m  57m R   50  1.1   2:24.41 php-fpm
53077 www-data  20   0  774m 100m  56m R   50  1.2   2:10.90 php-fpm
 6240 www-data  20   0  740m  21m  10m R   49  0.3   0:23.19 php-fpm
 6248 www-data  20   0  740m  21m  10m R   49  0.3   0:20.48 php-fpm
51028 www-data  20   0  768m  93m  55m R   49  1.1   2:13.78 php-fpm
51030 www-data  20   0  770m 104m  64m R   49  1.3   2:52.08 php-fpm
53138 www-data  20   0  777m  96m  50m R   49  1.2   2:07.81 php-fpm
 6243 www-data  20   0  739m  16m 7448 R   47  0.2   0:19.16 php-fpm
 6250 www-data  20   0  739m  16m 7444 R   47  0.2   0:19.81 php-fpm
 6249 www-data  20   0  739m  16m 7448 R   45  0.2   0:19.25 php-fpm
 6253 www-data  20   0  739m  16m 7444 R   45  0.2   0:14.76 php-fpm
51026 www-data  20   0  772m  94m  52m R   45  1.2   2:10.43 php-fpm
51031 www-data  20   0  768m  90m  52m R   45  1.1   2:36.67 php-fpm
 6372 tn        20   0  258m  49m 9080 R   43  0.6   0:13.00 drush.php
51029 www-data  20   0  772m 123m  82m R   43  1.5   2:50.91 php-fpm
53340 www-data  20   0  769m  89m  50m R   43  1.1   1:57.83 php-fpm
 6252 www-data  20   0  739m  16m 7464 R   41  0.2   0:20.59 php-fpm
52759 www-data  20   0  767m  96m  59m R   40  1.2   2:15.25 php-fpm
53041 www-data  20   0  768m  95m  57m R   40  1.2   2:15.41 php-fpm
 6257 www-data  20   0  735m 8372 2952 R   36  0.1   0:03.66 php-fpm
51441 www-data  20   0  768m 100m  62m R   36  1.2   1:59.49 php-fpm
53300 www-data  20   0  762m  82m  50m R   34  1.0   2:17.06 php-fpm
51062 www-data  20   0  759m  90m  62m R   32  1.1   2:15.69 php-fpm
 6254 www-data  20   0  739m  16m 7444 R   31  0.2   0:13.16 php-fpm
 6659 root      20   0 19200 1376  912 R    2  0.0   0:00.02 top
    1 root      20   0  8356  780  648 S    0  0.0   0:09.22 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.02 kthreadd
    3 root      RT   0     0    0    0 S    0  0.0   0:34.09 migration/0

11:15:58

 11:15:58 up 2 days,  1:42,  1 user,  load average: 15.80, 4.66, 1.95
top :
top - 11:15:59 up 2 days,  1:42,  1 user,  load average: 15.80, 4.66, 1.95
Tasks: 259 total,  18 running, 239 sleeping,   0 stopped,   2 zombie
Cpu0  :  2.5%us,  2.4%sy,  0.0%ni, 91.3%id,  3.2%wa,  0.0%hi,  0.1%si,  0.5%st
Cpu1  :  1.9%us,  2.4%sy,  0.0%ni, 94.8%id,  0.4%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu2  :  1.9%us,  2.1%sy,  0.0%ni, 95.4%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu3  :  1.3%us,  1.6%sy,  0.0%ni, 96.3%id,  0.3%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu4  :  1.1%us,  1.5%sy,  0.0%ni, 96.8%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu5  :  0.9%us,  1.3%sy,  0.0%ni, 97.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu6  :  0.8%us,  1.2%sy,  0.0%ni, 97.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu7  :  0.8%us,  1.2%sy,  0.0%ni, 97.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu8  :  0.7%us,  1.2%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu9  :  0.7%us,  1.1%sy,  0.0%ni, 97.6%id,  0.2%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu10 :  0.6%us,  1.1%sy,  0.0%ni, 97.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu11 :  0.7%us,  1.1%sy,  0.0%ni, 97.7%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu12 :  0.7%us,  1.1%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu13 :  0.7%us,  1.1%sy,  0.0%ni, 97.7%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Mem:   8372060k total,  7125752k used,  1246308k free,  1824588k buffers
Swap:  1048568k total,        0k used,  1048568k free,  2525444k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 8588 www-data  20   0  770m  92m  53m R  130  1.1   2:11.90 php-fpm
 6571 www-data  20   0  770m  91m  52m R  128  1.1   1:39.57 php-fpm
 7297 www-data  20   0  771m  89m  49m R  127  1.1   2:13.49 php-fpm
 7309 www-data  20   0  776m 103m  57m R  125  1.3   2:23.27 php-fpm
 8012 www-data  20   0  769m  87m  48m R  125  1.1   1:24.40 php-fpm
 8771 www-data  20   0  759m  84m  55m R  121  1.0   1:27.05 php-fpm
 7725 www-data  20   0  770m  88m  49m R  115  1.1   1:38.88 php-fpm
32521 www-data  20   0  769m  90m  51m R  115  1.1   1:02.91 php-fpm
 6774 www-data  20   0  759m  77m  49m R  111  1.0   1:38.98 php-fpm
 6408 www-data  20   0  772m  97m  55m R  100  1.2   1:44.62 php-fpm
 6260 www-data  20   0  754m  84m  60m R   98  1.0   2:08.71 php-fpm
 6259 www-data  20   0  770m  98m  58m R   96  1.2   2:30.25 php-fpm
 6264 www-data  20   0  770m  98m  58m R   83  1.2   2:42.95 php-fpm
11340 www-data  20   0  768m 104m  66m R   81  1.3   1:31.02 php-fpm
 6263 www-data  20   0  769m  89m  50m R   76  1.1   2:37.23 php-fpm
63347 root      20   0 10620 1368 1148 S    4  0.0   0:06.56 bash
63776 root      20   0 19200 1380  912 R    2  0.0   0:00.02 top
    1 root      20   0  8356  780  648 S    0  0.0   0:09.94 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.02 kthreadd

12:02:34

At noon there was a load spike that resulted in nginx and php-fpm being killed.

The load of 81 (equivalent to 5 on a uni-processor system, see http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages) caused php-fpm and nginx to be killed, I think this is probably a bit too low, looking at the CPU states about they are not doing much and with the current 8GB of RAM the server isn't swapping at all -- there is 1GB of RAM free.

nginx high load on
ONEX_LOAD = 2628
FIVX_LOAD = 1095
uptime :
 12:02:34 up 2 days,  2:29,  1 user,  load average: 43.12, 15.94, 6.19
top :
====================
nginx high load on
ONEX_LOAD = 6109
FIVX_LOAD = 3001
uptime :
 12:04:41 up 2 days,  2:31,  1 user,  load average: 81.24, 38.13, 15.40
top :
====================
php-fpm and nginx about to be killed
ONEX_LOAD = 8124
FIVX_LOAD = 3813
uptime :
 12:04:41 up 2 days,  2:31,  1 user,  load average: 81.24, 38.13, 15.40
top :
====================
php-fpm and nginx about to be killed
ONEX_LOAD = 8124
FIVX_LOAD = 3813
uptime :
 12:04:41 up 2 days,  2:31,  1 user,  load average: 81.24, 38.13, 15.40
top :
top - 12:04:41 up 2 days,  2:31,  1 user,  load average: 81.24, 38.13, 15.40
Tasks: 354 total,  61 running, 292 sleeping,   0 stopped,   1 zombie
Cpu0  :  2.5%us,  2.5%sy,  0.0%ni, 91.2%id,  3.2%wa,  0.0%hi,  0.1%si,  0.6%st
Cpu1  :  1.9%us,  2.4%sy,  0.0%ni, 94.7%id,  0.4%wa,  0.0%hi,  0.0%si,  0.6%st
Cpu2  :  1.9%us,  2.1%sy,  0.0%ni, 95.3%id,  0.1%wa,  0.0%hi,  0.0%si,  0.6%st
Cpu3  :  1.3%us,  1.7%sy,  0.0%ni, 96.2%id,  0.3%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu4  :  1.1%us,  1.6%sy,  0.0%ni, 96.7%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu5  :  0.9%us,  1.4%sy,  0.0%ni, 97.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu6  :  0.8%us,  1.3%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu7  :  0.8%us,  1.3%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu8  :  0.7%us,  1.2%sy,  0.0%ni, 97.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu9  :  0.7%us,  1.2%sy,  0.0%ni, 97.5%id,  0.2%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu10 :  0.6%us,  1.1%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu11 :  0.7%us,  1.1%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu12 :  0.7%us,  1.2%sy,  0.0%ni, 97.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu13 :  0.7%us,  1.2%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Mem:   8372060k total,  7311676k used,  1060384k free,  1827280k buffers
Swap:  1048568k total,        0k used,  1048568k free,  2413744k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
33536 www-data  20   0  739m  16m 7412 R   34  0.2   0:27.62 php-fpm
 1650 www-data  20   0  767m  83m  47m R   32  1.0   2:02.19 php-fpm
33602 www-data  20   0  738m  14m 5524 R   32  0.2   0:15.04 php-fpm
 1639 www-data  20   0  768m  96m  59m R   30  1.2   2:32.42 php-fpm
 1646 www-data  20   0  779m  94m  46m R   30  1.2   2:28.53 php-fpm
 1649 www-data  20   0  778m 117m  69m R   29  1.4   1:42.00 php-fpm
33489 www-data  20   0  739m  16m 7392 R   29  0.2   0:36.28 php-fpm
33511 www-data  20   0  739m  16m 7392 R   29  0.2   0:30.93 php-fpm
33587 www-data  20   0  739m  16m 7392 R   29  0.2   0:19.65 php-fpm
33610 www-data  20   0  738m  14m 5620 R   29  0.2   0:12.82 php-fpm
 1638 www-data  20   0  768m  82m  45m R   27  1.0   1:27.38 php-fpm
33490 www-data  20   0  739m  16m 7412 R   27  0.2   0:35.29 php-fpm
33555 www-data  20   0  739m  16m 7392 R   27  0.2   0:21.10 php-fpm
 1637 www-data  20   0  770m  88m  49m R   25  1.1   1:32.50 php-fpm
 1640 www-data  20   0  769m  85m  46m R   25  1.0   1:18.79 php-fpm
 1641 www-data  20   0  769m  86m  47m R   25  1.1   1:37.08 php-fpm
 1644 www-data  20   0  759m  77m  49m R   25  0.9   1:24.41 php-fpm
 1645 www-data  20   0  756m  76m  50m R   25  0.9   1:29.48 php-fpm
33508 www-data  20   0  739m  16m 7376 R   25  0.2   0:31.68 php-fpm
33546 www-data  20   0  739m  16m 7392 R   25  0.2   0:29.41 php-fpm
33615 www-data  20   0  738m  13m 4996 R   25  0.2   0:06.10 php-fpm
 1633 www-data  20   0  790m 124m  64m R   23  1.5   2:07.66 php-fpm
 1651 www-data  20   0  775m  98m  53m R   23  1.2   1:52.55 php-fpm
33542 www-data  20   0  739m  16m 7376 R   23  0.2   0:26.59 php-fpm
33601 www-data  20   0  738m  13m 4996 R   23  0.2   0:16.93 php-fpm
33606 www-data  20   0  739m  15m 6668 R   23  0.2   0:13.79 php-fpm
33607 www-data  20   0  739m  16m 7392 R   23  0.2   0:11.05 php-fpm
33611 www-data  20   0  739m  16m 7376 R   23  0.2   0:10.72 php-fpm
33614 www-data  20   0  738m  13m 4648 R   23  0.2   0:08.19 php-fpm
33616 www-data  20   0  738m  13m 5036 R   23  0.2   0:05.09 php-fpm
33114 www-data  20   0  739m  17m 7548 R   22  0.2   0:51.43 php-fpm
33554 www-data  20   0  739m  16m 7376 R   22  0.2   0:22.99 php-fpm
33586 www-data  20   0  739m  16m 7392 R   22  0.2   0:21.11 php-fpm
33594 www-data  20   0  739m  16m 6932 R   22  0.2   0:17.97 php-fpm
33613 www-data  20   0  738m  12m 3968 R   22  0.2   0:06.04 php-fpm
 1636 www-data  20   0  780m  97m  47m R   20  1.2   1:57.37 php-fpm
 1643 www-data  20   0  769m  87m  49m R   20  1.1   1:34.98 php-fpm
 1647 www-data  20   0  769m  87m  48m R   20  1.1   2:31.06 php-fpm
33488 www-data  20   0  739m  16m 7352 R   20  0.2   0:38.12 php-fpm
33539 www-data  20   0  739m  16m 7392 R   20  0.2   0:28.44 php-fpm
33552 www-data  20   0  739m  16m 7392 R   20  0.2   0:23.66 php-fpm
33619 www-data  20   0  738m  12m 3956 R   20  0.2   0:08.27 php-fpm
33618 www-data  20   0  738m  14m 5476 R   16  0.2   0:08.94 php-fpm
33671 www-data  20   0  738m  12m 3992 R   16  0.2   0:01.69 php-fpm
33795 root      20   0 13288 6952  424 R   14  0.1   0:00.08 bzip2
 1652 www-data  20   0  761m  78m  48m R   13  1.0   1:15.18 php-fpm
33592 www-data  20   0  738m  13m 4996 R   13  0.2   0:19.70 php-fpm
33593 www-data  20   0  738m  14m 5652 S    9  0.2   0:19.63 php-fpm
33806 aegir     20   0 37152 2324 1848 D    9  0.0   0:00.05 postdrop
 1648 www-data  20   0  768m  85m  47m R    7  1.0   1:20.05 php-fpm
33604 www-data  20   0  738m  14m 5636 S    7  0.2   0:15.85 php-fpm
 1634 www-data  20   0  771m  88m  48m R    5  1.1   1:45.03 php-fpm
 2234 root      20   0  117m 1940 1076 S    5  0.0   0:45.27 rsyslogd
33624 root      20   0 19200 1412  912 R    5  0.0   0:19.34 top
33738 root      20   0 19200 1464  912 R    5  0.0   0:00.06 top
33794 root      20   0 19200 1448  912 S    5  0.0   0:00.03 top
 1642 www-data  20   0  771m  89m  48m R    4  1.1   1:35.15 php-fpm
 3356 mysql     20   0 2000m 1.1g  10m S    4 14.3  92:04.57 mysqld
27129 redis     20   0  191m  37m  920 S    4  0.5   4:25.78 redis-server
33298 root      20   0 10808 1596 1188 S    4  0.0   0:01.00 backupninja
33780 root      20   0 16852 1080  868 S    4  0.0   0:00.02 tar
33846 root      20   0 19200 1452  912 S    4  0.0   0:00.02 top
  225 root      20   0     0    0    0 R    2  0.0   0:47.92 kjournald
 1434 www-data  20   0 75076  11m 1892 S    2  0.1   0:02.24 nginx
33299 root      20   0 10836 1624 1192 S    2  0.0   0:00.08 metche
33307 root      20   0 10752 1584 1228 S    2  0.0   0:00.20 bash
33332 root      20   0 10624 1372 1148 S    2  0.0   0:00.20 bash
33783 aegir     20   0 37164 2344 1868 S    2  0.0   0:00.01 sendmail
33827 postfix   20   0 39340 2528 2004 D    2  0.0   0:00.01 cleanup
33836 postfix   20   0 39252 2404 1908 S    2  0.0   0:00.01 trivial-rewrite
33882 root      20   0 39848 2784 2128 S    2  0.0   0:00.01 mysqladmin
    1 root      20   0  8356  780  648 S    0  0.0   0:10.02 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.02 kthreadd

12:04:55

This is after php-fpm and nginx have been stopped:

php-fpm and nginx about to be killed
ONEX_LOAD = 6941
FIVX_LOAD = 3772
uptime :
 12:04:55 up 2 days,  2:31,  1 user,  load average: 69.41, 37.72, 15.64
top :
====================
php-fpm and nginx about to be killed
ONEX_LOAD = 6941
FIVX_LOAD = 3772
uptime :
 12:04:55 up 2 days,  2:31,  1 user,  load average: 69.41, 37.72, 15.64
top :
top - 12:04:56 up 2 days,  2:31,  1 user,  load average: 69.41, 37.72, 15.64
Tasks: 232 total,   1 running, 231 sleeping,   0 stopped,   0 zombie
Cpu0  :  2.5%us,  2.5%sy,  0.0%ni, 91.2%id,  3.2%wa,  0.0%hi,  0.1%si,  0.6%st
Cpu1  :  1.9%us,  2.4%sy,  0.0%ni, 94.7%id,  0.4%wa,  0.0%hi,  0.0%si,  0.6%st
Cpu2  :  1.9%us,  2.1%sy,  0.0%ni, 95.3%id,  0.1%wa,  0.0%hi,  0.0%si,  0.6%st
Cpu3  :  1.3%us,  1.7%sy,  0.0%ni, 96.2%id,  0.3%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu4  :  1.1%us,  1.6%sy,  0.0%ni, 96.7%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu5  :  0.9%us,  1.4%sy,  0.0%ni, 97.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu6  :  0.8%us,  1.3%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu7  :  0.8%us,  1.3%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu8  :  0.7%us,  1.2%sy,  0.0%ni, 97.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu9  :  0.7%us,  1.2%sy,  0.0%ni, 97.5%id,  0.2%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu10 :  0.6%us,  1.1%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu11 :  0.7%us,  1.1%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu12 :  0.7%us,  1.2%sy,  0.0%ni, 97.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu13 :  0.7%us,  1.2%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Mem:   8372060k total,  6151508k used,  2220552k free,  1827288k buffers
Swap:  1048568k total,        0k used,  1048568k free,  2322900k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
36669 root      20   0 19068 1344  912 R    6  0.0   0:00.05 top
36672 root      20   0 19068 1344  912 S    4  0.0   0:00.02 top
    1 root      20   0  8356  780  648 S    0  0.0   0:10.02 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.02 kthreadd
    3 root      RT   0     0    0    0 S    0  0.0   0:35.41 migration/0

12:05:01

php-fpm and nginx about to be killed
ONEX_LOAD = 6385
FIVX_LOAD = 3709
uptime :
 12:05:01 up 2 days,  2:31,  1 user,  load average: 63.85, 37.09, 15.56
top :
top - 12:05:01 up 2 days,  2:31,  1 user,  load average: 63.85, 37.09, 15.56
Tasks: 260 total,   4 running, 256 sleeping,   0 stopped,   0 zombie
Cpu0  :  2.5%us,  2.5%sy,  0.0%ni, 91.2%id,  3.2%wa,  0.0%hi,  0.1%si,  0.6%st
Cpu1  :  1.9%us,  2.4%sy,  0.0%ni, 94.7%id,  0.4%wa,  0.0%hi,  0.0%si,  0.6%st
Cpu2  :  1.9%us,  2.1%sy,  0.0%ni, 95.3%id,  0.1%wa,  0.0%hi,  0.0%si,  0.6%st
Cpu3  :  1.3%us,  1.7%sy,  0.0%ni, 96.2%id,  0.3%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu4  :  1.1%us,  1.6%sy,  0.0%ni, 96.7%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu5  :  0.9%us,  1.4%sy,  0.0%ni, 97.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu6  :  0.8%us,  1.3%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu7  :  0.8%us,  1.3%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu8  :  0.7%us,  1.2%sy,  0.0%ni, 97.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu9  :  0.7%us,  1.2%sy,  0.0%ni, 97.5%id,  0.2%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu10 :  0.6%us,  1.1%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu11 :  0.7%us,  1.1%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu12 :  0.7%us,  1.2%sy,  0.0%ni, 97.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu13 :  0.7%us,  1.2%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Mem:   8372060k total,  6187608k used,  2184452k free,  1827288k buffers
Swap:  1048568k total,        0k used,  1048568k free,  2323052k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
36832 root      20   0 39156  19m  16m R  100  0.2   0:00.74 apt-get
36826 aegir     20   0  235m  26m 8912 R   57  0.3   0:00.52 drush.php
36910 root      20   0 13288 6956  428 R   27  0.1   0:00.14 bzip2
 3356 mysql     20   0 2000m 1.1g  10m S    4 14.3  92:04.95 mysqld
36867 root      20   0 19200 1368  912 R    4  0.0   0:00.04 top
36909 root      20   0 16852 1080  868 S    4  0.0   0:00.02 tar
36823 root      20   0 10620 1368 1148 S    2  0.0   0:00.01 bash
36899 root      20   0  5368  564  480 S    2  0.0   0:00.01 sleep
36912 root      20   0  5368  564  480 S    2  0.0   0:00.01 sleep
    1 root      20   0  8356  780  648 S    0  0.0   0:10.02 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.02 kthreadd

12:05:03

php-fpm and nginx about to be killed
ONEX_LOAD = 5898
FIVX_LOAD = 3652
uptime :
 12:05:03 up 2 days,  2:31,  1 user,  load average: 58.98, 36.52, 15.49
top :
top - 12:05:03 up 2 days,  2:31,  1 user,  load average: 58.98, 36.52, 15.49
Tasks: 253 total,   3 running, 250 sleeping,   0 stopped,   0 zombie
Cpu0  :  2.5%us,  2.5%sy,  0.0%ni, 91.2%id,  3.2%wa,  0.0%hi,  0.1%si,  0.6%st
Cpu1  :  1.9%us,  2.4%sy,  0.0%ni, 94.7%id,  0.4%wa,  0.0%hi,  0.0%si,  0.6%st
Cpu2  :  1.9%us,  2.1%sy,  0.0%ni, 95.3%id,  0.1%wa,  0.0%hi,  0.0%si,  0.6%st
Cpu3  :  1.3%us,  1.7%sy,  0.0%ni, 96.2%id,  0.3%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu4  :  1.1%us,  1.6%sy,  0.0%ni, 96.7%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu5  :  0.9%us,  1.4%sy,  0.0%ni, 97.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu6  :  0.8%us,  1.3%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu7  :  0.8%us,  1.3%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu8  :  0.7%us,  1.2%sy,  0.0%ni, 97.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu9  :  0.7%us,  1.2%sy,  0.0%ni, 97.5%id,  0.2%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu10 :  0.6%us,  1.1%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu11 :  0.7%us,  1.1%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu12 :  0.7%us,  1.2%sy,  0.0%ni, 97.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu13 :  0.7%us,  1.2%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Mem:   8372060k total,  6167872k used,  2204188k free,  1827292k buffers
Swap:  1048568k total,        0k used,  1048568k free,  2323876k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
36910 root      20   0 13292 7024  448 R   99  0.1   0:02.11 bzip2
36991 root      20   0 38728  19m  16m R   52  0.2   0:00.27 apt-get
 2524 root      20   0 53504  15m 1532 S    2  0.2  10:06.58 lfd
 3356 mysql     20   0 2000m 1.1g  10m S    2 14.3  92:05.08 mysqld
36909 root      20   0 16852 1084  868 S    2  0.0   0:00.10 tar
36990 root      20   0 19200 1368  912 R    2  0.0   0:00.02 top
    1 root      20   0  8356  780  648 S    0  0.0   0:10.02 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.02 kthreadd

12:05:06

php-fpm and nginx about to be killed
ONEX_LOAD = 5898
FIVX_LOAD = 3652
uptime :
 12:05:06 up 2 days,  2:31,  1 user,  load average: 58.98, 36.52, 15.49
top :
top - 12:05:07 up 2 days,  2:31,  1 user,  load average: 58.98, 36.52, 15.49
Tasks: 244 total,   1 running, 243 sleeping,   0 stopped,   0 zombie
Cpu0  :  2.5%us,  2.5%sy,  0.0%ni, 91.2%id,  3.2%wa,  0.0%hi,  0.1%si,  0.6%st
Cpu1  :  1.9%us,  2.4%sy,  0.0%ni, 94.7%id,  0.4%wa,  0.0%hi,  0.0%si,  0.6%st
Cpu2  :  1.9%us,  2.1%sy,  0.0%ni, 95.3%id,  0.1%wa,  0.0%hi,  0.0%si,  0.6%st
Cpu3  :  1.3%us,  1.7%sy,  0.0%ni, 96.2%id,  0.3%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu4  :  1.1%us,  1.6%sy,  0.0%ni, 96.7%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu5  :  0.9%us,  1.4%sy,  0.0%ni, 97.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu6  :  0.8%us,  1.3%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu7  :  0.8%us,  1.3%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu8  :  0.7%us,  1.2%sy,  0.0%ni, 97.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu9  :  0.7%us,  1.2%sy,  0.0%ni, 97.5%id,  0.2%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu10 :  0.6%us,  1.1%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu11 :  0.7%us,  1.1%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu12 :  0.7%us,  1.2%sy,  0.0%ni, 97.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu13 :  0.7%us,  1.2%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Mem:   8372060k total,  6156464k used,  2215596k free,  1827296k buffers
Swap:  1048568k total,        0k used,  1048568k free,  2324872k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
37138 root      20   0 19200 1368  912 R    4  0.0   0:00.03 top
37164 root      20   0 19200 1360  912 S    4  0.0   0:00.02 top
    1 root      20   0  8356  780  648 S    0  0.0   0:10.02 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.02 kthreadd
    3 root      RT   0     0    0    0 S    0  0.0   0:35.41 migration/0

12:05:13

php-fpm and nginx about to be killed
ONEX_LOAD = 5426
FIVX_LOAD = 3592
uptime :
 12:05:13 up 2 days,  2:32,  1 user,  load average: 54.26, 35.92, 15.41
top :
top - 12:05:13 up 2 days,  2:32,  1 user,  load average: 49.91, 35.32, 15.32
Tasks: 240 total,   1 running, 239 sleeping,   0 stopped,   0 zombie
Cpu0  :  2.5%us,  2.5%sy,  0.0%ni, 91.2%id,  3.2%wa,  0.0%hi,  0.1%si,  0.6%st
Cpu1  :  1.9%us,  2.4%sy,  0.0%ni, 94.7%id,  0.4%wa,  0.0%hi,  0.0%si,  0.6%st
Cpu2  :  1.9%us,  2.1%sy,  0.0%ni, 95.3%id,  0.1%wa,  0.0%hi,  0.0%si,  0.6%st
Cpu3  :  1.3%us,  1.7%sy,  0.0%ni, 96.2%id,  0.3%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu4  :  1.1%us,  1.6%sy,  0.0%ni, 96.7%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu5  :  0.9%us,  1.4%sy,  0.0%ni, 97.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu6  :  0.8%us,  1.3%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu7  :  0.8%us,  1.3%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu8  :  0.7%us,  1.2%sy,  0.0%ni, 97.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu9  :  0.7%us,  1.2%sy,  0.0%ni, 97.5%id,  0.2%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu10 :  0.6%us,  1.1%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu11 :  0.7%us,  1.1%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu12 :  0.7%us,  1.2%sy,  0.0%ni, 97.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu13 :  0.7%us,  1.2%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Mem:   8372060k total,  6155624k used,  2216436k free,  1827300k buffers
Swap:  1048568k total,        0k used,  1048568k free,  2325092k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
37383 root      20   0 19200 1364  912 R    6  0.0   0:00.05 top
    1 root      20   0  8356  780  648 S    0  0.0   0:10.02 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.02 kthreadd
    3 root      RT   0     0    0    0 S    0  0.0   0:35.41 migration/0

12:09:31

Here is the record in the log file after they have been killed when the processes have been started again:

nginx high load off
ONEX_LOAD = 89
FIVX_LOAD = 1519
uptime :
 12:09:31 up 2 days,  2:36,  1 user,  load average: 0.89, 15.19, 11.70
top :
top - 12:09:32 up 2 days,  2:36,  1 user,  load average: 0.89, 15.19, 11.70
Tasks: 240 total,   2 running, 238 sleeping,   0 stopped,   0 zombie
Cpu0  :  2.5%us,  2.5%sy,  0.0%ni, 91.2%id,  3.2%wa,  0.0%hi,  0.1%si,  0.6%st
Cpu1  :  1.9%us,  2.4%sy,  0.0%ni, 94.7%id,  0.4%wa,  0.0%hi,  0.0%si,  0.6%st
Cpu2  :  1.9%us,  2.1%sy,  0.0%ni, 95.3%id,  0.1%wa,  0.0%hi,  0.0%si,  0.6%st
Cpu3  :  1.3%us,  1.7%sy,  0.0%ni, 96.2%id,  0.3%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu4  :  1.1%us,  1.6%sy,  0.0%ni, 96.7%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu5  :  0.9%us,  1.4%sy,  0.0%ni, 97.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu6  :  0.8%us,  1.3%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu7  :  0.8%us,  1.3%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu8  :  0.7%us,  1.2%sy,  0.0%ni, 97.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu9  :  0.7%us,  1.2%sy,  0.0%ni, 97.5%id,  0.2%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu10 :  0.6%us,  1.1%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu11 :  0.7%us,  1.1%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu12 :  0.7%us,  1.2%sy,  0.0%ni, 97.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu13 :  0.7%us,  1.2%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Mem:   8372060k total,  7008972k used,  1363088k free,  1827428k buffers
Swap:  1048568k total,        0k used,  1048568k free,  2404800k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
37456 www-data  20   0  777m  85m  39m R   21  1.0   0:03.45 php-fpm
41269 root      20   0 19200 1356  912 R    8  0.0   0:00.06 top
 3356 mysql     20   0 2000m 1.1g  10m S    2 14.3  92:08.32 mysqld
37435 www-data  20   0 70996 8628 1860 S    2  0.1   0:00.23 nginx
39446 root      20   0 33988 5648 2120 S    2  0.1   0:00.43 vi
40519 root      20   0 10620 1360 1148 S    2  0.0   0:00.04 bash
    1 root      20   0  8356  780  648 S    0  0.0   0:10.03 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.02 kthreadd

Following is what was written to the /var/log/php/php53-fpm-error.log:

[22-Jun-2013 12:00:23] WARNING: [pool www] child 1651, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (39.092773 sec), logging
[22-Jun-2013 12:00:24] NOTICE: child 1651 stopped for tracing
[22-Jun-2013 12:00:24] NOTICE: about to trace 1651
[22-Jun-2013 12:00:24] NOTICE: finished trace of 1651
[22-Jun-2013 12:00:33] WARNING: [pool www] child 1635, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (32.119338 sec), logging
[22-Jun-2013 12:00:34] NOTICE: child 1635 stopped for tracing
[22-Jun-2013 12:00:34] NOTICE: about to trace 1635
[22-Jun-2013 12:00:34] NOTICE: finished trace of 1635
[22-Jun-2013 12:00:43] WARNING: [pool www] child 1647, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (38.397008 sec), logging
[22-Jun-2013 12:00:43] WARNING: [pool www] child 1646, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (37.888318 sec), logging
[22-Jun-2013 12:00:43] WARNING: [pool www] child 1639, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.109913 sec), logging
[22-Jun-2013 12:00:43] NOTICE: child 1639 stopped for tracing
[22-Jun-2013 12:00:43] NOTICE: about to trace 1639
[22-Jun-2013 12:00:44] NOTICE: finished trace of 1639
[22-Jun-2013 12:00:44] NOTICE: child 1646 stopped for tracing
[22-Jun-2013 12:00:44] NOTICE: about to trace 1646
[22-Jun-2013 12:00:44] NOTICE: finished trace of 1646
[22-Jun-2013 12:00:44] NOTICE: child 1647 stopped for tracing
[22-Jun-2013 12:00:44] NOTICE: about to trace 1647
[22-Jun-2013 12:00:45] NOTICE: finished trace of 1647
[22-Jun-2013 12:01:04] WARNING: [pool www] child 1633, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.350825 sec), logging
[22-Jun-2013 12:01:04] NOTICE: child 1633 stopped for tracing
[22-Jun-2013 12:01:04] NOTICE: about to trace 1633
[22-Jun-2013 12:01:06] NOTICE: finished trace of 1633
[22-Jun-2013 12:01:14] WARNING: [pool www] child 1650, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (32.470621 sec), logging
[22-Jun-2013 12:01:14] WARNING: [pool www] child 1636, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (34.120095 sec), logging
[22-Jun-2013 12:01:14] NOTICE: child 1636 stopped for tracing
[22-Jun-2013 12:01:14] NOTICE: about to trace 1636
[22-Jun-2013 12:01:15] NOTICE: finished trace of 1636
[22-Jun-2013 12:01:15] NOTICE: child 1650 stopped for tracing
[22-Jun-2013 12:01:15] NOTICE: about to trace 1650
[22-Jun-2013 12:01:17] NOTICE: finished trace of 1650
[22-Jun-2013 12:01:24] WARNING: [pool www] child 33114, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.482372 sec), logging
[22-Jun-2013 12:01:25] NOTICE: child 33114 stopped for tracing
[22-Jun-2013 12:01:25] NOTICE: about to trace 33114
[22-Jun-2013 12:01:26] NOTICE: finished trace of 33114
[22-Jun-2013 12:01:34] WARNING: [pool www] child 1643, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (36.580010 sec), logging
[22-Jun-2013 12:01:34] WARNING: [pool www] child 1637, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (32.509937 sec), logging
[22-Jun-2013 12:01:35] NOTICE: child 1637 stopped for tracing
[22-Jun-2013 12:01:35] NOTICE: about to trace 1637
[22-Jun-2013 12:01:36] ERROR: failed to ptrace(PEEKDATA) pid 1637: Input/output error (5)
[22-Jun-2013 12:01:37] NOTICE: finished trace of 1637
[22-Jun-2013 12:01:37] NOTICE: child 1643 stopped for tracing
[22-Jun-2013 12:01:37] NOTICE: about to trace 1643
[22-Jun-2013 12:01:40] NOTICE: finished trace of 1643
[22-Jun-2013 12:01:45] WARNING: [pool www] child 1649, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (33.152829 sec), logging
[22-Jun-2013 12:01:45] WARNING: [pool www] child 1642, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (34.611072 sec), logging
[22-Jun-2013 12:01:45] WARNING: [pool www] child 1641, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (31.466280 sec), logging
[22-Jun-2013 12:01:45] NOTICE: child 1649 stopped for tracing
[22-Jun-2013 12:01:45] NOTICE: about to trace 1649
[22-Jun-2013 12:01:46] ERROR: failed to ptrace(PEEKDATA) pid 1649: Input/output error (5)
[22-Jun-2013 12:01:46] NOTICE: finished trace of 1649
[22-Jun-2013 12:01:46] NOTICE: child 1641 stopped for tracing
[22-Jun-2013 12:01:46] NOTICE: about to trace 1641
[22-Jun-2013 12:01:47] ERROR: failed to ptrace(PEEKDATA) pid 1641: Input/output error (5)
[22-Jun-2013 12:01:48] NOTICE: finished trace of 1641
[22-Jun-2013 12:01:48] NOTICE: child 1642 stopped for tracing
[22-Jun-2013 12:01:48] NOTICE: about to trace 1642
[22-Jun-2013 12:01:49] NOTICE: finished trace of 1642
[22-Jun-2013 12:01:52] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 9 idle, and 32 total children
[22-Jun-2013 12:01:53] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 9 idle, and 33 total children
[22-Jun-2013 12:01:55] WARNING: [pool www] child 1645, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.111287 sec), logging
[22-Jun-2013 12:01:55] WARNING: [pool www] child 1640, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (34.842618 sec), logging
[22-Jun-2013 12:01:55] NOTICE: child 1640 stopped for tracing
[22-Jun-2013 12:01:55] NOTICE: about to trace 1640
[22-Jun-2013 12:01:57] NOTICE: finished trace of 1640
[22-Jun-2013 12:01:57] NOTICE: child 1645 stopped for tracing
[22-Jun-2013 12:01:57] NOTICE: about to trace 1645
[22-Jun-2013 12:01:59] NOTICE: finished trace of 1645
[22-Jun-2013 12:02:05] WARNING: [pool www] child 1638, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (39.258995 sec), logging
[22-Jun-2013 12:02:05] NOTICE: child 1638 stopped for tracing
[22-Jun-2013 12:02:05] NOTICE: about to trace 1638
[22-Jun-2013 12:02:06] ERROR: failed to ptrace(PEEKDATA) pid 1638: Input/output error (5)
[22-Jun-2013 12:02:07] NOTICE: finished trace of 1638
[22-Jun-2013 12:02:15] WARNING: [pool www] child 1652, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (38.402964 sec), logging
[22-Jun-2013 12:02:15] WARNING: [pool www] child 1648, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (31.221868 sec), logging
[22-Jun-2013 12:02:15] WARNING: [pool www] child 1644, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (37.352610 sec), logging
[22-Jun-2013 12:02:15] WARNING: [pool www] child 1634, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.993610 sec), logging
[22-Jun-2013 12:02:15] NOTICE: child 1652 stopped for tracing
[22-Jun-2013 12:02:15] NOTICE: about to trace 1652
[22-Jun-2013 12:02:16] NOTICE: finished trace of 1652
[22-Jun-2013 12:02:16] NOTICE: child 1634 stopped for tracing
[22-Jun-2013 12:02:16] NOTICE: about to trace 1634
[22-Jun-2013 12:02:16] NOTICE: about to trace 1634
[22-Jun-2013 12:02:18] NOTICE: finished trace of 1634
[22-Jun-2013 12:02:18] NOTICE: child 1644 stopped for tracing
[22-Jun-2013 12:02:18] NOTICE: about to trace 1644
[22-Jun-2013 12:02:21] NOTICE: finished trace of 1644
[22-Jun-2013 12:02:21] NOTICE: child 1648 stopped for tracing
[22-Jun-2013 12:02:21] NOTICE: about to trace 1648
[22-Jun-2013 12:02:24] NOTICE: finished trace of 1648
[22-Jun-2013 12:02:25] WARNING: [pool www] child 33488, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (31.277113 sec), logging
[22-Jun-2013 12:02:25] WARNING: [pool www] child 33486, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (33.158089 sec), logging
[22-Jun-2013 12:02:25] WARNING: [pool www] child 1635, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "HEAD /index.php") executing too slow (36.440551 sec), logging
[22-Jun-2013 12:02:26] NOTICE: child 33488 stopped for tracing
[22-Jun-2013 12:02:26] NOTICE: about to trace 33488
[22-Jun-2013 12:02:28] ERROR: failed to ptrace(PEEKDATA) pid 33488: Input/output error (5)
[22-Jun-2013 12:02:29] NOTICE: finished trace of 33488
[22-Jun-2013 12:02:29] NOTICE: child 1635 stopped for tracing
[22-Jun-2013 12:02:29] NOTICE: about to trace 1635
[22-Jun-2013 12:02:31] ERROR: failed to ptrace(PEEKDATA) pid 1635: Input/output error (5)
[22-Jun-2013 12:02:32] NOTICE: finished trace of 1635
[22-Jun-2013 12:02:32] NOTICE: child 33486 stopped for tracing
[22-Jun-2013 12:02:32] NOTICE: about to trace 33486
[22-Jun-2013 12:02:34] ERROR: failed to ptrace(PEEKDATA) pid 33486: Input/output error (5)
[22-Jun-2013 12:02:36] NOTICE: finished trace of 33486
[22-Jun-2013 12:02:36] WARNING: [pool www] child 33490, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.771847 sec), logging
[22-Jun-2013 12:02:36] WARNING: [pool www] child 33489, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (39.684622 sec), logging
[22-Jun-2013 12:02:36] NOTICE: child 33490 stopped for tracing
[22-Jun-2013 12:02:36] NOTICE: about to trace 33490
[22-Jun-2013 12:02:38] ERROR: failed to ptrace(PEEKDATA) pid 33490: Input/output error (5)
[22-Jun-2013 12:02:40] NOTICE: finished trace of 33490
[22-Jun-2013 12:02:40] NOTICE: child 33489 stopped for tracing
[22-Jun-2013 12:02:40] NOTICE: about to trace 33489
[22-Jun-2013 12:02:42] ERROR: failed to ptrace(PEEKDATA) pid 33489: Input/output error (5)
[22-Jun-2013 12:02:43] NOTICE: finished trace of 33489
[22-Jun-2013 12:02:43] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 8 idle, and 39 total children
[22-Jun-2013 12:02:44] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 8 idle, and 41 total children
[22-Jun-2013 12:02:46] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 8 idle, and 43 total children
[22-Jun-2013 12:02:47] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 8 idle, and 45 total children
[22-Jun-2013 12:02:49] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 6 idle, and 47 total children
[22-Jun-2013 12:02:50] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 8 idle, and 51 total children
[22-Jun-2013 12:02:56] WARNING: [pool www] child 33536, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (31.151578 sec), logging
[22-Jun-2013 12:02:56] WARNING: [pool www] child 33511, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.119777 sec), logging
[22-Jun-2013 12:02:56] WARNING: [pool www] child 33508, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (39.194974 sec), logging
[22-Jun-2013 12:02:56] NOTICE: child 33511 stopped for tracing
[22-Jun-2013 12:02:56] NOTICE: about to trace 33511
[22-Jun-2013 12:02:58] NOTICE: finished trace of 33511
[22-Jun-2013 12:02:58] NOTICE: child 33508 stopped for tracing
[22-Jun-2013 12:02:58] NOTICE: about to trace 33508
[22-Jun-2013 12:02:59] NOTICE: finished trace of 33508
[22-Jun-2013 12:02:59] NOTICE: child 33536 stopped for tracing
[22-Jun-2013 12:02:59] NOTICE: about to trace 33536
[22-Jun-2013 12:03:00] ERROR: failed to ptrace(PEEKDATA) pid 33536: Input/output error (5)
[22-Jun-2013 12:03:01] NOTICE: finished trace of 33536
[22-Jun-2013 12:03:06] WARNING: [pool www] child 33546, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (34.052783 sec), logging
[22-Jun-2013 12:03:06] WARNING: [pool www] child 33542, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.571572 sec), logging
[22-Jun-2013 12:03:06] WARNING: [pool www] child 33539, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (39.905059 sec), logging
[22-Jun-2013 12:03:07] NOTICE: child 33539 stopped for tracing
[22-Jun-2013 12:03:07] NOTICE: about to trace 33539
[22-Jun-2013 12:03:07] ERROR: failed to ptrace(PEEKDATA) pid 33539: Input/output error (5)
[22-Jun-2013 12:03:09] NOTICE: finished trace of 33539
[22-Jun-2013 12:03:09] NOTICE: child 33542 stopped for tracing
[22-Jun-2013 12:03:09] NOTICE: about to trace 33542
[22-Jun-2013 12:03:10] ERROR: failed to ptrace(PEEKDATA) pid 33542: Input/output error (5)
[22-Jun-2013 12:03:10] NOTICE: finished trace of 33542
[22-Jun-2013 12:03:10] NOTICE: child 33546 stopped for tracing
[22-Jun-2013 12:03:10] NOTICE: about to trace 33546
[22-Jun-2013 12:03:11] ERROR: failed to ptrace(PEEKDATA) pid 33546: Input/output error (5)
[22-Jun-2013 12:03:12] NOTICE: finished trace of 33546
[22-Jun-2013 12:03:17] WARNING: [pool www] child 33554, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (34.365713 sec), logging
[22-Jun-2013 12:03:17] WARNING: [pool www] child 33552, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.734208 sec), logging
[22-Jun-2013 12:03:17] NOTICE: child 33552 stopped for tracing
[22-Jun-2013 12:03:17] NOTICE: about to trace 33552
[22-Jun-2013 12:03:18] ERROR: failed to ptrace(PEEKDATA) pid 33552: Input/output error (5)
[22-Jun-2013 12:03:21] NOTICE: finished trace of 33552
[22-Jun-2013 12:03:21] NOTICE: child 33554 stopped for tracing
[22-Jun-2013 12:03:21] NOTICE: about to trace 33554
[22-Jun-2013 12:03:25] ERROR: failed to ptrace(PEEKDATA) pid 33554: Input/output error (5)
[22-Jun-2013 12:03:26] NOTICE: finished trace of 33554
[22-Jun-2013 12:03:27] WARNING: [pool www] child 33594, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (31.143826 sec), logging
[22-Jun-2013 12:03:27] WARNING: [pool www] child 33587, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (32.058360 sec), logging
[22-Jun-2013 12:03:27] WARNING: [pool www] child 33586, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (33.421254 sec), logging
[22-Jun-2013 12:03:27] WARNING: [pool www] child 33555, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (39.496059 sec), logging
[22-Jun-2013 12:03:27] WARNING: [pool www] child 33114, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (38.303183 sec), logging
[22-Jun-2013 12:03:27] WARNING: [pool www] child 1643, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (34.175861 sec), logging
[22-Jun-2013 12:03:28] NOTICE: child 33114 stopped for tracing
[22-Jun-2013 12:03:28] NOTICE: about to trace 33114
[22-Jun-2013 12:03:30] ERROR: failed to ptrace(PEEKDATA) pid 33114: Input/output error (5)
[22-Jun-2013 12:03:32] NOTICE: finished trace of 33114
[22-Jun-2013 12:03:32] NOTICE: child 1643 stopped for tracing
[22-Jun-2013 12:03:32] NOTICE: about to trace 1643
[22-Jun-2013 12:03:34] ERROR: failed to ptrace(PEEKDATA) pid 1643: Input/output error (5)
[22-Jun-2013 12:03:35] NOTICE: finished trace of 1643
[22-Jun-2013 12:03:35] NOTICE: child 33555 stopped for tracing
[22-Jun-2013 12:03:35] NOTICE: about to trace 33555
[22-Jun-2013 12:03:36] ERROR: failed to ptrace(PEEKDATA) pid 33555: Input/output error (5)
[22-Jun-2013 12:03:38] NOTICE: finished trace of 33555
[22-Jun-2013 12:03:38] NOTICE: child 33586 stopped for tracing
[22-Jun-2013 12:03:38] NOTICE: about to trace 33586
[22-Jun-2013 12:03:39] ERROR: failed to ptrace(PEEKDATA) pid 33586: Input/output error (5)
[22-Jun-2013 12:03:39] NOTICE: finished trace of 33586
[22-Jun-2013 12:03:39] NOTICE: child 33587 stopped for tracing
[22-Jun-2013 12:03:39] NOTICE: about to trace 33587
[22-Jun-2013 12:03:41] NOTICE: finished trace of 33587
[22-Jun-2013 12:03:42] NOTICE: child 33594 stopped for tracing
[22-Jun-2013 12:03:42] NOTICE: about to trace 33594
[22-Jun-2013 12:03:44] NOTICE: finished trace of 33594
[22-Jun-2013 12:03:44] WARNING: [pool www] child 33603, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (38.711441 sec), logging
[22-Jun-2013 12:03:44] WARNING: [pool www] child 33601, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.574718 sec), logging
[22-Jun-2013 12:03:44] WARNING: [pool www] child 33593, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (42.349060 sec), logging
[22-Jun-2013 12:03:44] WARNING: [pool www] child 33592, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (42.648372 sec), logging
[22-Jun-2013 12:03:44] NOTICE: child 33601 stopped for tracing
[22-Jun-2013 12:03:44] NOTICE: about to trace 33601
[22-Jun-2013 12:03:46] ERROR: failed to ptrace(PEEKDATA) pid 33601: Input/output error (5)
[22-Jun-2013 12:03:47] NOTICE: finished trace of 33601
[22-Jun-2013 12:03:47] NOTICE: child 33592 stopped for tracing
[22-Jun-2013 12:03:47] NOTICE: about to trace 33592
[22-Jun-2013 12:03:50] ERROR: failed to ptrace(PEEKDATA) pid 33592: Input/output error (5)
[22-Jun-2013 12:03:52] NOTICE: finished trace of 33592
[22-Jun-2013 12:03:52] NOTICE: child 33593 stopped for tracing
[22-Jun-2013 12:03:52] NOTICE: about to trace 33593
[22-Jun-2013 12:03:53] ERROR: failed to ptrace(PEEKDATA) pid 33593: Input/output error (5)
[22-Jun-2013 12:03:56] NOTICE: finished trace of 33593
[22-Jun-2013 12:03:56] NOTICE: child 33603 stopped for tracing
[22-Jun-2013 12:03:56] NOTICE: about to trace 33603
[22-Jun-2013 12:03:58] ERROR: failed to ptrace(PEEKDATA) pid 33603: Input/output error (5)
[22-Jun-2013 12:04:00] NOTICE: finished trace of 33603
[22-Jun-2013 12:04:00] WARNING: [pool www] child 33610, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.728167 sec), logging
[22-Jun-2013 12:04:00] WARNING: [pool www] child 33606, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.620598 sec), logging
[22-Jun-2013 12:04:00] WARNING: [pool www] child 33604, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "HEAD /index.php") executing too slow (41.222512 sec), logging
[22-Jun-2013 12:04:00] WARNING: [pool www] child 33602, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "HEAD /index.php") executing too slow (37.904272 sec), logging
[22-Jun-2013 12:04:00] NOTICE: child 33604 stopped for tracing
[22-Jun-2013 12:04:00] NOTICE: about to trace 33604
[22-Jun-2013 12:04:01] ERROR: failed to ptrace(PEEKDATA) pid 33604: Input/output error (5)
[22-Jun-2013 12:04:03] NOTICE: finished trace of 33604
[22-Jun-2013 12:04:03] NOTICE: child 33602 stopped for tracing
[22-Jun-2013 12:04:03] NOTICE: about to trace 33602
[22-Jun-2013 12:04:05] NOTICE: finished trace of 33602
[22-Jun-2013 12:04:05] NOTICE: child 33606 stopped for tracing
[22-Jun-2013 12:04:05] NOTICE: about to trace 33606
[22-Jun-2013 12:04:09] NOTICE: finished trace of 33606
[22-Jun-2013 12:04:09] NOTICE: child 33610 stopped for tracing
[22-Jun-2013 12:04:09] NOTICE: about to trace 33610
[22-Jun-2013 12:04:11] ERROR: failed to ptrace(PEEKDATA) pid 33610: Input/output error (5)
[22-Jun-2013 12:04:13] NOTICE: finished trace of 33610
[22-Jun-2013 12:04:13] WARNING: [pool www] child 33607, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.034944 sec), logging
[22-Jun-2013 12:04:14] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 7 idle, and 59 total children
[22-Jun-2013 12:04:14] NOTICE: child 33607 stopped for tracing
[22-Jun-2013 12:04:14] NOTICE: about to trace 33607
[22-Jun-2013 12:04:17] ERROR: failed to ptrace(PEEKDATA) pid 33607: Input/output error (5)
[22-Jun-2013 12:04:19] NOTICE: finished trace of 33607
[22-Jun-2013 12:04:19] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 5 idle, and 62 total children
[22-Jun-2013 12:04:20] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 6 idle, and 67 total children
[22-Jun-2013 12:04:21] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 9 idle, and 71 total children
[22-Jun-2013 12:04:23] WARNING: [pool www] child 33618, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (32.925871 sec), logging
[22-Jun-2013 12:04:23] WARNING: [pool www] child 33611, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.520972 sec), logging
[22-Jun-2013 12:04:24] NOTICE: child 33611 stopped for tracing
[22-Jun-2013 12:04:25] NOTICE: about to trace 33611
[22-Jun-2013 12:04:27] NOTICE: finished trace of 33611
[22-Jun-2013 12:04:27] NOTICE: child 33618 stopped for tracing
[22-Jun-2013 12:04:27] NOTICE: about to trace 33618
[22-Jun-2013 12:04:29] ERROR: failed to ptrace(PEEKDATA) pid 33618: Input/output error (5)
[22-Jun-2013 12:04:30] NOTICE: finished trace of 33618
[22-Jun-2013 12:04:34] WARNING: [pool www] child 33619, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (34.695058 sec), logging
[22-Jun-2013 12:04:34] WARNING: [pool www] child 33614, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (33.567031 sec), logging
[22-Jun-2013 12:04:34] NOTICE: child 33614 stopped for tracing
[22-Jun-2013 12:04:34] NOTICE: about to trace 33614
[22-Jun-2013 12:04:37] NOTICE: finished trace of 33614
[22-Jun-2013 12:04:37] NOTICE: child 33619 stopped for tracing
[22-Jun-2013 12:04:37] NOTICE: about to trace 33619
[22-Jun-2013 12:04:38] ERROR: failed to ptrace(PEEKDATA) pid 33619: Input/output error (5)
[22-Jun-2013 12:04:39] NOTICE: finished trace of 33619
[22-Jun-2013 12:04:44] WARNING: [pool www] child 33615, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (32.554511 sec), logging
[22-Jun-2013 12:04:44] NOTICE: child 33615 stopped for tracing
[22-Jun-2013 12:04:44] NOTICE: about to trace 33615
[22-Jun-2013 12:04:44] NOTICE: finished trace of 33615
[22-Jun-2013 12:04:44] NOTICE: Finishing ...
[22-Jun-2013 12:04:44] NOTICE: Finishing ...
[22-Jun-2013 12:04:44] NOTICE: exiting, bye-bye!
[22-Jun-2013 12:05:15] NOTICE: fpm is running, pid 37446
[22-Jun-2013 12:05:15] NOTICE: ready to handle connections

The following mysql errors were written to /var/log/daemon.log, note that they only start when php-fpm is shutdown -- they are as a result of php-fpm being shutdown:

Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303769 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303775 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303774 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303773 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303742 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303752 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303782 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303743 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303770 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303724 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303755 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303734 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303763 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303754 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303750 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303761 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303771 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303715 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303766 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303776 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303768 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303751 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303746 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303749 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303748 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303738 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303729 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303777 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303676 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303825 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303817 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303745 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303720 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303762 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303757 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303747 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303737 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303765 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303740 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303756 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303753 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303730 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303717 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303731 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303781 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303778 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303735 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303759 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303722 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303727 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303714 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun 22 12:04:44 puffin mysqld: 130622 12:04:44 [Warning] Aborted connection 303739 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)

I have edited /var/xdrago/second.sh and removed the dumping of ps -lA and cat proc/interrupts to the /var/log/high-load.log file and also re-multiplied the variables by 5 rather than 4:

# Original values:
#CTL_ONEX_SPIDER_LOAD=388
#CTL_FIVX_SPIDER_LOAD=388
#CTL_ONEX_LOAD=1444
#CTL_FIVX_LOAD=888
#CTL_ONEX_LOAD_CRIT=1888
#CTL_FIVX_LOAD_CRIT=1555
# x4 of original:
#CTL_ONEX_SPIDER_LOAD=1552
#CTL_FIVX_SPIDER_LOAD=1552
#CTL_ONEX_LOAD=5776
#CTL_FIVX_LOAD=3552
#CTL_ONEX_LOAD_CRIT=7552
#CTL_FIVX_LOAD_CRIT=6220
# 5x of original:
CTL_ONEX_SPIDER_LOAD=1940
CTL_FIVX_SPIDER_LOAD=1940
CTL_ONEX_LOAD=7220
CTL_FIVX_LOAD=4440
CTL_ONEX_LOAD_CRIT=9440
CTL_FIVX_LOAD_CRIT=7775

comment:53 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 1.3
Total Hours changed from 19.48 to 20.78

I have made a start on the awstats configuration, see wiki:AwStatsInstall

There was just another load spike, I caught the end of it and the site was slow but still responsive -- I didn't get any 502 or 503 errors when browsing it, it was just very sluggish. I think this indicates that the 5x settings in second.sh are probably about right.

See also the graphs, since the munin refresh rate was changed from 5 mins to 3 mins the spikes are better recorded:

I'll upload these image to this ticket.

This is what was written to high-load.log at the start:

nginx high load on
ONEX_LOAD = 2245
FIVX_LOAD = 672
uptime :
 16:29:29 up 2 days,  6:56,  2 users,  load average: 22.45, 6.72, 2.62
top :
top - 16:29:30 up 2 days,  6:56,  2 users,  load average: 22.45, 6.72, 2.62
Tasks: 274 total,  34 running, 240 sleeping,   0 stopped,   0 zombie
Cpu0  :  2.5%us,  2.5%sy,  0.0%ni, 91.4%id,  3.1%wa,  0.0%hi,  0.1%si,  0.6%st
Cpu1  :  1.9%us,  2.4%sy,  0.0%ni, 94.8%id,  0.4%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu2  :  1.9%us,  2.1%sy,  0.0%ni, 95.4%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu3  :  1.3%us,  1.7%sy,  0.0%ni, 96.2%id,  0.3%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu4  :  1.1%us,  1.5%sy,  0.0%ni, 96.8%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu5  :  0.9%us,  1.4%sy,  0.0%ni, 97.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu6  :  0.8%us,  1.3%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu7  :  0.8%us,  1.2%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu8  :  0.7%us,  1.2%sy,  0.0%ni, 97.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu9  :  0.7%us,  1.2%sy,  0.0%ni, 97.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu10 :  0.6%us,  1.1%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu11 :  0.7%us,  1.1%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu12 :  0.7%us,  1.1%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu13 :  0.7%us,  1.1%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Mem:   8372060k total,  7551116k used,   820944k free,  1834400k buffers
Swap:  1048568k total,        0k used,  1048568k free,  2597216k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
37460 www-data  20   0  779m 111m  63m R   93  1.4   5:07.66 php-fpm
28122 www-data  20   0  738m  12m 3956 R   64  0.2   0:18.24 php-fpm
37449 www-data  20   0  777m  98m  51m R   61  1.2   5:13.37 php-fpm
27935 www-data  20   0  739m  16m 7392 R   60  0.2   0:31.46 php-fpm
37462 www-data  20   0  778m 108m  61m R   60  1.3   5:05.99 php-fpm
28119 www-data  20   0  738m  12m 3980 R   56  0.2   0:18.83 php-fpm
37452 www-data  20   0  777m 109m  63m R   55  1.3   5:00.77 php-fpm
27938 www-data  20   0  739m  16m 7376 R   53  0.2   0:25.12 php-fpm
27939 www-data  20   0  738m  12m 3980 R   53  0.2   0:21.74 php-fpm
37450 www-data  20   0  780m 115m  65m R   53  1.4   5:47.06 php-fpm
37465 www-data  20   0  764m  94m  61m R   53  1.2   5:15.87 php-fpm
28124 www-data  20   0  738m  12m 3972 R   52  0.2   0:18.12 php-fpm
37461 www-data  20   0  776m  98m  52m R   52  1.2   5:11.53 php-fpm
37467 www-data  20   0  778m 135m  87m R   52  1.7   5:20.84 php-fpm
28179 www-data  20   0  735m 7968 2628 R   50  0.1   0:02.08 php-fpm
28270 www-data  20   0  738m  12m 3968 R   48  0.2   0:18.80 php-fpm
37463 www-data  20   0  768m 104m  66m R   48  1.3   5:22.53 php-fpm
37453 www-data  20   0  777m 108m  61m R   47  1.3   5:46.63 php-fpm
28149 www-data  20   0  738m  12m 3972 R   45  0.2   0:14.44 php-fpm
37448 www-data  20   0  776m 104m  58m R   45  1.3   5:22.87 php-fpm
37457 www-data  20   0  778m 142m  95m R   45  1.7   5:18.96 php-fpm
37451 www-data  20   0  773m  97m  55m R   44  1.2   5:34.51 php-fpm
37455 www-data  20   0  771m 100m  59m R   44  1.2   5:18.36 php-fpm
37447 www-data  20   0  829m 178m  80m R   42  2.2   5:15.28 php-fpm
28172 www-data  20   0  735m 8044 2688 R   39  0.1   0:07.80 php-fpm
37468 www-data  20   0  770m  99m  59m R   39  1.2   5:25.87 php-fpm
37459 www-data  20   0  775m 105m  61m R   35  1.3   5:03.91 php-fpm
28860 root      20   0 10624  532  304 S    3  0.0   0:00.02 bash
  225 root      20   0     0    0    0 S    2  0.0   0:51.50 kjournald
28346 root      20   0 10624 1368 1144 S    2  0.0   0:00.09 bash
28859 root      20   0 19200 1380  912 R    2  0.0   0:00.02 top
    1 root      20   0  8356  780  648 S    0  0.0   0:10.47 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.02 kthreadd

And this when the high load settings were switched off:

nginx high load off
ONEX_LOAD = 1823
FIVX_LOAD = 1418
uptime :
 16:32:11 up 2 days,  6:59,  2 users,  load average: 18.23, 14.18, 6.31
top :
top - 16:32:12 up 2 days,  6:59,  2 users,  load average: 18.23, 14.18, 6.31
Tasks: 253 total,   1 running, 252 sleeping,   0 stopped,   0 zombie
Cpu0  :  2.5%us,  2.5%sy,  0.0%ni, 91.3%id,  3.1%wa,  0.0%hi,  0.1%si,  0.6%st
Cpu1  :  1.9%us,  2.4%sy,  0.0%ni, 94.7%id,  0.4%wa,  0.0%hi,  0.0%si,  0.6%st
Cpu2  :  1.9%us,  2.1%sy,  0.0%ni, 95.3%id,  0.1%wa,  0.0%hi,  0.0%si,  0.6%st
Cpu3  :  1.3%us,  1.7%sy,  0.0%ni, 96.2%id,  0.3%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu4  :  1.1%us,  1.6%sy,  0.0%ni, 96.7%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu5  :  0.9%us,  1.4%sy,  0.0%ni, 97.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu6  :  0.8%us,  1.3%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu7  :  0.8%us,  1.3%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu8  :  0.7%us,  1.2%sy,  0.0%ni, 97.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu9  :  0.7%us,  1.2%sy,  0.0%ni, 97.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu10 :  0.6%us,  1.1%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu11 :  0.7%us,  1.1%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu12 :  0.7%us,  1.2%sy,  0.0%ni, 97.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu13 :  0.7%us,  1.2%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Mem:   8372060k total,  6993624k used,  1378436k free,  1834460k buffers
Swap:  1048568k total,        0k used,  1048568k free,  2598336k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
33473 root      20   0 19200 1360  912 R    8  0.0   0:00.06 top
   30 root      RT   0     0    0    0 S    2  0.0   0:21.87 migration/9
33274 root      20   0 10620 1364 1148 S    2  0.0   0:00.03 bash
    1 root      20   0  8356  780  648 S    0  0.0   0:10.48 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.02 kthreadd
    3 root      RT   0     0    0    0 S    0  0.0   0:39.41 migration/0

Following is what was written to the php-fpm error log:

[22-Jun-2013 16:28:48] WARNING: [pool www] child 37467, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "HEAD /index.php") executing too slow (32.756779 sec), logging
[22-Jun-2013 16:28:48] WARNING: [pool www] child 37463, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "HEAD /index.php") executing too slow (32.961709 sec), logging
[22-Jun-2013 16:28:48] WARNING: [pool www] child 37462, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "HEAD /index.php") executing too slow (33.430994 sec), logging
[22-Jun-2013 16:28:48] WARNING: [pool www] child 37461, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "HEAD /index.php") executing too slow (32.855172 sec), logging
[22-Jun-2013 16:28:48] WARNING: [pool www] child 37459, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "HEAD /index.php") executing too slow (33.163302 sec), logging
[22-Jun-2013 16:28:48] WARNING: [pool www] child 37456, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "HEAD /index.php") executing too slow (33.285125 sec), logging
[22-Jun-2013 16:28:48] WARNING: [pool www] child 37453, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "HEAD /index.php") executing too slow (32.776045 sec), logging
[22-Jun-2013 16:28:48] WARNING: [pool www] child 37451, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "HEAD /index.php") executing too slow (32.548963 sec), logging
[22-Jun-2013 16:28:48] WARNING: [pool www] child 37450, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "HEAD /index.php") executing too slow (32.966145 sec), logging
[22-Jun-2013 16:28:48] WARNING: [pool www] child 37448, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "HEAD /index.php") executing too slow (32.471156 sec), logging
[22-Jun-2013 16:28:48] WARNING: [pool www] child 37447, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "HEAD /index.php") executing too slow (32.587559 sec), logging
[22-Jun-2013 16:28:48] NOTICE: child 37467 stopped for tracing
[22-Jun-2013 16:28:48] NOTICE: about to trace 37467
[22-Jun-2013 16:28:48] ERROR: failed to ptrace(PEEKDATA) pid 37467: Input/output error (5)
[22-Jun-2013 16:28:49] NOTICE: finished trace of 37467
[22-Jun-2013 16:28:49] NOTICE: child 37447 stopped for tracing
[22-Jun-2013 16:28:49] NOTICE: about to trace 37447
[22-Jun-2013 16:28:49] NOTICE: finished trace of 37447
[22-Jun-2013 16:28:49] NOTICE: child 37448 stopped for tracing
[22-Jun-2013 16:28:49] NOTICE: about to trace 37448
[22-Jun-2013 16:28:50] NOTICE: finished trace of 37448
[22-Jun-2013 16:28:50] NOTICE: child 37450 stopped for tracing
[22-Jun-2013 16:28:50] NOTICE: about to trace 37450
[22-Jun-2013 16:28:50] NOTICE: finished trace of 37450
[22-Jun-2013 16:28:50] NOTICE: child 37451 stopped for tracing
[22-Jun-2013 16:28:50] NOTICE: about to trace 37451
[22-Jun-2013 16:28:50] ERROR: failed to ptrace(PEEKDATA) pid 37451: Input/output error (5)
[22-Jun-2013 16:28:50] NOTICE: finished trace of 37451
[22-Jun-2013 16:28:50] NOTICE: child 37453 stopped for tracing
[22-Jun-2013 16:28:50] NOTICE: about to trace 37453
[22-Jun-2013 16:28:50] ERROR: failed to ptrace(PEEKDATA) pid 37453: Input/output error (5)
[22-Jun-2013 16:28:50] NOTICE: finished trace of 37453
[22-Jun-2013 16:28:50] NOTICE: child 37456 stopped for tracing
[22-Jun-2013 16:28:50] NOTICE: about to trace 37456
[22-Jun-2013 16:28:50] NOTICE: finished trace of 37456
[22-Jun-2013 16:28:50] NOTICE: child 37459 stopped for tracing
[22-Jun-2013 16:28:50] NOTICE: about to trace 37459
[22-Jun-2013 16:28:50] NOTICE: finished trace of 37459
[22-Jun-2013 16:28:50] NOTICE: child 37461 stopped for tracing
[22-Jun-2013 16:28:50] NOTICE: about to trace 37461
[22-Jun-2013 16:28:51] NOTICE: finished trace of 37461
[22-Jun-2013 16:28:51] NOTICE: child 37462 stopped for tracing
[22-Jun-2013 16:28:51] NOTICE: about to trace 37462
[22-Jun-2013 16:28:51] NOTICE: finished trace of 37462
[22-Jun-2013 16:28:51] NOTICE: child 37463 stopped for tracing
[22-Jun-2013 16:28:51] NOTICE: about to trace 37463
[22-Jun-2013 16:28:51] NOTICE: finished trace of 37463
[22-Jun-2013 16:28:58] WARNING: [pool www] child 37468, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.575101 sec), logging
[22-Jun-2013 16:28:58] WARNING: [pool www] child 37465, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (32.165764 sec), logging
[22-Jun-2013 16:28:58] WARNING: [pool www] child 37460, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.519804 sec), logging
[22-Jun-2013 16:28:58] WARNING: [pool www] child 37455, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (33.314712 sec), logging
[22-Jun-2013 16:28:58] NOTICE: child 37455 stopped for tracing
[22-Jun-2013 16:28:58] NOTICE: about to trace 37455
[22-Jun-2013 16:28:58] NOTICE: finished trace of 37455
[22-Jun-2013 16:28:58] NOTICE: child 37468 stopped for tracing
[22-Jun-2013 16:28:58] NOTICE: about to trace 37468
[22-Jun-2013 16:28:58] NOTICE: finished trace of 37468
[22-Jun-2013 16:28:58] NOTICE: child 37465 stopped for tracing
[22-Jun-2013 16:28:58] NOTICE: about to trace 37465
[22-Jun-2013 16:28:58] NOTICE: finished trace of 37465
[22-Jun-2013 16:28:58] NOTICE: child 37460 stopped for tracing
[22-Jun-2013 16:28:58] NOTICE: about to trace 37460
[22-Jun-2013 16:28:58] NOTICE: finished trace of 37460
[22-Jun-2013 16:29:08] WARNING: [pool www] child 27935, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (32.151806 sec), logging
[22-Jun-2013 16:29:08] WARNING: [pool www] child 37457, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "HEAD /index.php") executing too slow (35.993879 sec), logging
[22-Jun-2013 16:29:08] WARNING: [pool www] child 37452, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "HEAD /index.php") executing too slow (35.808820 sec), logging
[22-Jun-2013 16:29:08] WARNING: [pool www] child 37449, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (32.840217 sec), logging
[22-Jun-2013 16:29:08] NOTICE: child 37449 stopped for tracing
[22-Jun-2013 16:29:08] NOTICE: about to trace 37449
[22-Jun-2013 16:29:08] NOTICE: finished trace of 37449
[22-Jun-2013 16:29:08] NOTICE: child 37457 stopped for tracing
[22-Jun-2013 16:29:08] NOTICE: about to trace 37457
[22-Jun-2013 16:29:08] NOTICE: finished trace of 37457
[22-Jun-2013 16:29:08] NOTICE: child 27935 stopped for tracing
[22-Jun-2013 16:29:08] NOTICE: about to trace 27935
[22-Jun-2013 16:29:08] NOTICE: finished trace of 27935
[22-Jun-2013 16:29:08] NOTICE: child 37452 stopped for tracing
[22-Jun-2013 16:29:08] NOTICE: about to trace 37452
[22-Jun-2013 16:29:08] NOTICE: finished trace of 37452
[22-Jun-2013 16:29:18] WARNING: [pool www] child 27938, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (38.701855 sec), logging
[22-Jun-2013 16:29:19] NOTICE: child 27938 stopped for tracing
[22-Jun-2013 16:29:19] NOTICE: about to trace 27938
[22-Jun-2013 16:29:19] NOTICE: finished trace of 27938
[22-Jun-2013 16:29:28] WARNING: [pool www] child 28270, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.445436 sec), logging
[22-Jun-2013 16:29:28] WARNING: [pool www] child 28124, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (34.018885 sec), logging
[22-Jun-2013 16:29:28] WARNING: [pool www] child 28122, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (35.779858 sec), logging
[22-Jun-2013 16:29:28] WARNING: [pool www] child 28119, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (38.116857 sec), logging
[22-Jun-2013 16:29:28] WARNING: [pool www] child 27939, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (39.816647 sec), logging
[22-Jun-2013 16:29:28] NOTICE: child 27939 stopped for tracing
[22-Jun-2013 16:29:28] NOTICE: about to trace 27939
[22-Jun-2013 16:29:28] NOTICE: finished trace of 27939
[22-Jun-2013 16:29:28] NOTICE: child 28119 stopped for tracing
[22-Jun-2013 16:29:28] NOTICE: about to trace 28119
[22-Jun-2013 16:29:28] NOTICE: finished trace of 28119
[22-Jun-2013 16:29:28] NOTICE: child 28122 stopped for tracing
[22-Jun-2013 16:29:28] NOTICE: about to trace 28122
[22-Jun-2013 16:29:28] NOTICE: finished trace of 28122
[22-Jun-2013 16:29:28] NOTICE: child 28270 stopped for tracing
[22-Jun-2013 16:29:28] NOTICE: about to trace 28270
[22-Jun-2013 16:29:28] NOTICE: finished trace of 28270
[22-Jun-2013 16:29:28] NOTICE: child 28124 stopped for tracing
[22-Jun-2013 16:29:28] NOTICE: about to trace 28124
[22-Jun-2013 16:29:28] NOTICE: finished trace of 28124
[22-Jun-2013 16:29:48] WARNING: [pool www] child 28172, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (36.209643 sec), logging
[22-Jun-2013 16:29:48] NOTICE: child 28172 stopped for tracing
[22-Jun-2013 16:29:48] NOTICE: about to trace 28172
[22-Jun-2013 16:29:48] NOTICE: finished trace of 28172
[22-Jun-2013 16:29:58] WARNING: [pool www] child 28179, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (32.836693 sec), logging
[22-Jun-2013 16:29:58] NOTICE: child 28179 stopped for tracing
[22-Jun-2013 16:29:58] NOTICE: about to trace 28179
[22-Jun-2013 16:29:58] NOTICE: finished trace of 28179
[22-Jun-2013 16:31:19] WARNING: [pool www] child 28274, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (31.870912 sec), logging
[22-Jun-2013 16:31:19] NOTICE: child 28274 stopped for tracing
[22-Jun-2013 16:31:19] NOTICE: about to trace 28274
[22-Jun-2013 16:31:19] NOTICE: finished trace of 28274
[22-Jun-2013 16:31:29] WARNING: [pool www] child 28585, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.940790 sec), logging
[22-Jun-2013 16:31:29] WARNING: [pool www] child 28330, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (31.058969 sec), logging
[22-Jun-2013 16:31:29] WARNING: [pool www] child 28319, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (32.307274 sec), logging
[22-Jun-2013 16:31:29] WARNING: [pool www] child 37456, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (30.795803 sec), logging
[22-Jun-2013 16:31:29] NOTICE: child 37456 stopped for tracing
[22-Jun-2013 16:31:29] NOTICE: about to trace 37456
[22-Jun-2013 16:31:29] NOTICE: finished trace of 37456
[22-Jun-2013 16:31:29] NOTICE: child 28319 stopped for tracing
[22-Jun-2013 16:31:29] NOTICE: about to trace 28319
[22-Jun-2013 16:31:29] NOTICE: finished trace of 28319
[22-Jun-2013 16:31:29] NOTICE: child 28330 stopped for tracing
[22-Jun-2013 16:31:29] NOTICE: about to trace 28330
[22-Jun-2013 16:31:29] NOTICE: finished trace of 28330
[22-Jun-2013 16:31:29] NOTICE: child 28585 stopped for tracing
[22-Jun-2013 16:31:29] NOTICE: about to trace 28585
[22-Jun-2013 16:31:29] NOTICE: finished trace of 28585
[22-Jun-2013 16:31:39] WARNING: [pool www] child 28149, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "POST /index.php") executing too slow (37.492814 sec), logging
[22-Jun-2013 16:31:39] WARNING: [pool www] child 28133, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "POST /index.php") executing too slow (39.911357 sec), logging
[22-Jun-2013 16:31:39] WARNING: [pool www] child 37462, script '/data/disk/tn/static/transition-network-d6-p005/index.php' (request: "GET /index.php") executing too slow (32.647110 sec), logging
[22-Jun-2013 16:31:39] NOTICE: child 28133 stopped for tracing
[22-Jun-2013 16:31:39] NOTICE: about to trace 28133
[22-Jun-2013 16:31:39] ERROR: failed to ptrace(PEEKDATA) pid 28133: Input/output error (5)
[22-Jun-2013 16:31:39] NOTICE: finished trace of 28133
[22-Jun-2013 16:31:39] NOTICE: child 28149 stopped for tracing
[22-Jun-2013 16:31:39] NOTICE: about to trace 28149
[22-Jun-2013 16:31:39] ERROR: failed to ptrace(PEEKDATA) pid 28149: Input/output error (5)
[22-Jun-2013 16:31:39] NOTICE: finished trace of 28149
[22-Jun-2013 16:31:39] NOTICE: child 37462 stopped for tracing
[22-Jun-2013 16:31:39] NOTICE: about to trace 37462
[22-Jun-2013 16:31:39] ERROR: failed to ptrace(PEEKDATA) pid 37462: Input/output error (5)
[22-Jun-2013 16:31:39] NOTICE: finished trace of 37462

Changed 3 years ago by chris

Attachment puffin-cpu-day-2013-06-22.png added

Puffin CPU Spikes 2013-06-22

Changed 3 years ago by chris

Attachment puffin-load-day-2013-06-22.png added

Puffin Load Spikes 2013-06-22

comment:54 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 20.78 to 20.88

The images that have just been attached which illustrate the spikes:

comment:55 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.5
Total Hours changed from 20.88 to 21.38

I have done some more work on configuring awstats, see wiki:AwStatsInstall#Awstatsinstall

comment:56 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.3
Total Hours changed from 21.38 to 21.68

The max number of MySQL connections of 75 has been reached again and following is the result of perl /usr/local/bin/mysqltuner.pl:

 >>  MySQLTuner 1.2.0 - Major Hayden <major@mhtx.net>
 >>  Bug reports, feature requests, and downloads at http://mysqltuner.com/
 >>  Run with '--help' for additional options and output filtering
[OK] Logged in using credentials from debian maintenance account.

-------- General Statistics --------------------------------------------------
[--] Skipped version check for MySQLTuner script
[OK] Currently running supported MySQL version 5.5.31-MariaDB-1~squeeze-log
[OK] Operating on 64-bit architecture

-------- Storage Engine Statistics -------------------------------------------
[--] Status: +Archive -BDB +Federated +InnoDB -ISAM -NDBCluster 
[--] Data in MyISAM tables: 104M (Tables: 2)
[--] Data in InnoDB tables: 447M (Tables: 1037)
[--] Data in PERFORMANCE_SCHEMA tables: 0B (Tables: 17)
[!!] Total fragmented tables: 99

-------- Security Recommendations  -------------------------------------------
[OK] All database users have passwords assigned

-------- Performance Metrics -------------------------------------------------
[--] Up for: 3d 4h 15m 14s (16M q [60.080 qps], 449K conn, TX: 31B, RX: 2B)
[--] Reads / Writes: 87% / 13%
[--] Total buffers: 1.1G global + 13.4M per thread (75 max threads)
[OK] Maximum possible memory usage: 2.1G (26% of installed RAM)
[OK] Slow queries: 0% (93/16M)
[!!] Highest connection usage: 100%  (76/75)
[OK] Key buffer size / total MyISAM indexes: 509.0M/93.2M
[OK] Key buffer hit rate: 98.3% (32M cached / 568K reads)
[OK] Query cache efficiency: 73.8% (10M cached / 14M selects)
[!!] Query cache prunes per day: 888102
[OK] Sorts requiring temporary tables: 2% (9K temp sorts / 440K sorts)
[!!] Joins performed without indexes: 15445
[!!] Temporary tables created on disk: 30% (157K on disk / 522K total)
[OK] Thread cache hit rate: 99% (76 created / 449K connections)
[!!] Table cache hit rate: 0% (128 open / 118K opened)
[OK] Open file limit used: 0% (4/196K)
[OK] Table locks acquired immediately: 99% (5M immediate / 5M locks)
[OK] InnoDB data size / buffer pool: 447.2M/509.0M

-------- Recommendations -----------------------------------------------------
General recommendations:
    Run OPTIMIZE TABLE to defragment tables for better performance
    Reduce or eliminate persistent connections to reduce connection usage
    Adjust your join queries to always utilize indexes
    When making adjustments, make tmp_table_size/max_heap_table_size equal
    Reduce your SELECT DISTINCT queries without LIMIT clauses
    Increase table_cache gradually to avoid file descriptor limits
Variables to adjust:
    max_connections (> 75)
    wait_timeout (< 3600)
    interactive_timeout (< 28800)
    query_cache_size (> 64M)
    join_buffer_size (> 1.0M, or always use indexes with joins)
    tmp_table_size (> 64M)
    max_heap_table_size (> 128M)
    table_cache (> 128)

These values in /etc/my.cnf have been increased as suggested, but I haven't changed the timeout values as these should also be checked for php-fpm and nginx:

#join_buffer_size        = 1M
join_buffer_size        = 2M

#max_connections         = 75
#max_user_connections    = 75
max_connections         = 100
max_user_connections    = 100

#query_cache_size        = 64M
query_cache_size        = 128M

#table_cache             = 128
table_cache             = 256

#max_heap_table_size     = 128M
max_heap_table_size     = 256M

#tmp_table_size          = 64M
tmp_table_size          = 128M

And mysql has been restarted.

comment:57 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Priority changed from critical to major
Total Hours changed from 21.68 to 21.93
Description modified (diff)
Summary changed from Load spikes, ksoftirqd using all the CPU and services stopping for 15 min at a time to Load spikes causing the TN site to be stopped for 15 min at a time

I think there is still potential for performance improvements via mysql tuning, however I'm reluctant to allocate additional RAM to MySQL when I'm not sure if the current allocation of 8GB to puffin (it did have 4GB) is going to be permanent.

Following is the latest result of perl /usr/local/bin/mysqltuner.pl:

 >>  MySQLTuner 1.2.0 - Major Hayden <major@mhtx.net>
 >>  Bug reports, feature requests, and downloads at http://mysqltuner.com/
 >>  Run with '--help' for additional options and output filtering
[OK] Logged in using credentials from debian maintenance account.

-------- General Statistics --------------------------------------------------
[--] Skipped version check for MySQLTuner script
[OK] Currently running supported MySQL version 5.5.31-MariaDB-1~squeeze-log
[OK] Operating on 64-bit architecture

-------- Storage Engine Statistics -------------------------------------------
[--] Status: +Archive -BDB +Federated +InnoDB -ISAM -NDBCluster 
[--] Data in MyISAM tables: 104M (Tables: 2)
[--] Data in InnoDB tables: 456M (Tables: 1037)
[--] Data in PERFORMANCE_SCHEMA tables: 0B (Tables: 17)
[!!] Total fragmented tables: 101

-------- Security Recommendations  -------------------------------------------
[OK] All database users have passwords assigned

-------- Performance Metrics -------------------------------------------------
[--] Up for: 20h 55m 18s (5M q [67.881 qps], 130K conn, TX: 9B, RX: 841M)
[--] Reads / Writes: 82% / 18%
[--] Total buffers: 1.3G global + 14.4M per thread (100 max threads)
[OK] Maximum possible memory usage: 2.7G (33% of installed RAM)
[OK] Slow queries: 0% (24/5M)
[OK] Highest usage of available connections: 60% (60/100)
[OK] Key buffer size / total MyISAM indexes: 509.0M/93.9M
[OK] Key buffer hit rate: 98.7% (9M cached / 117K reads)
[OK] Query cache efficiency: 78.3% (3M cached / 4M selects)
[!!] Query cache prunes per day: 486359
[OK] Sorts requiring temporary tables: 2% (3K temp sorts / 109K sorts)
[!!] Joins performed without indexes: 4597
[!!] Temporary tables created on disk: 27% (37K on disk / 136K total)
[OK] Thread cache hit rate: 99% (60 created / 130K connections)
[!!] Table cache hit rate: 0% (256 open / 38K opened)
[OK] Open file limit used: 0% (6/196K)
[OK] Table locks acquired immediately: 99% (1M immediate / 1M locks)
[OK] InnoDB data size / buffer pool: 456.7M/509.0M

-------- Recommendations -----------------------------------------------------
General recommendations:
    Run OPTIMIZE TABLE to defragment tables for better performance
    MySQL started within last 24 hours - recommendations may be inaccurate
    Adjust your join queries to always utilize indexes
    When making adjustments, make tmp_table_size/max_heap_table_size equal
    Reduce your SELECT DISTINCT queries without LIMIT clauses
    Increase table_cache gradually to avoid file descriptor limits
Variables to adjust:
    query_cache_size (> 128M)
    join_buffer_size (> 2.0M, or always use indexes with joins)
    tmp_table_size (> 128M)
    max_heap_table_size (> 256M)
    table_cache (> 256)

And this is the result of the /usr/local/bin/tuning-primer.sh script, found via, http://www.day32.com/MySQL/

        -- MYSQL PERFORMANCE TUNING PRIMER --
             - By: Matthew Montgomery -

MySQL Version 5.5.31-MariaDB-1~squeeze-log x86_64

Uptime = 0 days 20 hrs 57 min 38 sec
Avg. qps = 67
Total Questions = 5123458
Threads Connected = 3

Warning: Server has not been running for at least 48hrs.
It may not be safe to use these recommendations

To find out more information on how each of these
runtime variables effects performance visit:
http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html
Visit http://www.mysql.com/products/enterprise/advisors.html
for info about MySQL's Enterprise Monitoring and Advisory Service

SLOW QUERIES
The slow query log is enabled.
Current long_query_time = 5.000000 sec.
You have 25 out of 5123512 that take longer than 5.000000 sec. to complete
Your long_query_time seems to be fine

BINARY UPDATE LOG
The binary update log is NOT enabled.
You will not be able to do point in time recovery
See http://dev.mysql.com/doc/refman/5.5/en/point-in-time-recovery.html

WORKER THREADS
Current thread_cache_size = 128
Current threads_cached = 58
Current threads_per_sec = 0
Historic threads_per_sec = 0
Your thread_cache_size is fine

MAX CONNECTIONS
Current max_connections = 100
Current threads_connected = 2
Historic max_used_connections = 60
The number of used connections is 60% of the configured maximum.
Your max_connections variable seems to be fine.

INNODB STATUS
Current InnoDB index space = 177 M
Current InnoDB data space = 457 M
Current InnoDB buffer pool free = 0 %
Current innodb_buffer_pool_size = 509 M
Depending on how much space your innodb indexes take up it may be safe
to increase this value to up to 2 / 3 of total system memory

MEMORY USAGE
Max Memory Ever Allocated : 1.97 G
Configured Max Per-thread Buffers : 1.40 G
Configured Max Global Buffers : 1.13 G
Configured Max Memory Limit : 2.53 G
Physical Memory : 7.98 G
Max memory limit seem to be within acceptable norms

KEY BUFFER
Current MyISAM index space = 93 M
Current key_buffer_size = 509 M
Key cache miss rate is 1 : 78
Key buffer free ratio = 81 %
Your key_buffer_size seems to be fine

QUERY CACHE
Query cache is enabled
Current query_cache_size = 128 M
Current query_cache_used = 78 M
Current query_cache_limit = 128 K
Current Query cache Memory fill ratio = 61.62 %
Current query_cache_min_res_unit = 4 K
MySQL won't cache query results that are larger than query_cache_limit in size

SORT OPERATIONS
Current sort_buffer_size = 128 K
Current read_rnd_buffer_size = 4 M
Sort buffer seems to be fine

JOINS
tuning-primer.sh: line 402: export: `2097152': not a valid identifier
Current join_buffer_size = 2.00 M
You have had 4607 queries where a join could not use an index properly
You should enable "log-queries-not-using-indexes"
Then look for non indexed joins in the slow query log.
If you are unable to optimize your queries you may want to increase your
join_buffer_size to accommodate larger joins in one pass.

Note! This script will still suggest raising the join_buffer_size when
ANY joins not using indexes are found.

OPEN FILES LIMIT
Current open_files_limit = 196608 files
The open_files_limit should typically be set to at least 2x-3x
that of table_cache if you have heavy MyISAM usage.
Your open_files_limit value seems to be fine

TABLE CACHE
Current table_open_cache = 256 tables
Current table_definition_cache = 512 tables
You have a total of 1080 tables
You have 256 open tables.
Current table_cache hit rate is 0%
, while 100% of your table cache is in use
You should probably increase your table_cache
You should probably increase your table_definition_cache value.

TEMP TABLES
Current max_heap_table_size = 256 M
Current tmp_table_size = 128 M
Of 99254 temp tables, 27% were created on disk
Perhaps you should increase your tmp_table_size and/or max_heap_table_size
to reduce the number of disk-based temporary tables
Note! BLOB and TEXT columns are not allow in memory tables.
If you are using these columns raising these values might not impact your 
ratio of on disk temp tables.

TABLE SCANS
Current read_buffer_size = 8 M
Current table scan ratio = 92 : 1
read_buffer_size seems to be fine

TABLE LOCKING
Current Lock Wait ratio = 1 : 114280
Your table locking seems to be fine

The description of this ticket has been edited.

Last edited 3 years ago by chris (previous) (diff)

comment:58 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.85
Total Hours changed from 21.93 to 22.78

Following a chat with Ed these are the things I'm going to work on:

RAM - document how the additional 4GB has improved performance.
Finish sorting out the wiki:AwStatsInstall so we have some better data on the site traffic, (wiki:PiwikServer excludes bots).

I also intend to do these things:

Further tune MySQL
Script the edits to the MySQL, Nginx and php-fpm config files so they don't take 15 mins to do manually after each BOA upgrade. Perhaps a set of vim scripts for editing the config files would make sense so that each change could be reviewed rather than totally automating it.
Raise a BOA ticket regarding the thresholds in second.sh which we have had to tweak.

A question for Jim, currently there is enough RAM that we could double the Redis RAM from 512MB to 1GB -- any reason not to do this? The performance drop when we didn't have Redis running was very significant and I expect that giving Redis extra RAM would speed things up.

comment:59 follow-up: ↓ 68 Changed 3 years ago by jim

Chris, check out the variables in .barracuda.cnf as they allow seeing
custom php.ini, my.cnf & others. No need to script most stuff.

The question is around second.sh - ideally we'd supply a patch that allows
tuning of scripts in /var/xdrago to our needs.
On 24 Jun 2013 12:04, "Transiton Technology Trac" <
trac@tech.transitionnetwork.org> wrote:

> #555: Load spikes causing the TN site to be stopped for 15 min at a time
> -------------------------------------+-------------------------------------
>            Reporter:  chris          |                      Owner:  chris
>                Type:  maintenance    |                     Status:  new
>            Priority:  major          |                  Milestone:
>           Component:  Live server    |  Maintenance
>            Keywords:                 |                 Resolution:
> Add Hours to Ticket:  0.85           |  Estimated Number of Hours:  0.25
>         Total Hours:  21.93          |                  Billable?:  1
> -------------------------------------+-------------------------------------
> Changes (by chris):
>
>  * hours:  0.0 => 0.85
>  * totalhours:  21.93 => 22.78
>
>
> Comment:
>
>  Following a chat with Ed these are the things I'm going to work on:
>
>  * RAM - document how the additional 4GB has improved performance.
>  * Finish sorting out the wiki:AwStatsInstall so we have some better data
>  on the site traffic, (wiki:PiwikServer excludes bots).
>
>  I also intend to do these things:
>
>  * Further tune MySQL
>  * Script the edits to the MySQL, Nginx and php-fpm config files so they
>  don't take 15 mins to do manually after each BOA upgrade. Perhaps a set of
>  vim scripts for editing the config files would make sense so that each
>  change could be reviewed rather than totally automating it.
>  * Raise a BOA ticket regarding the thresholds in {{{second.sh}}} which we
>  have had to tweak.
>
>  A question for Jim, currently there is enough RAM that we could double the
>  Redis RAM from 512MB to 1GB -- any reason not to do this? The performance
>  drop when we didn't have Redis running was very significant and I expect
>  that giving Redis extra RAM would speed things up.
>
> --
> Ticket URL: <https://tech.transitionnetwork.org/trac/ticket/555#comment:58
> >
> Transition Technology <https://tech.transitionnetwork.org/trac>
> Support and issues tracking for the Transition Network Web Project.
>

comment:60 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.3
Total Hours changed from 22.78 to 23.08

Since we are going to run with the extra 4GB of RAM, for some weeks at least, I have made the following tweaks to /etc/mysql/my.cnf based on suggestions in #comment:57:

#innodb_buffer_pool_size = 509M
innodb_buffer_pool_size = 600M

#query_cache_limit       = 128K
query_cache_limit       = 256K

#query_cache_size        = 64M
query_cache_size        = 512M

#log_queries_not_using_indexes
log_queries_not_using_indexes

#join_buffer_size        = 1M
join_buffer_size        = 6M

#table_cache             = 128
table_cache             = 2048

#tmp_table_size          = 64M
tmp_table_size          = 512M

#max_heap_table_size     = 128M
max_heap_table_size     = 1024M

There is a copy of my.cnf in /root/ just in case all the changes are clobbered.

The tuning scripts should be run again tomorrow to see what they suggest, they already make some more suggested changes but I don't want to restart mysql more than absolutely necessary so these tweaks can wait, plus they are contradictory regarding joins / indexes:

[!!] Joins performed without indexes: 124
[!!] Temporary tables created on disk: 26% (1K on disk / 3K total)

    Adjust your join queries to always utilize indexes

    join_buffer_size (> 6.0M, or always use indexes with joins)

Current join_buffer_size = 6.00 M
You have had 129 queries where a join could not use an index properly
join_buffer_size >= 4 M
This is not advised
You should enable "log-queries-not-using-indexes"
Then look for non indexed joins in the slow query log.

Current table_open_cache = 2048 tables
Current table_definition_cache = 512 tables
You have a total of 1080 tables
You have 1107 open tables.
The table_cache value seems to be fine
You should probably increase your table_definition_cache value.

comment:61 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 23.08 to 23.33

I have made a start on a page documenting the wiki:RamUsage by wiki:PuffinServer and I'll also add a section to it about wiki:PenguinServer as this machine is currently swapping a fair amount and looks like it would also benefit from some additional RAM, see the munin stats here:

https://penguin.transitionnetwork.org/munin/transitionnetwork.org/penguin.transitionnetwork.org/memory.html

Direct link to the RAM usage page:

/trac/wiki/RamUsage

The plan for the wiki:RamUsage page being to document the need for additional RAM in order that the cost can be justified.

comment:62 Changed 3 years ago by jim

Further to my last regarding a patch for second.sh...

cat /proc/cpuinfo | grep processor | wc -l returns number of CPUs for any Linux box.

So if we had sensible defaults for 1 cpu, then we should be able to multiply it through by the number of CPUs.

As long as the results match up with the default (which I think expects 4 CPUs) and ours with 14, we're onto a winner and the patch is more likely to be accepted.

Unless there's a load variable in the system that is already aware of the # of cores?

comment:63 Changed 3 years ago by jim

grep -c processor /proc/cpuinfo being a much nicer way of getting the CPU count, and already used in /var/xdrago/proc_num_ctrl.cgi.

comment:64 Changed 3 years ago by jim

Add Hours to Ticket changed from 0.0 to 0.15
Total Hours changed from 23.33 to 23.48

comment:65 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 1.7
Total Hours changed from 23.48 to 25.18

We just had another load spike, the high load config was switched on at 10:01:14 am when the load hit 23 and then 8 mins later the load hit 90 and and web server was stopped and it wasn't until 6 mins after that the the load had dropped enough for services to be started up again. Pingdom measured 5 mins of downtime:

 www.transitionnetwork.org (www.transitionnetwork.org) is UP again at 25/06/2013  10:12:57, after 5m of downtime.

Following is what was logged when the high load config was switched on:

nginx high load on
ONEX_LOAD = 2182
FIVX_LOAD = 624
uptime :
 10:01:14 up 5 days, 28 min,  1 user,  load average: 23.44, 6.83, 2.61
top :
top - 10:01:22 up 5 days, 28 min,  1 user,  load average: 24.36, 7.30, 2.79
Tasks: 290 total,  23 running, 261 sleeping,   2 stopped,   4 zombie
Cpu0  :  2.4%us,  2.4%sy,  0.0%ni, 91.3%id,  3.2%wa,  0.0%hi,  0.1%si,  0.5%st
Cpu1  :  1.9%us,  2.4%sy,  0.0%ni, 94.9%id,  0.3%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu2  :  1.9%us,  2.1%sy,  0.0%ni, 95.3%id,  0.2%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu3  :  1.2%us,  1.6%sy,  0.0%ni, 96.4%id,  0.2%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu4  :  1.1%us,  1.5%sy,  0.0%ni, 96.9%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu5  :  0.9%us,  1.4%sy,  0.0%ni, 97.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu6  :  0.8%us,  1.3%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu7  :  0.7%us,  1.2%sy,  0.0%ni, 97.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu8  :  0.7%us,  1.2%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu9  :  0.7%us,  1.1%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu10 :  0.6%us,  1.1%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu11 :  0.7%us,  1.1%sy,  0.0%ni, 97.7%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu12 :  0.6%us,  1.1%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu13 :  0.6%us,  1.1%sy,  0.0%ni, 97.7%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Mem:   8372060k total,  6287004k used,  2085056k free,  1023812k buffers
Swap:  1048568k total,        0k used,  1048568k free,  2230356k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
65172 nobody    20   0     8    4    0 R  177  0.0   0:03.52 munin-node
65080 root      20   0     0    0    0 R  137  0.0   0:13.79 lfd
65168 root      20   0 19200 1396  912 R  128  0.0   0:03.75 top
65115 aegir     20   0  217m  10m 6888 R  103  0.1   0:10.47 drush.php
65134 root      20   0 53504  14m  560 R   95  0.2   0:05.99 lfd
65117 root      20   0 53504  14m  592 R   75  0.2   0:06.34 lfd
64944 root      20   0 13292 3512  452 R   71  0.0   0:38.39 bzip2
64971 tn        20   0  222m  17m 8600 R   62  0.2   0:31.50 php
65175 root      20   0     0    0    0 R   54  0.0   0:00.93 bash
65107 root      20   0 10684 1428 1144 R   47  0.0   0:03.16 bash
65176 root      20   0  5368  568  480 S   40  0.0   0:00.68 sleep
64825 aegir     20   0  234m  25m 8740 R   38  0.3   0:48.37 drush.php
56567 root      20   0  734m 7312 2160 R   27  0.1   0:12.85 php-fpm
30652 redis     20   0  191m  35m  920 R   22  0.4   3:42.47 redis-server
64828 root      20   0 10852 1660 1208 S   21  0.0   0:23.99 backupninja
65153 root      20   0     0    0    0 R   21  0.0   0:00.36 grep
65155 root      20   0     0    0    0 R   19  0.0   0:00.59 awk
28288 www-data  20   0  776m 143m  99m S   14  1.8   3:00.57 php-fpm
65129 root      20   0 10616 1344 1128 R    8  0.0   0:00.86 bash
    1 root      20   0  8356  780  648 S    0  0.0   0:20.22 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.02 kthreadd

Switching to the high load configuration didn't bring the load down this time, 8 mins later the load is at 90 and we have this logged:

php-fpm and nginx about to be killed
ONEX_LOAD = 9086
FIVX_LOAD = 6301
uptime :
 10:09:59 up 5 days, 36 min,  1 user,  load average: 90.86, 63.01, 30.97
top :
top - 10:10:00 up 5 days, 36 min,  1 user,  load average: 87.10, 62.69, 31.04
Tasks: 343 total,  32 running, 311 sleeping,   0 stopped,   0 zombie
Cpu0  :  2.4%us,  2.4%sy,  0.0%ni, 91.3%id,  3.2%wa,  0.0%hi,  0.1%si,  0.6%st
Cpu1  :  1.9%us,  2.4%sy,  0.0%ni, 94.8%id,  0.3%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu2  :  1.9%us,  2.1%sy,  0.0%ni, 95.3%id,  0.2%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu3  :  1.2%us,  1.7%sy,  0.0%ni, 96.4%id,  0.2%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu4  :  1.1%us,  1.5%sy,  0.0%ni, 96.8%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu5  :  0.9%us,  1.4%sy,  0.0%ni, 97.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu6  :  0.8%us,  1.3%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu7  :  0.7%us,  1.2%sy,  0.0%ni, 97.4%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu8  :  0.7%us,  1.2%sy,  0.0%ni, 97.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu9  :  0.7%us,  1.2%sy,  0.0%ni, 97.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu10 :  0.6%us,  1.1%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu11 :  0.7%us,  1.1%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu12 :  0.6%us,  1.1%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu13 :  0.6%us,  1.1%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Mem:   8372060k total,  6081764k used,  2290296k free,  1023892k buffers
Swap:  1048568k total,        0k used,  1048568k free,  2230472k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1713 www-data  20   0  740m  21m  10m R   62  0.3   1:07.85 php-fpm
 3326 www-data  20   0  742m  34m  21m R   57  0.4   1:05.02 php-fpm
 4208 www-data  20   0  739m  16m 7400 R   55  0.2   0:57.85 php-fpm
 6065 tn        20   0  258m  49m 9084 R   54  0.6   0:53.65 drush.php
 7825 www-data  20   0  738m  12m 3972 R   46  0.2   0:00.29 php-fpm
 3449 www-data  20   0  739m  17m 7552 R   44  0.2   0:51.53 php-fpm
65185 www-data  20   0  755m  64m  37m R   43  0.8   1:56.78 php-fpm
 9852 aegir     20   0  236m  27m 8936 R   41  0.3   0:00.27 drush.php
 2989 www-data  20   0  744m  39m  24m R   40  0.5   1:09.97 php-fpm
 4207 www-data  20   0  741m  29m  17m R   40  0.4   1:06.60 php-fpm
 4073 www-data  20   0  744m  38m  24m R   38  0.5   1:03.83 php-fpm
65397 www-data  20   0  755m  63m  36m R   38  0.8   1:44.07 php-fpm
 4133 www-data  20   0  743m  35m  22m R   36  0.4   1:08.74 php-fpm
65284 www-data  20   0  745m  50m  33m R   36  0.6   1:58.90 php-fpm
 3517 www-data  20   0  743m  34m  21m R   35  0.4   1:06.83 php-fpm
 7172 www-data  20   0  738m  12m 4064 R   35  0.2   0:16.42 php-fpm
 9615 aegir     20   0  241m  32m 8936 R   30  0.4   0:40.06 drush.php
65252 www-data  20   0  745m  47m  31m R   30  0.6   1:47.14 php-fpm
 3375 www-data  20   0  743m  36m  23m R   27  0.5   1:03.95 php-fpm
28313 www-data  20   0  759m  79m  51m R   27  1.0   3:40.67 php-fpm
 7284 www-data  20   0  738m  12m 3928 R   25  0.2   0:00.16 php-fpm
 8714 www-data  20   0  736m  10m 3892 R   25  0.1   0:00.18 php-fpm
 5018 www-data  20   0  739m  16m 7400 R   24  0.2   0:53.26 php-fpm
  321 www-data  20   0 72336  10m 1836 R   17  0.1   0:02.74 nginx
 8350 www-data  20   0  736m  10m 3896 R   16  0.1   0:00.10 php-fpm
 9849 aegir     20   0  221m  15m 8460 R   14  0.2   0:00.11 drush.php
56567 root      20   0  734m 7316 2160 S   14  0.1   1:11.92 php-fpm
11744 mysql     20   0 2269m 1.3g  10m S    9 15.8  33:09.05 mysqld
28290 www-data  20   0  768m 105m  68m S    9  1.3   4:32.54 php-fpm
 9863 root      20   0 19200 1448  912 R    6  0.0   0:00.04 top
 9840 root      20   0 19340 1476  912 R    5  0.0   0:00.05 top
 9915 www-data  20   0  734m 5984  828 S    5  0.1   0:00.03 php-fpm
 9932 root      20   0 19200 1444  912 S    5  0.0   0:00.03 top
65200 www-data  20   0  747m  47m  29m S    5  0.6   1:42.20 php-fpm
   16 root      20   0     0    0    0 S    2  0.0   7:28.25 ksoftirqd/4
   55 root      20   0     0    0    0 S    2  0.0   0:19.35 events/10
 9608 root      20   0 10628 1368 1144 S    2  0.0   0:01.39 bash
 9846 root      20   0 10620 1356 1136 S    2  0.0   0:00.01 bash
 9871 root      20   0 10684 1428 1148 S    2  0.0   0:00.01 bash
10099 root      20   0     0    0    0 R    2  0.0   0:00.01 mysqladmin
30652 redis     20   0  191m  39m  920 S    2  0.5   4:07.71 redis-server
64822 root      20   0 10624 1372 1148 S    2  0.0   0:10.06 bash
65129 root      20   0 10624 1372 1148 S    2  0.0   0:07.69 bash
65232 www-data  20   0  745m  48m  32m R    2  0.6   1:01.58 php-fpm
65353 root      20   0 10624 1368 1148 S    2  0.0   0:01.99 bash
65488 www-data  20   0 72336  10m 1832 S    2  0.1   0:09.98 nginx
65535 www-data  20   0 72336  10m 1840 S    2  0.1   0:01.15 nginx
    1 root      20   0  8356  780  648 S    0  0.0   0:20.23 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.02 kthreadd

This was a big enough spike to cause a gap in the Munin stats, and php-fpm and nginx were not started again till 6 mins later:

nginx high load off
ONEX_LOAD = 59
FIVX_LOAD = 1922
uptime :
 10:16:11 up 5 days, 43 min,  1 user,  load average: 0.59, 19.22, 21.52
top :
top - 10:16:12 up 5 days, 43 min,  1 user,  load average: 0.59, 19.22, 21.52
Tasks: 244 total,   1 running, 243 sleeping,   0 stopped,   0 zombie
Cpu0  :  2.4%us,  2.4%sy,  0.0%ni, 91.3%id,  3.2%wa,  0.0%hi,  0.1%si,  0.6%st
Cpu1  :  1.9%us,  2.4%sy,  0.0%ni, 94.8%id,  0.3%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu2  :  1.9%us,  2.1%sy,  0.0%ni, 95.3%id,  0.2%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu3  :  1.2%us,  1.7%sy,  0.0%ni, 96.4%id,  0.2%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu4  :  1.1%us,  1.5%sy,  0.0%ni, 96.8%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu5  :  0.9%us,  1.4%sy,  0.0%ni, 97.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu6  :  0.8%us,  1.3%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu7  :  0.7%us,  1.2%sy,  0.0%ni, 97.4%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu8  :  0.7%us,  1.2%sy,  0.0%ni, 97.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu9  :  0.7%us,  1.2%sy,  0.0%ni, 97.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu10 :  0.6%us,  1.1%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu11 :  0.7%us,  1.1%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu12 :  0.6%us,  1.1%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.5%st
Cpu13 :  0.6%us,  1.1%sy,  0.0%ni, 97.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.5%st
Mem:   8372060k total,  5933896k used,  2438164k free,  1024032k buffers
Swap:  1048568k total,        0k used,  1048568k free,  2125704k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
18971 root      20   0 19200 1364  912 R    6  0.0   0:00.05 top
    1 root      20   0  8356  780  648 S    0  0.0   0:20.24 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.02 kthreadd
    3 root      RT   0     0    0    0 S    0  0.0   1:26.39 migration/0

I have looked through all the logs and can't find anything worth noting that hasn't already been noted on other ticket comments.

I have run the mysql tuning scripts again, these are the results:

perl mysqltuner.pl 

 >>  MySQLTuner 1.2.0 - Major Hayden <major@mhtx.net>
 >>  Bug reports, feature requests, and downloads at http://mysqltuner.com/
 >>  Run with '--help' for additional options and output filtering
[OK] Logged in using credentials from debian maintenance account.

-------- General Statistics --------------------------------------------------
[--] Skipped version check for MySQLTuner script
[OK] Currently running supported MySQL version 5.5.31-MariaDB-1~squeeze-log
[OK] Operating on 64-bit architecture

-------- Storage Engine Statistics -------------------------------------------
[--] Status: +Archive -BDB +Federated +InnoDB -ISAM -NDBCluster 
[--] Data in MyISAM tables: 104M (Tables: 2)
[--] Data in InnoDB tables: 449M (Tables: 1037)
[--] Data in PERFORMANCE_SCHEMA tables: 0B (Tables: 17)
[!!] Total fragmented tables: 96

-------- Security Recommendations  -------------------------------------------
[OK] All database users have passwords assigned

-------- Performance Metrics -------------------------------------------------
[--] Up for: 21h 53m 32s (4M q [59.066 qps], 129K conn, TX: 8B, RX: 760M)
[--] Reads / Writes: 81% / 19%
[--] Total buffers: 2.1G global + 18.4M per thread (100 max threads)
[OK] Maximum possible memory usage: 3.9G (48% of installed RAM)
[OK] Slow queries: 0% (34K/4M)
[!!] Highest connection usage: 100%  (101/100)
[OK] Key buffer size / total MyISAM indexes: 509.0M/94.5M
[OK] Key buffer hit rate: 99.8% (9M cached / 16K reads)
[OK] Query cache efficiency: 83.5% (3M cached / 4M selects)
[OK] Query cache prunes per day: 0
[OK] Sorts requiring temporary tables: 3% (2K temp sorts / 75K sorts)
[!!] Joins performed without indexes: 2320
[!!] Temporary tables created on disk: 27% (28K on disk / 105K total)
[OK] Thread cache hit rate: 99% (101 created / 129K connections)
[OK] Table cache hit rate: 22% (1K open / 8K opened)
[OK] Open file limit used: 0% (58/196K)
[OK] Table locks acquired immediately: 99% (1M immediate / 1M locks)
[OK] InnoDB data size / buffer pool: 449.1M/600.0M

-------- Recommendations -----------------------------------------------------
General recommendations:
    Run OPTIMIZE TABLE to defragment tables for better performance
    MySQL started within last 24 hours - recommendations may be inaccurate
    Reduce or eliminate persistent connections to reduce connection usage
    Adjust your join queries to always utilize indexes
    Temporary table size is already large - reduce result set size
    Reduce your SELECT DISTINCT queries without LIMIT clauses
Variables to adjust:
    max_connections (> 100)
    wait_timeout (< 3600)
    interactive_timeout (< 28800)
    join_buffer_size (> 6.0M, or always use indexes with joins)

bash tuning-primer.sh 
 
        -- MYSQL PERFORMANCE TUNING PRIMER --
             - By: Matthew Montgomery -

MySQL Version 5.5.31-MariaDB-1~squeeze-log x86_64

Uptime = 0 days 21 hrs 35 min 21 sec
Avg. qps = 58
Total Questions = 4559816
Threads Connected = 2

Warning: Server has not been running for at least 48hrs.
It may not be safe to use these recommendations

To find out more information on how each of these
runtime variables effects performance visit:
http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html
Visit http://www.mysql.com/products/enterprise/advisors.html
for info about MySQL's Enterprise Monitoring and Advisory Service

SLOW QUERIES
The slow query log is enabled.
Current long_query_time = 5.000000 sec.
You have 33837 out of 4559870 that take longer than 5.000000 sec. to complete
Your long_query_time seems to be fine

BINARY UPDATE LOG
The binary update log is NOT enabled.
You will not be able to do point in time recovery
See http://dev.mysql.com/doc/refman/5.5/en/point-in-time-recovery.html

WORKER THREADS
Current thread_cache_size = 128
Current threads_cached = 95
Current threads_per_sec = 0
Historic threads_per_sec = 0
Your thread_cache_size is fine

MAX CONNECTIONS
Current max_connections = 100
Current threads_connected = 6
Historic max_used_connections = 101
The number of used connections is 101% of the configured maximum.
You should raise max_connections

INNODB STATUS
Current InnoDB index space = 178 M
Current InnoDB data space = 449 M
Current InnoDB buffer pool free = 3 %
Current innodb_buffer_pool_size = 600 M
Depending on how much space your innodb indexes take up it may be safe
to increase this value to up to 2 / 3 of total system memory

MEMORY USAGE
Max Memory Ever Allocated : 3.40 G
Configured Max Per-thread Buffers : 1.79 G
Configured Max Global Buffers : 1.59 G
Configured Max Memory Limit : 3.38 G
Physical Memory : 7.98 G
Max memory limit seem to be within acceptable norms

KEY BUFFER
Current MyISAM index space = 94 M
Current key_buffer_size = 509 M
Key cache miss rate is 1 : 589
Key buffer free ratio = 78 %
Your key_buffer_size seems to be fine

QUERY CACHE
Query cache is enabled
Current query_cache_size = 512 M
Current query_cache_used = 109 M
Current query_cache_limit = 256 K
Current Query cache Memory fill ratio = 21.34 %
Current query_cache_min_res_unit = 4 K
Query Cache is 30 % fragmented
Run "FLUSH QUERY CACHE" periodically to defragment the query cache memory
If you have many small queries lower 'query_cache_min_res_unit' to reduce fragmentation.
Your query_cache_size seems to be too high.
Perhaps you can use these resources elsewhere
MySQL won't cache query results that are larger than query_cache_limit in size

SORT OPERATIONS
Current sort_buffer_size = 128 K
Current read_rnd_buffer_size = 4 M
Sort buffer seems to be fine

JOINS
tuning-primer.sh: line 402: export: `2097152': not a valid identifier
Current join_buffer_size = 6.00 M
You have had 2305 queries where a join could not use an index properly
join_buffer_size >= 4 M
This is not advised
You should enable "log-queries-not-using-indexes"
Then look for non indexed joins in the slow query log.

OPEN FILES LIMIT
Current open_files_limit = 196608 files
The open_files_limit should typically be set to at least 2x-3x
that of table_cache if you have heavy MyISAM usage.
Your open_files_limit value seems to be fine

TABLE CACHE
Current table_open_cache = 2048 tables
Current table_definition_cache = 512 tables
You have a total of 1080 tables
You have 1974 open tables.
Current table_cache hit rate is 25%
, while 96% of your table cache is in use
You should probably increase your table_cache
You should probably increase your table_definition_cache value.

TEMP TABLES
Current max_heap_table_size = 1.00 G
Current tmp_table_size = 512 M
Of 74733 temp tables, 27% were created on disk
Perhaps you should increase your tmp_table_size and/or max_heap_table_size
to reduce the number of disk-based temporary tables
Note! BLOB and TEXT columns are not allow in memory tables.
If you are using these columns raising these values might not impact your 
ratio of on disk temp tables.

TABLE SCANS
Current read_buffer_size = 8 M
Current table scan ratio = 92 : 1
read_buffer_size seems to be fine

TABLE LOCKING
Current Lock Wait ratio = 1 : 1025630
Your table locking seems to be fine

These values in /etc/mysql/my.cnf have been changed:

max_connections         = 120
max_user_connections    = 120

#table_definition_cache  = 512
table_definition_cache  = 2048

#sort_buffer_size        = 128K
sort_buffer_size        = 512K
#bulk_insert_buffer_size = 128K
bulk_insert_buffer_size = 256K

table_cache             = 4096

#table_open_cache        = 64
table_open_cache        = 2048

#wait_timeout            = 3600
wait_timeout            = 300

max_heap_table_size     = 2048M

tmp_table_size          = 1024M

join_buffer_size        = 8M

And I have restarted MySQL.

Jim, it might be worth looking at the slow query log, these suggestions / notes have been made, I don't know how valid they are:

Sorts requiring temporary tables: 1% (63 temp sorts / 5K sorts)
Of 5858 temp tables, 26% were created on disk
Reduce your SELECT DISTINCT queries without LIMIT clauses
Adjust your join queries to always utilize indexes
You have had 81 queries where a join could not use an index properly -- look for non indexed joins in the slow query log.

comment:66 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 25.18 to 25.43

I have made these additional changes to the my.cnf file:

#key_buffer_size         = 509M
key_buffer_size         = 256M

join_buffer_size        = 32M

And restarted mysql.

comment:67 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 25.43 to 25.68

Reading this, https://dev.mysql.com/doc/refman/5.1/en/query-cache-configuration.html and looking at these values:

MariaDB [mysql]> SHOW GLOBAL STATUS;
...
| Qcache_free_blocks                       | 31783       |
| Qcache_free_memory                       | 420047440   |
| Qcache_hits                              | 1510058     |
| Qcache_inserts                           | 532864      |
| Qcache_lowmem_prunes                     | 0           |
| Qcache_not_cached                        | 99024       |
| Qcache_queries_in_cache                  | 77218       |
| Qcache_total_blocks                      | 187187      |
| Queries                                  | 2468635     |
...

I have adjusted these variables:

query_cache_limit       = 1M

query_cache_min_res_unit = 2K

And restarted mysql.

comment:68 in reply to: ↑ 59 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.5
Total Hours changed from 25.68 to 26.18

Replying to jim:

Chris, check out the variables in .barracuda.cnf as they allow seeing
custom php.ini, my.cnf & others. No need to script most stuff.

Thanks Jim, I have changed these variables in /root/.barracuda.cnf :

_LOAD_LIMIT_ONE=1444
_LOAD_LIMIT_TWO=888

_CUSTOM_CONFIG_SQL=NO

_CUSTOM_CONFIG_PHP_5_3=NO

So they now read:

_LOAD_LIMIT_ONE==7220
_LOAD_LIMIT_TWO=4440

_CUSTOM_CONFIG_SQL=YES

_CUSTOM_CONFIG_PHP_5_3=YES

The question is around second.sh - ideally we'd supply a patch that allows
tuning of scripts in /var/xdrago to our needs.

Do you know if the values for _LOAD_LIMIT_ONE and _LOAD_LIMIT_TWO in /root/.barracuda.cnf are calculated somewhere or are they simply set at standard reasonable defaults?

Re-running the mysql tuning scripts:

 >>  MySQLTuner 1.2.0 - Major Hayden <major@mhtx.net>
 >>  Bug reports, feature requests, and downloads at http://mysqltuner.com/
 >>  Run with '--help' for additional options and output filtering
[OK] Logged in using credentials from debian maintenance account.

-------- General Statistics --------------------------------------------------
[--] Skipped version check for MySQLTuner script
[OK] Currently running supported MySQL version 5.5.31-MariaDB-1~squeeze-log
[OK] Operating on 64-bit architecture

-------- Storage Engine Statistics -------------------------------------------
[--] Status: +Archive -BDB +Federated +InnoDB -ISAM -NDBCluster 
[--] Data in MyISAM tables: 105M (Tables: 2)
[--] Data in InnoDB tables: 453M (Tables: 1039)
[--] Data in PERFORMANCE_SCHEMA tables: 0B (Tables: 17)
[!!] Total fragmented tables: 101

-------- Security Recommendations  -------------------------------------------
[OK] All database users have passwords assigned

-------- Performance Metrics -------------------------------------------------
[--] Up for: 2d 15h 51m 55s (14M q [61.416 qps], 391K conn, TX: 27B, RX: 2B)
[--] Reads / Writes: 84% / 16%
[--] Total buffers: 2.3G global + 44.8M per thread (120 max threads)
[!!] Maximum possible memory usage: 7.6G (95% of installed RAM)
[OK] Slow queries: 0% (105K/14M)
[!!] Highest connection usage: 100%  (121/120)
[OK] Key buffer size / total MyISAM indexes: 256.0M/96.6M
[OK] Key buffer hit rate: 99.9% (28M cached / 24K reads)
[OK] Query cache efficiency: 81.0% (10M cached / 12M selects)
[OK] Query cache prunes per day: 0
[OK] Sorts requiring temporary tables: 1% (2K temp sorts / 259K sorts)
[!!] Joins performed without indexes: 8000
[OK] Temporary tables created on disk: 25% (100K on disk / 393K total)
[OK] Thread cache hit rate: 99% (121 created / 391K connections)
[!!] Table cache hit rate: 16% (2K open / 12K opened)
[OK] Open file limit used: 0% (80/196K)
[OK] Table locks acquired immediately: 99% (3M immediate / 3M locks)
[OK] InnoDB data size / buffer pool: 453.4M/600.0M

-------- Recommendations -----------------------------------------------------
General recommendations:
    Run OPTIMIZE TABLE to defragment tables for better performance
    Reduce your overall MySQL memory footprint for system stability
    Reduce or eliminate persistent connections to reduce connection usage
    Adjust your join queries to always utilize indexes
    Increase table_cache gradually to avoid file descriptor limits
Variables to adjust:
  *** MySQL's maximum memory usage is dangerously high ***
  *** Add RAM before increasing MySQL buffer variables ***
    max_connections (> 120)
    wait_timeout (< 300)
    interactive_timeout (< 28800)
    join_buffer_size (> 32.0M, or always use indexes with joins)
    table_cache (> 4096)

 
        -- MYSQL PERFORMANCE TUNING PRIMER --
             - By: Matthew Montgomery -

MySQL Version 5.5.31-MariaDB-1~squeeze-log x86_64

Uptime = 2 days 15 hrs 52 min 41 sec
Avg. qps = 61
Total Questions = 14124710
Threads Connected = 3

Server has been running for over 48hrs.
It should be safe to follow these recommendations

To find out more information on how each of these
runtime variables effects performance visit:
http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html
Visit http://www.mysql.com/products/enterprise/advisors.html
for info about MySQL's Enterprise Monitoring and Advisory Service

SLOW QUERIES
The slow query log is enabled.
Current long_query_time = 5.000000 sec.
You have 105118 out of 14124819 that take longer than 5.000000 sec. to complete
Your long_query_time seems to be fine

BINARY UPDATE LOG
The binary update log is NOT enabled.
You will not be able to do point in time recovery
See http://dev.mysql.com/doc/refman/5.5/en/point-in-time-recovery.html

WORKER THREADS
Current thread_cache_size = 128
Current threads_cached = 118
Current threads_per_sec = 0
Historic threads_per_sec = 0
Your thread_cache_size is fine

MAX CONNECTIONS
Current max_connections = 120
Current threads_connected = 2
Historic max_used_connections = 121
The number of used connections is 100% of the configured maximum.
You should raise max_connections

INNODB STATUS
Current InnoDB index space = 179 M
Current InnoDB data space = 453 M
Current InnoDB buffer pool free = 1 %
Current innodb_buffer_pool_size = 600 M
Depending on how much space your innodb indexes take up it may be safe
to increase this value to up to 2 / 3 of total system memory

MEMORY USAGE
Max Memory Ever Allocated : 6.63 G
Configured Max Per-thread Buffers : 5.24 G
Configured Max Global Buffers : 1.34 G
Configured Max Memory Limit : 6.59 G
Physical Memory : 7.98 G
Max memory limit seem to be within acceptable norms

KEY BUFFER
Current MyISAM index space = 96 M
Current key_buffer_size = 256 M
Key cache miss rate is 1 : 1148
Key buffer free ratio = 71 %
Your key_buffer_size seems to be fine

QUERY CACHE
Query cache is enabled
Current query_cache_size = 512 M
Current query_cache_used = 287 M
Current query_cache_limit = 1 M
Current Query cache Memory fill ratio = 56.07 %
Current query_cache_min_res_unit = 2 K
MySQL won't cache query results that are larger than query_cache_limit in size

SORT OPERATIONS
Current sort_buffer_size = 512 K
Current read_rnd_buffer_size = 4 M
Sort buffer seems to be fine

JOINS
tuning-primer.sh: line 402: export: `2097152': not a valid identifier
Current join_buffer_size = 32.00 M
You have had 8004 queries where a join could not use an index properly
join_buffer_size >= 4 M
This is not advised
You should enable "log-queries-not-using-indexes"
Then look for non indexed joins in the slow query log.

OPEN FILES LIMIT
Current open_files_limit = 196608 files
The open_files_limit should typically be set to at least 2x-3x
that of table_cache if you have heavy MyISAM usage.
Your open_files_limit value seems to be fine

TABLE CACHE
Current table_open_cache = 4096 tables
Current table_definition_cache = 2048 tables
You have a total of 1082 tables
You have 2089 open tables.
The table_cache value seems to be fine

TEMP TABLES
Current max_heap_table_size = 2.00 G
Current tmp_table_size = 1.00 G
Of 292953 temp tables, 25% were created on disk
Perhaps you should increase your tmp_table_size and/or max_heap_table_size
to reduce the number of disk-based temporary tables
Note! BLOB and TEXT columns are not allow in memory tables.
If you are using these columns raising these values might not impact your 
ratio of on disk temp tables.

TABLE SCANS
Current read_buffer_size = 8 M
Current table scan ratio = 96 : 1
read_buffer_size seems to be fine

TABLE LOCKING
Current Lock Wait ratio = 1 : 1239726
Your table locking seems to be fine

These variables in /etc/mysql/my.cnf have been changed:

wait_timeout            = 120

max_connections         = 150
max_user_connections    = 150

query_cache_limit       = 2M
query_cache_min_res_unit = 1K

table_open_cache        = 6144
table_definition_cache  = 6144
table_cache             = 8192

tmp_table_size          = 1024M
max_heap_table_size     = 2048M
max_tmp_tables          = 32768

innodb_buffer_pool_size = 1024M

comment:69 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 1.0
Total Hours changed from 26.18 to 27.18

I have done some work on the awstats config and detailed data is being written to the data file, /var/lib/awstats/awstats062013.www.transitionnetwork.org.txt on penguin but the resulting graph doesn't contain this data, only the total number of hits, https://penguin.transitionnetwork.org/awstats/www.transitionnetwork.org/stats-2013-06/awstats.www.transitionnetwork.org.html

I can't see what I'm doing wrong, I'll revisit this tomorrow.

comment:70 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 1.0
Total Hours changed from 27.18 to 28.18

I have installed Munin nginx_vhost_traffic plugin on puffin from http://www.mygento.net/blog/munin_nginx_vhost_traffic_plugin/

cd /usr/share/munin/plugins
wget http://www.mygento.net/media/nginx_vhost_traffic
chmod 755 nginx_vhost_traffic
cd /etc/munin/plugins/
ln -s /usr/share/munin/plugins/nginx_vhost_traffic

The /etc/munin/plugin-conf.d/munin-node was edited and this section was added:

[nginx_vhost_traffic]
group adm
env.vhosts puffin.webarch.net www.transitionnetwork.org space.transitionnetwork.org cgp.master.puffin.webarch.net newlive.puffin.webarch.net
env.logdir /var/log/nginx
env.logfile access.log
env.aggregate false

And munin-node was restarted.

I'm not sure how useful this will be, or even if it works properly, stats will be generated here https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/nginx_vhost_traffic.html

I have spent some more time trying to get awstats generate some graphs from the nginx logs but I haven't had any luck with that so I'm going to switch to using Piwik, see http://piwik.org/log-analytics/how-to/

comment:71 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 2.0
Total Hours changed from 28.18 to 30.18

To import the puffin nginx logs this was tried:

python /web/stats.transitionnetwork.org/piwik/misc/log-analytics/import_logs.py --url=https://stats.transitionnetwork.org/ \
       --dry-run --show-progress \
       --idsite=12 --enable-static --enable-bots --enable-http-errors --enable-http-redirect \
       --log-format-regex='"(?P<ip>\S+)" (?P<host>\S+) \[(?P<date>.*?) (?P<timezone>.*?)\] (?P<status>\S+) (?P<length>\S+) \S+ \S+ "(?P<referrer>.*?)" "(?P<user_agent>.*?)" \S+ "\S+"' \
       --recorders=8 \
       /home/puffin/nginx/puffin-nginx-2013-06-22.log

But this resulted in logs of lines being missed:

    34220 requests imported successfully
    3258 requests were downloads
    9157 requests ignored:
        9157 invalid log lines

This is due to Nginx recording HTTPS requests with the IP and proxy IP, eg this Google bot request:

"66.249.75.112, 127.0.0.1"

So the logs need to be run through sed first:

cat puffin-nginx-2013-06-22.log | sed 's/, 127.0.0.1//' > puffin-nginx-2013-06-22.log.fixed
cat puffin-nginx-2013-06-23.log | sed 's/, 127.0.0.1//' > puffin-nginx-2013-06-23.log.fixed
cat puffin-nginx-2013-06-24.log | sed 's/, 127.0.0.1//' > puffin-nginx-2013-06-24.log.fixed
cat puffin-nginx-2013-06-25.log | sed 's/, 127.0.0.1//' > puffin-nginx-2013-06-25.log.fixed
cat puffin-nginx-2013-06-26.log | sed 's/, 127.0.0.1//' > puffin-nginx-2013-06-26.log.fixed
cat puffin-nginx-2013-06-27.log | sed 's/, 127.0.0.1//' > puffin-nginx-2013-06-27.log.fixed
cat puffin-nginx-2013-06-28.log | sed 's/, 127.0.0.1//' > puffin-nginx-2013-06-28.log.fixed
cat puffin-nginx-2013-06-29.log | sed 's/, 127.0.0.1//' > puffin-nginx-2013-06-29.log.fixed
cat puffin-nginx-2013-06-30.log | sed 's/, 127.0.0.1//' > puffin-nginx-2013-06-30.log.fixed
cat puffin-nginx-2013-07-01.log | sed 's/, 127.0.0.1//' > puffin-nginx-2013-07-01.log.fixed
cat puffin-nginx-2013-07-02.log | sed 's/, 127.0.0.1//' > puffin-nginx-2013-07-02.log.fixed

After the dry-run was tested the data was actually imported and a script was written to run via cron, see wiki:PiwikImportScript

Looking at these stats we are getting between 5.5k and 7.5k visitors a day, around 1.2k of these are bots, stats are not generated for the total bandwidth etc, parsing these logs does mean we need to check the privacy policy.

comment:72 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.5
Total Hours changed from 30.18 to 30.68

To recap on where we are at with the log analysis:

AWStats - haven't managed to get this working with the default BOA Nginx log format, it still looks like the best bet for getting stats about total hits, total bandwidth and bots.
Piwik - I did import several days worth of Nginx logs into Piwik but althouigh is does illustrate how the regular Piwik stats only report a fraction of the traffic it's not that useful as Piwik is designed for reporting on human interactions, not bots.

This is how I suggest I proceed:

Disable the importing of logs into Piwik -- the data that it produces isn't that great.
Set up Nginx to write an additional access log in a format that works with AWStats and sort out the remaining issues here wiki:AwStatsInstall

I have installed logstalgia on my local machine, tailing the access log via ssh, see https://code.google.com/p/logstalgia/ this might be good if we wanted to produce a video of the traffic or something... but it's not as exciting the the videos here https://www.youtube.com/user/Logstalgia but this might be because images, css and javascript don't appear to be logged, not sure why.

comment:73 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 1.0
Total Hours changed from 30.68 to 31.68

I really don't understand why I can't get AWStats working, it doesn't generate a data file and the generated graphs are blank.

I have had more luck with http://www.webalizer.org/ and that looks like the best option. I'll finish setting it up tomorrow and document it.

comment:74 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.9
Total Hours changed from 31.68 to 32.58

This comment is to record the time spend in a email to the ttech list about the ongoing load spike issues.

comment:75 Changed 3 years ago by chris

I have created a new wiki page to document what tools we have for analysing web server logs, wiki:WebServerLogs this isn't yet complete and there are some oddities with the Webarizer stats and also the goaccess stats which makes me think the log format isn't exactly right, I'll try and resolve this tomorrow and finish documenting how to get a handle on what the web servers are doing.

These tools are only available to people with ssh and sudo, Jim should give them a try when he has a few spare mins:

These stats will be available for everyone with a password when I have them sorted:

/trac/wiki/WebServerLogs#webalizer

comment:76 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 1.0
Total Hours changed from 32.58 to 33.58

Oops I forgot to add the time to that last comment.

comment:77 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 1.0
Total Hours changed from 33.58 to 34.58

Goaccess on Debian Squeeze is version 0.12-1 and this doesn't have features of the newer version in Debian Wheezy, like the ability to generate HTML reports and the ability to specify the log file format in a ~/.goaccessrc file.

I suggest that when we upgrade to Wheezy, see ticket:535 we set up Goaccess to generate a HTML report per day.

I have documented Webalizer, wiki:WebServerLogs#webalizer and sent a password to the ttech list.

Looking at the last few days of Webalizer stats the busiest was yesterday (11th July 2013):

7,558 visits (this will be not exact due to Nginx reverse proxying HTTPS connections)
59,295 pages
65,338 hits
1.7GB of files

Contrast to Piwik stats:

1,131 visits
2,813 page views

I think this difference is mostly due to the bots.

comment:78 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 34.58 to 34.83

The following might, of might not be related to the issues we have been having, I'd love to know that the kernel issue that causes "random load spikes" is. Looking at the stats for the server that puffin is hosted on I don't think we have a "noisy neighbors" issue.

High load average alerts

Thread on BOA forum, in relation to a far lower spec virtual server, https://groups.drupal.org/node/306518

The advice from omega8cc is:

I would strongly suggest to ask Linode to move your VPS to some other
machine. We have seen this too many times - people wasting hours and days
trying to figure out what the problem could be, only to see it magically
fixed once moved away from noisy neighbors. Note that one migration may be
not enough if you are migrated to another machine with another set of noisy
neighbors.

Only if this will not help, continue with debugging - but since the load
seems to be related to some cron tasks, it is almost for sure disk I/O and/or
CPU power shortage - a typical sign of being hosted on a critically
overloaded machine.

There were no changes on the BOA side which could cause issues like this.

Linux Kernel Upgrade

A notification regarding BOA hosted servers, http://omega8.cc/emergency-linux-kernel-upgrade-nyc-1-274

Omega8.cc has announced the following emergency maintenance:

Start Date: Tuesday, July 9th, 2013 05:30 AM (EDT) End Date: Tuesday, July
9th, 2013 06:30 AM (EDT)

Locations: NYC 1 (New York, US)

During this maintenance window, Omega8.cc engineers will be performing
emergency reboot on all machines affected by random load spikes after recent
Linux kernel security upgrade. Expected downtime depends on the possible
hardware reconfiguration on some machines and may take 5 to 15 minutes on an
average.

comment:79 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 2.0
Total Hours changed from 34.83 to 36.83

When looking at the Puffin Nginx access log via wiki:WebServerLogs#logstalgia it's clear that most the hits are not logged -- there are no records of css, images or js files being served to clients, but, based on the figures from http://tools.pingdom.com/fpt/ the front page has:

Image 31
Other 3
CSS 3

So, tracing all the Nginx config files, starting from /etc/nginx/nginx.conf these files were checked by following the includes, these files were checked:

/etc/nginx/mime.types
/etc/nginx/conf.d/*.conf
- /etc/nginx/conf.d/aegir.conf
- /var/aegir/config/server_master/nginx/pre.d/*
  - /var/aegir/config/server_master/nginx/pre.d/nginx_speed_purge.conf
  - /var/aegir/config/server_master/nginx/pre.d/nginx_wild_ssl.conf
- /var/aegir/config/server_master/nginx/platform.d/*
  - /data/disk/tn/config/server_master/nginx/vhost.d/*
    - /data/disk/tn/config/server_master/nginx/vhost.d/news.transitionnetwork.org
      - /data/disk/tn/config/includes/fastcgi_params.conf
      - /data/disk/tn/config/includes/nginx_octopus_include.conf
    - /data/disk/tn/config/server_master/nginx/vhost.d/space.transitionnetwork.org
      - /data/disk/tn/config/includes/fastcgi_params.conf
      - /data/disk/tn/config/includes/nginx_modern_include.conf
    - /data/disk/tn/config/server_master/nginx/vhost.d/stg.transitionnetwork.org
      - /data/disk/tn/config/includes/fastcgi_params.conf
      - /data/disk/tn/config/includes/nginx_octopus_include.conf
    - /data/disk/tn/config/server_master/nginx/vhost.d/tn.puffin.webarch.net
      - /data/disk/tn/config/includes/fastcgi_params.conf
      - /data/disk/tn/config/includes/nginx_modern_include.conf
    - /data/disk/tn/config/server_master/nginx/vhost.d/www.transitionnetwork.org
      - /data/disk/tn/config/includes/fastcgi_params.conf
      - /data/disk/tn/config/includes/nginx_octopus_include.conf
        /data/conf/nginx_high_load.c*
        /data/disk/tn/config/server_master/nginx/post.d/nginx_force_include*
        /data/disk/tn/config/server_master/nginx/post.d/nginx_vhost_include*
- /var/aegir/config/server_master/nginx/vhost.d/*
  - /var/aegir/config/server_master/nginx/vhost.d/cgp.master.puffin.webarch.net
  - /var/aegir/config/server_master/nginx/vhost.d/chive.master.puffin.webarch.net
  - /var/aegir/config/server_master/nginx/vhost.d/master.puffin.webarch.net
- /var/aegir/config/server_master/nginx/post.d/*
/etc/nginx/sites-enabled/*

And the following resources have access logs disabled, in /etc/nginx/conf.d/aegir.conf, data for Munin stats:

## chris
  location /nginx_status {
    access_log   off;
  }
  location ~ ^/(status|ping)$ {
    access_log off;
  }

Requests for purging the speed cache in /var/aegir/config/server_master/nginx/pre.d/nginx_speed_purge.conf:

  location ~ /purge-([a-z\-]*)(/.*) {
    fastcgi_cache_purge speed $1$host$request_method$2;
    log_not_found off;
  }

The HTTPS reverse proxy, in /var/aegir/config/server_master/nginx/pre.d/nginx_wild_ssl.conf:

  location / {
    access_log                 off;
    log_not_found              off;
  }

This is where Nginx is set to not log images, css, etc etc, in /data/disk/tn/config/includes/nginx_octopus_include.conf:

location ^~ /cdn/farfuture/ {
  access_log    off;
  log_not_found off;
  }

location = /favicon.ico {
  access_log    off;
  log_not_found off;
  }

location = /robots.txt {
  access_log    off;
  log_not_found off;
  }

location = /cron.php {
  access_log   off;
  }

location = /core/cron.php {
  access_log   off;
  }

location ~ (?<upload_form_uri>.*)/x-progress-id:(?<upload_id>\w*) {
  access_log off;
}

location ^~ /progress {
  access_log off;
}

location ^~ /hosting/c/server_master {
  access_log off;
}

location ^~ /hosting/c/server_localhost {
  access_log off;
}

location ^~ /hosting {
  access_log off;
}

location ^~ /admin/settings/performance/cache-backend {
  access_log off;
}

location ^~ /admin {
  access_log off;
}

location ^~ /audio/download {
  location ~* ^/audio/download/.*/.*\.(?:mp3|mp4|m4a|ogg)$ {
    access_log off;
  }
}

location ~* (?:cgi-bin|vti-bin) {
  access_log off;
}

location ~* \.r\.(?:jpe?g|png|gif) {
  access_log off;
}

location ~* /(?:.+)/files/styles/adaptive/(?:.+)$ {
  access_log off;
}

location ~* /(?:external|system|files/imagecache|files/styles)/ {
  access_log off;
}

location ~* ^/sites/.*/files/backup_migrate/ {
  access_log off;
}

location ~* ^/sites/.*/files/config_.* {
  access_log off;
}

location ~* ^/sites/.*/files/private/ {

  access_log off;
}

location ~* ^/sites/.*/private/ {
  access_log off;
}

location ~* wysiwyg_fields/(?:plugins|scripts)/.*\.(?:js|css) {
  access_log off;
  log_not_found off;
}

location ~* files/advagg_(?:css|js)/ {
  access_log off;
}

location ~* \.css$ {
  access_log  off;
}

location ~* \.(?:js|htc)$ {
  access_log  off;
}

location ~* \.json$ {
  access_log  off;
}

location @uncached {
  access_log off;
}

location ~* ^.+\.(?:jpe?g|gif|png|ico|bmp|svg|swf|pdf|docx?|xlsx?|pptx?|tiff?|txt|rtf|cgi|bat|pl|dll|aspx?|class|otf|ttf|woff|eot|less)$ {
  access_log    off;
  log_not_found off;
}

location ~* /(?:cross-?domain)\.xml$ {
  access_log  off;
}

location ~* /(?:modules|libraries)/(?:contrib/)?(?:ad|tinybrowser|f?ckeditor|tinymce|wysiwyg_spellcheck|ecc|civicrm|fbconnect|radioactivity)/.*\.php$ {
  access_log   off;
}

location ~* ^/sites/.*/(?:modules|libraries)/(?:contrib/)?(?:tinybrowser|f?ckeditor|tinymce)/.*\.(?:html?|xml)$ {
  access_log      off;
}

location ~* ^/sites/.*/files/ {
  access_log      off;
}

location ~* \.xml$ {
  access_log off;
}

location ~* ^/(?:.*/)?(?:admin|user|cart|checkout|logout|flag|comment/reply) {
  access_log off;
}

location ~* ^/(?:core/)?(?:boost_stats|update|authorize|rtoc|xmlrpc|js)\.php$ {
  access_log   off;
}

I have tried enabling the access log for everything in /data/disk/tn/config/includes/nginx_octopus_include.conf as an experiment to see how much the Nginx logs then diverge from the Piwik stats. I have run this command on that file using vim and restarted Nginx:

cp /data/disk/tn/config/includes/nginx_octopus_include.conf /root/
vim /data/disk/tn/config/includes/nginx_octopus_include.conf 
  :1,$s/access_log\s\+off/access_log on/c
  40 substitutions on 40 lines
/etc/init.d/nginx restart

However we still don't have a record of images, css and js in the log files... and I don't understand why, but it's now very clear that the wiki:WebServerLogs totally under record the actual traffic / hits.

Looking at the Xen bandwidth stats:

 puffin  /  monthly

       month        rx      |     tx      |    total    |   avg. rate
    ------------------------+-------------+-------------+---------------
      Nov '12         0 KiB |       0 KiB |       0 KiB |    0.00 kbit/s
      Dec '12         0 KiB |       0 KiB |       0 KiB |    0.00 kbit/s
      Mar '13     32.26 GiB |    4.20 GiB |   36.46 GiB |  114.19 kbit/s
      Apr '13     68.61 GiB |   14.06 GiB |   82.66 GiB |  267.52 kbit/s
      May '13     65.49 GiB |   22.61 GiB |   88.10 GiB |  275.92 kbit/s
      Jun '13     68.12 GiB |   16.18 GiB |   84.31 GiB |  272.85 kbit/s
      Jul '13     44.54 GiB |   10.87 GiB |   55.41 GiB |  369.55 kbit/s
    ------------------------+-------------+-------------+---------------
    estimated     94.85 GiB |   23.15 GiB |  117.99 GiB |

These are in https://en.wikipedia.org/wiki/GiB (1GiB ≈ 1.074GB) and also in and out are clearly reversed, so we have these stats for data served to clients for these whole months:

Apr '13 68.61 GiB | 73.69 GB | 2.5 GB / day
May '13 65.49 GiB | 70.34 GB | 2.3 GB / day
Jun '13 68.12 GiB | 73.16 GB | 2.4 GB / day

And for the first half of July:

Jul '13 44.54 GiB | 47.84 GB | 3.2 GB / day

And compared with the starts from Webalizer we have:

10th July | 1219928 kB | 1.16 GB
11th July | 1740502 kB | 1.66 GB
12th July | 1364893 kB | 1.30 GB
13th July | 1498118 kB | 1.43 GB
14th July | 1645123 kB | 1.57 GB

So it's clear that the Webalizer recorded traffic is roughly about 1/2 the actual traffic (however these stats do include data transfered by ssh -- backups will be included in the Xen stats).

Incidental I noticed puffin had the standard limit of 1024 open files so I multiplied this by 4:

ulimit -n
  1024
ulimit -n 4096
ulimit -n
  4096

comment:80 Changed 3 years ago by chris

Adding in the other server and we served 0.1TB of data to clients in July 2013:

 puffin  /  monthly

       month        rx      |     tx      |    total    |   avg. rate
    ------------------------+-------------+-------------+---------------
      Nov '12         0 KiB |       0 KiB |       0 KiB |    0.00 kbit/s
      Dec '12         0 KiB |       0 KiB |       0 KiB |    0.00 kbit/s
      Mar '13     32.26 GiB |    4.20 GiB |   36.46 GiB |  114.19 kbit/s
      Apr '13     68.61 GiB |   14.06 GiB |   82.66 GiB |  267.52 kbit/s
      May '13     65.49 GiB |   22.61 GiB |   88.10 GiB |  275.92 kbit/s
      Jun '13     68.12 GiB |   16.18 GiB |   84.31 GiB |  272.85 kbit/s
      Jul '13     44.57 GiB |   10.87 GiB |   55.44 GiB |  369.57 kbit/s
    ------------------------+-------------+-------------+---------------
    estimated     94.86 GiB |   23.14 GiB |  118.00 GiB |

 penguin  /  monthly

       month        rx      |     tx      |    total    |   avg. rate
    ------------------------+-------------+-------------+---------------
      Dec '12         0 KiB |       0 KiB |       0 KiB |    0.00 kbit/s
      Mar '13      3.61 GiB |  971.74 MiB |    4.56 GiB |   14.27 kbit/s
      Apr '13      8.56 GiB |    1.84 GiB |   10.40 GiB |   33.65 kbit/s
      May '13      7.28 GiB |    2.51 GiB |    9.79 GiB |   30.67 kbit/s
      Jun '13     10.14 GiB |    3.06 GiB |   13.20 GiB |   42.71 kbit/s
      Jul '13      5.82 GiB |    2.03 GiB |    7.85 GiB |   52.20 kbit/s
    ------------------------+-------------+-------------+---------------
    estimated     12.36 GiB |    4.30 GiB |   16.67 GiB |

 parrot  /  monthly

       month        rx      |     tx      |    total    |   avg. rate
    ------------------------+-------------+-------------+---------------
      Apr '13         0 KiB |       0 KiB |       0 KiB |    0.00 kbit/s
      May '13     23.01 GiB |    3.76 GiB |   26.77 GiB |   83.85 kbit/s
      Jun '13     20.95 GiB |    2.60 GiB |   23.54 GiB |   76.20 kbit/s
      Jul '13     11.37 GiB |    1.40 GiB |   12.76 GiB |   84.91 kbit/s
    ------------------------+-------------+-------------+---------------
    estimated     24.14 GiB |    2.97 GiB |   27.11 GiB |

comment:81 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 36.83 to 37.08

I have edited all the Nginx config files listed above and changed access_log off; to access_log on;, but still I don't see the hits for images and js and css.

I think there must be some Ngnix config files I haven't managed to find or something...

Changed 3 years ago by chris

Attachment puffin_2013-07-19_mysql_connections-month.png added

Puffin MySQL Connections by Month for 2013-07-19

Changed 3 years ago by chris

Attachment puffin_2013-07-19_mysql_qcache_mem-day.png added

Puffin MySQL Query Cache by Day 2013-07-19

comment:82 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.35
Total Hours changed from 37.08 to 37.43

The Puffin MySQL query cache, which was set at 512MB filled up last night:

And we are also not seeing the same large number of connections that we had before:

The latest from mysqltuner.pl:

 >>  MySQLTuner 1.2.0 - Major Hayden <major@mhtx.net>
 >>  Bug reports, feature requests, and downloads at http://mysqltuner.com/
 >>  Run with '--help' for additional options and output filtering
[OK] Logged in using credentials from debian maintenance account.

-------- General Statistics --------------------------------------------------
[--] Skipped version check for MySQLTuner script
[OK] Currently running supported MySQL version 5.5.31-MariaDB-1~squeeze-log
[OK] Operating on 64-bit architecture

-------- Storage Engine Statistics -------------------------------------------
[--] Status: +Archive -BDB +Federated +InnoDB -ISAM -NDBCluster 
[--] Data in MyISAM tables: 106M (Tables: 2)
[--] Data in InnoDB tables: 451M (Tables: 1039)
[--] Data in PERFORMANCE_SCHEMA tables: 0B (Tables: 17)
[!!] Total fragmented tables: 100

-------- Security Recommendations  -------------------------------------------
[OK] All database users have passwords assigned

-------- Performance Metrics -------------------------------------------------
[--] Up for: 4d 9h 18m 51s (30M q [79.462 qps], 756K conn, TX: 56B, RX: 4B)
[--] Reads / Writes: 77% / 23%
[--] Total buffers: 3.8G global + 44.8M per thread (150 max threads)
[!!] Maximum possible memory usage: 10.3G (129% of installed RAM)
[OK] Slow queries: 0% (151K/30M)
[OK] Highest usage of available connections: 28% (42/150)
[OK] Key buffer size / total MyISAM indexes: 256.0M/100.4M
[OK] Key buffer hit rate: 99.9% (44M cached / 24K reads)
[OK] Query cache efficiency: 87.4% (23M cached / 27M selects)
[!!] Query cache prunes per day: 5291
[OK] Sorts requiring temporary tables: 1% (3K temp sorts / 344K sorts)
[!!] Joins performed without indexes: 6089
[OK] Temporary tables created on disk: 25% (108K on disk / 433K total)
[OK] Thread cache hit rate: 99% (42 created / 756K connections)
[OK] Table cache hit rate: 31% (2K open / 9K opened)
[OK] Open file limit used: 0% (80/196K)
[OK] Table locks acquired immediately: 99% (5M immediate / 5M locks)
[OK] InnoDB data size / buffer pool: 451.7M/1.0G

-------- Recommendations -----------------------------------------------------
General recommendations:
    Run OPTIMIZE TABLE to defragment tables for better performance
    Reduce your overall MySQL memory footprint for system stability
    Increasing the query_cache size over 128M may reduce performance
    Adjust your join queries to always utilize indexes
Variables to adjust:
  *** MySQL's maximum memory usage is dangerously high ***
  *** Add RAM before increasing MySQL buffer variables ***
    query_cache_size (> 512M) [see warning above]
    join_buffer_size (> 32.0M, or always use indexes with joins)

I have changed these values in /etc/my/my.cnf, doubling the memory for the join_buffer_size and halving the number of connections:

join_buffer_size        = 64M
max_connections         = 75
max_user_connections    = 75

I was tempted to increase the size of the query cache but after reading this:

It might be that we would be better off making it smaller...

Following is the result of the tuning-primer.sh script:

        -- MYSQL PERFORMANCE TUNING PRIMER --
             - By: Matthew Montgomery -

MySQL Version 5.5.31-MariaDB-1~squeeze-log x86_64

Uptime = 4 days 9 hrs 37 min 26 sec
Avg. qps = 79
Total Questions = 30217672
Threads Connected = 2

Server has been running for over 48hrs.
It should be safe to follow these recommendations

To find out more information on how each of these
runtime variables effects performance visit:
http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html
Visit http://www.mysql.com/products/enterprise/advisors.html
for info about MySQL's Enterprise Monitoring and Advisory Service

SLOW QUERIES
The slow query log is enabled.
Current long_query_time = 5.000000 sec.
You have 151779 out of 30217798 that take longer than 5.000000 sec. to complete
Your long_query_time seems to be fine

BINARY UPDATE LOG
The binary update log is NOT enabled.
You will not be able to do point in time recovery
See http://dev.mysql.com/doc/refman/5.5/en/point-in-time-recovery.html

WORKER THREADS
Current thread_cache_size = 128
Current threads_cached = 41
Current threads_per_sec = 0
Historic threads_per_sec = 0
Your thread_cache_size is fine

MAX CONNECTIONS
Current max_connections = 150
Current threads_connected = 1
Historic max_used_connections = 42
The number of used connections is 28% of the configured maximum.
Your max_connections variable seems to be fine.

INNODB STATUS
Current InnoDB index space = 178 M
Current InnoDB data space = 451 M
Current InnoDB buffer pool free = 42 %
Current innodb_buffer_pool_size = 1.00 G
Depending on how much space your innodb indexes take up it may be safe
to increase this value to up to 2 / 3 of total system memory

MEMORY USAGE
Max Memory Ever Allocated : 3.59 G
Configured Max Per-thread Buffers : 6.55 G
Configured Max Global Buffers : 1.76 G
Configured Max Memory Limit : 8.31 G
Physical Memory : 7.98 G

Max memory limit exceeds 90% of physical memory

KEY BUFFER
Current MyISAM index space = 100 M
Current key_buffer_size = 256 M
Key cache miss rate is 1 : 1857
Key buffer free ratio = 72 %
Your key_buffer_size seems to be fine

QUERY CACHE
Query cache is enabled
Current query_cache_size = 512 M
Current query_cache_used = 326 M
Current query_cache_limit = 2 M
Current Query cache Memory fill ratio = 63.76 %
Current query_cache_min_res_unit = 1 K
Query Cache is 22 % fragmented
Run "FLUSH QUERY CACHE" periodically to defragment the query cache memory
If you have many small queries lower 'query_cache_min_res_unit' to reduce fragmentation.
MySQL won't cache query results that are larger than query_cache_limit in size

SORT OPERATIONS
Current sort_buffer_size = 512 K
Current read_rnd_buffer_size = 4 M
Sort buffer seems to be fine

JOINS
tuning-primer.sh: line 402: export: `2097152': not a valid identifier
Current join_buffer_size = 32.00 M
You have had 6104 queries where a join could not use an index properly
join_buffer_size >= 4 M
This is not advised
You should enable "log-queries-not-using-indexes"
Then look for non indexed joins in the slow query log.

OPEN FILES LIMIT
Current open_files_limit = 196608 files
The open_files_limit should typically be set to at least 2x-3x
that of table_cache if you have heavy MyISAM usage.
Your open_files_limit value seems to be fine

TABLE CACHE
Current table_open_cache = 8192 tables
Current table_definition_cache = 6144 tables
You have a total of 1082 tables
You have 2834 open tables.
The table_cache value seems to be fine

TEMP TABLES
Current max_heap_table_size = 4.00 G
Current tmp_table_size = 2.00 G
Of 326085 temp tables, 25% were created on disk
Perhaps you should increase your tmp_table_size and/or max_heap_table_size
to reduce the number of disk-based temporary tables
Note! BLOB and TEXT columns are not allow in memory tables.
If you are using these columns raising these values might not impact your 
ratio of on disk temp tables.

TABLE SCANS
Current read_buffer_size = 8 M
Current table scan ratio = 94 : 1
read_buffer_size seems to be fine

TABLE LOCKING
Current Lock Wait ratio = 1 : 1832134
Your table locking seems to be fine

I'm not going to restart MySQL right now as we have a MySQL update pending, see ticket:573.

Changed 3 years ago by chris

Attachment puffin_2013-07-19_phpfpm_status-day.png added

Puffin 2013-07-19 PHP-FPM Status

comment:83 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 37.43 to 37.68

Looking at the mumber of php-fpm processes I think we have too many spare, it seems to be a waste of resources for most of the time:

I had set the number of spare servers high so that there would be lots available at peak time but I'm not sure this is the bast way to do it, BOA has the max spare set to 1 by default, so I have edited these values in /opt/local/etc/php53-fpm.conf:

pm.start_servers = 2
pm.min_spare_servers = 2
pm.max_spare_servers = 6

And restarted php-fpm53.

Changed 3 years ago by chris

Attachment puffin_2013-07-19_2_phpfpm_status-day.png added

Puffin PHP-FPM 2013-07-19

Changed 3 years ago by chris

Attachment puffin_2013-07-10_multips_memory-day.png added

Puffin 2013-07-19 Memory Usage

comment:84 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 37.68 to 37.78

You can see the effect of the changes to php-fpm here:

And the drop in memory usage:

The drop in MySQL memory usage is because of the restart after the upgrade, see ticket:573.

Changed 3 years ago by chris

Attachment puffin_2013-07-19_fw_packets-day.png added

Puffin 2013-07-19 Firewall Packets

Changed 3 years ago by chris

Attachment puffin_2013-07-19-2_mysql_queries-day.png added

Puffin 2013-07-19 Mysql Query Cache

comment:85 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 37.78 to 37.88

One further thought on the mySQL Query cache, when we have a big traffic spike like the one yesterday around noon:

It appears that many of the requests are being served from the MySQL cache:

In fact it's almost impossible to see any inserts or modify queries on this graph so I suspect that for our usage it might make sense to have a massive query cache.

Changed 3 years ago by chris

Attachment puffin_daily_usage_201307.png added

Puffin Webalizer 2013-07-19

comment:86 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 37.88 to 38.13

Bandwidth wise we hit a new high yesterday, over 3GB in a day:

Yesterday was a high for the week with the Piwik stats also.

Here are the latest bandwidth stats from Xen:

 puffin  /  monthly

       month        rx      |     tx      |    total    |   avg. rate
    ------------------------+-------------+-------------+---------------
      Nov '12         0 KiB |       0 KiB |       0 KiB |    0.00 kbit/s
      Dec '12         0 KiB |       0 KiB |       0 KiB |    0.00 kbit/s
      Mar '13     32.26 GiB |    4.20 GiB |   36.46 GiB |  114.19 kbit/s
      Apr '13     68.61 GiB |   14.06 GiB |   82.66 GiB |  267.52 kbit/s
      May '13     65.49 GiB |   22.61 GiB |   88.10 GiB |  275.92 kbit/s
      Jun '13     68.12 GiB |   16.18 GiB |   84.31 GiB |  272.85 kbit/s
      Jul '13     62.42 GiB |   14.31 GiB |   76.73 GiB |  401.87 kbit/s
    ------------------------+-------------+-------------+---------------
    estimated    104.39 GiB |   23.92 GiB |  128.31 GiB |

Changed 3 years ago by chris

Attachment puffin-2013-07-26-load-day.png added

Puffin Load 2013-07-26

comment:87 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.6
Total Hours changed from 38.13 to 38.73

There was a massive load spike today that resulted in BOA shutting down Nginx and PHP-FPM.

The issue started around 2:20pm:

uptime :
 14:21:15 up 36 days,  5:38,  0 users,  load average: 93.68, 71.10, 37.93

Twenty mins later the load was approaching 200:

uptime :
 14:41:00 up 36 days,  5:57,  0 users,  load average: 189.37, 169.22, 118.81

With this load the dumping of the output of top into a log file is pointless and it doesn't work, so I have edited this from the /var/xdrago/second.sh script.

Looking at the logs I noticed a FTP connection that happened around the same time (which is probably unrelated), this is from var/log/messages:

Jul 26 14:29:06 puffin pure-ftpd: (?@50.192.103.201) [INFO] New connection from 50.192.103.201
Jul 26 14:41:00 puffin pure-ftpd: (?@50.192.103.201) [INFO] Logout.

Jim, does anyone need FTP access, can we simply uninstall the pure-ftp server?

Looking in the /var/log/php/php53-fpm-error.log php-fpm wasn't running for about 8 mins:

[26-Jul-2013 14:41:08] NOTICE: Finishing ...
[26-Jul-2013 14:41:08] NOTICE: exiting, bye-bye!
[26-Jul-2013 14:49:00] NOTICE: fpm is running, pid 44718
[26-Jul-2013 14:49:00] NOTICE: ready to handle connections

I have done some grepping but haven't found anything else of note in the logs.

comment:88 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 38.73 to 38.98

This issue is very much ongoing, for example this is a list of all the recent load alert emails from lfd (there are a _lot_ less of these than the 5min munin alert emails that are sent when the load is over 4, see https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/load.html).

Aug 04 High 5 minute load average alert - 7.20
Aug 04 High 5 minute load average alert - 17.31
Aug 04 High 5 minute load average alert - 18.88
Aug 05 High 5 minute load average alert - 8.47
Aug 05 High 5 minute load average alert - 7.95
Aug 05 High 5 minute load average alert - 29.18
Aug 05 High 5 minute load average alert - 9.16
Aug 06 High 5 minute load average alert - 7.52
Aug 06 High 5 minute load average alert - 8.95
Aug 06 High 5 minute load average alert - 7.60
Aug 06 High 5 minute load average alert - 7.03
Aug 06 High 5 minute load average alert - 6.92
Aug 06 High 5 minute load average alert - 6.18
Aug 06 High 5 minute load average alert - 6.64
Aug 06 High 5 minute load average alert - 8.55
Aug 07 High 5 minute load average alert - 6.02
Aug 07 High 5 minute load average alert - 6.58
Aug 07 High 5 minute load average alert - 6.19
Aug 07 High 5 minute load average alert - 7.26
Aug 07 High 5 minute load average alert - 6.43
Aug 07 High 5 minute load average alert - 6.59
Aug 07 High 5 minute load average alert - 7.67
Aug 07 High 5 minute load average alert - 8.17
Aug 07 High 5 minute load average alert - 6.25
Aug 07 High 5 minute load average alert - 8.29
Aug 07 High 5 minute load average alert - 6.74
Aug 07 High 5 minute load average alert - 7.62
Aug 08 High 5 minute load average alert - 6.07
Aug 08 High 5 minute load average alert - 27.59
Aug 08 High 5 minute load average alert - 11.33
Aug 08 High 5 minute load average alert - 7.89
Aug 08 High 5 minute load average alert - 7.43
Aug 08 High 5 minute load average alert - 6.06
Aug 08 High 5 minute load average alert - 7.47
Aug 09 High 5 minute load average alert - 7.94
Aug 09 High 5 minute load average alert - 6.62
Aug 09 High 5 minute load average alert - 6.14
Aug 09 High 5 minute load average alert - 12.39
Aug 09 High 5 minute load average alert - 6.81
Aug 09 High 5 minute load average alert - 6.82
Aug 09 High 5 minute load average alert - 11.91
Aug 09 High 5 minute load average alert - 9.55
Aug 09 High 5 minute load average alert - 6.96
Aug 09 High 5 minute load average alert - 6.19
Aug 09 High 5 minute load average alert - 7.54
Aug 10 High 5 minute load average alert - 7.05
Aug 10 High 5 minute load average alert - 6.28
Aug 10 High 5 minute load average alert - 7.46
Aug 10 High 5 minute load average alert - 7.24
Aug 10 High 5 minute load average alert - 12.92
Aug 10 High 5 minute load average alert - 6.72
Aug 11 High 5 minute load average alert - 7.22
Aug 11 High 5 minute load average alert - 6.22
Aug 11 High 5 minute load average alert - 10.06
Aug 11 High 5 minute load average alert - 7.39
Aug 12 High 5 minute load average alert - 6.22
Aug 12 High 5 minute load average alert - 7.39
Aug 12 High 5 minute load average alert - 9.60
Aug 12 High 5 minute load average alert - 6.86
Aug 12 High 5 minute load average alert - 11.38
Aug 12 High 5 minute load average alert - 7.61
Aug 12 High 5 minute load average alert - 6.11
Aug 13 High 5 minute load average alert - 13.65
Aug 13 High 5 minute load average alert - 9.53
Aug 13 High 5 minute load average alert - 23.08
Aug 13 High 5 minute load average alert - 6.66
Aug 13 High 5 minute load average alert - 8.62
Aug 13 High 5 minute load average alert - 10.84
Aug 13 High 5 minute load average alert - 6.78
Aug 13 High 5 minute load average alert - 7.84
Aug 14 High 5 minute load average alert - 6.59
Aug 14 High 5 minute load average alert - 7.25
Aug 14 High 5 minute load average alert - 6.74
Aug 14 High 5 minute load average alert - 6.77
Aug 14 High 5 minute load average alert - 6.90
Aug 15 High 5 minute load average alert - 6.63
Aug 15 High 5 minute load average alert - 10.86
Aug 15 High 5 minute load average alert - 6.72
Aug 15 High 5 minute load average alert - 6.85
Aug 15 High 5 minute load average alert - 7.58
Aug 15 High 5 minute load average alert - 6.08
Aug 15 High 5 minute load average alert - 11.27
Aug 16 High 5 minute load average alert - 6.38
Aug 16 High 5 minute load average alert - 6.10
Aug 16 High 5 minute load average alert - 7.14
Aug 16 High 5 minute load average alert - 7.31
Aug 16 High 5 minute load average alert - 7.39
Aug 16 High 5 minute load average alert - 6.32
Aug 16 High 5 minute load average alert - 18.47
Aug 16 High 5 minute load average alert - 18.47
Aug 16 High 5 minute load average alert - 18.47
Aug 16 High 5 minute load average alert - 7.61
Aug 16 High 5 minute load average alert - 6.38
Aug 17 High 5 minute load average alert - 8.68
Aug 18 High 5 minute load average alert - 9.78
Aug 18 High 5 minute load average alert - 7.21
Aug 18 High 5 minute load average alert - 6.13
Aug 18 High 5 minute load average alert - 7.18
Aug 19 High 5 minute load average alert - 12.06
Aug 19 High 5 minute load average alert - 7.05
Aug 19 High 5 minute load average alert - 8.62
Aug 19 High 5 minute load average alert - 6.71
Aug 19 High 5 minute load average alert - 7.32
Aug 19 High 5 minute load average alert - 9.75
Aug 20 High 5 minute load average alert - 13.68
Aug 20 High 5 minute load average alert - 13.62
Aug 20 High 5 minute load average alert - 11.06
Aug 20 High 5 minute load average alert - 6.77
Aug 20 High 5 minute load average alert - 10.87
Aug 21 High 5 minute load average alert - 8.23
Aug 21 High 5 minute load average alert - 8.92
Aug 21 High 5 minute load average alert - 6.69
Aug 21 High 5 minute load average alert - 6.79
Aug 22 High 5 minute load average alert - 6.11
Aug 22 High 5 minute load average alert - 8.87
Aug 22 High 5 minute load average alert - 6.16
Aug 22 High 5 minute load average alert - 6.17
Aug 23 High 5 minute load average alert - 6.93
Aug 23 High 5 minute load average alert - 10.37
Aug 23 High 5 minute load average alert - 6.00
Aug 23 High 5 minute load average alert - 7.09
Aug 23 High 5 minute load average alert - 14.06
Aug 23 High 5 minute load average alert - 6.72
Aug 23 High 5 minute load average alert - 7.86
Aug 23 High 5 minute load average alert - 7.69
Aug 24 High 5 minute load average alert - 6.27
Aug 24 High 5 minute load average alert - 6.97
Aug 24 High 5 minute load average alert - 6.44
Aug 24 High 5 minute load average alert - 6.63
Aug 24 High 5 minute load average alert - 8.20
Aug 24 High 5 minute load average alert - 6.99
Aug 25 High 5 minute load average alert - 7.16
Aug 25 High 5 minute load average alert - 7.88
Aug 25 High 5 minute load average alert - 16.66
Aug 25 High 5 minute load average alert - 7.27
Aug 26 High 5 minute load average alert - 6.87
Aug 26 High 5 minute load average alert - 6.81
Aug 26 High 5 minute load average alert - 11.00                                                                                                                 4566 O + Aug 26 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 7.23
Aug 26 High 5 minute load average alert - 8.68
Aug 26 High 5 minute load average alert - 6.44
Aug 26 High 5 minute load average alert - 6.15
Aug 27 High 5 minute load average alert - 6.31
Aug 27 High 5 minute load average alert - 11.43
Aug 27 High 5 minute load average alert - 11.61
Aug 27 High 5 minute load average alert - 7.18
Aug 27 High 5 minute load average alert - 6.05
Aug 27 High 5 minute load average alert - 41.91
Aug 27 High 5 minute load average alert - 17.85
Aug 27 High 5 minute load average alert - 6.61
Aug 28 High 5 minute load average alert - 6.33
Aug 28 High 5 minute load average alert - 6.70
Aug 28 High 5 minute load average alert - 6.86
Aug 28 High 5 minute load average alert - 13.35
Aug 28 High 5 minute load average alert - 17.06
Aug 28 High 5 minute load average alert - 31.46
Aug 28 High 5 minute load average alert - 6.64
Aug 28 High 5 minute load average alert - 6.15
Aug 28 High 5 minute load average alert - 7.15
Aug 29 High 5 minute load average alert - 9.57
Aug 29 High 5 minute load average alert - 7.41
Aug 29 High 5 minute load average alert - 7.03
Aug 29 High 5 minute load average alert - 7.40
Aug 29 High 5 minute load average alert - 8.42
Aug 29 High 5 minute load average alert - 6.56
Aug 29 High 5 minute load average alert - 8.52
Aug 29 High 5 minute load average alert - 8.37
Aug 30 High 5 minute load average alert - 7.52
Aug 30 High 5 minute load average alert - 8.78
Aug 30 High 5 minute load average alert - 6.87
Aug 30 High 5 minute load average alert - 8.27
Aug 30 High 5 minute load average alert - 9.88
Aug 30 High 5 minute load average alert - 7.89
Aug 31 High 5 minute load average alert - 6.16
Aug 31 High 5 minute load average alert - 7.59
Aug 31 High 5 minute load average alert - 7.05
Aug 31 High 5 minute load average alert - 7.95
Sep 01 High 5 minute load average alert - 7.33
Sep 01 High 5 minute load average alert - 7.09
Sep 01 High 5 minute load average alert - 6.82
Sep 02 High 5 minute load average alert - 8.23
Sep 02 High 5 minute load average alert - 6.16
Sep 02 High 5 minute load average alert - 7.51
Sep 02 High 5 minute load average alert - 7.04
Sep 02 High 5 minute load average alert - 7.57
Sep 02 High 5 minute load average alert - 6.08
Sep 03 High 5 minute load average alert - 13.18
Sep 03 High 5 minute load average alert - 8.05
Sep 03 High 5 minute load average alert - 7.75
Sep 03 High 5 minute load average alert - 6.86
Sep 03 High 5 minute load average alert - 6.08
Sep 03 High 5 minute load average alert - 19.09
Sep 03 High 5 minute load average alert - 8.56
Sep 03 High 5 minute load average alert - 7.01
Sep 03 High 5 minute load average alert - 7.29
Sep 03 High 5 minute load average alert - 6.32
Sep 03 High 5 minute load average alert - 6.64
Sep 04 High 5 minute load average alert - 20.16
Sep 04 High 5 minute load average alert - 9.14
Sep 04 High 5 minute load average alert - 7.35
Sep 04 High 5 minute load average alert - 6.50
Sep 04 High 5 minute load average alert - 9.07
Sep 04 High 5 minute load average alert - 6.63
Sep 05 High 5 minute load average alert - 8.44
Sep 05 High 5 minute load average alert - 6.43

Changed 3 years ago by jim

Attachment mem-9sept2013-after-extra-drupal-caching.png added

9 sept 2013 - mem usage after more Drupal caching enabled

comment:89 Changed 3 years ago by jim

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 38.98 to 39.23

It looks to me like since I've upped the use of Redis caches (which stores cache_view, cache_block etc) we're now getting short of memory, which is crippling the server's IO.

I restarted php53-fpm, redis and nginx as part of a little debug moment for work in #590, and after that we were short of memory.

I think we could drop some from MySQL so that more is spare/available to Redis.

See attached memory chart (https://tech.transitionnetwork.org/trac/attachment/ticket/555/mem-9sept2013-after-extra-drupal-caching.png) for the current situation.

comment:90 follow-up: ↓ 92 Changed 3 years ago by jim

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 39.23 to 39.48

I could be wrong, but having looked around Drupal and done some improvements, I have a hunch that a lot of what makes the server's load spike is IO contention -- when disk activity is higher, load spikes quickly.. Hence my suggestion over on #591.

Also, and this is just me wondering out loud now, does Redis-server has enough memory? I see its utilisation is stuck low -- around 200-400Mb with occasional moves towards 1Gb and back down again -- which well might be ok, but in an ideal world it would have lots of things in it as what it contains is post-processed goodies like query results, rendered views output, blocks and whole pages etc.

So two questions, Chris:

Do you think disk IO speed/latency is an issue? You suggested a move to a ZFS/SSD thingy recently which implies you might...
Do you think Redis has enough memory? (or if it 'takes what's left' does MySQL have too much?)

comment:91 Changed 3 years ago by jim

Aha.. 2 things about redis:

I had my readings or by a factor of 5 - peaks in past being 200Mb,not 1Gb.
The work I have done on cron in #590 (around comment 33) have massively improved Redis utilisation by stopping it being wiped hourly... I plan to move system_cron to run once a day at ~5am from every 3 hours which should give us a really big boost. Am monitoring effects first though.

comment:92 in reply to: ↑ 90 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 39.48 to 39.58

Replying to jim:

does Redis-server has enough memory?

I was also wondering this some months back:

Replying to chris:

A question for Jim, currently there is enough RAM that we could double the Redis RAM from 512MB to 1GB -- any reason not to do this? The performance drop when we didn't have Redis running was very significant and I expect that giving Redis extra RAM would speed things up.

If we do this I think we should also consider reducing the RAM usage of MySQL.

Changed 3 years ago by chris

Attachment puffin_load-week_2013-10-03.png added

Changed 3 years ago by chris

Attachment puffin_load-month_2013-10-03.png added

comment:93 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 39.58 to 39.83

We had a two week period, which appears to have equated with the time that New Relic was running, ticket:586, without the frequent load spikes, but since midnight on the 29th September they have returned:

I suspect that it is coincidental that the spikes dropped off while New Relic was running, but I don't have a good explanation for the change in the pattern, there isn't anything noticeable on the webalizer stats.

comment:94 Changed 3 years ago by chris

Following the discussion on ticket:601#comment:4 the changes to dump data when the high load trigger is tripped, see ticket:555#comment:48 have been revisited, first the old log file was moved:

mv /var/log/high-load.log /var/log/high-load.log.1

And the following changes were made to /var/xdrago/:

nginx_high_load_on()
{
  mv -f /data/conf/nginx_high_load_off.conf /data/conf/nginx_high_load.conf
  /etc/init.d/nginx reload

  # start additions
  echo "====================" >> /var/log/high-load.log
  echo "nginx high load on" >> /var/log/high-load.log
  echo "ONEX_LOAD = $ONEX_LOAD" >> /var/log/high-load.log
  echo "FIVX_LOAD = $FIVX_LOAD" >> /var/log/high-load.log
  echo "uptime : " >> /var/log/high-load.log
  uptime >> /var/log/high-load.log
  echo "vmstat : " >> /var/log/high-load.log
  vmstat -S M -s >> /var/log/high-load.log
  vmstat -S M -d >> /var/log/high-load.log
  echo "top : " >> /var/log/high-load.log
  top -n 1 -b >> /var/log/high-load.log
  echo "====================" >> /var/log/high-load.log
  # end additions

}

comment:95 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 39.83 to 40.08

Ooops forgot to add the time to make the above changes.

comment:96 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 40.08 to 40.18

15 mins ago there was a load spike, this is what was written to /var/log/high-load.log:

====================
nginx high load on
ONEX_LOAD = 1283
FIVX_LOAD = 395
uptime :
 13:16:36 up 11 days,  3:58,  2 users,  load average: 14.28, 4.39, 2.11
vmstat :
         8175 M total memory
         7638 M used memory
         4913 M active memory
         2048 M inactive memory
          537 M free memory
          700 M buffer memory
         2904 M swap cache
         1023 M total swap
            0 M used swap
         1023 M free swap
     30401331 non-nice user cpu ticks
           15 nice user cpu ticks
     45313222 system cpu ticks
   2530109095 idle cpu ticks
      4701595 IO-wait cpu ticks
          536 IRQ cpu ticks
       644384 softirq cpu ticks
      6274220 stolen cpu ticks
      7029105 pages paged in
    362477344 pages paged out
            0 pages swapped in
            0 pages swapped out
   1026236759 interrupts
    786785317 CPU context switches
   1379837901 boot time
     20612641 forks
disk- ------------reads------------ ------------writes----------- -----IO------
       total merged sectors      ms  total merged sectors      ms    cur    sec
xvda2 391011   2506 13647722 2073724 17869827 49729737 724954688 952865084      0  29399
xvda1  26909  24402  410488  114452      0      0       0       0      0     19
top :
top - 13:16:37 up 11 days,  3:58,  2 users,  load average: 14.28, 4.39, 2.11
Tasks: 318 total,  47 running, 268 sleeping,   0 stopped,   3 zombie
Cpu0  :  2.6%us,  2.3%sy,  0.0%ni, 92.2%id,  2.2%wa,  0.0%hi,  0.3%si,  0.3%st
Cpu1  :  1.7%us,  2.3%sy,  0.0%ni, 95.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu2  :  1.7%us,  2.2%sy,  0.0%ni, 95.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu3  :  1.3%us,  1.9%sy,  0.0%ni, 96.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu4  :  1.2%us,  1.8%sy,  0.0%ni, 96.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu5  :  1.0%us,  1.7%sy,  0.0%ni, 97.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu6  :  0.9%us,  1.6%sy,  0.0%ni, 97.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu7  :  0.9%us,  1.6%sy,  0.0%ni, 97.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu8  :  0.9%us,  1.5%sy,  0.0%ni, 97.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu9  :  0.8%us,  1.5%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu10 :  0.8%us,  1.5%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu11 :  0.8%us,  1.5%sy,  0.0%ni, 97.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu12 :  0.8%us,  1.4%sy,  0.0%ni, 97.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu13 :  0.8%us,  1.4%sy,  0.0%ni, 97.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Mem:   8372060k total,  7801800k used,   570260k free,   717156k buffers
Swap:  1048568k total,        0k used,  1048568k free,  2973932k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 4125 mysql     20   0 2782m 2.0g  10m S  104 25.3 145:37.28 mysqld
41336 tn        20   0  250m  40m 8960 R   65  0.5   0:09.07 drush.php
30570 www-data  20   0  791m 116m  56m S   63  1.4   0:29.94 php-fpm
30585 www-data  20   0  759m  73m  45m R   56  0.9   0:28.39 php-fpm
30577 www-data  20   0  770m  83m  43m R   52  1.0   0:29.00 php-fpm
39927 www-data  20   0  779m  82m  33m R   50  1.0   0:20.49 php-fpm
30587 www-data  20   0  769m  78m  42m R   47  1.0   0:26.04 php-fpm
30575 www-data  20   0  759m  74m  45m R   45  0.9   0:34.82 php-fpm
30584 www-data  20   0  754m  66m  43m R   45  0.8   0:28.25 php-fpm
30586 www-data  20   0  778m  90m  43m R   45  1.1   0:29.98 php-fpm
39848 www-data  20   0  772m  79m  35m R   45  1.0   0:18.66 php-fpm
30588 www-data  20   0  777m  90m  43m S   43  1.1   0:29.14 php-fpm
30589 www-data  20   0  753m  66m  43m R   43  0.8   0:32.59 php-fpm
41542 tn        20   0  240m  31m 9004 R   43  0.4   0:07.38 drush.php
30573 www-data  20   0  782m  94m  42m R   41  1.2   0:27.88 php-fpm
39831 www-data  20   0  762m  69m  35m R   41  0.8   0:20.34 php-fpm
30578 www-data  20   0  781m  90m  41m R   39  1.1   0:37.33 php-fpm
40034 www-data  20   0  780m  83m  33m R   39  1.0   0:17.34 php-fpm
30572 www-data  20   0  778m  91m  43m R   38  1.1   0:30.43 php-fpm
30581 www-data  20   0  791m 103m  43m R   38  1.3   0:28.55 php-fpm
30582 www-data  20   0  783m  97m  45m R   38  1.2   0:28.15 php-fpm
30579 www-data  20   0  782m  93m  41m R   34  1.1   0:30.45 php-fpm
26795 root      20   0 72968  10m 1764 R   32  0.1   0:07.78 nginx
30569 www-data  20   0  781m  95m  45m S   32  1.2   0:31.14 php-fpm
30568 www-data  20   0  802m 134m  59m S   31  1.6   0:26.51 php-fpm
30576 www-data  20   0  782m  96m  45m S   25  1.2   0:29.96 php-fpm
58401 redis     20   0  475m  75m  928 S   11  0.9   5:09.65 redis-server
41953 root      20   0 19200 1400  912 R    4  0.0   0:00.06 top
39554 root      20   0 10624 1372 1148 S    2  0.0   0:00.17 bash
39607 root      20   0 10624 1368 1148 S    2  0.0   0:00.38 bash
40013 www-data  20   0     0    0    0 R    2  0.0   0:03.96 nginx
40032 www-data  20   0     0    0    0 R    2  0.0   0:02.72 nginx
40047 www-data  20   0 69868 8396 1972 R    2  0.1   0:01.58 nginx
40048 www-data  20   0 72968  11m 1928 S    2  0.1   0:01.81 nginx
    1 root      20   0  8356  768  636 S    0  0.0   0:55.98 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.02 kthreadd
    3 root      RT   0     0    0    0 S    0  0.0   1:13.56 migration/0
    4 root      20   0     0    0    0 S    0  0.0   0:55.15 ksoftirqd/0
    5 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/0
    6 root      RT   0     0    0    0 S    0  0.0   1:16.81 migration/1
    7 root      20   0     0    0    0 S    0  0.0   1:23.90 ksoftirqd/1
    8 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/1
    9 root      RT   0     0    0    0 S    0  0.0   1:06.91 migration/2
   10 root      20   0     0    0    0 S    0  0.0   1:16.97 ksoftirqd/2
   11 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/2
   12 root      RT   0     0    0    0 S    0  0.0   1:08.80 migration/3
   13 root      20   0     0    0    0 R    0  0.0   1:45.13 ksoftirqd/3
   14 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/3
   15 root      RT   0     0    0    0 S    0  0.0   1:05.97 migration/4
   16 root      20   0     0    0    0 S    0  0.0   1:37.83 ksoftirqd/4
   17 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/4
   18 root      RT   0     0    0    0 S    0  0.0   1:06.23 migration/5
   19 root      20   0     0    0    0 S    0  0.0   1:45.94 ksoftirqd/5
   20 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/5
   21 root      RT   0     0    0    0 S    0  0.0   1:08.04 migration/6
   22 root      20   0     0    0    0 S    0  0.0   1:35.66 ksoftirqd/6
   23 root      RT   0     0    0    0 S    0  0.0   0:00.02 watchdog/6
   24 root      RT   0     0    0    0 S    0  0.0   1:00.92 migration/7
   25 root      20   0     0    0    0 S    0  0.0   1:39.80 ksoftirqd/7
   26 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/7
   27 root      RT   0     0    0    0 S    0  0.0   1:02.56 migration/8
   28 root      20   0     0    0    0 S    0  0.0   1:39.21 ksoftirqd/8
   29 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/8
   30 root      RT   0     0    0    0 S    0  0.0   1:06.20 migration/9
   31 root      20   0     0    0    0 S    0  0.0   1:40.45 ksoftirqd/9
   32 root      RT   0     0    0    0 S    0  0.0   0:00.01 watchdog/9
   33 root      RT   0     0    0    0 S    0  0.0   1:05.33 migration/10
   34 root      20   0     0    0    0 S    0  0.0   1:52.43 ksoftirqd/10
   35 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/10
   36 root      RT   0     0    0    0 S    0  0.0   1:03.28 migration/11
   37 root      20   0     0    0    0 S    0  0.0   1:39.61 ksoftirqd/11
   38 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/11
   39 root      RT   0     0    0    0 S    0  0.0   1:04.47 migration/12
   40 root      20   0     0    0    0 S    0  0.0   1:27.33 ksoftirqd/12
   41 root      RT   0     0    0    0 S    0  0.0   0:00.03 watchdog/12
   42 root      RT   0     0    0    0 S    0  0.0   1:08.44 migration/13
   43 root      20   0     0    0    0 S    0  0.0   0:20.49 ksoftirqd/13
   44 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/13
   45 root      20   0     0    0    0 S    0  0.0   0:15.25 events/0
   46 root      20   0     0    0    0 S    0  0.0   0:17.27 events/1
   47 root      20   0     0    0    0 S    0  0.0   0:15.69 events/2
   48 root      20   0     0    0    0 S    0  0.0   0:12.99 events/3
   49 root      20   0     0    0    0 S    0  0.0   0:14.55 events/4
   50 root      20   0     0    0    0 S    0  0.0   0:18.10 events/5
   51 root      20   0     0    0    0 S    0  0.0   0:13.79 events/6
   52 root      20   0     0    0    0 S    0  0.0   0:14.06 events/7
   53 root      20   0     0    0    0 S    0  0.0   0:14.13 events/8
   54 root      20   0     0    0    0 S    0  0.0   0:13.40 events/9
   55 root      20   0     0    0    0 S    0  0.0   0:10.96 events/10
   56 root      20   0     0    0    0 S    0  0.0   0:11.43 events/11
   57 root      20   0     0    0    0 S    0  0.0   0:14.11 events/12
   58 root      20   0     0    0    0 S    0  0.0   0:18.72 events/13
   59 root      20   0     0    0    0 S    0  0.0   0:00.00 cpuset
   60 root      20   0     0    0    0 S    0  0.0   0:00.00 khelper
   61 root      20   0     0    0    0 S    0  0.0   0:00.00 netns
   62 root      20   0     0    0    0 S    0  0.0   0:00.00 async/mgr
   63 root      20   0     0    0    0 S    0  0.0   0:00.00 pm
   64 root      20   0     0    0    0 S    0  0.0   0:00.00 xenwatch
   65 root      20   0     0    0    0 S    0  0.0   0:00.00 xenbus
   66 root      20   0     0    0    0 S    0  0.0   0:02.06 sync_supers
   67 root      20   0     0    0    0 S    0  0.0   0:12.14 bdi-default
   68 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/0
   69 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/1
   70 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/2
   71 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/3
   72 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/4
   73 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/5
   74 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/6
   75 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/7
   76 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/8
   77 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/9
   78 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/10
   79 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/11
   80 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/12
   81 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/13
   82 root      20   0     0    0    0 S    0  0.0   0:11.48 kblockd/0
   83 root      20   0     0    0    0 S    0  0.0   0:00.03 kblockd/1
   84 root      20   0     0    0    0 S    0  0.0   0:00.01 kblockd/2
   85 root      20   0     0    0    0 S    0  0.0   0:00.01 kblockd/3
   86 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/4
   87 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/5
   88 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/6
   89 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/7
   90 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/8
   91 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/9
   92 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/10
   93 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/11
   94 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/12
   95 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/13
   96 root      20   0     0    0    0 S    0  0.0   0:00.00 kseriod
  111 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/0
  112 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/1
  113 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/2
  114 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/3
  115 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/4
  116 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/5
  117 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/6
  118 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/7
  119 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/8
  120 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/9
  121 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/10
  122 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/11
  123 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/12
  124 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/13
  125 root      20   0     0    0    0 S    0  0.0   0:00.66 khungtaskd
  126 root      20   0     0    0    0 S    0  0.0   0:04.11 kswapd0
  127 root      25   5     0    0    0 S    0  0.0   0:00.00 ksmd
  128 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/0
  129 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/1
  130 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/2
  131 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/3
  132 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/4
  133 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/5
  134 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/6
  135 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/7
  136 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/8
  137 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/9
  138 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/10
  139 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/11
  140 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/12
  141 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/13
  142 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/0
  143 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/1
  144 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/2
  145 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/3
  146 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/4
  147 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/5
  148 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/6
  149 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/7
  150 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/8
  151 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/9
  152 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/10
  153 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/11
  154 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/12
  155 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/13
  158 root      20   0     0    0    0 S    0  0.0   0:00.00 khvcd
  211 root      20   0     0    0    0 S    0  0.0   0:00.00 kstriped
  220 root      20   0     0    0    0 S    0  0.0   3:11.03 kjournald
  266 root      16  -4 16744  744  372 S    0  0.0   0:00.21 udevd
  304 root      18  -2 16740  724  348 S    0  0.0   0:01.55 udevd
  368 root      20   0     0    0    0 S    0  0.0   0:59.37 flush-202:2
  720 root      20   0  6468  604  480 S    0  0.0  10:55.47 vnstatd
  745 root      20   0     0    0    0 S    0  0.0   0:00.00 kauditd
 3233 root      20   0 19164 1716 1332 S    0  0.0   0:00.25 mysqld_safe
 4126 root      20   0  5352  688  584 S    0  0.0   0:00.02 logger
 4581 root      20   0 49176 1140  584 S    0  0.0   0:05.95 sshd
 4708 root      16  -4 45180  964  612 S    0  0.0   0:00.31 auditd
 4710 root      12  -8 14296  780  648 S    0  0.0   0:00.73 audispd
 4739 root      20   0 22432 1060  796 S    0  0.0   7:57.74 cron
 5897 root      20   0  117m 1676 1068 S    0  0.0   2:45.63 rsyslogd
 5961 daemon    20   0 18716  448  284 S    0  0.0   0:00.01 atd
 5987 pdnsd     20   0  207m 1984  632 S    0  0.0   2:55.65 pdnsd
 6010 messageb  20   0 23268  788  564 S    0  0.0   0:00.01 dbus-daemon
 6631 root      20   0 70480 3184 2492 S    0  0.0   0:00.03 sshd
 6979 chris     20   0 70480 1584  876 S    0  0.0   0:00.50 sshd
 6980 chris     20   0 25736 8528 1544 S    0  0.1   0:00.59 bash
 7542 root      20   0 24572 1244  992 S    0  0.0   0:00.00 sudo
 7543 root      20   0 22156 5036 1632 S    0  0.1   0:01.14 bash
 7621 root      20   0 41872 8756 1820 S    0  0.1   2:24.25 munin-node
 7651 root      20   0  5932  612  516 S    0  0.0   0:00.00 getty
 9389 root      20   0 37176 2384 1868 S    0  0.0   2:09.26 master
 9393 postfix   20   0 39472 2644 1984 S    0  0.0   0:19.09 qmgr
 9394 root      20   0 28712 1736 1224 S    0  0.0   0:02.12 pure-ftpd
13773 root      20   0 56612  17m 1544 S    0  0.2  10:06.22 lfd
14843 postfix   20   0 39240 2420 1912 S    0  0.0   0:00.01 pickup
18931 postfix   20   0 42176 3708 2440 S    0  0.0   0:11.16 tlsmgr
20232 root      18  -2 16740  596  228 S    0  0.0   0:00.00 udevd
30567 root      20   0  734m 6976 1828 S    0  0.1   0:00.18 php-fpm
36953 postfix   20   0 39252 2396 1908 S    0  0.0   0:00.01 trivial-rewrite
36954 postfix   20   0 43660 3244 2568 S    0  0.0   0:00.01 smtp
36955 postfix   20   0 43660 3244 2568 S    0  0.0   0:00.00 smtp
36960 postfix   20   0 39272 2416 1924 S    0  0.0   0:00.01 bounce
36961 postfix   20   0 39272 2364 1888 S    0  0.0   0:00.00 bounce
39541 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
39542 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
39543 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
39546 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
39548 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
39549 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
39551 root      20   0 10612 1356 1148 S    0  0.0   0:00.00 bash
39552 root      20   0 10660 1412 1156 S    0  0.0   0:00.15 bash
39581 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
39582 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
39585 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
39594 root      20   0  3956  580  484 S    0  0.0   0:00.00 sh
39598 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
39601 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
39605 root      20   0 10612 1352 1148 S    0  0.0   0:00.01 bash
39610 root      20   0 10660 1408 1156 S    0  0.0   0:00.12 bash
39671 root      20   0 70480 3180 2492 S    0  0.0   0:00.02 sshd
39758 chris     20   0 70480 1580  876 S    0  0.0   0:00.13 sshd
39759 chris     20   0 25736 8524 1544 S    0  0.1   0:00.54 bash
39800 root      20   0 24572 1244  992 S    0  0.0   0:00.00 sudo
39801 root      20   0 22104 4920 1568 S    0  0.1   0:00.33 bash
39802 root      20   0 41872 8016 1080 S    0  0.1   0:01.43 munin-node
39851 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
39855 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
39856 root      20   0 32812 1080  776 S    0  0.0   0:00.01 cron
39858 root      20   0  3956  580  484 S    0  0.0   0:00.00 sh
39860 root      20   0  3956  580  484 S    0  0.0   0:00.00 sh
39862 root      20   0 10624 1372 1148 S    0  0.0   0:00.21 bash
39865 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
39866 root      20   0 10644 1368 1124 S    0  0.0   0:00.05 bash
39876 root      20   0 10672 1436 1168 S    0  0.0   0:00.04 bash
40018 www-data  20   0 72968  11m 1952 R    0  0.1   0:08.89 nginx
40021 www-data  20   0 72968  11m 1916 R    0  0.1   0:02.40 nginx
40025 www-data  20   0     0    0    0 R    0  0.0   0:02.95 nginx
40027 www-data  20   0     0    0    0 R    0  0.0   0:02.15 nginx
40028 www-data  20   0 72968  11m 1940 R    0  0.1   0:02.54 nginx
40030 www-data  20   0 72968  11m 1940 R    0  0.1   0:01.67 nginx
40031 www-data  20   0 72968  11m 1948 R    0  0.1   0:02.55 nginx
40033 www-data  20   0     0    0    0 Z    0  0.0   0:01.72 nginx <defunct>
40035 www-data  20   0 72968  11m 1932 S    0  0.1   0:02.22 nginx
40036 www-data  20   0     0    0    0 Z    0  0.0   0:01.46 nginx <defunct>
40037 www-data  20   0 69868 8392 1968 R    0  0.1   0:01.30 nginx
40038 www-data  20   0 72968  11m 1928 R    0  0.1   0:01.20 nginx
40039 www-data  20   0 72968  11m 1928 R    0  0.1   0:01.24 nginx
40040 www-data  20   0 72968  11m 1948 R    0  0.1   0:01.66 nginx
40041 www-data  20   0 72968  11m 1940 R    0  0.1   0:01.87 nginx
40042 www-data  20   0 72968  11m 1928 R    0  0.1   0:03.00 nginx
40043 www-data  20   0 72968  11m 1948 R    0  0.1   0:02.97 nginx
40045 www-data  20   0 72968  11m 1944 S    0  0.1   0:02.72 nginx
40046 www-data  20   0 72968  11m 1936 S    0  0.1   0:02.87 nginx
40049 www-data  20   0 72968  11m 1928 S    0  0.1   0:01.80 nginx
40226 root      20   0 10612 1348 1136 S    0  0.0   0:00.00 bash
40238 tn        20   0 36888 1236  968 S    0  0.0   0:00.00 su
40245 tn        20   0 10592 1304 1112 S    0  0.0   0:00.00 bash
40249 tn        20   0  255m  46m 9168 S    0  0.6   0:08.76 php
40439 root      20   0 10612 1344 1136 S    0  0.0   0:00.00 bash
40442 tn        20   0 36888 1232  968 S    0  0.0   0:00.18 su
40474 tn        20   0 10592 1304 1112 S    0  0.0   0:00.27 bash
40526 tn        20   0  252m  44m 9156 S    0  0.5   0:07.16 php
41328 tn        20   0  3956  580  484 S    0  0.0   0:00.00 sh
41540 tn        20   0  3956  576  484 S    0  0.0   0:00.05 sh
41644 nobody    20   0     0    0    0 Z    0  0.0   0:00.18 phpfpm_st <defunct>
41911 root      20   0  5368  568  480 S    0  0.0   0:00.00 sleep
41913 root      20   0  5368  560  480 S    0  0.0   0:00.00 sleep
41914 root      20   0 18320 2296 1624 S    0  0.0   0:00.01 perl
41972 www-data  20   0 72968 9424  504 R    0  0.1   0:00.00 nginx
41985 www-data  20   0 72968 9424  504 R    0  0.1   0:00.00 nginx
41986 www-data  20   0 72968 9424  504 R    0  0.1   0:00.00 nginx
41988 www-data  20   0 72968 9424  504 R    0  0.1   0:00.00 nginx
41993 www-data  20   0 72968 9424  504 R    0  0.1   0:00.00 nginx
41994 www-data  20   0 72968 9424  504 R    0  0.1   0:00.00 nginx
41995 www-data  20   0 72968 9424  504 R    0  0.1   0:00.00 nginx
41997 root      20   0  3956  608  492 S    0  0.0   0:00.00 newrelic-daemon
42005 www-data  20   0 72968 9424  504 S    0  0.1   0:00.00 nginx
42007 www-data  20   0 72968 9424  504 S    0  0.1   0:00.00 nginx
42008 www-data  20   0 72968 9424  504 S    0  0.1   0:00.00 nginx
42013 www-data  20   0 72968 9424  504 S    0  0.1   0:00.00 nginx
42020 www-data  20   0 72968 9424  504 S    0  0.1   0:00.00 nginx
42022 www-data  20   0 72968 9424  504 S    0  0.1   0:00.00 nginx
42025 www-data  20   0 72968 9424  504 S    0  0.1   0:00.00 nginx
42026 www-data  20   0 72968 9424  504 S    0  0.1   0:00.00 nginx
42028 www-data  20   0 72968 9424  504 S    0  0.1   0:00.00 nginx
42031 www-data  20   0 72968 9424  504 S    0  0.1   0:00.00 nginx
42036 www-data  20   0 72968 9424  504 S    0  0.1   0:00.00 nginx
42039 www-data  20   0 72968 9424  504 S    0  0.1   0:00.00 nginx
42041 www-data  20   0 72968 9424  504 S    0  0.1   0:00.00 nginx
42044 www-data  20   0 72968 9424  504 S    0  0.1   0:00.00 nginx
42045 www-data  20   0 72968 9424  504 S    0  0.1   0:00.00 nginx
42046 www-data  20   0 72968 9424  504 S    0  0.1   0:00.00 nginx
42047 www-data  20   0 72968 9424  504 S    0  0.1   0:00.00 nginx
42048 www-data  20   0 72968 9424  504 S    0  0.1   0:00.00 nginx
42050 www-data  20   0 72968 9424  504 S    0  0.1   0:00.00 nginx
42055 www-data  20   0 72968 9424  504 S    0  0.1   0:00.00 nginx
42056 www-data  20   0 72968 9424  504 S    0  0.1   0:00.00 nginx
42057 www-data  20   0 72968 9436  516 S    0  0.1   0:00.00 nginx
42060 root      20   0  3872  500  416 S    0  0.0   0:00.00 sleep
42062 root      20   0  5368  564  480 S    0  0.0   0:00.00 sleep
42066 root      20   0 10624  528  304 S    0  0.0   0:00.00 bash
42069 root      20   0 39852 2380 1860 R    0  0.0   0:00.00 mysqladmin
42071 root      20   0 10376  912  768 S    0  0.0   0:00.00 awk
42072 root      20   0  7552  820  704 S    0  0.0   0:00.00 grep
42073 root      20   0 10376  916  768 S    0  0.0   0:00.00 awk
44174 ntp       20   0 38340 2180 1592 S    0  0.0   2:01.49 ntpd
45625 root      20   0 10352 1596  876 S    0  0.0   0:00.04 man
45719 root      20   0  9884  992  800 S    0  0.0   0:00.05 pager

====================

comment:97 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 40.18 to 40.28

The load spiked up to 70, the following has been written to /var/log/high-load.log since the above was added:

====================
nginx high load on
ONEX_LOAD = 1000
FIVX_LOAD = 527
uptime :
 13:56:21 up 11 days,  4:40,  2 users,  load average: 32.23, 10.49, 4.28
vmstat :
         8175 M total memory
         7352 M used memory
         4637 M active memory
         2035 M inactive memory
          822 M free memory
          702 M buffer memory
         2943 M swap cache
         1023 M total swap
            0 M used swap
         1023 M free swap
     30507990 non-nice user cpu ticks
           15 nice user cpu ticks
     45545416 system cpu ticks
   2536353958 idle cpu ticks
      4714177 IO-wait cpu ticks
          536 IRQ cpu ticks
       647216 softirq cpu ticks
      6359171 stolen cpu ticks
      7042505 pages paged in
    363318888 pages paged out
            0 pages swapped in
            0 pages swapped out
   1029974210 interrupts
    789452439 CPU context switches
   1379837755 boot time
     20691756 forks
disk- ------------reads------------ ------------writes----------- -----IO------
       total merged sectors      ms  total merged sectors      ms    cur    sec
xvda2 391625   2507 13674530 2078048 17916830 49812579 726637776 954473156      0  29479
xvda1  26909  24402  410488  114452      0      0       0       0      0     19
top :
top - 13:56:21 up 11 days,  4:40,  2 users,  load average: 32.23, 10.49, 4.28
Tasks: 325 total,  23 running, 297 sleeping,   0 stopped,   5 zombie
Cpu0  :  2.6%us,  2.4%sy,  0.0%ni, 92.2%id,  2.2%wa,  0.0%hi,  0.3%si,  0.3%st
Cpu1  :  1.7%us,  2.3%sy,  0.0%ni, 95.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.3%st
Cpu2  :  1.7%us,  2.2%sy,  0.0%ni, 95.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu3  :  1.3%us,  1.9%sy,  0.0%ni, 96.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu4  :  1.2%us,  1.8%sy,  0.0%ni, 96.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu5  :  1.0%us,  1.7%sy,  0.0%ni, 97.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu6  :  0.9%us,  1.6%sy,  0.0%ni, 97.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu7  :  0.9%us,  1.6%sy,  0.0%ni, 97.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu8  :  0.9%us,  1.5%sy,  0.0%ni, 97.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu9  :  0.8%us,  1.5%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu10 :  0.8%us,  1.5%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu11 :  0.8%us,  1.5%sy,  0.0%ni, 97.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu12 :  0.8%us,  1.4%sy,  0.0%ni, 97.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu13 :  0.8%us,  1.4%sy,  0.0%ni, 97.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Mem:   8372060k total,  7502616k used,   869444k free,   718960k buffers
Swap:  1048568k total,        0k used,  1048568k free,  3014180k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
30588 www-data  20   0  770m  90m  50m R   65  1.1   2:10.47 php-fpm
30579 www-data  20   0  765m  87m  52m S   55  1.1   2:11.09 php-fpm
30584 www-data  20   0  771m  94m  54m S   53  1.2   2:12.63 php-fpm
55858 www-data  20   0  739m  17m 7864 R   53  0.2   0:01.64 php-fpm
30575 www-data  20   0  758m  80m  52m R   52  1.0   2:04.57 php-fpm
40034 www-data  20   0  775m  96m  51m R   52  1.2   1:47.48 php-fpm
39927 www-data  20   0  759m  80m  51m R   50  1.0   1:49.42 php-fpm
30585 www-data  20   0  759m  81m  52m R   46  1.0   1:54.97 php-fpm
30581 www-data  20   0  761m  83m  53m R   45  1.0   2:04.42 php-fpm
39848 www-data  20   0  775m  96m  51m R   45  1.2   1:56.80 php-fpm
30587 www-data  20   0  760m  86m  57m R   43  1.1   2:04.33 php-fpm
30589 www-data  20   0  794m 117m  53m R   43  1.4   2:04.99 php-fpm
30577 www-data  20   0  770m  88m  49m R   40  1.1   2:06.27 php-fpm
39831 www-data  20   0  771m  92m  51m R   40  1.1   2:16.28 php-fpm
30573 www-data  20   0  785m 113m  58m R   38  1.4   1:58.50 php-fpm
30578 www-data  20   0  761m  82m  52m R   38  1.0   2:30.05 php-fpm
55854 www-data  20   0  739m  18m 8012 R   38  0.2   0:01.36 php-fpm
30576 www-data  20   0  773m  94m  52m R   36  1.2   1:40.62 php-fpm
30582 www-data  20   0  772m  95m  54m R   34  1.2   1:58.82 php-fpm
30586 www-data  20   0  772m  93m  51m R   34  1.1   2:10.08 php-fpm
55977 www-data  20   0  739m  17m 7872 R   34  0.2   0:00.25 php-fpm
55857 www-data  20   0  739m  18m 8068 R   29  0.2   0:01.28 php-fpm
26795 root      20   0 72952  10m 1764 S   28  0.1   0:08.25 nginx
49887 www-data  20   0 73000  10m 1944 S   21  0.1   0:13.44 nginx
 4125 mysql     20   0 2782m 2.0g  10m S    7 25.3 147:23.31 mysqld
55383 root      20   0 41872 8016 1080 S    7  0.1   0:06.21 munin-node
56011 root      20   0 19200 1396  912 R    5  0.0   0:00.05 top
56252 nobody    20   0 22244 3616 1792 R    5  0.0   0:00.03 diskstats
55193 root      20   0 10752 1584 1228 S    3  0.0   0:02.34 bash
55650 root      20   0 10752 1588 1228 S    3  0.0   0:08.28 bash
56049 root      20   0 18320 2296 1624 S    3  0.0   0:00.02 perl
58401 redis     20   0  475m  79m  928 R    3  1.0   5:29.48 redis-server
 5897 root      20   0  117m 1676 1068 S    2  0.0   2:46.25 rsyslogd
49896 www-data  20   0 73000  10m 1944 S    2  0.1   0:05.12 nginx
49897 www-data  20   0 73000  10m 1896 S    2  0.1   0:00.45 nginx
49901 www-data  20   0 73000  10m 1928 S    2  0.1   0:01.14 nginx
49906 www-data  20   0 73000  10m 1936 S    2  0.1   0:00.81 nginx
49907 www-data  20   0 73000  10m 1916 S    2  0.1   0:14.72 nginx
55910 root      20   0 10624 1368 1144 S    2  0.0   0:00.01 bash
56105 postfix   20   0 39340 2524 2004 S    2  0.0   0:00.01 cleanup
56149 postfix   20   0 39500 3080 2336 S    2  0.0   0:00.01 local
56164 postfix   20   0 39500 2940 2224 S    2  0.0   0:00.01 local
    1 root      20   0  8356  768  636 S    0  0.0   0:56.02 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.02 kthreadd
    3 root      RT   0     0    0    0 S    0  0.0   1:13.87 migration/0
    4 root      20   0     0    0    0 S    0  0.0   0:55.23 ksoftirqd/0
    5 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/0
    6 root      RT   0     0    0    0 S    0  0.0   1:17.61 migration/1
    7 root      20   0     0    0    0 S    0  0.0   1:25.39 ksoftirqd/1
    8 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/1
    9 root      RT   0     0    0    0 S    0  0.0   1:07.32 migration/2
   10 root      20   0     0    0    0 S    0  0.0   1:19.50 ksoftirqd/2
   11 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/2
   12 root      RT   0     0    0    0 S    0  0.0   1:09.32 migration/3
   13 root      20   0     0    0    0 S    0  0.0   1:45.18 ksoftirqd/3
   14 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/3
   15 root      RT   0     0    0    0 S    0  0.0   1:06.28 migration/4
   16 root      20   0     0    0    0 S    0  0.0   1:38.55 ksoftirqd/4
   17 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/4
   18 root      RT   0     0    0    0 S    0  0.0   1:07.12 migration/5
   19 root      20   0     0    0    0 S    0  0.0   1:47.44 ksoftirqd/5
   20 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/5
   21 root      RT   0     0    0    0 S    0  0.0   1:08.42 migration/6
   22 root      20   0     0    0    0 S    0  0.0   1:37.52 ksoftirqd/6
   23 root      RT   0     0    0    0 S    0  0.0   0:00.02 watchdog/6
   24 root      RT   0     0    0    0 S    0  0.0   1:01.27 migration/7
   25 root      20   0     0    0    0 S    0  0.0   1:41.68 ksoftirqd/7
   26 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/7
   27 root      RT   0     0    0    0 S    0  0.0   1:03.10 migration/8
   28 root      20   0     0    0    0 S    0  0.0   1:40.43 ksoftirqd/8
   29 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/8
   30 root      RT   0     0    0    0 S    0  0.0   1:06.53 migration/9
   31 root      20   0     0    0    0 S    0  0.0   1:42.21 ksoftirqd/9
   32 root      RT   0     0    0    0 S    0  0.0   0:00.01 watchdog/9
   33 root      RT   0     0    0    0 S    0  0.0   1:05.54 migration/10
   34 root      20   0     0    0    0 S    0  0.0   1:53.28 ksoftirqd/10
   35 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/10
   36 root      RT   0     0    0    0 S    0  0.0   1:03.77 migration/11
   37 root      20   0     0    0    0 S    0  0.0   1:41.66 ksoftirqd/11
   38 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/11
   39 root      RT   0     0    0    0 S    0  0.0   1:04.98 migration/12
   40 root      20   0     0    0    0 S    0  0.0   1:28.28 ksoftirqd/12
   41 root      RT   0     0    0    0 S    0  0.0   0:00.03 watchdog/12
   42 root      RT   0     0    0    0 S    0  0.0   1:08.80 migration/13
   43 root      20   0     0    0    0 S    0  0.0   0:20.57 ksoftirqd/13
   44 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/13
   45 root      20   0     0    0    0 S    0  0.0   0:15.28 events/0
   46 root      20   0     0    0    0 S    0  0.0   0:17.31 events/1
   47 root      20   0     0    0    0 S    0  0.0   0:15.72 events/2
   48 root      20   0     0    0    0 S    0  0.0   0:13.02 events/3
   49 root      20   0     0    0    0 S    0  0.0   0:14.59 events/4
   50 root      20   0     0    0    0 S    0  0.0   0:18.14 events/5
   51 root      20   0     0    0    0 S    0  0.0   0:13.82 events/6
   52 root      20   0     0    0    0 S    0  0.0   0:14.09 events/7
   53 root      20   0     0    0    0 S    0  0.0   0:14.16 events/8
   54 root      20   0     0    0    0 S    0  0.0   0:13.43 events/9
   55 root      20   0     0    0    0 S    0  0.0   0:10.98 events/10
   56 root      20   0     0    0    0 S    0  0.0   0:11.46 events/11
   57 root      20   0     0    0    0 S    0  0.0   0:14.47 events/12
   58 root      20   0     0    0    0 S    0  0.0   0:18.77 events/13
   59 root      20   0     0    0    0 S    0  0.0   0:00.00 cpuset
   60 root      20   0     0    0    0 S    0  0.0   0:00.00 khelper
   61 root      20   0     0    0    0 S    0  0.0   0:00.00 netns
   62 root      20   0     0    0    0 S    0  0.0   0:00.00 async/mgr
   63 root      20   0     0    0    0 S    0  0.0   0:00.00 pm
   64 root      20   0     0    0    0 S    0  0.0   0:00.00 xenwatch
   65 root      20   0     0    0    0 S    0  0.0   0:00.00 xenbus
   66 root      20   0     0    0    0 S    0  0.0   0:02.12 sync_supers
   67 root      20   0     0    0    0 S    0  0.0   0:12.14 bdi-default
   68 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/0
   69 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/1
   70 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/2
   71 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/3
   72 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/4
   73 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/5
   74 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/6
   75 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/7
   76 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/8
   77 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/9
   78 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/10
   79 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/11
   80 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/12
   81 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/13
   82 root      20   0     0    0    0 S    0  0.0   0:11.51 kblockd/0
   83 root      20   0     0    0    0 S    0  0.0   0:00.03 kblockd/1
   84 root      20   0     0    0    0 S    0  0.0   0:00.01 kblockd/2
   85 root      20   0     0    0    0 S    0  0.0   0:00.01 kblockd/3
   86 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/4
   87 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/5
   88 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/6
   89 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/7
   90 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/8
   91 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/9
   92 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/10
   93 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/11
   94 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/12
   95 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/13
   96 root      20   0     0    0    0 S    0  0.0   0:00.00 kseriod
  111 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/0
  112 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/1
  113 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/2
  114 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/3
  115 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/4
  116 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/5
  117 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/6
  118 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/7
  119 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/8
  120 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/9
  121 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/10
  122 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/11
  123 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/12
  124 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/13
  125 root      20   0     0    0    0 S    0  0.0   0:00.66 khungtaskd
  126 root      20   0     0    0    0 S    0  0.0   0:04.11 kswapd0
  127 root      25   5     0    0    0 S    0  0.0   0:00.00 ksmd
  128 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/0
  129 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/1
  130 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/2
  131 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/3
  132 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/4
  133 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/5
  134 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/6
  135 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/7
  136 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/8
  137 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/9
  138 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/10
  139 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/11
  140 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/12
  141 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/13
  142 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/0
  143 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/1
  144 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/2
  145 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/3
  146 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/4
  147 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/5
  148 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/6
  149 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/7
  150 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/8
  151 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/9
  152 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/10
  153 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/11
  154 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/12
  155 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/13
  158 root      20   0     0    0    0 S    0  0.0   0:00.00 khvcd
  211 root      20   0     0    0    0 S    0  0.0   0:00.00 kstriped
  220 root      20   0     0    0    0 S    0  0.0   3:12.40 kjournald
  266 root      16  -4 16744  744  372 S    0  0.0   0:00.21 udevd
  304 root      18  -2 16740  724  348 S    0  0.0   0:01.55 udevd
  368 root      20   0     0    0    0 S    0  0.0   0:59.54 flush-202:2
  720 root      20   0  6468  604  480 S    0  0.0  11:19.83 vnstatd
  745 root      20   0     0    0    0 S    0  0.0   0:00.00 kauditd
 3233 root      20   0 19164 1716 1332 S    0  0.0   0:00.25 mysqld_safe
 4126 root      20   0  5352  688  584 S    0  0.0   0:00.02 logger
 4581 root      20   0 49176 1140  584 S    0  0.0   0:05.97 sshd
 4708 root      16  -4 45180  964  612 S    0  0.0   0:00.31 auditd
 4710 root      12  -8 14296  780  648 S    0  0.0   0:00.73 audispd
 4739 root      20   0 22432 1060  796 S    0  0.0   8:11.61 cron
 5961 daemon    20   0 18716  448  284 S    0  0.0   0:00.01 atd
 5987 pdnsd     20   0  207m 1984  632 S    0  0.0   2:56.16 pdnsd
 6010 messageb  20   0 23268  788  564 S    0  0.0   0:00.01 dbus-daemon
 6631 root      20   0 70480 3184 2492 S    0  0.0   0:00.03 sshd
 6979 chris     20   0 70480 1584  876 S    0  0.0   0:00.53 sshd
 6980 chris     20   0 25736 8528 1544 S    0  0.1   0:00.59 bash
 7542 root      20   0 24572 1244  992 S    0  0.0   0:00.00 sudo
 7543 root      20   0 22156 5040 1636 S    0  0.1   0:01.15 bash
 7621 root      20   0 41872 8756 1820 S    0  0.1   2:24.60 munin-node
 7651 root      20   0  5932  612  516 S    0  0.0   0:00.00 getty
 9389 root      20   0 37176 2384 1868 S    0  0.0   2:09.38 master
 9393 postfix   20   0 39472 2644 1984 S    0  0.0   0:19.11 qmgr
 9394 root      20   0 28712 1736 1224 S    0  0.0   0:02.12 pure-ftpd
13773 root      20   0 56612  17m 1544 S    0  0.2  10:28.80 lfd
14843 postfix   20   0 39240 2420 1912 S    0  0.0   0:00.02 pickup
18931 postfix   20   0 42176 3708 2440 S    0  0.0   0:11.17 tlsmgr
20232 root      18  -2 16740  596  228 S    0  0.0   0:00.00 udevd
30567 root      20   0  734m 7040 1888 S    0  0.1   0:04.30 php-fpm
39671 root      20   0 70480 3180 2492 S    0  0.0   0:00.02 sshd
39758 chris     20   0 70480 1580  876 S    0  0.0   0:00.13 sshd
39759 chris     20   0 25736 8524 1544 S    0  0.1   0:00.54 bash
39800 root      20   0 24572 1244  992 S    0  0.0   0:00.00 sudo
39801 root      20   0 22104 4920 1568 S    0  0.1   0:00.33 bash
44174 ntp       20   0 38340 2180 1592 S    0  0.0   2:05.58 ntpd
45625 root      20   0 10352 1596  876 S    0  0.0   0:00.04 man
45719 root      20   0  9884  992  800 S    0  0.0   0:00.05 pager
48232 root      20   0 32992 4572 2080 S    0  0.1   0:00.06 vi
49873 www-data  20   0 73000  10m 1940 S    0  0.1   0:00.91 nginx
49874 www-data  20   0 73000  10m 1944 S    0  0.1   0:13.47 nginx
49876 www-data  20   0 73000  10m 1920 S    0  0.1   0:00.50 nginx
49878 www-data  20   0 73000  10m 1896 S    0  0.1   0:00.96 nginx
49880 www-data  20   0 73000  10m 1964 S    0  0.1   0:04.36 nginx
49884 www-data  20   0 73000  10m 1960 S    0  0.1   0:01.18 nginx
49888 www-data  20   0 73000  10m 1940 S    0  0.1   0:00.99 nginx
49889 www-data  20   0 73000  10m 1936 S    0  0.1   0:00.79 nginx
49890 www-data  20   0 73000  10m 1920 S    0  0.1   0:02.53 nginx
49891 www-data  20   0 73000  10m 1924 S    0  0.1   0:00.78 nginx
49892 www-data  20   0 73000  10m 1924 S    0  0.1   0:00.74 nginx
49893 www-data  20   0 73000  10m 1920 S    0  0.1   0:03.38 nginx
49894 www-data  20   0 73000  10m 1920 S    0  0.1   0:04.42 nginx
49898 www-data  20   0 73000  10m 1924 S    0  0.1   0:08.21 nginx
49899 www-data  20   0 73000  10m 1916 S    0  0.1   0:00.57 nginx
49900 www-data  20   0 73000  10m 1924 S    0  0.1   0:00.78 nginx
49902 www-data  20   0 73000  10m 1932 S    0  0.1   0:01.13 nginx
49903 www-data  20   0 73000  10m 1936 S    0  0.1   0:01.09 nginx
49904 www-data  20   0 73000  10m 1924 S    0  0.1   0:05.91 nginx
49905 www-data  20   0 73000  10m 1960 S    0  0.1   0:10.76 nginx
49908 www-data  20   0 73000  10m 1932 S    0  0.1   0:01.19 nginx
49909 www-data  20   0 73000  10m 1936 S    0  0.1   0:17.92 nginx
49910 www-data  20   0 73000 9488  568 S    0  0.1   0:09.91 nginx
55173 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
55177 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
55179 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
55184 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
55188 root      20   0  3956  580  484 S    0  0.0   0:00.00 sh
55190 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
55195 root      20   0 10620 1364 1148 S    0  0.0   0:04.04 bash
55197 root      20   0 10660 1416 1156 S    0  0.0   0:06.56 bash
55225 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
55226 root      20   0 32812 1080  776 S    0  0.0   0:00.01 cron
55230 root      20   0  3956  580  484 S    0  0.0   0:00.09 sh
55233 root      20   0  3956  576  484 S    0  0.0   0:00.10 sh
55239 root      20   0 10620 1368 1148 S    0  0.0   0:04.06 bash
55243 root      20   0 10672 1432 1168 S    0  0.0   0:08.08 bash
55632 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
55635 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
55636 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
55643 root      20   0  3956  580  484 S    0  0.0   0:00.00 sh
55645 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
55646 root      20   0  3956  580  484 S    0  0.0   0:00.03 sh
55654 root      20   0 10620 1368 1148 S    0  0.0   0:01.03 bash
55677 root      20   0 10660 1416 1156 S    0  0.0   0:00.04 bash
55709 root      20   0 18320 2296 1624 S    0  0.0   0:00.06 perl
55737 root      20   0  3956  612  492 S    0  0.0   0:00.15 newrelic-daemon
55774 nobody    20   0     0    0    0 Z    0  0.0   0:07.09 df_inode <defunct>
55802 root      20   0     0    0    0 Z    0  0.0   0:04.78 mysql_que <defunct>
55830 root      20   0     0    0    0 Z    0  0.0   0:04.67 mysql_que <defunct>
55853 nobody    20   0     0    0    0 Z    0  0.0   0:02.80 nginx_req <defunct>
55864 nobody    20   0     0    0    0 Z    0  0.0   0:05.41 nginx_req <defunct>
55868 root      20   0 18320 2304 1624 S    0  0.0   0:04.59 perl
55890 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
55897 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
55898 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
55903 root      20   0  3872  500  416 S    0  0.0   0:00.00 sleep
55906 root      20   0  3956  580  484 S    0  0.0   0:00.00 sh
55915 root      20   0  3956  580  484 S    0  0.0   0:00.00 sh
55916 root      20   0  5368  564  480 S    0  0.0   0:00.00 sleep
55932 root      20   0 10644 1356 1116 S    0  0.0   0:00.00 bash
55936 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
55946 root      20   0 10660 1416 1156 S    0  0.0   0:00.01 bash
55950 root      20   0  5368  560  480 S    0  0.0   0:00.00 sleep
56044 root      20   0  3956  612  492 S    0  0.0   0:00.00 newrelic-daemon
56125 root      20   0  5368  564  480 S    0  0.0   0:00.00 sleep
56129 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56130 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56132 root      20   0  5368  568  480 S    0  0.0   0:00.00 sleep
56134 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56136 postfix   20   0 39252 2400 1908 S    0  0.0   0:00.00 trivial-rewrite
56139 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56146 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56147 root      20   0  3872  500  416 S    0  0.0   0:00.00 sleep
56148 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56155 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56158 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56166 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56169 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56174 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56177 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56178 root      20   0  3956  612  492 S    0  0.0   0:00.00 newrelic-daemon
56185 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56190 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56193 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56196 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56209 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56211 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56216 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56219 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56221 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56229 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56232 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56238 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56244 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56246 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56249 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56250 root      20   0  3872  500  416 S    0  0.0   0:00.00 sleep
56254 www-data  20   0 72952 9404  476 S    0  0.1   0:00.00 nginx
56256 www-data  20   0 72952 9400  472 S    0  0.1   0:00.00 nginx
56269 root      20   0 10624  532  304 S    0  0.0   0:00.00 bash

====================

And:

====================
nginx high load on
ONEX_LOAD = 2348
FIVX_LOAD = 830
uptime :
 14:08:33 up 11 days,  4:53,  2 users,  load average: 38.70, 14.25, 6.83
vmstat :
         8175 M total memory
         7214 M used memory
         4514 M active memory
         2026 M inactive memory
          961 M free memory
          702 M buffer memory
         2949 M swap cache
         1023 M total swap
            0 M used swap
         1023 M free swap
     30545673 non-nice user cpu ticks
           15 nice user cpu ticks
     45732405 system cpu ticks
   2538005325 idle cpu ticks
      4718092 IO-wait cpu ticks
          536 IRQ cpu ticks
       647908 softirq cpu ticks
      6407733 stolen cpu ticks
      7046089 pages paged in
    363616780 pages paged out
            0 pages swapped in
            0 pages swapped out
   1031704229 interrupts
    790432564 CPU context switches
   1379837710 boot time
     20727741 forks
disk- ------------reads------------ ------------writes----------- -----IO------
       total merged sectors      ms  total merged sectors      ms    cur    sec
xvda2 391902   2507 13681690 2079772 17933984 49837492 727233752 954914288      0  29503
xvda1  26909  24402  410488  114452      0      0       0       0      0     19
top :
top - 14:08:34 up 11 days,  4:53,  2 users,  load average: 38.70, 14.25, 6.83
Tasks: 301 total,  27 running, 274 sleeping,   0 stopped,   0 zombie
Cpu0  :  2.6%us,  2.4%sy,  0.0%ni, 92.2%id,  2.2%wa,  0.0%hi,  0.3%si,  0.3%st
Cpu1  :  1.7%us,  2.3%sy,  0.0%ni, 95.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.3%st
Cpu2  :  1.7%us,  2.2%sy,  0.0%ni, 95.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu3  :  1.3%us,  1.9%sy,  0.0%ni, 96.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu4  :  1.2%us,  1.8%sy,  0.0%ni, 96.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu5  :  1.0%us,  1.7%sy,  0.0%ni, 97.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu6  :  0.9%us,  1.6%sy,  0.0%ni, 97.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu7  :  0.9%us,  1.6%sy,  0.0%ni, 97.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu8  :  0.9%us,  1.6%sy,  0.0%ni, 97.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu9  :  0.8%us,  1.5%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu10 :  0.8%us,  1.5%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu11 :  0.8%us,  1.5%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu12 :  0.8%us,  1.5%sy,  0.0%ni, 97.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu13 :  0.8%us,  1.5%sy,  0.0%ni, 97.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Mem:   8372060k total,  7404988k used,   967072k free,   719500k buffers
Swap:  1048568k total,        0k used,  1048568k free,  3019864k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
30581 www-data  20   0  759m  85m  55m R   86  1.0   3:08.44 php-fpm
30584 www-data  20   0  769m  94m  55m R   80  1.2   3:06.66 php-fpm
30589 www-data  20   0  794m 117m  53m R   80  1.4   2:51.56 php-fpm
39927 www-data  20   0  758m  80m  52m R   69  1.0   2:32.86 php-fpm
30586 www-data  20   0  773m  94m  52m R   57  1.2   3:13.68 php-fpm
55858 www-data  20   0  759m  76m  47m R   48  0.9   0:58.81 php-fpm
26630 www-data  20   0  739m  16m 6132 R   46  0.2   0:19.78 php-fpm
26659 www-data  20   0  739m  17m 7852 R   46  0.2   0:18.24 php-fpm
30579 www-data  20   0  765m  89m  54m R   46  1.1   2:52.85 php-fpm
30582 www-data  20   0  770m  95m  55m R   46  1.2   3:03.82 php-fpm
30587 www-data  20   0  759m  86m  58m R   46  1.1   2:49.60 php-fpm
55854 www-data  20   0  761m  78m  48m R   46  1.0   1:08.47 php-fpm
55857 www-data  20   0  763m  80m  47m R   46  1.0   0:44.35 php-fpm
55977 www-data  20   0  750m  73m  53m R   46  0.9   0:56.27 php-fpm
26651 www-data  20   0  739m  17m 7856 R   45  0.2   0:19.76 php-fpm
40034 www-data  20   0  776m  98m  52m R   45  1.2   2:54.24 php-fpm
30578 www-data  20   0  761m  83m  52m R   43  1.0   3:37.15 php-fpm
30585 www-data  20   0  759m  83m  54m R   43  1.0   2:48.34 php-fpm
30588 www-data  20   0  773m  94m  51m R   43  1.2   3:15.86 php-fpm
26667 www-data  20   0  739m  17m 7856 R   41  0.2   0:16.89 php-fpm
39831 www-data  20   0  769m  89m  51m R   39  1.1   3:01.63 php-fpm
39848 www-data  20   0  775m  97m  52m R   32  1.2   3:03.99 php-fpm
26795 root      20   0 72968  10m 1764 S   23  0.1   0:08.68 nginx
15293 root      20   0 31072 9124 1948 S    4  0.1   0:38.27 csf
26815 root      20   0 18320 2292 1624 S    4  0.0   0:00.02 perl
26867 root      20   0 19200 1376  912 R    4  0.0   0:00.04 top
 8213 www-data  20   0 72984  10m 1868 S    2  0.1   0:05.11 nginx
 9389 root      20   0 37176 2384 1868 S    2  0.0   2:09.41 master
26874 postfix   20   0 39500 3084 2336 S    2  0.0   0:00.01 local
26889 aegir     20   0 39500 3052 2312 S    2  0.0   0:00.01 local
26908 postfix   20   0 39500 2940 2224 S    2  0.0   0:00.01 local
26979 root      20   0 16072 1660  552 R    2  0.0   0:00.01 iptables
58401 redis     20   0  475m  82m  928 S    2  1.0   5:37.49 redis-server
    1 root      20   0  8356  768  636 S    0  0.0   0:56.04 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.02 kthreadd
    3 root      RT   0     0    0    0 S    0  0.0   1:14.48 migration/0
    4 root      20   0     0    0    0 S    0  0.0   0:55.26 ksoftirqd/0
    5 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/0
    6 root      RT   0     0    0    0 S    0  0.0   1:18.02 migration/1
    7 root      20   0     0    0    0 S    0  0.0   1:25.40 ksoftirqd/1
    8 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/1
    9 root      RT   0     0    0    0 S    0  0.0   1:07.62 migration/2
   10 root      20   0     0    0    0 S    0  0.0   1:19.51 ksoftirqd/2
   11 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/2
   12 root      RT   0     0    0    0 S    0  0.0   1:09.78 migration/3
   13 root      20   0     0    0    0 S    0  0.0   1:45.20 ksoftirqd/3
   14 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/3
   15 root      RT   0     0    0    0 S    0  0.0   1:06.49 migration/4
   16 root      20   0     0    0    0 S    0  0.0   1:38.57 ksoftirqd/4
   17 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/4
   18 root      RT   0     0    0    0 S    0  0.0   1:07.36 migration/5
   19 root      20   0     0    0    0 S    0  0.0   1:47.46 ksoftirqd/5
   20 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/5
   21 root      RT   0     0    0    0 S    0  0.0   1:08.66 migration/6
   22 root      20   0     0    0    0 S    0  0.0   1:37.54 ksoftirqd/6
   23 root      RT   0     0    0    0 S    0  0.0   0:00.02 watchdog/6
   24 root      RT   0     0    0    0 S    0  0.0   1:01.51 migration/7
   25 root      20   0     0    0    0 S    0  0.0   1:41.70 ksoftirqd/7
   26 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/7
   27 root      RT   0     0    0    0 S    0  0.0   1:03.38 migration/8
   28 root      20   0     0    0    0 S    0  0.0   1:40.45 ksoftirqd/8
   29 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/8
   30 root      RT   0     0    0    0 S    0  0.0   1:06.63 migration/9
   31 root      20   0     0    0    0 S    0  0.0   1:42.39 ksoftirqd/9
   32 root      RT   0     0    0    0 S    0  0.0   0:00.01 watchdog/9
   33 root      RT   0     0    0    0 S    0  0.0   1:05.88 migration/10
   34 root      20   0     0    0    0 S    0  0.0   1:53.30 ksoftirqd/10
   35 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/10
   36 root      RT   0     0    0    0 S    0  0.0   1:04.08 migration/11
   37 root      20   0     0    0    0 S    0  0.0   1:41.68 ksoftirqd/11
   38 root      RT   0     0    0    0 S    0  0.0   0:00.02 watchdog/11
   39 root      RT   0     0    0    0 S    0  0.0   1:05.40 migration/12
   40 root      20   0     0    0    0 S    0  0.0   1:28.30 ksoftirqd/12
   41 root      RT   0     0    0    0 S    0  0.0   0:00.03 watchdog/12
   42 root      RT   0     0    0    0 S    0  0.0   1:09.22 migration/13
   43 root      20   0     0    0    0 S    0  0.0   0:20.76 ksoftirqd/13
   44 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/13
   45 root      20   0     0    0    0 S    0  0.0   0:15.30 events/0
   46 root      20   0     0    0    0 S    0  0.0   0:17.32 events/1
   47 root      20   0     0    0    0 S    0  0.0   0:15.73 events/2
   48 root      20   0     0    0    0 S    0  0.0   0:13.05 events/3
   49 root      20   0     0    0    0 S    0  0.0   0:14.60 events/4
   50 root      20   0     0    0    0 S    0  0.0   0:18.15 events/5
   51 root      20   0     0    0    0 S    0  0.0   0:13.85 events/6
   52 root      20   0     0    0    0 S    0  0.0   0:14.09 events/7
   53 root      20   0     0    0    0 S    0  0.0   0:14.18 events/8
   54 root      20   0     0    0    0 S    0  0.0   0:13.44 events/9
   55 root      20   0     0    0    0 S    0  0.0   0:11.03 events/10
   56 root      20   0     0    0    0 S    0  0.0   0:11.50 events/11
   57 root      20   0     0    0    0 S    0  0.0   0:14.53 events/12
   58 root      20   0     0    0    0 S    0  0.0   0:18.80 events/13
   59 root      20   0     0    0    0 S    0  0.0   0:00.00 cpuset
   60 root      20   0     0    0    0 S    0  0.0   0:00.00 khelper
   61 root      20   0     0    0    0 S    0  0.0   0:00.00 netns
   62 root      20   0     0    0    0 S    0  0.0   0:00.00 async/mgr
   63 root      20   0     0    0    0 S    0  0.0   0:00.00 pm
   64 root      20   0     0    0    0 S    0  0.0   0:00.00 xenwatch
   65 root      20   0     0    0    0 S    0  0.0   0:00.00 xenbus
   66 root      20   0     0    0    0 S    0  0.0   0:02.12 sync_supers
   67 root      20   0     0    0    0 S    0  0.0   0:12.14 bdi-default
   68 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/0
   69 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/1
   70 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/2
   71 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/3
   72 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/4
   73 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/5
   74 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/6
   75 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/7
   76 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/8
   77 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/9
   78 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/10
   79 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/11
   80 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/12
   81 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/13
   82 root      20   0     0    0    0 S    0  0.0   0:11.52 kblockd/0
   83 root      20   0     0    0    0 S    0  0.0   0:00.03 kblockd/1
   84 root      20   0     0    0    0 S    0  0.0   0:00.01 kblockd/2
   85 root      20   0     0    0    0 S    0  0.0   0:00.01 kblockd/3
   86 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/4
   87 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/5
   88 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/6
   89 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/7
   90 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/8
   91 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/9
   92 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/10
   93 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/11
   94 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/12
   95 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/13
   96 root      20   0     0    0    0 S    0  0.0   0:00.00 kseriod
  111 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/0
  112 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/1
  113 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/2
  114 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/3
  115 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/4
  116 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/5
  117 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/6
  118 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/7
  119 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/8
  120 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/9
  121 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/10
  122 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/11
  123 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/12
  124 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/13
  125 root      20   0     0    0    0 S    0  0.0   0:00.66 khungtaskd
  126 root      20   0     0    0    0 S    0  0.0   0:04.11 kswapd0
  127 root      25   5     0    0    0 S    0  0.0   0:00.00 ksmd
  128 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/0
  129 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/1
  130 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/2
  131 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/3
  132 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/4
  133 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/5
  134 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/6
  135 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/7
  136 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/8
  137 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/9
  138 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/10
  139 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/11
  140 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/12
  141 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/13
  142 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/0
  143 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/1
  144 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/2
  145 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/3
  146 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/4
  147 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/5
  148 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/6
  149 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/7
  150 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/8
  151 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/9
  152 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/10
  153 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/11
  154 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/12
  155 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/13
  158 root      20   0     0    0    0 S    0  0.0   0:00.00 khvcd
  211 root      20   0     0    0    0 S    0  0.0   0:00.00 kstriped
  220 root      20   0     0    0    0 S    0  0.0   3:12.56 kjournald
  266 root      16  -4 16744  744  372 S    0  0.0   0:00.21 udevd
  304 root      18  -2 16740  724  348 S    0  0.0   0:01.55 udevd
  368 root      20   0     0    0    0 S    0  0.0   0:59.57 flush-202:2
  720 root      20   0  6468  604  480 S    0  0.0  11:32.08 vnstatd
  745 root      20   0     0    0    0 S    0  0.0   0:00.00 kauditd
 3233 root      20   0 19164 1716 1332 S    0  0.0   0:00.25 mysqld_safe
 4125 mysql     20   0 2782m 2.0g  10m S    0 25.3 148:05.64 mysqld
 4126 root      20   0  5352  688  584 S    0  0.0   0:00.02 logger
 4581 root      20   0 49176 1140  584 S    0  0.0   0:05.97 sshd
 4708 root      16  -4 45180  964  612 S    0  0.0   0:00.31 auditd
 4710 root      12  -8 14296  780  648 S    0  0.0   0:00.73 audispd
 4739 root      20   0 22432 1060  796 S    0  0.0   8:20.01 cron
 5897 root      20   0  117m 1676 1068 S    0  0.0   2:46.69 rsyslogd
 5961 daemon    20   0 18716  448  284 S    0  0.0   0:00.01 atd
 5987 pdnsd     20   0  207m 1984  632 S    0  0.0   2:56.29 pdnsd
 6010 messageb  20   0 23268  788  564 S    0  0.0   0:00.01 dbus-daemon
 6631 root      20   0 70480 3184 2492 S    0  0.0   0:00.03 sshd
 6979 chris     20   0 70480 1584  876 S    0  0.0   0:00.53 sshd
 6980 chris     20   0 25736 8528 1544 S    0  0.1   0:00.59 bash
 7542 root      20   0 24572 1244  992 S    0  0.0   0:00.00 sudo
 7543 root      20   0 22156 5040 1636 S    0  0.1   0:01.15 bash
 7621 root      20   0 41872 8756 1820 S    0  0.1   2:24.67 munin-node
 7651 root      20   0  5932  612  516 S    0  0.0   0:00.00 getty
 8206 www-data  20   0 72984  10m 1796 S    0  0.1   0:05.87 nginx
 8219 www-data  20   0 72984  10m 1904 S    0  0.1   0:01.80 nginx
 8220 www-data  20   0 72984  10m 1732 S    0  0.1   0:02.48 nginx
 8222 www-data  20   0 72984  10m 1904 S    0  0.1   0:02.50 nginx
 8223 www-data  20   0 72984  10m 1876 S    0  0.1   0:02.92 nginx
 8224 www-data  20   0 72984 9868  936 S    0  0.1   0:01.37 nginx
 8225 www-data  20   0 72984  10m 1908 S    0  0.1   0:01.04 nginx
 8226 www-data  20   0 72984  10m 1860 S    0  0.1   0:03.13 nginx
 8228 www-data  20   0 72984  10m 1904 S    0  0.1   0:01.92 nginx
 8229 www-data  20   0 72984  10m 1900 S    0  0.1   0:02.64 nginx
 8232 www-data  20   0 72984  10m 1780 S    0  0.1   0:05.38 nginx
 8233 www-data  20   0 72984  10m 1856 S    0  0.1   0:01.01 nginx
 8234 www-data  20   0 72984 9.8m 1144 S    0  0.1   0:02.47 nginx
 8235 www-data  20   0 72984  10m 1916 S    0  0.1   0:02.88 nginx
 8237 www-data  20   0 72984  10m 1916 S    0  0.1   0:02.35 nginx
 8239 www-data  20   0 72984 9992 1060 S    0  0.1   0:00.43 nginx
 8242 www-data  20   0 72984  10m 1916 S    0  0.1   0:02.21 nginx
 9393 postfix   20   0 39472 2644 1984 S    0  0.0   0:19.12 qmgr
 9394 root      20   0 28712 1736 1224 S    0  0.0   0:02.12 pure-ftpd
13773 root      20   0 56612  17m 1544 S    0  0.2  10:29.66 lfd
14843 postfix   20   0 39240 2420 1912 S    0  0.0   0:00.02 pickup
18931 postfix   20   0 42176 3708 2440 S    0  0.0   0:11.18 tlsmgr
20232 root      18  -2 16740  596  228 S    0  0.0   0:00.00 udevd
26324 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
26325 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
26330 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
26331 root      20   0  3956  580  484 S    0  0.0   0:00.00 sh
26335 root      20   0 10624 1368 1148 S    0  0.0   0:01.82 bash
26336 root      20   0 10660 1412 1156 S    0  0.0   0:00.00 bash
26360 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
26361 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
26368 root      20   0  3956  580  484 S    0  0.0   0:00.00 sh
26371 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
26373 root      20   0 10624 1368 1148 S    0  0.0   0:01.14 bash
26374 root      20   0 10660 1416 1156 S    0  0.0   0:00.07 bash
26588 root      20   0 32812 1080  776 S    0  0.0   0:02.98 cron
26589 root      20   0 32812 1080  776 S    0  0.0   0:02.69 cron
26603 root      20   0  3956  580  484 S    0  0.0   0:01.76 sh
26609 root      20   0  3956  576  484 S    0  0.0   0:01.74 sh
26629 root      20   0 10672 1436 1168 S    0  0.0   0:04.30 bash
26634 root      20   0 10624 1364 1144 S    0  0.0   0:02.02 bash
26696 root      20   0 32812 1080  776 S    0  0.0   0:02.36 cron
26700 root      20   0 32812 1080  776 S    0  0.0   0:02.60 cron
26701 root      20   0 32812 1080  776 S    0  0.0   0:02.82 cron
26709 root      20   0  3956  576  484 S    0  0.0   0:02.05 sh
26713 root      20   0  3956  580  484 S    0  0.0   0:02.21 sh
26714 root      20   0  3956  576  484 S    0  0.0   0:02.50 sh
26718 root      20   0 10660 1412 1156 S    0  0.0   0:01.31 bash
26723 root      20   0 10644 1368 1124 S    0  0.0   0:02.76 bash
26725 root      20   0 10624 1364 1144 S    0  0.0   0:01.68 bash
26751 root      20   0  5368  568  480 S    0  0.0   0:00.28 sleep
26779 root      20   0  5368  568  480 S    0  0.0   0:00.00 sleep
26824 root      20   0  5368  568  480 S    0  0.0   0:00.00 sleep
26830 postfix   20   0 39340 2528 2004 S    0  0.0   0:00.01 cleanup
26844 root      20   0  5368  564  480 S    0  0.0   0:00.00 sleep
26856 postfix   20   0 39252 2404 1908 S    0  0.0   0:00.01 trivial-rewrite
26873 root      20   0  5368  564  480 S    0  0.0   0:00.00 sleep
26887 www-data  20   0 72968 9440  504 R    0  0.1   0:00.00 nginx
26890 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26892 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26897 root      20   0  3956  612  492 S    0  0.0   0:00.00 newrelic-daemon
26898 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26899 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26902 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26904 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26907 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26909 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26910 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26911 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26916 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26917 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26920 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26926 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26931 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26933 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26934 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26936 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26937 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26938 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26939 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26940 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26941 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26942 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26943 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26944 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26951 www-data  20   0 72968 9440  504 S    0  0.1   0:00.00 nginx
26952 www-data  20   0 72968 9452  516 S    0  0.1   0:00.00 nginx
26953 root      20   0  3872  500  416 S    0  0.0   0:00.00 sleep
26973 root      20   0 10624  528  304 S    0  0.0   0:00.00 bash
26975 root      20   0 10624  224    0 R    0  0.0   0:00.00 bash
26976 root      20   0  7552  816  704 S    0  0.0   0:00.00 grep
26977 root      20   0 10376  908  768 S    0  0.0   0:00.00 awk
26980 root      20   0 10624  220    0 R    0  0.0   0:00.00 bash
30567 root      20   0  734m 7040 1888 S    0  0.1   0:06.84 php-fpm
39671 root      20   0 70480 3180 2492 S    0  0.0   0:00.02 sshd
39758 chris     20   0 70480 1580  876 S    0  0.0   0:00.13 sshd
39759 chris     20   0 25736 8524 1544 S    0  0.1   0:00.54 bash
39800 root      20   0 24572 1244  992 S    0  0.0   0:00.00 sudo
39801 root      20   0 22104 4920 1568 S    0  0.1   0:00.33 bash
44174 ntp       20   0 38340 2180 1592 S    0  0.0   2:06.61 ntpd
48232 root      20   0 32992 4572 2080 S    0  0.1   0:00.06 vi

====================

And:

====================
nginx high load on
ONEX_LOAD = 285
FIVX_LOAD = 412
uptime :
 14:16:02 up 11 days,  5:01,  3 users,  load average: 5.59, 4.67, 4.94
vmstat :
         8175 M total memory
         7423 M used memory
         4675 M active memory
         2067 M inactive memory
          752 M free memory
          702 M buffer memory
         2934 M swap cache
         1023 M total swap
            0 M used swap
         1023 M free swap
     30566473 non-nice user cpu ticks
           15 nice user cpu ticks
     45784546 system cpu ticks
   2539162558 idle cpu ticks
      4720530 IO-wait cpu ticks
          536 IRQ cpu ticks
       648323 softirq cpu ticks
      6420136 stolen cpu ticks
      7048877 pages paged in
    363775744 pages paged out
            0 pages swapped in
            0 pages swapped out
   1032427982 interrupts
    790941773 CPU context switches
   1379837683 boot time
     20744047 forks
disk- ------------reads------------ ------------writes----------- -----IO------
       total merged sectors      ms  total merged sectors      ms    cur    sec
xvda2 391989   2508 13684618 2080132 17942725 49855397 727551488 955206856      0  29519
xvda1  27090  24552  413136  114684      0      0       0       0      0     19
top :
top - 14:16:04 up 11 days,  5:01,  3 users,  load average: 5.59, 4.67, 4.94
Tasks: 312 total,  29 running, 283 sleeping,   0 stopped,   0 zombie
Cpu0  :  2.6%us,  2.4%sy,  0.0%ni, 92.2%id,  2.2%wa,  0.0%hi,  0.3%si,  0.3%st
Cpu1  :  1.7%us,  2.3%sy,  0.0%ni, 95.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.3%st
Cpu2  :  1.7%us,  2.2%sy,  0.0%ni, 95.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu3  :  1.3%us,  1.9%sy,  0.0%ni, 96.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu4  :  1.2%us,  1.8%sy,  0.0%ni, 96.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu5  :  1.0%us,  1.7%sy,  0.0%ni, 97.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu6  :  0.9%us,  1.7%sy,  0.0%ni, 97.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu7  :  0.9%us,  1.6%sy,  0.0%ni, 97.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu8  :  0.9%us,  1.6%sy,  0.0%ni, 97.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu9  :  0.8%us,  1.5%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu10 :  0.8%us,  1.5%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu11 :  0.8%us,  1.5%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu12 :  0.8%us,  1.5%sy,  0.0%ni, 97.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu13 :  0.8%us,  1.5%sy,  0.0%ni, 97.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Mem:   8372060k total,  7611296k used,   760764k free,   719808k buffers
Swap:  1048568k total,        0k used,  1048568k free,  3004672k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
36093 www-data  20   0  761m  73m  42m R  138  0.9   0:10.58 php-fpm
36100 www-data  20   0  750m  62m  40m R  136  0.8   0:07.99 php-fpm
43007 aegir     20   0  234m  25m 8704 R  136  0.3   0:01.06 drush.php
42850 aegir     20   0  241m  32m 8992 R  135  0.4   0:01.91 drush.php
36082 www-data  20   0  793m 123m  60m R  125  1.5   0:10.47 php-fpm
43200 nobody    20   0 17484 3008 1712 S  125  0.0   0:00.74 irqstats
42894 aegir     20   0  242m  33m 8912 R  116  0.4   0:02.67 drush.php
36083 www-data  20   0  766m  81m  46m R  114  1.0   0:10.25 php-fpm
36096 www-data  20   0  758m  70m  42m R  114  0.9   0:08.66 php-fpm
36102 www-data  20   0  759m  70m  42m R  114  0.9   0:09.12 php-fpm
42801 www-data  20   0  743m  34m  21m R  111  0.4   0:03.28 php-fpm
36090 www-data  20   0  753m  67m  44m R  109  0.8   0:09.08 php-fpm
36103 www-data  20   0  795m 107m  43m R  109  1.3   0:07.24 php-fpm
42922 www-data  20   0  739m  17m 7548 S  109  0.2   0:02.29 php-fpm
36086 www-data  20   0  768m  82m  44m R  108  1.0   0:09.22 php-fpm
36104 www-data  20   0  768m  79m  42m R  106  1.0   0:09.39 php-fpm
42799 www-data  20   0  757m  61m  33m R  106  0.8   0:03.99 php-fpm
36091 www-data  20   0  767m  82m  45m R  104  1.0   0:10.27 php-fpm
36097 www-data  20   0  766m  77m  41m R  103  1.0   0:08.83 php-fpm
43017 tn        20   0  236m  27m 8728 R  103  0.3   0:00.85 php
36088 www-data  20   0  770m  83m  43m R  101  1.0   0:08.75 php-fpm
36101 www-data  20   0  761m  73m  42m R  101  0.9   0:09.71 php-fpm
36106 www-data  20   0  829m 134m  38m R  101  1.6   0:20.20 php-fpm
26795 root      20   0 72984  10m 1764 R   98  0.1   0:09.86 nginx
42800 www-data  20   0  761m  59m  28m R   98  0.7   0:03.44 php-fpm
36092 www-data  20   0  759m  72m  43m R   96  0.9   0:08.10 php-fpm
43016 root      20   0 10644 1364 1124 S   86  0.0   0:00.51 bash
36084 www-data  20   0  769m  85m  46m R   84  1.0   0:08.80 php-fpm
42875 root      20   0 10752 1580 1228 S   82  0.0   0:00.74 bash
43186 root      20   0 19200 1432  912 R   77  0.0   0:00.48 top
43109 root      20   0 16852 1084  868 S   69  0.0   0:00.43 tar
42656 www-data  20   0 73000  10m 1876 S   67  0.1   0:01.95 nginx
36087 www-data  20   0  767m  80m  44m R   66  1.0   0:08.75 php-fpm
43071 root      20   0  3956  616  492 S   64  0.0   0:00.38 newrelic-daemon
43110 root      20   0 13292 7016  448 R   59  0.1   0:00.57 bzip2
43086 root      20   0  3956  612  492 S   47  0.0   0:00.28 newrelic-daemon
42674 www-data  20   0 73000 9472  536 S   25  0.1   0:00.15 nginx
42829 root      20   0 10624 1360 1144 S   13  0.0   0:00.11 bash
42928 root      20   0 10624 1364 1144 S   13  0.0   0:00.43 bash
58401 redis     20   0  475m  87m  928 S   10  1.1   5:42.27 redis-server
 4125 mysql     20   0 2782m 2.0g  10m S    8 25.4 148:23.33 mysqld
43207 www-data  20   0 72984 9444  504 S    5  0.1   0:00.03 nginx
42639 www-data  20   0 73000 9980 1044 S    2  0.1   0:00.20 nginx
42802 root      20   0 41872 7972 1036 S    2  0.1   0:00.04 munin-node
    1 root      20   0  8356  768  636 S    0  0.0   0:56.05 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.02 kthreadd
    3 root      RT   0     0    0    0 S    0  0.0   1:14.56 migration/0
    4 root      20   0     0    0    0 S    0  0.0   0:55.27 ksoftirqd/0
    5 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/0
    6 root      RT   0     0    0    0 S    0  0.0   1:18.06 migration/1
    7 root      20   0     0    0    0 S    0  0.0   1:25.46 ksoftirqd/1
    8 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/1
    9 root      RT   0     0    0    0 S    0  0.0   1:07.72 migration/2
   10 root      20   0     0    0    0 S    0  0.0   1:19.52 ksoftirqd/2
   11 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/2
   12 root      RT   0     0    0    0 S    0  0.0   1:09.92 migration/3
   13 root      20   0     0    0    0 S    0  0.0   1:45.21 ksoftirqd/3
   14 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/3
   15 root      RT   0     0    0    0 S    0  0.0   1:06.56 migration/4
   16 root      20   0     0    0    0 S    0  0.0   1:38.58 ksoftirqd/4
   17 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/4
   18 root      RT   0     0    0    0 S    0  0.0   1:07.41 migration/5
   19 root      20   0     0    0    0 S    0  0.0   1:47.48 ksoftirqd/5
   20 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/5
   21 root      RT   0     0    0    0 S    0  0.0   1:08.78 migration/6
   22 root      20   0     0    0    0 S    0  0.0   1:37.56 ksoftirqd/6
   23 root      RT   0     0    0    0 S    0  0.0   0:00.02 watchdog/6
   24 root      RT   0     0    0    0 S    0  0.0   1:01.59 migration/7
   25 root      20   0     0    0    0 S    0  0.0   1:41.71 ksoftirqd/7
   26 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/7
   27 root      RT   0     0    0    0 S    0  0.0   1:03.53 migration/8
   28 root      20   0     0    0    0 S    0  0.0   1:40.46 ksoftirqd/8
   29 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/8
   30 root      RT   0     0    0    0 S    0  0.0   1:06.74 migration/9
   31 root      20   0     0    0    0 S    0  0.0   1:42.40 ksoftirqd/9
   32 root      RT   0     0    0    0 S    0  0.0   0:00.01 watchdog/9
   33 root      RT   0     0    0    0 S    0  0.0   1:05.94 migration/10
   34 root      20   0     0    0    0 S    0  0.0   1:53.31 ksoftirqd/10
   35 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/10
   36 root      RT   0     0    0    0 S    0  0.0   1:04.12 migration/11
   37 root      20   0     0    0    0 S    0  0.0   1:41.69 ksoftirqd/11
   38 root      RT   0     0    0    0 S    0  0.0   0:00.02 watchdog/11
   39 root      RT   0     0    0    0 S    0  0.0   1:05.44 migration/12
   40 root      20   0     0    0    0 S    0  0.0   1:28.31 ksoftirqd/12
   41 root      RT   0     0    0    0 S    0  0.0   0:00.03 watchdog/12
   42 root      RT   0     0    0    0 S    0  0.0   1:09.32 migration/13
   43 root      20   0     0    0    0 S    0  0.0   0:20.77 ksoftirqd/13
   44 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/13
   45 root      20   0     0    0    0 S    0  0.0   0:15.31 events/0
   46 root      20   0     0    0    0 S    0  0.0   0:17.32 events/1
   47 root      20   0     0    0    0 S    0  0.0   0:15.74 events/2
   48 root      20   0     0    0    0 S    0  0.0   0:13.05 events/3
   49 root      20   0     0    0    0 S    0  0.0   0:14.60 events/4
   50 root      20   0     0    0    0 S    0  0.0   0:18.16 events/5
   51 root      20   0     0    0    0 S    0  0.0   0:13.86 events/6
   52 root      20   0     0    0    0 S    0  0.0   0:14.10 events/7
   53 root      20   0     0    0    0 S    0  0.0   0:14.18 events/8
   54 root      20   0     0    0    0 S    0  0.0   0:13.44 events/9
   55 root      20   0     0    0    0 S    0  0.0   0:11.04 events/10
   56 root      20   0     0    0    0 S    0  0.0   0:11.51 events/11
   57 root      20   0     0    0    0 S    0  0.0   0:14.53 events/12
   58 root      20   0     0    0    0 S    0  0.0   0:18.82 events/13
   59 root      20   0     0    0    0 S    0  0.0   0:00.00 cpuset
   60 root      20   0     0    0    0 S    0  0.0   0:00.00 khelper
   61 root      20   0     0    0    0 S    0  0.0   0:00.00 netns
   62 root      20   0     0    0    0 S    0  0.0   0:00.00 async/mgr
   63 root      20   0     0    0    0 S    0  0.0   0:00.00 pm
   64 root      20   0     0    0    0 S    0  0.0   0:00.00 xenwatch
   65 root      20   0     0    0    0 S    0  0.0   0:00.00 xenbus
   66 root      20   0     0    0    0 S    0  0.0   0:02.12 sync_supers
   67 root      20   0     0    0    0 S    0  0.0   0:12.14 bdi-default
   68 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/0
   69 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/1
   70 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/2
   71 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/3
   72 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/4
   73 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/5
   74 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/6
   75 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/7
   76 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/8
   77 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/9
   78 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/10
   79 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/11
   80 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/12
   81 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/13
   82 root      20   0     0    0    0 S    0  0.0   0:11.53 kblockd/0
   83 root      20   0     0    0    0 S    0  0.0   0:00.03 kblockd/1
   84 root      20   0     0    0    0 S    0  0.0   0:00.01 kblockd/2
   85 root      20   0     0    0    0 S    0  0.0   0:00.01 kblockd/3
   86 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/4
   87 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/5
   88 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/6
   89 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/7
   90 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/8
   91 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/9
   92 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/10
   93 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/11
   94 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/12
   95 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/13
   96 root      20   0     0    0    0 S    0  0.0   0:00.00 kseriod
  111 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/0
  112 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/1
  113 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/2
  114 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/3
  115 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/4
  116 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/5
  117 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/6
  118 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/7
  119 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/8
  120 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/9
  121 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/10
  122 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/11
  123 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/12
  124 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/13
  125 root      20   0     0    0    0 S    0  0.0   0:00.66 khungtaskd
  126 root      20   0     0    0    0 S    0  0.0   0:04.11 kswapd0
  127 root      25   5     0    0    0 S    0  0.0   0:00.00 ksmd
  128 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/0
  129 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/1
  130 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/2
  131 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/3
  132 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/4
  133 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/5
  134 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/6
  135 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/7
  136 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/8
  137 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/9
  138 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/10
  139 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/11
  140 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/12
  141 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/13
  142 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/0
  143 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/1
  144 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/2
  145 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/3
  146 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/4
  147 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/5
  148 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/6
  149 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/7
  150 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/8
  151 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/9
  152 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/10
  153 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/11
  154 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/12
  155 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/13
  158 root      20   0     0    0    0 S    0  0.0   0:00.00 khvcd
  211 root      20   0     0    0    0 S    0  0.0   0:00.00 kstriped
  220 root      20   0     0    0    0 S    0  0.0   3:12.64 kjournald
  266 root      16  -4 16744  744  372 S    0  0.0   0:00.21 udevd
  304 root      18  -2 16740  724  348 S    0  0.0   0:01.56 udevd
  368 root      20   0     0    0    0 S    0  0.0   0:59.60 flush-202:2
  720 root      20   0  6468  604  480 S    0  0.0  11:32.88 vnstatd
  745 root      20   0     0    0    0 S    0  0.0   0:00.00 kauditd
 3233 root      20   0 19164 1716 1332 S    0  0.0   0:00.25 mysqld_safe
 4126 root      20   0  5352  688  584 S    0  0.0   0:00.02 logger
 4581 root      20   0 49176 1140  584 S    0  0.0   0:05.97 sshd
 4708 root      16  -4 45180  964  612 S    0  0.0   0:00.31 auditd
 4710 root      12  -8 14296  780  648 S    0  0.0   0:00.73 audispd
 4739 root      20   0 22432 1060  796 S    0  0.0   8:21.32 cron
 5897 root      20   0  117m 1676 1068 S    0  0.0   2:46.84 rsyslogd
 5961 daemon    20   0 18716  448  284 S    0  0.0   0:00.01 atd
 5987 pdnsd     20   0  207m 1984  632 S    0  0.0   2:56.35 pdnsd
 6010 messageb  20   0 23268  788  564 S    0  0.0   0:00.01 dbus-daemon
 6631 root      20   0 70480 3184 2492 S    0  0.0   0:00.03 sshd
 6979 chris     20   0 70480 1584  876 S    0  0.0   0:00.53 sshd
 6980 chris     20   0 25736 8528 1544 S    0  0.1   0:00.59 bash
 7542 root      20   0 24572 1244  992 S    0  0.0   0:00.00 sudo
 7543 root      20   0 22156 5040 1636 S    0  0.1   0:01.15 bash
 7621 root      20   0 41872 8756 1820 S    0  0.1   2:25.36 munin-node
 7651 root      20   0  5932  612  516 S    0  0.0   0:00.00 getty
 9389 root      20   0 37176 2384 1868 S    0  0.0   2:09.41 master
 9393 postfix   20   0 39472 2644 1984 S    0  0.0   0:19.12 qmgr
 9394 root      20   0 28712 1736 1224 S    0  0.0   0:02.12 pure-ftpd
13773 root      20   0 56612  17m 1544 S    0  0.2  10:30.02 lfd
14843 postfix   20   0 39240 2420 1912 S    0  0.0   0:00.03 pickup
18931 postfix   20   0 42176 3708 2440 S    0  0.0   0:11.18 tlsmgr
20232 root      18  -2 16740  596  228 S    0  0.0   0:00.00 udevd
26902 www-data  20   0 72968  10m 1892 S    0  0.1   0:00.14 nginx
26942 www-data  20   0 72968  10m 1880 S    0  0.1   0:00.13 nginx
36081 root      20   0  734m 6960 1812 S    0  0.1   0:00.10 php-fpm
39671 root      20   0 70480 3180 2492 S    0  0.0   0:00.02 sshd
39758 chris     20   0 70480 1580  876 S    0  0.0   0:00.13 sshd
39759 chris     20   0 25736 8524 1544 S    0  0.1   0:00.54 bash
39800 root      20   0 24572 1244  992 S    0  0.0   0:00.00 sudo
39801 root      20   0 22104 4920 1568 S    0  0.1   0:00.33 bash
40728 root      20   0 70480 3188 2500 S    0  0.0   0:00.02 sshd
40730 jim       20   0 70480 1568  876 S    0  0.0   0:00.05 sshd
40733 jim       20   0 25652 8444 1544 S    0  0.1   0:00.54 bash
40739 postfix   20   0 39340 2524 2004 S    0  0.0   0:00.01 cleanup
40743 postfix   20   0 39252 2404 1908 S    0  0.0   0:00.00 trivial-rewrite
40744 postfix   20   0 43808 3596 2828 S    0  0.0   0:00.07 smtp
42532 root      20   0 24572 1240  992 S    0  0.0   0:00.00 sudo
42533 root      20   0 22104 4896 1544 S    0  0.1   0:00.27 bash
42571 tn.ftp    20   0 36888 1236  964 S    0  0.0   0:00.00 su
42572 tn.ftp    20   0 19232 1980 1508 S    0  0.0   0:00.04 bash
42641 www-data  20   0 73000 9440  504 S    0  0.1   0:00.00 nginx
42647 www-data  20   0 73000 9440  504 S    0  0.1   0:00.00 nginx
42648 www-data  20   0 73000 9440  504 S    0  0.1   0:00.00 nginx
42649 www-data  20   0 73000  10m 1848 S    0  0.1   0:00.02 nginx
42650 www-data  20   0 73000 9440  504 S    0  0.1   0:00.00 nginx
42651 www-data  20   0 73000 9872  936 S    0  0.1   0:01.08 nginx
42652 www-data  20   0 73000 9440  504 R    0  0.1   0:00.00 nginx
42653 www-data  20   0 73000 9440  504 S    0  0.1   0:00.00 nginx
42654 www-data  20   0 73000 9440  504 S    0  0.1   0:00.02 nginx
42655 www-data  20   0 73000 9440  504 S    0  0.1   0:00.00 nginx
42658 www-data  20   0 73000 9440  504 S    0  0.1   0:00.00 nginx
42659 www-data  20   0 73000 9440  504 S    0  0.1   0:00.00 nginx
42660 www-data  20   0 73000 9860  924 S    0  0.1   0:00.23 nginx
42661 www-data  20   0 73000 9440  504 S    0  0.1   0:00.00 nginx
42662 www-data  20   0 73000 9440  504 S    0  0.1   0:00.00 nginx
42663 www-data  20   0 73000 9440  504 S    0  0.1   0:00.00 nginx
42664 www-data  20   0 73000 9440  504 S    0  0.1   0:00.00 nginx
42665 www-data  20   0 73000 9440  504 S    0  0.1   0:00.00 nginx
42666 www-data  20   0 73000 9728  792 S    0  0.1   0:00.00 nginx
42667 www-data  20   0 73000 9440  504 S    0  0.1   0:00.00 nginx
42668 www-data  20   0 73000 9440  504 S    0  0.1   0:00.00 nginx
42669 www-data  20   0 73000 9440  504 S    0  0.1   0:00.00 nginx
42670 www-data  20   0 73000 9440  504 S    0  0.1   0:00.00 nginx
42671 www-data  20   0 73000 9440  504 S    0  0.1   0:00.00 nginx
42672 www-data  20   0 73000 9440  504 S    0  0.1   0:00.00 nginx
42673 www-data  20   0 73000 9440  504 S    0  0.1   0:00.00 nginx
42804 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
42805 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
42808 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
42809 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
42810 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
42814 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
42816 root      20   0  3956  580  484 S    0  0.0   0:00.00 sh
42817 root      20   0  3956  580  484 S    0  0.0   0:00.00 sh
42820 root      20   0 10612 1356 1148 S    0  0.0   0:00.10 bash
42821 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
42827 root      20   0 10660 1416 1156 S    0  0.0   0:00.01 bash
42836 root      20   0 10644 1360 1124 S    0  0.0   0:00.24 bash
42840 aegir     20   0 10588 1300 1112 S    0  0.0   0:00.00 bash
42852 root      20   0 18320 2300 1624 S    0  0.0   0:00.12 perl
42859 root      20   0 32812 1080  776 S    0  0.0   0:00.01 cron
42863 root      20   0 32812 1080  776 S    0  0.0   0:00.03 cron
42864 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
42865 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
42867 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
42868 root      20   0 32812 1080  776 S    0  0.0   0:00.01 cron
42869 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
42870 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
42872 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
42874 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
42876 root      20   0 10644 1368 1124 S    0  0.0   0:00.24 bash
42879 root      20   0 10612 1356 1148 S    0  0.0   0:00.55 bash
42887 aegir     20   0 10588 1300 1112 S    0  0.0   0:00.00 bash
42900 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
42901 root      20   0 10664 1412 1148 S    0  0.0   0:00.01 bash
42913 root      20   0  3956  584  488 S    0  0.0   0:00.00 sh
42914 root      20   0 10836 1624 1192 S    0  0.0   0:00.36 metche
42918 root      20   0  3956  576  484 S    0  0.0   0:00.14 sh
42942 root      20   0  5368  568  480 S    0  0.0   0:00.17 sleep
42943 root      20   0  5368  564  480 S    0  0.0   0:00.20 sleep
42945 root      20   0  5368  564  480 S    0  0.0   0:00.00 sleep
42947 root      20   0 10612 1348 1136 S    0  0.0   0:00.00 bash
42949 root      20   0 32812 1080  776 S    0  0.0   0:00.02 cron
42950 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
42955 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
42956 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
42961 tn        20   0 36888 1232  968 S    0  0.0   0:00.01 su
42962 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
42963 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
42967 root      20   0 10624 1368 1144 S    0  0.0   0:00.00 bash
42973 root      20   0 10660 1416 1156 S    0  0.0   0:00.01 bash
42985 tn        20   0 10592 1304 1112 S    0  0.0   0:00.00 bash

{{{

42989 aegir     20   0 10588 1296 1112 S    0  0.0   0:00.00 bash
43003 root      20   0 18320 2304 1624 S    0  0.0   0:00.02 perl
43005 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
44174 ntp       20   0 38340 2180 1592 S    0  0.0   2:06.74 ntpd
48232 root      20   0 32992 4572 2080 S    0  0.1   0:00.06 vi

====================
}}}

And:

{{{
====================
nginx high load on
ONEX_LOAD = 881
FIVX_LOAD = 407
uptime :
 14:33:07 up 11 days,  5:19,  3 users,  load average: 70.52, 32.99, 14.82
vmstat :
         8175 M total memory
         7501 M used memory
         4731 M active memory
         2087 M inactive memory
          673 M free memory
          703 M buffer memory
         2969 M swap cache
         1023 M total swap
            0 M used swap
         1023 M free swap
     30617854 non-nice user cpu ticks
           15 nice user cpu ticks
     45889550 system cpu ticks
   2541317540 idle cpu ticks
      4726769 IO-wait cpu ticks
          537 IRQ cpu ticks
       649925 softirq cpu ticks
      6576532 stolen cpu ticks
      7056193 pages paged in
    364190220 pages paged out
            0 pages swapped in
            0 pages swapped out
   1033966482 interrupts
    792110970 CPU context switches
   1379837621 boot time
     20775019 forks
disk- ------------reads------------ ------------writes----------- -----IO------
       total merged sectors      ms  total merged sectors      ms    cur    sec
xvda2 392300   2511 13699250 2087468 17964640 49896940 728380448 956223556      0  29584
xvda1  27090  24552  413136  114684      0      0       0       0      0     19
top :
top - 14:33:08 up 11 days,  5:19,  3 users,  load average: 70.52, 32.99, 14.82
Tasks: 358 total,  36 running, 322 sleeping,   0 stopped,   0 zombie
Cpu0  :  2.6%us,  2.4%sy,  0.0%ni, 92.2%id,  2.2%wa,  0.0%hi,  0.3%si,  0.3%st
Cpu1  :  1.7%us,  2.3%sy,  0.0%ni, 95.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.3%st
Cpu2  :  1.7%us,  2.2%sy,  0.0%ni, 95.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.3%st
Cpu3  :  1.3%us,  1.9%sy,  0.0%ni, 96.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.3%st
Cpu4  :  1.2%us,  1.8%sy,  0.0%ni, 96.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu5  :  1.0%us,  1.7%sy,  0.0%ni, 97.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu6  :  0.9%us,  1.7%sy,  0.0%ni, 97.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu7  :  0.9%us,  1.6%sy,  0.0%ni, 97.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu8  :  0.9%us,  1.6%sy,  0.0%ni, 97.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu9  :  0.8%us,  1.5%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu10 :  0.8%us,  1.5%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu11 :  0.8%us,  1.5%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu12 :  0.8%us,  1.5%sy,  0.0%ni, 97.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu13 :  0.8%us,  1.5%sy,  0.0%ni, 97.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st
Mem:   8372060k total,  7750224k used,   621836k free,   720804k buffers
Swap:  1048568k total,        0k used,  1048568k free,  3042316k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 8750 root      20   0 13292 7020  448 R  313  0.1   0:15.48 bzip2
 3796 www-data  20   0  777m  91m  45m R  109  1.1   0:57.95 php-fpm
 8765 root      20   0     0    0    0 R  106  0.0   0:12.03 lfd
 9033 root      20   0     0    0    0 R   65  0.0   0:00.42 awk
 8608 www-data  20   0  742m  31m  18m R   61  0.4   0:35.34 php-fpm
 9032 root      20   0     0    0    0 R   61  0.0   0:00.39 grep
 3795 www-data  20   0  767m  79m  41m R   59  1.0   0:57.99 php-fpm
 3801 www-data  20   0  784m  97m  43m R   53  1.2   0:49.55 php-fpm
 8584 www-data  20   0  742m  29m  17m R   53  0.4   0:38.34 php-fpm
 3793 www-data  20   0  769m  82m  44m R   51  1.0   0:56.23 php-fpm
 8617 www-data  20   0  740m  21m  10m R   51  0.3   0:31.79 php-fpm
 3789 www-data  20   0  791m 122m  61m R   50  1.5   0:18.79 php-fpm
 3791 www-data  20   0  782m  97m  45m R   50  1.2   0:38.37 php-fpm
 3802 www-data  20   0  790m 106m  46m R   50  1.3   0:36.65 php-fpm
 3794 www-data  20   0  777m  91m  45m R   48  1.1   0:50.90 php-fpm
 8972 root      20   0 13292 7016  448 R   47  0.1   0:00.37 bzip2
 3798 www-data  20   0  767m  79m  42m R   44  1.0   0:48.10 php-fpm
26795 root      20   0 72984  10m 1764 S   44  0.1   0:10.66 nginx
 9098 root      20   0     8    4    0 R   42  0.0   0:00.27 bash
 8621 www-data  20   0  740m  23m  12m R   41  0.3   0:35.84 php-fpm
 9076 root      20   0     8    4    0 R   41  0.0   0:00.26 sh
 9096 root      20   0 29420 1732 1280 R   41  0.0   0:00.26 mysqladmin
 9099 root      20   0     8    4    0 R   41  0.0   0:00.26 bash
 3800 www-data  20   0  772m  88m  47m R   36  1.1   0:51.04 php-fpm
 9074 postfix   20   0 39500 3084 2336 S   34  0.0   0:00.22 local
 3792 www-data  20   0  760m  74m  44m R   31  0.9   0:42.55 php-fpm
 3799 www-data  20   0  760m  73m  43m R   31  0.9   0:48.37 php-fpm
 3806 www-data  20   0  824m 140m  48m R   31  1.7   0:44.02 php-fpm
 9090 root      20   0 10616  520  304 S   30  0.0   0:00.19 bash
 9087 www-data  20   0 72984 9448  504 S   28  0.1   0:00.18 nginx
 9095 www-data  20   0 72984 9448  504 S   26  0.1   0:00.17 nginx
 9092 www-data  20   0 72984 9448  504 S   25  0.1   0:00.16 nginx
 9094 www-data  20   0 72984 9448  504 S   25  0.1   0:00.16 nginx
 9089 root      20   0 10616  520  304 S   22  0.0   0:00.14 bash
 3805 www-data  20   0  788m 100m  43m R   20  1.2   0:39.67 php-fpm
 9034 root      20   0 19200 1444  912 R   20  0.0   0:00.14 top
 3804 www-data  20   0  775m  87m  42m R   19  1.1   0:43.48 php-fpm
 8555 root      20   0 10752 1580 1228 S   17  0.0   0:04.90 bash
 9102 www-data  20   0 72984 9448  504 S   16  0.1   0:00.10 nginx
 4125 mysql     20   0 2782m 2.0g  10m S   11 25.4 149:15.28 mysqld
54352 www-data  20   0 73016  10m 1896 S   11  0.1   0:07.75 nginx
 9097 root      20   0 10376  908  768 S    9  0.0   0:00.06 awk
 9103 www-data  20   0 72984 9448  504 S    9  0.1   0:00.06 nginx
 9105 www-data  20   0 72984 9448  504 S    8  0.1   0:00.05 nginx
 9106 www-data  20   0 72984 9448  504 S    6  0.1   0:00.04 nginx
54336 www-data  20   0 73016  10m 1888 S    6  0.1   0:00.51 nginx
 9108 root      20   0 10376  912  768 S    5  0.0   0:00.03 awk
54350 www-data  20   0 73016  10m 1876 S    5  0.1   0:00.26 nginx
 9107 root      20   0  7552  816  704 S    3  0.0   0:00.02 grep
58401 redis     20   0  475m  96m  928 R    3  1.2   5:56.65 redis-server
 3790 www-data  20   0  784m  98m  45m R    2  1.2   1:06.53 php-fpm
 8435 root      20   0 10752 1584 1228 R    2  0.0   0:00.51 bash
 8439 root      20   0 10624 1368 1144 R    2  0.0   0:00.01 bash
 9114 root      20   0 39852 2808 2136 R    2  0.0   0:00.01 mysqladmin
54326 www-data  20   0 73016  10m 1920 S    2  0.1   0:00.32 nginx
54335 www-data  20   0 73016  10m 1888 S    2  0.1   0:00.89 nginx
    1 root      20   0  8356  768  636 S    0  0.0   0:56.06 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.02 kthreadd
    3 root      RT   0     0    0    0 S    0  0.0   1:14.81 migration/0
    4 root      20   0     0    0    0 S    0  0.0   0:55.34 ksoftirqd/0
    5 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/0
    6 root      RT   0     0    0    0 S    0  0.0   1:18.11 migration/1
    7 root      20   0     0    0    0 S    0  0.0   1:25.48 ksoftirqd/1
    8 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/1
    9 root      RT   0     0    0    0 S    0  0.0   1:07.79 migration/2
   10 root      20   0     0    0    0 S    0  0.0   1:19.54 ksoftirqd/2
   11 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/2
   12 root      RT   0     0    0    0 S    0  0.0   1:10.14 migration/3
   13 root      20   0     0    0    0 S    0  0.0   1:45.23 ksoftirqd/3
   14 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/3
   15 root      RT   0     0    0    0 S    0  0.0   1:06.61 migration/4
   16 root      20   0     0    0    0 S    0  0.0   1:38.65 ksoftirqd/4
   17 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/4
   18 root      RT   0     0    0    0 S    0  0.0   1:08.00 migration/5
   19 root      20   0     0    0    0 S    0  0.0   1:47.51 ksoftirqd/5
   20 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/5
   21 root      RT   0     0    0    0 S    0  0.0   1:09.04 migration/6
   22 root      20   0     0    0    0 S    0  0.0   1:37.58 ksoftirqd/6
   23 root      RT   0     0    0    0 S    0  0.0   0:00.02 watchdog/6
   24 root      RT   0     0    0    0 S    0  0.0   1:01.77 migration/7
   25 root      20   0     0    0    0 S    0  0.0   1:41.73 ksoftirqd/7
   26 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/7
   27 root      RT   0     0    0    0 S    0  0.0   1:03.64 migration/8
   28 root      20   0     0    0    0 S    0  0.0   1:40.48 ksoftirqd/8
   29 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/8
   30 root      RT   0     0    0    0 S    0  0.0   1:07.02 migration/9
   31 root      20   0     0    0    0 S    0  0.0   1:42.41 ksoftirqd/9
   32 root      RT   0     0    0    0 S    0  0.0   0:00.01 watchdog/9
   33 root      RT   0     0    0    0 S    0  0.0   1:06.05 migration/10
   34 root      20   0     0    0    0 S    0  0.0   1:53.33 ksoftirqd/10
   35 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/10
   36 root      RT   0     0    0    0 S    0  0.0   1:04.41 migration/11
   37 root      20   0     0    0    0 S    0  0.0   1:41.75 ksoftirqd/11
   38 root      RT   0     0    0    0 S    0  0.0   0:00.02 watchdog/11
   39 root      RT   0     0    0    0 S    0  0.0   1:05.66 migration/12
   40 root      20   0     0    0    0 S    0  0.0   1:28.38 ksoftirqd/12
   41 root      RT   0     0    0    0 S    0  0.0   0:00.03 watchdog/12
   42 root      RT   0     0    0    0 S    0  0.0   1:09.50 migration/13
   43 root      20   0     0    0    0 S    0  0.0   0:20.79 ksoftirqd/13
   44 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/13
   45 root      20   0     0    0    0 S    0  0.0   0:15.34 events/0
   46 root      20   0     0    0    0 S    0  0.0   0:17.34 events/1
   47 root      20   0     0    0    0 S    0  0.0   0:15.75 events/2
   48 root      20   0     0    0    0 S    0  0.0   0:13.07 events/3
   49 root      20   0     0    0    0 S    0  0.0   0:14.61 events/4
   50 root      20   0     0    0    0 S    0  0.0   0:18.18 events/5
   51 root      20   0     0    0    0 S    0  0.0   0:13.87 events/6
   52 root      20   0     0    0    0 S    0  0.0   0:14.11 events/7
   53 root      20   0     0    0    0 S    0  0.0   0:14.22 events/8
   54 root      20   0     0    0    0 S    0  0.0   0:13.45 events/9
   55 root      20   0     0    0    0 S    0  0.0   0:11.05 events/10
   56 root      20   0     0    0    0 S    0  0.0   0:11.52 events/11
   57 root      20   0     0    0    0 S    0  0.0   0:14.54 events/12
   58 root      20   0     0    0    0 S    0  0.0   0:18.84 events/13
   59 root      20   0     0    0    0 S    0  0.0   0:00.00 cpuset
   60 root      20   0     0    0    0 S    0  0.0   0:00.00 khelper
   61 root      20   0     0    0    0 S    0  0.0   0:00.00 netns
   62 root      20   0     0    0    0 S    0  0.0   0:00.00 async/mgr
   63 root      20   0     0    0    0 S    0  0.0   0:00.00 pm
   64 root      20   0     0    0    0 S    0  0.0   0:00.00 xenwatch
   65 root      20   0     0    0    0 S    0  0.0   0:00.00 xenbus
   66 root      20   0     0    0    0 S    0  0.0   0:02.12 sync_supers
   67 root      20   0     0    0    0 S    0  0.0   0:12.15 bdi-default
   68 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/0
   69 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/1
   70 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/2
   71 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/3
   72 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/4
   73 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/5
   74 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/6
   75 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/7
   76 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/8
   77 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/9
   78 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/10
   79 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/11
   80 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/12
   81 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/13
   82 root      20   0     0    0    0 S    0  0.0   0:11.55 kblockd/0
   83 root      20   0     0    0    0 S    0  0.0   0:00.03 kblockd/1
   84 root      20   0     0    0    0 S    0  0.0   0:00.01 kblockd/2
   85 root      20   0     0    0    0 S    0  0.0   0:00.01 kblockd/3
   86 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/4
   87 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/5
   88 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/6
   89 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/7
   90 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/8
   91 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/9
   92 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/10
   93 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/11
   94 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/12
   95 root      20   0     0    0    0 S    0  0.0   0:00.00 kblockd/13
   96 root      20   0     0    0    0 S    0  0.0   0:00.00 kseriod
  111 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/0
  112 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/1
  113 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/2
  114 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/3
  115 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/4
  116 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/5
  117 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/6
  118 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/7
  119 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/8
  120 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/9
  121 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/10
  122 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/11
  123 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/12
  124 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/13
  125 root      20   0     0    0    0 S    0  0.0   0:00.67 khungtaskd
  126 root      20   0     0    0    0 S    0  0.0   0:04.11 kswapd0
  127 root      25   5     0    0    0 S    0  0.0   0:00.00 ksmd
  128 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/0
  129 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/1
  130 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/2
  131 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/3
  132 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/4
  133 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/5
  134 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/6
  135 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/7
  136 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/8
  137 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/9
  138 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/10
  139 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/11
  140 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/12
  141 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/13
  142 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/0
  143 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/1
  144 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/2
  145 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/3
  146 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/4
  147 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/5
  148 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/6
  149 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/7
  150 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/8
  151 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/9
  152 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/10
  153 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/11
  154 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/12
  155 root      20   0     0    0    0 S    0  0.0   0:00.00 crypto/13
  158 root      20   0     0    0    0 S    0  0.0   0:00.00 khvcd
  211 root      20   0     0    0    0 S    0  0.0   0:00.00 kstriped
  220 root      20   0     0    0    0 S    0  0.0   3:15.24 kjournald
  266 root      16  -4 16744  744  372 S    0  0.0   0:00.21 udevd
  304 root      18  -2 16740  724  348 S    0  0.0   0:01.56 udevd
  368 root      20   0     0    0    0 S    0  0.0   0:59.78 flush-202:2
  720 root      20   0  6468  604  480 S    0  0.0  11:48.57 vnstatd
  745 root      20   0     0    0    0 S    0  0.0   0:00.00 kauditd
 2933 root      20   0 24572 1240  992 S    0  0.0   0:00.00 sudo
 2934 root      20   0 22120 4964 1592 S    0  0.1   0:00.30 bash
 3233 root      20   0 19164 1716 1332 S    0  0.0   0:00.25 mysqld_safe
 3788 root      20   0  734m 7024 1876 S    0  0.1   0:13.09 php-fpm
 3797 www-data  20   0  775m  89m  45m S    0  1.1   0:51.40 php-fpm
 3803 www-data  20   0  770m  78m  39m R    0  1.0   0:50.09 php-fpm
 4126 root      20   0  5352  688  584 S    0  0.0   0:00.04 logger
 4240 tn        20   0 36888 1236  964 S    0  0.0   0:00.00 su
 4242 tn        20   0 19312 2036 1480 S    0  0.0   0:00.01 bash
 4581 root      20   0 49176 1140  584 S    0  0.0   0:05.97 sshd
 4708 root      16  -4 45180  964  612 S    0  0.0   0:00.31 auditd
 4710 root      12  -8 14296  780  648 S    0  0.0   0:00.73 audispd
 4739 root      20   0 22432 1060  796 S    0  0.0   8:25.19 cron
 5897 root      20   0  117m 1676 1068 S    0  0.0   2:47.52 rsyslogd
 5961 daemon    20   0 18716  448  284 S    0  0.0   0:00.01 atd
 5987 pdnsd     20   0  207m 1984  632 S    0  0.0   2:56.77 pdnsd
 6010 messageb  20   0 23268  788  564 S    0  0.0   0:00.01 dbus-daemon
 6631 root      20   0 70480 3184 2492 S    0  0.0   0:00.03 sshd
 6979 chris     20   0 70480 1584  876 S    0  0.0   0:00.54 sshd
 6980 chris     20   0 25736 8528 1544 S    0  0.1   0:00.59 bash
 7542 root      20   0 24572 1244  992 S    0  0.0   0:00.00 sudo
 7543 root      20   0 22156 5040 1636 S    0  0.1   0:01.79 bash
 7621 root      20   0 41872 8756 1820 S    0  0.1   2:25.52 munin-node
 7651 root      20   0  5932  612  516 S    0  0.0   0:00.00 getty
 8414 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
 8416 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
 8419 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
 8421 root      20   0 32812 1080  776 S    0  0.0   0:00.00 cron
 8424 root      20   0  3956  584  488 S    0  0.0   0:00.00 sh
 8426 root      20   0  3956  580  484 S    0  0.0   0:00.00 sh
 8427 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
 8430 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
 8436 root      20   0 10836 1628 1192 S    0  0.0   0:00.48 metche
 8437 root      20   0 10660 1408 1156 S    0  0.0   0:00.00 bash
 8477 root      20   0 32812 1080  776 S    0  0.0   0:06.13 cron
 8478 root      20   0 32812 1080  776 S    0  0.0   0:01.86 cron
 8498 root      20   0  3956  576  484 S    0  0.0   0:03.82 sh
 8503 root      20   0 32812 1080  776 S    0  0.0   0:05.74 cron
 8505 root      20   0 32812 1080  776 S    0  0.0   0:06.91 cron
 8506 root      20   0  3956  580  484 S    0  0.0   0:03.58 sh
 8511 root      20   0 32812 1080  776 S    0  0.0   0:07.10 cron
 8516 root      20   0 32812 1080  776 S    0  0.0   0:05.77 cron
 8520 root      20   0 10672 1436 1168 S    0  0.0   0:03.23 bash
 8535 root      20   0 10616 1352 1136 S    0  0.0   0:02.62 bash
 8538 root      20   0  3956  580  484 S    0  0.0   0:03.18 sh
 8542 root      20   0  3956  580  488 S    0  0.0   0:02.83 sh
 8544 root      20   0  3956  580  484 S    0  0.0   0:03.30 sh
 8553 root      20   0 10836 1628 1192 S    0  0.0   0:02.57 metche
 8557 root      20   0  3956  576  484 S    0  0.0   0:03.26 sh
 8560 root      20   0 10660 1416 1156 S    0  0.0   0:04.09 bash
 8577 root      20   0 10624 1364 1144 S    0  0.0   0:03.10 bash
 8630 root      20   0 32812 1080  776 S    0  0.0   0:06.31 cron
 8636 root      20   0 32812 1080  776 S    0  0.0   0:05.83 cron
 8637 root      20   0 32812 1080  776 S    0  0.0   0:06.24 cron
 8669 root      20   0  3956  576  484 S    0  0.0   0:03.32 sh
 8671 root      20   0  3956  576  484 S    0  0.0   0:03.34 sh
 8673 root      20   0  3956  580  484 S    0  0.0   0:02.98 sh
 8690 root      20   0 10660 1416 1156 S    0  0.0   0:03.71 bash
 8692 root      20   0 10616 1348 1136 S    0  0.0   0:02.77 bash
 8693 root      20   0 10644 1368 1124 S    0  0.0   0:04.23 bash
 8723 root      20   0 32812 1080  776 S    0  0.0   0:05.65 cron
 8730 root      20   0 32812 1080  776 S    0  0.0   0:06.05 cron
 8736 root      20   0 16852 1080  868 S    0  0.0   0:04.10 tar
 8743 root      20   0 32812 1080  776 S    0  0.0   0:06.01 cron
 8746 root      20   0 14820 1036  852 D    0  0.0   0:04.12 ps
 8749 root      20   0  6028  668  548 S    0  0.0   0:01.04 grep
 8751 root      20   0  8172  692  580 S    0  0.0   0:02.43 sed
 8752 root      20   0  8852  784  652 S    0  0.0   0:02.86 awk
 8759 root      20   0  3956  576  484 S    0  0.0   0:02.64 sh
 8762 root      20   0  3956  576  484 S    0  0.0   0:02.69 sh
 8781 root      20   0 10624 1368 1144 S    0  0.0   0:02.16 bash
 8785 root      20   0 10644 1364 1124 S    0  0.0   0:02.59 bash
 8791 root      20   0  3956  580  484 S    0  0.0   0:03.46 sh
 8810 root      20   0     8    4    0 R    0  0.0   0:08.50 munin-node
 8815 root      20   0 10660 1416 1156 S    0  0.0   0:02.41 bash
 8826 root      20   0 32812 1080  776 S    0  0.0   0:01.36 cron
 8832 root      20   0 32812 1080  776 S    0  0.0   0:00.57 cron
 8833 root      20   0 32812 1080  776 S    0  0.0   0:00.74 cron
 8834 root      20   0  5368  560  480 S    0  0.0   0:00.37 sleep
 8836 root      20   0  5368  564  480 S    0  0.0   0:00.14 sleep
 8857 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
 8868 root      20   0 10616 1348 1136 S    0  0.0   0:00.01 bash
 8871 root      20   0 18320 2120 1528 S    0  0.0   0:00.01 perl
 8876 root      20   0  3956  576  484 S    0  0.0   0:00.00 sh
 8887 root      20   0 10644 1368 1124 S    0  0.0   0:00.00 bash
 8894 root      20   0 18320 2120 1528 S    0  0.0   0:00.01 perl
 8913 postfix   20   0 39340 2528 2004 S    0  0.0   0:00.01 cleanup
 8934 root      20   0 18320 2120 1528 S    0  0.0   0:00.03 perl
 8936 postfix   20   0 39252 2400 1908 S    0  0.0   0:00.00 trivial-rewrite
 8938 root      20   0  3956  580  484 S    0  0.0   0:00.00 sh
 8945 root      20   0 16456 1172  852 D    0  0.0   0:00.02 ps
 8946 root      20   0 10660 1412 1156 S    0  0.0   0:00.00 bash
 8951 root      20   0 16456 1176  852 D    0  0.0   0:00.02 ps
 8963 root      20   0 16852 1080  868 S    0  0.0   0:00.02 tar
 8969 postfix   20   0 39500 3084 2336 S    0  0.0   0:00.00 local
 8977 root      20   0 16456 1168  852 D    0  0.0   0:00.02 ps
 8997 postfix   20   0 39500 3084 2336 S    0  0.0   0:00.01 local
 9008 root      20   0  5368  568  480 S    0  0.0   0:00.00 sleep
 9020 root      20   0 10616  524  304 S    0  0.0   0:00.00 bash
 9046 root      20   0 18320 2100 1508 S    0  0.0   0:00.00 perl
 9068 root      20   0  3956  560  468 S    0  0.0   0:00.00 sh
 9077 root      20   0  7552  816  704 S    0  0.0   0:00.00 grep
 9109 www-data  20   0 72984 9448  504 S    0  0.1   0:00.00 nginx
 9111 www-data  20   0 72984 9448  504 S    0  0.1   0:00.00 nginx
 9112 root      20   0 10624  528  304 S    0  0.0   0:00.00 bash
 9113 www-data  20   0 72984 9448  504 S    0  0.1   0:00.00 nginx
 9115 root      20   0  7552  816  704 S    0  0.0   0:00.00 grep
 9116 root      20   0 10376  912  768 S    0  0.0   0:00.00 awk
 9117 root      20   0 10376  912  768 S    0  0.0   0:00.00 awk
 9118 root      20   0  7552  820  704 S    0  0.0   0:00.00 grep
 9119 root      20   0 10376  908  768 S    0  0.0   0:00.00 awk
 9120 www-data  20   0 72984 9448  504 S    0  0.1   0:00.00 nginx
 9121 www-data  20   0 72984 9448  504 S    0  0.1   0:00.00 nginx
 9389 root      20   0 37176 2384 1868 S    0  0.0   2:09.44 master
 9393 postfix   20   0 39472 2644 1984 S    0  0.0   0:19.14 qmgr
 9394 root      20   0 28712 1736 1224 S    0  0.0   0:02.12 pure-ftpd
13773 root      20   0 56612  17m 1544 S    0  0.2  10:44.66 lfd
18931 postfix   20   0 42176 3708 2440 S    0  0.0   0:11.18 tlsmgr
20232 root      18  -2 16740  596  228 S    0  0.0   0:00.00 udevd
39671 root      20   0 70480 3180 2492 S    0  0.0   0:00.02 sshd
39758 chris     20   0 70480 1580  876 S    0  0.0   0:00.13 sshd
39759 chris     20   0 25736 8524 1544 S    0  0.1   0:00.54 bash
39800 root      20   0 24572 1244  992 S    0  0.0   0:00.00 sudo
39801 root      20   0 22104 4920 1568 S    0  0.1   0:00.33 bash
40728 root      20   0 70480 3188 2500 S    0  0.0   0:00.02 sshd
40730 jim       20   0 70480 1592  876 S    0  0.0   0:00.23 sshd
40733 jim       20   0 25652 8452 1552 S    0  0.1   0:01.94 bash
44174 ntp       20   0 38340 2180 1592 S    0  0.0   2:07.16 ntpd
54321 www-data  20   0 73016  10m 1900 S    0  0.1   0:01.90 nginx
54322 www-data  20   0 73016  10m 1904 S    0  0.1   0:00.55 nginx
54324 www-data  20   0 73016  10m 1908 S    0  0.1   0:00.87 nginx
54328 www-data  20   0 73016  10m 1888 S    0  0.1   0:00.17 nginx
54334 www-data  20   0 73016  10m 1892 S    0  0.1   0:03.08 nginx
54337 www-data  20   0 73016  10m 1912 S    0  0.1   0:00.92 nginx
54338 www-data  20   0 73016  10m 1908 S    0  0.1   0:01.37 nginx
54339 www-data  20   0 73016  10m 1888 S    0  0.1   0:00.49 nginx
54340 www-data  20   0 73016  10m 1920 S    0  0.1   0:00.76 nginx
54341 www-data  20   0 73016  10m 1900 S    0  0.1   0:03.03 nginx
54342 www-data  20   0 73016  10m 1888 S    0  0.1   0:00.51 nginx
54343 www-data  20   0 73016  10m 1908 S    0  0.1   0:00.40 nginx
54345 www-data  20   0 73016  10m 1876 S    0  0.1   0:00.27 nginx
54346 www-data  20   0 73016  10m 1880 S    0  0.1   0:11.44 nginx
54347 www-data  20   0 73016  10m 1924 S    0  0.1   0:08.56 nginx
54348 www-data  20   0 73016  10m 1908 S    0  0.1   0:10.32 nginx
54349 www-data  20   0 73016 9448  504 S    0  0.1   0:00.07 nginx
54351 www-data  20   0 73016  10m 1904 S    0  0.1   0:10.71 nginx
54353 www-data  20   0 73016  10m 1912 S    0  0.1   0:03.77 nginx
54354 www-data  20   0 73016  10m 1932 S    0  0.1   0:02.16 nginx
54355 www-data  20   0 73016  10m 1884 S    0  0.1   0:00.61 nginx
54356 www-data  20   0 73016  10m 1896 S    0  0.1   0:00.63 nginx
54357 www-data  20   0 73016  10m 1920 S    0  0.1   0:00.45 nginx
54358 www-data  20   0 73016 9504  560 S    0  0.1   0:06.65 nginx
62065 postfix   20   0 39240 2416 1912 S    0  0.0   0:00.01 pickup

====================                 
}}}

comment:98 follow-up: ↓ 99 Changed 3 years ago by jim

Odd. What's bzip2 up to??

comment:99 in reply to: ↑ 98 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.2
Total Hours changed from 40.28 to 40.48

Replying to jim:

Odd. What's bzip2 up to??

I'd guess that it's apt-get related:

36910 root      20   0 13292 7024  448 R   99  0.1   0:02.11 bzip2
36991 root      20   0 38728  19m  16m R   52  0.2   0:00.27 apt-get

Last nights New Relic reinstall, ticket:586#comment:29 clobbered the changes to /var/xdrago/second.sh so I have re-added the following:

nginx_high_load_on()
{
  mv -f /data/conf/nginx_high_load_off.conf /data/conf/nginx_high_load.conf
  /etc/init.d/nginx reload

  # start additions
  echo "====================" >> /var/log/high-load.log
  echo "nginx high load on" >> /var/log/high-load.log
  echo "ONEX_LOAD = $ONEX_LOAD" >> /var/log/high-load.log
  echo "FIVX_LOAD = $FIVX_LOAD" >> /var/log/high-load.log
  echo "uptime : " >> /var/log/high-load.log
  uptime >> /var/log/high-load.log
  echo "vmstat : " >> /var/log/high-load.log
  vmstat -S M -s >> /var/log/high-load.log
  vmstat -S M -d >> /var/log/high-load.log
  echo "top : " >> /var/log/high-load.log
  top -n 1 -b >> /var/log/high-load.log
  echo "====================" >> /var/log/high-load.log
  # end additions

}

And manually rotated the logs:

cd /var/log
mv high-load.log.1 high-load.log.2
mv high-load.log high-load.log.1

Changed 3 years ago by chris

Attachment puffin-2013-10-09-load-day.png added

Changed 3 years ago by chris

Attachment puffin-2013-10-09-load-week.png added

comment:100 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.2
Total Hours changed from 40.48 to 40.68

The load spikes are back, see ticket:586#comment:31, I guess it was just a long weekend:

comment:101 Changed 3 years ago by ed

so now we have load spikes *and* NR - this is good timing.

Last edited 3 years ago by ed (previous) (diff)

comment:102 Changed 3 years ago by jim

Per my comment over on the NR ticket, NR was NOT collecting data at the time of this spike. It is now...

comment:103 Changed 3 years ago by jim

Add Hours to Ticket changed from 0.0 to 0.3
Total Hours changed from 40.68 to 40.98

OK so the database operations become slow occasionally, when a loadspike happens.

According to NR, there was some slow DB ops for url_alias, node, system and term tables, and this appears to have made Drupal slow too.

I then compared this to Munin and in this case there was a huge spike in 'low memory prunes' at that time.

Other spikes have conincided with high numbers of PHP requests...

I'm therefore waiting a 'proper' spike, since these appear to either be rare memory issues or regular usage spikes.

comment:104 Changed 3 years ago by jim

I would add that we can probably remove MANY of the URL aliases which would clean that table up a bit. I'll continue these thoughts on #590.

comment:105 follow-up: ↓ 106 Changed 3 years ago by jim

Add Hours to Ticket changed from 0.0 to 1.25
Total Hours changed from 40.98 to 42.23

I have a hunch... It's too coincidental that the load spikes occurred in tandem with the NR install/removal, which ran the BOA install script.

Looking at the events in #599 (clock drift) and this ticket, the start and end of the periods of load spikes coincides to with two things: the BOA script install being run, and my enablement of the cron job that syncs time...

According to Munin, the load spikes started on the 29th, which is when Chris set up the original clock sync cron job: https://tech.transitionnetwork.org/trac/timeline?from=2013-09-29T21%3A29%3A19%2B01%3A00&precision=second
NR was 'uninstalled' late on the 10th -- THOUGH this could easily be the 11th as we had such a lot of clock drift.
NR was reinstalled on 3rd https://tech.transitionnetwork.org/trac/timeline?from=2013-10-03T21%3A28%3A39%2B01%3A00&precision=second -- load spikes stopped. This wiped the date cron job.
I reinstated the cron job and CSF settings on the 7th. Load spikes started again around that time.

Now there's a few hours discrepancy here, but we cannot actually trust the times in Munin or Trac during this period.

My hunch is that the server software including MySQL, Drupal, Redis, Munin and NR are being confused by the time being changed so often, and that is the reason for the load spikes.

The simple test is to disable the Chris' cron date sync job per the last comment on #599 and wait a day or so.

Ideally though the clock would just work!

If I don't hear anything from Chris before tomorrow noon I'll disable the cron job and test myself.

Adding time for this detective work, plus 1 hour for my meeting with Ed at 2.15pm.

comment:106 in reply to: ↑ 105 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 1.0
Total Hours changed from 42.23 to 43.23

Replying to jim:

My hunch is that the server software including MySQL, Drupal, Redis, Munin and NR are being confused by the time being changed so often, and that is the reason for the load spikes.

I don't doubt that the clock drift and it being reset every min, could be causing some confusion. But I doubt it's the cause of the load spikes as these pre-date the clock problems.

I have written up a summary of what the load trigger thresholds are for BOA for Ed at wiki:PuffinServer#LoadSpikes the key sentance being:

When the load hits 3.88 robots are served 403 Forbidden responses and when the load hits 18.88 maintenance tasks are killed and when the load hits 72.2 the server terminates until the 5 min load average falls below 44.4.

Note that the Max load for the last month is 40.97 and for the last year 95.23, see the puffin munin load graphs.

comment:107 follow-up: ↓ 108 Changed 3 years ago by ed

Thanks Chris, pls can you define a maintenance task? will that affect a user editing a blog post for example?

comment:108 in reply to: ↑ 107 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 43.23 to 43.33

Replying to ed:

can you define a maintenance task? will that affect a user editing a blog post for example?

No, it shouldn't, it's tasks which are run via cron as far as I'm aware, Jim should be able to confirm this, these are the processes which are killed when the load reaches 18.88:

  if test -f /var/run/boa_run.pid ; then
    sleep 1
  else
    killall -9 php drush.php wget
  fi

comment:109 Changed 3 years ago by jim

Agreed, doing a killall -9 php drush.php wget won't do much except ensure the system isn't downloading or running maintenance stuff.

comment:110 Changed 3 years ago by ed

Changed 3 years ago by chris

Attachment puffin-2013-10-15-load-day.png added

comment:111 follow-up: ↓ 115 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.2
Total Hours changed from 43.33 to 43.53

We had a massive load spike this evening, this is the Munin load graph:

But this understates how big it was, this is the Munin email I got at the height of it:

Date: Tue, 15 Oct 2013 23:02:03 +0100
Subject: puffin.transitionnetwork.org Munin Alert

transitionnetwork.org :: puffin.transitionnetwork.org :: Load average
        CRITICALs: load is 91.54 (outside range [:8]).

This was high enough for the server to kill itself, see wiki:PuffinServer#LoadSpikes

Changed 3 years ago by jim

Attachment phpfpm_status-day.png added

comment:112 Changed 3 years ago by jim

Add Hours to Ticket changed from 0.0 to 0.05
Total Hours changed from 43.53 to 43.58

The gaps in Munin whenever it gets spicy render it nearly useless for establishing any kind of cause and effect. And webalizer seems to just aggregate so drilling down to an hour is impossible (I think?).

Would be great to know what the precursor was -- though I note a spike in PHP-FPM active connections just before it dies -- and a spike in CPU before that.

Some questions:

Can we rule in or out a network/traffic issue somehow? I'm concerned we wouldn't know if the server was getting slammed as Munin dies as soon as it does.
Can we tell if it's choking because of software glitches somewhere, or because of a burst of hits? (this is nearly the same question as 1, but turning the question into the server, rather than out onto the network).
Can we increase the resolution of Munin down to a minute window? And make the zoom function work?

Chris, Ed?

comment:113 Changed 3 years ago by chris

Cc aland added

Adding Alan to the CC list.

comment:114 Changed 3 years ago by chris

Sorry I haven't had time this week to look at the logs to try and work out what the cause of the massive load spike was on Tuesday, I'll try to spend some time on it later today or over the weekend.

Changed 3 years ago by chris

Attachment 30.png added

comment:115 in reply to: ↑ 111 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 1.6
Total Hours changed from 43.58 to 45.18

Replying to chris:

We had a massive load spike this evening

Looking at the Piwik stats this day had more page views than any, bar one, this year:

Looking at the Ngnix logs for the day there was:

77,544 Hits
7,276 Unique Visitors
2.556 GB Bandwidth

These figures are taken from running goaccess against the log file:

sudo -i
cd /var/log/nginx
gunzip awstats.log.6.gz
goaccess -b -s -f awstats.log.6

And this is a screenshot of the top of the output:

Further down there are hits by top IP addresses and one in Cote D'Ivoire Abidjan accounts for 12.20% of the hits, 9,463 lines in the access log file, these are the stats for the traffic from that IP address:

The hits from this IP address started at 14/Oct/2013:10:42:01 and ended at 14/Oct/2013:13:00:23, so the load spike cannot be solely attributed to the traffic from this IP, which also has a user agent string which indicates that it is a spider:

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36 Squider/0.01

The Munin Email at the height of the load spike was at 23:02:03 so taking the logs from 22:30 to 23:30 and we have:

Things to note here:

Total hits 2742
Total Unique Visitors 573
BW 0.100 GB
Common browsers: 7.85% Googlebot

962 lines of the logs, out of 2,742, have the same user agent:

Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36

Of these 962 lines a total of 309 contain "register" and 532 contain "add" -- this was clearly a spam bot trying to register accounts and to post content.

But also there were some jobs that looks like admin was being done on the site just at the time of the massive load spike, perhaps by Jim? Specifically these POST's:

127.0.0.1 - - [14/Oct/2013:22:59:29 +0100] "POST /node/81?destination=node%2F81 HTTP/1.0" 302 0 "https://tn.puffin.webarch.net/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36"
127.0.0.1 - - [14/Oct/2013:23:00:04 +0100] "POST /node/671/delete HTTP/1.0" 302 0 "https://tn.puffin.webarch.net/node/671/delete" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36"
127.0.0.1 - - [14/Oct/2013:23:00:59 +0100] "POST /hosting/js/node/691/platform_migrate?token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX HTTP/1.0" 302 0 "https://tn.puffin.webarch.net/hosting/js/node/691/platform_migrate?token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36"
127.0.0.1 - - [14/Oct/2013:23:01:00 +0100] "POST /hosting/js/batch?id=1&op=do HTTP/1.0" 200 99 "https://tn.puffin.webarch.net/hosting/js/batch?op=start&id=1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36"

My conclusion is that the site was getting a fair amount of bot activity, both "good" (GoogleBot) and bad (spam bots) and that this also coincided with a popular day for the site, but the thing that probably caused the massive load spike was work being done on the site via the https://tn.puffin.webarch.net/ domain. If this is right then I must admit I find it a little worrying that doing work to migrate platforms via https://tn.puffin.webarch.net/ appears to have the potential to cause load spikes which result in BOA shutting the server down -- isn't this the worst time to have drush and php cli tasks killed (this happens when the load hits 18.88) and ngnix and php-fpm killed when the load hits 72.2 -- the load reached 91.54.

Regarding the questions Jim asked:

Some questions:

Can we rule in or out a network/traffic issue somehow? I'm concerned we wouldn't know if the server was getting slammed as Munin dies as soon as it does.
Can we tell if it's choking because of software glitches somewhere, or because of a burst of hits? (this is nearly the same question as 1, but turning the question into the server, rather than out onto the network).
Can we increase the resolution of Munin down to a minute window? And make the zoom function work?

I think 1. and 2. have been more-or-less answered? Regarding 3. the way to increase the resolution would be to make the zoom work, this is something I spent a little time when it was installed but concluded that it probably wasn't worth spending too much time on, I could revisit this but I'd suggest it's something that perhaps we revisit after we have upgraded to Wheezy?

Changed 3 years ago by chris

Attachment puffin-2013-10-20-load-day.png added

Changed 3 years ago by chris

Attachment puffin-2013-10-20-cpu-day.png added

Changed 3 years ago by chris

Attachment puffin-2013-10-20-phpfpm_status-day.png added

Changed 3 years ago by chris

Attachment puffin-2013-10-20-phpfpm_connections-day.png added

Changed 3 years ago by chris

Attachment puffin-2013-10-20-phpfpm_connections-day.2.png added

Changed 3 years ago by chris

Attachment puffin-2013-10-20-nginx_request-day.png added

Changed 3 years ago by chris

Attachment puffin-2013-10-20-fw_packets-day.png added

Changed 3 years ago by chris

Attachment puffin-2013-10-20-if_eth0-day.png added

Changed 3 years ago by chris

Attachment puffin-2013-10-20-fw_conntrack-day.png added

Changed 3 years ago by chris

Attachment puffin-2013-10-20-mysql_qcache-day.png added

Changed 3 years ago by chris

Attachment puffin-2013-10-20-mysql_queries-day.png added

comment:116 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.4
Total Hours changed from 45.18 to 45.58

Yesterday we had the first load spike since we have migrated to the ZFS server and it's clear from these munin stats that it was caused by a spike in traffic:

The ldf email alert recorded the spike as resulting in a load of 21.78:

From: root@puffin.webarch.net
Date: Sat, 19 Oct 2013 15:06:36 +0100 (BST)
Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 7.36

Time:                    Sat Oct 19 15:06:35 2013 +0100
1 Min Load Avg:          21.78
5 Min Load Avg:          7.36
15 Min Load Avg:         2.86
Running/Total Processes: 34/343

Taking the ngnix logs from 19/Oct/2013:15:00:00 to 19/Oct/2013:15:10:00 there was:

Total hits 1092
Total Unique Visitors 142
Total Requests 91
70.70% of hits from one IP address

This one IP address, a dynamic French IP address, had a total of 772 hits in this 10 min period, 0.032 GB of bandwidth was used and again it was the same "Squider" spider found last Tuesday:

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36 Squider/0.01

This bot is listed here

http://www.botsvsbrowsers.com/details/1303776/index.html

I don't know if it is a "good" or a "bad" bot, but it appears to me that this load spike can be attributed to this bot.

comment:117 follow-up: ↓ 118 Changed 3 years ago by jim

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 45.58 to 45.68

Three things:

I was indeed doing work on the server when you suggested above... Building and migrating platforms is pretty IO intensive (backups, file moves, downloads) so that's expected and unavoidable. I'll try to do them later in the day in future.
I say we block Squider. It smells bad, works bad and clearly falls into the 'misuse' category. If there's legitimate use for it, then let other sites humour its users. KILL IT!
It seems we're on the downward slope, load spike-wise... During the 'motherboard pain days' things were really bad, then recently they got pretty good and some Drupal work reduced IO needs further, now with the IO boosted by the ZFS move, we're in good shape. If it turns out that most spikes are now caused by a few misbehaving users/bots, we're in very good shape. Time will tell...

Changed 3 years ago by chris

Attachment puffin-2013-10-20_2-load-day.png added

Changed 3 years ago by chris

Attachment puffin-2013-10-20_2-phpfpm_connections-day.png added

Changed 3 years ago by chris

Attachment puffin-2013-10-20_2-fw_conntrack-day.png added

Changed 3 years ago by chris

Attachment puffin-2013-10-20_2-fw_conntrack-day.2.png added

Changed 3 years ago by chris

Attachment puffin-2013-10-20_2-mysql_qcache-day.png added

Changed 3 years ago by chris

Attachment puffin-2013-10-20_2-mysql_qcache-day.2.png added

Changed 3 years ago by chris

Attachment puffin-2013-10-20_2-mysql_queries-day.png added

comment:118 in reply to: ↑ 117 ; follow-up: ↓ 119 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 45.68 to 45.93

Replying to jim:

I was indeed doing work on the server when you suggested above... Building and migrating platforms is pretty IO intensive (backups, file moves, downloads) so that's expected and unavoidable. I'll try to do them later in the day in future.

It was late in the day 11pm, I'm not sure if any later would make that much difference?

If the result of using the web interface to build and migrate platforms is the server killing itself for 15 or 20 mins then this indicates to me that the application needs faster, more powerful hardware to run it (the load reached 91.54, which is significantly above the suicide threshold which is currently set at 72.2). Or am I missing something here?

I say we block Squider. It smells bad, works bad and clearly falls into the 'misuse' category. If there's legitimate use for it, then let other sites humour its users. KILL IT!

Personally I be interested in knowing more about it before blocking it, but I'm not that fussed, how would it be best blocked? At a BOA/Drupal, Nginx or firewall level?

It seems we're on the downward slope, load spike-wise... During the 'motherboard pain days' things were really bad, then recently they got pretty good and some Drupal work reduced IO needs further, now with the IO boosted by the ZFS move, we're in good shape. If it turns out that most spikes are now caused by a few misbehaving users/bots, we're in very good shape. Time will tell...

Perhaps, I think it's probably too soon to call this, we have just had another load spike and weekends generally have a lot less traffic than weekdays...

Just now the load went over 44, the ldf email:

From: root@puffin.webarch.net
Date: Sun, 20 Oct 2013 14:11:47 +0100 (BST)
Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 11.99

Time:                    Sun Oct 20 14:11:47 2013 +0100
1 Min Load Avg:          44.32
5 Min Load Avg:          11.99
15 Min Load Avg:         4.41
Running/Total Processes: 44/341

And some Munin graphs:

comment:119 in reply to: ↑ 118 ; follow-up: ↓ 120 Changed 3 years ago by jim

Add Hours to Ticket changed from 0.0 to 0.05
Total Hours changed from 45.93 to 45.98

Replying to chris:

If the result of using the web interface to build and migrate platforms is the server killing itself for 15 or 20 mins then this indicates to me that the application needs faster, more powerful hardware to run it (the load reached 91.54, which is significantly above the suicide threshold which is currently set at 72.2). Or am I missing something here?

You're jumping to some pretty big concussions there! It's far more likely that a load spike coincided with the work I was doing. The hardware is ample now it is working properly.

Aegir is actually a set of Drush commands that the web interface kicks off... So command line or web UI it's the same.

Finally, backing up, cloning, and tweaking a database from such a big site as TN.org is highly IO intensive, as is safely moving sites around and reloading Nginx when done. These tasks are not common in normal usage and is not a risk to the server.

comment:120 in reply to: ↑ 119 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 45.98 to 46.23

Replying to jim:

It's far more likely that a load spike coincided with the work I was doing.

The migrate POST happened at 23:00:59 and the load hit 91.54 at 23:02:03 -- these things happened at basically the same time, I'm not convinced that this was a coincidence.

The work you were doing that evening "hung", was this perhaps caused by the load reaching 18.88 and then the drush and php processes being killed by second.sh?

Finally, backing up, cloning, and tweaking a database from such a big site as TN.org is highly IO intensive, as is safely moving sites around and reloading Nginx when done. These tasks are not common in normal usage and is not a risk to the server.

Can I suggest that the next time you do any tasks like this, using the web interface, you keep a very close eye on the load, eg by running top in a terminal and see what level it reaches?

Changed 3 years ago by chris

Attachment puffin-2013-10-20_3_load-day.png added

comment:121 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 46.23 to 46.33

We just has another load spike which came close to the 18.88 php and drush killing threshold:

From: root@puffin.webarch.net
Date: Sun, 20 Oct 2013 16:44:01 +0100 (BST)
Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 6.36

Time:                    Sun Oct 20 16:44:01 2013 +0100
1 Min Load Avg:          17.91
5 Min Load Avg:          6.36
15 Min Load Avg:         3.14
Running/Total Processes: 24/333

The Munin graphs are based on the 5 Min Load Avg so they don't record the peaks as high as they reach:

comment:122 follow-up: ↓ 124 Changed 3 years ago by jim

Add Hours to Ticket changed from 0.0 to 0.05
Total Hours changed from 46.33 to 46.38

It's likely these particular load spikes almost certainly coincide with Aegir backing up the STG database in preparation to migrate/clone the site. That will take up a fair amount of IO/CPU.

Chris, did you add the top, uptime and vmstat output to the high load syslog entries as discussed? Would be nice to know what as running at the time

comment:123 Changed 3 years ago by jim

It's worth adding that the recent work on boa has only been going on for 7-10 days so it cannot be the only cause....

comment:124 in reply to: ↑ 122 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.5
Total Hours changed from 46.38 to 46.88

Replying to jim:

Chris, did you add the top, uptime and vmstat output to the high load syslog entries as discussed? Would be nice to know what as running at the time

Yes, but not the date / time so it'll be hard to find the right part of the log file, sorry, I'll look if you want by working it out from the uptimes. I have manually rotated the log now it so it'll be in /var/log/high-load.log.1.

I have now added in a date so it'll be easier to find things in the future, this is what we now have in /var/xdrago/second.sh:

nginx_high_load_on()
{
  mv -f /data/conf/nginx_high_load_off.conf /data/conf/nginx_high_load.conf
  /etc/init.d/nginx reload

  # start additions
  echo "====================" >> /var/log/high-load.log
  date >> /var/log/high-load.log
  echo "nginx high load on" >> /var/log/high-load.log
  echo "ONEX_LOAD = $ONEX_LOAD" >> /var/log/high-load.log
  echo "FIVX_LOAD = $FIVX_LOAD" >> /var/log/high-load.log
  echo "uptime : " >> /var/log/high-load.log
  uptime >> /var/log/high-load.log
  echo "vmstat : " >> /var/log/high-load.log
  vmstat -S M -s >> /var/log/high-load.log
  vmstat -S M -d >> /var/log/high-load.log
  echo "top : " >> /var/log/high-load.log
  top -n 1 -b >> /var/log/high-load.log
  echo "====================" >> /var/log/high-load.log
  # end additions

}

I also think we should try to radically change these variables in /var/xdrago/second.sh:

CTL_ONEX_SPIDER_LOAD=388
CTL_FIVX_SPIDER_LOAD=388
CTL_ONEX_LOAD=7220
CTL_FIVX_LOAD=4440
CTL_ONEX_LOAD_CRIT=1888
CTL_FIVX_LOAD_CRIT=1555

We have already changed the two of them from their defaults by editing /root/.barracuda.cnf and multiplying these values by 5:

#_LOAD_LIMIT_ONE=1444
#_LOAD_LIMIT_TWO=888
_LOAD_LIMIT_ONE=7220
_LOAD_LIMIT_TWO=4440

Remember we have 14 cores (see [https://en.wikipedia.org/wiki/Load_average#Unix-style_load_calculation Wikipedia on Unix-style load calculation) so if we multiply the original values by 6 we then have these values in /var/xdrago/second.sh:

CTL_ONEX_SPIDER_LOAD=2328
CTL_FIVX_SPIDER_LOAD=2328
CTL_ONEX_LOAD=8664
CTL_FIVX_LOAD=5328
CTL_ONEX_LOAD_CRIT=11328
CTL_FIVX_LOAD_CRIT=9330

This would mean that the new thresholds would be:

When the load hits 23.28 then robots are served 403 Forbidden responses
When the load hits 86.64 then drush and php maintenance tasks are killed until the 5min load drops below 53.28
When the load hits 113.28 then the web server applications are killed until the 5min load drops below 93.30

And dividing the above by 14 we get:

When the load hits 1.66 then robots are served 403 Forbidden responses
When the load hits 6.19 then drush and php maintenance tasks are killed until the 5min load drops below 3.81
When the load hits 8.09 then the web server applications are killed until the 5min load drops below 6.66

Perhaps these values are too low, if we multiply by 7 we get:

CTL_ONEX_SPIDER_LOAD=2716
CTL_FIVX_SPIDER_LOAD=2716
CTL_ONEX_LOAD=10108
CTL_FIVX_LOAD=6216
CTL_ONEX_LOAD_CRIT=13216
CTL_FIVX_LOAD_CRIT=10885

This would mean that the new thresholds would be:

When the load hits 27.16 then robots are served 403 Forbidden responses
When the load hits 101.08 then drush and php maintenance tasks are killed until the 5min load drops below 62.16
When the load hits 132.16 then the web server applications are killed until the 5min load drops below 108.85

And dividing the above by 14 we get:

When the load hits 1.94 then robots are served 403 Forbidden responses
When the load hits 7.22 then drush and php maintenance tasks are killed until the 5min load drops below 4.44
When the load hits 9.44 then the web server applications are killed until the 5min load drops below 7.78

This would mean that the load spike of 91 the other day wouldn't have caused drush and php tasks to be killed.

Jim, Alan, does this sound sensible to you?

comment:125 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 1.5
Total Hours changed from 46.88 to 48.38

I have looked though all the Octopus tickets and I couldn't find a question like the following:

second.sh kill thresholds and multiple CPU cores

Every minute /var/xdrago/second.sh is run by cron and it contains loops through itself every 10 seconds and checks the load levels and if kills tasks is set thresholds are breached, the default settings result in a load of 18.88 cause php and drush tasks to be killed and a load of 14.44 causes the web server to be killed, this is my reading of the script:

If the load average over the last minute is greater than 3.88 and less than 14.44 and the nginx high load config isn't in use then start to use it.
Else if the load average over the last 5 mins is greater than 3.88 and less than 8.88 and the nginx high load config isn't in use then start to use it.
Else if the load average over the last minute is less than 3.88 and the the load average over the last 5 mins is less than 3.88 and the nginx high load config is in use then stop using it.
If the load average over the last minute is greater than 18.88 and if /var/run/boa_run.pid exists, wait a second, if not kill some maintenance jobs: killall -9 php drush.php wget
Else if the load average over the last 5 mins is greater than 15.55 and if /var/run/boa_run.pid exists, wait a second, if not kill some maintenance jobs: killall -9 php drush.php wget
If the load average over the last minute is greater than 14.44 then kill the web server, killall -9 nginx and killall -9 php-fpm php-cgi
Else if the load average over the last 5 mins is greater than 8.88 then kill the web server, killall -9 nginx and killall -9 php-fpm php-cgi
Else restart all the services via /var/xdrago/proc_num_ctrl.cgi
How many cores are these thresholds based on?

The thresholds for killing the server are configurable vi these variables in /root/.barracuda.cnf:
_LOAD_LIMIT_ONE=1444
_LOAD_LIMIT_TWO=888
Could the thresholds for triggering the high load config and killing drush and php also be made configurable?

The current default settings do not appear to be suited for a server with a lot of CPU cores, what should the default values be multiplied by?

However the above isn't in the correct format, so I was looking at munging it into the correct format and gathering the info that needs to be posted, which includes /root/.tn.octopus.cnf and I see this file has these settings:

_CLIENT_OPTION="SSD"
_CLIENT_CORES="8"

The server isn't running of SSDs and there are 14 not 8 cores.

Grepping the BOA code, which I downloaded as a zip file from github these files contain "CLIENT_CORES":

./nginx-for-drupal-master/OCTOPUS.sh.txt
./nginx-for-drupal-master/docs/cnf/octopus.cnf
./nginx-for-drupal-master/aegir/scripts/AegirSetupC.sh.txt
./nginx-for-drupal-master/aegir/scripts/AegirSetupB.sh.txt
./nginx-for-drupal-master/aegir/scripts/AegirSetupA.sh.txt
./nginx-for-drupal-master/aegir/tools/system/weekly.sh

The first file above contains:

if [ ! -e "/data/disk/$_USER/log/cores.txt" ] ; then
  echo $_CLIENT_CORES > /data/disk/$_USER/log/cores.txt
fi

And the /data/disk/tn/log/cores.txt files does exist and it simply contains the number 8, I suggest we increase this to 14.

However the second file contains:

_CLIENT_OPTION="SSD" #---------- Currently not used
_CLIENT_CORES="8" #------------- Currently not used

So I'm not sure if this value is used?

The 3rd and 4th files limit the platforms available to servers with less then 2 cores (eg no !CiviCRM if you only have one core).

I can't find any code which causes the correct value for the number of cores to be written to /data/disk/tn/log/cores.txt automatically.

For the SSD option it looks like this is related to disk space, rather then IO speed, in nginx-for-drupal-master/aegir/tools/system/weekly.sh there is this code:

check_limits () {
  read_account_data
  if [ "$_CLIENT_OPTION" = "POWER" ] ; then
    _SQL_MIN_LIMIT=5120
    _DSK_MIN_LIMIT=51200
    _SQL_MAX_LIMIT=$(($_SQL_MIN_LIMIT + 256))
    _DSK_MAX_LIMIT=$(($_DSK_MIN_LIMIT + 5120))
  elif [ "$_CLIENT_OPTION" = "SSD" ] ; then
    _SQL_MIN_LIMIT=512
    _DSK_MIN_LIMIT=10240
    _SQL_MAX_LIMIT=$(($_SQL_MIN_LIMIT + 128))
    _DSK_MAX_LIMIT=$(($_DSK_MIN_LIMIT + 2560))
  else
    _SQL_MIN_LIMIT=256
    _DSK_MIN_LIMIT=5120
    _SQL_MAX_LIMIT=$(($_SQL_MIN_LIMIT + 64))
    _DSK_MAX_LIMIT=$(($_DSK_MIN_LIMIT + 1280))
  fi

Perhaps we should simply raise the kill thresholds as suggested in ticket:555#comment:124 and not bother raising this as a BOA ticket because we can't suggest a general "proposed resolution"?

comment:126 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 48.38 to 48.48

We are running out of budget for this month so I have applied the suggested changes from ticket:555#comment:124 and saved a copy of the modified second.sh script to /root/ and I have also changed /data/disk/tn/log/cores.txt and /root/.tn.octopus.cnf to show 14 rather than 8 cores.

comment:127 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 48.48 to 48.73

I have updated the load spike docs and also updated /root/.barracuda.cnf with the new values and updated the upgrade notes.

comment:128 follow-up: ↓ 130 Changed 3 years ago by jim

Add Hours to Ticket changed from 0.0 to 0.5
Total Hours changed from 48.73 to 49.23

I agree with Chris's load threshold changes, and yes the cores thing is not needed/used for us.

So... The question is: should we raise a Barracuda ticket so the changes to second.sh are avoided? I.e. make https://tech.transitionnetwork.org/trac/wiki/PuffinServer#xdragoshellscriptchanges part of the customisable bits for BOA.

I say yes. I'd suggest a check in second.sh for a /root/.barracuda-variables.cnf file and load the values from there if it exists, otherwise use the defaults in the control() function in second.sh.

D.org is still down for the scheduled upgrade, but I'll do this shortly, but a first stab at such a patch would be:

In second.sh we add a new function that limits from /root/.barracuda-overrides.cnf file, if it exists:

load_limits()
{
  if [ -e " /root/.barracuda-overrides.cnf" ] ; then
    source /root/.barracuda-overrides.cnf
  else
    CTL_ONEX_SPIDER_LOAD=388
    CTL_FIVX_SPIDER_LOAD=388
    CTL_ONEX_LOAD=1444
    CTL_FIVX_LOAD=888
    CTL_ONEX_LOAD_CRIT=1888
    CTL_FIVX_LOAD_CRIT=1555
  fi
}

Then we can remove the hard coded limits in the control() function and instead put a call to load_limits() in the bottom of the file just before it starts the control function calls

load_limits  <-- new
control      <-- as before
sleep 10
... etc ...

Chris, your thoughts? If it looks good to you I'll make an actual patch on GitHub/Drupal?.org.

Changed 3 years ago by chris

Attachment tn-piwik-2013-11-03.png added

comment:129 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 49.23 to 49.48

It's worth noting that the wiki:ServerBandwidth for April, May, June 2013:

       month        rx      |     tx      |    total    |   avg. rate
    ------------------------+-------------+-------------+---------------
      Apr '13     68.61 GiB |   14.06 GiB |   82.66 GiB |  267.52 kbit/s
      May '13     65.49 GiB |   22.61 GiB |   88.10 GiB |  275.92 kbit/s
      Jun '13     68.12 GiB |   16.18 GiB |   84.31 GiB |  272.85 kbit/s

Basically doubled for July, August, September and October 2013:

       month        rx      |     tx      |    total    |   avg. rate
    ------------------------+-------------+-------------+---------------
      Jul '13    113.14 GiB |   21.98 GiB |  135.12 GiB |  423.18 kbit/s
      Aug '13    124.42 GiB |   17.20 GiB |  141.62 GiB |  443.56 kbit/s
      Sep '13    139.33 GiB |   13.78 GiB |  153.10 GiB |  495.49 kbit/s
      Oct '13    143.35 GiB |   13.97 GiB |  157.32 GiB |  492.72 kbit/s

Time wise this also more-or-less corresponds with the increase in load spikes.

It's also worth noting that there hasn't been a corresponding increase in the number of visitors recorded by Piwik:

It's worth noting that the bandwidth usage only went up in one direction (outgoing).

According to pingdom this is the current size of the front page, by domain the content was loaded from:

www.transitionnetwork.org	773.1 kB	
s.ytimg.com	                43.5 kB	
stats.transitionnetwork.org	22.2 kB	
www.youtube.com	                10.4 kB	
i1.ytimg.com	                0 B

And this is by content type:

Image	647.4 kB	
CSS	76.7 kB	
Script	74.7 kB	
Other	27.0 kB	
HTML	23.4 kB

Was there a site redesign around this time that added a/some big image(s) to the front page perhaps?

Last edited 3 years ago by chris (previous) (diff)

comment:130 in reply to: ↑ 128 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 49.48 to 49.58

Replying to jim:

So... The question is: should we raise a Barracuda ticket so the changes to second.sh are avoided?

That would be good :-)

I say yes. I'd suggest a check in second.sh for a /root/.barracuda-variables.cnf file and load the values from there if it exists, otherwise use the defaults in the control() function in second.sh.

I'm happy with you suggesting it is done like this but I'd guess they might want to put the variables in the already existing /root/.barracuda.cnf file, but we can leave the implementation detail to them to sort as they see fit.

Chris, your thoughts? If it looks good to you I'll make an actual patch on GitHub/Drupal.org.

Go for it, I think having a suggested patch, even if they chose to implement the feature in another way, is good.

comment:131 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.5
Total Hours changed from 49.58 to 50.08

Based on a look at the munin stats I have tweaked some things:

php-fpm processes

https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/phpfpm_status.html

We had the default number of php-fpm processes that come with BOA, 18, we really don't need this many, so I have edited /opt/local/etc/php53-fpm.conf and changed these values:

;pm.start_servers = 18
pm.start_servers = 4

;pm.max_spare_servers = 18
pm.max_spare_servers = 4

And restarted:

/etc/init.d/php53-fpm restart

And updated the docs at wiki:PuffinServer#php-fpmconfigchanges

MySQL query cache

https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/mysql_qcache_mem.html

This was increased from 512MB to 1GB 3 weeks ago but since then the memory use hasn't got above 460MB so I have reduced it to 512MB again via editing /etc/mysql/my.cnf:

query_cache_size        = 512M

MySQL connections

https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/mysql_connections.html

We have hit the limit of 40 so that has been increased to 60 in /etc/mysql/my.cnf:

max_connections         = 60
max_user_connections    = 60

IO state graph

This had stopped working:

https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/iostat_ios.html

Fixed by removing the state files:

rm /var/lib/munin-node/plugin-state/nobody/iostat-ios.state /var/lib/munin/plugin-state/iostat-ios.state

Changed 3 years ago by chris

Attachment puffin_2013-12-15_load-pinpoint_1363530313_1387203913.png added

Puffin Load Spikes 2013

comment:132 Changed 3 years ago by chris

Add Hours to Ticket changed from 0.0 to 0.1
Status changed from new to closed
Resolution set to fixed
Total Hours changed from 50.08 to 50.18

I'm closing this ticket as the load spikes that started in May 2013 and persisted through to mid October 2013 are now not happening to the same extent, see this graph of the load on puffin:

This ticket isn't the only one related to this issue, there are also the following ones, which have been listed in the Load Spikes documentation, wiki:PuffinServer#LoadSpikes. Following is addition of the time spent on these tickets:

16.35 ticket:483 Nginx 502 Bad Gateway Errors with BOA
00.55 ticket:543 Puffin Load Spike
01.00 ticket:552 Puffin Downtime 23rd May 2013
00.75 ticket:554 Site slow down and MySQL load increase
50.08 ticket:555 Load spikes causing the TN site to be stopped for 15 min at a time
06.95 ticket:563 503 Errors
01.60 ticket:569 403s served to editors, admin very slow
00.25 ticket:576 Site down

Total: 77.53 which is 77 hours 31 mins and 48 seconds -- approx two weeks work.

comment:133 Changed 3 years ago by chris

Posting the documentation of this issue as of 2013-01-13 from the wiki:PuffinServer page here in anticipation of all this documentation being deleted after the next BOA update:

Load Spikes

The server has been suffering from load spikes which cause the site to be unresponsive for clients, you can see the current status via the puffin Munin load graph, note the Max values for the last day, week, month and year.

When the load hits 23.28 robots are served 403 Forbidden responses and when the load hits 86.64 maintenance tasks are killed and when the load hits 113.28 the server terminates until the 5 min load average falls below 93.30.

The default thresholds have been changed as they were causing the shut to shutdown for 15 min at a time far too often, the current values were applied on 23rd October 2013.

The server has 14 CPU cores, see Unix-style load calculation, the current thresholds are generated from these variables in /root/.barracuda.cnf, the commented out values are the default ones:

#_LOAD_LIMIT_ONE=1444
#_LOAD_LIMIT_TWO=888
_LOAD_LIMIT_ONE=8664
_LOAD_LIMIT_TWO=5328

These variables are used by the /var/xdrago/second.sh script, which is run every minute via cron and has a internal loop which causes it to run 5 times, waiting 10 seconds between each run, and it has the following variables in it (these have been edited from their default values):

ONEX_LOAD=`awk '{print $1*100}' /proc/loadavg`
FIVX_LOAD=`awk '{print $2*100}' /proc/loadavg`
CTL_ONEX_SPIDER_LOAD=2328
CTL_FIVX_SPIDER_LOAD=2328
CTL_ONEX_LOAD=8664
CTL_FIVX_LOAD=5328
CTL_ONEX_LOAD_CRIT=11328
CTL_FIVX_LOAD_CRIT=9330

These values translate to the following loads for comparison to the Munin graphs:

ONEX_LOAD: load average over the last minute times 100
FIVX_LOAD: load average over the last 5 minutes times 100
CTL_ONEX_SPIDER_LOAD: 23.28
CTL_FIVX_SPIDER_LOAD: 23.28
CTL_ONEX_LOAD: 86.64
CTL_FIVX_LOAD: 53.28
CTL_ONEX_LOAD_CRIT: 113.28
CTL_FIVX_LOAD_CRIT: 93.30

And the logic, translated into english, is:

If the load average over the last minute is greater than 23.28 and less than 86.64 and the nginx high load config isn't in use then start to use it.
Else if the load average over the last 5 mins is greater than 23.28 and less than 53.28 and the nginx high load config isn't in use then start to use it.
Else if the load average over the last minute is less than 23.28 and the the load average over the last 5 mins is less than 23.28 and the nginx high load config is in use then stop using it.

If the load average over the last minute is greater than 132.16 and if /var/run/boa_run.pid exists, wait a second, if not kill some maintenance jobs: killall -9 php drush.php wget
Else if the load average over the last 5 mins is greater than 108.85 and if /var/run/boa_run.pid exists, wait a second, if not kill some maintenance jobs: killall -9 php drush.php wget

If the load average over the last minute is greater than 101.08 then kill the web server, killall -9 nginx and killall -9 php-fpm php-cgi
Else if the load average over the last 5 mins is greater than 62.16 then kill the web server, killall -9 nginx and killall -9 php-fpm php-cgi
Else restart all the services via /var/xdrago/proc_num_ctrl.cgi

Tickets generated in relation to these issues include:

ticket:483 Nginx 502 Bad Gateway Errors with BOA
ticket:543 Puffin Load Spike
ticket:552 Puffin Downtime 23rd May 2013
ticket:554 Site slow down and MySQL load increase
ticket:555 Load spikes causing the TN site to be stopped for 15 min at a time
ticket:563 503 Errors
ticket:569 403s served to editors, admin very slow
ticket:576 Site down

A total of 77.5 hours was spent on the tickets listed above, the final one was closed on 15th December 2013 and the total time was added up, see ticket:555#comment:132.

Note: See TracTickets for help on using tickets.

Download in other formats: