#846 (Load Spikes on BOA PuffinServer) – Transition Technology

comment:1 Changed 19 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 0.0 to 0.25

The site was unavailable for a little time today due to a load spike, lfd email:

Date: Thu, 16 Apr 2015 11:47:48 +0100 (BST)
From: root@puffin.webarch.net
To: chris@webarchitects.co.uk
Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 155.39


Time:                    Thu Apr 16 11:47:47 2015 +0100
1 Min Load Avg:          169.26
5 Min Load Avg:          155.39
15 Min Load Avg:         91.87
Running/Total Processes: 156/776

There is simply a gap in the munin graphs, so nothing to indicate the cause there.

Around the same time as the load spike are 107 entries in /var/log/syslog like this:

Apr 16 11:33:14 puffin mysqld: 150416 11:33:14 [Warning] Aborted connection 4573872 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
...
Apr 16 11:48:17 puffin mysqld: 150416 11:48:17 [Warning] Aborted connection 4573989 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)

In /var/log/php/php53-fpm-error.log we have a lot or errors around the same time, for example:

[16-Apr-2015 11:43:14] WARNING: [pool tn] server reached max_children setting (48), consider raising it
[16-Apr-2015 11:43:14] ERROR: failed to ptrace(PEEKDATA) pid 22039: Input/output error (5)
[16-Apr-2015 11:43:35] WARNING: [pool tn] child 21124 exited on signal 15 (SIGTERM) after 504.251025 seconds from start
[16-Apr-2015 11:43:40] WARNING: [pool tn] child 21202 exited on signal 15 (SIGTERM) after 476.476796 seconds from start
[16-Apr-2015 11:44:06] WARNING: [pool tn] child 21214 exited on signal 15 (SIGTERM) after 498.760981 seconds from start
[16-Apr-2015 11:44:06] ERROR: failed to ptrace(PEEKDATA) pid 22031: Input/output error (5)
[16-Apr-2015 11:44:59] WARNING: [pool tn] child 21018 exited on signal 15 (SIGTERM) after 631.765051 seconds from start
[16-Apr-2015 11:45:02] WARNING: [pool tn] child 21169 exited on signal 15 (SIGTERM) after 570.740053 seconds from start
[16-Apr-2015 11:45:03] WARNING: [pool tn] child 21196 exited on signal 15 (SIGTERM) after 562.125105 seconds from start
[16-Apr-2015 11:45:09] ERROR: failed to ptrace(PEEKDATA) pid 22086: Input/output error (5)
[16-Apr-2015 11:45:14] ERROR: failed to ptrace(PEEKDATA) pid 22096: Input/output error (5)
[16-Apr-2015 11:45:14] WARNING: [pool tn] child 22220, script '/data/disk/tn/static/transition-network-d6-35-p001b-booker/index.php' (request: "GET /index.php") executing too slow (70.534516 sec), logging
[16-Apr-2015 11:45:14] WARNING: [pool tn] child 22190, script '/data/disk/tn/static/transition-network-d6-35-p001b-booker/index.php' (request: "GET /index.php") executing too slow (108.971549 sec), logging
[16-Apr-2015 11:45:14] WARNING: [pool tn] child 22177, script '/data/disk/tn/static/transition-network-d6-35-p001b-booker/index.php' (request: "GET /index.php") executing too slow (113.345634 sec), logging
[16-Apr-2015 11:45:15] WARNING: [pool tn] child 22162, script '/data/disk/tn/static/transition-network-d6-35-p001b-booker/index.php' (request: "GET /index.php") executing too slow (140.537344 sec), logging
[16-Apr-2015 11:45:15] WARNING: [pool tn] child 22138, script '/data/disk/tn/static/transition-network-d6-35-p001b-booker/index.php' (request: "GET /index.php") executing too slow (162.128243 sec), logging
...

Then it looks like BOA killed the server:

[16-Apr-2015 11:45:42] WARNING: [pool tn] child 22115 exited on signal 15 (SIGTERM) after 228.834465 seconds from start
[16-Apr-2015 11:45:42] WARNING: [pool tn] child 22121 exited on signal 15 (SIGTERM) after 226.189439 seconds from start
[16-Apr-2015 11:45:42] WARNING: [pool tn] child 22138 exited on signal 15 (SIGTERM) after 212.408443 seconds from start
[16-Apr-2015 11:45:42] WARNING: [pool tn] child 21920 exited on signal 15 (SIGTERM) after 365.389874 seconds from start
[16-Apr-2015 11:45:44] WARNING: [pool tn] child 21388 exited on signal 15 (SIGTERM) after 506.200344 seconds from start
[16-Apr-2015 11:45:54] WARNING: [pool tn] child 22031 exited on signal 15 (SIGTERM) after 302.818195 seconds from start
[16-Apr-2015 11:45:54] WARNING: [pool tn] child 21441 exited on signal 15 (SIGTERM) after 471.575229 seconds from start
[16-Apr-2015 11:45:54] WARNING: [pool tn] child 21457 exited on signal 15 (SIGTERM) after 462.377788 seconds from start
[16-Apr-2015 11:45:54] WARNING: [pool tn] child 21819 exited on signal 15 (SIGTERM) after 433.145685 seconds from start
[16-Apr-2015 11:45:54] WARNING: [pool tn] child 21942 exited on signal 15 (SIGTERM) after 366.174700 seconds from start
[16-Apr-2015 11:45:54] WARNING: [pool tn] child 21994 exited on signal 15 (SIGTERM) after 337.253893 seconds from start
[16-Apr-2015 11:45:54] WARNING: [pool tn] child 21998 exited on signal 15 (SIGTERM) after 323.425116 seconds from start
[16-Apr-2015 11:45:54] WARNING: [pool tn] child 22019 exited on signal 15 (SIGTERM) after 307.843527 seconds from start
[16-Apr-2015 11:45:54] WARNING: [pool tn] child 22096 exited on signal 15 (SIGTERM) after 251.391717 seconds from start
[16-Apr-2015 11:45:56] WARNING: [pool tn] child 22022 exited on signal 15 (SIGTERM) after 308.369336 seconds from start
[16-Apr-2015 11:45:56] WARNING: [pool tn] child 22039 exited on signal 15 (SIGTERM) after 300.004118 seconds from start
[16-Apr-2015 11:45:56] WARNING: [pool tn] child 21247 exited on signal 15 (SIGTERM) after 596.245154 seconds from start
[16-Apr-2015 11:46:10] WARNING: [pool tn] child 22063 exited on signal 15 (SIGTERM) after 294.977738 seconds from start
[16-Apr-2015 11:46:40] WARNING: [pool tn] child 21323 exited on signal 15 (SIGTERM) after 601.894171 seconds from start
[16-Apr-2015 11:46:58] WARNING: [pool tn] child 21356 exited on signal 15 (SIGTERM) after 600.301272 seconds from start
[16-Apr-2015 11:46:58] WARNING: [pool tn] child 21425 exited on signal 15 (SIGTERM) after 552.032084 seconds from start
[16-Apr-2015 11:46:58] WARNING: [pool tn] child 21515 exited on signal 15 (SIGTERM) after 524.282000 seconds from start
[16-Apr-2015 11:46:58] WARNING: [pool tn] child 21552 exited on signal 15 (SIGTERM) after 524.116433 seconds from start
[16-Apr-2015 11:46:58] WARNING: [pool tn] child 21820 exited on signal 15 (SIGTERM) after 493.075464 seconds from start
[16-Apr-2015 11:46:58] WARNING: [pool tn] child 21917 exited on signal 15 (SIGTERM) after 444.102529 seconds from start
[16-Apr-2015 11:46:59] WARNING: [pool tn] child 21939 exited on signal 15 (SIGTERM) after 431.164583 seconds from start
[16-Apr-2015 11:47:00] WARNING: [pool tn] child 21949 exited on signal 15 (SIGTERM) after 419.710066 seconds from start
[16-Apr-2015 11:47:00] WARNING: [pool tn] child 21975 exited on signal 15 (SIGTERM) after 416.390367 seconds from start
[16-Apr-2015 11:47:00] WARNING: [pool tn] child 21992 exited on signal 15 (SIGTERM) after 405.098629 seconds from start
[16-Apr-2015 11:47:00] WARNING: [pool tn] child 22000 exited on signal 15 (SIGTERM) after 388.104313 seconds from start
[16-Apr-2015 11:47:00] WARNING: [pool tn] child 22086 exited on signal 15 (SIGTERM) after 321.356026 seconds from start
[16-Apr-2015 11:47:00] WARNING: [pool tn] child 22118 exited on signal 15 (SIGTERM) after 306.410884 seconds from start

In the /var/log/lfd.log we have this:

Apr 16 11:21:35 puffin lfd[38092]: csf (re)start completed
Apr 16 11:32:13 puffin lfd[20740]: *Lock Error* [PT_LOAD] still active - section skipped
Apr 16 11:33:14 puffin lfd[20831]: *Lock Error* [PT_LOAD] still active - section skipped
Apr 16 11:34:02 puffin lfd[20948]: *Lock Error* [PT_LOAD] still active - section skipped
Apr 16 11:34:44 puffin lfd[21058]: *Lock Error* [PT_LOAD] still active - section skipped
Apr 16 11:34:44 puffin lfd[20944]: Directory Watching terminated after 40 seconds
Apr 16 11:34:46 puffin lfd[20944]: LF_DIRWATCH taking 40 seconds, temporarily throttled to run every 900 seconds
Apr 16 11:35:14 puffin lfd[21132]: *Lock Error* [PT_LOAD] still active - section skipped
Apr 16 11:35:53 puffin lfd[21227]: *Lock Error* [PT_LOAD] still active - section skipped
Apr 16 11:36:31 puffin lfd[21312]: *Lock Error* [PT_LOAD] still active - section skipped
Apr 16 11:37:42 puffin lfd[21419]: *Lock Error* [PT_LOAD] still active - section skipped
Apr 16 11:38:13 puffin lfd[20639]: *LOAD* 5 minute load average is 63.60, threshold is 12 - email sent
Apr 16 11:42:34 puffin lfd[22124]: STATS: 15 sec. timeout performing iptables_log
Apr 16 11:42:49 puffin lfd[22168]: *Lock Error* [PT_LOAD] still active - section skipped
Apr 16 11:44:36 puffin lfd[22241]: STATS: 15 sec. timeout performing iptables_log
Apr 16 11:45:05 puffin lfd[22302]: *Lock Error* [PT_LOAD] still active - section skipped
Apr 16 11:45:48 puffin lfd[22753]: *Lock Error* [PT_LOAD] still active - section skipped
Apr 16 11:45:54 puffin lfd[22994]: *Lock Error* [PT_LOAD] still active - section skipped
Apr 16 11:46:58 puffin lfd[23740]: *Lock Error* [PT_LOAD] still active - section skipped
Apr 16 11:47:10 puffin lfd[24155]: *LOAD* 5 minute load average is 155.39, threshold is 12 - email sent
Apr 16 11:47:25 puffin lfd[22973]: (sshd) Failed SSH login from 90.157.113.153 (RU/Russian Federation/153.113-157-90.telenet.ru): 5 in the last 300 secs - *Blocked in csf* [LF_SSHD]
Apr 16 11:47:30 puffin lfd[22658]: *Error*: csf output: deny failed: 81.95.52.66 is in the allow file /etc/csf/csf.allow

Apr 16 11:47:30 puffin lfd[22658]: (sshd) Failed SSH login from 81.95.52.66 (GB/United Kingdom/nsa.rat.burntout.org): 5 in the last 300 secs - *Blocked in csf* [LF_SSHD]
Apr 16 11:49:27 puffin lfd[33980]: *LOAD* 5 minute load average is 131.14, threshold is 12 - email sent
Apr 16 11:51:30 puffin lfd[38042]: *LOAD* 5 minute load average is 111.98, threshold is 12 - email sent
Apr 16 11:52:11 puffin lfd[49976]: (sshd) Failed SSH login from 179.189.93.22 (BR/Brazil/-): 5 in the last 300 secs - *Blocked in csf* [LF_SSHD]
Apr 16 11:53:06 puffin lfd[59954]: (sshd) Failed SSH login from 177.20.162.97 (BR/Brazil/network-ppp177-162-97.prtelecom.com.br): 5 in the last 300 secs - *Blocked in csf* [LF_SSHD]
Apr 16 11:53:31 puffin lfd[63549]: *LOAD* 5 minute load average is 75.55, threshold is 12 - email sent
Apr 16 11:54:38 puffin lfd[5231]: 46.141.24.202 (IT/Italy/-), 11 distributed sshd attacks on account [admin] in the last 300 secs - *Blocked in csf* [LF_DISTATTACK]
Apr 16 11:55:43 puffin lfd[7238]: *LOAD* 5 minute load average is 51.00, threshold is 12 - email sent
Apr 16 11:57:45 puffin lfd[14052]: *LOAD* 5 minute load average is 37.26, threshold is 12 - email sent
Apr 16 11:59:45 puffin lfd[20997]: *LOAD* 5 minute load average is 25.00, threshold is 12 - email sent
Apr 16 12:00:28 puffin lfd[24081]: (sshd) Failed SSH login from 110.36.53.71 (PK/Pakistan/WimaxUser3653-71.wateen.net): 5 in the last 300 secs - *Blocked in csf* [LF_SSHD]
Apr 16 12:01:40 puffin lfd[27617]: (sshd) Failed SSH login from 5.172.125.182 (IT/Italy/ip-182-125-172-5-dyn-dsl.customer.digitelitalia.com): 5 in the last 300 secs - *Blocked in csf* [LF_SSHD]
Apr 16 12:01:49 puffin lfd[27991]: *LOAD* 5 minute load average is 18.76, threshold is 12 - email sent
Apr 16 12:02:50 puffin lfd[31578]: 109.161.192.203 (BH/Bahrain/-), 11 distributed sshd attacks on account [admin] in the last 300 secs - *Blocked in csf* [LF_DISTATTACK]
Apr 16 12:03:05 puffin lfd[33319]: (sshd) Failed SSH login from 31.192.163.159 (RU/Russian Federation/159.163-192-31.telenet.ru): 5 in the last 300 secs - *Blocked in csf* [LF_SSHD]
Apr 16 12:03:08 puffin lfd[38092]: TERM
Apr 16 12:03:08 puffin lfd[38092]: daemon stopped
Apr 16 12:19:06 puffin lfd[31159]: daemon started on puffin.webarch.net - csf v7.67 (generic)

I'm still unsure of the cause of the load spike.

comment:2 Changed 19 months ago by chris

Description modified (diff)

comment:3 Changed 19 months ago by chris

There was a load spike just now:

Date: Sun, 19 Apr 2015 11:15:49 +0100 (BST)
From: root@puffin.webarch.net
To: chris@webarchitects.co.uk
Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 45.80

Time:                    Sun Apr 19 11:15:44 2015 +0100
1 Min Load Avg:          75.88
5 Min Load Avg:          45.80
15 Min Load Avg:         23.44
Running/Total Processes: 32/518

I think this was high enough for the webserver to be stopped, I also got a pingdom and Nagios alert about it.

comment:4 Changed 19 months ago by chris

Another just now:

Date: Wed, 22 Apr 2015 10:49:37 +0100 (BST)
From: root@puffin.webarch.net
To: chris@webarchitects.co.uk
Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 60.50

Time:                    Wed Apr 22 10:49:37 2015 +0100
1 Min Load Avg:          118.47
5 Min Load Avg:          60.50
15 Min Load Avg:         25.26
Running/Total Processes: 42/524

comment:5 Changed 19 months ago by chris

It seems to have passed now, it spiked a bit higher than the last comment:

Date: Wed, 22 Apr 2015 11:09:31 +0100 (BST)
From: root@puffin.webarch.net
To: chris@webarchitects.co.uk
Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 80.03

Time:                    Wed Apr 22 11:09:30 2015 +0100
1 Min Load Avg:          121.18
5 Min Load Avg:          80.03
15 Min Load Avg:         44.00
Running/Total Processes: 57/532

comment:6 Changed 19 months ago by chris

The email this is a reply to should have updated this ticket:

/trac/ticket/846

This is a test to try to work out why it didn't.

On Thu 23-Apr-2015 at 10:43:51AM +0100, Ade Stuart wrote:
> 
> The time frames for these peaks seem consistent. Is there a crone task
> happening between 10.45 and 11.15? My starter for ten would be that
> something is using up the system resource, and hence the spikes.
> Thoughts?

comment:7 Changed 19 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 0.25 to 0.5

OK, weird, my email worked, I have updated the whitelist in /etc/email2trac.conf -- perhaps that is the cause... I did read ticket:494 and ticket:819 trying to find an answer to this...

Regarding the crontab on PuffinServer, perhaps, but there are not any specific ones at that time:

* * * * * bash /var/xdrago/second.sh >/dev/null 2>&1
* * * * * bash /var/xdrago/minute.sh >/dev/null 2>&1
* * * * * bash /var/xdrago/runner.sh >/dev/null 2>&1
* * * * * bash /var/xdrago/manage_ltd_users.sh >/dev/null 2>&1
03 * * * * bash /var/xdrago/clear.sh >/dev/null 2>&1
14 * * * * bash /var/xdrago/purge_binlogs.sh >/dev/null 2>&1
01 0 * * * bash /var/xdrago/graceful.sh >/dev/null 2>&1
08 0 * * * bash /var/xdrago/mysql_backup.sh >/dev/null 2>&1
08 2 * * * bash /opt/local/bin/backboa backup >/dev/null 2>&1
08 8 * * * bash /opt/local/bin/duobackboa backup >/dev/null 2>&1
58 2 * * * bash /var/xdrago/daily.sh >/dev/null 2>&1
58 5 * * * bash /var/xdrago/fetch-remote.sh >/dev/null 2>&1

comment:8 Changed 19 months ago by chris

Just had a massive load spike:

uptime
 19:16:12 up 267 days,  5:59,  1 user,  load average: 198.45, 175.53, 159.21

Seems to be recovering...

comment:9 Changed 19 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 0.5 to 0.75

I did manage to ssh in (eventually) and get top to display what was going on just as the load was dropping off and it was bash and dash processes that were dominating so I have added dash and bash to this graph:

https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/multips_memory.html

Note that the php5-fpm memory usage is very much up over the last few weeks.

I have also added the multip plugin to track the number of processes (same ones as for multips_memory):

[multips]
env.names nginx php_fpm mysqld redis-server munin-node bash dash
user root

[multips_memory]
env.names nginx php-fpm mysqld redis-server munin-node bash dash
user root

BOA runs multiple bash and dash scripts all the time:

ps -lA | grep dash
4 S     0  9185  9175  1  80   0 -  1033 -      ?        00:00:01 dash
4 S     0  9190  9174  3  80   0 -  1033 -      ?        00:00:03 dash
4 S     0 11864 11861  0  80   0 -  1033 -      ?        00:00:00 dash
4 S     0 11872 11865  0  80   0 -  1033 -      ?        00:00:00 dash
4 S     0 11875 11866  0  80   0 -  1033 -      ?        00:00:00 dash
4 S     0 58966 58959  0  80   0 -  1033 -      ?        00:00:00 dash

ps -lA | grep bash
0 S     0 11924 11875  0  80   0 -  2721 -      ?        00:00:00 bash
4 S     0 15765 15759  0  80   0 -  2693 -      ?        00:00:00 bash
0 S     0 15803 15764  0  80   0 -  2721 -      ?        00:00:00 bash
0 S     0 15805 15762  0  80   0 -  2698 -      ?        00:00:00 bash
0 S  1000 39464 39201  0  80   0 -  5060 -      pts/0    00:00:06 bash
4 S     0 40513 40175  0  80   0 -  5217 -      pts/0    00:00:15 bash
0 S     0 51716     1  2  80   0 -  3180 -      ?        2-05:47:25 bash
0 S     0 51721     1  2  80   0 -  3180 -      ?        2-05:22:41 bash
0 S     0 51726     1  2  80   0 -  3180 -      ?        2-04:59:57 bash
0 S     0 51731     1  2  80   0 -  3180 -      ?        2-05:05:06 bash
0 S     0 51736     1  2  80   0 -  3180 -      ?        2-05:12:54 bash
4 S     0 58973 58966  0  80   0 -  2700 -      ?        00:00:00 bash

I also removed the pure-ftpd server from this list as we are no longer running it.

PS this ticket was closed as a dupe: ticket:852

Changed 18 months ago by chris

Attachment puffin-2015-05-22-load-day.png added

Changed 18 months ago by chris

Attachment puffin-2015-05-22-lmultips-day.png added

comment:10 Changed 18 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 0.75 to 1.0

We just had a load spike:

Date: Fri, 22 May 2015 13:14:39 +0100 (BST)
From: root@puffin.webarch.net
To: chris@webarchitects.co.uk
Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 41.22

Time:                    Fri May 22 13:14:39 2015 +0100
1 Min Load Avg:          74.26
5 Min Load Avg:          41.22
15 Min Load Avg:         18.13
Running/Total Processes: 37/509

And there was a corresponding spike in bash and dash processes (not sure if this was the cause or a consequence):

From looking at top I have added to the list of processes the multips graphs track:

[multips]
env.names nginx php_fpm mysqld redis-server munin-node bash dash awk perl csf
user root

[multips_memory]
env.names nginx php-fpm mysqld redis-server munin-node bash dash awk perl csf
user root

See:

comment:11 Changed 18 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.18
Total Hours changed from 1.0 to 1.18

We just had a rather large load spike, one of the email alerts:

Date: Tue,  2 Jun 2015 17:19:28 +0100 (BST)
From: root@puffin.webarch.net
To: chris@webarchitects.co.uk
Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 151.07

Time:                    Tue Jun  2 17:19:27 2015 +0100
1 Min Load Avg:          165.73
5 Min Load Avg:          151.07
15 Min Load Avg:         94.10
Running/Total Processes: 122/655

And the BOA firewall script has tried to block the Webarchitects Nagios monitoring server:

Date: Tue,  2 Jun 2015 17:19:39 +0100 (BST)
From: root@puffin.webarch.net
To: chris@webarchitects.co.uk
Subject: lfd on puffin.webarch.net: blocked 81.95.52.66 (GB/United Kingdom/nsa.rat.burntout.org)

Time:     Tue Jun  2 17:19:37 2015 +0100
IP:       81.95.52.66 (GB/United Kingdom/nsa.rat.burntout.org)
Failures: 5 (sshd)
Interval: 300 seconds
Blocked:  Permanent Block (IP match in csf.allow, block may not work)

Log entries:

Jun  2 17:19:02 puffin sshd[32711]: Did not receive identification string from 81.95.52.66
Jun  2 17:19:02 puffin sshd[32728]: Did not receive identification string from 81.95.52.66
Jun  2 17:19:03 puffin sshd[32756]: Did not receive identification string from 81.95.52.66
Jun  2 17:19:04 puffin sshd[310]: Did not receive identification string from 81.95.52.66
Jun  2 17:19:07 puffin sshd[413]: Did not receive identification string from 81.95.52.66

But it looks like it failed as the server is whitelisted.

I can't ssh to the server, the connection times out but I can access the XenShell however I don't have a password to hand as I always use ssh keys so I can't login.

So we could either wait for it to recover or I should shut it down and restart it, I'll wate 5 mins and see if it recovers...

Note that the server load has been so hight that it hasn't been able to generate any Munin stats -- we simply have gap in the graphs:

https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/index.html

comment:12 Changed 18 months ago by chris

Here is another of the emails, when this one was sent the 1 Min Load Avg was 180:

Date: Tue,  2 Jun 2015 17:21:48 +0100 (BST)
From: root@puffin.webarch.net
To: chris@webarchitects.co.uk
Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 161.05

Time:                    Tue Jun  2 17:21:47 2015 +0100
1 Min Load Avg:          180.22
5 Min Load Avg:          161.05
15 Min Load Avg:         105.43
Running/Total Processes: 174/862

comment:13 Changed 18 months ago by chris

The website is producing 502 Bad Gateway errors, this means Nginx is up and running but it can't connect to php-fpm.

comment:14 Changed 18 months ago by chris

I'm going to shut it down and restart it.

comment:15 Changed 18 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 1.18 to 1.28

Running xm shutdown puffin.webarch.net wasn't resulting in the server being shutdown so I ran:

xm destroy puffin.webarch.net
xm create puffin.webarch.net.cfg

And it's on it's way back up now.

comment:16 Changed 18 months ago by chris

It's up again.

comment:17 Changed 18 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 1.28 to 1.53

Not a very big load spike on PuffinServer but it must have got a lot higher:

Date: Thu,  4 Jun 2015 09:54:16 +0100 (BST)
From: root@puffin.webarch.net
To: chris@webarchitects.co.uk
Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 14.22

Time:                    Thu Jun  4 09:52:36 2015 +0100
1 Min Load Avg:          39.15
5 Min Load Avg:          14.22
15 Min Load Avg:         5.67
Running/Total Processes: 47/430

It is was not responding to HTTP or SSH requests, but I'm now in and it is recovering:

uptime
 10:12:11 up 1 day, 16:18,  1 user,  load average: 89.40, 126.87, 103.35

Looking in /var/log/syslog it was a kernel issue and we have had this a few times before:

/trac/search?q=blocked+for+more+than+120+seconds

This is what we have in syslog:

Jun  4 09:49:50 puffin mysqld: 150604  9:49:50 [Warning] Aborted connection 294666 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun  4 09:49:51 puffin mysqld: 150604  9:49:51 [Warning] Aborted connection 294667 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun  4 09:49:51 puffin mysqld: 150604  9:49:51 [Warning] Aborted connection 294665 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun  4 09:49:52 puffin mysqld: 150604  9:49:51 [Warning] Aborted connection 294664 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error)
Jun  4 09:50:03 puffin kernel: [143763.270448] INFO: task vnstatd:2101 blocked for more than 120 seconds.
Jun  4 09:50:03 puffin kernel: [143763.270470] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  4 09:50:03 puffin kernel: [143763.270479] vnstatd       D ffff8801fb30d4c0     0  2101      1 0x00000000
Jun  4 09:50:03 puffin kernel: [143763.270490]  ffff8801fb30d4c0 0000000000000282 ffff8801ffc15000 ffff880000009680
Jun  4 09:50:03 puffin kernel: [143763.270503]  ffff8801fd5a9d58 ffffffff81012cdb 000000000000f9e0 ffff8801fd5a9fd8
Jun  4 09:50:03 puffin kernel: [143763.270514]  00000000000157c0 00000000000157c0 ffff8801ff2a0710 ffff8801ff2a0a08
Jun  4 09:50:03 puffin kernel: [143763.270525] Call Trace:
Jun  4 09:50:03 puffin kernel: [143763.270537]  [<ffffffff81012cdb>] ? xen_hypervisor_callback+0x1b/0x20
Jun  4 09:50:03 puffin kernel: [143763.270548]  [<ffffffff81155196>] ? cap_inode_permission+0x0/0x3
Jun  4 09:50:03 puffin kernel: [143763.270576]  [<ffffffff8130e5c3>] ? __mutex_lock_common+0x122/0x192
Jun  4 09:50:03 puffin kernel: [143763.270587]  [<ffffffff8130e6eb>] ? mutex_lock+0x1a/0x31
Jun  4 09:50:03 puffin kernel: [143763.270595]  [<ffffffff810f7eac>] ? do_lookup+0x80/0x15d
Jun  4 09:50:03 puffin kernel: [143763.270603]  [<ffffffff810f86aa>] ? __link_path_walk+0x323/0x6f5
Jun  4 09:50:03 puffin kernel: [143763.270611]  [<ffffffff810f8caa>] ? path_walk+0x66/0xc9
Jun  4 09:50:03 puffin kernel: [143763.270619]  [<ffffffff810fa114>] ? do_path_lookup+0x20/0x77
Jun  4 09:50:03 puffin kernel: [143763.270627]  [<ffffffff810fa2a0>] ? do_filp_open+0xe5/0x94b
Jun  4 09:50:03 puffin kernel: [143763.270635]  [<ffffffff8130f054>] ? _spin_lock_irqsave+0x15/0x34
Jun  4 09:50:03 puffin kernel: [143763.270644]  [<ffffffff81068fa6>] ? hrtimer_try_to_cancel+0x3a/0x43
Jun  4 09:50:03 puffin kernel: [143763.270653]  [<ffffffff8102de30>] ? pvclock_clocksource_read+0x3a/0x8b
Jun  4 09:50:03 puffin kernel: [143763.270662]  [<ffffffff81103705>] ? alloc_fd+0x67/0x10c
Jun  4 09:50:03 puffin kernel: [143763.270696]  [<ffffffff810eeacf>] ? do_sys_open+0x55/0xfc
Jun  4 09:50:03 puffin kernel: [143763.270705]  [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b
Jun  4 09:50:03 puffin kernel: [143763.270763] INFO: task id:11812 blocked for more than 120 seconds.
Jun  4 09:50:03 puffin kernel: [143763.270769] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  4 09:50:03 puffin kernel: [143763.270777] id            D ffff8801fcd9b880     0 11812  11811 0x00000080
Jun  4 09:50:03 puffin kernel: [143763.270787]  ffff8801fcd9b880 0000000000000286 ffff880002b27d88 ffff8801fedc1c20
Jun  4 09:50:03 puffin kernel: [143763.270798]  00000000000006fe ffffffff81012cdb 000000000000f9e0 ffff880002b27fd8
Jun  4 09:50:03 puffin kernel: [143763.270809]  00000000000157c0 00000000000157c0 ffff8801fabf3170 ffff8801fabf3468
Jun  4 09:50:03 puffin kernel: [143763.270821] Call Trace:
Jun  4 09:50:03 puffin kernel: [143763.270827]  [<ffffffff81012cdb>] ? xen_hypervisor_callback+0x1b/0x20
Jun  4 09:50:03 puffin kernel: [143763.270836]  [<ffffffff81155196>] ? cap_inode_permission+0x0/0x3
Jun  4 09:50:03 puffin kernel: [143763.270845]  [<ffffffff8130e5c3>] ? __mutex_lock_common+0x122/0x192
Jun  4 09:50:03 puffin kernel: [143763.270854]  [<ffffffff8130e6eb>] ? mutex_lock+0x1a/0x31
Jun  4 09:50:03 puffin kernel: [143763.270861]  [<ffffffff810f7eac>] ? do_lookup+0x80/0x15d
Jun  4 09:50:03 puffin kernel: [143763.270869]  [<ffffffff810f892c>] ? __link_path_walk+0x5a5/0x6f5
Jun  4 09:50:03 puffin kernel: [143763.270876]  [<ffffffff810f8caa>] ? path_walk+0x66/0xc9
Jun  4 09:50:03 puffin kernel: [143763.270884]  [<ffffffff810fa114>] ? do_path_lookup+0x20/0x77
Jun  4 09:50:03 puffin kernel: [143763.270892]  [<ffffffff810fa2a0>] ? do_filp_open+0xe5/0x94b
Jun  4 09:50:03 puffin kernel: [143763.270901]  [<ffffffff810d217e>] ? vma_link+0x74/0x9a
Jun  4 09:50:03 puffin kernel: [143763.270908]  [<ffffffff81103705>] ? alloc_fd+0x67/0x10c
Jun  4 09:50:03 puffin kernel: [143763.270915]  [<ffffffff810eeacf>] ? do_sys_open+0x55/0xfc
Jun  4 09:50:03 puffin kernel: [143763.270923]  [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b
Jun  4 09:50:03 puffin kernel: [143763.270931] INFO: task cron:11813 blocked for more than 120 seconds.
Jun  4 09:50:03 puffin kernel: [143763.270937] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  4 09:50:03 puffin kernel: [143763.270945] cron          D ffff8801fb1de2e0     0 11813   7882 0x00000080
Jun  4 09:50:03 puffin kernel: [143763.270954]  ffff8801fb1de2e0 0000000000000282 0000000000000001 00000001810423b0
Jun  4 09:50:03 puffin kernel: [143763.270965]  ffff88000bb357c0 0000000000000001 000000000000f9e0 ffff880002ab1fd8
Jun  4 09:50:03 puffin kernel: [143763.270976]  00000000000157c0 00000000000157c0 ffff8801ecdf2350 ffff8801ecdf2648
Jun  4 09:50:03 puffin kernel: [143763.270987] Call Trace:
Jun  4 09:50:03 puffin kernel: [143763.270994]  [<ffffffff811976a6>] ? vsnprintf+0x40a/0x449
Jun  4 09:50:03 puffin kernel: [143763.271002]  [<ffffffff8130e5c3>] ? __mutex_lock_common+0x122/0x192
Jun  4 09:50:03 puffin kernel: [143763.271011]  [<ffffffff8130e6eb>] ? mutex_lock+0x1a/0x31
Jun  4 09:50:03 puffin kernel: [143763.271018]  [<ffffffff810f7eac>] ? do_lookup+0x80/0x15d
Jun  4 09:50:03 puffin kernel: [143763.271025]  [<ffffffff810f892c>] ? __link_path_walk+0x5a5/0x6f5
Jun  4 09:50:03 puffin kernel: [143763.271033]  [<ffffffff810f826a>] ? do_follow_link+0x1fa/0x317
Jun  4 09:50:03 puffin kernel: [143763.271041]  [<ffffffff810f86e0>] ? __link_path_walk+0x359/0x6f5
Jun  4 09:50:03 puffin kernel: [143763.271049]  [<ffffffff810f8caa>] ? path_walk+0x66/0xc9
Jun  4 09:50:03 puffin kernel: [143763.271057]  [<ffffffff810fa114>] ? do_path_lookup+0x20/0x77
Jun  4 09:50:03 puffin kernel: [143763.271064]  [<ffffffff810fa2a0>] ? do_filp_open+0xe5/0x94b
Jun  4 09:50:03 puffin kernel: [143763.271072]  [<ffffffff81012cdb>] ? xen_hypervisor_callback+0x1b/0x20
Jun  4 09:50:03 puffin kernel: [143763.271081]  [<ffffffff8130dda8>] ? thread_return+0x79/0xe0
Jun  4 09:50:03 puffin kernel: [143763.271089]  [<ffffffff811901fb>] ? _atomic_dec_and_lock+0x33/0x50
Jun  4 09:50:03 puffin kernel: [143763.271097]  [<ffffffff81103705>] ? alloc_fd+0x67/0x10c
Jun  4 09:50:03 puffin kernel: [143763.271104]  [<ffffffff810eeacf>] ? do_sys_open+0x55/0xfc
Jun  4 09:50:03 puffin kernel: [143763.271111]  [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b
Jun  4 09:50:03 puffin kernel: [143763.271119] INFO: task cron:11814 blocked for more than 120 seconds.
Jun  4 09:50:03 puffin kernel: [143763.271125] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  4 09:50:03 puffin kernel: [143763.271133] cron          D ffff8801fc48e2e0     0 11814   7882 0x00000080
Jun  4 09:50:03 puffin kernel: [143763.271142]  ffff8801fc48e2e0 0000000000000286 0000000000000001 00000001810423b0
Jun  4 09:50:03 puffin kernel: [143763.271154]  ffff88000bc617c0 000000000000000b 000000000000f9e0 ffff88007f0d7fd8
Jun  4 09:50:03 puffin kernel: [143763.271165]  00000000000157c0 00000000000157c0 ffff8801ecdf0710 ffff8801ecdf0a08
Jun  4 09:50:03 puffin kernel: [143763.271175] Call Trace:
Jun  4 09:50:03 puffin kernel: [143763.271181]  [<ffffffff811976a6>] ? vsnprintf+0x40a/0x449
Jun  4 09:50:03 puffin kernel: [143763.271189]  [<ffffffff8130e5c3>] ? __mutex_lock_common+0x122/0x192
Jun  4 09:50:03 puffin kernel: [143763.271198]  [<ffffffff8130e6eb>] ? mutex_lock+0x1a/0x31
Jun  4 09:50:03 puffin kernel: [143763.271205]  [<ffffffff810f7eac>] ? do_lookup+0x80/0x15d
Jun  4 09:50:03 puffin kernel: [143763.271213]  [<ffffffff810f892c>] ? __link_path_walk+0x5a5/0x6f5
Jun  4 09:50:03 puffin kernel: [143763.271221]  [<ffffffff810f826a>] ? do_follow_link+0x1fa/0x317
Jun  4 09:50:03 puffin kernel: [143763.271229]  [<ffffffff810f86e0>] ? __link_path_walk+0x359/0x6f5
Jun  4 09:50:03 puffin kernel: [143763.271237]  [<ffffffff810f8caa>] ? path_walk+0x66/0xc9
Jun  4 09:50:03 puffin kernel: [143763.271244]  [<ffffffff810fa114>] ? do_path_lookup+0x20/0x77
Jun  4 09:50:03 puffin kernel: [143763.271252]  [<ffffffff810fa2a0>] ? do_filp_open+0xe5/0x94b
Jun  4 09:50:03 puffin kernel: [143763.271259]  [<ffffffff81012cdb>] ? xen_hypervisor_callback+0x1b/0x20
Jun  4 09:50:03 puffin kernel: [143763.271268]  [<ffffffff8130dda8>] ? thread_return+0x79/0xe0
Jun  4 09:50:03 puffin kernel: [143763.271275]  [<ffffffff811901fb>] ? _atomic_dec_and_lock+0x33/0x50
Jun  4 09:50:03 puffin kernel: [143763.271283]  [<ffffffff81103705>] ? alloc_fd+0x67/0x10c
Jun  4 09:50:03 puffin kernel: [143763.271291]  [<ffffffff810eeacf>] ? do_sys_open+0x55/0xfc
Jun  4 09:50:03 puffin kernel: [143763.271298]  [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b
Jun  4 09:50:03 puffin kernel: [143763.271305] INFO: task cron:11815 blocked for more than 120 seconds.
Jun  4 09:50:03 puffin kernel: [143763.271311] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  4 09:50:03 puffin kernel: [143763.271319] cron          D ffff8801fdfc8e20     0 11815   7882 0x00000080
Jun  4 09:50:03 puffin kernel: [143763.271328]  ffff8801fdfc8e20 0000000000000286 0000000000000001 00000001810423b0
Jun  4 09:50:03 puffin kernel: [143763.271339]  ffff88000bc617c0 000000000000000b 000000000000f9e0 ffff880002fa5fd8
Jun  4 09:50:03 puffin kernel: [143763.271350]  00000000000157c0 00000000000157c0 ffff8801ecdf7810 ffff8801ecdf7b08
Jun  4 09:50:03 puffin kernel: [143763.271362] Call Trace:
Jun  4 09:50:03 puffin kernel: [143763.271368]  [<ffffffff811976a6>] ? vsnprintf+0x40a/0x449
Jun  4 09:50:03 puffin kernel: [143763.271376]  [<ffffffff8130e5c3>] ? __mutex_lock_common+0x122/0x192
Jun  4 09:50:03 puffin kernel: [143763.271385]  [<ffffffff8130e6eb>] ? mutex_lock+0x1a/0x31
Jun  4 09:50:03 puffin kernel: [143763.271392]  [<ffffffff810f7eac>] ? do_lookup+0x80/0x15d
Jun  4 09:50:03 puffin kernel: [143763.271400]  [<ffffffff810f892c>] ? __link_path_walk+0x5a5/0x6f5
Jun  4 09:50:03 puffin kernel: [143763.271408]  [<ffffffff810f826a>] ? do_follow_link+0x1fa/0x317
Jun  4 09:50:03 puffin kernel: [143763.271416]  [<ffffffff810f86e0>] ? __link_path_walk+0x359/0x6f5
Jun  4 09:50:03 puffin kernel: [143763.271423]  [<ffffffff810f8caa>] ? path_walk+0x66/0xc9
Jun  4 09:50:03 puffin kernel: [143763.271431]  [<ffffffff810fa114>] ? do_path_lookup+0x20/0x77
Jun  4 09:50:03 puffin kernel: [143763.271439]  [<ffffffff810fa2a0>] ? do_filp_open+0xe5/0x94b
Jun  4 09:50:03 puffin kernel: [143763.271446]  [<ffffffff81012cdb>] ? xen_hypervisor_callback+0x1b/0x20
Jun  4 09:50:03 puffin kernel: [143763.271455]  [<ffffffff8130dda8>] ? thread_return+0x79/0xe0
Jun  4 09:50:03 puffin kernel: [143763.271462]  [<ffffffff811901fb>] ? _atomic_dec_and_lock+0x33/0x50
Jun  4 09:50:03 puffin kernel: [143763.271470]  [<ffffffff81103705>] ? alloc_fd+0x67/0x10c
Jun  4 09:50:03 puffin kernel: [143763.271478]  [<ffffffff810eeacf>] ? do_sys_open+0x55/0xfc
Jun  4 09:50:03 puffin kernel: [143763.271485]  [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b
Jun  4 09:50:03 puffin kernel: [143763.271492] INFO: task cron:11816 blocked for more than 120 seconds.
Jun  4 09:50:03 puffin kernel: [143763.271498] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  4 09:50:03 puffin kernel: [143763.271506] cron          D ffff8801fdfcc6a0     0 11816   7882 0x00000080
Jun  4 09:50:03 puffin kernel: [143763.271516]  ffff8801fdfcc6a0 0000000000000286 0000000000000001 00000001810423b0
Jun  4 09:50:03 puffin kernel: [143763.271527]  ffff88000bb537c0 ffffffff81012cdb 000000000000f9e0 ffff88007d50ffd8
Jun  4 09:50:03 puffin kernel: [143763.271537]  00000000000157c0 00000000000157c0 ffff8801ecdf0e20 ffff8801ecdf1118
Jun  4 09:50:03 puffin kernel: [143763.271549] Call Trace:
Jun  4 09:50:03 puffin kernel: [143763.271554]  [<ffffffff81012cdb>] ? xen_hypervisor_callback+0x1b/0x20
Jun  4 09:50:03 puffin kernel: [143763.271563]  [<ffffffff81155196>] ? cap_inode_permission+0x0/0x3
Jun  4 09:50:03 puffin kernel: [143763.271571]  [<ffffffff8130e5c3>] ? __mutex_lock_common+0x122/0x192
Jun  4 09:50:03 puffin kernel: [143763.271580]  [<ffffffff8130e6eb>] ? mutex_lock+0x1a/0x31
Jun  4 09:50:03 puffin kernel: [143763.271587]  [<ffffffff810f7eac>] ? do_lookup+0x80/0x15d
Jun  4 09:50:03 puffin kernel: [143763.271594]  [<ffffffff810f892c>] ? __link_path_walk+0x5a5/0x6f5
Jun  4 09:50:03 puffin kernel: [143763.271602]  [<ffffffff810f826a>] ? do_follow_link+0x1fa/0x317
Jun  4 09:50:03 puffin kernel: [143763.271610]  [<ffffffff810f86e0>] ? __link_path_walk+0x359/0x6f5
Jun  4 09:50:03 puffin kernel: [143763.271618]  [<ffffffff810f8caa>] ? path_walk+0x66/0xc9
Jun  4 09:50:03 puffin kernel: [143763.271625]  [<ffffffff810fa114>] ? do_path_lookup+0x20/0x77
Jun  4 09:50:03 puffin kernel: [143763.271633]  [<ffffffff810fa2a0>] ? do_filp_open+0xe5/0x94b
Jun  4 09:50:03 puffin kernel: [143763.271641]  [<ffffffff81012cdb>] ? xen_hypervisor_callback+0x1b/0x20
Jun  4 09:50:03 puffin kernel: [143763.271649]  [<ffffffff8130dda8>] ? thread_return+0x79/0xe0
Jun  4 09:50:03 puffin kernel: [143763.271657]  [<ffffffff811901fb>] ? _atomic_dec_and_lock+0x33/0x50
Jun  4 09:50:03 puffin kernel: [143763.271665]  [<ffffffff81103705>] ? alloc_fd+0x67/0x10c
Jun  4 09:50:03 puffin kernel: [143763.271672]  [<ffffffff810eeacf>] ? do_sys_open+0x55/0xfc
Jun  4 09:50:03 puffin kernel: [143763.271679]  [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b
Jun  4 09:50:03 puffin kernel: [143763.271687] INFO: task cron:11817 blocked for more than 120 seconds.
Jun  4 09:50:03 puffin kernel: [143763.271693] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  4 09:50:03 puffin kernel: [143763.271701] cron          D ffff88007f04b880     0 11817   7882 0x00000080
Jun  4 09:50:03 puffin kernel: [143763.271710]  ffff88007f04b880 0000000000000286 0000000000000001 00000001810423b0
Jun  4 09:50:03 puffin kernel: [143763.271721]  ffff88000bc077c0 0000000000000008 000000000000f9e0 ffff88007d079fd8
Jun  4 09:50:03 puffin kernel: [143763.271732]  00000000000157c0 00000000000157c0 ffff8801ecdf62e0 ffff8801ecdf65d8
Jun  4 09:50:03 puffin kernel: [143763.271743] Call Trace:
Jun  4 09:50:03 puffin kernel: [143763.271749]  [<ffffffff811976a6>] ? vsnprintf+0x40a/0x449
Jun  4 09:50:03 puffin kernel: [143763.271757]  [<ffffffff8130e5c3>] ? __mutex_lock_common+0x122/0x192
Jun  4 09:50:03 puffin kernel: [143763.271766]  [<ffffffff8130e6eb>] ? mutex_lock+0x1a/0x31
Jun  4 09:50:03 puffin kernel: [143763.271773]  [<ffffffff810f7eac>] ? do_lookup+0x80/0x15d
Jun  4 09:50:03 puffin kernel: [143763.271780]  [<ffffffff810f892c>] ? __link_path_walk+0x5a5/0x6f5
Jun  4 09:50:03 puffin kernel: [143763.271789]  [<ffffffff810f826a>] ? do_follow_link+0x1fa/0x317
Jun  4 09:50:03 puffin kernel: [143763.271796]  [<ffffffff810f86e0>] ? __link_path_walk+0x359/0x6f5
Jun  4 09:50:03 puffin kernel: [143763.271804]  [<ffffffff810f8caa>] ? path_walk+0x66/0xc9
Jun  4 09:50:03 puffin kernel: [143763.271812]  [<ffffffff810fa114>] ? do_path_lookup+0x20/0x77
Jun  4 09:50:03 puffin kernel: [143763.271819]  [<ffffffff810fa2a0>] ? do_filp_open+0xe5/0x94b
Jun  4 09:50:03 puffin kernel: [143763.271827]  [<ffffffff81012cdb>] ? xen_hypervisor_callback+0x1b/0x20
Jun  4 09:50:03 puffin kernel: [143763.271836]  [<ffffffff8130dda8>] ? thread_return+0x79/0xe0
Jun  4 09:50:03 puffin kernel: [143763.271843]  [<ffffffff811901fb>] ? _atomic_dec_and_lock+0x33/0x50
Jun  4 09:50:03 puffin kernel: [143763.271851]  [<ffffffff81103705>] ? alloc_fd+0x67/0x10c
Jun  4 09:50:03 puffin kernel: [143763.271858]  [<ffffffff810eeacf>] ? do_sys_open+0x55/0xfc
Jun  4 09:50:03 puffin kernel: [143763.271865]  [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b
Jun  4 09:50:03 puffin kernel: [143763.271873] INFO: task cron:11818 blocked for more than 120 seconds.
Jun  4 09:50:03 puffin kernel: [143763.271879] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  4 09:50:03 puffin kernel: [143763.271886] cron          D ffff8801ecdf7810     0 11818   7882 0x00000080
Jun  4 09:50:03 puffin kernel: [143763.271895]  ffff8801ecdf7810 0000000000000282 0000000000000001 00000001810423b0
Jun  4 09:50:03 puffin kernel: [143763.271906]  ffff88000bc7f7c0 000000000000000c 000000000000f9e0 ffff88007d0b3fd8
Jun  4 09:50:03 puffin kernel: [143763.271917]  00000000000157c0 00000000000157c0 ffff8801ecdf2a60 ffff8801ecdf2d58
Jun  4 09:50:03 puffin kernel: [143763.271928] Call Trace:
Jun  4 09:50:03 puffin kernel: [143763.271934]  [<ffffffff811976a6>] ? vsnprintf+0x40a/0x449
Jun  4 09:50:03 puffin kernel: [143763.271942]  [<ffffffff8130e5c3>] ? __mutex_lock_common+0x122/0x192
Jun  4 09:50:03 puffin kernel: [143763.271950]  [<ffffffff8130e6eb>] ? mutex_lock+0x1a/0x31
Jun  4 09:50:03 puffin kernel: [143763.271957]  [<ffffffff810f7eac>] ? do_lookup+0x80/0x15d
Jun  4 09:50:03 puffin kernel: [143763.271964]  [<ffffffff810f892c>] ? __link_path_walk+0x5a5/0x6f5
Jun  4 09:50:03 puffin kernel: [143763.271972]  [<ffffffff810f826a>] ? do_follow_link+0x1fa/0x317
Jun  4 09:50:03 puffin kernel: [143763.271980]  [<ffffffff810f86e0>] ? __link_path_walk+0x359/0x6f5
Jun  4 09:50:03 puffin kernel: [143763.271988]  [<ffffffff810f8caa>] ? path_walk+0x66/0xc9
Jun  4 09:50:03 puffin kernel: [143763.271995]  [<ffffffff810fa114>] ? do_path_lookup+0x20/0x77
Jun  4 09:50:03 puffin kernel: [143763.272003]  [<ffffffff810fa2a0>] ? do_filp_open+0xe5/0x94b
Jun  4 09:50:03 puffin kernel: [143763.272011]  [<ffffffff81012cdb>] ? xen_hypervisor_callback+0x1b/0x20
Jun  4 09:50:03 puffin kernel: [143763.482981]  [<ffffffff8130dda8>] ? thread_return+0x79/0xe0
Jun  4 09:50:03 puffin kernel: [143763.482993]  [<ffffffff811901fb>] ? _atomic_dec_and_lock+0x33/0x50
Jun  4 09:50:03 puffin kernel: [143763.483003]  [<ffffffff81103705>] ? alloc_fd+0x67/0x10c
Jun  4 09:50:03 puffin kernel: [143763.483012]  [<ffffffff810eeacf>] ? do_sys_open+0x55/0xfc
Jun  4 09:50:03 puffin kernel: [143763.483020]  [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b
Jun  4 09:50:03 puffin kernel: [143763.483030] INFO: task id:11820 blocked for more than 120 seconds.
Jun  4 09:50:03 puffin kernel: [143763.483037] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  4 09:50:03 puffin kernel: [143763.483045] id            D ffff8801fabba350     0 11820  11819 0x00000080
Jun  4 09:50:03 puffin kernel: [143763.483055]  ffff8801fabba350 0000000000000286 ffff88007d593d88 ffff8801fedc1c20
Jun  4 09:50:03 puffin kernel: [143763.483067]  00000000000006fe ffff880000009680 000000000000f9e0 ffff88007d593fd8
Jun  4 09:50:03 puffin kernel: [143763.483078]  00000000000157c0 00000000000157c0 ffff8801fabb8710 ffff8801fabb8a08
Jun  4 09:50:03 puffin kernel: [143763.483089] Call Trace:
Jun  4 09:50:03 puffin kernel: [143763.483097]  [<ffffffff810b9b8b>] ? zone_watermark_ok+0x20/0xb1
Jun  4 09:50:03 puffin kernel: [143763.483107]  [<ffffffff8130e5c3>] ? __mutex_lock_common+0x122/0x192
Jun  4 09:50:03 puffin kernel: [143763.483115]  [<ffffffff8130e6eb>] ? mutex_lock+0x1a/0x31
Jun  4 09:50:03 puffin kernel: [143763.483123]  [<ffffffff810f7eac>] ? do_lookup+0x80/0x15d
Jun  4 09:50:03 puffin kernel: [143763.483131]  [<ffffffff810f892c>] ? __link_path_walk+0x5a5/0x6f5
Jun  4 09:50:03 puffin kernel: [143763.483139]  [<ffffffff810f8caa>] ? path_walk+0x66/0xc9
Jun  4 09:50:03 puffin kernel: [143763.483146]  [<ffffffff810fa114>] ? do_path_lookup+0x20/0x77
Jun  4 09:50:03 puffin kernel: [143763.483154]  [<ffffffff810fa2a0>] ? do_filp_open+0xe5/0x94b
Jun  4 09:50:03 puffin kernel: [143763.483163]  [<ffffffff810d217e>] ? vma_link+0x74/0x9a
Jun  4 09:50:03 puffin kernel: [143763.483170]  [<ffffffff81103705>] ? alloc_fd+0x67/0x10c
Jun  4 09:50:03 puffin kernel: [143763.483178]  [<ffffffff810eeacf>] ? do_sys_open+0x55/0xfc
Jun  4 09:50:03 puffin kernel: [143763.483185]  [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b
Jun  4 09:50:03 puffin kernel: [143763.483194] INFO: task awk:11827 blocked for more than 120 seconds.
Jun  4 09:50:03 puffin kernel: [143763.483200] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  4 09:50:03 puffin kernel: [143763.483207] awk           D ffff88019001a350     0 11827  11823 0x00000080
Jun  4 09:50:03 puffin kernel: [143763.483216]  ffff88019001a350 0000000000000282 0000000000000000 0000000000000000
Jun  4 09:50:03 puffin kernel: [143763.483228]  0000000000000000 ffffffff81012cdb 000000000000f9e0 ffff88007f0d9fd8
Jun  4 09:50:03 puffin kernel: [143763.483239]  00000000000157c0 00000000000157c0 ffff8801fabb8000 ffff8801fabb82f8
Jun  4 09:50:03 puffin kernel: [143763.483249] Call Trace:
Jun  4 09:50:03 puffin kernel: [143763.483256]  [<ffffffff81012cdb>] ? xen_hypervisor_callback+0x1b/0x20
Jun  4 09:50:03 puffin kernel: [143763.483266]  [<ffffffff81155196>] ? cap_inode_permission+0x0/0x3
Jun  4 09:50:03 puffin kernel: [143763.506896]  [<ffffffff8130e5c3>] ? __mutex_lock_common+0x122/0x192
Jun  4 09:50:03 puffin kernel: [143763.506910]  [<ffffffff8130e6eb>] ? mutex_lock+0x1a/0x31
Jun  4 09:50:03 puffin kernel: [143763.506919]  [<ffffffff810f7eac>] ? do_lookup+0x80/0x15d
Jun  4 09:50:03 puffin kernel: [143763.506927]  [<ffffffff810f892c>] ? __link_path_walk+0x5a5/0x6f5
Jun  4 09:50:03 puffin kernel: [143763.506935]  [<ffffffff810f826a>] ? do_follow_link+0x1fa/0x317
Jun  4 09:50:03 puffin kernel: [143763.506943]  [<ffffffff810f86e0>] ? __link_path_walk+0x359/0x6f5
Jun  4 09:50:03 puffin kernel: [143763.506952]  [<ffffffff810f8caa>] ? path_walk+0x66/0xc9
Jun  4 09:50:03 puffin kernel: [143763.506960]  [<ffffffff810fa114>] ? do_path_lookup+0x20/0x77
Jun  4 09:50:03 puffin kernel: [143763.506967]  [<ffffffff810fa2a0>] ? do_filp_open+0xe5/0x94b
Jun  4 09:50:03 puffin kernel: [143763.506975]  [<ffffffff8100c4e8>] ? pte_pfn_to_mfn+0x21/0x30
Jun  4 09:50:03 puffin kernel: [143763.506983]  [<ffffffff8100cc43>] ? xen_make_pte+0x7b/0x83
Jun  4 09:50:03 puffin kernel: [143763.506991]  [<ffffffff8100e585>] ? xen_set_pte_at+0xa2/0xc2
Jun  4 09:50:03 puffin kernel: [143763.507000]  [<ffffffff8105cd8e>] ? do_sigaction+0x159/0x171
Jun  4 09:50:03 puffin kernel: [143763.507008]  [<ffffffff81103705>] ? alloc_fd+0x67/0x10c
Jun  4 09:50:03 puffin kernel: [143763.507016]  [<ffffffff810eeacf>] ? do_sys_open+0x55/0xfc
Jun  4 09:50:03 puffin kernel: [143763.507023]  [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b

It has more-or-less recovered now:

uptime
 10:20:55 up 1 day, 16:26,  1 user,  load average: 2.21, 24.19, 60.46

I have reset the root passwords on all three servers so that I can login via the xen consol when needed.

comment:18 Changed 18 months ago by chris

The load alert emails didn't go out while the load was so high, they have come in now, this was the peak:

Date: Thu,  4 Jun 2015 10:08:33 +0100 (BST)
From: root@puffin.webarch.net
To: chris@webarchitects.co.uk
Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 150.63

Time:                    Thu Jun  4 10:08:33 2015 +0100
1 Min Load Avg:          171.68
5 Min Load Avg:          150.63
15 Min Load Avg:         100.19
Running/Total Processes: 114/959

comment:19 Changed 18 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.05
Total Hours changed from 1.53 to 1.58

Looks like another load spike now, uptime is not producing any results and I can't long via the xen console, I'm going to restart the server.

comment:20 Changed 18 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.03
Total Hours changed from 1.58 to 1.61

It's on it's way back up now, the firewall is loading and this always takes ages.

Last edited 18 months ago by chris (previous) (diff)

comment:23 follow-up: ↓ 24 Changed 18 months ago by ade

Hi Chris,
Do you have an understanding as to why this is happening?
It seems that as well as the known issues, such as the the boa replication
issues, the site going down is now becoming a regular issue, and we need to
ascertain why this is the case.

Your thoughts?

best regards
Ade

On 4 June 2015 at 10:36, Transition Technology Trac <
trac@tech.transitionnetwork.org> wrote:

> #846: Load Spikes on BOA PuffinServer
> -------------------------------------+-------------------------------------
>            Reporter:  chris          |                      Owner:  chris
>                Type:  maintenance    |                     Status:  new
>            Priority:  major          |                  Milestone:
>           Component:  Live server    |  Maintenance
>            Keywords:                 |                 Resolution:
> Add Hours to Ticket:  0.03           |  Estimated Number of Hours:  0.0
>         Total Hours:  1.58           |                  Billable?:  1
> -------------------------------------+-------------------------------------
> Changes (by chris):
>
>  * hours:  0.0 => 0.03
>  * totalhours:  1.58 => 1.61
>
>
> Comment:
>
>  It's on it's way back up now, the firewall is loading and tyhis always
>  takes '''ages'''.
>
> --
> Ticket URL: <https://tech.transitionnetwork.org/trac/ticket/846#comment:20
> >
> Transition Technology <https://tech.transitionnetwork.org/trac>
> Support and issues tracking for the Transition Network Web Project.
>



-- 
Ade Stuart
Web Manager - Transition network

07595 331877

The Transition Network is a registered charity
address: 43 Fore St, Totnes, Devon, TQ9 5HN, UK
website: www.transitionnetwork.org
TN company no: 6135675 TN charity no: 1128675

comment:24 in reply to: ↑ 23 Changed 18 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 1.61 to 1.86

Replying to ade:

Do you have an understanding as to why this is happening?

Not really, we have had a long history of load spikes with BOA but have never really managed to get to the bottom of it, probably because there is more than one cause, I could point out lots of past tickets on this if that would help?

Of course load spikes are not specific to BOA but BOA makes it really hard to work out what is going on and even harder to intervene to try to fix / mitigate things.

It seems that as well as the known issues, such as the the boa replication
issues, the site going down is now becoming a regular issue, and we need to
ascertain why this is the case.

Your thoughts?

Yes it would be good to know the cause(s) of these, but to be honest I don't know if we ever will -- we have spent a lot of time on the load spike suicides in the past... see PuffinServer#LoadSpikes and:

/trac/wiki/PuffinServerBoaLoadSpikes

Personally I'd be more interested in spending time sorting out a non-BOA server than spending even more time trying understand what BOA is doing, however regarding this specific load spike and kernel issue it might not be BOA related -- there are lots of links related to "task blocked for more than 120 seconds xen":

https://duckduckgo.com/html/?q=task+blocked+for+more+than+120+seconds+xen

Would you like me to spend some time looking into this?

Regarding the time the BOA firewall takes to load, mentioned in ticket:846#comment:20, I just timed how long it takes just to list the firewall rules, 30 mins (!!!) -- there rules are over 5k lines long:

date; iptables -L > /tmp/ip.txt ; date
  Thu Jun  4 10:44:57 BST 2015
  Thu Jun  4 11:15:29 BST 2015
cat /tmp/ip.txt | wc -l
  5239

comment:25 Changed 18 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 1.86 to 2.11

Just had another load spike on PuffinServer, list is a list of the load alert emails from ldf, Saturday, Sunday and today:

Jun 05 High 5 minute load average alert - 14.69
Jun 05 High 5 minute load average alert - 28.96
Jun 05 High 5 minute load average alert - 19.62
Jun 05 High 5 minute load average alert - 13.18
Jun 06 High 5 minute load average alert - 14.89
Jun 06 High 5 minute load average alert - 30.80
Jun 06 High 5 minute load average alert - 53.29
Jun 06 High 5 minute load average alert - 65.00
Jun 06 High 5 minute load average alert - 61.44
Jun 06 High 5 minute load average alert - 29.48
Jun 06 High 5 minute load average alert - 19.79
Jun 06 High 5 minute load average alert - 15.18
Jun 07 High 5 minute load average alert - 13.54
Jun 07 High 5 minute load average alert - 69.17
Jun 07 High 5 minute load average alert - 96.37
Jun 07 High 5 minute load average alert - 90.83
Jun 07 High 5 minute load average alert - 88.37
Jun 07 High 5 minute load average alert - 85.94
Jun 07 High 5 minute load average alert - 85.82
Jun 07 High 5 minute load average alert - 87.89
Jun 07 High 5 minute load average alert - 91.81
Jun 07 High 5 minute load average alert - 97.62
Jun 07 High 5 minute load average alert - 105.56
Jun 07 High 5 minute load average alert - 107.19
Jun 07 High 5 minute load average alert - 114.79
Jun 07 High 5 minute load average alert - 42.17
Jun 07 High 5 minute load average alert - 28.30
Jun 07 High 5 minute load average alert - 19.12
Jun 07 High 5 minute load average alert - 12.91
Jun 07 High 5 minute load average alert - 17.87
Jun 07 High 5 minute load average alert - 18.92
Jun 07 High 5 minute load average alert - 14.26
Jun 07 High 5 minute load average alert - 13.46
Jun 07 High 5 minute load average alert - 20.99
Jun 07 High 5 minute load average alert - 15.80
Jun 08 High 5 minute load average alert - 15.42
Jun 08 High 5 minute load average alert - 19.44
Jun 08 High 5 minute load average alert - 96.48
Jun 08 High 5 minute load average alert - 111.23
Jun 08 High 5 minute load average alert - 73.42
Jun 08 High 5 minute load average alert - 49.12
Jun 08 High 5 minute load average alert - 32.89

The peak of the latest one:

Date: Mon,  8 Jun 2015 10:37:52 +0100 (BST)
From: root@puffin.webarch.net
To: chris@webarchitects.co.uk
Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 111.23

Time:                    Mon Jun  8 10:37:52 2015 +0100
1 Min Load Avg:          102.54
5 Min Load Avg:          111.23
15 Min Load Avg:         62.46
Running/Total Processes: 3/472

It has recovered now, I couldn't find anything in the logs.

comment:26 Changed 17 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 2.11 to 2.36

There was a load spike just after 4am this morning, this first ldf alert:

Date: Tue,  7 Jul 2015 03:55:46 +0100 (BST)
From: root@puffin.webarch.net
Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 13.07

Time:                    Tue Jul  7 03:54:54 2015 +0100
1 Min Load Avg:          38.25
5 Min Load Avg:          13.07
15 Min Load Avg:         5.17
Running/Total Processes: 52/458

The peak:

Date: Tue,  7 Jul 2015 04:09:28 +0100 (BST)
From: root@puffin.webarch.net
Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 94.54

Time:                    Tue Jul  7 04:09:22 2015 +0100
1 Min Load Avg:          94.67
5 Min Load Avg:          94.54
15 Min Load Avg:         66.86
Running/Total Processes: 64/685

And the last alert:

Date: Tue,  7 Jul 2015 04:21:59 +0100 (BST)
From: root@puffin.webarch.net
Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 12.87

Time:                    Tue Jul  7 04:21:59 2015 +0100
1 Min Load Avg:          1.04
5 Min Load Avg:          12.87
15 Min Load Avg:         36.69
Running/Total Processes: 1/376

There were a lot of 503 errors served during this time, almost all to bots, for example:

"66.249.92.63" www.transitionnetwork.org [07/Jul/2015:04:13:41 +0100] "GET /news/network/feed HTTP/1.1" 503 206 341 376 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 2 subscribers; feed-id=12394840076319634974)" 0.000 "-"
"207.46.13.110, 127.0.0.1" www.transitionnetwork.org [07/Jul/2015:04:13:45 +0100] "GET /initiatives/glastonbury HTTP/1.0" 503 206 406 371 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.001 "-"
"66.249.92.60" www.transitionnetwork.org [07/Jul/2015:04:13:52 +0100] "GET /about HTTP/1.1" 503 206 243 376 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 2 subscribers; feed-id=2875799361570896823)" 0.000 "-"
"207.46.13.110" www.transitionnetwork.org [07/Jul/2015:04:13:55 +0100] "GET /blogs/rob-hopkins/2013-01-14/british-bean-back-interview-josiah-meldrum-hodmedod-and-transition HTTP/1.1" 503 206 343 376 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-"
"65.19.138.34" www.transitionnetwork.org [07/Jul/2015:04:13:58 +0100] "GET /blogs/feed/rob-hopkins/atom/ HTTP/1.1" 503 206 399 371 "-" "Feedly/1.0 (+http://www.feedly.com/fetcher.html; like FeedFetcher-Google)" 0.142 "-"
"207.46.13.110" www.transitionnetwork.org [07/Jul/2015:04:14:05 +0100] "GET /tags/colin-godmans-farm HTTP/1.1" 503 206 271 376 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-"
"66.249.92.63" www.transitionnetwork.org [07/Jul/2015:04:14:07 +0100] "GET /blogs/feed HTTP/1.1" 503 206 334 376 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 10 subscribers; feed-id=2865359885998846808)" 0.000 "-"
"65.19.138.33" www.transitionnetwork.org [07/Jul/2015:04:14:12 +0100] "GET /blogs/feed HTTP/1.1" 503 206 339 371 "-" "Feedly/1.0 (+http://www.feedly.com/fetcher.html; like FeedFetcher-Google)" 0.177 "-"
"207.46.13.110" www.transitionnetwork.org [07/Jul/2015:04:14:15 +0100] "GET /people/nyle-seabright HTTP/1.1" 503 206 269 376 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-"
"66.249.78.181" www.transitionnetwork.org [07/Jul/2015:04:14:18 +0100] "GET /blogs/feed/rob-hopkins/ HTTP/1.1" 503 206 347 376 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.000 "-"
"207.46.13.110" www.transitionnetwork.org [07/Jul/2015:04:14:25 +0100] "GET /people/adam-walker1 HTTP/1.1" 503 206 267 376 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-"
"94.228.34.249" www.transitionnetwork.org [07/Jul/2015:04:14:32 +0100] "GET /forums/themes/local-government HTTP/1.1" 503 206 321 376 "-" "magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)" 0.000 "-"
"207.46.13.110, 127.0.0.1" www.transitionnetwork.org [07/Jul/2015:04:14:35 +0100] "GET /initiatives/building-resilience-mission-brim HTTP/1.0" 503 206 427 371 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-"
"66.249.92.63" www.transitionnetwork.org [07/Jul/2015:04:14:39 +0100] "GET /rss.xml HTTP/1.1" 503 206 245 376 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; feed-id=2525513441670857354)" 0.000 "-"
"207.46.13.110" www.transitionnetwork.org [07/Jul/2015:04:14:45 +0100] "GET /tags/leicesershire HTTP/1.1" 503 206 266 376 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-"
"185.2.129.62" www.transitionnetwork.org [07/Jul/2015:04:14:49 +0100] "GET /blogs/feed/rob-hopkins/ HTTP/1.1" 503 206 243 376 "http://transitionculture.org/?feed=rss2" "SimplePie/1.0 Beta (Feed Parser; http://www.simplepie.org/; Allow like Gecko) Build/20060129" 0.000 "-"
"185.2.129.62" www.transitionnetwork.org [07/Jul/2015:04:14:49 +0100] "GET /blogs/feed/rob-hopkins/ HTTP/1.1" 503 206 243 376 "http://transitionculture.org/?feed=rss2" "SimplePie/1.0 Beta (Feed Parser; http://www.simplepie.org/; Allow like Gecko) Build/20060129" 0.000 "-"
"66.249.92.57" www.transitionnetwork.org [07/Jul/2015:04:14:49 +0100] "GET /news/feed HTTP/1.1" 503 206 332 376 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 4 subscribers; feed-id=7675538851756851077)" 0.000 "-"
"207.46.13.110" www.transitionnetwork.org [07/Jul/2015:04:14:55 +0100] "GET /blogs/feed/rob-hopkins/ HTTP/1.1" 503 206 271 376 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-"
"66.249.92.60" www.transitionnetwork.org [07/Jul/2015:04:15:00 +0100] "GET /blogs/feed/rob-hopkins/2013 HTTP/1.1" 503 206 265 376 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 2 subscribers; feed-id=3110067181455394580)" 0.000 "-"
"66.249.78.181" www.transitionnetwork.org [07/Jul/2015:04:15:02 +0100] "GET /sites/default/files/resize/remote/b7409b46d23f7dd65940714c8201cc07-92x100.jpg HTTP/1.1" 503 206 313 376 "-" "Googlebot-Image/1.0" 0.000 "-"
"207.46.13.130" www.transitionnetwork.org [07/Jul/2015:04:15:05 +0100] "GET /user/login?destination=news/2013-12-16/bank-england-takes-note-local-money HTTP/1.1" 503 206 354 376 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-"
"66.249.92.63" www.transitionnetwork.org [07/Jul/2015:04:15:10 +0100] "GET /news/network/feed HTTP/1.1" 503 206 341 376 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 2 subscribers; feed-id=12394840076319634974)" 0.000 "-"
"207.46.13.130" www.transitionnetwork.org [07/Jul/2015:04:15:05 +0100] "GET /user/login?destination=news/2013-12-16/bank-england-takes-note-local-money HTTP/1.1" 503 206 354 376 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-"
"66.249.92.63" www.transitionnetwork.org [07/Jul/2015:04:15:10 +0100] "GET /news/network/feed HTTP/1.1" 503 206 341 376 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 2 subscribers; feed-id=12394840076319634974)" 0.000 "-"
"66.249.78.217, 127.0.0.1" www.transitionnetwork.org [07/Jul/2015:04:15:11 +0100] "GET /robots.txt HTTP/1.0" 503 206 401 371 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.011 "-"
"127.0.0.1" localhost [07/Jul/2015:04:15:14 +0100] "GET /robots.txt HTTP/1.1" 404 162 179 333 "-" "Wget/1.13.4 (linux-gnu)" 0.000 "-"
"207.46.13.130" www.transitionnetwork.org [07/Jul/2015:04:15:15 +0100] "GET /tags/publications HTTP/1.1" 503 206 297 376 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-"
"94.228.34.249" www.transitionnetwork.org [07/Jul/2015:04:15:15 +0100] "GET /forum/rss/forum/transitiongroup-food HTTP/1.1" 503 206 327 376 "-" "magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)" 0.000 "-"
"66.249.92.62, 127.0.0.1" www.transitionnetwork.org [07/Jul/2015:04:15:20 +0100] "GET /news/feed HTTP/1.0" 503 206 379 371 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; feed-id=730710629253691737)" 0.001 "-"
"207.46.13.130" www.transitionnetwork.org [07/Jul/2015:04:15:25 +0100] "GET /tags/militarisation HTTP/1.1" 503 206 299 376 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-"
"94.228.34.250" www.transitionnetwork.org [07/Jul/2015:04:15:25 +0100] "GET /forum/rss/tags/covered-market HTTP/1.1" 503 206 320 376 "-" "magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)" 0.000 "-"
"66.249.92.57" www.transitionnetwork.org [07/Jul/2015:04:15:30 +0100] "GET /blogs/feed HTTP/1.1" 503 206 334 376 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 10 subscribers; feed-id=2865359885998846808)" 0.000 "-"
"178.63.75.73" www.transitionnetwork.org [07/Jul/2015:04:15:31 +0100] "GET /taxonomy/term/998/all HTTP/1.0" 503 206 247 371 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 - James BOT - WebCrawler http://cognitiveseo.com/bot.html" 0.037 "-"
"178.63.75.73" www.transitionnetwork.org [07/Jul/2015:04:15:33 +0100] "GET /people/lloyd-williams HTTP/1.0" 503 206 247 371 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 - James BOT - WebCrawler http://cognitiveseo.com/bot.html" 0.023 "-"
"207.46.13.110" www.transitionnetwork.org [07/Jul/2015:04:15:35 +0100] "GET /blogs/rob-hopkins/2014-04/katrina-brown-resilience-impact-and-learning HTTP/1.1" 503 206 318 376 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-"
"188.40.120.19" www.transitionnetwork.org [07/Jul/2015:04:15:35 +0100] "GET /forums/process/web-and-comms/free-crowdmapping-solution-web-mobile-apps HTTP/1.0" 503 206 297 371 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 - James BOT - WebCrawler http://cognitiveseo.com/bot.html" 0.023 "-"
"144.76.100.237" www.transitionnetwork.org [07/Jul/2015:04:15:38 +0100] "GET /forums/process/web-and-comms/are-dedicated-servers-best-business HTTP/1.0" 503 206 290 371 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 - James BOT - WebCrawler http://cognitiveseo.com/bot.html" 0.023 "-"
"136.243.16.102, 127.0.0.1" www.transitionnetwork.org [07/Jul/2015:04:15:42 +0100] "GET /forums/process/web-and-comms/eco-friendly-printing-and-print-manufacturers HTTP/1.0" 503 206 461 371 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 - James BOT - WebCrawler http://cognitiveseo.com/bot.html" 0.000 "-"
"136.243.16.102, 127.0.0.1" www.transitionnetwork.org [07/Jul/2015:04:15:43 +0100] "GET /people/pravin-ganore HTTP/1.0" 503 206 407 371 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 - James BOT - WebCrawler http://cognitiveseo.com/bot.html" 0.001 "-"
"136.243.16.102, 127.0.0.1" www.transitionnetwork.org [07/Jul/2015:04:15:44 +0100] "GET /taxonomy/term/999/0/feed HTTP/1.0" 503 206 411 371 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 - James BOT - WebCrawler http://cognitiveseo.com/bot.html" 0.001 "-"
"136.243.16.102, 127.0.0.1" www.transitionnetwork.org [07/Jul/2015:04:15:44 +0100] "GET /people/neha-patel HTTP/1.0" 503 206 404 371 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 - James BOT - WebCrawler http://cognitiveseo.com/bot.html" 0.000 "-"
"66.249.78.195" www.transitionnetwork.org [07/Jul/2015:04:15:44 +0100] "GET /sites/default/files/resize/remote/ceb9808fa3768e278199f7c502612729-550x366.jpg HTTP/1.1" 503 206 314 376 "-" "Googlebot-Image/1.0" 0.000 "-"
"207.46.13.110" www.transitionnetwork.org [07/Jul/2015:04:15:45 +0100] "GET /people/janev-cameron HTTP/1.1" 503 206 268 376 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-"
"136.243.16.102, 127.0.0.1" www.transitionnetwork.org [07/Jul/2015:04:15:46 +0100] "GET /forums/process/web-and-comms/free-tool-creating-interactive-iphoneandroid-ebooks HTTP/1.0" 503 206 467 371 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 - James BOT - WebCrawler http://cognitiveseo.com/bot.html" 0.000 "-"
"68.180.228.44" www.transitionnetwork.org [07/Jul/2015:04:15:47 +0100] "GET /tags/sustainable-bays HTTP/1.1" 503 206 288 376 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 0.000 "-"
"151.80.31.122" booker-stage-20140501.transitionnetwork.org [07/Jul/2015:04:15:47 +0100] "GET /events/partner HTTP/1.1" 404 134 208 336 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" 0.000 "1.32"
"66.249.92.60" www.transitionnetwork.org [07/Jul/2015:04:15:49 +0100] "GET /news/feed HTTP/1.1" 503 206 332 376 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 4 subscribers; feed-id=7675538851756851077)" 0.000 "-"

I don't know if all the bots caused the load spike or if they were a victim of the load spike.

Note there was a 404 request above for http://booker-stage-20140501.transitionnetwork.org/robots.txt -- the staging sites should have robots.txt files to ensure they are now crawled or indexed, however this site isn't a copy of the main site: http://booker-stage-20140501.transitionnetwork.org/ so I don't think there is a issue here.

Last edited 17 months ago by chris (previous) (diff)

comment:27 follow-up: ↓ 28 Changed 17 months ago by paul

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 2.36 to 2.61

That stage website has been removed a while back.

Do we have a Google Webmaster account for transitionnetwork.org website? It maybe worth investigating how to manually remove this page from the index.

The most recent stage site does have an appropriate robots.txt but not sure how this was achieved. Next time we build a platform: I'll follow the notes in the wiki and check that the robots.txt file is correct.

https://booker-stage-20150319.transitionnetwork.org/robots.txt

User-agent: *
Disallow: /
Sitemap: http://www.transitionnetwork.org/sitemap.xml

Last edited 17 months ago by paul (previous) (diff)

comment:28 in reply to: ↑ 27 Changed 17 months ago by chris

Replying to paul:

That stage website has been removed a while back.

Right, I realised that half through writing it up in the comment above.

Do we have a Google Webmaster account for transitionnetwork.org website? It maybe worth investigating how to manually remove this page from the index.

Somebody should have one? Ask Sam perhaps?

The most recent stage site does have an appropriate robots.txt but not sure how this was achieved.

I think robots.txt files are created for staging sites but not for the default BOA splash pages for domains matching the wild card DNS, for example:

So I'd suggest this is an issue for BOA (the lack for robots.txt files), however as we appear to now have a policy to not apply BOA updates (see ticket:754#comment:60) there seems to be no point is raising an issue with them since even if they fix it the fix won't be appied to PuffinServer...?

There was another load spike just now, the peak:

Date: Tue,  7 Jul 2015 12:06:16 +0100 (BST)
From: root@puffin.webarch.net
Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 59.74

Time:                    Tue Jul  7 12:06:16 2015 +0100
1 Min Load Avg:          106.45
5 Min Load Avg:          59.74
15 Min Load Avg:         26.01
Running/Total Processes: 112/600

comment:29 Changed 17 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 2.61 to 2.71

Oops forgot to add time to the comment above.

comment:30 Changed 16 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.15
Total Hours changed from 2.71 to 2.86

We are seeing very high load spikes on a daily basis on PuffinServer at the moment, these are the subject lines from some of the alert emails:

Jul 27 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 12.55
Jul 27 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 34.23
Jul 27 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 73.63
Jul 27 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 120.41
Jul 27 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 88.71
Jul 27 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 59.66
Jul 27 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 39.93
Jul 27 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 26.81
Jul 27 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 17.95
Jul 27 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 12.07

Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 21.47
Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 76.39
Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 59.15
Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 39.67
Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 26.66
Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 17.89
Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 12.05
Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 70.32
Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 57.11
Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 38.34
Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 25.73
Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 17.31
Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 12.63
Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 25.05
Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 16.87

Jul 29 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 12.04
Jul 29 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 40.73
Jul 29 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 92.19
Jul 29 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 68.15
Jul 29 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 45.92
Jul 29 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 30.78
Jul 29 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 21.26

The tops of these peaks are not picked up by Munin, see:

https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/load.html

I'm not convinced that it is worth spending time on this issue any more, if anyone disagrees please let me know.

Last edited 16 months ago by chris (previous) (diff)

comment:31 Changed 15 months ago by chris

Big load spike today, I'll post some stats about it on this ticket.

Changed 15 months ago by chris

Attachment puffin-2015-08-25-load-day.png added

comment:32 Changed 15 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 2.86 to 3.11

These are the subject lines from the ldf load alert emails from today, these are sent every 5 mins when the load is above 10 or so:

Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 15.84
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 26.13
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 17.56
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 14.28
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 32.92
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 30.15
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 20.27
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 17.02
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 13.57
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 14.04
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 31.83
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 38.05
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 26.59
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 19.76
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 13.28
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 16.31
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 21.37
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 14.40
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 14.99
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 22.18
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 15.10
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 12.05
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 13.05
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 33.59
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 46.81
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 33.24
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 22.31
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 15.13
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 13.46
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 12.23
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 32.07
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 57.67
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 44.66
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 30.17
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 20.29
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 13.73
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 12.51
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 23.35
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 20.75
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 13.97
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 13.99
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 29.44
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 32.96
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 15.04
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 12.95
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 13.63
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 14.32
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 31.23
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 70.98
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 78.94
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 77.84
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 86.56
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 106.85
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 112.84
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 105.13
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 93.73
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 97.14
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 100.15
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 104.10
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 101.91
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 97.32
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 30.86
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 19.46
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 16.40
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 30.10
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 38.24
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 35.38
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 23.76
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 19.16
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 12.97
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 14.08
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 31.02
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 53.04
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 56.32
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 55.31
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 78.01
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 76.02
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 61.12
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 40.95
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 16.53
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 16.12
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 12.48
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 32.67
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 31.74
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 21.30
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 14.62
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 31.64

The tops of these peaks haven't shown up in Munin as usual:

This was a, worse than usual, load spike but we get lots of these every day, it has always been the case with BOA, see wiki:PuffinServer#LoadSpikes, I don't know what can be done about it, given the current server architecture, other than putting up with them?

Last edited 15 months ago by chris (previous) (diff)

Changed 15 months ago by chris

Attachment puffin-load-year-2015-08-28.png added

Changed 15 months ago by chris

Attachment puffin-multips_memory-year-2015-08-28.png added

comment:33 Changed 15 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.5
Total Hours changed from 3.11 to 3.61

I have just spent some time looking through the Puffin Munin stats and this graph indicates the load increase on the server over the last year:

And the only other graph which has changed this much over the year is the PHP-FPM memory usage and this is basically related to the number of processes running (the process size hasn't changed much):

Note that a couple of years ago when trying to manually set the number of PHP-FPM processes to address BOA load spikes, we did have high and load values for this, for example:

These examples were taken from a quick look through some of the wiki:PuffinServerBoaLoadSpikes#BOALoadSpikes tickets.

This autumn will mark the 3rd anniversary of the Transition Network using BOA as it's Drupal hosting platform, personally I'd be happy to move back to a system that we have more control over so we have a chance to find the cause of issues such as these load spikes and address them in an effective way.

We just had a big load spike while writing this comment and watching it in top there were lots of perl, sh and php processes running as the aegir user (the sites runs as tn.web) -- if we want to be able to do something about the downtime caused by all the load spikes we need a server we can control and with BOA one is not in control (this is the whole point of it...).

Last edited 15 months ago by chris (previous) (diff)

comment:34 Changed 15 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 3.61 to 3.86

I have been looking at the Munin stats for Puffin again, trying to work out why the server has been having so many load spikes, and these graphs are perhaps indicating something, these stats are from MySQL:

These from redis, which caches MySQL data:

Firewall connections:

And direction of the traffic:

This last graph seems to indicate that the cause is additional incoming connections.

Changed 15 months ago by chris

Attachment puffin_2015-09-01_logstalgia.png added

comment:35 Changed 15 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.5
Total Hours changed from 3.86 to 4.36

I have spent some time looking at the wiki:WebServerLogs#logstalgia graphical representation of the Nginx access.log (which is like watching pong!) and the things that were very noticeable:

Huge number of requests for /?q=node/add, 8,867 today, 17,113 yesterday, 18,082 the day before.
Huge number of requests for /user/register, 10,496 today, 19,898 yesterday, 20,907 the day before.
Lot of requests for /people/kion-kashefi, 3,277 today, 6,291 yesterday, 6,451 the day before.

All I can conclude from this is that there are a huge number of spam bots out there... and this could be the cause of the change in behaviour of the site and why nothing shows up in the Piwik stats (as Piwik stats are from people not bots).

comment:36 Changed 15 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.43
Total Hours changed from 4.36 to 4.79

Looking at yesterdays Nginx access.log.1 via wiki:WebServerLogs#goaccess there are a huge number of requests via Google proxies:

Volume      No   IP
------------------------------
697.66 KB - 8075 66.249.93.187
694.56 KB - 7944 66.249.84.252
703.52 KB - 7916 66.249.93.179
690.28 KB - 7869 66.249.84.198
668.66 KB - 7810 66.249.84.194
734.59 KB - 7797 66.249.81.219
826.61 KB - 7773 66.249.81.225
673.94 KB - 7750 66.249.93.183
764.81 KB - 7700 66.249.81.222
374.77 KB - 4262 66.102.6.213 
361.39 KB - 4146 66.102.6.205
350.12 KB - 4091 66.102.6.209

You can get the reverse proxy via:

This is what we have for the top requests by hits:

      3.45 KB - 41616 /feeds/importer/transition_blogs_importer/15800
      1.19 MB - 4524  /
    288.78 MB - 3491  /blogs/feed/rob-hopkins/
     78.69 MB - 2633  /news/feed 
     24.63 MB - 2019  /blogs/feed/rob-hopkins   
      1.71 MB - 1684  /robots.txt

I have no idea why there are over 41k requests for this blank page and these are all from Google -- Google is requesting this URL multiple times per second!

"66.249.84.252" www.transitionnetwork.org [01/Sep/2015:19:13:33 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.633 "-"
"66.249.84.252" www.transitionnetwork.org [01/Sep/2015:19:13:33 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.514 "-"
"66.249.93.187" www.transitionnetwork.org [01/Sep/2015:19:13:33 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.493 "-"

Run this command on the server to see, it's really crazy:

sudo -i
tail -f  /var/log/nginx/access.log | grep transition_blogs_importer

Some of the URL's are like this:

"66.249.93.187" www.transitionnetwork.org [01/Sep/2015:19:15:06 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=15765967025011620143&hub.verify_token=ab6f372cedf4209205adf1b9abb9315f&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"

So... does anyone know anyone at Transition Voice? Perhaps they should be contected to see if they have an idea about this behaviour? Or should we contact Google?

I'm very tempted to block all these Google IP addresses via the firewall (I don't think we can do it at a Nginx level as BOA is running the server) to see if that improves the situation with the server... Anyone have any thoughts on this?

Last edited 15 months ago by chris (previous) (diff)

comment:37 Changed 15 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 4.79 to 4.89

I sent a (unencrypted) question via http://transitionvoice.com/contact-us/ and moments later all the Google proxy requests stopped... how weird is that!?

Only requests we now have from the Google subnet addresses are from the Googlebot, eg:

"66.249.75.150" www.transitionnetwork.org [01/Sep/2015:19:39:39 +0100] "GET /tags/newsletter?page=4 HTTP/1.1" 301 178 443 394 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.000 "-"
"66.249.75.3, 127.0.0.1" www.transitionnetwork.org [01/Sep/2015:19:39:59 +0100] "GET /blogs/feed/rob-hopkins HTTP/1.0" 200 66986 392 67674 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.024 "3.49"
"66.249.75.150" www.transitionnetwork.org [01/Sep/2015:19:40:34 +0100] "GET /tags/newsletter HTTP/1.1" 200 9943 436 10656 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 16.405 "5.12"
"66.249.75.11, 127.0.0.1" www.transitionnetwork.org [01/Sep/2015:19:40:38 +0100] "GET /tags/capital HTTP/1.0" 200 7163 444 7797 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.586 "3.95"
"66.249.75.168" www.transitionnetwork.org [01/Sep/2015:19:40:58 +0100] "GET /tags/tamzin-pinkerton HTTP/1.1" 200 7375 320 8088 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.798 "3.93"

comment:38 Changed 15 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 4.89 to 4.99

In terms of registration attempts we are seeing 4-6k a day:

grep POST /var/log/nginx/access.log.1 | grep register | wc -l
4477
zgrep POST /var/log/nginx/access.log.2.gz | grep register | wc -l
4886
zgrep POST /var/log/nginx/access.log.3.gz | grep register | wc -l
5880
zgrep POST /var/log/nginx/access.log.4.gz | grep register | wc -l
4423

A vast number of them from this User Agent:

"Mozilla/5.0 (iPhone; CPU iPhone OS 8_1_2 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) CriOS/39.0.2171.50 Mobile/12B440 Safari/600.1.4"

But the vast nimber with this User Agent string are only from today -- I guess botnets randomise their UA strings:

grep CriOS /var/log/nginx/access.log | wc -l
17183
grep CriOS /var/log/nginx/access.log.1 | wc -l
5282
zgrep CriOS /var/log/nginx/access.log.2 | wc -l
8

My conclusion to all of this is that I suspect most the problems we have had with the load on the server recently has been caused by bots...

Last edited 15 months ago by chris (previous) (diff)

comment:39 Changed 15 months ago by chris

So the load spiked and low and behold loads of these requests again:

"66.249.93.179" www.transitionnetwork.org [01/Sep/2015:21:05:23 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 37.734 "-"
"66.249.93.187" www.transitionnetwork.org [01/Sep/2015:21:05:23 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 38.393 "-"
"66.249.81.222" www.transitionnetwork.org [01/Sep/2015:21:05:23 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 39.092 "-"
"66.249.81.219" www.transitionnetwork.org [01/Sep/2015:21:05:24 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 40.474 "-"
"66.249.93.183" www.transitionnetwork.org [01/Sep/2015:21:05:25 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 24.565 "-"
"66.249.81.219" www.transitionnetwork.org [01/Sep/2015:21:05:25 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 25.067 "-"
"66.249.84.194" www.transitionnetwork.org [01/Sep/2015:21:05:28 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 43.830 "-"
"66.249.93.183" www.transitionnetwork.org [01/Sep/2015:21:05:43 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=10142066725310477262&hub.verify_token=1eda61c9d1f2bf3e476094acdaf91702&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.249.93.187" www.transitionnetwork.org [01/Sep/2015:21:05:44 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=6319814073685007505&hub.verify_token=9e89505416e7a1a19987fbfd5b01491a&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.249.84.252" www.transitionnetwork.org [01/Sep/2015:21:05:45 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=107922229115863848&hub.verify_token=03272b1b2ff8d44849ccf86666975a1f&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 508 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.249.81.219" www.transitionnetwork.org [01/Sep/2015:21:05:46 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=16352637103625808541&hub.verify_token=a15234a3b3d477b771799e41d7ad34f4&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.249.93.187" www.transitionnetwork.org [01/Sep/2015:21:05:46 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=1240898511006880980&hub.verify_token=ebf54df1b54bf1a62cf19fc3cbf13a8f&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.249.84.194" www.transitionnetwork.org [01/Sep/2015:21:05:47 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=15419583452535399630&hub.verify_token=af68e5ec296fa25fc7c43e12eae3ee41&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.249.84.198" www.transitionnetwork.org [01/Sep/2015:21:05:48 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=8707460082747624150&hub.verify_token=cdca72c68fd1e269d69ed9e9b24238c4&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.249.93.183" www.transitionnetwork.org [01/Sep/2015:21:05:49 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=11887861429882477665&hub.verify_token=e8b1d38570d15518e9f9dbcd5c6b9257&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.249.84.194" www.transitionnetwork.org [01/Sep/2015:21:05:49 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=3110730584339548525&hub.verify_token=25800bcdd6b0106352e5ae3e076b56cb&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.249.84.252" www.transitionnetwork.org [01/Sep/2015:21:05:50 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=2476781462168896036&hub.verify_token=beb3f197b835cb1f821ac88209b5fbcc&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.249.81.222" www.transitionnetwork.org [01/Sep/2015:21:05:51 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=5726390847248026995&hub.verify_token=54d54561037e8777b201926f27ad25f3&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.249.93.183" www.transitionnetwork.org [01/Sep/2015:21:05:51 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=8578319575267852037&hub.verify_token=bb331ae48606c991244c8c53dc830c1a&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.249.84.252" www.transitionnetwork.org [01/Sep/2015:21:05:52 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=16276261841447425468&hub.verify_token=43947eb43c6cf833cc6e00f0b2addc5c&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.249.93.179" www.transitionnetwork.org [01/Sep/2015:21:06:03 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 62.004 "-"

comment:40 Changed 15 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 4.99 to 5.09

These are the IP addresses I'm tempted to block:

grep transition_blogs_importer /var/log/nginx/access.log.1 | awk '{ print $1}' | sort -u 
"66.102.6.205"
"66.102.6.209"
"66.102.6.213"
"66.249.81.195"
"66.249.81.198"
"66.249.81.201"
"66.249.81.204"
"66.249.81.219"
"66.249.81.222"
"66.249.81.225"
"66.249.84.194"
"66.249.84.198"
"66.249.84.252"
"66.249.93.179"
"66.249.93.183"
"66.249.93.187"
"66.249.93.213"
"66.249.93.216"
"66.249.93.219"
"66.249.93.222"

comment:41 Changed 15 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 5.09 to 5.19

We can't simply block these Google proxy IP's as they do a small number of legitimate requests as well as the massive volume of bogus ones, eg:

grep 66.249.81.219 /var/log/nginx/access.log | grep -v transition_blogs_importer
"2003:57:ee1d:5b01:b1c3:ec90:76e7:40c4, 66.249.81.219" www.transitionnetwork.org [01/Sep/2015:09:52:27 +0100] "GET / HTTP/1.1" 200 11794 263 12478 "-" "Google favicon" 0.036 "4.57"
"66.249.81.219" www.transitionnetwork.org [01/Sep/2015:10:49:14 +0100] "GET /rss.xml HTTP/1.1" 200 26504 230 27183 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; feed-id=2525513441670857354)" 0.095 "3.40"
grep 66.249.81.222 /var/log/nginx/access.log | grep -v transition_blogs_importer
"66.249.81.222" www.transitionnetwork.org [01/Sep/2015:20:49:56 +0100] "GET /blogs/feed/rob-hopkins/ HTTP/1.1" 200 67015 246 67767 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; feed-id=5471198439844167222)" 0.085 "3.49"
grep 66.249.81.225 /var/log/nginx/access.log | grep -v transition_blogs_importer
"66.249.81.225" www.transitionnetwork.org [01/Sep/2015:19:41:32 +0100] "GET /blogs/feed/rob-hopkins/ HTTP/1.1" 200 67015 246 67767 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; feed-id=5471198439844167222)" 0.090 "3.49"

So I have added the following to /var/aegir/config/server_master/nginx/pre.d/nginx_wild_ssl.conf:

  location /feeds/importer/transition_blogs_importer/15800 {
    return 403;
  }

Not sure what the effect will be as the requests have stopped again...

comment:42 Changed 15 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 5.19 to 5.44

We just has a small load spike:

WARNINGs: load is 7.46 (outside range [:6]).

And at the same time 2 requests a second from FeedFetcher-Google to feeds/importer/transition_blogs_importer/15800, these are just the last ones:

"66.249.84.194" www.transitionnetwork.org [02/Sep/2015:13:37:19 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=17061375092290510245&hub.verify_token=ce2b689348783c89d4722373ecda9bed&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.249.84.252" www.transitionnetwork.org [02/Sep/2015:13:37:19 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.549 "-"
"66.249.81.225" www.transitionnetwork.org [02/Sep/2015:13:37:20 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.477 "-"
"66.249.93.187" www.transitionnetwork.org [02/Sep/2015:13:37:20 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=9971194119277265194&hub.verify_token=3313f02505f7d33aab87c368df13a700&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.249.93.183" www.transitionnetwork.org [02/Sep/2015:13:37:20 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=11023786745442642030&hub.verify_token=d2c4c29eb64e468966c69f533132871a&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.249.84.198" www.transitionnetwork.org [02/Sep/2015:13:37:20 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.517 "-"
"66.249.93.187" www.transitionnetwork.org [02/Sep/2015:13:37:21 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=14715150916754042873&hub.verify_token=3d84aef446543bf2b85f56694c7b6468&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.249.84.194" www.transitionnetwork.org [02/Sep/2015:13:37:21 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=17982434088673727697&hub.verify_token=0697bb324e57665a4f20694cb6bebc8f&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.249.84.252" www.transitionnetwork.org [02/Sep/2015:13:37:21 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.497 "-"
"66.102.6.209" www.transitionnetwork.org [02/Sep/2015:13:37:22 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.501 "-"
"66.249.93.187" www.transitionnetwork.org [02/Sep/2015:13:37:22 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=6040636127189422897&hub.verify_token=84128e5a7cbf890a53b7a6c1cbb5ff3b&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.249.84.198" www.transitionnetwork.org [02/Sep/2015:13:37:22 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=12494725620258192599&hub.verify_token=f62b6ac0aca1eeb79ec70a3acdbd2976&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.249.84.198" www.transitionnetwork.org [02/Sep/2015:13:37:22 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.492 "-"
"66.249.84.194" www.transitionnetwork.org [02/Sep/2015:13:37:23 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=11605899247776179542&hub.verify_token=c15417975bad95cdcb7d731e18cfd0e4&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.249.93.187" www.transitionnetwork.org [02/Sep/2015:13:37:23 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.535 "-"
"66.249.93.183" www.transitionnetwork.org [02/Sep/2015:13:37:24 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.641 "-"
"66.249.93.183" www.transitionnetwork.org [02/Sep/2015:13:37:24 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=6671767172620812224&hub.verify_token=3b4f373fa62274a5dc1df9092223bc0b&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.249.84.198" www.transitionnetwork.org [02/Sep/2015:13:37:24 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=1122299795865433797&hub.verify_token=3fea7d8a2dcffe41c90593140a58154f&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.102.6.209" www.transitionnetwork.org [02/Sep/2015:13:37:24 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.589 "-"
"66.249.84.252" www.transitionnetwork.org [02/Sep/2015:13:37:25 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.476 "-"
"66.102.6.213" www.transitionnetwork.org [02/Sep/2015:13:37:26 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.496 "-"
"66.249.93.183" www.transitionnetwork.org [02/Sep/2015:13:37:27 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.500 "-"
"66.102.6.205" www.transitionnetwork.org [02/Sep/2015:13:37:27 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.521 "-"
"66.249.84.194" www.transitionnetwork.org [02/Sep/2015:13:37:28 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=5950156067159212032&hub.verify_token=d81f046d7d811ddd073d5085d49609de&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.249.93.187" www.transitionnetwork.org [02/Sep/2015:13:37:28 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=14864592753011071735&hub.verify_token=a160f1d2f40967f584fe2a48f6943767&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.249.93.179" www.transitionnetwork.org [02/Sep/2015:13:37:29 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=1114708068878419733&hub.verify_token=014eefe3c76d5456099a00380653dbff&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
"66.102.6.209" www.transitionnetwork.org [02/Sep/2015:13:37:29 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=5789640148520582546&hub.verify_token=4bbd1b66686203c49a7c0cc1a19cbbe3&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"

So I tracked down the Ngnix config file, /data/disk/tn/config/includes/nginx_vhost_common.conf and added:

# google feed fetcher
location ^~ /feeds/importer/transition_blogs_importer/15800 {
    return 403;
}

So next time it comes back rather than these requests being sent to php-fpm they will get a fast, low cost, response from Nginx...

Last edited 15 months ago by chris (previous) (diff)

comment:51 follow-up: ↓ 52 Changed 15 months ago by sam

Hi all

If the load spikes/ downtime are related to the blog import, & this
change doesn't fix it then I think we could consider just turning off
the blog imports? I don't think its massively read?

Thanks

Sam

On 2 September 2015 at 13:47, Transition Technology Trac
<trac@tech.transitionnetwork.org> wrote:
> #846: Load Spikes on BOA PuffinServer
> -------------------------------------+-------------------------------------
>            Reporter:  chris          |                      Owner:  chris
>                Type:  maintenance    |                     Status:  new
>            Priority:  major          |                  Milestone:
>           Component:  Live server    |  Maintenance
>            Keywords:                 |                 Resolution:
> Add Hours to Ticket:  0.25           |  Estimated Number of Hours:  0.0
>         Total Hours:  5.19           |                  Billable?:  1
> -------------------------------------+-------------------------------------
> Changes (by chris):
>
>  * hours:  0.0 => 0.25
>  * totalhours:  5.19 => 5.44
>
>
> Comment:
>
>  We just has a small load spike:
>
>  {{{
>  WARNINGs: load is 7.46 (outside range [:6]).
>  }}}
>
>  And at the same time 2 requests a second from !FeedFetcher-Google to
>  `feeds/importer/transition_blogs_importer/15800`, this is just the last
>  ones:
>
>  {{{
>  "66.249.84.194" www.transitionnetwork.org [02/Sep/2015:13:37:19 +0100]
>  "GET
>  /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=17061375092290510245&hub.verify_token=ce2b689348783c89d4722373ecda9bed&hub.mode=subscribe&hub.lease_seconds=432000
>  HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google;
>  (+http://www.google.com/feedfetcher.html)" 0.000 "-"
>  "66.249.84.252" www.transitionnetwork.org [02/Sep/2015:13:37:19 +0100]
>  "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0
>  "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.549
>  "-"
>  "66.249.81.225" www.transitionnetwork.org [02/Sep/2015:13:37:20 +0100]
>  "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0
>  "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.477
>  "-"
>  "66.249.93.187" www.transitionnetwork.org [02/Sep/2015:13:37:20 +0100]
>  "GET
>  /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=9971194119277265194&hub.verify_token=3313f02505f7d33aab87c368df13a700&hub.mode=subscribe&hub.lease_seconds=432000
>  HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google;
>  (+http://www.google.com/feedfetcher.html)" 0.000 "-"
>  "66.249.93.183" www.transitionnetwork.org [02/Sep/2015:13:37:20 +0100]
>  "GET
>  /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=11023786745442642030&hub.verify_token=d2c4c29eb64e468966c69f533132871a&hub.mode=subscribe&hub.lease_seconds=432000
>  HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google;
>  (+http://www.google.com/feedfetcher.html)" 0.000 "-"
>  "66.249.84.198" www.transitionnetwork.org [02/Sep/2015:13:37:20 +0100]
>  "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0
>  "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.517
>  "-"
>  "66.249.93.187" www.transitionnetwork.org [02/Sep/2015:13:37:21 +0100]
>  "GET
>  /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=14715150916754042873&hub.verify_token=3d84aef446543bf2b85f56694c7b6468&hub.mode=subscribe&hub.lease_seconds=432000
>  HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google;
>  (+http://www.google.com/feedfetcher.html)" 0.000 "-"
>  "66.249.84.194" www.transitionnetwork.org [02/Sep/2015:13:37:21 +0100]
>  "GET
>  /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=17982434088673727697&hub.verify_token=0697bb324e57665a4f20694cb6bebc8f&hub.mode=subscribe&hub.lease_seconds=432000
>  HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google;
>  (+http://www.google.com/feedfetcher.html)" 0.000 "-"
>  "66.249.84.252" www.transitionnetwork.org [02/Sep/2015:13:37:21 +0100]
>  "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0
>  "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.497
>  "-"
>  "66.102.6.209" www.transitionnetwork.org [02/Sep/2015:13:37:22 +0100] "GET
>  /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-"
>  "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.501 "-"
>  "66.249.93.187" www.transitionnetwork.org [02/Sep/2015:13:37:22 +0100]
>  "GET
>  /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=6040636127189422897&hub.verify_token=84128e5a7cbf890a53b7a6c1cbb5ff3b&hub.mode=subscribe&hub.lease_seconds=432000
>  HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google;
>  (+http://www.google.com/feedfetcher.html)" 0.000 "-"
>  "66.249.84.198" www.transitionnetwork.org [02/Sep/2015:13:37:22 +0100]
>  "GET
>  /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=12494725620258192599&hub.verify_token=f62b6ac0aca1eeb79ec70a3acdbd2976&hub.mode=subscribe&hub.lease_seconds=432000
>  HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google;
>  (+http://www.google.com/feedfetcher.html)" 0.000 "-"
>  "66.249.84.198" www.transitionnetwork.org [02/Sep/2015:13:37:22 +0100]
>  "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0
>  "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.492
>  "-"
>  "66.249.84.194" www.transitionnetwork.org [02/Sep/2015:13:37:23 +0100]
>  "GET
>  /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=11605899247776179542&hub.verify_token=c15417975bad95cdcb7d731e18cfd0e4&hub.mode=subscribe&hub.lease_seconds=432000
>  HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google;
>  (+http://www.google.com/feedfetcher.html)" 0.000 "-"
>  "66.249.93.187" www.transitionnetwork.org [02/Sep/2015:13:37:23 +0100]
>  "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0
>  "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.535
>  "-"
>  "66.249.93.183" www.transitionnetwork.org [02/Sep/2015:13:37:24 +0100]
>  "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0
>  "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.641
>  "-"
>  "66.249.93.183" www.transitionnetwork.org [02/Sep/2015:13:37:24 +0100]
>  "GET
>  /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=6671767172620812224&hub.verify_token=3b4f373fa62274a5dc1df9092223bc0b&hub.mode=subscribe&hub.lease_seconds=432000
>  HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google;
>  (+http://www.google.com/feedfetcher.html)" 0.000 "-"
>  "66.249.84.198" www.transitionnetwork.org [02/Sep/2015:13:37:24 +0100]
>  "GET
>  /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=1122299795865433797&hub.verify_token=3fea7d8a2dcffe41c90593140a58154f&hub.mode=subscribe&hub.lease_seconds=432000
>  HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google;
>  (+http://www.google.com/feedfetcher.html)" 0.000 "-"
>  "66.102.6.209" www.transitionnetwork.org [02/Sep/2015:13:37:24 +0100] "GET
>  /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-"
>  "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.589 "-"
>  "66.249.84.252" www.transitionnetwork.org [02/Sep/2015:13:37:25 +0100]
>  "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0
>  "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.476
>  "-"
>  "66.102.6.213" www.transitionnetwork.org [02/Sep/2015:13:37:26 +0100] "GET
>  /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-"
>  "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.496 "-"
>  "66.249.93.183" www.transitionnetwork.org [02/Sep/2015:13:37:27 +0100]
>  "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0
>  "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.500
>  "-"
>  "66.102.6.205" www.transitionnetwork.org [02/Sep/2015:13:37:27 +0100] "GET
>  /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-"
>  "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.521 "-"
>  "66.249.84.194" www.transitionnetwork.org [02/Sep/2015:13:37:28 +0100]
>  "GET
>  /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=5950156067159212032&hub.verify_token=d81f046d7d811ddd073d5085d49609de&hub.mode=subscribe&hub.lease_seconds=432000
>  HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google;
>  (+http://www.google.com/feedfetcher.html)" 0.000 "-"
>  "66.249.93.187" www.transitionnetwork.org [02/Sep/2015:13:37:28 +0100]
>  "GET
>  /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=14864592753011071735&hub.verify_token=a160f1d2f40967f584fe2a48f6943767&hub.mode=subscribe&hub.lease_seconds=432000
>  HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google;
>  (+http://www.google.com/feedfetcher.html)" 0.000 "-"
>  "66.249.93.179" www.transitionnetwork.org [02/Sep/2015:13:37:29 +0100]
>  "GET
>  /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=1114708068878419733&hub.verify_token=014eefe3c76d5456099a00380653dbff&hub.mode=subscribe&hub.lease_seconds=432000
>  HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google;
>  (+http://www.google.com/feedfetcher.html)" 0.000 "-"
>  "66.102.6.209" www.transitionnetwork.org [02/Sep/2015:13:37:29 +0100] "GET
>  /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=5789640148520582546&hub.verify_token=4bbd1b66686203c49a7c0cc1a19cbbe3&hub.mode=subscribe&hub.lease_seconds=432000
>  HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google;
>  (+http://www.google.com/feedfetcher.html)" 0.000 "-"
>  }}}
>
>  So I tracked down the Ngnix config file,
>  `/data/disk/tn/config/includes/nginx_vhost_common.conf` and added:
>
>  {{{
>  # google feed fetcher
>  location ^~ /feeds/importer/transition_blogs_importer/15800 {
>      return 403;
>  }
>  }}}
>
>  So next time it comes back rather than these requests being sent to php-
>  fpm they will get a fast, low cost, response from Nginx...
>
> --
> Ticket URL: <https://tech.transitionnetwork.org/trac/ticket/846#comment:42>
> Transition Technology <https://tech.transitionnetwork.org/trac>
> Support and issues tracking for the Transition Network Web Project.

comment:52 in reply to: ↑ 51 Changed 15 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 5.44 to 5.69

Replying to sam:

If the load spikes/ downtime are related to the blog import, & this
change doesn't fix it then I think we could consider just turning off
the blog imports? I don't think its massively read?

I'm not sure they are, the crazy number of requests from the Google Feedfetcher has been solved by serving 403's for this URL:

https://www.transitionnetwork.org/feeds/importer/transition_blogs_importer/15800

There were only 6 requests for that URL yesterday:

grep transition_blogs_importer /var/log/nginx/access.log.1 | wc -l
6

Before we started serving 403's we were getting 40k requests a day for this URL...

The main traffic is still for these 3 URLs (see ticket:846#comment:35):

https://www.transitionnetwork.org/?q=node/add (16963 yesterday)
https://www.transitionnetwork.org/user/register (19118 yesterday)
https://www.transitionnetwork.org/people/kion-kashefi (5822 yesterday)

I strongly suspect botnet(s) are behind the massive number of requests for these URLs, I'm not sure how, without ditching BOA, to address this.

I have been running wiki:WebServerLogs#logstalgia for the last week or so to keep an eye on things.

Last edited 15 months ago by chris (previous) (diff)

Changed 12 months ago by chris

Attachment puffin_2015-12-04_load-day.png added

comment:53 Changed 12 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.15
Total Hours changed from 5.69 to 5.84

There is a on-going very high load on PuffinServer today, these are the subjects lines from the LFD emails I have been sent so far today:

Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 14.93                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 15.05                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 12.34                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 12.12                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 27.72                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 48.07                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 53.77                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 54.35                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 54.49                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 56.10                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 62.23                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 66.63                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 61.97                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 41.70                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 28.15                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 19.04                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 13.20                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 12.52                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 13.16                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 14.60                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 34.55                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 52.49                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 52.04                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 37.02                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 25.41                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 16.86                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 35.16                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 25.18                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 16.90                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 13.72                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 15.58                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 33.53                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 18.99                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 22.53                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 35.66                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 32.54                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 21.86                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 14.73                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 12.54                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 28.89                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 28.37                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 27.05                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 18.12                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 13.42                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 12.85                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 29.11                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 23.91                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 30.78                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 20.67                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 16.29                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 12.90                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 27.81                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 42.35                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 36.69                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 24.65                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 19.49                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 31.78                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 49.33                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 58.00                                                   
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 69.29                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 61.48                                                      
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 51.63

comment:54 Changed 12 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 5.84 to 6.09

The load is through the roof again, BOA is busy killing things (see the killall processes in this top output):

top - 10:33:16 up 3 days, 6 min,  1 user,  load average: 103.85, 116.45, 99.92
Tasks: 659 total, 112 running, 547 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.5 us, 15.9 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 83.7 st
%Cpu1  :  1.0 us, 15.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 83.7 st
%Cpu2  :  1.4 us, 14.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 83.9 st
%Cpu3  :  1.4 us, 13.4 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 85.2 st
%Cpu4  :  1.0 us, 15.8 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 83.3 st
%Cpu5  :  1.0 us, 14.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 84.5 st
%Cpu6  :  0.5 us, 14.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 85.5 st
%Cpu7  :  1.4 us, 14.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 84.3 st
%Cpu8  :  1.8 us, 16.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 82.0 st
%Cpu9  :  1.3 us, 12.8 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 85.9 st
%Cpu10 :  1.9 us, 14.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 83.8 st
%Cpu11 :  1.4 us, 15.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 83.1 st
%Cpu12 :  1.4 us, 16.4 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 82.2 st
%Cpu13 :  1.9 us, 14.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 84.1 st
KiB Mem:   8372060 total,  6507456 used,  1864604 free,   411396 buffers
KiB Swap:        0 total,        0 used,        0 free,  3172436 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND                                                                   
19900 root      20   0     0    0    0 R    46  0.0   0:02.43 mysqladmin                                                                
17809 root      20   0     0    0    0 R    36  0.0   0:01.70 php53-fpm                                                                 
19975 root      20   0     0    0    0 R    28  0.0   0:01.16 bash                                                                      
20038 root      20   0     0    0    0 R    26  0.0   0:01.05 hostname                                                                  
20053 root      20   0     8    4    0 R    25  0.0   0:01.03 bash                                                                      
20060 root      20   0     8    4    0 R    24  0.0   0:00.97 service                                                                   
20067 root      20   0     8    4    0 R    23  0.0   0:00.95 php53-fpm                                                                 
19923 root      20   0  5560  604  512 R    23  0.0   0:01.05 expr                                                                      
19927 root      20   0  5544  720  596 R    23  0.0   0:00.93 killall                                                                   
20058 root      20   0     8    4    0 R    23  0.0   0:00.93 bash                                                                      
20050 root      20   0     8    4    0 R    22  0.0   0:00.92 bash                                                                      
19825 aegir     20   0  187m 9304 6152 R    22  0.1   0:01.41 php                                                                       
19476 root      20   0  5544  724  596 R    21  0.0   0:02.07 killall                                                                   
20041 root      20   0     8    4    0 R    20  0.0   0:00.83 perl                                                                      
19454 root      20   0     0    0    0 R    20  0.0   0:01.65 awk                                                                       
19940 root      20   0  5544  720  596 R    20  0.0   0:00.82 killall                                                                   
20061 root      20   0     8    4    0 R    20  0.0   0:00.82 service                                                                   
19925 root      20   0  3948  168  128 R    20  0.0   0:00.86 expr                                                                      
20045 root      20   0     8    4    0 R    20  0.0   0:00.80 php53-fpm                                                                 
15911 root      20   0     0    0    0 R    19  0.0   0:01.11 php53-fpm                                                                 
19914 root      20   0  7788  836  724 R    19  0.0   0:01.06 grep                                                                      
19474 root      20   0  5548  716  596 R    19  0.0   0:01.77 killall                                                                   
19994 root      20   0 40372 2416 1856 R    18  0.0   0:00.72 mysqladmin                                                                
20029 root      20   0     8    4    0 R    18  0.0   0:00.72 service                                                                   
20057 root      20   0     8    4    0 R    17  0.0   0:00.70 php53-fpm                                                                 
20063 root      20   0     8    4    0 R    17  0.0   0:00.70 php53-fpm                                                                 
19898 root      20   0 40372 2412 1856 R    17  0.0   0:01.37 mysqladmin                                                                
19459 root      20   0     0    0    0 R    17  0.0   0:02.13 grep                                                                      
19990 root      20   0  5544  712  588 R    17  0.0   0:00.68 killall                                                                   
20042 root      20   0     8    4    0 R    17  0.0   0:00.68 bash                                                                      
20024 root      20   0 40372 2416 1856 R    16  0.0   0:00.67 mysqladmin

There are 92 BOA bash scripts running (two are these are my shell):

ps -lA | grep bash | wc -l
94

I sorry but don't know what to do apart from let BOA go on with running loads of bash scripts killing loads of things... this is the nature of BOA... it is designed not to be managed by sysadmins and at times like this it results in a server that nothing can be done for apart from a reboot and perhaps I should try that since the site is unavailable...

Can we one day have a discussion about moving away from BOA? Please?

comment:55 Changed 12 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.15
Total Hours changed from 6.09 to 6.24

I rebooted the server, once BOA finds high loads and goes on a killing spree it doesn't restart things for a while, so rebooting has resulted in the site coming back on line quicker. This is not how servers should be managed.

comment:56 Changed 12 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 6.24 to 6.49

Another massive load spike:

lfd on puffin.webarch.net: High 5 minute load average alert - 158.55

Rebooting the server again...

comment:57 Changed 12 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 6.49 to 6.74

Perhaps the suggestions posted here, 3 months ago, could be considered someday soonish?

ticket:754#comment:61

The amount of downtime we have with BOA at the moment is really appalling...

I did look at the Piwik stats to see if there is a lot more human traffic due to cop21 but there didn't seem to be, so perhaps it is bots...

The server is back up now following the reboot but again it is unresponsive...

load average: 95.23, 76.54, 46.78

I'm going to have a look at what is happing via the Nginx logs, see wiki:WebServerLogs#logstalgia but it might be too overloaded for this to work, just getting a response from uptime took 5 mins...

comment:58 Changed 12 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.2
Total Hours changed from 6.74 to 6.94

BOA is on another killing spree:

top - 17:00:34 up 26 min,  2 users,  load average: 86.28, 77.39, 57.07
Tasks: 499 total,  99 running, 400 sleeping,   0 stopped,   0 zombie
   32 root       0 -20     0    0    0 S   0.0  0.0   0:00.00 netns
%Cpu1  :  0.2 us,  1.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 98.6 st
%Cpu2  :  0.0 us,  1.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 98.7 st
%Cpu3  :  0.0 us,  1.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 98.9 st
%Cpu4  :  0.2 us,  1.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 98.7 st
%Cpu5  :  0.0 us,  1.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 98.9 st
%Cpu6  :  0.2 us,  1.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.2 si, 98.0 st
%Cpu7  :  0.0 us,  1.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.2 si, 98.7 st
%Cpu8  :  0.2 us,  1.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 98.7 st
%Cpu9  :  0.0 us,  1.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 99.0 st
%Cpu10 :  0.2 us,  1.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 98.8 st
%Cpu11 :  0.0 us,  1.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 98.9 st
%Cpu12 :  0.0 us,  1.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 98.9 st
sudo -i
KiB Mem:   8372060 total,  1556628 used,  6815432 free,   143036 buffers
Tasks: 500 total, 100 running, 400 sleeping,   0 stopped,   0 zombie

%Cpu1  :  0.1 us,  0.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 99.2 st
28289 root      20   0     0    0    0 R   113  0.0   0:13.88 mysqladmin
28410 root      20   0     8    4    0 R    83  0.0   0:04.07 bash
25660 root      20   0     0    0    0 R    80  0.0   0:08.32 cron
28344 root      20   0     0    0    0 R    77  0.0   0:08.98 id
28413 root      20   0     8    4    0 R    74  0.0   0:03.65 bash
28411 root      20   0     8    4    0 R    71  0.0   0:04.20 bash
28147 root      20   0     0    0    0 R    59  0.0   0:04.66 grep
28216 root      20   0     0    0    0 R    56  0.0   0:03.18 bash
28148 root      20   0     0    0    0 R    45  0.0   0:05.18 awk
28392 root      20   0 28900  840  556 R    36  0.0   0:05.36 cron
28312 root      20   0 50544  512  300 R    36  0.0   0:08.00 host
27866 root      20   0  4064  660  500 R    35  0.0   0:24.31 pidof
28301 root      20   0 40376 2500 1900 R    35  0.0   0:08.28 mysqladmin
28378 root      20   0 11064  240  172 R    31  0.0   0:04.74 awk
23627 aegir     20   0  230m  40m 9512 R    30  0.5   1:03.83 php
28368 aegir     20   0  9796  724  608 R    28  0.0   0:05.84 id
28021 root      20   0  201m  12m 7896 R    26  0.2   0:17.97 php-fpm
28379 root      20   0 27828  336  216 R    26  0.0   0:05.39 mysqladmin
28283 root      20   0 40372 2504 1904 R    25  0.0   0:08.92 mysqladmin
28351 root      20   0 29980  356  228 R    25  0.0   0:05.98 mysqladmin
27843 root      20   0  4064  668  500 R    25  0.0   0:23.27 pidof
27965 root      20   0  4064  616  500 R    23  0.0   0:20.62 pidof
28406 root      20   0     8    4    0 R    21  0.0   0:03.66 bash
28403 root      20   0 24840  776  512 R    17  0.0   0:03.06 cron
28068 aegir     20   0  4132  592  500 R    17  0.0   0:09.51 dash
28381 root      20   0  6124  336  256 R    16  0.0   0:04.20 grep
28206 root      20   0  5544  696  588 R    16  0.0   0:10.63 killall
27472 root      20   0  5544  708  596 R    16  0.0   0:39.45 killall
28184 root      20   0  5544  704  596 R    16  0.0   0:10.05 killall
28404 root      20   0 24792  820  548 R    16  0.0   0:02.76 cron
28229 root      20   0  5544  704  588 R    16  0.0   0:10.05 killall
27864 root      20   0  4064  640  500 R    15  0.0   0:22.94 pidof
28401 root      20   0 24664  780  512 R    15  0.0   0:02.62 cron
27473 root      20   0  5544  712  596 R    15  0.0   0:38.61 killall
28400 root      20   0 24840  776  512 R    15  0.0   0:02.60 cron
28317 root      20   0  5544  700  588 R    15  0.0   0:07.47 killall
23147 aegir     20   0  231m  41m 9516 R    14  0.5   1:12.12 php
28405 root      20   0 24664  780  512 R    14  0.0   0:02.53 cron

At least 12 killall processes are running:

ps -lA | grep killall
4 R     0 27472 26445 18  80   0 -  1386 -      ?        00:00:59 killall
4 R     0 27473 26027 18  80   0 -  1386 -      ?        00:00:59 killall
4 R     0 27505 23173 18  80   0 -  1386 -      ?        00:01:00 killall
4 R     0 27506 22402 18  80   0 -  1386 -      ?        00:00:57 killall
4 R     0 28184 19270 14  80   0 -  1386 -      ?        00:00:30 killall
4 R     0 28206 11411 14  80   0 -  1387 -      ?        00:00:30 killall
4 R     0 28214 23723 15  80   0 -  1386 -      ?        00:00:31 killall
4 R     0 28215 22836 15  80   0 -  1386 -      ?        00:00:32 killall
4 R     0 28229 24782 15  80   0 -  1386 -      ?        00:00:30 killall
4 R     0 28235 25066 13  80   0 -  1386 -      ?        00:00:27 killall
4 R     0 28317 23862 15  80   0 -  1387 -      ?        00:00:29 killall
4 R     0 28462 26319 13  80   0 -  1386 -      ?        00:00:08 killall

Now it is down to 10...

ps -lA | grep killall | wc -l
10

But back to 12 again...

puffin:~# ps -lA | grep killall | wc -l
12

I'm going to reboot the server with four time as much RAM to see if makes it less suicidal...

comment:59 Changed 12 months ago by chris

Again a massive number of BOA bash jobs are running:

ps -lA | grep bash | wc -l
91

comment:60 Changed 12 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 6.94 to 7.04

It is up, I don't know how long it'll stay up:

uptime
 17:33:42 up 22 min,  1 user,  load average: 46.07, 49.30, 26.33

BOA hasn't started killing itself yet...

ps -lA | grep killall | wc -l
0

comment:61 Changed 12 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 7.04 to 7.14

So now it seems fine...

17:58:32 up 46 min,  1 user,  load average: 0.25, 1.40, 8.90

I have no idea what was causing the massive load spikes earlier.

comment:62 follow-up: ↓ 63 Changed 12 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.2
Total Hours changed from 7.14 to 7.34

So wiki:WebServerLogs#logstalgia now works given the extra RAM and guess what, massive number of hits for /user/register, ?q=/node/add and /people/kion-kashefi, the same as 3 months ago, see ticket:846#comment:35 ...

comment:63 in reply to: ↑ 62 ; follow-up: ↓ 65 Changed 12 months ago by sam

Replying to chris:

So wiki:WebServerLogs#logstalgia now works given the extra RAM and guess what, massive number of hits for /user/register, ?q=/node/add and /people/kion-kashefi, the same as 3 months ago, see ticket:846#comment:35 ...

I guess it would be fairly trivial for us to move the URL of that registration page?

/user/new/register ?

My assumption is this traffic is generated by a script that attacks Drupal sites. Moving the URL might be enough to outwit it if it's a crappy script?

I realise we'd still have to serve them a 404, but my assumption is that the load is actually generated by login attempts, not just the serving of a simple page?

It might be a more sophisticated script that would find the page anyway, but for five minutes work it might be worth a try?

Thanks

Sam

comment:64 follow-up: ↓ 66 Changed 12 months ago by sam

also just unpublished https://www.transitionnetwork.org/people/kion-kashefi Sorry didn't spot it in previous ticket.

comment:65 in reply to: ↑ 63 Changed 12 months ago by chris

Replying to sam:

I guess it would be fairly trivial for us to move the URL of that registration page?

/user/new/register ?

Yes I think that might help, but we would need to check everything that links there first?

comment:66 in reply to: ↑ 64 Changed 12 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 7.34 to 7.44

Replying to sam:

also just unpublished https://www.transitionnetwork.org/people/kion-kashefi Sorry didn't spot it in previous ticket.

Thanks, we could also do with identifying other spam pages like this, I don't think this is the only one...

Changed 12 months ago by chris

Attachment puffin_memory-pinpoint_2015-12-06.png added

comment:67 Changed 12 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 7.44 to 7.69

Looking at the memory usage for the last 12 hours (during which we haven't had any load spikes, I don't know for sure if this is related to the extra memory or a coincidence):

The minimum unused memory has been 23.36G, so 32-23.36=8.64, therefore I'm going to reboot the server with 9G, this is 1G more than it had, so this would be an extra £28 per quarter if Ade is happy for the Transition Network to pay this, if not I can reboot it again back to 8G.

comment:70 follow-up: ↓ 71 Changed 12 months ago by ade

Sounds ideal Chris,
Please invoice as proposed.

Best regards
Ade

On 6 December 2015 at 15:59, Transition Technology Trac <
trac@tech.transitionnetwork.org> wrote:

> #846: Load Spikes on BOA PuffinServer
> -------------------------------------+-------------------------------------
>            Reporter:  chris          |                      Owner:  chris
>                Type:  maintenance    |                     Status:  new
>            Priority:  major          |                  Milestone:
>           Component:  Live server    |  Maintenance
>            Keywords:                 |                 Resolution:
> Add Hours to Ticket:  0.25           |  Estimated Number of Hours:  0.0
>         Total Hours:  7.44           |                  Billable?:  1
> -------------------------------------+-------------------------------------
> Changes (by chris):
>
>  * hours:  0.0 => 0.25
>  * totalhours:  7.44 => 7.69
>
>
> Comment:
>
>  Looking at the
>  [
> https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/memory.html
>  memory usage] for the last 12 hours (during which we haven't had any load
>  spikes, I don't know for sure if this is related to the extra memory or a
>  coincidence):
>
>  [[Image(puffin_memory-pinpoint_2015-12-06.png)]]
>
>  The minimum unused memory has been 23.36G, so 32-23.36=8.64, therefore I'm
>  going to reboot the server with 9G, this is 1G more than it had, so this
>  would be an extra [https://www.webarch.net/virtual-servers#ram £28 per
>  quarter] if Ade is happy for the Transition Network to pay this, if not I
>  can reboot it again back to 8G.
>
> --
> Ticket URL: <https://tech.transitionnetwork.org/trac/ticket/846#comment:67
> >
> Transition Technology <https://tech.transitionnetwork.org/trac>
> Support and issues tracking for the Transition Network Web Project.
>



-- 
Ade Stuart
Web Manager - Transition network

07595 331877

The Transition Network is a registered charity
address: 43 Fore St, Totnes, Devon, TQ9 5HN, UK
website: www.transitionnetwork.org
TN company no: 6135675 TN charity no: 1128675

comment:71 in reply to: ↑ 70 Changed 12 months ago by chris

Replying to ade:

Sounds ideal Chris,
Please invoice as proposed.

Thanks Ade, will do. Our email, just crossed -- the one I just sent to the Ttech list was sent before I saw this.

comment:72 Changed 11 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 7.69 to 7.94

Another bad day for load spikes, these are the subject lines from lfd email today:

Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 15.64
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 12.96
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 15.60
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 16.13
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 12.11
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 14.80
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 15.63
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 32.27
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 42.51
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 28.61
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 19.86
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 13.46
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 15.38
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 18.87
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 13.11
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 12.90
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 14.44
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 14.83
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 23.33
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 18.92
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 17.96
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 12.71
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 12.86
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 13.10
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 13.14
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 15.55
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 14.72
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 32.77
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 25.37
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 20.09
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 15.84
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 40.58
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 55.44
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 64.21
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 71.34
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 54.87
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 37.07
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 24.86
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 16.80

I have spent a fair amount of time looking at the output of top as well as all the email alerts and the munin graphs, as ever I can't see any simply cause or fix for this while we stick with BOA...

comment:73 Changed 11 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 7.94 to 8.19

I think a lot of the load issues are caused by bots, looking at the latest /var/log/nginx/access.log, it is 31k lines long and these are the User Agent strings for the UA's with 60 or more hits:

     60 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2; Feeder.co) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36
     60 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36
     61 Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2
     61 Stringer (https://github.com/swanson/stringer)
     62 LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.1 +http://www.linkedin.com)
     64 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)
     64 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:42.0) Gecko/20100101 Firefox/42.0
     66 Mozilla/5.0 (compatible; FeedBooster; +http://feeds.qsensei.com)
     66 NewsBlur Feed Fetcher - 42 subscribers - http://www.newsblur.com/site/40901/rob-hopkinss-blog (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)
     66 NewsBlur Page Fetcher - 42 subscribers - http://www.newsblur.com/site/40901/rob-hopkinss-blog (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)
     68 Superfeedr bot/2.0 http://superfeedr.com - Make your feeds realtime: get in touch - feed-id:152016352
     69 Superfeedr bot/2.0 http://superfeedr.com - Make your feeds realtime: get in touch - feed-id:111162297
     70 Automattic Feed Fetcher 1.0
     70 Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8
     70 Superfeedr bot/2.0 http://superfeedr.com - Make your feeds realtime: get in touch - feed-id:107327965
     70 Superfeedr bot/2.0 http://superfeedr.com - Make your feeds realtime: get in touch - feed-id:243955592
     70 Superfeedr bot/2.0 http://superfeedr.com - Make your feeds realtime: get in touch - feed-id:90309734
     70 Superfeedr bot/2.0 http://superfeedr.com - Make your feeds realtime: get in touch - feed-id:92605190
     75 Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0;  http://www.bing.com/bingbot.htm)
     76 Superfeedr bot/2.0 http://superfeedr.com - Make your feeds realtime: get in touch - feed-id:90556700
     80 Java/1.8.0_51
     81 Feedbin feed-id:189396 - 5 subscribers
     89 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
    120 dlvr.it/1.0 (+http://dlvr.it/)
    124 check_http/v1.4.16 (nagios-plugins 1.4.16)
    128 Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
    130 Java/1.8.0_65
    138 Superfeedr bot/2.0 http://superfeedr.com - Make your feeds realtime: get in touch - feed-id:90159368
    143 Mozilla/5.0 (compatible)
    149 Owlin bot v3 http://owlin.com/bot
    158 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36
    172 curl
    265 Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
    323 robots
    346 Wget/1.13.4 (linux-gnu)
    384 Feedly/1.0 (+http://www.feedly.com/fetcher.html; like FeedFetcher-Google)
    436 Mozilla/5.0 (compatible; SpiderLing (a SPIDER for LINGustic research); +http://nlp.fi.muni.cz/projects/biwec/)
    441 Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20100101 Firefox/17.0
    489 magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)
    545 Y!J-ASR/0.1 crawler (http://www.yahoo-help.jp/app/answers/detail/p/595/a_id/42716/)
    559 Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com)
    633 Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)
    645 Pingdom.com_bot_version_1.4_(http://www.pingdom.com)
    747 Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)
    793 Domain Re-Animator Bot (http://domainreanimator.com) - support@domainreanimator.com
   1004 Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
   1323 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
   1841 -
   2080 Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
   2633 Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
   3193 Mozilla/5.0 (compatible; spbot/4.4.2; +http://OpenLinkProfiler.org/bot )
   3705 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.99 Safari/537.36

The above hits account for 25k of the 31k total, if you take out the ones which might be people you have 20.5k hits from bots.

So I think we can say that at least 60% of the traffic to the site is from bots.

comment:74 Changed 11 months ago by chris

The UAs were extracted like this:

awk -F\" '{print $8}' /var/log/nginx/access.log | sort | uniq -c | sort -bn | less

comment:75 Changed 11 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.2
Total Hours changed from 8.19 to 8.39

PuffinServer was down overnight again due to MySQL locking up, see ticket:890#comment:6 and today is another big day for load spikes and the server going down...

So it is back up and almost totally unresponsive again due to the load:

 12:22:54 up 6 min,  1 user,  load average: 56.00, 23.47, 8.84

I think we might be better off simply switching the server off for the day rather than me wasting hours trying to get it to stay up for moments at a time?

Last edited 11 months ago by chris (previous) (diff)

comment:76 Changed 11 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.02
Total Hours changed from 8.39 to 8.41

It is only getting worse:

 12:26:18 up 9 min,  1 user,  load average: 95.24, 57.20, 25.01

comment:77 Changed 11 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 8.41 to 8.51

 12:30:36 up 13 min,  2 users,  load average: 73.60, 65.90, 36.46

I think I'm going to give up on trying to keep the server up today -- if anyone does want a server that we can keep up to run the site then let's have a discussion about dropping BOA?

comment:78 follow-up: ↓ 79 Changed 11 months ago by paul

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 8.51 to 8.76

Is the server swapping to the hard drive?

I'll be around later this afternoon, If I can be of any assistance.

comment:79 in reply to: ↑ 78 ; follow-up: ↓ 80 Changed 11 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 8.76 to 9.01

Replying to paul:

Is the server swapping to the hard drive?

No it hasn't got any swap.

I'll be around later this afternoon, If I can be of any assistance.

Thanks but I think what we really need is to have the conversation about what to do regarding BOA and we can only have that conversation with Ade...

Results from checking uptime over the last half hour or so:

 12:32:03 up 15 min,  2 users,  load average: 65.47, 65.04, 38.86

 12:39:52 up 23 min,  1 user,  load average: 97.46, 83.81, 57.33

 12:41:46 up 25 min,  1 user,  load average: 103.87, 90.45, 62.82

 12:43:40 up 26 min,  1 user,  load average: 110.41, 97.36, 68.43

 12:49:07 up 32 min,  1 user,  load average: 133.84, 120.30, 86.88

 12:52:09 up 35 min,  1 user,  load average: 117.21, 117.41, 91.68

 12:54:34 up 37 min,  1 user,  load average: 136.75, 124.76, 98.08

 12:56:04 up 39 min,  1 user,  load average: 141.40, 128.69, 101.89

 12:59:54 up 43 min,  1 user,  load average: 160.20, 145.55, 114.65

It is only getting worse and due to the nature of BOA that isn't really anything we can do apart from watch it failing.

comment:80 in reply to: ↑ 79 Changed 11 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.15
Total Hours changed from 9.01 to 9.16

Replying to paul:

I'll be around later this afternoon, If I can be of any assistance.

By all means try to get to the bottom of it, I think the problem is that we can't as long as we are running BOA. You could start by reading this ticket thread I guess...

We might as well switch the server off for the day:

 13:03:31 up 46 min,  1 user,  load average: 185.61, 162.76, 127.80

 13:11:13 up 54 min,  1 user,  load average: 201.08, 191.55, 155.24

comment:81 Changed 11 months ago by chris

So fucked:

 13:35:41 up  1:18,  1 user,  load average: 263.11, 261.47, 233.17

comment:82 Changed 11 months ago by chris

I'm going to have one last try for the day to get the server up, rebooting it with double the RAM...

comment:83 Changed 11 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.1
Total Hours changed from 9.16 to 9.26

It is up again:

 13:43:13 up 4 min,  1 user,  load average: 2.72, 0.91, 0.33

free -m
             total       used       free     shared    buffers     cached
Mem:         18415       2086      16329          0        129        586
-/+ buffers/cache:       1370      17044
Swap:            0          0          0

No idea how long it will be up for, not long I suspect...

comment:84 Changed 11 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.05
Total Hours changed from 9.26 to 9.31

It is knackered again already:

 13:46:25 up 7 min,  1 user,  load average: 51.22, 20.76, 7.90

I'm really giving up for the day now.

comment:85 Changed 11 months ago by sam

Hi all

Thanks for your work on this.

I see the site is available again at the moment.

Paul: one thing we have discussed is moving the /login page, the assumption being that the load is caused by bots filling and submitting the form. If the bots can't find the form they can't submit it repeatedly.

I know this would be trivial on Wordpress but I've never done it on a Drupal site. Is it something that would be fairly easy for you to do?

I see there is a module for it: https://www.drupal.org/project/rename_admin_paths but I seem to recall we already have a module to edit paths?

I realise we might end up with a few broken internal links, I think this would be a price worth paying if it reduces the load and keeps the site up. I could fix any broken links early in the new year.

Thanks

Sam

comment:86 follow-up: ↓ 87 Changed 11 months ago by ade

Hi Chris,
Sorry have taken holiday to coincide with kids holidays.
I have no idea how to combat the BOT issues that the server is currently
receiving. With regard to the long term plan, we are looking at the next
site to be a WordPress site over the next 3-4 months, so this will mitigate
the problem in the long-term, well either that or push the issue to a
different platform. So the question is what can we do in the short term? We
can't afford to rip out BOA and rebuild the server for 3-4 months.
Are the bots targeting a particular URL or port or are they just flooding
port 80? If its a URL can we remove that URL or change it internally?

I had every intention of discussing with you the plans for next year before
I came on holiday, but a question has arisen that I needed answering before
hand. I understand that the lack of communication from my end has had to be
very frustrating. I get back the first week of Jan..can we catch up then?

Regards
Ade
On 23 Dec 2015 1:48 p.m., "Transition Technology Trac" <
trac@tech.transitionnetwork.org> wrote:

> #846: Load Spikes on BOA PuffinServer
> -------------------------------------+-------------------------------------
>            Reporter:  chris          |                      Owner:  chris
>                Type:  maintenance    |                     Status:  new
>            Priority:  major          |                  Milestone:
>           Component:  Live server    |  Maintenance
>            Keywords:                 |                 Resolution:
> Add Hours to Ticket:  0.05           |  Estimated Number of Hours:  0.0
>         Total Hours:  9.26           |                  Billable?:  1
> -------------------------------------+-------------------------------------
> Changes (by chris):
>
>  * hours:  0.0 => 0.05
>  * totalhours:  9.26 => 9.31
>
>
> Comment:
>
>  It is knackered again already:
>
>  {{{
>   13:46:25 up 7 min,  1 user,  load average: 51.22, 20.76, 7.90
>  }}}
>
>  I'm really giving up for the day now.
>
> --
> Ticket URL: <https://tech.transitionnetwork.org/trac/ticket/846#comment:84
> >
> Transition Technology <https://tech.transitionnetwork.org/trac>
> Support and issues tracking for the Transition Network Web Project.
>

comment:87 in reply to: ↑ 86 Changed 11 months ago by chris

Cc kate added
Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 9.31 to 9.56

Replying to ade:

Sorry have taken holiday to coincide with kids holidays.

No problem, I understand!

I have no idea how to combat the BOT issues that the server is currently
receiving.

I'm not sure we can with BOA as the configuration isn't designed to be managed by sysadmins.

With regard to the long term plan, we are looking at the next site to be a
WordPress site over the next 3-4 months, so this will mitigate the problem in
the long-term, well either that or push the issue to a different platform.

OK.

So the question is what can we do in the short term? We can't afford to rip
out BOA and rebuild the server for 3-4 months.

OK, I have had enough of BOA, we have been running PuffinServer for three years and we have had problems with it for ages, see wiki:PuffinServer#LoadSpikes, it hasn't been updated for seven months (see ticket:889, ticket:872, ticket:864, ticket:863 and ticket:854).

Kate, the other syadmin at Webarchitects has done some work on PuffinServer to bring it back up, reducing the number of CPUs and removing some BOA cron jobs, I have created an account for her on trac and she will hopefully document what she has done and record the time spent as a comment on this thread or as a new ticket.

Are the bots targeting a particular URL or port or are they just flooding
port 80? If its a URL can we remove that URL or change it internally?

It isn't entirly clear as most requests are not logged due to the way BOA has configured Nginx, however there is evidence of some URLs being abused by bots, see ticket:846#comment:35 and yes it might be possible for someone to mitigate this to some extent.

I had every intention of discussing with you the plans for next year before I
came on holiday, but a question has arisen that I needed answering before
hand. I understand that the lack of communication from my end has had to be
very frustrating. I get back the first week of Jan..can we catch up then?

It would be really great to have a chat about what we are going to do, I'm back at work when the kids go back to school.

Last edited 11 months ago by chris (previous) (diff)

comment:88 follow-up: ↓ 89 Changed 11 months ago by kate

Add Hours to Ticket changed from 0.0 to 1.0
Total Hours changed from 9.56 to 10.56

Hi,

My colleague Chris Croome mentioned in our IRC channel that there were issues with server puffin.
I logged into the host around 16:00 today.
I noticed that the host has 14 CPU, was showing a high load ( >100 ).
I noticed that there was a high "steal" percentage of CPU time
There were many bash "killall" processes running.

From previous discussion with Chris, I am aware that the server has a number of shell scripts called "BOA" used to dynamically monitor and maintain it.

I took the following action

I shutdown the server
I reduced the number of CPU to 4
I restarted the server
I commented out several shell scripts that are scheduled to run from root's crontab ( see output of crontab -l -u root If it is commented out I did it. )

Since performing this action ( about 1.5 hours now ) the load is < 1

top - 18:09:30 up  1:39,  2 users,  load average: 0.31, 0.23, 0.19
Tasks: 155 total,   1 running, 154 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.2 us,  0.0 sy,  0.0 ni, 98.5 id,  1.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu1  :  7.4 us,  4.3 sy,  0.0 ni, 88.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  0.9 us,  0.2 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  0.1 us,  0.1 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  18868060 total,  2889180 used, 15978880 free,   151080 buffers
KiB Swap:        0 total,        0 used,        0 free,   767888 cached

The actions I took should be considered emergency actions. I'm not familiar with the BOA management, nor with other systems on this host. My reasoning for taking these steps was...

Spawning bash scripts every minute for dynamic system management ( parsing log files, checking file system acls, checking for processes running ) is going to be expensive in resources. There are probably much better ways of doing these things. ( sv, inotify ... )

From experience of running hosts in virtualised environments I'm not convinced that web applications run well with many virtual CPU. I believe that there may be hypervisor cpu scheduling conflicts with other VM's, and that unless there is clear evidence that performance is enhanced with additional CPU and that CPU numbers are required, then extra CPU numbers can detract from performance, rather than improve it.

Regards,

Kate

Last edited 11 months ago by kate (previous) (diff)

comment:89 in reply to: ↑ 88 ; follow-up: ↓ 100 Changed 11 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 10.56 to 10.81

Replying to kate:

There were many bash "killall" processes running.

Yes, BOA goes on suicidal killing sprees quite often, for example see ticket:846#comment:58 ...

I reduced the number of CPU to 4

Note that the server also currently has twice the RAM, I have updated the wiki page to reflect this, wiki:PuffinServer?action=diff&version=189&old_version=188

I commented out several shell scripts that are scheduled to run from root's crontab ( see output of crontab -l -u root If it is commented out I did it. )

I'm not sure what the effect will be of this:

crontab -l -u root
#* * * * * bash /var/xdrago/second.sh >/dev/null 2>&1
#* * * * * bash /var/xdrago/minute.sh >/dev/null 2>&1
#* * * * * bash /var/xdrago/runner.sh >/dev/null 2>&1
#* * * * * bash /var/xdrago/manage_ltd_users.sh >/dev/null 2>&1
#03 * * * * bash /var/xdrago/clear.sh >/dev/null 2>&1
#14 * * * * bash /var/xdrago/purge_binlogs.sh >/dev/null 2>&1
#01 0 * * * bash /var/xdrago/graceful.sh >/dev/null 2>&1
#08 0 * * * bash /var/xdrago/mysql_backup.sh >/dev/null 2>&1
#08 2 * * * bash /opt/local/bin/backboa backup >/dev/null 2>&1
#08 8 * * * bash /opt/local/bin/duobackboa backup >/dev/null 2>&1
#58 2 * * * bash /var/xdrago/daily.sh >/dev/null 2>&1
#58 5 * * * bash /var/xdrago/fetch-remote.sh >/dev/null 2>&1

Some of the bash scripts were probably doing things that are needed so I fear that someone if going to have to go through all of these and work out which ones...

comment:90 Changed 11 months ago by paul

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 10.81 to 11.06

I'll install the honeypot module which should prevent spam bots from completing forms successfully ..

https://www.drupal.org/project/honeypot

comment:91 follow-up: ↓ 92 Changed 11 months ago by paul

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 11.06 to 11.31

Sam,

Would you have a look to see if you can configure the honeypot module with the primary administrator account (uid=1)?

https://www.transitionnetwork.org/admin/settings/honeypot

comment:92 in reply to: ↑ 91 Changed 11 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 11.31 to 11.56

Replying to paul:

Would you have a look to see if you can configure the honeypot module with the primary administrator account (uid=1)?

Paul you can get a temp login as UID 1 (without knowing the password or changing it) using drush uli.

I spent some time looking at the munin stats and things seem to be going OK so far.

I have opened a new ticket for the BOA cron jobs, ticket:893.

comment:93 Changed 11 months ago by sam

Hi Paul

Yep it looks like I can.

I enabled it for

User Registration form
User Password Reset form

However I can't see a login form option, unless Drupal regards the login/registration forms as equivalent?

I'll do a bit of testing now to make sure it doesn't conflict with the existing spam protection stuff; Botcha etc.

Thanks

Sam

comment:94 follow-up: ↓ 95 Changed 11 months ago by paul

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 11.56 to 11.81

@Chris

Normally I can but I couldn't figure out quickly how to to run the command as user tn with sudo or su?

@Sam

Sorry. I don't know. Probably easier for you to explore the problem if you have access. But, let me know if you need further assistance.

comment:95 in reply to: ↑ 94 Changed 11 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 11.81 to 12.06

Replying to paul:

@Chris

Normally I can but I couldn't figure out quickly how to to run the command as user tn with sudo or su?

OK, to find where the site is I looked in /etc/passwd and found:

tn.web:x:118:33::/home/tn.web:/bin/false

But the site isn't in /home/tn.web.

So trying to find it via /etc/nginx/nginx.conf, wehere we have include /etc/nginx/sites-enabled/*; but:

ls -lah /etc/nginx/sites-enabled/
ls: cannot access /etc/nginx/sites-enabled/: No such file or directory

However /etc/nginx/conf.d/aegir.conf is a symlink to /var/aegir/config/nginx.conf and this ends with:

include /var/aegir/config/server_master/nginx/pre.d/*;
include /var/aegir/config/server_master/nginx/platform.d/*;
include /var/aegir/config/server_master/nginx/vhost.d/*;
include /var/aegir/config/server_master/nginx/post.d/*;

So looking in /var/aegir/config/server_master/nginx/vhost.d/ but there isn't the TN site config there... I don't want to waste any more time trying to work through the maze of Nginx config BOA created.

Paul if you know where BOA has hidden the site directory then I'll be able to work out the drush command...

comment:96 Changed 11 months ago by sam

Well it all seems Ok for now. The bash script changes may have done the trick temporarily ?

I suggest we just leave it alone for now. Go and have Christmas and look again in the new year.

Thanks everyone for your work on this.

Hope you all have a nice few days off. 

Best 

Sam

Sent from my Sony Xperia™ smartphone

---- Transition Technology Trac wrote ----

>#846: Load Spikes on BOA PuffinServer
>-------------------------------------+-------------------------------------
>           Reporter:  chris          |                      Owner:  chris
>               Type:  maintenance    |                     Status:  new
>           Priority:  major          |                  Milestone:
>          Component:  Live server    |  Maintenance
>           Keywords:                 |                 Resolution:
>Add Hours to Ticket:  0.25           |  Estimated Number of Hours:  0.0
>        Total Hours:  11.56          |                  Billable?:  1
>-------------------------------------+-------------------------------------
>Changes (by paul):
>
> * hours:  0.0 => 0.25
> * totalhours:  11.56 => 11.81
>
>
>Comment:
>
> @Chris
>
> Normally I can but I couldn't figure out quickly how to to run the command
> as user tn with sudo or su?
>
> @Sam
>
> Sorry. I don't know. Probably easier for you to explore the problem if you
> have access. But, let me know if you need further assistance.
>
>-- 
>Ticket URL: <https://tech.transitionnetwork.org/trac/ticket/846#comment:94>
>Transition Technology <https://tech.transitionnetwork.org/trac>
>Support and issues tracking for the Transition Network Web Project.

comment:97 Changed 11 months ago by sam

I think I was a bit unclear before. It's not that I was unable to configure honeypot it just seems it doesn't protect login forms. We have protected the registration form with it, so that may be enough to reduce load, certainly alongside the script changes.

Let's just leave it alone for now. 

Thanks

Sam

Sent from my Sony Xperia™ smartphone

---- Transition Technology Trac wrote ----

>#846: Load Spikes on BOA PuffinServer
>-------------------------------------+-------------------------------------
>           Reporter:  chris          |                      Owner:  chris
>               Type:  maintenance    |                     Status:  new
>           Priority:  major          |                  Milestone:
>          Component:  Live server    |  Maintenance
>           Keywords:                 |                 Resolution:
>Add Hours to Ticket:  0.25           |  Estimated Number of Hours:  0.0
>        Total Hours:  11.81          |                  Billable?:  1
>-------------------------------------+-------------------------------------
>Changes (by chris):
>
> * hours:  0.0 => 0.25
> * totalhours:  11.81 => 12.06
>
>
>Comment:
>
> Replying to [comment:94 paul]:
> > @Chris
> >
> > Normally I can but I couldn't figure out quickly how to to run the
> command as user tn with sudo or su?
>
> OK, to find where the site is I looked in `/etc/passwd` and found:
>
> {{{
> tn.web:x:118:33::/home/tn.web:/bin/false
> }}}
>
> But the site isn't in `/home/tn.web`.
>
> So trying to find it via `/etc/nginx/nginx.conf`, wehere we have `include
> /etc/nginx/sites-enabled/*;` but:
>
> {{{
> ls -lah /etc/nginx/sites-enabled/
> ls: cannot access /etc/nginx/sites-enabled/: No such file or directory
> }}}
>
> However `/etc/nginx/conf.d/aegir.conf` is a symlink to
> `/var/aegir/config/nginx.conf` and this ends with:
>
> {{{
> include /var/aegir/config/server_master/nginx/pre.d/*;
> include /var/aegir/config/server_master/nginx/platform.d/*;
> include /var/aegir/config/server_master/nginx/vhost.d/*;
> include /var/aegir/config/server_master/nginx/post.d/*;
> }}}
>
> So looking in `/var/aegir/config/server_master/nginx/vhost.d/` but there
> isn't the TN site config there... I don't want to waste any more time
> trying to work through the maze of Nginx config BOA created.
>
> Paul if you know where BOA has hidden the site directory then I'll be able
> to work out the `drush` command...
>
>-- 
>Ticket URL: <https://tech.transitionnetwork.org/trac/ticket/846#comment:95>
>Transition Technology <https://tech.transitionnetwork.org/trac>
>Support and issues tracking for the Transition Network Web Project.

comment:98 Changed 11 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.25
Total Hours changed from 12.06 to 12.31

Thanks Sam, I wrote the following before I read your comments:

I think I have found the sites:

cd /data/
find ./ -name index.php
./disk/tn/static/transition-network-d6-35-p001b-booker/index.php
./disk/tn/static/drupal-6.37/index.php
./disk/tn/static/transition-network-d6-35-s001b-booker/index.php
./disk/tn/aegir/distro/019/profiles/hostmaster/modules/hosting/example/index.php
./disk/tn/aegir/distro/019/index.php
./disk/tn/platforms/transitionnetwork.org/index.php
./all/000/core/drupal-7.34.1/index.php
./all/000/core/drupal-7.33.1/index.php
./all/000/core/pressflow-6.29.1/index.php
./all/000/core/pressflow-6.34.1/index.php
./all/000/core/drupal-7.23.3/index.php
./all/000/core/drupal-7.32.1/index.php
./all/000/core/drupal-7.27.1/index.php
./all/000/core/drupal-7.30.1/index.php
./all/000/core/drupal-7.36.1/index.php
./all/000/core/pressflow-6.32.2/index.php
./all/000/core/drupal-7.28.1/index.php
./all/000/core/pressflow-6.35.1/index.php
./all/000/core/drupal-7.24.1/index.php
./all/000/core/pressflow-6.33.1/index.php
./all/000/core/pressflow-6.31.1/index.php
./all/000/core/pressflow-6.31.2/index.php
./all/000/core/pressflow-6.28.3/index.php
./all/000/core/drupal-7.31.1/index.php

So I guess /data/disk/tn/platforms/transitionnetwork.org/ is the site?

So:

sudo -i
su - tn -s /bin/bash
cd /data/disk/tn/platforms/transitionnetwork.org
drush uli
Drush command terminated abnormally due to an unrecoverable error.                                                     [error]

Hmm.... according to the BOA Changelog:

  * Allow 'drush uli' as root, but deny root access to Drush by default.

This came with:

### Stable BOA-2.2.0 Release - Full Edition
### Date: Mon Mar 31 06:44:08 SGT 2014

And we are running:

tail -1 /var/log/barracuda_log.txt
Fri May 1 22:50:58 BST 2015 / Debian.wheezy x86_64 XEN / Aegir BOA-2.4.2 / Barracuda BOA-2.4.2 / Nginx 1.8.0 / PHP-MI 5.3 / PHP-SE / FPM 5.3 / CLI 5.3 / MariaDB-5.5.43 localhost

So, trying that:

sudo -i
cd /data/disk/tn/platforms/transitionnetwork.org
drush uli
Drush command terminated abnormally due to an unrecoverable error.                                                     [error]

So sorry I don't know how to fix this, this looks like more broken by design by BOA nonsense.

comment:99 Changed 11 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.05
Status changed from new to closed
Resolution set to fixed
Total Hours changed from 12.31 to 12.36

Closing this ticket as the load spike problem appears to have been solved by commenting out all the BOA cron jobs, see ticket:893

Changed 10 months ago by chris

Attachment puffin-2016-01-16-memory-week.png added

comment:100 in reply to: ↑ 89 Changed 10 months ago by chris

Add Hours to Ticket changed from 0.0 to 0.23
Total Hours changed from 12.36 to 12.59

Replying to chris:

Replying to kate:

I reduced the number of CPU to 4

Note that the server also currently has twice the RAM, I have updated the wiki page to reflect this, wiki:PuffinServer?action=diff&version=189&old_version=188

For the last week the min spare RAM for PuffinServer has been 4.76G:

So I have edited the Xen config for the server:

#memory      = '18432'
memory      = '13824'

Based on this calculation:

18-4.5
13.5
13.5*1024
13824.0

I haven't yet rebooted it (I'll do it late one evening when less people are using the site), so it is still currently running with 18GB of RAM, even though the TN has only being invoiced for 9GB, but the invoice is for a server with 14 cores when it currently has 4 (Webarchitects will sort out their virtual server specs to better reflect what works well for the next quarter). When that has been done the top of the PuffinServer wiki page will need updating to reflect the change.

Reported by:	chris	Owned by:	chris
Priority:	major	Milestone:	Maintenance
Component:	Live server	Keywords:
Cc:	ade, annesley, paul, sam, kate	Estimated Number of Hours:	0.0
Add Hours to Ticket:	0	Billable?:	yes
Total Hours:	12.59

Context Navigation

Ticket #846 (closed maintenance: fixed)

Load Spikes on BOA PuffinServer

Description (last modified by chris) (diff)

Attachments

Change History

comment:1 Changed 19 months ago by chris

comment:2 Changed 19 months ago by chris

comment:3 Changed 19 months ago by chris

comment:4 Changed 19 months ago by chris

comment:5 Changed 19 months ago by chris

comment:6 Changed 19 months ago by chris

comment:7 Changed 19 months ago by chris

comment:8 Changed 19 months ago by chris

comment:9 Changed 19 months ago by chris

Changed 18 months ago by chris

Changed 18 months ago by chris

comment:10 Changed 18 months ago by chris

comment:11 Changed 18 months ago by chris

comment:12 Changed 18 months ago by chris

comment:13 Changed 18 months ago by chris

comment:14 Changed 18 months ago by chris

comment:15 Changed 18 months ago by chris

comment:16 Changed 18 months ago by chris

comment:17 Changed 18 months ago by chris

comment:18 Changed 18 months ago by chris

comment:19 Changed 18 months ago by chris

comment:20 Changed 18 months ago by chris

comment:23 follow-up: ↓ 24 Changed 18 months ago by ade

comment:24 in reply to: ↑ 23 Changed 18 months ago by chris

comment:25 Changed 18 months ago by chris

comment:26 Changed 17 months ago by chris

comment:27 follow-up: ↓ 28 Changed 17 months ago by paul

comment:28 in reply to: ↑ 27 Changed 17 months ago by chris

comment:29 Changed 17 months ago by chris

comment:30 Changed 16 months ago by chris

comment:31 Changed 15 months ago by chris

Changed 15 months ago by chris

comment:32 Changed 15 months ago by chris

Changed 15 months ago by chris

Changed 15 months ago by chris

comment:33 Changed 15 months ago by chris

Changed 15 months ago by chris

Changed 15 months ago by chris

Changed 15 months ago by chris

Changed 15 months ago by chris

comment:34 Changed 15 months ago by chris

Changed 15 months ago by chris

comment:35 Changed 15 months ago by chris

comment:36 Changed 15 months ago by chris

comment:37 Changed 15 months ago by chris

comment:38 Changed 15 months ago by chris

comment:39 Changed 15 months ago by chris

comment:40 Changed 15 months ago by chris

comment:41 Changed 15 months ago by chris

comment:42 Changed 15 months ago by chris

comment:51 follow-up: ↓ 52 Changed 15 months ago by sam

comment:52 in reply to: ↑ 51 Changed 15 months ago by chris

Changed 12 months ago by chris

comment:53 Changed 12 months ago by chris

comment:54 Changed 12 months ago by chris

comment:55 Changed 12 months ago by chris

comment:56 Changed 12 months ago by chris

comment:57 Changed 12 months ago by chris

comment:58 Changed 12 months ago by chris

comment:59 Changed 12 months ago by chris

comment:60 Changed 12 months ago by chris

comment:61 Changed 12 months ago by chris

comment:62 follow-up: ↓ 63 Changed 12 months ago by chris

comment:63 in reply to: ↑ 62 ; follow-up: ↓ 65 Changed 12 months ago by sam

comment:64 follow-up: ↓ 66 Changed 12 months ago by sam

comment:65 in reply to: ↑ 63 Changed 12 months ago by chris

comment:66 in reply to: ↑ 64 Changed 12 months ago by chris

Changed 12 months ago by chris

comment:67 Changed 12 months ago by chris

comment:70 follow-up: ↓ 71 Changed 12 months ago by ade

comment:71 in reply to: ↑ 70 Changed 12 months ago by chris

comment:72 Changed 11 months ago by chris

comment:73 Changed 11 months ago by chris

comment:74 Changed 11 months ago by chris