Ticket #846 (closed maintenance: fixed)
Load Spikes on BOA PuffinServer
Reported by: | chris | Owned by: | chris |
---|---|---|---|
Priority: | major | Milestone: | Maintenance |
Component: | Live server | Keywords: | |
Cc: | ade, annesley, paul, sam, kate | Estimated Number of Hours: | 0.0 |
Add Hours to Ticket: | 0 | Billable?: | yes |
Total Hours: | 12.59 |
Description (last modified by chris) (diff)
Creating this as a ticket to record load spikes and related site outages.
See wiki:PuffinServer#LoadSpikes for links to historic issues of this nature.
Attachments
Change History
comment:1 Changed 20 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 0.0 to 0.25
comment:3 Changed 19 months ago by chris
There was a load spike just now:
Date: Sun, 19 Apr 2015 11:15:49 +0100 (BST) From: root@puffin.webarch.net To: chris@webarchitects.co.uk Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 45.80 Time: Sun Apr 19 11:15:44 2015 +0100 1 Min Load Avg: 75.88 5 Min Load Avg: 45.80 15 Min Load Avg: 23.44 Running/Total Processes: 32/518
I think this was high enough for the webserver to be stopped, I also got a pingdom and Nagios alert about it.
comment:4 Changed 19 months ago by chris
Another just now:
Date: Wed, 22 Apr 2015 10:49:37 +0100 (BST) From: root@puffin.webarch.net To: chris@webarchitects.co.uk Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 60.50 Time: Wed Apr 22 10:49:37 2015 +0100 1 Min Load Avg: 118.47 5 Min Load Avg: 60.50 15 Min Load Avg: 25.26 Running/Total Processes: 42/524
comment:5 Changed 19 months ago by chris
It seems to have passed now, it spiked a bit higher than the last comment:
Date: Wed, 22 Apr 2015 11:09:31 +0100 (BST) From: root@puffin.webarch.net To: chris@webarchitects.co.uk Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 80.03 Time: Wed Apr 22 11:09:30 2015 +0100 1 Min Load Avg: 121.18 5 Min Load Avg: 80.03 15 Min Load Avg: 44.00 Running/Total Processes: 57/532
comment:6 Changed 19 months ago by chris
The email this is a reply to should have updated this ticket: /trac/ticket/846 This is a test to try to work out why it didn't. On Thu 23-Apr-2015 at 10:43:51AM +0100, Ade Stuart wrote: > > The time frames for these peaks seem consistent. Is there a crone task > happening between 10.45 and 11.15? My starter for ten would be that > something is using up the system resource, and hence the spikes. > Thoughts?
comment:7 Changed 19 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 0.25 to 0.5
OK, weird, my email worked, I have updated the whitelist in /etc/email2trac.conf -- perhaps that is the cause... I did read ticket:494 and ticket:819 trying to find an answer to this...
Regarding the crontab on PuffinServer, perhaps, but there are not any specific ones at that time:
* * * * * bash /var/xdrago/second.sh >/dev/null 2>&1 * * * * * bash /var/xdrago/minute.sh >/dev/null 2>&1 * * * * * bash /var/xdrago/runner.sh >/dev/null 2>&1 * * * * * bash /var/xdrago/manage_ltd_users.sh >/dev/null 2>&1 03 * * * * bash /var/xdrago/clear.sh >/dev/null 2>&1 14 * * * * bash /var/xdrago/purge_binlogs.sh >/dev/null 2>&1 01 0 * * * bash /var/xdrago/graceful.sh >/dev/null 2>&1 08 0 * * * bash /var/xdrago/mysql_backup.sh >/dev/null 2>&1 08 2 * * * bash /opt/local/bin/backboa backup >/dev/null 2>&1 08 8 * * * bash /opt/local/bin/duobackboa backup >/dev/null 2>&1 58 2 * * * bash /var/xdrago/daily.sh >/dev/null 2>&1 58 5 * * * bash /var/xdrago/fetch-remote.sh >/dev/null 2>&1
comment:8 Changed 19 months ago by chris
Just had a massive load spike:
uptime 19:16:12 up 267 days, 5:59, 1 user, load average: 198.45, 175.53, 159.21
Seems to be recovering...
comment:9 Changed 19 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 0.5 to 0.75
I did manage to ssh in (eventually) and get top to display what was going on just as the load was dropping off and it was bash and dash processes that were dominating so I have added dash and bash to this graph:
Note that the php5-fpm memory usage is very much up over the last few weeks.
I have also added the multip plugin to track the number of processes (same ones as for multips_memory):
[multips] env.names nginx php_fpm mysqld redis-server munin-node bash dash user root [multips_memory] env.names nginx php-fpm mysqld redis-server munin-node bash dash user root
BOA runs multiple bash and dash scripts all the time:
ps -lA | grep dash 4 S 0 9185 9175 1 80 0 - 1033 - ? 00:00:01 dash 4 S 0 9190 9174 3 80 0 - 1033 - ? 00:00:03 dash 4 S 0 11864 11861 0 80 0 - 1033 - ? 00:00:00 dash 4 S 0 11872 11865 0 80 0 - 1033 - ? 00:00:00 dash 4 S 0 11875 11866 0 80 0 - 1033 - ? 00:00:00 dash 4 S 0 58966 58959 0 80 0 - 1033 - ? 00:00:00 dash ps -lA | grep bash 0 S 0 11924 11875 0 80 0 - 2721 - ? 00:00:00 bash 4 S 0 15765 15759 0 80 0 - 2693 - ? 00:00:00 bash 0 S 0 15803 15764 0 80 0 - 2721 - ? 00:00:00 bash 0 S 0 15805 15762 0 80 0 - 2698 - ? 00:00:00 bash 0 S 1000 39464 39201 0 80 0 - 5060 - pts/0 00:00:06 bash 4 S 0 40513 40175 0 80 0 - 5217 - pts/0 00:00:15 bash 0 S 0 51716 1 2 80 0 - 3180 - ? 2-05:47:25 bash 0 S 0 51721 1 2 80 0 - 3180 - ? 2-05:22:41 bash 0 S 0 51726 1 2 80 0 - 3180 - ? 2-04:59:57 bash 0 S 0 51731 1 2 80 0 - 3180 - ? 2-05:05:06 bash 0 S 0 51736 1 2 80 0 - 3180 - ? 2-05:12:54 bash 4 S 0 58973 58966 0 80 0 - 2700 - ? 00:00:00 bash
I also removed the pure-ftpd server from this list as we are no longer running it.
PS this ticket was closed as a dupe: ticket:852
comment:10 Changed 18 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 0.75 to 1.0
We just had a load spike:
Date: Fri, 22 May 2015 13:14:39 +0100 (BST) From: root@puffin.webarch.net To: chris@webarchitects.co.uk Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 41.22 Time: Fri May 22 13:14:39 2015 +0100 1 Min Load Avg: 74.26 5 Min Load Avg: 41.22 15 Min Load Avg: 18.13 Running/Total Processes: 37/509
And there was a corresponding spike in bash and dash processes (not sure if this was the cause or a consequence):
From looking at top I have added to the list of processes the multips graphs track:
[multips] env.names nginx php_fpm mysqld redis-server munin-node bash dash awk perl csf user root [multips_memory] env.names nginx php-fpm mysqld redis-server munin-node bash dash awk perl csf user root
See:
comment:11 Changed 18 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.18
- Total Hours changed from 1.0 to 1.18
We just had a rather large load spike, one of the email alerts:
Date: Tue, 2 Jun 2015 17:19:28 +0100 (BST) From: root@puffin.webarch.net To: chris@webarchitects.co.uk Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 151.07 Time: Tue Jun 2 17:19:27 2015 +0100 1 Min Load Avg: 165.73 5 Min Load Avg: 151.07 15 Min Load Avg: 94.10 Running/Total Processes: 122/655
And the BOA firewall script has tried to block the Webarchitects Nagios monitoring server:
Date: Tue, 2 Jun 2015 17:19:39 +0100 (BST) From: root@puffin.webarch.net To: chris@webarchitects.co.uk Subject: lfd on puffin.webarch.net: blocked 81.95.52.66 (GB/United Kingdom/nsa.rat.burntout.org) Time: Tue Jun 2 17:19:37 2015 +0100 IP: 81.95.52.66 (GB/United Kingdom/nsa.rat.burntout.org) Failures: 5 (sshd) Interval: 300 seconds Blocked: Permanent Block (IP match in csf.allow, block may not work) Log entries: Jun 2 17:19:02 puffin sshd[32711]: Did not receive identification string from 81.95.52.66 Jun 2 17:19:02 puffin sshd[32728]: Did not receive identification string from 81.95.52.66 Jun 2 17:19:03 puffin sshd[32756]: Did not receive identification string from 81.95.52.66 Jun 2 17:19:04 puffin sshd[310]: Did not receive identification string from 81.95.52.66 Jun 2 17:19:07 puffin sshd[413]: Did not receive identification string from 81.95.52.66
But it looks like it failed as the server is whitelisted.
I can't ssh to the server, the connection times out but I can access the XenShell however I don't have a password to hand as I always use ssh keys so I can't login.
So we could either wait for it to recover or I should shut it down and restart it, I'll wate 5 mins and see if it recovers...
Note that the server load has been so hight that it hasn't been able to generate any Munin stats -- we simply have gap in the graphs:
comment:12 Changed 18 months ago by chris
Here is another of the emails, when this one was sent the 1 Min Load Avg was 180:
Date: Tue, 2 Jun 2015 17:21:48 +0100 (BST) From: root@puffin.webarch.net To: chris@webarchitects.co.uk Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 161.05 Time: Tue Jun 2 17:21:47 2015 +0100 1 Min Load Avg: 180.22 5 Min Load Avg: 161.05 15 Min Load Avg: 105.43 Running/Total Processes: 174/862
comment:13 Changed 18 months ago by chris
The website is producing 502 Bad Gateway errors, this means Nginx is up and running but it can't connect to php-fpm.
comment:14 Changed 18 months ago by chris
I'm going to shut it down and restart it.
comment:15 Changed 18 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.1
- Total Hours changed from 1.18 to 1.28
Running xm shutdown puffin.webarch.net wasn't resulting in the server being shutdown so I ran:
xm destroy puffin.webarch.net xm create puffin.webarch.net.cfg
And it's on it's way back up now.
comment:16 Changed 18 months ago by chris
It's up again.
comment:17 Changed 18 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 1.28 to 1.53
Not a very big load spike on PuffinServer but it must have got a lot higher:
Date: Thu, 4 Jun 2015 09:54:16 +0100 (BST) From: root@puffin.webarch.net To: chris@webarchitects.co.uk Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 14.22 Time: Thu Jun 4 09:52:36 2015 +0100 1 Min Load Avg: 39.15 5 Min Load Avg: 14.22 15 Min Load Avg: 5.67 Running/Total Processes: 47/430
It is was not responding to HTTP or SSH requests, but I'm now in and it is recovering:
uptime 10:12:11 up 1 day, 16:18, 1 user, load average: 89.40, 126.87, 103.35
Looking in /var/log/syslog it was a kernel issue and we have had this a few times before:
This is what we have in syslog:
Jun 4 09:49:50 puffin mysqld: 150604 9:49:50 [Warning] Aborted connection 294666 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error) Jun 4 09:49:51 puffin mysqld: 150604 9:49:51 [Warning] Aborted connection 294667 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error) Jun 4 09:49:51 puffin mysqld: 150604 9:49:51 [Warning] Aborted connection 294665 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error) Jun 4 09:49:52 puffin mysqld: 150604 9:49:51 [Warning] Aborted connection 294664 to db: 'transitionnetw_0' user: 'transitionnetw_0' host: 'localhost' (Unknown error) Jun 4 09:50:03 puffin kernel: [143763.270448] INFO: task vnstatd:2101 blocked for more than 120 seconds. Jun 4 09:50:03 puffin kernel: [143763.270470] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 4 09:50:03 puffin kernel: [143763.270479] vnstatd D ffff8801fb30d4c0 0 2101 1 0x00000000 Jun 4 09:50:03 puffin kernel: [143763.270490] ffff8801fb30d4c0 0000000000000282 ffff8801ffc15000 ffff880000009680 Jun 4 09:50:03 puffin kernel: [143763.270503] ffff8801fd5a9d58 ffffffff81012cdb 000000000000f9e0 ffff8801fd5a9fd8 Jun 4 09:50:03 puffin kernel: [143763.270514] 00000000000157c0 00000000000157c0 ffff8801ff2a0710 ffff8801ff2a0a08 Jun 4 09:50:03 puffin kernel: [143763.270525] Call Trace: Jun 4 09:50:03 puffin kernel: [143763.270537] [<ffffffff81012cdb>] ? xen_hypervisor_callback+0x1b/0x20 Jun 4 09:50:03 puffin kernel: [143763.270548] [<ffffffff81155196>] ? cap_inode_permission+0x0/0x3 Jun 4 09:50:03 puffin kernel: [143763.270576] [<ffffffff8130e5c3>] ? __mutex_lock_common+0x122/0x192 Jun 4 09:50:03 puffin kernel: [143763.270587] [<ffffffff8130e6eb>] ? mutex_lock+0x1a/0x31 Jun 4 09:50:03 puffin kernel: [143763.270595] [<ffffffff810f7eac>] ? do_lookup+0x80/0x15d Jun 4 09:50:03 puffin kernel: [143763.270603] [<ffffffff810f86aa>] ? __link_path_walk+0x323/0x6f5 Jun 4 09:50:03 puffin kernel: [143763.270611] [<ffffffff810f8caa>] ? path_walk+0x66/0xc9 Jun 4 09:50:03 puffin kernel: [143763.270619] [<ffffffff810fa114>] ? do_path_lookup+0x20/0x77 Jun 4 09:50:03 puffin kernel: [143763.270627] [<ffffffff810fa2a0>] ? do_filp_open+0xe5/0x94b Jun 4 09:50:03 puffin kernel: [143763.270635] [<ffffffff8130f054>] ? _spin_lock_irqsave+0x15/0x34 Jun 4 09:50:03 puffin kernel: [143763.270644] [<ffffffff81068fa6>] ? hrtimer_try_to_cancel+0x3a/0x43 Jun 4 09:50:03 puffin kernel: [143763.270653] [<ffffffff8102de30>] ? pvclock_clocksource_read+0x3a/0x8b Jun 4 09:50:03 puffin kernel: [143763.270662] [<ffffffff81103705>] ? alloc_fd+0x67/0x10c Jun 4 09:50:03 puffin kernel: [143763.270696] [<ffffffff810eeacf>] ? do_sys_open+0x55/0xfc Jun 4 09:50:03 puffin kernel: [143763.270705] [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b Jun 4 09:50:03 puffin kernel: [143763.270763] INFO: task id:11812 blocked for more than 120 seconds. Jun 4 09:50:03 puffin kernel: [143763.270769] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 4 09:50:03 puffin kernel: [143763.270777] id D ffff8801fcd9b880 0 11812 11811 0x00000080 Jun 4 09:50:03 puffin kernel: [143763.270787] ffff8801fcd9b880 0000000000000286 ffff880002b27d88 ffff8801fedc1c20 Jun 4 09:50:03 puffin kernel: [143763.270798] 00000000000006fe ffffffff81012cdb 000000000000f9e0 ffff880002b27fd8 Jun 4 09:50:03 puffin kernel: [143763.270809] 00000000000157c0 00000000000157c0 ffff8801fabf3170 ffff8801fabf3468 Jun 4 09:50:03 puffin kernel: [143763.270821] Call Trace: Jun 4 09:50:03 puffin kernel: [143763.270827] [<ffffffff81012cdb>] ? xen_hypervisor_callback+0x1b/0x20 Jun 4 09:50:03 puffin kernel: [143763.270836] [<ffffffff81155196>] ? cap_inode_permission+0x0/0x3 Jun 4 09:50:03 puffin kernel: [143763.270845] [<ffffffff8130e5c3>] ? __mutex_lock_common+0x122/0x192 Jun 4 09:50:03 puffin kernel: [143763.270854] [<ffffffff8130e6eb>] ? mutex_lock+0x1a/0x31 Jun 4 09:50:03 puffin kernel: [143763.270861] [<ffffffff810f7eac>] ? do_lookup+0x80/0x15d Jun 4 09:50:03 puffin kernel: [143763.270869] [<ffffffff810f892c>] ? __link_path_walk+0x5a5/0x6f5 Jun 4 09:50:03 puffin kernel: [143763.270876] [<ffffffff810f8caa>] ? path_walk+0x66/0xc9 Jun 4 09:50:03 puffin kernel: [143763.270884] [<ffffffff810fa114>] ? do_path_lookup+0x20/0x77 Jun 4 09:50:03 puffin kernel: [143763.270892] [<ffffffff810fa2a0>] ? do_filp_open+0xe5/0x94b Jun 4 09:50:03 puffin kernel: [143763.270901] [<ffffffff810d217e>] ? vma_link+0x74/0x9a Jun 4 09:50:03 puffin kernel: [143763.270908] [<ffffffff81103705>] ? alloc_fd+0x67/0x10c Jun 4 09:50:03 puffin kernel: [143763.270915] [<ffffffff810eeacf>] ? do_sys_open+0x55/0xfc Jun 4 09:50:03 puffin kernel: [143763.270923] [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b Jun 4 09:50:03 puffin kernel: [143763.270931] INFO: task cron:11813 blocked for more than 120 seconds. Jun 4 09:50:03 puffin kernel: [143763.270937] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 4 09:50:03 puffin kernel: [143763.270945] cron D ffff8801fb1de2e0 0 11813 7882 0x00000080 Jun 4 09:50:03 puffin kernel: [143763.270954] ffff8801fb1de2e0 0000000000000282 0000000000000001 00000001810423b0 Jun 4 09:50:03 puffin kernel: [143763.270965] ffff88000bb357c0 0000000000000001 000000000000f9e0 ffff880002ab1fd8 Jun 4 09:50:03 puffin kernel: [143763.270976] 00000000000157c0 00000000000157c0 ffff8801ecdf2350 ffff8801ecdf2648 Jun 4 09:50:03 puffin kernel: [143763.270987] Call Trace: Jun 4 09:50:03 puffin kernel: [143763.270994] [<ffffffff811976a6>] ? vsnprintf+0x40a/0x449 Jun 4 09:50:03 puffin kernel: [143763.271002] [<ffffffff8130e5c3>] ? __mutex_lock_common+0x122/0x192 Jun 4 09:50:03 puffin kernel: [143763.271011] [<ffffffff8130e6eb>] ? mutex_lock+0x1a/0x31 Jun 4 09:50:03 puffin kernel: [143763.271018] [<ffffffff810f7eac>] ? do_lookup+0x80/0x15d Jun 4 09:50:03 puffin kernel: [143763.271025] [<ffffffff810f892c>] ? __link_path_walk+0x5a5/0x6f5 Jun 4 09:50:03 puffin kernel: [143763.271033] [<ffffffff810f826a>] ? do_follow_link+0x1fa/0x317 Jun 4 09:50:03 puffin kernel: [143763.271041] [<ffffffff810f86e0>] ? __link_path_walk+0x359/0x6f5 Jun 4 09:50:03 puffin kernel: [143763.271049] [<ffffffff810f8caa>] ? path_walk+0x66/0xc9 Jun 4 09:50:03 puffin kernel: [143763.271057] [<ffffffff810fa114>] ? do_path_lookup+0x20/0x77 Jun 4 09:50:03 puffin kernel: [143763.271064] [<ffffffff810fa2a0>] ? do_filp_open+0xe5/0x94b Jun 4 09:50:03 puffin kernel: [143763.271072] [<ffffffff81012cdb>] ? xen_hypervisor_callback+0x1b/0x20 Jun 4 09:50:03 puffin kernel: [143763.271081] [<ffffffff8130dda8>] ? thread_return+0x79/0xe0 Jun 4 09:50:03 puffin kernel: [143763.271089] [<ffffffff811901fb>] ? _atomic_dec_and_lock+0x33/0x50 Jun 4 09:50:03 puffin kernel: [143763.271097] [<ffffffff81103705>] ? alloc_fd+0x67/0x10c Jun 4 09:50:03 puffin kernel: [143763.271104] [<ffffffff810eeacf>] ? do_sys_open+0x55/0xfc Jun 4 09:50:03 puffin kernel: [143763.271111] [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b Jun 4 09:50:03 puffin kernel: [143763.271119] INFO: task cron:11814 blocked for more than 120 seconds. Jun 4 09:50:03 puffin kernel: [143763.271125] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 4 09:50:03 puffin kernel: [143763.271133] cron D ffff8801fc48e2e0 0 11814 7882 0x00000080 Jun 4 09:50:03 puffin kernel: [143763.271142] ffff8801fc48e2e0 0000000000000286 0000000000000001 00000001810423b0 Jun 4 09:50:03 puffin kernel: [143763.271154] ffff88000bc617c0 000000000000000b 000000000000f9e0 ffff88007f0d7fd8 Jun 4 09:50:03 puffin kernel: [143763.271165] 00000000000157c0 00000000000157c0 ffff8801ecdf0710 ffff8801ecdf0a08 Jun 4 09:50:03 puffin kernel: [143763.271175] Call Trace: Jun 4 09:50:03 puffin kernel: [143763.271181] [<ffffffff811976a6>] ? vsnprintf+0x40a/0x449 Jun 4 09:50:03 puffin kernel: [143763.271189] [<ffffffff8130e5c3>] ? __mutex_lock_common+0x122/0x192 Jun 4 09:50:03 puffin kernel: [143763.271198] [<ffffffff8130e6eb>] ? mutex_lock+0x1a/0x31 Jun 4 09:50:03 puffin kernel: [143763.271205] [<ffffffff810f7eac>] ? do_lookup+0x80/0x15d Jun 4 09:50:03 puffin kernel: [143763.271213] [<ffffffff810f892c>] ? __link_path_walk+0x5a5/0x6f5 Jun 4 09:50:03 puffin kernel: [143763.271221] [<ffffffff810f826a>] ? do_follow_link+0x1fa/0x317 Jun 4 09:50:03 puffin kernel: [143763.271229] [<ffffffff810f86e0>] ? __link_path_walk+0x359/0x6f5 Jun 4 09:50:03 puffin kernel: [143763.271237] [<ffffffff810f8caa>] ? path_walk+0x66/0xc9 Jun 4 09:50:03 puffin kernel: [143763.271244] [<ffffffff810fa114>] ? do_path_lookup+0x20/0x77 Jun 4 09:50:03 puffin kernel: [143763.271252] [<ffffffff810fa2a0>] ? do_filp_open+0xe5/0x94b Jun 4 09:50:03 puffin kernel: [143763.271259] [<ffffffff81012cdb>] ? xen_hypervisor_callback+0x1b/0x20 Jun 4 09:50:03 puffin kernel: [143763.271268] [<ffffffff8130dda8>] ? thread_return+0x79/0xe0 Jun 4 09:50:03 puffin kernel: [143763.271275] [<ffffffff811901fb>] ? _atomic_dec_and_lock+0x33/0x50 Jun 4 09:50:03 puffin kernel: [143763.271283] [<ffffffff81103705>] ? alloc_fd+0x67/0x10c Jun 4 09:50:03 puffin kernel: [143763.271291] [<ffffffff810eeacf>] ? do_sys_open+0x55/0xfc Jun 4 09:50:03 puffin kernel: [143763.271298] [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b Jun 4 09:50:03 puffin kernel: [143763.271305] INFO: task cron:11815 blocked for more than 120 seconds. Jun 4 09:50:03 puffin kernel: [143763.271311] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 4 09:50:03 puffin kernel: [143763.271319] cron D ffff8801fdfc8e20 0 11815 7882 0x00000080 Jun 4 09:50:03 puffin kernel: [143763.271328] ffff8801fdfc8e20 0000000000000286 0000000000000001 00000001810423b0 Jun 4 09:50:03 puffin kernel: [143763.271339] ffff88000bc617c0 000000000000000b 000000000000f9e0 ffff880002fa5fd8 Jun 4 09:50:03 puffin kernel: [143763.271350] 00000000000157c0 00000000000157c0 ffff8801ecdf7810 ffff8801ecdf7b08 Jun 4 09:50:03 puffin kernel: [143763.271362] Call Trace: Jun 4 09:50:03 puffin kernel: [143763.271368] [<ffffffff811976a6>] ? vsnprintf+0x40a/0x449 Jun 4 09:50:03 puffin kernel: [143763.271376] [<ffffffff8130e5c3>] ? __mutex_lock_common+0x122/0x192 Jun 4 09:50:03 puffin kernel: [143763.271385] [<ffffffff8130e6eb>] ? mutex_lock+0x1a/0x31 Jun 4 09:50:03 puffin kernel: [143763.271392] [<ffffffff810f7eac>] ? do_lookup+0x80/0x15d Jun 4 09:50:03 puffin kernel: [143763.271400] [<ffffffff810f892c>] ? __link_path_walk+0x5a5/0x6f5 Jun 4 09:50:03 puffin kernel: [143763.271408] [<ffffffff810f826a>] ? do_follow_link+0x1fa/0x317 Jun 4 09:50:03 puffin kernel: [143763.271416] [<ffffffff810f86e0>] ? __link_path_walk+0x359/0x6f5 Jun 4 09:50:03 puffin kernel: [143763.271423] [<ffffffff810f8caa>] ? path_walk+0x66/0xc9 Jun 4 09:50:03 puffin kernel: [143763.271431] [<ffffffff810fa114>] ? do_path_lookup+0x20/0x77 Jun 4 09:50:03 puffin kernel: [143763.271439] [<ffffffff810fa2a0>] ? do_filp_open+0xe5/0x94b Jun 4 09:50:03 puffin kernel: [143763.271446] [<ffffffff81012cdb>] ? xen_hypervisor_callback+0x1b/0x20 Jun 4 09:50:03 puffin kernel: [143763.271455] [<ffffffff8130dda8>] ? thread_return+0x79/0xe0 Jun 4 09:50:03 puffin kernel: [143763.271462] [<ffffffff811901fb>] ? _atomic_dec_and_lock+0x33/0x50 Jun 4 09:50:03 puffin kernel: [143763.271470] [<ffffffff81103705>] ? alloc_fd+0x67/0x10c Jun 4 09:50:03 puffin kernel: [143763.271478] [<ffffffff810eeacf>] ? do_sys_open+0x55/0xfc Jun 4 09:50:03 puffin kernel: [143763.271485] [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b Jun 4 09:50:03 puffin kernel: [143763.271492] INFO: task cron:11816 blocked for more than 120 seconds. Jun 4 09:50:03 puffin kernel: [143763.271498] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 4 09:50:03 puffin kernel: [143763.271506] cron D ffff8801fdfcc6a0 0 11816 7882 0x00000080 Jun 4 09:50:03 puffin kernel: [143763.271516] ffff8801fdfcc6a0 0000000000000286 0000000000000001 00000001810423b0 Jun 4 09:50:03 puffin kernel: [143763.271527] ffff88000bb537c0 ffffffff81012cdb 000000000000f9e0 ffff88007d50ffd8 Jun 4 09:50:03 puffin kernel: [143763.271537] 00000000000157c0 00000000000157c0 ffff8801ecdf0e20 ffff8801ecdf1118 Jun 4 09:50:03 puffin kernel: [143763.271549] Call Trace: Jun 4 09:50:03 puffin kernel: [143763.271554] [<ffffffff81012cdb>] ? xen_hypervisor_callback+0x1b/0x20 Jun 4 09:50:03 puffin kernel: [143763.271563] [<ffffffff81155196>] ? cap_inode_permission+0x0/0x3 Jun 4 09:50:03 puffin kernel: [143763.271571] [<ffffffff8130e5c3>] ? __mutex_lock_common+0x122/0x192 Jun 4 09:50:03 puffin kernel: [143763.271580] [<ffffffff8130e6eb>] ? mutex_lock+0x1a/0x31 Jun 4 09:50:03 puffin kernel: [143763.271587] [<ffffffff810f7eac>] ? do_lookup+0x80/0x15d Jun 4 09:50:03 puffin kernel: [143763.271594] [<ffffffff810f892c>] ? __link_path_walk+0x5a5/0x6f5 Jun 4 09:50:03 puffin kernel: [143763.271602] [<ffffffff810f826a>] ? do_follow_link+0x1fa/0x317 Jun 4 09:50:03 puffin kernel: [143763.271610] [<ffffffff810f86e0>] ? __link_path_walk+0x359/0x6f5 Jun 4 09:50:03 puffin kernel: [143763.271618] [<ffffffff810f8caa>] ? path_walk+0x66/0xc9 Jun 4 09:50:03 puffin kernel: [143763.271625] [<ffffffff810fa114>] ? do_path_lookup+0x20/0x77 Jun 4 09:50:03 puffin kernel: [143763.271633] [<ffffffff810fa2a0>] ? do_filp_open+0xe5/0x94b Jun 4 09:50:03 puffin kernel: [143763.271641] [<ffffffff81012cdb>] ? xen_hypervisor_callback+0x1b/0x20 Jun 4 09:50:03 puffin kernel: [143763.271649] [<ffffffff8130dda8>] ? thread_return+0x79/0xe0 Jun 4 09:50:03 puffin kernel: [143763.271657] [<ffffffff811901fb>] ? _atomic_dec_and_lock+0x33/0x50 Jun 4 09:50:03 puffin kernel: [143763.271665] [<ffffffff81103705>] ? alloc_fd+0x67/0x10c Jun 4 09:50:03 puffin kernel: [143763.271672] [<ffffffff810eeacf>] ? do_sys_open+0x55/0xfc Jun 4 09:50:03 puffin kernel: [143763.271679] [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b Jun 4 09:50:03 puffin kernel: [143763.271687] INFO: task cron:11817 blocked for more than 120 seconds. Jun 4 09:50:03 puffin kernel: [143763.271693] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 4 09:50:03 puffin kernel: [143763.271701] cron D ffff88007f04b880 0 11817 7882 0x00000080 Jun 4 09:50:03 puffin kernel: [143763.271710] ffff88007f04b880 0000000000000286 0000000000000001 00000001810423b0 Jun 4 09:50:03 puffin kernel: [143763.271721] ffff88000bc077c0 0000000000000008 000000000000f9e0 ffff88007d079fd8 Jun 4 09:50:03 puffin kernel: [143763.271732] 00000000000157c0 00000000000157c0 ffff8801ecdf62e0 ffff8801ecdf65d8 Jun 4 09:50:03 puffin kernel: [143763.271743] Call Trace: Jun 4 09:50:03 puffin kernel: [143763.271749] [<ffffffff811976a6>] ? vsnprintf+0x40a/0x449 Jun 4 09:50:03 puffin kernel: [143763.271757] [<ffffffff8130e5c3>] ? __mutex_lock_common+0x122/0x192 Jun 4 09:50:03 puffin kernel: [143763.271766] [<ffffffff8130e6eb>] ? mutex_lock+0x1a/0x31 Jun 4 09:50:03 puffin kernel: [143763.271773] [<ffffffff810f7eac>] ? do_lookup+0x80/0x15d Jun 4 09:50:03 puffin kernel: [143763.271780] [<ffffffff810f892c>] ? __link_path_walk+0x5a5/0x6f5 Jun 4 09:50:03 puffin kernel: [143763.271789] [<ffffffff810f826a>] ? do_follow_link+0x1fa/0x317 Jun 4 09:50:03 puffin kernel: [143763.271796] [<ffffffff810f86e0>] ? __link_path_walk+0x359/0x6f5 Jun 4 09:50:03 puffin kernel: [143763.271804] [<ffffffff810f8caa>] ? path_walk+0x66/0xc9 Jun 4 09:50:03 puffin kernel: [143763.271812] [<ffffffff810fa114>] ? do_path_lookup+0x20/0x77 Jun 4 09:50:03 puffin kernel: [143763.271819] [<ffffffff810fa2a0>] ? do_filp_open+0xe5/0x94b Jun 4 09:50:03 puffin kernel: [143763.271827] [<ffffffff81012cdb>] ? xen_hypervisor_callback+0x1b/0x20 Jun 4 09:50:03 puffin kernel: [143763.271836] [<ffffffff8130dda8>] ? thread_return+0x79/0xe0 Jun 4 09:50:03 puffin kernel: [143763.271843] [<ffffffff811901fb>] ? _atomic_dec_and_lock+0x33/0x50 Jun 4 09:50:03 puffin kernel: [143763.271851] [<ffffffff81103705>] ? alloc_fd+0x67/0x10c Jun 4 09:50:03 puffin kernel: [143763.271858] [<ffffffff810eeacf>] ? do_sys_open+0x55/0xfc Jun 4 09:50:03 puffin kernel: [143763.271865] [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b Jun 4 09:50:03 puffin kernel: [143763.271873] INFO: task cron:11818 blocked for more than 120 seconds. Jun 4 09:50:03 puffin kernel: [143763.271879] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 4 09:50:03 puffin kernel: [143763.271886] cron D ffff8801ecdf7810 0 11818 7882 0x00000080 Jun 4 09:50:03 puffin kernel: [143763.271895] ffff8801ecdf7810 0000000000000282 0000000000000001 00000001810423b0 Jun 4 09:50:03 puffin kernel: [143763.271906] ffff88000bc7f7c0 000000000000000c 000000000000f9e0 ffff88007d0b3fd8 Jun 4 09:50:03 puffin kernel: [143763.271917] 00000000000157c0 00000000000157c0 ffff8801ecdf2a60 ffff8801ecdf2d58 Jun 4 09:50:03 puffin kernel: [143763.271928] Call Trace: Jun 4 09:50:03 puffin kernel: [143763.271934] [<ffffffff811976a6>] ? vsnprintf+0x40a/0x449 Jun 4 09:50:03 puffin kernel: [143763.271942] [<ffffffff8130e5c3>] ? __mutex_lock_common+0x122/0x192 Jun 4 09:50:03 puffin kernel: [143763.271950] [<ffffffff8130e6eb>] ? mutex_lock+0x1a/0x31 Jun 4 09:50:03 puffin kernel: [143763.271957] [<ffffffff810f7eac>] ? do_lookup+0x80/0x15d Jun 4 09:50:03 puffin kernel: [143763.271964] [<ffffffff810f892c>] ? __link_path_walk+0x5a5/0x6f5 Jun 4 09:50:03 puffin kernel: [143763.271972] [<ffffffff810f826a>] ? do_follow_link+0x1fa/0x317 Jun 4 09:50:03 puffin kernel: [143763.271980] [<ffffffff810f86e0>] ? __link_path_walk+0x359/0x6f5 Jun 4 09:50:03 puffin kernel: [143763.271988] [<ffffffff810f8caa>] ? path_walk+0x66/0xc9 Jun 4 09:50:03 puffin kernel: [143763.271995] [<ffffffff810fa114>] ? do_path_lookup+0x20/0x77 Jun 4 09:50:03 puffin kernel: [143763.272003] [<ffffffff810fa2a0>] ? do_filp_open+0xe5/0x94b Jun 4 09:50:03 puffin kernel: [143763.272011] [<ffffffff81012cdb>] ? xen_hypervisor_callback+0x1b/0x20 Jun 4 09:50:03 puffin kernel: [143763.482981] [<ffffffff8130dda8>] ? thread_return+0x79/0xe0 Jun 4 09:50:03 puffin kernel: [143763.482993] [<ffffffff811901fb>] ? _atomic_dec_and_lock+0x33/0x50 Jun 4 09:50:03 puffin kernel: [143763.483003] [<ffffffff81103705>] ? alloc_fd+0x67/0x10c Jun 4 09:50:03 puffin kernel: [143763.483012] [<ffffffff810eeacf>] ? do_sys_open+0x55/0xfc Jun 4 09:50:03 puffin kernel: [143763.483020] [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b Jun 4 09:50:03 puffin kernel: [143763.483030] INFO: task id:11820 blocked for more than 120 seconds. Jun 4 09:50:03 puffin kernel: [143763.483037] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 4 09:50:03 puffin kernel: [143763.483045] id D ffff8801fabba350 0 11820 11819 0x00000080 Jun 4 09:50:03 puffin kernel: [143763.483055] ffff8801fabba350 0000000000000286 ffff88007d593d88 ffff8801fedc1c20 Jun 4 09:50:03 puffin kernel: [143763.483067] 00000000000006fe ffff880000009680 000000000000f9e0 ffff88007d593fd8 Jun 4 09:50:03 puffin kernel: [143763.483078] 00000000000157c0 00000000000157c0 ffff8801fabb8710 ffff8801fabb8a08 Jun 4 09:50:03 puffin kernel: [143763.483089] Call Trace: Jun 4 09:50:03 puffin kernel: [143763.483097] [<ffffffff810b9b8b>] ? zone_watermark_ok+0x20/0xb1 Jun 4 09:50:03 puffin kernel: [143763.483107] [<ffffffff8130e5c3>] ? __mutex_lock_common+0x122/0x192 Jun 4 09:50:03 puffin kernel: [143763.483115] [<ffffffff8130e6eb>] ? mutex_lock+0x1a/0x31 Jun 4 09:50:03 puffin kernel: [143763.483123] [<ffffffff810f7eac>] ? do_lookup+0x80/0x15d Jun 4 09:50:03 puffin kernel: [143763.483131] [<ffffffff810f892c>] ? __link_path_walk+0x5a5/0x6f5 Jun 4 09:50:03 puffin kernel: [143763.483139] [<ffffffff810f8caa>] ? path_walk+0x66/0xc9 Jun 4 09:50:03 puffin kernel: [143763.483146] [<ffffffff810fa114>] ? do_path_lookup+0x20/0x77 Jun 4 09:50:03 puffin kernel: [143763.483154] [<ffffffff810fa2a0>] ? do_filp_open+0xe5/0x94b Jun 4 09:50:03 puffin kernel: [143763.483163] [<ffffffff810d217e>] ? vma_link+0x74/0x9a Jun 4 09:50:03 puffin kernel: [143763.483170] [<ffffffff81103705>] ? alloc_fd+0x67/0x10c Jun 4 09:50:03 puffin kernel: [143763.483178] [<ffffffff810eeacf>] ? do_sys_open+0x55/0xfc Jun 4 09:50:03 puffin kernel: [143763.483185] [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b Jun 4 09:50:03 puffin kernel: [143763.483194] INFO: task awk:11827 blocked for more than 120 seconds. Jun 4 09:50:03 puffin kernel: [143763.483200] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 4 09:50:03 puffin kernel: [143763.483207] awk D ffff88019001a350 0 11827 11823 0x00000080 Jun 4 09:50:03 puffin kernel: [143763.483216] ffff88019001a350 0000000000000282 0000000000000000 0000000000000000 Jun 4 09:50:03 puffin kernel: [143763.483228] 0000000000000000 ffffffff81012cdb 000000000000f9e0 ffff88007f0d9fd8 Jun 4 09:50:03 puffin kernel: [143763.483239] 00000000000157c0 00000000000157c0 ffff8801fabb8000 ffff8801fabb82f8 Jun 4 09:50:03 puffin kernel: [143763.483249] Call Trace: Jun 4 09:50:03 puffin kernel: [143763.483256] [<ffffffff81012cdb>] ? xen_hypervisor_callback+0x1b/0x20 Jun 4 09:50:03 puffin kernel: [143763.483266] [<ffffffff81155196>] ? cap_inode_permission+0x0/0x3 Jun 4 09:50:03 puffin kernel: [143763.506896] [<ffffffff8130e5c3>] ? __mutex_lock_common+0x122/0x192 Jun 4 09:50:03 puffin kernel: [143763.506910] [<ffffffff8130e6eb>] ? mutex_lock+0x1a/0x31 Jun 4 09:50:03 puffin kernel: [143763.506919] [<ffffffff810f7eac>] ? do_lookup+0x80/0x15d Jun 4 09:50:03 puffin kernel: [143763.506927] [<ffffffff810f892c>] ? __link_path_walk+0x5a5/0x6f5 Jun 4 09:50:03 puffin kernel: [143763.506935] [<ffffffff810f826a>] ? do_follow_link+0x1fa/0x317 Jun 4 09:50:03 puffin kernel: [143763.506943] [<ffffffff810f86e0>] ? __link_path_walk+0x359/0x6f5 Jun 4 09:50:03 puffin kernel: [143763.506952] [<ffffffff810f8caa>] ? path_walk+0x66/0xc9 Jun 4 09:50:03 puffin kernel: [143763.506960] [<ffffffff810fa114>] ? do_path_lookup+0x20/0x77 Jun 4 09:50:03 puffin kernel: [143763.506967] [<ffffffff810fa2a0>] ? do_filp_open+0xe5/0x94b Jun 4 09:50:03 puffin kernel: [143763.506975] [<ffffffff8100c4e8>] ? pte_pfn_to_mfn+0x21/0x30 Jun 4 09:50:03 puffin kernel: [143763.506983] [<ffffffff8100cc43>] ? xen_make_pte+0x7b/0x83 Jun 4 09:50:03 puffin kernel: [143763.506991] [<ffffffff8100e585>] ? xen_set_pte_at+0xa2/0xc2 Jun 4 09:50:03 puffin kernel: [143763.507000] [<ffffffff8105cd8e>] ? do_sigaction+0x159/0x171 Jun 4 09:50:03 puffin kernel: [143763.507008] [<ffffffff81103705>] ? alloc_fd+0x67/0x10c Jun 4 09:50:03 puffin kernel: [143763.507016] [<ffffffff810eeacf>] ? do_sys_open+0x55/0xfc Jun 4 09:50:03 puffin kernel: [143763.507023] [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b
It has more-or-less recovered now:
uptime 10:20:55 up 1 day, 16:26, 1 user, load average: 2.21, 24.19, 60.46
I have reset the root passwords on all three servers so that I can login via the xen consol when needed.
comment:18 Changed 18 months ago by chris
The load alert emails didn't go out while the load was so high, they have come in now, this was the peak:
Date: Thu, 4 Jun 2015 10:08:33 +0100 (BST) From: root@puffin.webarch.net To: chris@webarchitects.co.uk Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 150.63 Time: Thu Jun 4 10:08:33 2015 +0100 1 Min Load Avg: 171.68 5 Min Load Avg: 150.63 15 Min Load Avg: 100.19 Running/Total Processes: 114/959
comment:19 Changed 18 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.05
- Total Hours changed from 1.53 to 1.58
Looks like another load spike now, uptime is not producing any results and I can't long via the xen console, I'm going to restart the server.
comment:20 Changed 18 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.03
- Total Hours changed from 1.58 to 1.61
It's on it's way back up now, the firewall is loading and this always takes ages.
comment:23 follow-up: ↓ 24 Changed 18 months ago by ade
Hi Chris, Do you have an understanding as to why this is happening? It seems that as well as the known issues, such as the the boa replication issues, the site going down is now becoming a regular issue, and we need to ascertain why this is the case. Your thoughts? best regards Ade On 4 June 2015 at 10:36, Transition Technology Trac < trac@tech.transitionnetwork.org> wrote: > #846: Load Spikes on BOA PuffinServer > -------------------------------------+------------------------------------- > Reporter: chris | Owner: chris > Type: maintenance | Status: new > Priority: major | Milestone: > Component: Live server | Maintenance > Keywords: | Resolution: > Add Hours to Ticket: 0.03 | Estimated Number of Hours: 0.0 > Total Hours: 1.58 | Billable?: 1 > -------------------------------------+------------------------------------- > Changes (by chris): > > * hours: 0.0 => 0.03 > * totalhours: 1.58 => 1.61 > > > Comment: > > It's on it's way back up now, the firewall is loading and tyhis always > takes '''ages'''. > > -- > Ticket URL: <https://tech.transitionnetwork.org/trac/ticket/846#comment:20 > > > Transition Technology <https://tech.transitionnetwork.org/trac> > Support and issues tracking for the Transition Network Web Project. > -- Ade Stuart Web Manager - Transition network 07595 331877 The Transition Network is a registered charity address: 43 Fore St, Totnes, Devon, TQ9 5HN, UK website: www.transitionnetwork.org TN company no: 6135675 TN charity no: 1128675
comment:24 in reply to: ↑ 23 Changed 18 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 1.61 to 1.86
Replying to ade:
Do you have an understanding as to why this is happening?
Not really, we have had a long history of load spikes with BOA but have never really managed to get to the bottom of it, probably because there is more than one cause, I could point out lots of past tickets on this if that would help?
Of course load spikes are not specific to BOA but BOA makes it really hard to work out what is going on and even harder to intervene to try to fix / mitigate things.
It seems that as well as the known issues, such as the the boa replication
issues, the site going down is now becoming a regular issue, and we need to
ascertain why this is the case.
Your thoughts?
Yes it would be good to know the cause(s) of these, but to be honest I don't know if we ever will -- we have spent a lot of time on the load spike suicides in the past... see PuffinServer#LoadSpikes and:
Personally I'd be more interested in spending time sorting out a non-BOA server than spending even more time trying understand what BOA is doing, however regarding this specific load spike and kernel issue it might not be BOA related -- there are lots of links related to "task blocked for more than 120 seconds xen":
Would you like me to spend some time looking into this?
Regarding the time the BOA firewall takes to load, mentioned in ticket:846#comment:20, I just timed how long it takes just to list the firewall rules, 30 mins (!!!) -- there rules are over 5k lines long:
date; iptables -L > /tmp/ip.txt ; date Thu Jun 4 10:44:57 BST 2015 Thu Jun 4 11:15:29 BST 2015 cat /tmp/ip.txt | wc -l 5239
comment:25 Changed 18 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 1.86 to 2.11
Just had another load spike on PuffinServer, list is a list of the load alert emails from ldf, Saturday, Sunday and today:
Jun 05 High 5 minute load average alert - 14.69 Jun 05 High 5 minute load average alert - 28.96 Jun 05 High 5 minute load average alert - 19.62 Jun 05 High 5 minute load average alert - 13.18 Jun 06 High 5 minute load average alert - 14.89 Jun 06 High 5 minute load average alert - 30.80 Jun 06 High 5 minute load average alert - 53.29 Jun 06 High 5 minute load average alert - 65.00 Jun 06 High 5 minute load average alert - 61.44 Jun 06 High 5 minute load average alert - 29.48 Jun 06 High 5 minute load average alert - 19.79 Jun 06 High 5 minute load average alert - 15.18 Jun 07 High 5 minute load average alert - 13.54 Jun 07 High 5 minute load average alert - 69.17 Jun 07 High 5 minute load average alert - 96.37 Jun 07 High 5 minute load average alert - 90.83 Jun 07 High 5 minute load average alert - 88.37 Jun 07 High 5 minute load average alert - 85.94 Jun 07 High 5 minute load average alert - 85.82 Jun 07 High 5 minute load average alert - 87.89 Jun 07 High 5 minute load average alert - 91.81 Jun 07 High 5 minute load average alert - 97.62 Jun 07 High 5 minute load average alert - 105.56 Jun 07 High 5 minute load average alert - 107.19 Jun 07 High 5 minute load average alert - 114.79 Jun 07 High 5 minute load average alert - 42.17 Jun 07 High 5 minute load average alert - 28.30 Jun 07 High 5 minute load average alert - 19.12 Jun 07 High 5 minute load average alert - 12.91 Jun 07 High 5 minute load average alert - 17.87 Jun 07 High 5 minute load average alert - 18.92 Jun 07 High 5 minute load average alert - 14.26 Jun 07 High 5 minute load average alert - 13.46 Jun 07 High 5 minute load average alert - 20.99 Jun 07 High 5 minute load average alert - 15.80 Jun 08 High 5 minute load average alert - 15.42 Jun 08 High 5 minute load average alert - 19.44 Jun 08 High 5 minute load average alert - 96.48 Jun 08 High 5 minute load average alert - 111.23 Jun 08 High 5 minute load average alert - 73.42 Jun 08 High 5 minute load average alert - 49.12 Jun 08 High 5 minute load average alert - 32.89
The peak of the latest one:
Date: Mon, 8 Jun 2015 10:37:52 +0100 (BST) From: root@puffin.webarch.net To: chris@webarchitects.co.uk Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 111.23 Time: Mon Jun 8 10:37:52 2015 +0100 1 Min Load Avg: 102.54 5 Min Load Avg: 111.23 15 Min Load Avg: 62.46 Running/Total Processes: 3/472
It has recovered now, I couldn't find anything in the logs.
comment:26 Changed 17 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 2.11 to 2.36
There was a load spike just after 4am this morning, this first ldf alert:
Date: Tue, 7 Jul 2015 03:55:46 +0100 (BST) From: root@puffin.webarch.net Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 13.07 Time: Tue Jul 7 03:54:54 2015 +0100 1 Min Load Avg: 38.25 5 Min Load Avg: 13.07 15 Min Load Avg: 5.17 Running/Total Processes: 52/458
The peak:
Date: Tue, 7 Jul 2015 04:09:28 +0100 (BST) From: root@puffin.webarch.net Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 94.54 Time: Tue Jul 7 04:09:22 2015 +0100 1 Min Load Avg: 94.67 5 Min Load Avg: 94.54 15 Min Load Avg: 66.86 Running/Total Processes: 64/685
And the last alert:
Date: Tue, 7 Jul 2015 04:21:59 +0100 (BST) From: root@puffin.webarch.net Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 12.87 Time: Tue Jul 7 04:21:59 2015 +0100 1 Min Load Avg: 1.04 5 Min Load Avg: 12.87 15 Min Load Avg: 36.69 Running/Total Processes: 1/376
There were a lot of 503 errors served during this time, almost all to bots, for example:
"66.249.92.63" www.transitionnetwork.org [07/Jul/2015:04:13:41 +0100] "GET /news/network/feed HTTP/1.1" 503 206 341 376 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 2 subscribers; feed-id=12394840076319634974)" 0.000 "-" "207.46.13.110, 127.0.0.1" www.transitionnetwork.org [07/Jul/2015:04:13:45 +0100] "GET /initiatives/glastonbury HTTP/1.0" 503 206 406 371 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.001 "-" "66.249.92.60" www.transitionnetwork.org [07/Jul/2015:04:13:52 +0100] "GET /about HTTP/1.1" 503 206 243 376 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 2 subscribers; feed-id=2875799361570896823)" 0.000 "-" "207.46.13.110" www.transitionnetwork.org [07/Jul/2015:04:13:55 +0100] "GET /blogs/rob-hopkins/2013-01-14/british-bean-back-interview-josiah-meldrum-hodmedod-and-transition HTTP/1.1" 503 206 343 376 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-" "65.19.138.34" www.transitionnetwork.org [07/Jul/2015:04:13:58 +0100] "GET /blogs/feed/rob-hopkins/atom/ HTTP/1.1" 503 206 399 371 "-" "Feedly/1.0 (+http://www.feedly.com/fetcher.html; like FeedFetcher-Google)" 0.142 "-" "207.46.13.110" www.transitionnetwork.org [07/Jul/2015:04:14:05 +0100] "GET /tags/colin-godmans-farm HTTP/1.1" 503 206 271 376 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-" "66.249.92.63" www.transitionnetwork.org [07/Jul/2015:04:14:07 +0100] "GET /blogs/feed HTTP/1.1" 503 206 334 376 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 10 subscribers; feed-id=2865359885998846808)" 0.000 "-" "65.19.138.33" www.transitionnetwork.org [07/Jul/2015:04:14:12 +0100] "GET /blogs/feed HTTP/1.1" 503 206 339 371 "-" "Feedly/1.0 (+http://www.feedly.com/fetcher.html; like FeedFetcher-Google)" 0.177 "-" "207.46.13.110" www.transitionnetwork.org [07/Jul/2015:04:14:15 +0100] "GET /people/nyle-seabright HTTP/1.1" 503 206 269 376 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-" "66.249.78.181" www.transitionnetwork.org [07/Jul/2015:04:14:18 +0100] "GET /blogs/feed/rob-hopkins/ HTTP/1.1" 503 206 347 376 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.000 "-" "207.46.13.110" www.transitionnetwork.org [07/Jul/2015:04:14:25 +0100] "GET /people/adam-walker1 HTTP/1.1" 503 206 267 376 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-" "94.228.34.249" www.transitionnetwork.org [07/Jul/2015:04:14:32 +0100] "GET /forums/themes/local-government HTTP/1.1" 503 206 321 376 "-" "magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)" 0.000 "-" "207.46.13.110, 127.0.0.1" www.transitionnetwork.org [07/Jul/2015:04:14:35 +0100] "GET /initiatives/building-resilience-mission-brim HTTP/1.0" 503 206 427 371 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-" "66.249.92.63" www.transitionnetwork.org [07/Jul/2015:04:14:39 +0100] "GET /rss.xml HTTP/1.1" 503 206 245 376 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; feed-id=2525513441670857354)" 0.000 "-" "207.46.13.110" www.transitionnetwork.org [07/Jul/2015:04:14:45 +0100] "GET /tags/leicesershire HTTP/1.1" 503 206 266 376 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-" "185.2.129.62" www.transitionnetwork.org [07/Jul/2015:04:14:49 +0100] "GET /blogs/feed/rob-hopkins/ HTTP/1.1" 503 206 243 376 "http://transitionculture.org/?feed=rss2" "SimplePie/1.0 Beta (Feed Parser; http://www.simplepie.org/; Allow like Gecko) Build/20060129" 0.000 "-" "185.2.129.62" www.transitionnetwork.org [07/Jul/2015:04:14:49 +0100] "GET /blogs/feed/rob-hopkins/ HTTP/1.1" 503 206 243 376 "http://transitionculture.org/?feed=rss2" "SimplePie/1.0 Beta (Feed Parser; http://www.simplepie.org/; Allow like Gecko) Build/20060129" 0.000 "-" "66.249.92.57" www.transitionnetwork.org [07/Jul/2015:04:14:49 +0100] "GET /news/feed HTTP/1.1" 503 206 332 376 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 4 subscribers; feed-id=7675538851756851077)" 0.000 "-" "207.46.13.110" www.transitionnetwork.org [07/Jul/2015:04:14:55 +0100] "GET /blogs/feed/rob-hopkins/ HTTP/1.1" 503 206 271 376 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-" "66.249.92.60" www.transitionnetwork.org [07/Jul/2015:04:15:00 +0100] "GET /blogs/feed/rob-hopkins/2013 HTTP/1.1" 503 206 265 376 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 2 subscribers; feed-id=3110067181455394580)" 0.000 "-" "66.249.78.181" www.transitionnetwork.org [07/Jul/2015:04:15:02 +0100] "GET /sites/default/files/resize/remote/b7409b46d23f7dd65940714c8201cc07-92x100.jpg HTTP/1.1" 503 206 313 376 "-" "Googlebot-Image/1.0" 0.000 "-" "207.46.13.130" www.transitionnetwork.org [07/Jul/2015:04:15:05 +0100] "GET /user/login?destination=news/2013-12-16/bank-england-takes-note-local-money HTTP/1.1" 503 206 354 376 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-" "66.249.92.63" www.transitionnetwork.org [07/Jul/2015:04:15:10 +0100] "GET /news/network/feed HTTP/1.1" 503 206 341 376 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 2 subscribers; feed-id=12394840076319634974)" 0.000 "-" "207.46.13.130" www.transitionnetwork.org [07/Jul/2015:04:15:05 +0100] "GET /user/login?destination=news/2013-12-16/bank-england-takes-note-local-money HTTP/1.1" 503 206 354 376 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-" "66.249.92.63" www.transitionnetwork.org [07/Jul/2015:04:15:10 +0100] "GET /news/network/feed HTTP/1.1" 503 206 341 376 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 2 subscribers; feed-id=12394840076319634974)" 0.000 "-" "66.249.78.217, 127.0.0.1" www.transitionnetwork.org [07/Jul/2015:04:15:11 +0100] "GET /robots.txt HTTP/1.0" 503 206 401 371 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.011 "-" "127.0.0.1" localhost [07/Jul/2015:04:15:14 +0100] "GET /robots.txt HTTP/1.1" 404 162 179 333 "-" "Wget/1.13.4 (linux-gnu)" 0.000 "-" "207.46.13.130" www.transitionnetwork.org [07/Jul/2015:04:15:15 +0100] "GET /tags/publications HTTP/1.1" 503 206 297 376 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-" "94.228.34.249" www.transitionnetwork.org [07/Jul/2015:04:15:15 +0100] "GET /forum/rss/forum/transitiongroup-food HTTP/1.1" 503 206 327 376 "-" "magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)" 0.000 "-" "66.249.92.62, 127.0.0.1" www.transitionnetwork.org [07/Jul/2015:04:15:20 +0100] "GET /news/feed HTTP/1.0" 503 206 379 371 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; feed-id=730710629253691737)" 0.001 "-" "207.46.13.130" www.transitionnetwork.org [07/Jul/2015:04:15:25 +0100] "GET /tags/militarisation HTTP/1.1" 503 206 299 376 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-" "94.228.34.250" www.transitionnetwork.org [07/Jul/2015:04:15:25 +0100] "GET /forum/rss/tags/covered-market HTTP/1.1" 503 206 320 376 "-" "magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)" 0.000 "-" "66.249.92.57" www.transitionnetwork.org [07/Jul/2015:04:15:30 +0100] "GET /blogs/feed HTTP/1.1" 503 206 334 376 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 10 subscribers; feed-id=2865359885998846808)" 0.000 "-" "178.63.75.73" www.transitionnetwork.org [07/Jul/2015:04:15:31 +0100] "GET /taxonomy/term/998/all HTTP/1.0" 503 206 247 371 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 - James BOT - WebCrawler http://cognitiveseo.com/bot.html" 0.037 "-" "178.63.75.73" www.transitionnetwork.org [07/Jul/2015:04:15:33 +0100] "GET /people/lloyd-williams HTTP/1.0" 503 206 247 371 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 - James BOT - WebCrawler http://cognitiveseo.com/bot.html" 0.023 "-" "207.46.13.110" www.transitionnetwork.org [07/Jul/2015:04:15:35 +0100] "GET /blogs/rob-hopkins/2014-04/katrina-brown-resilience-impact-and-learning HTTP/1.1" 503 206 318 376 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-" "188.40.120.19" www.transitionnetwork.org [07/Jul/2015:04:15:35 +0100] "GET /forums/process/web-and-comms/free-crowdmapping-solution-web-mobile-apps HTTP/1.0" 503 206 297 371 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 - James BOT - WebCrawler http://cognitiveseo.com/bot.html" 0.023 "-" "144.76.100.237" www.transitionnetwork.org [07/Jul/2015:04:15:38 +0100] "GET /forums/process/web-and-comms/are-dedicated-servers-best-business HTTP/1.0" 503 206 290 371 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 - James BOT - WebCrawler http://cognitiveseo.com/bot.html" 0.023 "-" "136.243.16.102, 127.0.0.1" www.transitionnetwork.org [07/Jul/2015:04:15:42 +0100] "GET /forums/process/web-and-comms/eco-friendly-printing-and-print-manufacturers HTTP/1.0" 503 206 461 371 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 - James BOT - WebCrawler http://cognitiveseo.com/bot.html" 0.000 "-" "136.243.16.102, 127.0.0.1" www.transitionnetwork.org [07/Jul/2015:04:15:43 +0100] "GET /people/pravin-ganore HTTP/1.0" 503 206 407 371 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 - James BOT - WebCrawler http://cognitiveseo.com/bot.html" 0.001 "-" "136.243.16.102, 127.0.0.1" www.transitionnetwork.org [07/Jul/2015:04:15:44 +0100] "GET /taxonomy/term/999/0/feed HTTP/1.0" 503 206 411 371 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 - James BOT - WebCrawler http://cognitiveseo.com/bot.html" 0.001 "-" "136.243.16.102, 127.0.0.1" www.transitionnetwork.org [07/Jul/2015:04:15:44 +0100] "GET /people/neha-patel HTTP/1.0" 503 206 404 371 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 - James BOT - WebCrawler http://cognitiveseo.com/bot.html" 0.000 "-" "66.249.78.195" www.transitionnetwork.org [07/Jul/2015:04:15:44 +0100] "GET /sites/default/files/resize/remote/ceb9808fa3768e278199f7c502612729-550x366.jpg HTTP/1.1" 503 206 314 376 "-" "Googlebot-Image/1.0" 0.000 "-" "207.46.13.110" www.transitionnetwork.org [07/Jul/2015:04:15:45 +0100] "GET /people/janev-cameron HTTP/1.1" 503 206 268 376 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.000 "-" "136.243.16.102, 127.0.0.1" www.transitionnetwork.org [07/Jul/2015:04:15:46 +0100] "GET /forums/process/web-and-comms/free-tool-creating-interactive-iphoneandroid-ebooks HTTP/1.0" 503 206 467 371 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 - James BOT - WebCrawler http://cognitiveseo.com/bot.html" 0.000 "-" "68.180.228.44" www.transitionnetwork.org [07/Jul/2015:04:15:47 +0100] "GET /tags/sustainable-bays HTTP/1.1" 503 206 288 376 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 0.000 "-" "151.80.31.122" booker-stage-20140501.transitionnetwork.org [07/Jul/2015:04:15:47 +0100] "GET /events/partner HTTP/1.1" 404 134 208 336 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" 0.000 "1.32" "66.249.92.60" www.transitionnetwork.org [07/Jul/2015:04:15:49 +0100] "GET /news/feed HTTP/1.1" 503 206 332 376 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 4 subscribers; feed-id=7675538851756851077)" 0.000 "-"
I don't know if all the bots caused the load spike or if they were a victim of the load spike.
Note there was a 404 request above for http://booker-stage-20140501.transitionnetwork.org/robots.txt -- the staging sites should have robots.txt files to ensure they are now crawled or indexed, however this site isn't a copy of the main site: http://booker-stage-20140501.transitionnetwork.org/ so I don't think there is a issue here.
comment:27 follow-up: ↓ 28 Changed 17 months ago by paul
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 2.36 to 2.61
That stage website has been removed a while back.
Do we have a Google Webmaster account for transitionnetwork.org website? It maybe worth investigating how to manually remove this page from the index.
The most recent stage site does have an appropriate robots.txt but not sure how this was achieved. Next time we build a platform I'll follow the notes in the wiki and check that the robots.txt file is correct.
https://booker-stage-20150319.transitionnetwork.org/robots.txt
User-agent: *
Disallow: /
Sitemap: http://www.transitionnetwork.org/sitemap.xml
comment:28 in reply to: ↑ 27 Changed 17 months ago by chris
Replying to paul:
That stage website has been removed a while back.
Right, I realised that half through writing it up in the comment above.
Do we have a Google Webmaster account for transitionnetwork.org website? It maybe worth investigating how to manually remove this page from the index.
Somebody should have one? Ask Sam perhaps?
The most recent stage site does have an appropriate robots.txt but not sure how this was achieved.
I think robots.txt files are created for staging sites but not for the default BOA splash pages for domains matching the wild card DNS, for example:
- https://foo.transitionnetwork.org/robots.txt
- https://bar.transitionnetwork.org/robots.txt
- https://baz.transitionnetwork.org/robots.txt
So I'd suggest this is an issue for BOA (the lack for robots.txt files), however as we appear to now have a policy to not apply BOA updates (see ticket:754#comment:60) there seems to be no point is raising an issue with them since even if they fix it the fix won't be appied to PuffinServer...?
There was another load spike just now, the peak:
Date: Tue, 7 Jul 2015 12:06:16 +0100 (BST) From: root@puffin.webarch.net Subject: lfd on puffin.webarch.net: High 5 minute load average alert - 59.74 Time: Tue Jul 7 12:06:16 2015 +0100 1 Min Load Avg: 106.45 5 Min Load Avg: 59.74 15 Min Load Avg: 26.01 Running/Total Processes: 112/600
comment:29 Changed 17 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.1
- Total Hours changed from 2.61 to 2.71
Oops forgot to add time to the comment above.
comment:30 Changed 16 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.15
- Total Hours changed from 2.71 to 2.86
We are seeing very high load spikes on a daily basis on PuffinServer at the moment, these are the subject lines from some of the alert emails:
Jul 27 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 12.55 Jul 27 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 34.23 Jul 27 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 73.63 Jul 27 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 120.41 Jul 27 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 88.71 Jul 27 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 59.66 Jul 27 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 39.93 Jul 27 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 26.81 Jul 27 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 17.95 Jul 27 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 12.07 Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 21.47 Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 76.39 Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 59.15 Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 39.67 Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 26.66 Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 17.89 Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 12.05 Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 70.32 Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 57.11 Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 38.34 Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 25.73 Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 17.31 Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 12.63 Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 25.05 Jul 28 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 16.87 Jul 29 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 12.04 Jul 29 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 40.73 Jul 29 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 92.19 Jul 29 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 68.15 Jul 29 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 45.92 Jul 29 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 30.78 Jul 29 root@puffi lfd on puffin.webarch.net: High 5 minute load average alert - 21.26
The tops of these peaks are not picked up by Munin, see:
I'm not convinced that it is worth spending time on this issue any more, if anyone disagrees please let me know.
comment:31 Changed 15 months ago by chris
Big load spike today, I'll post some stats about it on this ticket.
comment:32 Changed 15 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 2.86 to 3.11
These are the subject lines from the ldf load alert emails from today, these are sent every 5 mins when the load is above 10 or so:
Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 15.84 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 26.13 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 17.56 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 14.28 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 32.92 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 30.15 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 20.27 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 17.02 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 13.57 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 14.04 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 31.83 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 38.05 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 26.59 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 19.76 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 13.28 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 16.31 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 21.37 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 14.40 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 14.99 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 22.18 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 15.10 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 12.05 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 13.05 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 33.59 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 46.81 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 33.24 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 22.31 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 15.13 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 13.46 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 12.23 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 32.07 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 57.67 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 44.66 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 30.17 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 20.29 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 13.73 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 12.51 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 23.35 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 20.75 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 13.97 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 13.99 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 29.44 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 32.96 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 15.04 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 12.95 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 13.63 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 14.32 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 31.23 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 70.98 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 78.94 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 77.84 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 86.56 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 106.85 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 112.84 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 105.13 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 93.73 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 97.14 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 100.15 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 104.10 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 101.91 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 97.32 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 30.86 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 19.46 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 16.40 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 30.10 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 38.24 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 35.38 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 23.76 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 19.16 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 12.97 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 14.08 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 31.02 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 53.04 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 56.32 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 55.31 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 78.01 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 76.02 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 61.12 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 40.95 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 16.53 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 16.12 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 12.48 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 32.67 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 31.74 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 21.30 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 14.62 Aug 25 lfd on puffin.webarch.net: High 5 minute load average alert - 31.64
The tops of these peaks haven't shown up in Munin as usual:
This was a, worse than usual, load spike but we get lots of these every day, it has always been the case with BOA, see wiki:PuffinServer#LoadSpikes, I don't know what can be done about it, given the current server architecture, other than putting up with them?
comment:33 Changed 15 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.5
- Total Hours changed from 3.11 to 3.61
I have just spent some time looking through the Puffin Munin stats and this graph indicates the load increase on the server over the last year:
And the only other graph which has changed this much over the year is the PHP-FPM memory usage and this is basically related to the number of processes running (the process size hasn't changed much):
Note that a couple of years ago when trying to manually set the number of PHP-FPM processes to address BOA load spikes, we did have high and load values for this, for example:
- /trac/attachment/ticket/483/multips_memory-month.png
- /trac/attachment/ticket/555/puffin_2013-07-10_multips_memory-day.png
These examples were taken from a quick look through some of the wiki:PuffinServerBoaLoadSpikes#BOALoadSpikes tickets.
This autumn will mark the 3rd anniversary of the Transition Network using BOA as it's Drupal hosting platform, personally I'd be happy to move back to a system that we have more control over so we have a chance to find the cause of issues such as these load spikes and address them in an effective way.
We just had a big load spike while writing this comment and watching it in top there were lots of perl, sh and php processes running as the aegir user (the sites runs as tn.web) -- if we want to be able to do something about the downtime caused by all the load spikes we need a server we can control and with BOA one is not in control (this is the whole point of it...).
comment:34 Changed 15 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 3.61 to 3.86
I have been looking at the Munin stats for Puffin again, trying to work out why the server has been having so many load spikes, and these graphs are perhaps indicating something, these stats are from MySQL:
These from redis, which caches MySQL data:
Firewall connections:
And direction of the traffic:
This last graph seems to indicate that the cause is additional incoming connections.
comment:35 Changed 15 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.5
- Total Hours changed from 3.86 to 4.36
I have spent some time looking at the wiki:WebServerLogs#logstalgia graphical representation of the Nginx access.log (which is like watching pong!) and the things that were very noticeable:
- Huge number of requests for /?q=node/add, 8,867 today, 17,113 yesterday, 18,082 the day before.
- Huge number of requests for /user/register, 10,496 today, 19,898 yesterday, 20,907 the day before.
- Lot of requests for /people/kion-kashefi, 3,277 today, 6,291 yesterday, 6,451 the day before.
All I can conclude from this is that there are a huge number of spam bots out there... and this could be the cause of the change in behaviour of the site and why nothing shows up in the Piwik stats (as Piwik stats are from people not bots).
comment:36 Changed 15 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.43
- Total Hours changed from 4.36 to 4.79
Looking at yesterdays Nginx access.log.1 via wiki:WebServerLogs#goaccess there are a huge number of requests via Google proxies:
Volume No IP ------------------------------ 697.66 KB - 8075 66.249.93.187 694.56 KB - 7944 66.249.84.252 703.52 KB - 7916 66.249.93.179 690.28 KB - 7869 66.249.84.198 668.66 KB - 7810 66.249.84.194 734.59 KB - 7797 66.249.81.219 826.61 KB - 7773 66.249.81.225 673.94 KB - 7750 66.249.93.183 764.81 KB - 7700 66.249.81.222 374.77 KB - 4262 66.102.6.213 361.39 KB - 4146 66.102.6.205 350.12 KB - 4091 66.102.6.209
You can get the reverse proxy via:
- https://www.robtex.net/#!dns=66.249.93.187
- https://www.robtex.net/#!dns=66.249.84.252
- https://www.robtex.net/#!dns=66.249.93.179
- https://www.robtex.net/#!dns=66.249.84.198
- https://www.robtex.net/#!dns=66.249.84.194
- https://www.robtex.net/#!dns=66.249.81.219
- https://www.robtex.net/#!dns=66.249.81.225
- https://www.robtex.net/#!dns=66.249.93.183
- https://www.robtex.net/#!dns=66.249.81.222
- https://www.robtex.net/#!dns=66.102.6.213
- https://www.robtex.net/#!dns=66.102.6.205
- https://www.robtex.net/#!dns=66.102.6.209
This is what we have for the top requests by hits:
3.45 KB - 41616 /feeds/importer/transition_blogs_importer/15800 1.19 MB - 4524 / 288.78 MB - 3491 /blogs/feed/rob-hopkins/ 78.69 MB - 2633 /news/feed 24.63 MB - 2019 /blogs/feed/rob-hopkins 1.71 MB - 1684 /robots.txt
I have no idea why there are over 41k requests for this blank page and these are all from Google -- Google is requesting this URL multiple times per second!
"66.249.84.252" www.transitionnetwork.org [01/Sep/2015:19:13:33 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.633 "-" "66.249.84.252" www.transitionnetwork.org [01/Sep/2015:19:13:33 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.514 "-" "66.249.93.187" www.transitionnetwork.org [01/Sep/2015:19:13:33 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.493 "-"
Run this command on the server to see, it's really crazy:
sudo -i tail -f /var/log/nginx/access.log | grep transition_blogs_importer
Some of the URL's are like this:
"66.249.93.187" www.transitionnetwork.org [01/Sep/2015:19:15:06 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=15765967025011620143&hub.verify_token=ab6f372cedf4209205adf1b9abb9315f&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
So... does anyone know anyone at Transition Voice? Perhaps they should be contected to see if they have an idea about this behaviour? Or should we contact Google?
I'm very tempted to block all these Google IP addresses via the firewall (I don't think we can do it at a Nginx level as BOA is running the server) to see if that improves the situation with the server... Anyone have any thoughts on this?
comment:37 Changed 15 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.1
- Total Hours changed from 4.79 to 4.89
I sent a (unencrypted) question via http://transitionvoice.com/contact-us/ and moments later all the Google proxy requests stopped... how weird is that!?
Only requests we now have from the Google subnet addresses are from the Googlebot, eg:
"66.249.75.150" www.transitionnetwork.org [01/Sep/2015:19:39:39 +0100] "GET /tags/newsletter?page=4 HTTP/1.1" 301 178 443 394 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.000 "-" "66.249.75.3, 127.0.0.1" www.transitionnetwork.org [01/Sep/2015:19:39:59 +0100] "GET /blogs/feed/rob-hopkins HTTP/1.0" 200 66986 392 67674 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.024 "3.49" "66.249.75.150" www.transitionnetwork.org [01/Sep/2015:19:40:34 +0100] "GET /tags/newsletter HTTP/1.1" 200 9943 436 10656 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 16.405 "5.12" "66.249.75.11, 127.0.0.1" www.transitionnetwork.org [01/Sep/2015:19:40:38 +0100] "GET /tags/capital HTTP/1.0" 200 7163 444 7797 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.586 "3.95" "66.249.75.168" www.transitionnetwork.org [01/Sep/2015:19:40:58 +0100] "GET /tags/tamzin-pinkerton HTTP/1.1" 200 7375 320 8088 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.798 "3.93"
comment:38 Changed 15 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.1
- Total Hours changed from 4.89 to 4.99
In terms of registration attempts we are seeing 4-6k a day:
grep POST /var/log/nginx/access.log.1 | grep register | wc -l 4477 zgrep POST /var/log/nginx/access.log.2.gz | grep register | wc -l 4886 zgrep POST /var/log/nginx/access.log.3.gz | grep register | wc -l 5880 zgrep POST /var/log/nginx/access.log.4.gz | grep register | wc -l 4423
A vast number of them from this User Agent:
"Mozilla/5.0 (iPhone; CPU iPhone OS 8_1_2 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) CriOS/39.0.2171.50 Mobile/12B440 Safari/600.1.4"
But the vast nimber with this User Agent string are only from today -- I guess botnets randomise their UA strings:
grep CriOS /var/log/nginx/access.log | wc -l 17183 grep CriOS /var/log/nginx/access.log.1 | wc -l 5282 zgrep CriOS /var/log/nginx/access.log.2 | wc -l 8
My conclusion to all of this is that I suspect most the problems we have had with the load on the server recently has been caused by bots...
comment:39 Changed 15 months ago by chris
So the load spiked and low and behold loads of these requests again:
"66.249.93.179" www.transitionnetwork.org [01/Sep/2015:21:05:23 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 37.734 "-" "66.249.93.187" www.transitionnetwork.org [01/Sep/2015:21:05:23 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 38.393 "-" "66.249.81.222" www.transitionnetwork.org [01/Sep/2015:21:05:23 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 39.092 "-" "66.249.81.219" www.transitionnetwork.org [01/Sep/2015:21:05:24 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 40.474 "-" "66.249.93.183" www.transitionnetwork.org [01/Sep/2015:21:05:25 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 24.565 "-" "66.249.81.219" www.transitionnetwork.org [01/Sep/2015:21:05:25 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 25.067 "-" "66.249.84.194" www.transitionnetwork.org [01/Sep/2015:21:05:28 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 43.830 "-" "66.249.93.183" www.transitionnetwork.org [01/Sep/2015:21:05:43 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=10142066725310477262&hub.verify_token=1eda61c9d1f2bf3e476094acdaf91702&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.249.93.187" www.transitionnetwork.org [01/Sep/2015:21:05:44 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=6319814073685007505&hub.verify_token=9e89505416e7a1a19987fbfd5b01491a&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.249.84.252" www.transitionnetwork.org [01/Sep/2015:21:05:45 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=107922229115863848&hub.verify_token=03272b1b2ff8d44849ccf86666975a1f&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 508 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.249.81.219" www.transitionnetwork.org [01/Sep/2015:21:05:46 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=16352637103625808541&hub.verify_token=a15234a3b3d477b771799e41d7ad34f4&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.249.93.187" www.transitionnetwork.org [01/Sep/2015:21:05:46 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=1240898511006880980&hub.verify_token=ebf54df1b54bf1a62cf19fc3cbf13a8f&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.249.84.194" www.transitionnetwork.org [01/Sep/2015:21:05:47 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=15419583452535399630&hub.verify_token=af68e5ec296fa25fc7c43e12eae3ee41&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.249.84.198" www.transitionnetwork.org [01/Sep/2015:21:05:48 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=8707460082747624150&hub.verify_token=cdca72c68fd1e269d69ed9e9b24238c4&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.249.93.183" www.transitionnetwork.org [01/Sep/2015:21:05:49 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=11887861429882477665&hub.verify_token=e8b1d38570d15518e9f9dbcd5c6b9257&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.249.84.194" www.transitionnetwork.org [01/Sep/2015:21:05:49 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=3110730584339548525&hub.verify_token=25800bcdd6b0106352e5ae3e076b56cb&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.249.84.252" www.transitionnetwork.org [01/Sep/2015:21:05:50 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=2476781462168896036&hub.verify_token=beb3f197b835cb1f821ac88209b5fbcc&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.249.81.222" www.transitionnetwork.org [01/Sep/2015:21:05:51 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=5726390847248026995&hub.verify_token=54d54561037e8777b201926f27ad25f3&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.249.93.183" www.transitionnetwork.org [01/Sep/2015:21:05:51 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=8578319575267852037&hub.verify_token=bb331ae48606c991244c8c53dc830c1a&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.249.84.252" www.transitionnetwork.org [01/Sep/2015:21:05:52 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=16276261841447425468&hub.verify_token=43947eb43c6cf833cc6e00f0b2addc5c&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.249.93.179" www.transitionnetwork.org [01/Sep/2015:21:06:03 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 62.004 "-"
comment:40 Changed 15 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.1
- Total Hours changed from 4.99 to 5.09
These are the IP addresses I'm tempted to block:
grep transition_blogs_importer /var/log/nginx/access.log.1 | awk '{ print $1}' | sort -u "66.102.6.205" "66.102.6.209" "66.102.6.213" "66.249.81.195" "66.249.81.198" "66.249.81.201" "66.249.81.204" "66.249.81.219" "66.249.81.222" "66.249.81.225" "66.249.84.194" "66.249.84.198" "66.249.84.252" "66.249.93.179" "66.249.93.183" "66.249.93.187" "66.249.93.213" "66.249.93.216" "66.249.93.219" "66.249.93.222"
comment:41 Changed 15 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.1
- Total Hours changed from 5.09 to 5.19
We can't simply block these Google proxy IP's as they do a small number of legitimate requests as well as the massive volume of bogus ones, eg:
grep 66.249.81.219 /var/log/nginx/access.log | grep -v transition_blogs_importer "2003:57:ee1d:5b01:b1c3:ec90:76e7:40c4, 66.249.81.219" www.transitionnetwork.org [01/Sep/2015:09:52:27 +0100] "GET / HTTP/1.1" 200 11794 263 12478 "-" "Google favicon" 0.036 "4.57" "66.249.81.219" www.transitionnetwork.org [01/Sep/2015:10:49:14 +0100] "GET /rss.xml HTTP/1.1" 200 26504 230 27183 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; feed-id=2525513441670857354)" 0.095 "3.40" grep 66.249.81.222 /var/log/nginx/access.log | grep -v transition_blogs_importer "66.249.81.222" www.transitionnetwork.org [01/Sep/2015:20:49:56 +0100] "GET /blogs/feed/rob-hopkins/ HTTP/1.1" 200 67015 246 67767 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; feed-id=5471198439844167222)" 0.085 "3.49" grep 66.249.81.225 /var/log/nginx/access.log | grep -v transition_blogs_importer "66.249.81.225" www.transitionnetwork.org [01/Sep/2015:19:41:32 +0100] "GET /blogs/feed/rob-hopkins/ HTTP/1.1" 200 67015 246 67767 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; feed-id=5471198439844167222)" 0.090 "3.49"
So I have added the following to /var/aegir/config/server_master/nginx/pre.d/nginx_wild_ssl.conf:
location /feeds/importer/transition_blogs_importer/15800 { return 403; }
Not sure what the effect will be as the requests have stopped again...
comment:42 Changed 15 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 5.19 to 5.44
We just has a small load spike:
WARNINGs: load is 7.46 (outside range [:6]).
And at the same time 2 requests a second from FeedFetcher-Google to feeds/importer/transition_blogs_importer/15800, these are just the last ones:
"66.249.84.194" www.transitionnetwork.org [02/Sep/2015:13:37:19 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=17061375092290510245&hub.verify_token=ce2b689348783c89d4722373ecda9bed&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.249.84.252" www.transitionnetwork.org [02/Sep/2015:13:37:19 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.549 "-" "66.249.81.225" www.transitionnetwork.org [02/Sep/2015:13:37:20 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.477 "-" "66.249.93.187" www.transitionnetwork.org [02/Sep/2015:13:37:20 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=9971194119277265194&hub.verify_token=3313f02505f7d33aab87c368df13a700&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.249.93.183" www.transitionnetwork.org [02/Sep/2015:13:37:20 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=11023786745442642030&hub.verify_token=d2c4c29eb64e468966c69f533132871a&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.249.84.198" www.transitionnetwork.org [02/Sep/2015:13:37:20 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.517 "-" "66.249.93.187" www.transitionnetwork.org [02/Sep/2015:13:37:21 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=14715150916754042873&hub.verify_token=3d84aef446543bf2b85f56694c7b6468&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.249.84.194" www.transitionnetwork.org [02/Sep/2015:13:37:21 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=17982434088673727697&hub.verify_token=0697bb324e57665a4f20694cb6bebc8f&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.249.84.252" www.transitionnetwork.org [02/Sep/2015:13:37:21 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.497 "-" "66.102.6.209" www.transitionnetwork.org [02/Sep/2015:13:37:22 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.501 "-" "66.249.93.187" www.transitionnetwork.org [02/Sep/2015:13:37:22 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=6040636127189422897&hub.verify_token=84128e5a7cbf890a53b7a6c1cbb5ff3b&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.249.84.198" www.transitionnetwork.org [02/Sep/2015:13:37:22 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=12494725620258192599&hub.verify_token=f62b6ac0aca1eeb79ec70a3acdbd2976&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.249.84.198" www.transitionnetwork.org [02/Sep/2015:13:37:22 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.492 "-" "66.249.84.194" www.transitionnetwork.org [02/Sep/2015:13:37:23 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=11605899247776179542&hub.verify_token=c15417975bad95cdcb7d731e18cfd0e4&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.249.93.187" www.transitionnetwork.org [02/Sep/2015:13:37:23 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.535 "-" "66.249.93.183" www.transitionnetwork.org [02/Sep/2015:13:37:24 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.641 "-" "66.249.93.183" www.transitionnetwork.org [02/Sep/2015:13:37:24 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=6671767172620812224&hub.verify_token=3b4f373fa62274a5dc1df9092223bc0b&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.249.84.198" www.transitionnetwork.org [02/Sep/2015:13:37:24 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=1122299795865433797&hub.verify_token=3fea7d8a2dcffe41c90593140a58154f&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.102.6.209" www.transitionnetwork.org [02/Sep/2015:13:37:24 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.589 "-" "66.249.84.252" www.transitionnetwork.org [02/Sep/2015:13:37:25 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.476 "-" "66.102.6.213" www.transitionnetwork.org [02/Sep/2015:13:37:26 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.496 "-" "66.249.93.183" www.transitionnetwork.org [02/Sep/2015:13:37:27 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.500 "-" "66.102.6.205" www.transitionnetwork.org [02/Sep/2015:13:37:27 +0100] "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.521 "-" "66.249.84.194" www.transitionnetwork.org [02/Sep/2015:13:37:28 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=5950156067159212032&hub.verify_token=d81f046d7d811ddd073d5085d49609de&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.249.93.187" www.transitionnetwork.org [02/Sep/2015:13:37:28 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=14864592753011071735&hub.verify_token=a160f1d2f40967f584fe2a48f6943767&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.249.93.179" www.transitionnetwork.org [02/Sep/2015:13:37:29 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=1114708068878419733&hub.verify_token=014eefe3c76d5456099a00380653dbff&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-" "66.102.6.209" www.transitionnetwork.org [02/Sep/2015:13:37:29 +0100] "GET /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=5789640148520582546&hub.verify_token=4bbd1b66686203c49a7c0cc1a19cbbe3&hub.mode=subscribe&hub.lease_seconds=432000 HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 0.000 "-"
So I tracked down the Ngnix config file, /data/disk/tn/config/includes/nginx_vhost_common.conf and added:
# google feed fetcher location ^~ /feeds/importer/transition_blogs_importer/15800 { return 403; }
So next time it comes back rather than these requests being sent to php-fpm they will get a fast, low cost, response from Nginx...
comment:51 follow-up: ↓ 52 Changed 15 months ago by sam
Hi all If the load spikes/ downtime are related to the blog import, & this change doesn't fix it then I think we could consider just turning off the blog imports? I don't think its massively read? Thanks Sam On 2 September 2015 at 13:47, Transition Technology Trac <trac@tech.transitionnetwork.org> wrote: > #846: Load Spikes on BOA PuffinServer > -------------------------------------+------------------------------------- > Reporter: chris | Owner: chris > Type: maintenance | Status: new > Priority: major | Milestone: > Component: Live server | Maintenance > Keywords: | Resolution: > Add Hours to Ticket: 0.25 | Estimated Number of Hours: 0.0 > Total Hours: 5.19 | Billable?: 1 > -------------------------------------+------------------------------------- > Changes (by chris): > > * hours: 0.0 => 0.25 > * totalhours: 5.19 => 5.44 > > > Comment: > > We just has a small load spike: > > {{{ > WARNINGs: load is 7.46 (outside range [:6]). > }}} > > And at the same time 2 requests a second from !FeedFetcher-Google to > `feeds/importer/transition_blogs_importer/15800`, this is just the last > ones: > > {{{ > "66.249.84.194" www.transitionnetwork.org [02/Sep/2015:13:37:19 +0100] > "GET > /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=17061375092290510245&hub.verify_token=ce2b689348783c89d4722373ecda9bed&hub.mode=subscribe&hub.lease_seconds=432000 > HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; > (+http://www.google.com/feedfetcher.html)" 0.000 "-" > "66.249.84.252" www.transitionnetwork.org [02/Sep/2015:13:37:19 +0100] > "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 > "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.549 > "-" > "66.249.81.225" www.transitionnetwork.org [02/Sep/2015:13:37:20 +0100] > "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 > "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.477 > "-" > "66.249.93.187" www.transitionnetwork.org [02/Sep/2015:13:37:20 +0100] > "GET > /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=9971194119277265194&hub.verify_token=3313f02505f7d33aab87c368df13a700&hub.mode=subscribe&hub.lease_seconds=432000 > HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; > (+http://www.google.com/feedfetcher.html)" 0.000 "-" > "66.249.93.183" www.transitionnetwork.org [02/Sep/2015:13:37:20 +0100] > "GET > /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=11023786745442642030&hub.verify_token=d2c4c29eb64e468966c69f533132871a&hub.mode=subscribe&hub.lease_seconds=432000 > HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; > (+http://www.google.com/feedfetcher.html)" 0.000 "-" > "66.249.84.198" www.transitionnetwork.org [02/Sep/2015:13:37:20 +0100] > "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 > "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.517 > "-" > "66.249.93.187" www.transitionnetwork.org [02/Sep/2015:13:37:21 +0100] > "GET > /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=14715150916754042873&hub.verify_token=3d84aef446543bf2b85f56694c7b6468&hub.mode=subscribe&hub.lease_seconds=432000 > HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; > (+http://www.google.com/feedfetcher.html)" 0.000 "-" > "66.249.84.194" www.transitionnetwork.org [02/Sep/2015:13:37:21 +0100] > "GET > /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=17982434088673727697&hub.verify_token=0697bb324e57665a4f20694cb6bebc8f&hub.mode=subscribe&hub.lease_seconds=432000 > HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; > (+http://www.google.com/feedfetcher.html)" 0.000 "-" > "66.249.84.252" www.transitionnetwork.org [02/Sep/2015:13:37:21 +0100] > "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 > "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.497 > "-" > "66.102.6.209" www.transitionnetwork.org [02/Sep/2015:13:37:22 +0100] "GET > /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" > "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.501 "-" > "66.249.93.187" www.transitionnetwork.org [02/Sep/2015:13:37:22 +0100] > "GET > /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=6040636127189422897&hub.verify_token=84128e5a7cbf890a53b7a6c1cbb5ff3b&hub.mode=subscribe&hub.lease_seconds=432000 > HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; > (+http://www.google.com/feedfetcher.html)" 0.000 "-" > "66.249.84.198" www.transitionnetwork.org [02/Sep/2015:13:37:22 +0100] > "GET > /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=12494725620258192599&hub.verify_token=f62b6ac0aca1eeb79ec70a3acdbd2976&hub.mode=subscribe&hub.lease_seconds=432000 > HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; > (+http://www.google.com/feedfetcher.html)" 0.000 "-" > "66.249.84.198" www.transitionnetwork.org [02/Sep/2015:13:37:22 +0100] > "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 > "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.492 > "-" > "66.249.84.194" www.transitionnetwork.org [02/Sep/2015:13:37:23 +0100] > "GET > /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=11605899247776179542&hub.verify_token=c15417975bad95cdcb7d731e18cfd0e4&hub.mode=subscribe&hub.lease_seconds=432000 > HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; > (+http://www.google.com/feedfetcher.html)" 0.000 "-" > "66.249.93.187" www.transitionnetwork.org [02/Sep/2015:13:37:23 +0100] > "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 > "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.535 > "-" > "66.249.93.183" www.transitionnetwork.org [02/Sep/2015:13:37:24 +0100] > "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 > "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.641 > "-" > "66.249.93.183" www.transitionnetwork.org [02/Sep/2015:13:37:24 +0100] > "GET > /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=6671767172620812224&hub.verify_token=3b4f373fa62274a5dc1df9092223bc0b&hub.mode=subscribe&hub.lease_seconds=432000 > HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; > (+http://www.google.com/feedfetcher.html)" 0.000 "-" > "66.249.84.198" www.transitionnetwork.org [02/Sep/2015:13:37:24 +0100] > "GET > /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=1122299795865433797&hub.verify_token=3fea7d8a2dcffe41c90593140a58154f&hub.mode=subscribe&hub.lease_seconds=432000 > HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; > (+http://www.google.com/feedfetcher.html)" 0.000 "-" > "66.102.6.209" www.transitionnetwork.org [02/Sep/2015:13:37:24 +0100] "GET > /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" > "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.589 "-" > "66.249.84.252" www.transitionnetwork.org [02/Sep/2015:13:37:25 +0100] > "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 > "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.476 > "-" > "66.102.6.213" www.transitionnetwork.org [02/Sep/2015:13:37:26 +0100] "GET > /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" > "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.496 "-" > "66.249.93.183" www.transitionnetwork.org [02/Sep/2015:13:37:27 +0100] > "GET /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 > "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.500 > "-" > "66.102.6.205" www.transitionnetwork.org [02/Sep/2015:13:37:27 +0100] "GET > /feeds/importer/transition_blogs_importer/15800 HTTP/1.1" 200 0 327 0 "-" > "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" 20.521 "-" > "66.249.84.194" www.transitionnetwork.org [02/Sep/2015:13:37:28 +0100] > "GET > /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=5950156067159212032&hub.verify_token=d81f046d7d811ddd073d5085d49609de&hub.mode=subscribe&hub.lease_seconds=432000 > HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; > (+http://www.google.com/feedfetcher.html)" 0.000 "-" > "66.249.93.187" www.transitionnetwork.org [02/Sep/2015:13:37:28 +0100] > "GET > /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=14864592753011071735&hub.verify_token=a160f1d2f40967f584fe2a48f6943767&hub.mode=subscribe&hub.lease_seconds=432000 > HTTP/1.1" 301 178 510 425 "-" "FeedFetcher-Google; > (+http://www.google.com/feedfetcher.html)" 0.000 "-" > "66.249.93.179" www.transitionnetwork.org [02/Sep/2015:13:37:29 +0100] > "GET > /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=1114708068878419733&hub.verify_token=014eefe3c76d5456099a00380653dbff&hub.mode=subscribe&hub.lease_seconds=432000 > HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; > (+http://www.google.com/feedfetcher.html)" 0.000 "-" > "66.102.6.209" www.transitionnetwork.org [02/Sep/2015:13:37:29 +0100] "GET > /feeds/importer/transition_blogs_importer/15800?hub.topic=http://feeds.feedburner.com/TransitionVoice&hub.challenge=5789640148520582546&hub.verify_token=4bbd1b66686203c49a7c0cc1a19cbbe3&hub.mode=subscribe&hub.lease_seconds=432000 > HTTP/1.1" 301 178 509 425 "-" "FeedFetcher-Google; > (+http://www.google.com/feedfetcher.html)" 0.000 "-" > }}} > > So I tracked down the Ngnix config file, > `/data/disk/tn/config/includes/nginx_vhost_common.conf` and added: > > {{{ > # google feed fetcher > location ^~ /feeds/importer/transition_blogs_importer/15800 { > return 403; > } > }}} > > So next time it comes back rather than these requests being sent to php- > fpm they will get a fast, low cost, response from Nginx... > > -- > Ticket URL: <https://tech.transitionnetwork.org/trac/ticket/846#comment:42> > Transition Technology <https://tech.transitionnetwork.org/trac> > Support and issues tracking for the Transition Network Web Project.
comment:52 in reply to: ↑ 51 Changed 15 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 5.44 to 5.69
Replying to sam:
If the load spikes/ downtime are related to the blog import, & this
change doesn't fix it then I think we could consider just turning off
the blog imports? I don't think its massively read?
I'm not sure they are, the crazy number of requests from the Google Feedfetcher has been solved by serving 403's for this URL:
There were only 6 requests for that URL yesterday:
grep transition_blogs_importer /var/log/nginx/access.log.1 | wc -l 6
Before we started serving 403's we were getting 40k requests a day for this URL...
The main traffic is still for these 3 URLs (see ticket:846#comment:35):
- https://www.transitionnetwork.org/?q=node/add (16963 yesterday)
- https://www.transitionnetwork.org/user/register (19118 yesterday)
- https://www.transitionnetwork.org/people/kion-kashefi (5822 yesterday)
I strongly suspect botnet(s) are behind the massive number of requests for these URLs, I'm not sure how, without ditching BOA, to address this.
I have been running wiki:WebServerLogs#logstalgia for the last week or so to keep an eye on things.
comment:53 Changed 12 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.15
- Total Hours changed from 5.69 to 5.84
There is a on-going very high load on PuffinServer today, these are the subjects lines from the LFD emails I have been sent so far today:
Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 14.93 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 15.05 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 12.34 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 12.12 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 27.72 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 48.07 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 53.77 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 54.35 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 54.49 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 56.10 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 62.23 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 66.63 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 61.97 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 41.70 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 28.15 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 19.04 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 13.20 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 12.52 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 13.16 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 14.60 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 34.55 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 52.49 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 52.04 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 37.02 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 25.41 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 16.86 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 35.16 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 25.18 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 16.90 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 13.72 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 15.58 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 33.53 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 18.99 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 22.53 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 35.66 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 32.54 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 21.86 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 14.73 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 12.54 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 28.89 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 28.37 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 27.05 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 18.12 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 13.42 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 12.85 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 29.11 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 23.91 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 30.78 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 20.67 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 16.29 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 12.90 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 27.81 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 42.35 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 36.69 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 24.65 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 19.49 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 31.78 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 49.33 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 58.00 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 69.29 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 61.48 Dec 04 lfd on puffin.webarch.net: High 5 minute load average alert - 51.63
comment:54 Changed 12 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 5.84 to 6.09
The load is through the roof again, BOA is busy killing things (see the killall processes in this top output):
top - 10:33:16 up 3 days, 6 min, 1 user, load average: 103.85, 116.45, 99.92 Tasks: 659 total, 112 running, 547 sleeping, 0 stopped, 0 zombie %Cpu0 : 0.5 us, 15.9 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 83.7 st %Cpu1 : 1.0 us, 15.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 83.7 st %Cpu2 : 1.4 us, 14.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 83.9 st %Cpu3 : 1.4 us, 13.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 85.2 st %Cpu4 : 1.0 us, 15.8 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 83.3 st %Cpu5 : 1.0 us, 14.6 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 84.5 st %Cpu6 : 0.5 us, 14.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 85.5 st %Cpu7 : 1.4 us, 14.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 84.3 st %Cpu8 : 1.8 us, 16.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 82.0 st %Cpu9 : 1.3 us, 12.8 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 85.9 st %Cpu10 : 1.9 us, 14.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 83.8 st %Cpu11 : 1.4 us, 15.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 83.1 st %Cpu12 : 1.4 us, 16.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 82.2 st %Cpu13 : 1.9 us, 14.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 84.1 st KiB Mem: 8372060 total, 6507456 used, 1864604 free, 411396 buffers KiB Swap: 0 total, 0 used, 0 free, 3172436 cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 19900 root 20 0 0 0 0 R 46 0.0 0:02.43 mysqladmin 17809 root 20 0 0 0 0 R 36 0.0 0:01.70 php53-fpm 19975 root 20 0 0 0 0 R 28 0.0 0:01.16 bash 20038 root 20 0 0 0 0 R 26 0.0 0:01.05 hostname 20053 root 20 0 8 4 0 R 25 0.0 0:01.03 bash 20060 root 20 0 8 4 0 R 24 0.0 0:00.97 service 20067 root 20 0 8 4 0 R 23 0.0 0:00.95 php53-fpm 19923 root 20 0 5560 604 512 R 23 0.0 0:01.05 expr 19927 root 20 0 5544 720 596 R 23 0.0 0:00.93 killall 20058 root 20 0 8 4 0 R 23 0.0 0:00.93 bash 20050 root 20 0 8 4 0 R 22 0.0 0:00.92 bash 19825 aegir 20 0 187m 9304 6152 R 22 0.1 0:01.41 php 19476 root 20 0 5544 724 596 R 21 0.0 0:02.07 killall 20041 root 20 0 8 4 0 R 20 0.0 0:00.83 perl 19454 root 20 0 0 0 0 R 20 0.0 0:01.65 awk 19940 root 20 0 5544 720 596 R 20 0.0 0:00.82 killall 20061 root 20 0 8 4 0 R 20 0.0 0:00.82 service 19925 root 20 0 3948 168 128 R 20 0.0 0:00.86 expr 20045 root 20 0 8 4 0 R 20 0.0 0:00.80 php53-fpm 15911 root 20 0 0 0 0 R 19 0.0 0:01.11 php53-fpm 19914 root 20 0 7788 836 724 R 19 0.0 0:01.06 grep 19474 root 20 0 5548 716 596 R 19 0.0 0:01.77 killall 19994 root 20 0 40372 2416 1856 R 18 0.0 0:00.72 mysqladmin 20029 root 20 0 8 4 0 R 18 0.0 0:00.72 service 20057 root 20 0 8 4 0 R 17 0.0 0:00.70 php53-fpm 20063 root 20 0 8 4 0 R 17 0.0 0:00.70 php53-fpm 19898 root 20 0 40372 2412 1856 R 17 0.0 0:01.37 mysqladmin 19459 root 20 0 0 0 0 R 17 0.0 0:02.13 grep 19990 root 20 0 5544 712 588 R 17 0.0 0:00.68 killall 20042 root 20 0 8 4 0 R 17 0.0 0:00.68 bash 20024 root 20 0 40372 2416 1856 R 16 0.0 0:00.67 mysqladmin
There are 92 BOA bash scripts running (two are these are my shell):
ps -lA | grep bash | wc -l 94
I sorry but don't know what to do apart from let BOA go on with running loads of bash scripts killing loads of things... this is the nature of BOA... it is designed not to be managed by sysadmins and at times like this it results in a server that nothing can be done for apart from a reboot and perhaps I should try that since the site is unavailable...
Can we one day have a discussion about moving away from BOA? Please?
comment:55 Changed 12 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.15
- Total Hours changed from 6.09 to 6.24
I rebooted the server, once BOA finds high loads and goes on a killing spree it doesn't restart things for a while, so rebooting has resulted in the site coming back on line quicker. This is not how servers should be managed.
comment:56 Changed 12 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 6.24 to 6.49
Another massive load spike:
lfd on puffin.webarch.net: High 5 minute load average alert - 158.55
Rebooting the server again...
comment:57 Changed 12 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 6.49 to 6.74
Perhaps the suggestions posted here, 3 months ago, could be considered someday soonish?
The amount of downtime we have with BOA at the moment is really appalling...
I did look at the Piwik stats to see if there is a lot more human traffic due to cop21 but there didn't seem to be, so perhaps it is bots...
The server is back up now following the reboot but again it is unresponsive...
load average: 95.23, 76.54, 46.78
I'm going to have a look at what is happing via the Nginx logs, see wiki:WebServerLogs#logstalgia but it might be too overloaded for this to work, just getting a response from uptime took 5 mins...
comment:58 Changed 12 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.2
- Total Hours changed from 6.74 to 6.94
BOA is on another killing spree:
top - 17:00:34 up 26 min, 2 users, load average: 86.28, 77.39, 57.07 Tasks: 499 total, 99 running, 400 sleeping, 0 stopped, 0 zombie 32 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 netns %Cpu1 : 0.2 us, 1.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 98.6 st %Cpu2 : 0.0 us, 1.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 98.7 st %Cpu3 : 0.0 us, 1.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 98.9 st %Cpu4 : 0.2 us, 1.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 98.7 st %Cpu5 : 0.0 us, 1.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 98.9 st %Cpu6 : 0.2 us, 1.6 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.2 si, 98.0 st %Cpu7 : 0.0 us, 1.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.2 si, 98.7 st %Cpu8 : 0.2 us, 1.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 98.7 st %Cpu9 : 0.0 us, 1.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 99.0 st %Cpu10 : 0.2 us, 1.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 98.8 st %Cpu11 : 0.0 us, 1.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 98.9 st %Cpu12 : 0.0 us, 1.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 98.9 st sudo -i KiB Mem: 8372060 total, 1556628 used, 6815432 free, 143036 buffers Tasks: 500 total, 100 running, 400 sleeping, 0 stopped, 0 zombie %Cpu1 : 0.1 us, 0.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 99.2 st 28289 root 20 0 0 0 0 R 113 0.0 0:13.88 mysqladmin 28410 root 20 0 8 4 0 R 83 0.0 0:04.07 bash 25660 root 20 0 0 0 0 R 80 0.0 0:08.32 cron 28344 root 20 0 0 0 0 R 77 0.0 0:08.98 id 28413 root 20 0 8 4 0 R 74 0.0 0:03.65 bash 28411 root 20 0 8 4 0 R 71 0.0 0:04.20 bash 28147 root 20 0 0 0 0 R 59 0.0 0:04.66 grep 28216 root 20 0 0 0 0 R 56 0.0 0:03.18 bash 28148 root 20 0 0 0 0 R 45 0.0 0:05.18 awk 28392 root 20 0 28900 840 556 R 36 0.0 0:05.36 cron 28312 root 20 0 50544 512 300 R 36 0.0 0:08.00 host 27866 root 20 0 4064 660 500 R 35 0.0 0:24.31 pidof 28301 root 20 0 40376 2500 1900 R 35 0.0 0:08.28 mysqladmin 28378 root 20 0 11064 240 172 R 31 0.0 0:04.74 awk 23627 aegir 20 0 230m 40m 9512 R 30 0.5 1:03.83 php 28368 aegir 20 0 9796 724 608 R 28 0.0 0:05.84 id 28021 root 20 0 201m 12m 7896 R 26 0.2 0:17.97 php-fpm 28379 root 20 0 27828 336 216 R 26 0.0 0:05.39 mysqladmin 28283 root 20 0 40372 2504 1904 R 25 0.0 0:08.92 mysqladmin 28351 root 20 0 29980 356 228 R 25 0.0 0:05.98 mysqladmin 27843 root 20 0 4064 668 500 R 25 0.0 0:23.27 pidof 27965 root 20 0 4064 616 500 R 23 0.0 0:20.62 pidof 28406 root 20 0 8 4 0 R 21 0.0 0:03.66 bash 28403 root 20 0 24840 776 512 R 17 0.0 0:03.06 cron 28068 aegir 20 0 4132 592 500 R 17 0.0 0:09.51 dash 28381 root 20 0 6124 336 256 R 16 0.0 0:04.20 grep 28206 root 20 0 5544 696 588 R 16 0.0 0:10.63 killall 27472 root 20 0 5544 708 596 R 16 0.0 0:39.45 killall 28184 root 20 0 5544 704 596 R 16 0.0 0:10.05 killall 28404 root 20 0 24792 820 548 R 16 0.0 0:02.76 cron 28229 root 20 0 5544 704 588 R 16 0.0 0:10.05 killall 27864 root 20 0 4064 640 500 R 15 0.0 0:22.94 pidof 28401 root 20 0 24664 780 512 R 15 0.0 0:02.62 cron 27473 root 20 0 5544 712 596 R 15 0.0 0:38.61 killall 28400 root 20 0 24840 776 512 R 15 0.0 0:02.60 cron 28317 root 20 0 5544 700 588 R 15 0.0 0:07.47 killall 23147 aegir 20 0 231m 41m 9516 R 14 0.5 1:12.12 php 28405 root 20 0 24664 780 512 R 14 0.0 0:02.53 cron
At least 12 killall processes are running:
ps -lA | grep killall 4 R 0 27472 26445 18 80 0 - 1386 - ? 00:00:59 killall 4 R 0 27473 26027 18 80 0 - 1386 - ? 00:00:59 killall 4 R 0 27505 23173 18 80 0 - 1386 - ? 00:01:00 killall 4 R 0 27506 22402 18 80 0 - 1386 - ? 00:00:57 killall 4 R 0 28184 19270 14 80 0 - 1386 - ? 00:00:30 killall 4 R 0 28206 11411 14 80 0 - 1387 - ? 00:00:30 killall 4 R 0 28214 23723 15 80 0 - 1386 - ? 00:00:31 killall 4 R 0 28215 22836 15 80 0 - 1386 - ? 00:00:32 killall 4 R 0 28229 24782 15 80 0 - 1386 - ? 00:00:30 killall 4 R 0 28235 25066 13 80 0 - 1386 - ? 00:00:27 killall 4 R 0 28317 23862 15 80 0 - 1387 - ? 00:00:29 killall 4 R 0 28462 26319 13 80 0 - 1386 - ? 00:00:08 killall
Now it is down to 10...
ps -lA | grep killall | wc -l 10
But back to 12 again...
puffin:~# ps -lA | grep killall | wc -l 12
I'm going to reboot the server with four time as much RAM to see if makes it less suicidal...
comment:59 Changed 12 months ago by chris
Again a massive number of BOA bash jobs are running:
ps -lA | grep bash | wc -l 91
comment:60 Changed 12 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.1
- Total Hours changed from 6.94 to 7.04
It is up, I don't know how long it'll stay up:
uptime 17:33:42 up 22 min, 1 user, load average: 46.07, 49.30, 26.33
BOA hasn't started killing itself yet...
ps -lA | grep killall | wc -l 0
comment:61 Changed 12 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.1
- Total Hours changed from 7.04 to 7.14
So now it seems fine...
17:58:32 up 46 min, 1 user, load average: 0.25, 1.40, 8.90
I have no idea what was causing the massive load spikes earlier.
comment:62 follow-up: ↓ 63 Changed 12 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.2
- Total Hours changed from 7.14 to 7.34
So wiki:WebServerLogs#logstalgia now works given the extra RAM and guess what, massive number of hits for /user/register, ?q=/node/add and /people/kion-kashefi, the same as 3 months ago, see ticket:846#comment:35 ...
comment:63 in reply to: ↑ 62 ; follow-up: ↓ 65 Changed 12 months ago by sam
Replying to chris:
So wiki:WebServerLogs#logstalgia now works given the extra RAM and guess what, massive number of hits for /user/register, ?q=/node/add and /people/kion-kashefi, the same as 3 months ago, see ticket:846#comment:35 ...
I guess it would be fairly trivial for us to move the URL of that registration page?
/user/new/register ?
My assumption is this traffic is generated by a script that attacks Drupal sites. Moving the URL might be enough to outwit it if it's a crappy script?
I realise we'd still have to serve them a 404, but my assumption is that the load is actually generated by login attempts, not just the serving of a simple page?
It might be a more sophisticated script that would find the page anyway, but for five minutes work it might be worth a try?
Thanks
Sam
comment:64 follow-up: ↓ 66 Changed 12 months ago by sam
also just unpublished https://www.transitionnetwork.org/people/kion-kashefi Sorry didn't spot it in previous ticket.
comment:65 in reply to: ↑ 63 Changed 12 months ago by chris
Replying to sam:
I guess it would be fairly trivial for us to move the URL of that registration page?
/user/new/register ?
Yes I think that might help, but we would need to check everything that links there first?
comment:66 in reply to: ↑ 64 Changed 12 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.1
- Total Hours changed from 7.34 to 7.44
Replying to sam:
also just unpublished https://www.transitionnetwork.org/people/kion-kashefi Sorry didn't spot it in previous ticket.
Thanks, we could also do with identifying other spam pages like this, I don't think this is the only one...
comment:67 Changed 12 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 7.44 to 7.69
Looking at the memory usage for the last 12 hours (during which we haven't had any load spikes, I don't know for sure if this is related to the extra memory or a coincidence):
The minimum unused memory has been 23.36G, so 32-23.36=8.64, therefore I'm going to reboot the server with 9G, this is 1G more than it had, so this would be an extra £28 per quarter if Ade is happy for the Transition Network to pay this, if not I can reboot it again back to 8G.
comment:70 follow-up: ↓ 71 Changed 12 months ago by ade
Sounds ideal Chris, Please invoice as proposed. Best regards Ade On 6 December 2015 at 15:59, Transition Technology Trac < trac@tech.transitionnetwork.org> wrote: > #846: Load Spikes on BOA PuffinServer > -------------------------------------+------------------------------------- > Reporter: chris | Owner: chris > Type: maintenance | Status: new > Priority: major | Milestone: > Component: Live server | Maintenance > Keywords: | Resolution: > Add Hours to Ticket: 0.25 | Estimated Number of Hours: 0.0 > Total Hours: 7.44 | Billable?: 1 > -------------------------------------+------------------------------------- > Changes (by chris): > > * hours: 0.0 => 0.25 > * totalhours: 7.44 => 7.69 > > > Comment: > > Looking at the > [ > https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/memory.html > memory usage] for the last 12 hours (during which we haven't had any load > spikes, I don't know for sure if this is related to the extra memory or a > coincidence): > > [[Image(puffin_memory-pinpoint_2015-12-06.png)]] > > The minimum unused memory has been 23.36G, so 32-23.36=8.64, therefore I'm > going to reboot the server with 9G, this is 1G more than it had, so this > would be an extra [https://www.webarch.net/virtual-servers#ram £28 per > quarter] if Ade is happy for the Transition Network to pay this, if not I > can reboot it again back to 8G. > > -- > Ticket URL: <https://tech.transitionnetwork.org/trac/ticket/846#comment:67 > > > Transition Technology <https://tech.transitionnetwork.org/trac> > Support and issues tracking for the Transition Network Web Project. > -- Ade Stuart Web Manager - Transition network 07595 331877 The Transition Network is a registered charity address: 43 Fore St, Totnes, Devon, TQ9 5HN, UK website: www.transitionnetwork.org TN company no: 6135675 TN charity no: 1128675
comment:71 in reply to: ↑ 70 Changed 12 months ago by chris
Replying to ade:
Sounds ideal Chris,
Please invoice as proposed.
Thanks Ade, will do. Our email, just crossed -- the one I just sent to the Ttech list was sent before I saw this.
comment:72 Changed 11 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 7.69 to 7.94
Another bad day for load spikes, these are the subject lines from lfd email today:
Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 15.64 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 12.96 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 15.60 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 16.13 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 12.11 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 14.80 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 15.63 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 32.27 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 42.51 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 28.61 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 19.86 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 13.46 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 15.38 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 18.87 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 13.11 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 12.90 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 14.44 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 14.83 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 23.33 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 18.92 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 17.96 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 12.71 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 12.86 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 13.10 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 13.14 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 15.55 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 14.72 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 32.77 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 25.37 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 20.09 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 15.84 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 40.58 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 55.44 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 64.21 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 71.34 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 54.87 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 37.07 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 24.86 Dec 13 lfd on puffin.webarch.net: High 5 minute load average alert - 16.80
I have spent a fair amount of time looking at the output of top as well as all the email alerts and the munin graphs, as ever I can't see any simply cause or fix for this while we stick with BOA...
comment:73 Changed 11 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 7.94 to 8.19
I think a lot of the load issues are caused by bots, looking at the latest /var/log/nginx/access.log, it is 31k lines long and these are the User Agent strings for the UA's with 60 or more hits:
60 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2; Feeder.co) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36 60 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36 61 Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2 61 Stringer (https://github.com/swanson/stringer) 62 LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.1 +http://www.linkedin.com) 64 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729) 64 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:42.0) Gecko/20100101 Firefox/42.0 66 Mozilla/5.0 (compatible; FeedBooster; +http://feeds.qsensei.com) 66 NewsBlur Feed Fetcher - 42 subscribers - http://www.newsblur.com/site/40901/rob-hopkinss-blog (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3) 66 NewsBlur Page Fetcher - 42 subscribers - http://www.newsblur.com/site/40901/rob-hopkinss-blog (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3) 68 Superfeedr bot/2.0 http://superfeedr.com - Make your feeds realtime: get in touch - feed-id:152016352 69 Superfeedr bot/2.0 http://superfeedr.com - Make your feeds realtime: get in touch - feed-id:111162297 70 Automattic Feed Fetcher 1.0 70 Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8 70 Superfeedr bot/2.0 http://superfeedr.com - Make your feeds realtime: get in touch - feed-id:107327965 70 Superfeedr bot/2.0 http://superfeedr.com - Make your feeds realtime: get in touch - feed-id:243955592 70 Superfeedr bot/2.0 http://superfeedr.com - Make your feeds realtime: get in touch - feed-id:90309734 70 Superfeedr bot/2.0 http://superfeedr.com - Make your feeds realtime: get in touch - feed-id:92605190 75 Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; http://www.bing.com/bingbot.htm) 76 Superfeedr bot/2.0 http://superfeedr.com - Make your feeds realtime: get in touch - feed-id:90556700 80 Java/1.8.0_51 81 Feedbin feed-id:189396 - 5 subscribers 89 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) 120 dlvr.it/1.0 (+http://dlvr.it/) 124 check_http/v1.4.16 (nagios-plugins 1.4.16) 128 Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) 130 Java/1.8.0_65 138 Superfeedr bot/2.0 http://superfeedr.com - Make your feeds realtime: get in touch - feed-id:90159368 143 Mozilla/5.0 (compatible) 149 Owlin bot v3 http://owlin.com/bot 158 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36 172 curl 265 Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html) 323 robots 346 Wget/1.13.4 (linux-gnu) 384 Feedly/1.0 (+http://www.feedly.com/fetcher.html; like FeedFetcher-Google) 436 Mozilla/5.0 (compatible; SpiderLing (a SPIDER for LINGustic research); +http://nlp.fi.muni.cz/projects/biwec/) 441 Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20100101 Firefox/17.0 489 magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net) 545 Y!J-ASR/0.1 crawler (http://www.yahoo-help.jp/app/answers/detail/p/595/a_id/42716/) 559 Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com) 633 Pingdom.com_bot_version_1.4_(http://www.pingdom.com/) 645 Pingdom.com_bot_version_1.4_(http://www.pingdom.com) 747 Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/) 793 Domain Re-Animator Bot (http://domainreanimator.com) - support@domainreanimator.com 1004 Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/) 1323 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 1841 - 2080 Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) 2633 Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) 3193 Mozilla/5.0 (compatible; spbot/4.4.2; +http://OpenLinkProfiler.org/bot ) 3705 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.99 Safari/537.36
The above hits account for 25k of the 31k total, if you take out the ones which might be people you have 20.5k hits from bots.
So I think we can say that at least 60% of the traffic to the site is from bots.
comment:74 Changed 11 months ago by chris
The UAs were extracted like this:
awk -F\" '{print $8}' /var/log/nginx/access.log | sort | uniq -c | sort -bn | less
comment:75 Changed 11 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.2
- Total Hours changed from 8.19 to 8.39
PuffinServer was down overnight again due to MySQL locking up, see ticket:890#comment:6 and today is another big day for load spikes and the server going down...
So it is back up and almost totally unresponsive again due to the load:
12:22:54 up 6 min, 1 user, load average: 56.00, 23.47, 8.84
I think we might be better off simply switching the server off for the day rather than me wasting hours trying to get it to stay up for moments at a time?
comment:76 Changed 11 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.02
- Total Hours changed from 8.39 to 8.41
It is only getting worse:
12:26:18 up 9 min, 1 user, load average: 95.24, 57.20, 25.01
comment:77 Changed 11 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.1
- Total Hours changed from 8.41 to 8.51
12:30:36 up 13 min, 2 users, load average: 73.60, 65.90, 36.46
I think I'm going to give up on trying to keep the server up today -- if anyone does want a server that we can keep up to run the site then let's have a discussion about dropping BOA?
comment:78 follow-up: ↓ 79 Changed 11 months ago by paul
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 8.51 to 8.76
Is the server swapping to the hard drive?
I'll be around later this afternoon, If I can be of any assistance.
comment:79 in reply to: ↑ 78 ; follow-up: ↓ 80 Changed 11 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 8.76 to 9.01
Replying to paul:
Is the server swapping to the hard drive?
No it hasn't got any swap.
I'll be around later this afternoon, If I can be of any assistance.
Thanks but I think what we really need is to have the conversation about what to do regarding BOA and we can only have that conversation with Ade...
Results from checking uptime over the last half hour or so:
12:32:03 up 15 min, 2 users, load average: 65.47, 65.04, 38.86 12:39:52 up 23 min, 1 user, load average: 97.46, 83.81, 57.33 12:41:46 up 25 min, 1 user, load average: 103.87, 90.45, 62.82 12:43:40 up 26 min, 1 user, load average: 110.41, 97.36, 68.43 12:49:07 up 32 min, 1 user, load average: 133.84, 120.30, 86.88 12:52:09 up 35 min, 1 user, load average: 117.21, 117.41, 91.68 12:54:34 up 37 min, 1 user, load average: 136.75, 124.76, 98.08 12:56:04 up 39 min, 1 user, load average: 141.40, 128.69, 101.89 12:59:54 up 43 min, 1 user, load average: 160.20, 145.55, 114.65
It is only getting worse and due to the nature of BOA that isn't really anything we can do apart from watch it failing.
comment:80 in reply to: ↑ 79 Changed 11 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.15
- Total Hours changed from 9.01 to 9.16
Replying to paul:
I'll be around later this afternoon, If I can be of any assistance.
By all means try to get to the bottom of it, I think the problem is that we can't as long as we are running BOA. You could start by reading this ticket thread I guess...
We might as well switch the server off for the day:
13:03:31 up 46 min, 1 user, load average: 185.61, 162.76, 127.80 13:11:13 up 54 min, 1 user, load average: 201.08, 191.55, 155.24
comment:81 Changed 11 months ago by chris
So fucked:
13:35:41 up 1:18, 1 user, load average: 263.11, 261.47, 233.17
comment:82 Changed 11 months ago by chris
I'm going to have one last try for the day to get the server up, rebooting it with double the RAM...
comment:83 Changed 11 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.1
- Total Hours changed from 9.16 to 9.26
It is up again:
13:43:13 up 4 min, 1 user, load average: 2.72, 0.91, 0.33
free -m total used free shared buffers cached Mem: 18415 2086 16329 0 129 586 -/+ buffers/cache: 1370 17044 Swap: 0 0 0
No idea how long it will be up for, not long I suspect...
comment:84 Changed 11 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.05
- Total Hours changed from 9.26 to 9.31
It is knackered again already:
13:46:25 up 7 min, 1 user, load average: 51.22, 20.76, 7.90
I'm really giving up for the day now.
comment:85 Changed 11 months ago by sam
Hi all
Thanks for your work on this.
I see the site is available again at the moment.
Paul: one thing we have discussed is moving the /login page, the assumption being that the load is caused by bots filling and submitting the form. If the bots can't find the form they can't submit it repeatedly.
I know this would be trivial on Wordpress but I've never done it on a Drupal site. Is it something that would be fairly easy for you to do?
I see there is a module for it: https://www.drupal.org/project/rename_admin_paths but I seem to recall we already have a module to edit paths?
I realise we might end up with a few broken internal links, I think this would be a price worth paying if it reduces the load and keeps the site up. I could fix any broken links early in the new year.
Thanks
Sam
comment:86 follow-up: ↓ 87 Changed 11 months ago by ade
Hi Chris, Sorry have taken holiday to coincide with kids holidays. I have no idea how to combat the BOT issues that the server is currently receiving. With regard to the long term plan, we are looking at the next site to be a WordPress site over the next 3-4 months, so this will mitigate the problem in the long-term, well either that or push the issue to a different platform. So the question is what can we do in the short term? We can't afford to rip out BOA and rebuild the server for 3-4 months. Are the bots targeting a particular URL or port or are they just flooding port 80? If its a URL can we remove that URL or change it internally? I had every intention of discussing with you the plans for next year before I came on holiday, but a question has arisen that I needed answering before hand. I understand that the lack of communication from my end has had to be very frustrating. I get back the first week of Jan..can we catch up then? Regards Ade On 23 Dec 2015 1:48 p.m., "Transition Technology Trac" < trac@tech.transitionnetwork.org> wrote: > #846: Load Spikes on BOA PuffinServer > -------------------------------------+------------------------------------- > Reporter: chris | Owner: chris > Type: maintenance | Status: new > Priority: major | Milestone: > Component: Live server | Maintenance > Keywords: | Resolution: > Add Hours to Ticket: 0.05 | Estimated Number of Hours: 0.0 > Total Hours: 9.26 | Billable?: 1 > -------------------------------------+------------------------------------- > Changes (by chris): > > * hours: 0.0 => 0.05 > * totalhours: 9.26 => 9.31 > > > Comment: > > It is knackered again already: > > {{{ > 13:46:25 up 7 min, 1 user, load average: 51.22, 20.76, 7.90 > }}} > > I'm really giving up for the day now. > > -- > Ticket URL: <https://tech.transitionnetwork.org/trac/ticket/846#comment:84 > > > Transition Technology <https://tech.transitionnetwork.org/trac> > Support and issues tracking for the Transition Network Web Project. >
comment:87 in reply to: ↑ 86 Changed 11 months ago by chris
- Cc kate added
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 9.31 to 9.56
Replying to ade:
Sorry have taken holiday to coincide with kids holidays.
No problem, I understand!
I have no idea how to combat the BOT issues that the server is currently
receiving.
I'm not sure we can with BOA as the configuration isn't designed to be managed by sysadmins.
With regard to the long term plan, we are looking at the next site to be a
WordPress site over the next 3-4 months, so this will mitigate the problem in
the long-term, well either that or push the issue to a different platform.
OK.
So the question is what can we do in the short term? We can't afford to rip
out BOA and rebuild the server for 3-4 months.
OK, I have had enough of BOA, we have been running PuffinServer for three years and we have had problems with it for ages, see wiki:PuffinServer#LoadSpikes, it hasn't been updated for seven months (see ticket:889, ticket:872, ticket:864, ticket:863 and ticket:854).
Kate, the other syadmin at Webarchitects has done some work on PuffinServer to bring it back up, reducing the number of CPUs and removing some BOA cron jobs, I have created an account for her on trac and she will hopefully document what she has done and record the time spent as a comment on this thread or as a new ticket.
Are the bots targeting a particular URL or port or are they just flooding
port 80? If its a URL can we remove that URL or change it internally?
It isn't entirly clear as most requests are not logged due to the way BOA has configured Nginx, however there is evidence of some URLs being abused by bots, see ticket:846#comment:35 and yes it might be possible for someone to mitigate this to some extent.
I had every intention of discussing with you the plans for next year before I
came on holiday, but a question has arisen that I needed answering before
hand. I understand that the lack of communication from my end has had to be
very frustrating. I get back the first week of Jan..can we catch up then?
It would be really great to have a chat about what we are going to do, I'm back at work when the kids go back to school.
comment:88 follow-up: ↓ 89 Changed 11 months ago by kate
- Add Hours to Ticket changed from 0.0 to 1.0
- Total Hours changed from 9.56 to 10.56
Hi,
My colleague Chris Croome mentioned in our IRC channel that there were issues with server puffin.
I logged into the host around 16:00 today.
I noticed that the host has 14 CPU, was showing a high load ( >100 ).
I noticed that there was a high "steal" percentage of CPU time
There were many bash "killall" processes running.
From previous discussion with Chris, I am aware that the server has a number of shell scripts called "BOA" used to dynamically monitor and maintain it.
I took the following action
- I shutdown the server
- I reduced the number of CPU to 4
- I restarted the server
- I commented out several shell scripts that are scheduled to run from root's crontab ( see output of crontab -l -u root If it is commented out I did it. )
Since performing this action ( about 1.5 hours now ) the load is < 1
top - 18:09:30 up 1:39, 2 users, load average: 0.31, 0.23, 0.19 Tasks: 155 total, 1 running, 154 sleeping, 0 stopped, 0 zombie %Cpu0 : 0.2 us, 0.0 sy, 0.0 ni, 98.5 id, 1.0 wa, 0.0 hi, 0.3 si, 0.0 st %Cpu1 : 7.4 us, 4.3 sy, 0.0 ni, 88.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 0.9 us, 0.2 sy, 0.0 ni, 99.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 : 0.1 us, 0.1 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 18868060 total, 2889180 used, 15978880 free, 151080 buffers KiB Swap: 0 total, 0 used, 0 free, 767888 cached
The actions I took should be considered emergency actions. I'm not familiar with the BOA management, nor with other systems on this host. My reasoning for taking these steps was...
- Spawning bash scripts every minute for dynamic system management ( parsing log files, checking file system acls, checking for processes running ) is going to be expensive in resources. There are probably much better ways of doing these things. ( sv, inotify ... )
- From experience of running hosts in virtualised environments I'm not convinced that web applications run well with many virtual CPU. I believe that there may be hypervisor cpu scheduling conflicts with other VM's, and that unless there is clear evidence that performance is enhanced with additional CPU and that CPU numbers are required, then extra CPU numbers can detract from performance, rather than improve it.
Regards,
Kate
comment:89 in reply to: ↑ 88 ; follow-up: ↓ 100 Changed 11 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 10.56 to 10.81
Replying to kate:
- There were many bash "killall" processes running.
Yes, BOA goes on suicidal killing sprees quite often, for example see ticket:846#comment:58 ...
- I reduced the number of CPU to 4
Note that the server also currently has twice the RAM, I have updated the wiki page to reflect this, wiki:PuffinServer?action=diff&version=189&old_version=188
I commented out several shell scripts that are scheduled to run from root's crontab ( see output of crontab -l -u root If it is commented out I did it. )
I'm not sure what the effect will be of this:
crontab -l -u root #* * * * * bash /var/xdrago/second.sh >/dev/null 2>&1 #* * * * * bash /var/xdrago/minute.sh >/dev/null 2>&1 #* * * * * bash /var/xdrago/runner.sh >/dev/null 2>&1 #* * * * * bash /var/xdrago/manage_ltd_users.sh >/dev/null 2>&1 #03 * * * * bash /var/xdrago/clear.sh >/dev/null 2>&1 #14 * * * * bash /var/xdrago/purge_binlogs.sh >/dev/null 2>&1 #01 0 * * * bash /var/xdrago/graceful.sh >/dev/null 2>&1 #08 0 * * * bash /var/xdrago/mysql_backup.sh >/dev/null 2>&1 #08 2 * * * bash /opt/local/bin/backboa backup >/dev/null 2>&1 #08 8 * * * bash /opt/local/bin/duobackboa backup >/dev/null 2>&1 #58 2 * * * bash /var/xdrago/daily.sh >/dev/null 2>&1 #58 5 * * * bash /var/xdrago/fetch-remote.sh >/dev/null 2>&1
Some of the bash scripts were probably doing things that are needed so I fear that someone if going to have to go through all of these and work out which ones...
comment:90 Changed 11 months ago by paul
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 10.81 to 11.06
I'll install the honeypot module which should prevent spam bots from completing forms successfully ..
comment:91 follow-up: ↓ 92 Changed 11 months ago by paul
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 11.06 to 11.31
Sam,
Would you have a look to see if you can configure the honeypot module with the primary administrator account (uid=1)?
comment:92 in reply to: ↑ 91 Changed 11 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 11.31 to 11.56
Replying to paul:
Would you have a look to see if you can configure the honeypot module with the primary administrator account (uid=1)?
Paul you can get a temp login as UID 1 (without knowing the password or changing it) using drush uli.
I spent some time looking at the munin stats and things seem to be going OK so far.
I have opened a new ticket for the BOA cron jobs, ticket:893.
comment:93 Changed 11 months ago by sam
Hi Paul
Yep it looks like I can.
I enabled it for
User Registration form
User Password Reset form
However I can't see a login form option, unless Drupal regards the login/registration forms as equivalent?
I'll do a bit of testing now to make sure it doesn't conflict with the existing spam protection stuff; Botcha etc.
Thanks
Sam
comment:94 follow-up: ↓ 95 Changed 11 months ago by paul
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 11.56 to 11.81
@Chris
Normally I can but I couldn't figure out quickly how to to run the command as user tn with sudo or su?
@Sam
Sorry. I don't know. Probably easier for you to explore the problem if you have access. But, let me know if you need further assistance.
comment:95 in reply to: ↑ 94 Changed 11 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 11.81 to 12.06
Replying to paul:
@Chris
Normally I can but I couldn't figure out quickly how to to run the command as user tn with sudo or su?
OK, to find where the site is I looked in /etc/passwd and found:
tn.web:x:118:33::/home/tn.web:/bin/false
But the site isn't in /home/tn.web.
So trying to find it via /etc/nginx/nginx.conf, wehere we have include /etc/nginx/sites-enabled/*; but:
ls -lah /etc/nginx/sites-enabled/ ls: cannot access /etc/nginx/sites-enabled/: No such file or directory
However /etc/nginx/conf.d/aegir.conf is a symlink to /var/aegir/config/nginx.conf and this ends with:
include /var/aegir/config/server_master/nginx/pre.d/*; include /var/aegir/config/server_master/nginx/platform.d/*; include /var/aegir/config/server_master/nginx/vhost.d/*; include /var/aegir/config/server_master/nginx/post.d/*;
So looking in /var/aegir/config/server_master/nginx/vhost.d/ but there isn't the TN site config there... I don't want to waste any more time trying to work through the maze of Nginx config BOA created.
Paul if you know where BOA has hidden the site directory then I'll be able to work out the drush command...
comment:96 Changed 11 months ago by sam
Well it all seems Ok for now. The bash script changes may have done the trick temporarily ? I suggest we just leave it alone for now. Go and have Christmas and look again in the new year. Thanks everyone for your work on this. Hope you all have a nice few days off. Best Sam Sent from my Sony Xperia™ smartphone ---- Transition Technology Trac wrote ---- >#846: Load Spikes on BOA PuffinServer >-------------------------------------+------------------------------------- > Reporter: chris | Owner: chris > Type: maintenance | Status: new > Priority: major | Milestone: > Component: Live server | Maintenance > Keywords: | Resolution: >Add Hours to Ticket: 0.25 | Estimated Number of Hours: 0.0 > Total Hours: 11.56 | Billable?: 1 >-------------------------------------+------------------------------------- >Changes (by paul): > > * hours: 0.0 => 0.25 > * totalhours: 11.56 => 11.81 > > >Comment: > > @Chris > > Normally I can but I couldn't figure out quickly how to to run the command > as user tn with sudo or su? > > @Sam > > Sorry. I don't know. Probably easier for you to explore the problem if you > have access. But, let me know if you need further assistance. > >-- >Ticket URL: <https://tech.transitionnetwork.org/trac/ticket/846#comment:94> >Transition Technology <https://tech.transitionnetwork.org/trac> >Support and issues tracking for the Transition Network Web Project.
comment:97 Changed 11 months ago by sam
I think I was a bit unclear before. It's not that I was unable to configure honeypot it just seems it doesn't protect login forms. We have protected the registration form with it, so that may be enough to reduce load, certainly alongside the script changes. Let's just leave it alone for now. Thanks Sam Sent from my Sony Xperia™ smartphone ---- Transition Technology Trac wrote ---- >#846: Load Spikes on BOA PuffinServer >-------------------------------------+------------------------------------- > Reporter: chris | Owner: chris > Type: maintenance | Status: new > Priority: major | Milestone: > Component: Live server | Maintenance > Keywords: | Resolution: >Add Hours to Ticket: 0.25 | Estimated Number of Hours: 0.0 > Total Hours: 11.81 | Billable?: 1 >-------------------------------------+------------------------------------- >Changes (by chris): > > * hours: 0.0 => 0.25 > * totalhours: 11.81 => 12.06 > > >Comment: > > Replying to [comment:94 paul]: > > @Chris > > > > Normally I can but I couldn't figure out quickly how to to run the > command as user tn with sudo or su? > > OK, to find where the site is I looked in `/etc/passwd` and found: > > {{{ > tn.web:x:118:33::/home/tn.web:/bin/false > }}} > > But the site isn't in `/home/tn.web`. > > So trying to find it via `/etc/nginx/nginx.conf`, wehere we have `include > /etc/nginx/sites-enabled/*;` but: > > {{{ > ls -lah /etc/nginx/sites-enabled/ > ls: cannot access /etc/nginx/sites-enabled/: No such file or directory > }}} > > However `/etc/nginx/conf.d/aegir.conf` is a symlink to > `/var/aegir/config/nginx.conf` and this ends with: > > {{{ > include /var/aegir/config/server_master/nginx/pre.d/*; > include /var/aegir/config/server_master/nginx/platform.d/*; > include /var/aegir/config/server_master/nginx/vhost.d/*; > include /var/aegir/config/server_master/nginx/post.d/*; > }}} > > So looking in `/var/aegir/config/server_master/nginx/vhost.d/` but there > isn't the TN site config there... I don't want to waste any more time > trying to work through the maze of Nginx config BOA created. > > Paul if you know where BOA has hidden the site directory then I'll be able > to work out the `drush` command... > >-- >Ticket URL: <https://tech.transitionnetwork.org/trac/ticket/846#comment:95> >Transition Technology <https://tech.transitionnetwork.org/trac> >Support and issues tracking for the Transition Network Web Project.
comment:98 Changed 11 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 12.06 to 12.31
Thanks Sam, I wrote the following before I read your comments:
I think I have found the sites:
cd /data/ find ./ -name index.php ./disk/tn/static/transition-network-d6-35-p001b-booker/index.php ./disk/tn/static/drupal-6.37/index.php ./disk/tn/static/transition-network-d6-35-s001b-booker/index.php ./disk/tn/aegir/distro/019/profiles/hostmaster/modules/hosting/example/index.php ./disk/tn/aegir/distro/019/index.php ./disk/tn/platforms/transitionnetwork.org/index.php ./all/000/core/drupal-7.34.1/index.php ./all/000/core/drupal-7.33.1/index.php ./all/000/core/pressflow-6.29.1/index.php ./all/000/core/pressflow-6.34.1/index.php ./all/000/core/drupal-7.23.3/index.php ./all/000/core/drupal-7.32.1/index.php ./all/000/core/drupal-7.27.1/index.php ./all/000/core/drupal-7.30.1/index.php ./all/000/core/drupal-7.36.1/index.php ./all/000/core/pressflow-6.32.2/index.php ./all/000/core/drupal-7.28.1/index.php ./all/000/core/pressflow-6.35.1/index.php ./all/000/core/drupal-7.24.1/index.php ./all/000/core/pressflow-6.33.1/index.php ./all/000/core/pressflow-6.31.1/index.php ./all/000/core/pressflow-6.31.2/index.php ./all/000/core/pressflow-6.28.3/index.php ./all/000/core/drupal-7.31.1/index.php
So I guess /data/disk/tn/platforms/transitionnetwork.org/ is the site?
So:
sudo -i su - tn -s /bin/bash cd /data/disk/tn/platforms/transitionnetwork.org drush uli Drush command terminated abnormally due to an unrecoverable error. [error]
Hmm.... according to the BOA Changelog:
* Allow 'drush uli' as root, but deny root access to Drush by default.
This came with:
### Stable BOA-2.2.0 Release - Full Edition ### Date: Mon Mar 31 06:44:08 SGT 2014
And we are running:
tail -1 /var/log/barracuda_log.txt Fri May 1 22:50:58 BST 2015 / Debian.wheezy x86_64 XEN / Aegir BOA-2.4.2 / Barracuda BOA-2.4.2 / Nginx 1.8.0 / PHP-MI 5.3 / PHP-SE / FPM 5.3 / CLI 5.3 / MariaDB-5.5.43 localhost
So, trying that:
sudo -i cd /data/disk/tn/platforms/transitionnetwork.org drush uli Drush command terminated abnormally due to an unrecoverable error. [error]
So sorry I don't know how to fix this, this looks like more broken by design by BOA nonsense.
comment:99 Changed 11 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.05
- Status changed from new to closed
- Resolution set to fixed
- Total Hours changed from 12.31 to 12.36
Closing this ticket as the load spike problem appears to have been solved by commenting out all the BOA cron jobs, see ticket:893
comment:100 in reply to: ↑ 89 Changed 10 months ago by chris
- Add Hours to Ticket changed from 0.0 to 0.23
- Total Hours changed from 12.36 to 12.59
Replying to chris:
Replying to kate:
- I reduced the number of CPU to 4
Note that the server also currently has twice the RAM, I have updated the wiki page to reflect this, wiki:PuffinServer?action=diff&version=189&old_version=188
For the last week the min spare RAM for PuffinServer has been 4.76G:
So I have edited the Xen config for the server:
#memory = '18432' memory = '13824'
Based on this calculation:
18-4.5 13.5 13.5*1024 13824.0
I haven't yet rebooted it (I'll do it late one evening when less people are using the site), so it is still currently running with 18GB of RAM, even though the TN has only being invoiced for 9GB, but the invoice is for a server with 14 cores when it currently has 4 (Webarchitects will sort out their virtual server specs to better reflect what works well for the next quarter). When that has been done the top of the PuffinServer wiki page will need updating to reflect the change.
The site was unavailable for a little time today due to a load spike, lfd email:
There is simply a gap in the munin graphs, so nothing to indicate the cause there.
Around the same time as the load spike are 107 entries in /var/log/syslog like this:
In /var/log/php/php53-fpm-error.log we have a lot or errors around the same time, for example:
Then it looks like BOA killed the server:
In the /var/log/lfd.log we have this:
I'm still unsure of the cause of the load spike.