Ticket #287 (closed defect: duplicate)
Live Server Load
Reported by: | chris | Owned by: | chris |
---|---|---|---|
Priority: | major | Milestone: | Phase 4 |
Component: | Live server | Keywords: | |
Cc: | ed,jim | Estimated Number of Hours: | 0.2 |
Add Hours to Ticket: | 0 | Billable?: | yes |
Total Hours: | 0.93 |
Description
Currently it's though the roof (anything over 4 and it's maxed out, it has 4 processors):
load average: 43.63, 33.84, 18.85
See also: https://kiwi.transitionnetwork.org/munin/webarch.net/quince.webarch.net.html
This could be because of things Jim is doing combined with the GA import that is running but I'm seeing a *lot* of apache processes so it seems to be combined with a spike in traffic.
I'm keen to migrate the site onto a faster server :-)
Change History
comment:2 Changed 5 years ago by jim
- Priority changed from major to blocker
Cannot SSH into server.
OK, I think the best plan is to restart every service, then me put the OLD version of the CR file engine on, then re-enable it. That should give us speed without the data loss... Hopefully.
Chris, when can I get in?
comment:3 Changed 5 years ago by ed
OK - things are getting a little on top and we need to back off LIVE. We're clearly do too many things at once, and knickers are getting in a twist. Timing is very critical with the newsletter today and conference in one week...
I'm getting texts from Network...
- we can't do any server moves right now
- we have a newsletter going out in a few hours which is not delayable - the conference is undersold and we *need* to have a fast conference page etc.
- what is driving the activity? we'll need to close whatever it is down, or delay until quiet time over weekend - definitely stop the GA import it's not critical
- presumably it's also the cacherouter being turned off, along with the aliases work and other things?
SO:
- What can we not do right now to calm the situation in the short term?
- What are the caching priorities to calm the situation in the medium term? Will Varnish help this? If so, we focus on just that, and resolving the issues brought up in the skype.
e.g. If cacherouter is needed to keep the site running smoothly, do we bar all UGC addition to the site for a bit?
comment:4 Changed 5 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.1
- Total Hours changed from 0.0 to 0.1
Looks like I'm going to have to reboot it, my ssh sessions are unresponsive, the load has clearly go so high it's not responding :-(
comment:5 Changed 5 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.03
- Total Hours changed from 0.1 to 0.13
It is in the process of shutting down, it's just taking a while...
comment:6 Changed 5 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.1
- Total Hours changed from 0.13 to 0.23
I'm getting a lot of these messages from the consol and it's still using loads of cpu:
[1205922.667284] INFO: task sh:22253 blocked for more than 120 seconds. [1205922.667302] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1205922.667311] sh D ffff8800459bfe30 0 22253 22234 [1205922.667320] ffff880001733cb8 0000000000000286 0000000000000000 0000000000000001 [1205922.667330] ffff88005bcec200 ffff88000152af80 ffff88005bcec480 0000000080223845 [1205922.667340] ffff880000000005 00000000002943de ffff88007fe0d400 000000000000000f [1205922.667349] Call Trace: [1205922.667370] [<ffffffffa000b5a1>] :jbd:journal_stop+0x1e7/0x1f3 [1205922.667381] [<ffffffff80243f42>] getnstimeofday+0x39/0x98 [1205922.667390] [<ffffffff802647bd>] sync_page+0x0/0x41 [1205922.667399] [<ffffffff80435337>] io_schedule+0x5c/0x9e [1205922.667406] [<ffffffff802647f9>] sync_page+0x3c/0x41 [1205922.667412] [<ffffffff804354aa>] __wait_on_bit_lock+0x36/0x66 [1205922.667422] [<ffffffff802647aa>] __lock_page+0x5e/0x64 [1205922.667430] [<ffffffff8023f6db>] wake_bit_function+0x0/0x23 [1205922.667436] [<ffffffff8026498b>] find_lock_page+0x79/0xc3 [1205922.667443] [<ffffffff8026675d>] filemap_fault+0x72/0x345 [1205922.667452] [<ffffffff80271b6d>] __do_fault+0x51/0x50b [1205922.667459] [<ffffffff8027691c>] handle_mm_fault+0x646/0xcaa [1205922.667470] [<ffffffff802181bf>] do_page_fault+0xb69/0xf46 [1205922.667477] [<ffffffff80311bab>] __up_write+0x21/0x10e [1205922.667485] [<ffffffff80436e67>] error_exit+0x0/0x69
I will give it a bit longer and then might have to do a more brutal restart via a virtual power cycle it we don't get anywhere...
comment:7 Changed 5 years ago by chris
It's running again, I did the brutal thing, Jim you can ssh in now.
I'm not going to restart the GA import again, for now.
comment:8 Changed 5 years ago by jim
Cheers Chris, CR file engine is back on...
Am working on a patch to try to improve it, may put it at the server later to see if it makes a difference...
In the mean time the best thing is to keep LIVE load down until the friday rush/newsletter send cools off.
comment:9 Changed 5 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.05
- Total Hours changed from 0.23 to 0.28
Messages from the boot:
Jul 1 11:53:52 quince /etc/mysql/debian-start[1492]: Triggering myisam-recover for all MyISAM tables Jul 1 11:53:52 quince mysqld[1439]: 110701 11:53:52 [ERROR] /usr/sbin/mysqld: Table './live/cache' is marked as crashed and should be repaired Jul 1 11:53:52 quince mysqld[1439]: 110701 11:53:52 [Warning] Checking table: './live/cache' Jul 1 11:53:53 quince mysqld[1439]: 110701 11:53:53 [ERROR] /usr/sbin/mysqld: Table './live/cache_filter' is marked as crashed and should be repaired Jul 1 11:53:53 quince mysqld[1439]: 110701 11:53:53 [Warning] Checking table: './live/cache_filter' Jul 1 11:53:55 quince mysqld[1439]: 110701 11:53:55 [ERROR] /usr/sbin/mysqld: Table './live/cache_menu' is marked as crashed and should be repaired Jul 1 11:53:55 quince mysqld[1439]: 110701 11:53:55 [Warning] Checking table: './live/cache_menu' Jul 1 11:53:57 quince mysqld[1439]: 110701 11:53:57 [ERROR] /usr/sbin/mysqld: Table './live/cache_page' is marked as crashed and should be repaired Jul 1 11:53:57 quince mysqld[1439]: 110701 11:53:57 [Warning] Checking table: './live/cache_page'
I guess these all need purging in any case, but perhaps this indicates what caused the problem?
comment:10 Changed 5 years ago by jim
Almost certainly. Truncate the tables and repair them, Drupal won't mind!
Could explain the issues I was having last night with url bulk update...
comment:11 Changed 5 years ago by chris
Truncate the tables and repair them
Do you want to do that or should I?
comment:12 Changed 5 years ago by jim
Never repaired anything in MySQL... Can you do it please?
Might be worth taking the site down for a minute while, it happens at https://www.transitionnetwork.org/admin/settings/site-maintenance then clearing the caches in Drupal, then running the mysql command, then take Drupal out of maintenance mode.
comment:13 Changed 5 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.2
- Total Hours changed from 0.28 to 0.48
Actually it seems to have sorted itself out?
mysql> check table cache; +------------+-------+----------+----------+ | Table | Op | Msg_type | Msg_text | +------------+-------+----------+----------+ | live.cache | check | status | OK | +------------+-------+----------+----------+ 1 row in set (0.01 sec) mysql> check table cache_filter; +-------------------+-------+----------+----------+ | Table | Op | Msg_type | Msg_text | +-------------------+-------+----------+----------+ | live.cache_filter | check | status | OK | +-------------------+-------+----------+----------+ 1 row in set (0.05 sec) mysql> check table cache_menu; +-----------------+-------+----------+----------+ | Table | Op | Msg_type | Msg_text | +-----------------+-------+----------+----------+ | live.cache_menu | check | status | OK | +-----------------+-------+----------+----------+ 1 row in set (0.01 sec) mysql> check table cache_page; +-----------------+-------+----------+----------+ | Table | Op | Msg_type | Msg_text | +-----------------+-------+----------+----------+ | live.cache_page | check | status | OK | +-----------------+-------+----------+----------+ 1 row in set (0.06 sec)
If we need to look at this again, some suggestions here: http://www.felipecruz.com/repair-mysql-database.php
comment:15 Changed 5 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.05
- Total Hours changed from 0.48 to 0.53
Reading through the comments here:
http://serverfault.com/questions/280916/unable-to-remove-limit-on-memory-usage-for-php-script
Makes me wonder if it wasn't Piwik which brought down the server today.
comment:16 Changed 5 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.4
- Total Hours changed from 0.53 to 0.93
Looks like there is a problem with MySQL, lots of these messages in /var/log/syslog:
Jul 2 00:23:53 quince apache2: PHP Warning: MySQL server has gone away#012query: INSERT INTO watchdog#012 (uid, type, message, variables, severity, link, location, referer, hostname, timestamp)#012 VALUES#012 (0, 'php', '%type: %message in %function (line %line of %file).', 'a:5:{s:5:\"%type\";s:12:\"User warning\";s:8:\"%message\";s:120:\"MySQL server has gone away\nquery: SELECT expire, value FROM semaphore WHERE name = &#039;variable_cache_regenerate&#039;\&
On suggestion I found:
If the problem is with watchdog table, disable Syslog module.
Having read some of these:
- http://dev.mysql.com/doc/refman/5.0/en/gone-away.html
- http://drupal.org/node/259580
- http://drupal.org/node/163145
- http://drupal.org/node/51263
I have tweaked some settings in /etc/mysql/my.cnf and so far the errors appear to have gone...
comment:17 Changed 5 years ago by jim
I restarted mysql around 00:23 on Jul 2 because the bulk URL update hung it again. Runs fine on my local machine... can run 4 at once.
Now you changed other settings on the other ticket, is it worth trying this again?
comment:18 Changed 5 years ago by chris
Now you changed other settings on the other ticket, is it worth trying this again?
Sure, go for it, let me know if you want me to keep an eye on it when you run it.
comment:19 Changed 5 years ago by chris
- Status changed from new to closed
- Resolution set to duplicate
I'm closing this as it appears that it was caused, twice, by the Piwik GA import code, so we might as well deal with it on ticket:160
What's using CPU, apache or mysql? I'm guessing the latter.
Dropping CacheRouter? is almost certainly the cause of this, plus the extra thur/fri traffic the site seems to get.
Did you say memcache is available on LIVE? Could we turn that on as a stop-gap
I would re-enable the path alias cache now too (turned it off last night when the url rebuild happened but I cannot get into the site!
OR we could re-enable CR and just accept we're going to lose a little data... Not a great choice.