Ticket #287 (closed defect: duplicate)

Opened 5 years ago

Last modified 5 years ago

Live Server Load

Reported by: chris Owned by: chris
Priority: major Milestone: Phase 4
Component: Live server Keywords:
Cc: ed,jim Estimated Number of Hours: 0.2
Add Hours to Ticket: 0 Billable?: yes
Total Hours: 0.93

Description

Currently it's though the roof (anything over 4 and it's maxed out, it has 4 processors):

load average: 43.63, 33.84, 18.85

See also: https://kiwi.transitionnetwork.org/munin/webarch.net/quince.webarch.net.html

This could be because of things Jim is doing combined with the GA import that is running but I'm seeing a *lot* of apache processes so it seems to be combined with a spike in traffic.

I'm keen to migrate the site onto a faster server :-)

Change History

comment:1 Changed 5 years ago by jim

What's using CPU, apache or mysql? I'm guessing the latter.

Dropping CacheRouter? is almost certainly the cause of this, plus the extra thur/fri traffic the site seems to get.

Did you say memcache is available on LIVE? Could we turn that on as a stop-gap

I would re-enable the path alias cache now too (turned it off last night when the url rebuild happened but I cannot get into the site!

OR we could re-enable CR and just accept we're going to lose a little data... Not a great choice.

comment:2 Changed 5 years ago by jim

  • Priority changed from major to blocker

Cannot SSH into server.

OK, I think the best plan is to restart every service, then me put the OLD version of the CR file engine on, then re-enable it. That should give us speed without the data loss... Hopefully.

Chris, when can I get in?

comment:3 Changed 5 years ago by ed

OK - things are getting a little on top and we need to back off LIVE. We're clearly do too many things at once, and knickers are getting in a twist. Timing is very critical with the newsletter today and conference in one week...

I'm getting texts from Network...

  1. we can't do any server moves right now
  2. we have a newsletter going out in a few hours which is not delayable - the conference is undersold and we *need* to have a fast conference page etc.
  1. what is driving the activity? we'll need to close whatever it is down, or delay until quiet time over weekend - definitely stop the GA import it's not critical
  2. presumably it's also the cacherouter being turned off, along with the aliases work and other things?

SO:

  1. What can we not do right now to calm the situation in the short term?
  2. What are the caching priorities to calm the situation in the medium term? Will Varnish help this? If so, we focus on just that, and resolving the issues brought up in the skype.

e.g. If cacherouter is needed to keep the site running smoothly, do we bar all UGC addition to the site for a bit?

comment:4 Changed 5 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 0.0 to 0.1

Looks like I'm going to have to reboot it, my ssh sessions are unresponsive, the load has clearly go so high it's not responding :-(

comment:5 Changed 5 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.03
  • Total Hours changed from 0.1 to 0.13

It is in the process of shutting down, it's just taking a while...

comment:6 Changed 5 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 0.13 to 0.23

I'm getting a lot of these messages from the consol and it's still using loads of cpu:

[1205922.667284] INFO: task sh:22253 blocked for more than 120 seconds.
[1205922.667302] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1205922.667311] sh            D ffff8800459bfe30     0 22253  22234
[1205922.667320]  ffff880001733cb8 0000000000000286 0000000000000000 0000000000000001
[1205922.667330]  ffff88005bcec200 ffff88000152af80 ffff88005bcec480 0000000080223845
[1205922.667340]  ffff880000000005 00000000002943de ffff88007fe0d400 000000000000000f
[1205922.667349] Call Trace:
[1205922.667370]  [<ffffffffa000b5a1>] :jbd:journal_stop+0x1e7/0x1f3
[1205922.667381]  [<ffffffff80243f42>] getnstimeofday+0x39/0x98
[1205922.667390]  [<ffffffff802647bd>] sync_page+0x0/0x41
[1205922.667399]  [<ffffffff80435337>] io_schedule+0x5c/0x9e
[1205922.667406]  [<ffffffff802647f9>] sync_page+0x3c/0x41
[1205922.667412]  [<ffffffff804354aa>] __wait_on_bit_lock+0x36/0x66
[1205922.667422]  [<ffffffff802647aa>] __lock_page+0x5e/0x64
[1205922.667430]  [<ffffffff8023f6db>] wake_bit_function+0x0/0x23
[1205922.667436]  [<ffffffff8026498b>] find_lock_page+0x79/0xc3
[1205922.667443]  [<ffffffff8026675d>] filemap_fault+0x72/0x345
[1205922.667452]  [<ffffffff80271b6d>] __do_fault+0x51/0x50b
[1205922.667459]  [<ffffffff8027691c>] handle_mm_fault+0x646/0xcaa
[1205922.667470]  [<ffffffff802181bf>] do_page_fault+0xb69/0xf46
[1205922.667477]  [<ffffffff80311bab>] __up_write+0x21/0x10e
[1205922.667485]  [<ffffffff80436e67>] error_exit+0x0/0x69

I will give it a bit longer and then might have to do a more brutal restart via a virtual power cycle it we don't get anywhere...

comment:7 Changed 5 years ago by chris

It's running again, I did the brutal thing, Jim you can ssh in now.

I'm not going to restart the GA import again, for now.

comment:8 Changed 5 years ago by jim

Cheers Chris, CR file engine is back on...

Am working on a patch to try to improve it, may put it at the server later to see if it makes a difference...

In the mean time the best thing is to keep LIVE load down until the friday rush/newsletter send cools off.

comment:9 Changed 5 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.05
  • Total Hours changed from 0.23 to 0.28

Messages from the boot:

Jul  1 11:53:52 quince /etc/mysql/debian-start[1492]: Triggering myisam-recover for all MyISAM tables
Jul  1 11:53:52 quince mysqld[1439]: 110701 11:53:52 [ERROR] /usr/sbin/mysqld: Table './live/cache' is marked as crashed and should be repaired
Jul  1 11:53:52 quince mysqld[1439]: 110701 11:53:52 [Warning] Checking table:   './live/cache'
Jul  1 11:53:53 quince mysqld[1439]: 110701 11:53:53 [ERROR] /usr/sbin/mysqld: Table './live/cache_filter' is marked as crashed and should be repaired
Jul  1 11:53:53 quince mysqld[1439]: 110701 11:53:53 [Warning] Checking table:   './live/cache_filter'
Jul  1 11:53:55 quince mysqld[1439]: 110701 11:53:55 [ERROR] /usr/sbin/mysqld: Table './live/cache_menu' is marked as crashed and should be repaired
Jul  1 11:53:55 quince mysqld[1439]: 110701 11:53:55 [Warning] Checking table:   './live/cache_menu'
Jul  1 11:53:57 quince mysqld[1439]: 110701 11:53:57 [ERROR] /usr/sbin/mysqld: Table './live/cache_page' is marked as crashed and should be repaired
Jul  1 11:53:57 quince mysqld[1439]: 110701 11:53:57 [Warning] Checking table:   './live/cache_page' 

I guess these all need purging in any case, but perhaps this indicates what caused the problem?

comment:10 Changed 5 years ago by jim

Almost certainly. Truncate the tables and repair them, Drupal won't mind!

Could explain the issues I was having last night with url bulk update...

comment:11 Changed 5 years ago by chris

Truncate the tables and repair them

Do you want to do that or should I?

comment:12 Changed 5 years ago by jim

Never repaired anything in MySQL... Can you do it please?

Might be worth taking the site down for a minute while, it happens at https://www.transitionnetwork.org/admin/settings/site-maintenance then clearing the caches in Drupal, then running the mysql command, then take Drupal out of maintenance mode.

comment:13 Changed 5 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.2
  • Total Hours changed from 0.28 to 0.48

Actually it seems to have sorted itself out?

mysql> check table cache;
+------------+-------+----------+----------+
| Table      | Op    | Msg_type | Msg_text |
+------------+-------+----------+----------+
| live.cache | check | status   | OK       | 
+------------+-------+----------+----------+
1 row in set (0.01 sec)

mysql> check table cache_filter;
+-------------------+-------+----------+----------+
| Table             | Op    | Msg_type | Msg_text |
+-------------------+-------+----------+----------+
| live.cache_filter | check | status   | OK       | 
+-------------------+-------+----------+----------+
1 row in set (0.05 sec)

mysql> check table cache_menu;
+-----------------+-------+----------+----------+
| Table           | Op    | Msg_type | Msg_text |
+-----------------+-------+----------+----------+
| live.cache_menu | check | status   | OK       | 
+-----------------+-------+----------+----------+
1 row in set (0.01 sec)

mysql> check table cache_page;
+-----------------+-------+----------+----------+
| Table           | Op    | Msg_type | Msg_text |
+-----------------+-------+----------+----------+
| live.cache_page | check | status   | OK       | 
+-----------------+-------+----------+----------+
1 row in set (0.06 sec)

If we need to look at this again, some suggestions here: http://www.felipecruz.com/repair-mysql-database.php

comment:14 Changed 5 years ago by ed

  • Priority changed from blocker to major

comment:15 Changed 5 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.05
  • Total Hours changed from 0.48 to 0.53

Reading through the comments here:

http://serverfault.com/questions/280916/unable-to-remove-limit-on-memory-usage-for-php-script

Makes me wonder if it wasn't Piwik which brought down the server today.

comment:16 Changed 5 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.4
  • Total Hours changed from 0.53 to 0.93

Looks like there is a problem with MySQL, lots of these messages in /var/log/syslog:

Jul  2 00:23:53 quince apache2: PHP Warning:  MySQL server has gone away#012query: INSERT INTO watchdog#012    (uid, type, message, variables, severity, link, location, referer, hostname, timestamp)#012    VALUES#012    (0, &#039;php&#039;, &#039;%type: %message in %function (line %line of %file).&#039;, &#039;a:5:{s:5:\&quot;%type\&quot;;s:12:\&quot;User warning\&quot;;s:8:\&quot;%message\&quot;;s:120:\&quot;MySQL server has gone away\nquery: SELECT expire, value FROM semaphore WHERE name = &amp;#039;variable_cache_regenerate&amp;#039;\&

On suggestion I found:

If the problem is with watchdog table, disable Syslog module.

http://drupal.org/node/227445#comment-4365944

Having read some of these:

I have tweaked some settings in /etc/mysql/my.cnf and so far the errors appear to have gone...

comment:17 Changed 5 years ago by jim

I restarted mysql around 00:23 on Jul 2 because the bulk URL update hung it again. Runs fine on my local machine... can run 4 at once.

Now you changed other settings on the other ticket, is it worth trying this again?

comment:18 Changed 5 years ago by chris

Now you changed other settings on the other ticket, is it worth trying this again?

Sure, go for it, let me know if you want me to keep an eye on it when you run it.

comment:19 Changed 5 years ago by chris

  • Status changed from new to closed
  • Resolution set to duplicate

I'm closing this as it appears that it was caused, twice, by the Piwik GA import code, so we might as well deal with it on ticket:160

Note: See TracTickets for help on using tickets.