Ticket #543 (closed maintenance: fixed)

Opened 4 years ago

Last modified 3 years ago

Puffin Load Spike

Reported by: chris Owned by: chris
Priority: minor Milestone: Maintenance
Component: Live server Keywords:
Cc: ed, jim Estimated Number of Hours: 0.0
Add Hours to Ticket: 0 Billable?: yes
Total Hours: 0.55

Description

Puffin just had a massive load spike up to 65 and then it calmed down and it's up in the 70's again:

uptime
 21:46:43 up 91 days,  3:01,  4 users,  load average: 73.45, 30.76, 28.09

I'm trying to find the cause, it's making the TN site unresponsive.

Change History

comment:1 follow-up: ↓ 2 Changed 4 years ago by mark

I did a development platform build earlier on Aegir, but haven't done
anything since (for the last 4 hours).  Guessing it's *not* this, but
thought I'd mention it anyway...


On 3 May 2013 21:48, Transiton Technology Trac <
trac@tech.transitionnetwork.org> wrote:

> #543: Puffin Load Spike
> -----------------------------------------+-----------------------------
>                  Reporter:  chris        |                Owner:  chris
>                      Type:  maintenance  |               Status:  new
>                  Priority:  critical     |            Milestone:
>                 Component:  Live server  |             Keywords:
> Estimated Number of Hours:  0            |  Add Hours to Ticket:  0
>                 Billable?:  1            |          Total Hours:  0
> -----------------------------------------+-----------------------------
>  Puffin just had a massive load spike up to 65 and then it calmed down and
>  it's up in the 70's again:
>
>  {{{
>  uptime
>   21:46:43 up 91 days,  3:01,  4 users,  load average: 73.45, 30.76, 28.09
>  }}}
>
>  I'm trying to find the cause, it's making the TN site unresponsive.
>
> --
> Ticket URL: <https://tech.transitionnetwork.org/trac/ticket/543>
> Transition Technology <https://tech.transitionnetwork.org/trac>
> Support and issues tracking for the Transition Network Web Project.
>

comment:2 in reply to: ↑ 1 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.25
  • Total Hours changed from 0.0 to 0.25

Replying to mark:

I did a development platform build earlier on Aegir, but haven't done
anything since (for the last 4 hours). Guessing it's *not* this, but
thought I'd mention it anyway...

4 hours ago -- that rules it out, thanks for confirming that this wasn't the cause.

The load peaked almost in the 80s:

uptime
 21:46:37 up 91 days,  3:01,  3 users,  load average: 79.76, 31.26, 28.24

It looks to me like there was some kind of kernel issue, this from /var/log/kern.log there were a lot of these for different CPUs:

May  3 21:46:35 puffin kernel: [7873260.245957] BUG: soft lockup - CPU#0 stuck for 61s! [munin-node:40006]
May  3 21:46:35 puffin kernel: [7873260.245973] Modules linked in: nf_nat_ftp ip6table_filter ip6_tables joydev sg st sd_mod crc_t10dif sr_mod scsi_mod ide_gd_mod ide_cd_mod ide_core cdrom xt_tcpudp ipt_REDIRECT xt_conntrack iptable_mangle nf_conntrack_ftp ipt_REJECT ipt_LOG xt_limit xt_multiport xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables snd_pcm snd_timer snd soundcore snd_page_alloc pcspkr evdev ext3 jbd mbcache dm_mod xen_blkfront xen_netfront
May  3 21:46:35 puffin kernel: [7873260.246078] CPU 0:
May  3 21:46:35 puffin kernel: [7873260.246084] Modules linked in: nf_nat_ftp ip6table_filter ip6_tables joydev sg st sd_mod crc_t10dif sr_mod scsi_mod ide_gd_mod ide_cd_mod ide_core cdrom xt_tcpudp ipt_REDIRECT xt_conntrack iptable_mangle nf_conntrack_ftp ipt_REJECT ipt_LOG xt_limit xt_multiport xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables snd_pcm snd_timer snd soundcore snd_page_alloc pcspkr evdev ext3 jbd mbcache dm_mod xen_blkfront xen_netfront
May  3 21:46:35 puffin kernel: [7873260.246181] Pid: 40006, comm: munin-node Not tainted 2.6.32-5-xen-amd64 #1
May  3 21:46:35 puffin kernel: [7873260.246188] RIP: e030:[<ffffffff8130e0a8>]  [<ffffffff8130e0a8>] _spin_lock+0x15/0x1b
May  3 21:46:35 puffin kernel: [7873260.246204] RSP: e02b:ffff8800ef5f1ec0  EFLAGS: 00000293
May  3 21:46:35 puffin kernel: [7873260.246210] RAX: 0000000000000c27 RBX: ffff8800b99accc0 RCX: 0000000000000000
May  3 21:46:35 puffin kernel: [7873260.246218] RDX: 0000000000000c24 RSI: ffffffff813e3d79 RDI: ffffffff814a1000
May  3 21:46:35 puffin kernel: [7873260.246226] RBP: ffff8800ef5f1ef8 R08: 000000000097c1f0 R09: 0000000000000000
May  3 21:46:35 puffin kernel: [7873260.246234] R10: 0000000000a3a6d8 R11: ffffffff81154882 R12: ffff8800ff44f240
May  3 21:46:35 puffin kernel: [7873260.246241] R13: 0000000000c9b268 R14: ffff8800ef5f1f58 R15: 0000000000c9b058
May  3 21:46:35 puffin kernel: [7873260.246255] FS:  00007f54ce992700(0000) GS:ffff880003b85000(0000) knlGS:0000000000000000
May  3 21:46:35 puffin kernel: [7873260.246264] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
May  3 21:46:35 puffin kernel: [7873260.246270] CR2: 0000000000a33948 CR3: 000000007cc21000 CR4: 0000000000000660
May  3 21:46:35 puffin kernel: [7873260.246278] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May  3 21:46:35 puffin kernel: [7873260.247305] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
May  3 21:46:35 puffin kernel: [7873260.247305] Call Trace:
May  3 21:46:35 puffin kernel: [7873260.247305]  [<ffffffff811001f6>] ? d_alloc+0x14b/0x186
May  3 21:46:35 puffin kernel: [7873260.247305]  [<ffffffff810f6836>] ? create_write_pipe+0xe9/0x1aa
May  3 21:46:35 puffin kernel: [7873260.247305]  [<ffffffff810f691c>] ? do_pipe_flags+0x25/0xdd
May  3 21:46:35 puffin kernel: [7873260.247305]  [<ffffffff810f69e7>] ? sys_pipe2+0x13/0x4f
May  3 21:46:35 puffin kernel: [7873260.247305]  [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b

This is the result from grepping for BUG in kern.log from today

May  3 21:46:35 puffin kernel: [7873260.245957] BUG: soft lockup - CPU#0 stuck for 61s! [munin-node:40006]
May  3 21:46:35 puffin kernel: [7873260.273706] BUG: soft lockup - CPU#10 stuck for 61s! [nginx:39158]
May  3 21:46:35 puffin kernel: [7873260.251798] BUG: soft lockup - CPU#2 stuck for 61s! [swapper:0]
May  3 21:46:35 puffin kernel: [7873260.246490] BUG: soft lockup - CPU#4 stuck for 61s! [apt_all:39947]
May  3 21:46:35 puffin kernel: [7873260.309621] BUG: soft lockup - CPU#1 stuck for 61s! [bzip2:40041]
May  3 21:46:35 puffin kernel: [7873260.251808] BUG: soft lockup - CPU#13 stuck for 61s! [bash:40060]
May  3 21:46:35 puffin kernel: [7873260.309639] BUG: soft lockup - CPU#12 stuck for 61s! [swapper:0]
May  3 21:46:35 puffin kernel: [7873260.309629] BUG: soft lockup - CPU#9 stuck for 61s! [php-fpm:39175]
May  3 21:46:35 puffin kernel: [7873260.309652] BUG: soft lockup - CPU#6 stuck for 61s! [drush.php:39942]
May  3 21:46:35 puffin kernel: [7873260.323249] BUG: soft lockup - CPU#7 stuck for 61s! [swapper:0]
May  3 21:46:35 puffin kernel: [7873260.323247] BUG: soft lockup - CPU#5 stuck for 61s! [swapper:0]
May  3 21:46:35 puffin kernel: [7873260.309637] BUG: soft lockup - CPU#8 stuck for 61s! [swapper:0]
May  3 21:46:35 puffin kernel: [7873260.323252] BUG: soft lockup - CPU#11 stuck for 61s! [swapper:0]
May  3 21:46:35 puffin kernel: [7873261.417057] BUG: soft lockup - CPU#3 stuck for 61s! [swapper:0]

The same thing happened on 29th April:

Apr 29 16:04:00 puffin kernel: [7506959.700011] BUG: soft lockup - CPU#0 stuck for 61s! [munin-node:16305]
Apr 29 16:04:00 puffin kernel: [7506959.728797] BUG: soft lockup - CPU#4 stuck for 61s! [bash:2248]
Apr 29 16:04:00 puffin kernel: [7506959.734715] BUG: soft lockup - CPU#11 stuck for 61s! [bash:2243]
Apr 29 16:04:00 puffin kernel: [7506959.779966] BUG: soft lockup - CPU#9 stuck for 61s! [bash:2247]
Apr 29 16:04:00 puffin kernel: [7506959.807022] BUG: soft lockup - CPU#3 stuck for 61s! [drush.php:2227]
Apr 29 16:04:00 puffin kernel: [7506959.841325] BUG: soft lockup - CPU#6 stuck for 61s! [metche:2240]
Apr 29 16:04:00 puffin kernel: [7506959.839965] BUG: soft lockup - CPU#1 stuck for 61s! [bash:2244]
Apr 29 16:04:00 puffin kernel: [7506959.882711] BUG: soft lockup - CPU#7 stuck for 61s! [bash:2246]
Apr 29 16:04:00 puffin kernel: [7506959.859438] BUG: soft lockup - CPU#13 stuck for 61s! [perl:2241]
Apr 29 16:04:00 puffin kernel: [7506959.908581] BUG: soft lockup - CPU#8 stuck for 61s! [backupninja:2239]
Apr 29 16:04:00 puffin kernel: [7506959.908581] BUG: soft lockup - CPU#12 stuck for 61s! [apt_all:2237]
Apr 29 16:04:00 puffin kernel: [7506960.812723] BUG: soft lockup - CPU#10 stuck for 61s! [bash:2249]
Apr 29 16:04:00 puffin kernel: [7506960.844747] BUG: soft lockup - CPU#2 stuck for 61s! [php-fpm:1139]
Apr 29 16:04:00 puffin kernel: [7506961.894745] BUG: soft lockup - CPU#5 stuck for 61s! [swapper:0]
Apr 29 16:04:00 puffin kernel: [7507025.168019] BUG: soft lockup - CPU#0 stuck for 61s! [munin-node:16305]
Apr 29 16:04:00 puffin kernel: [7507025.302077] BUG: soft lockup - CPU#13 stuck for 61s! [perl:2241]
Apr 29 16:04:00 puffin kernel: [7507025.170185] BUG: soft lockup - CPU#11 stuck for 61s! [bash:2243]
Apr 29 16:04:00 puffin kernel: [7507025.170188] BUG: soft lockup - CPU#9 stuck for 61s! [bash:2247]
Apr 29 16:04:00 puffin kernel: [7507025.246231] BUG: soft lockup - CPU#6 stuck for 61s! [metche:2240]
Apr 29 16:04:00 puffin kernel: [7507025.305361] BUG: soft lockup - CPU#7 stuck for 61s! [bash:2246]
Apr 29 16:04:00 puffin kernel: [7507025.305648] BUG: soft lockup - CPU#8 stuck for 61s! [backupninja:2239]
Apr 29 16:04:00 puffin kernel: [7507025.306237] BUG: soft lockup - CPU#4 stuck for 61s! [bash:2248]
Apr 29 16:04:00 puffin kernel: [7507025.306234] BUG: soft lockup - CPU#3 stuck for 61s! [drush.php:2227]
Apr 29 16:04:00 puffin kernel: [7507025.308802] BUG: soft lockup - CPU#1 stuck for 61s! [bash:2244]
Apr 29 16:04:00 puffin kernel: [7507025.315487] BUG: soft lockup - CPU#12 stuck for 61s! [apt_all:2237]
Apr 29 16:04:00 puffin kernel: [7507026.280892] BUG: soft lockup - CPU#2 stuck for 61s! [php-fpm:1139]
Apr 29 16:04:00 puffin kernel: [7507026.502369] BUG: soft lockup - CPU#10 stuck for 61s! [bash:2249]
Apr 29 16:04:00 puffin kernel: [7507027.370539] BUG: soft lockup - CPU#5 stuck for 61s! [swapper:0]
Apr 29 16:04:00 puffin kernel: [7507090.700009] BUG: soft lockup - CPU#0 stuck for 61s! [munin-node:16305]
Apr 29 16:04:00 puffin kernel: [7507090.673472] BUG: soft lockup - CPU#3 stuck for 61s! [drush.php:2227]
Apr 29 16:04:00 puffin kernel: [7507090.764482] BUG: soft lockup - CPU#6 stuck for 61s! [metche:2240]
Apr 29 16:04:00 puffin kernel: [7507090.820913] BUG: soft lockup - CPU#11 stuck for 61s! [bash:2243]
Apr 29 16:04:00 puffin kernel: [7507090.764491] BUG: soft lockup - CPU#9 stuck for 61s! [bash:2247]
Apr 29 16:04:00 puffin kernel: [7507090.764500] BUG: soft lockup - CPU#1 stuck for 61s! [bash:2244]
Apr 29 16:04:00 puffin kernel: [7507090.764492] BUG: soft lockup - CPU#13 stuck for 61s! [perl:2241]
Apr 29 16:04:00 puffin kernel: [7507090.764500] BUG: soft lockup - CPU#4 stuck for 61s! [bash:2248]
Apr 29 16:04:00 puffin kernel: [7507090.820931] BUG: soft lockup - CPU#8 stuck for 61s! [backupninja:2239]
Apr 29 16:04:00 puffin kernel: [7507090.858239] BUG: soft lockup - CPU#7 stuck for 61s! [bash:2246]
Apr 29 16:04:00 puffin kernel: [7507090.942543] BUG: soft lockup - CPU#12 stuck for 61s! [apt_all:2237]
Apr 29 16:04:00 puffin kernel: [7507092.017604] BUG: soft lockup - CPU#2 stuck for 61s! [php-fpm:1139]
Apr 29 16:04:00 puffin kernel: [7507092.054278] BUG: soft lockup - CPU#10 stuck for 61s! [bash:2249]
Apr 29 16:04:00 puffin kernel: [7507092.983020] BUG: soft lockup - CPU#5 stuck for 61s! [swapper:0]

I don't know the cause.

comment:3 Changed 4 years ago by chris

There are some old Xen bug reports of this problem, related to heavy load, here: http://bugzilla.xensource.com/bugzilla/buglist.cgi?quicksearch=BUG%3A+soft+lockup+detected

comment:4 Changed 4 years ago by ed

  • Milestone set to Maintenance

comment:5 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Priority changed from critical to minor
  • Total Hours changed from 0.25 to 0.35

Dropping this to minor but not closing because the cause of the kernel problems haven't been tracked down however this issue hasn't occurred again.

Since then the host machines kernel and version of Xen has been downgraded due to an unresolved bug, see ticket:535#comment:6

comment:6 Changed 4 years ago by chris

  • Status changed from new to accepted

comment:7 Changed 4 years ago by chris

See also ticket:552 for another load spike issue.

comment:8 Changed 3 years ago by mark

  • Cc mark removed

Just removing myself from this ticket, as I don't believe I can offer any help.

comment:9 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.2
  • Status changed from accepted to closed
  • Resolution set to fixed
  • Total Hours changed from 0.35 to 0.55

This issue has been followed up on ticket:555 and documented at wiki:PuffinServer#LoadSpikes which now included a list of all related tickets.

Note: See TracTickets for help on using tickets.