Ticket #666 (closed maintenance: fixed)

Opened 3 years ago

Last modified 3 years ago

Parrot lockups

Reported by: chris Owned by: chris
Priority: critical Milestone: Maintenance
Component: Dev server Keywords:
Cc: ed, aland Estimated Number of Hours: 0.0
Add Hours to Ticket: 0 Billable?: yes
Total Hours: 0.32

Description

We have this on the console, I'm going to reboot it:

[28800.164426]  [<ffffffffa001c1ba>] ? do_get_write_access+0x22c/0x452 [jbd2]
[28800.164435]  [<ffffffff81066360>] ? wake_bit_function+0x0/0x23
[28800.164443]  [<ffffffff8104b51c>] ? try_to_wake_up+0x289/0x29b
[28800.164453]  [<ffffffffa001c402>] ? jbd2_journal_get_write_access+0x22/0x33 [jbd2]
[28800.164475]  [<ffffffffa006289e>] ? __ext4_journal_get_write_access+0x4e/0x56 [ext4]
[28800.164492]  [<ffffffffa0042b8e>] ? ext4_reserve_inode_write+0x37/0x73 [ext4]
[28800.164508]  [<ffffffffa0042c05>] ? ext4_mark_inode_dirty+0x3b/0x1c4 [ext4]
[28800.164528]  [<ffffffffa005bdc7>] ? ext4_journal_start_sb+0xd4/0x10e [ext4]
[28800.164543]  [<ffffffffa0042eb0>] ? ext4_dirty_inode+0x30/0x46 [ext4]
[28800.164553]  [<ffffffff81109ead>] ? __mark_inode_dirty+0x25/0x14a
[28800.164560]  [<ffffffff8110138b>] ? file_update_time+0x101/0x130
[28800.164569]  [<ffffffff810b6835>] ? __generic_file_aio_write+0x16e/0x293
[28800.164578]  [<ffffffff810b69b3>] ? generic_file_aio_write+0x59/0x9f
[28800.164588]  [<ffffffff810f0316>] ? do_sync_write+0xce/0x113
[28800.164596]  [<ffffffff810fcd0c>] ? filldir+0x0/0xb7
[28800.164605]  [<ffffffff810549b1>] ? _local_bh_enable_ip+0x22/0x8f
[28800.164613]  [<ffffffff81066332>] ? autoremove_wake_function+0x0/0x2e
[28800.164626]  [<ffffffff8130f1a1>] ? _spin_lock_bh+0x9/0x25
[28800.164626]  [<ffffffff810f0c68>] ? vfs_write+0xa9/0x102
[28800.164632]  [<ffffffff810f0d18>] ? sys_pwrite64+0x57/0x77
[28800.164639]  [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b
[28800.164657] INFO: task apache2:31559 blocked for more than 120 seconds.
[28800.164665] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28800.164675] apache2       D 0000000000000000     0 31559  28011 0x00000000
[28800.164691]  ffffffff8149f1f0 0000000000000286 0000000000000000 ffffffff81274f4f
[28800.164711]  ffff880002dc99d8 ffff8800bd1ca000 000000000000f9e0 ffff880002dc9fd8
[28800.164730]  00000000000157c0 00000000000157c0 ffff8800bd9746a0 ffff8800bd974998
[28800.164752] Call Trace:
[28800.164762]  [<ffffffff81274f4f>] ? sch_direct_xmit+0x7f/0x14c
[28800.164773]  [<ffffffff81066253>] ? bit_waitqueue+0x10/0xa0
[28800.164787]  [<ffffffffa001c1ba>] ? do_get_write_access+0x22c/0x452 [jbd2]
[28800.164798]  [<ffffffff81066360>] ? wake_bit_function+0x0/0x23
[28800.164812]  [<ffffffffa001c402>] ? jbd2_journal_get_write_access+0x22/0x33 [jbd2]
[28800.164833]  [<ffffffffa006289e>] ? __ext4_journal_get_write_access+0x4e/0x56 [ext4]
[28800.164852]  [<ffffffffa0042b8e>] ? ext4_reserve_inode_write+0x37/0x73 [ext4]
[28800.164871]  [<ffffffffa0042c05>] ? ext4_mark_inode_dirty+0x3b/0x1c4 [ext4]
[28800.164890]  [<ffffffffa005bdc7>] ? ext4_journal_start_sb+0xd4/0x10e [ext4]
[28800.164908]  [<ffffffffa0042eb0>] ? ext4_dirty_inode+0x30/0x46 [ext4]
[28800.164921]  [<ffffffff81109ead>] ? __mark_inode_dirty+0x25/0x14a
[28800.164932]  [<ffffffff8110138b>] ? file_update_time+0x101/0x130
[28800.164943]  [<ffffffff810b6835>] ? __generic_file_aio_write+0x16e/0x293
[28800.164958]  [<ffffffff8125227b>] ? sock_aio_write+0x0/0xbc
[28800.164969]  [<ffffffff8100cc43>] ? xen_make_pte+0x7b/0x83
[28800.164980]  [<ffffffff810b69b3>] ? generic_file_aio_write+0x59/0x9f
[28800.164992]  [<ffffffff810f0316>] ? do_sync_write+0xce/0x113
[28800.165003]  [<ffffffff81066332>] ? autoremove_wake_function+0x0/0x2e
[28800.165015]  [<ffffffff810ce24c>] ? handle_mm_fault+0x3b8/0x80f
[28800.165027]  [<ffffffff810f0c68>] ? vfs_write+0xa9/0x102
[28800.165038]  [<ffffffff810f0d7d>] ? sys_write+0x45/0x6e
[28800.165049]  [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b
[125024.867759] hrtimer: interrupt took 38561246 ns
[1412520.196163] INFO: task mysqld:7928 blocked for more than 120 seconds.
[1412520.196183] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1412520.196191] mysqld        D 0000000000000000     0  7928  18454 0x00000000
[1412520.196203]  ffff8800bfa69530 0000000000000286 0000000000000000 0000000000000000
[1412520.196215]  0007ffffffffffff 0000000000000001 000000000000f9e0 ffff880074cb5fd8
[1412520.196226]  00000000000157c0 00000000000157c0 ffff880002ce1530 ffff880002ce1828
[1412520.196236] Call Trace:
[1412520.196259]  [<ffffffffa00232bf>] ? jbd2_log_wait_commit+0xbf/0x112 [jbd2]
[1412520.196273]  [<ffffffff81066332>] ? autoremove_wake_function+0x0/0x2e
[1412520.196293]  [<ffffffffa003fb41>] ? ext4_sync_file+0x199/0x25c [ext4]
[1412520.196304]  [<ffffffff8110d6e0>] ? vfs_fsync_range+0x73/0x9e
[1412520.196319]  [<ffffffff8110d78a>] ? do_fsync+0x28/0x39
[1412520.196325]  [<ffffffff8110d7b9>] ? sys_fsync+0xb/0x10
[1412520.196333]  [<ffffffff81011b63>] ? sysret_check+0x17/0x5a
[1412520.196341]  [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b

Change History

comment:1 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.32
  • Total Hours changed from 0.0 to 0.32

The server is back up.

The console errors were not recent, this is from /var/log/kern.log.2.gz:

Dec 25 01:28:54 parrot kernel: [1412520.196163] INFO: task mysqld:7928 blocked for more than 120 seconds.
Dec 25 01:28:54 parrot kernel: [1412520.196183] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 25 01:28:54 parrot kernel: [1412520.196191] mysqld        D 0000000000000000     0  7928  18454 0x00000000
Dec 25 01:28:54 parrot kernel: [1412520.196203]  ffff8800bfa69530 0000000000000286 0000000000000000 0000000000000000
Dec 25 01:28:54 parrot kernel: [1412520.196215]  0007ffffffffffff 0000000000000001 000000000000f9e0 ffff880074cb5fd8
Dec 25 01:28:54 parrot kernel: [1412520.196226]  00000000000157c0 00000000000157c0 ffff880002ce1530 ffff880002ce1828
Dec 25 01:28:54 parrot kernel: [1412520.196236] Call Trace:
Dec 25 01:28:54 parrot kernel: [1412520.196259]  [<ffffffffa00232bf>] ? jbd2_log_wait_commit+0xbf/0x112 [jbd2]
Dec 25 01:28:54 parrot kernel: [1412520.196273]  [<ffffffff81066332>] ? autoremove_wake_function+0x0/0x2e
Dec 25 01:28:54 parrot kernel: [1412520.196293]  [<ffffffffa003fb41>] ? ext4_sync_file+0x199/0x25c [ext4]
Dec 25 01:28:54 parrot kernel: [1412520.196304]  [<ffffffff8110d6e0>] ? vfs_fsync_range+0x73/0x9e
Dec 25 01:28:54 parrot kernel: [1412520.196319]  [<ffffffff8110d78a>] ? do_fsync+0x28/0x39
Dec 25 01:28:54 parrot kernel: [1412520.196325]  [<ffffffff8110d7b9>] ? sys_fsync+0xb/0x10
Dec 25 01:28:54 parrot kernel: [1412520.196333]  [<ffffffff81011b63>] ? sysret_check+0x17/0x5a
Dec 25 01:28:54 parrot kernel: [1412520.196341]  [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b

I can't see anything in the logs to indicate why it was not responding today.

There is also nothing I can see in the munin logs, https://penguin.transitionnetwork.org/munin/transitionnetwork.org/parrot.transitionnetwork.org/

I was alerted to the lask of response from the server by this email:

From: munin@penguin.webarch.net
Date: Wed, 08 Jan 2014 11:25:23 +0000
Subject: parrot.transitionnetwork.org Munin Alert

transitionnetwork.org :: parrot.transitionnetwork.org :: eth0 errors
        UNKNOWNs: errors is unknown, errors is unknown.

And I couldn't connect via SSH.

It's possible that it would have recovered without intervention.

Closing this ticket as I can't think of anything else to do on it and the server is up and running now.

comment:2 Changed 3 years ago by chris

  • Status changed from new to closed
  • Resolution set to fixed

comment:3 Changed 3 years ago by chris

  • Cc aland added
  • Status changed from closed to reopened
  • Resolution fixed deleted
  • Summary changed from Parrot isn't responding to Parrot lockups

wiki:ParrotServer locked up again today, again nothing in the logs, I stopped and restarted it at a xen level.

I have reopened this ticket to keep an eye on this issue and also added Alan as a CC.

comment:4 Changed 3 years ago by chris

The alert I got about the problem from munin was as before:

From: munin@penguin.webarch.net
Date: Wed, 15 Jan 2014 13:55:22 +0000
Subject: parrot.transitionnetwork.org Munin Alert

transitionnetwork.org :: parrot.transitionnetwork.org :: eth0 errors
        UNKNOWNs: errors is unknown, errors is unknown.

comment:5 Changed 3 years ago by chris

  • Status changed from reopened to closed
  • Resolution set to fixed

Closing this, hoping the fix for the NFZ/ZFS server has resolved this, see ticket:618#comment:5

Note: See TracTickets for help on using tickets.