Ticket #599 (closed maintenance: fixed)

Opened 3 years ago

Last modified 3 years ago

Server time drift

Reported by: chris Owned by: chris
Priority: critical Milestone: Maintenance
Component: Live server Keywords:
Cc: jim, ed, aland Estimated Number of Hours: 0.0
Add Hours to Ticket: 0 Billable?: yes
Total Hours: 1.54

Description

The servers are not keeping good time at the moment.

Attachments

puffin-2013-10-18_irqstats-day.png (144.7 KB) - added by chris 3 years ago.
puffin-2013-10-19_interrupts-day.png (26.7 KB) - added by chris 3 years ago.

Change History

comment:1 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.25
  • Total Hours changed from 0.0 to 0.25

Rdate doesn't run on wiki:PuffinServer due to the firewall, so /etc/csf/csf.conf was edited and port 37 was added to TCP_IN and TCP_OUT and the firewall restarted.

csf -r

Checking and then setting the date and checking it again:

date ; rdate -s ntp.demon.co.uk ; date
Mon Sep 30 02:22:43 BST 2013
Sun Sep 29 21:23:33 BST 2013

The following was added to the root crontab:

* * * * * rdate -s ntp.demon.co.uk

On wiki:PenguinServer

date ; rdate -s ntp.demon.co.uk ; date
Sun Sep 29 23:06:12 BST 2013
Sun Sep 29 21:25:50 BST 2013

A crontab was also added.

On wiki:ParrotServer:

date ; rdate -s ntp.demon.co.uk ; date
Sun Sep 29 21:39:25 BST 2013
Sun Sep 29 21:27:26 BST 2013

A crontab was also added.

This is just a temp workaround -- we need to solve the clock drifting, not sure what the answer is yet though.

comment:2 follow-up: ↓ 3 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.2
  • Total Hours changed from 0.25 to 0.45

Puffin: This doesn't seem to be working... This has consequences for Drupal so need fixing.

tn@puffin:~/static/transition-network-d6-p005$ date
Tue Oct  8 00:38:45 BST 2013

It's 18.58 Oct 7!

rdate -s hangs, so this must have got hit but the BOA update, so I've edit /etc/csf/csf.conf to allow port 37 and set _CUSTOM_CSF to 'YES' in /root/.barracuda.cnf to avoid future clobberage.

Added crontab entry again too. Ideally we'd use a non-root user with their own cron, or another place that isn't controlled by BOA.

puffin:/data/conf# date ; rdate -s ntp.demon.co.uk ; date
Tue Oct  8 00:44:26 BST 2013
Mon Oct  7 19:02:36 BST 2013

comment:3 in reply to: ↑ 2 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 0.45 to 0.55

Replying to jim:

Ideally we'd use a non-root user with their own cron

Thanks for spotting this, and good point, I have commented the root cron job and set it to run as me.

comment:4 Changed 3 years ago by jim

Per my comment over on #555 I have a suspicion the regular date alteration by the cron task is the cause of the load spikes...

I'd like to reduce the cron frequency to every 4 hours to see what effect that has.

I await Chris' thoughts to my brain-fart over on #555.

comment:5 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 0.55 to 0.65

I don't think the time drift has anything to do with the load spikes. The time drift only started when the motherboard was swapped and the load spikes predate this by some months.

Alan is going to try to solve the time drift issue during tonight's downtime, see https://lists.webarch.co.uk/pipermail/webarch-xen1/2013-October/000005.html

comment:6 follow-up: ↓ 7 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 0.65 to 0.75

There are two types of load spikes:

  • Those from 'before' which were high and irregular ultimately caused by hardware issues -- fixed by the motherboard changed.
  • The recent ones which are much lower in intensity and regularly spaced around the hour (or there abouts) -- these coincide with the crontab date sync being enabled as far as I can tell (see #555).

I note that the server has been rebooted (which I think was the clock fix) and the load spikes have now largely stopped.

Two questions for Chris:

  1. Has the clock issue been resolved?
  2. Has the crontab entry for the date sync been removed?

comment:7 in reply to: ↑ 6 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.25
  • Total Hours changed from 0.75 to 1.0

Replying to jim:

There are two types of load spikes:

  • Those from 'before' which were high and irregular ultimately caused by hardware issues -- fixed by the motherboard changed.

Changing the motherboard stopped the crashes and created the time drift problem but I'm not sure that anything else can be put down to it.

  • The recent ones which are much lower in intensity and regularly spaced around the hour (or there abouts) -- these coincide with the crontab date sync being enabled as far as I can tell (see #555).

I note that the server has been rebooted (which I think was the clock fix) and the load spikes have now largely stopped.

The reboot was to try to fix the clock but it didn't fix it.

Two questions for Chris:

  1. Has the clock issue been resolved?

No.

  1. Has the crontab entry for the date sync been removed?

No, it's running as me now.

comment:8 Changed 3 years ago by chris

  • Cc aland added

Adding Alan to the CC list.

comment:9 Changed 3 years ago by aland

  • Add Hours to Ticket changed from 0.0 to 0.17
  • Total Hours changed from 1.0 to 1.17

Installed chrony ( replacement for ntp )
commented out crontab entry for using rdate

aptitude install chrony

watched clock with date for some few minutes

clocks now in sync

comment:10 Changed 3 years ago by aland

Logged into parrot

installed chrony

clocks in sync now

comment:11 Changed 3 years ago by aland

  • Add Hours to Ticket changed from 0.0 to 0.17
  • Total Hours changed from 1.17 to 1.34

logged into penguin

installed chrony

removed cronjob

clock is running accurately now

Changed 3 years ago by chris

Changed 3 years ago by chris

comment:12 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 1.34 to 1.44

These two munin graphs changed quite dramatically when chrony was installed:



I don't know exactly what it means though.

comment:13 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Status changed from new to closed
  • Resolution set to fixed
  • Total Hours changed from 1.44 to 1.54

The server don't have a issue with time keeping since chrony has been installed as far as I'm aware, closing this ticket.

comment:14 Changed 3 years ago by chris

There is a crontab entry to reset the clock after a reboot, see wiki:PuffinServer#Cron

Note: See TracTickets for help on using tickets.