<?xml version="1.0"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
  <channel>
    <title>Transition Technology: Ticket #101: 27th June 2010 Site Downtime</title>
    <link>http://localhost:8080/trac/ticket/101</link>
    <description>&lt;p&gt;
On Sunday 27th June the site was down for about 10 hours, see the attached graph of MySQL throughput.
&lt;/p&gt;
&lt;p&gt;
Following are some extract from the emails sent from Gaia regarding this issue.
&lt;/p&gt;
&lt;p&gt;
We need to investigate the cause of this to see if we can avoid it happening in the future...
&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;
The issue was the 'variable' table in the "live" drupal database had crashed. I took the web server fully offline for a few minutes just now as I repaired that table and then brought it back online.
The homepage is now working fine, but other pages are still generating errors.
There's a backup of the database I took just now at:
&lt;/p&gt;
&lt;pre class="wiki"&gt;/web/transitionnetwork.org/live.20100627.sql
&lt;/pre&gt;&lt;hr /&gt;
&lt;p&gt;
I had to take the site back down again and manually emptied all 'cache*' tables in the database.
&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;
we received emails into support@… from two people browsing the site early this morning. The default server error notice includes a reference to our support email.
&lt;/p&gt;
&lt;p&gt;
So what can be done to mitigate this in the future:
You should review Drupal related logs to see if it gives any clues as to what caused the crash. An action on the 'variable' table in the database could be a clue, as that as the sole table crashed in the db.
There would have been a faster response to the outage on our side if our monitors had been specifically watching the drupal database named "live" as well as a more content specific monitor for an actual page generated by Drupal. We can coordinate on monitor customization, just let us know.
&lt;/p&gt;
&lt;p&gt;
it looks like the internal server errors starting happening right after /cron.php was run at 23:00 UTC:
&lt;/p&gt;
&lt;pre class="wiki"&gt;transitiontowns.gaiahost.coop - - [26/Jun/2010:23:00:00 +0000] "GET /cron.php HTTP/1.0" 200 - "-" "ApacheBench/2.3"
&lt;/pre&gt;&lt;p&gt;
It specifically didn't look like a corruption related to a regular system or database backup process. The other traffic at this time was normal.
&lt;/p&gt;
</description>
    <language>en-us</language>
    <image>
      <title>Transition Technology</title>
      <url>/trac/chrome/site/TransitionNetwork-Logo-Web-Small.jpg</url>
      <link>http://localhost:8080/trac/ticket/101</link>
    </image>
    <generator>Trac 0.12.5</generator>
    <item>
      
        <dc:creator>chris</dc:creator>

      <pubDate>Mon, 28 Jun 2010 08:04:23 GMT</pubDate>
      <title>attachment set</title>
      <link>http://localhost:8080/trac/ticket/101</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/101</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;attachment&lt;/strong&gt;
                set to &lt;em&gt;transitiontowns.gaiahost.coop-mysql_bytes-day.png&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
MySQL throughput
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>chris</dc:creator>

      <pubDate>Mon, 28 Jun 2010 13:22:10 GMT</pubDate>
      <title>hours, totalhours changed</title>
      <link>http://localhost:8080/trac/ticket/101#comment:1</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/101#comment:1</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;hours&lt;/strong&gt;
                changed from &lt;em&gt;0.0&lt;/em&gt; to &lt;em&gt;0.5&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;totalhours&lt;/strong&gt;
                changed from &lt;em&gt;0.0&lt;/em&gt; to &lt;em&gt;0.5&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
There are lots of errors in the logs during the downtime, the first error was triggered by the cron run at Sunday, 27 June 2010 - 12:00am:
&lt;/p&gt;
&lt;pre class="wiki"&gt;Table &amp;amp;#039;variable&amp;amp;#039; is marked as crashed and should be repaired query: UPDATE variable SET value = &amp;amp;#039;d:0.06666666666666666574148081281236954964697360992431640625;&amp;amp;#039; WHERE name = &amp;amp;#039;node_cron_comments_scale&amp;amp;#039; in /web/transitionnetwork.org/www/includes/bootstrap.inc on line 523.
&lt;/pre&gt;&lt;p&gt;
&lt;a class="ext-link" href="https://www.transitionnetwork.org/admin/reports/event/174362"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;https://www.transitionnetwork.org/admin/reports/event/174362&lt;/a&gt;
&lt;/p&gt;
&lt;p&gt;
After 4 log entries like the above the errors changed to:
&lt;/p&gt;
&lt;pre class="wiki"&gt;	You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near &amp;amp;#039;) ORDER BY fit DESC LIMIT 0, 1&amp;amp;#039; at line 1 query: SELECT * FROM menu_router WHERE path IN () ORDER BY fit DESC LIMIT 0, 1 in /web/transitionnetwork.org/www/includes/menu.inc on line 315.
&lt;/pre&gt;&lt;p&gt;
&lt;a class="ext-link" href="https://www.transitionnetwork.org/admin/reports/event/174367"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;https://www.transitionnetwork.org/admin/reports/event/174367&lt;/a&gt;
&lt;/p&gt;
&lt;p&gt;
And the above error then appears to have been repeated for every request to the site.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>jim</dc:creator>

      <pubDate>Mon, 28 Jun 2010 15:19:15 GMT</pubDate>
      <title></title>
      <link>http://localhost:8080/trac/ticket/101#comment:2</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/101#comment:2</guid>
      <description>
        &lt;p&gt;
Interesting.... The variables table holds common variables that are used site-wide by Drupal - it shouldn't be super-heavily updated but some variables might get to be quite big strings of JSON.
&lt;/p&gt;
&lt;p&gt;
&lt;strong&gt;All cache* tables are wipe-able, and indeed &lt;em&gt;should&lt;/em&gt; be wiped after any major DB operation as a matter of course.&lt;/strong&gt;
&lt;/p&gt;
&lt;p&gt;
The log file errors are consistent with what you say Chris, and what I just said: variables table is requested in its entirety for each request, and often broken data will get into a cache table which just needs wiping or waiting for its next clearance.
&lt;/p&gt;
&lt;p&gt;
Drupal (or indeed any application) using a DB table should not be able to corrupt it in normal use, unless that database has problems, bugs or issues with the underlying hardware...
&lt;/p&gt;
&lt;p&gt;
My questions are:
&lt;/p&gt;
&lt;ul&gt;&lt;li&gt;Is MySQL patched and up to date?
&lt;/li&gt;&lt;li&gt;Has this server's physical memory been tested recently? How about the disks?
&lt;/li&gt;&lt;li&gt;What do the MySQL and &lt;a class="missing wiki"&gt;SysLog?&lt;/a&gt; logs show? Could /tmp or /var or /etc (or similar) have run out of disk space momentarily?
&lt;/li&gt;&lt;/ul&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>john</dc:creator>

      <pubDate>Tue, 29 Jun 2010 09:52:55 GMT</pubDate>
      <title></title>
      <link>http://localhost:8080/trac/ticket/101#comment:3</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/101#comment:3</guid>
      <description>
        &lt;p&gt;
This is a weird one. The previous cron ran without any problems and I re-created the crashed variable table using Chris's dump to see if anything looked amiss (I couldn't see anything). The problem occurred early on in the 12am cron run, the first 4 errors are from subsequent modules that are running their cron processes and realizing that the variable table is corrupt. As it's not happened before or since and I don't think there is anything special running on the cron at that time, we may not see it happening again. Agree though, would check that there is nothing wrong with the space allocation and possibly hardware (unlikely I know) and that there wan't a brief power outage at that time.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>chris</dc:creator>

      <pubDate>Tue, 29 Jun 2010 12:44:38 GMT</pubDate>
      <title></title>
      <link>http://localhost:8080/trac/ticket/101#comment:4</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/101#comment:4</guid>
      <description>
        &lt;p&gt;
My questions are:
&lt;/p&gt;
&lt;ul&gt;&lt;li&gt;Is MySQL patched and up to date?
&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;
The server is running:
&lt;/p&gt;
&lt;pre class="wiki"&gt;Server version: 5.1.39-log FreeBSD port: mysql-server-5.1.39
&lt;/pre&gt;&lt;p&gt;
5.1.48 is available: &lt;a class="ext-link" href="http://www.freebsd.org/cgi/cvsweb.cgi/ports/databases/mysql51-server/"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;http://www.freebsd.org/cgi/cvsweb.cgi/ports/databases/mysql51-server/&lt;/a&gt;
&lt;/p&gt;
&lt;ul&gt;&lt;li&gt;Has this server's physical memory been tested recently? How about the disks?
&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;
I don't know.
&lt;/p&gt;
&lt;ul&gt;&lt;li&gt;What do the MySQL and &lt;a class="missing wiki"&gt;SysLog?&lt;/a&gt;? logs show? Could /tmp or /var or /etc (or similar) have run out of disk space momentarily?
&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;
No shortage of disk space:
&lt;/p&gt;
&lt;pre class="wiki"&gt;Filesystem            Size    Used   Avail Capacity  Mounted on
/dev/mirror/gm0s1a    129G     19G    100G    16%    /
&lt;/pre&gt;&lt;p&gt;
I'll check the logs next...
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>chris</dc:creator>

      <pubDate>Tue, 29 Jun 2010 13:53:41 GMT</pubDate>
      <title></title>
      <link>http://localhost:8080/trac/ticket/101#comment:5</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/101#comment:5</guid>
      <description>
        &lt;p&gt;
MySQL is currently only set to log slow queries (in /var/db/mysql ), perhaps we should change this, and there is nothing in /var/log/messages
&lt;/p&gt;
&lt;p&gt;
Should we consider if some automatic cleaning up of cache tables and checking of the variable table is needed, or should we just wait to see if this happens again and fix it when it does?
&lt;/p&gt;
&lt;p&gt;
Perhaps we should wait to see if the problem happens again after the site has been migrated to the new hardware.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>jim</dc:creator>

      <pubDate>Wed, 30 Jun 2010 13:17:07 GMT</pubDate>
      <title></title>
      <link>http://localhost:8080/trac/ticket/101#comment:6</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/101#comment:6</guid>
      <description>
        &lt;p&gt;
As long as errors are logged, that's fine - no point logging everything.
&lt;/p&gt;
&lt;p&gt;
And I don't think the cache tables should be auto-cleared, there's no point unless a DB operation is done directly on the database, not through Drupal Drupal clears these tables when it needs to, and developers do whenever they've done something that needs it too. For normal sites running normally, clearing the caches just slows things down.
&lt;/p&gt;
&lt;p&gt;
I definitely agree that, apart from MySQL error logging, waiting for new hardware is the best bet.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>jim</dc:creator>

      <pubDate>Thu, 15 Jul 2010 14:42:10 GMT</pubDate>
      <title>status changed; resolution set</title>
      <link>http://localhost:8080/trac/ticket/101#comment:7</link>
      <guid isPermaLink="false">http://localhost:8080/trac/ticket/101#comment:7</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;status&lt;/strong&gt;
                changed from &lt;em&gt;new&lt;/em&gt; to &lt;em&gt;closed&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;resolution&lt;/strong&gt;
                set to &lt;em&gt;fixed&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
Random, new server approaches, re-open if it happens again...
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item>
 </channel>
</rss>