Ticket #300 (closed task: fixed)

Opened 5 years ago

Last modified 5 years ago

Memcache problem: cache data being truncated/removed causing issues site-wide

Reported by: ed Owned by: chris
Priority: minor Milestone: Phase 5
Component: Drupal modules & settings Keywords:
Cc: ed, chris Estimated Number of Hours: 0.0
Add Hours to Ticket: 0 Billable?: yes
Total Hours: 6.4

Description (last modified by jim) (diff)

LIVE is being odd, like it's not getting all the data from MySQL - truncated somehow... It's having a reasonable impact on the site.

Effects noted:

  • Menus weren't showing fully (see https://tech.transitionnetwork.org/trac/ticket/300) -- added menu items not appearing, flushed caches a few times, no change. Waited overnight, no change. Admin menu for navigation menu showing different elements to public menu.
  • Several views are having problems displaying/filtering on certain fields (see /events is empty, missing video at /video and missing filters on /initiatives page etc etc etc).
  • Ed mentioned some counts of initiatives don't tally which could also be a symptom.
  • sometimes the diversity & inclusion page (and indeed other Panels pages) shows the underlying page rather than the panel version.
  • Certain permissions randomly dropped (#293)
  • Probably many more.

Change History

comment:1 Changed 5 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.05
  • Owner changed from jim to ed
  • Status changed from new to assigned
  • Total Hours changed from 0.0 to 0.05

Need more info: which menu, what items, where.

black admin menu doesn't work the same as normal menus.... please update.

comment:2 Changed 5 years ago by ed

Main, primary navigation. Added stuff to news, nothing happened - have a look - backend shows extra things, frontend doesn't...

comment:3 Changed 5 years ago by jim

  • Cc ed, chris added
  • Add Hours to Ticket changed from 0.0 to 1.0
  • Total Hours changed from 0.05 to 1.05

Something very odd is happening, and it's only affecting LIVE as far as I can see...

I've pulled a copy of the site onto my machine to check a few things and as soon as I refreshed the page the new menu items appeared. No amount of cache-slapping or menu saving will make them appear on LIVE. DEV doesn't have the issue.

My gut says it's probably a memcache issue... Perhaps some of the memcache 'buckets' are not big enough to hold the cache_menu table, for example. So I'm including Chris in this for his 2p on memcache settings.

Both the menu rendered as the top bar AND when I add it as a block is affected, so it seems to the the menu cache data that is incomplete somehow, despite the actual menu items existing as normal.

In the mean time I'll try to test this theory by disabling memcache module tonight out of hours...

comment:4 Changed 5 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 2.0
  • Owner changed from ed to chris
  • Total Hours changed from 1.05 to 3.05
  • Description modified (diff)
  • Summary changed from Navigation Menu not updating to Memcache problem: cache data being truncated/removed causing issues site-wide

Actions taken:

  • My gut says this is a memcache issue... So I tried disabling the memcache and cacherouter modules... No effect so re-enabled.
  • However, when I import the LIVE database onto my machine all the missing field handlers and menu items magically reappear -- same data, different machine, different result. This means the underlying data is fine, but (likely) the generated/stored cached version is getting truncated or partially filled somehow.
  • Tweaked the /etc/php5/conf.d/suhosin.ini settings and restarted Apache but nothing changed...
  • Since no PHP errors I assume this is not a max packet size issue for mysql.

I just restarted memcache:

jim@quince:~$ sudo /etc/init.d/memcached restart
Restarting memcached: memcached.

And the site suddenly started working normally again! Menus back,

SO this is a memcache issue - either the config settings aren't quite right, or the Drupal module has a bug. In any case the net result is that Drupal is often working with incorrect/incomplete cached data despite the underlying structures and data being fine. This is because either the output or stored data in memcache is truncated, or what Drupal sends to it is wrong, or it's not being cleared properly when Drupal says it should be.

Passing to Chris, but will keep looking through the Memcache issue list at http://drupal.org/project/issues/memcache

comment:5 Changed 5 years ago by ed

looks like some users aren't being able to login too - possibly related

comment:6 Changed 5 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.35
  • Total Hours changed from 3.05 to 3.4

These links looked interesting:

So I did a cat /var/log/memcached.log | grep SERVER_ERROR and got a lot of lines like ">9 SERVER_ERROR object too large for cache" and ">13 SERVER_ERROR object too large for cache" etc. Hence my hunch the data is truncated seems correct. The second link's fixes seem to be a good thing to try, i.e. increasing bucket size with the -l option (e.g. -l 3m or similar)

However, I see CacheRouter? is being used for memcache access, rather than the Memcache module. We should consider trying out the memcache-only approach to avoid the shoddy CR module. That's the next thing I'll look into.

comment:7 Changed 5 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.5
  • Total Hours changed from 3.4 to 3.9

to increase the slab size, see http://stackoverflow.com/questions/5650085/caching-a-302mb-object

the problem is we are running a version of memcache that is too old to have this option, the latest debian does have this, so upgrading to debian seems like a good option

have had a chat with Jim about this...

comment:8 Changed 5 years ago by jim

Memcache to be updated at some point - see ticket #301

I'll try routing the caches to avoid the issue by setting some database cache_* tables to NOT go in memcache but disk instead. All others go to memcache.

Only certain tables make big entries (to confirm): cache_views, cache_content and cache_form tables

Will attempt a fix on DEV tomorrow and report back before pushing to LIVE if effective.

comment:9 Changed 5 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 1.0
  • Priority changed from critical to major
  • Total Hours changed from 3.9 to 4.9

I applied a patch from http://drupal.org/node/435694#comment-4844382 to add error messages to the Drupal Watchdog if memcached complains or fails when setting cache entries.

Then settings.php on DEV and, after testing, LIVE has been updated to use the Memcache module, rather than Cache Router (which I plain do not trust). The old CR code has been commented out, and the Memcache new code looks like this:

# JK - issue #300 (memcache large dataset issues) trying memcache instraed of cacherouter
$conf = array(
  'cache_inc' => './sites/all/modules/memcache/memcache.inc',
  'memcache_servers' => array('127.0.0.1:11211' => 'default'),
  'memcache_bins' => array(
    'cache' => 'default',
    'cache_content' => 'database',
    'cache_form' => 'database',
    'cache_views' => 'database'
  ),
);

Which means cache_content, _form and _views will go to the database and all else will go into memcache.

The site seems fine, now we need some longer term testing... I'll keep an eye on the logs.

Downgrading

comment:10 Changed 5 years ago by jim

  • Priority changed from major to minor
  • Type changed from defect to task

3 days later and there is no mention of any memcache errors in Drupal's log.

2 things changed though, so we might still get better performance out of all this (not that it isn't already good):

  1. The caching was switched over from CacheRouter? to Memcache on the Drupal side of things
  2. 3 tables (cache_content, cache_form, cache_views) are now stored in the database as is Drupal default, all others go to memcache.

So this means we can continue to run the site with these settings OR we can try to add back some of these tables to see if we can get better use from Memcached. I'd start with cache_content, then _views, then probably leave _form because other issues can arise (read some issues, can't find links now). Similarly, cache_page is probably redundant with Varnish being uesd instead, so perhaps this could be set to 'none'.

It's all low priority now though. Will leave open so Chris can add his 2p and possibly tweak when he gets back from holiday.

I've also updated the NewLiveServer documentation here: https://tech.transitionnetwork.org/trac/wiki/NewLiveServer?version=75#memcache

comment:11 Changed 5 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 1.5
  • Total Hours changed from 4.9 to 6.4

Now we have a new version I have increased the max object size in /etc/memcached.conf:

# Increase max object size
# chris
# https://tech.transitionnetwork.org/trac/ticket/300
-I 3m

I deleted the old log file (1.2Gb!), no new problems are being added to the new log file.

I installed a munin plugin for memcache, http://exchange.munin-monitoring.org/plugins/memcached-multigraph/details

cd /usr/share/munin/plugins
wget http://exchange.munin-monitoring.org/plugins/memcached-multigraph/version/2/download -O memcached_multi_
chmod 755 memcached_multi_
cd /etc/munin/plugins/
ln -s /usr/share/munin/plugins/memcached_multi_ memcached_multi_bytes 
ln -s /usr/share/munin/plugins/memcached_multi_ memcached_multi_commands
ln -s /usr/share/munin/plugins/memcached_multi_ memcached_multi_conns 
ln -s /usr/share/munin/plugins/memcached_multi_ memcached_multi_evictions
ln -s /usr/share/munin/plugins/memcached_multi_ memcached_multi_items
ln -s /usr/share/munin/plugins/memcached_multi_ memcached_multi_memory

Added this to /etc/munin/plugin-conf.d/munin-node

[memcached_multi_*]
env.port 11211
env.timescale 3

The memcache stats are starting to appear here:

https://kiwi.transitionnetwork.org/munin/webarch.net/quince.webarch.net/index.html#memcached

I have added a note about this plugin on the server docs page, wiki:NewLiveServer#munin

Is there anything else I need to do on this ticket?

comment:12 Changed 5 years ago by chris

  • Milestone set to Phase 5

comment:13 Changed 5 years ago by ed

chris - please close this ticket if we're done

comment:14 Changed 5 years ago by chris

  • Status changed from assigned to closed
  • Resolution set to fixed

Closing this ticket as the memcache problems are sorted.

Note: See TracTickets for help on using tickets.