Ticket #610 (closed defect: fixed)

Opened 3 years ago

Last modified 3 years ago

Aegir database intensive (migrate, clone, restore) tasks hang for larger sites

Reported by: jim Owned by: jim
Priority: minor Milestone: Maintenance
Component: Live server Keywords:
Cc: chris, ed Estimated Number of Hours: 0.0
Add Hours to Ticket: 0 Billable?: yes
Total Hours: 10.75

Description (last modified by jim) (diff)

Large sites (TN.org and variants) will simply not complete their migrate, clone or restore tasks in Aegir.

However, smaller sites are fine, and all tasks work for them.

The process largely completes -- codebase installs, database is cloned, symlinks for sites aliases and files created... BUT the process never completes in Aegir, so the final steps of switching a site's served location never occurs.

Useful links/comments in this issue:

Change History

comment:1 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Priority changed from critical to blocker
  • Total Hours changed from 0.0 to 0.1

Yeah the OA site is doing it too -- can't migrate to the new OpenAtrium platform either.

Looking into it... This now blocks #582 and #560.

Last edited 3 years ago by jim (previous) (diff)

comment:2 Changed 3 years ago by jim

  • Summary changed from STG won't migrate - hangs to Aegir migrate tasks hang on Drushrc load

comment:3 in reply to: ↑ description Changed 3 years ago by chris

Replying to jim:

Per the docs, the command is:

octopus up-stable o1

I'll do this tonight at 9pm.

I probably won't be around then but there is also the outstanding issue of removing New Relic, ticket:568 -- it might be worth removing that at the same time?

comment:4 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.05
  • Total Hours changed from 0.1 to 0.15

Good idea. I'll remove NR too... This time I won't re-run the Barracuda update as that didn't seem to affect it. I'll simply visit the files changed and comment out, then killall newrelic services.

The files changed based on

Once that is completed and nginx and php53-fpm are restarted, we're all hopefully good. I'll go on to update Octopus after that.

Shout if you have any other thoughts or issues.

comment:5 Changed 3 years ago by jim

  • Description modified (diff)

And (just noticed) it'll be octopus up-stable tn (not o1).

Last edited 3 years ago by jim (previous) (diff)

comment:6 Changed 3 years ago by jim

Ok, NR has been removed so I'll run the octopus update...

As a precursor I've also told Octopus to always install the new Open Atrium D7 (2.x) branch -- this is the same one that we want for #560 and it's recently been added. Plus, the update is set to 'autopilot' now... changes to .tn.octopus.cnf are:

_PLATFORMS_LIST="D7P OA7"
_AUTOPILOT=YES

comment:7 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.05
  • Total Hours changed from 0.15 to 0.2

Octopus updated... will test later on.

comment:8 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 0.2 to 0.3

(Missed a little time)

comment:9 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.05
  • Total Hours changed from 0.3 to 0.35

No change... the command hangs on drushrc.php load as before...

tn@puffin:~$ drush @stg.transitionnetwork.org provision-migrate '@platform_TransitionNetworkD6S006' --backend --debug -v
Bootstrap to phase 0. [0.03 sec, 2.87 MB]                            [bootstrap]
Drush bootstrap phase : _drush_bootstrap_drush() [0.04 sec, 3.09 MB] [bootstrap]
Load alias @stg.transitionnetwork.org [0.04 sec, 3.1 MB]                                                                       [notice]
Loading drushrc "/data/disk/tn/static/transition-network-d6-s002/sites/stg.transitionnetwork.org/drushrc.php" into "site"   [bootstrap]
scope. [0.04 sec, 3.11 MB]
--- HANGS ---

So now we're into debug land.

This issue is strange since it was all working just a couple of months back. I'll see if there's a way to rebuild/clean the drushrc.php file.

comment:10 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.2
  • Total Hours changed from 0.35 to 0.55

We could try tweaking some settings in /etc/php5/cli/phi.ini, eg allowing scripts more time to run and giving them more memory. I have looked at the cli error log, /var/log/php/error_log_cli_53, but it is clobbered by BOA so there is nothing in it, it might be worth removing the clobbering to see if anything is recorded there.

Also there was a load spike around the time you were working on this -- if the load goes above 3.88 while you are running drush the task will be killed by second.sh -- we should probably increase this setting, but it's not one we can set in /root/.barracuda.cnf as far as I'm aware, see wiki:PuffinServer#LoadSpikes -- it needs to be set in /var/xdrango/second.sh.

comment:11 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.25
  • Total Hours changed from 0.55 to 0.8

Having followed (tail -f) those files whilst I ran the command, nothing gets put in there. And there's no memory issue...

So I re-ran the command with the maximum debugging output (drush @stg.transitionnetwork.org provision-migrate '@platform_TransitionNetworkD6S006' --backend --debug -vvvv), and that generated a HUGE amount of JSON data. Having scanned this, the following entry looks to be the cause:

"PROVISION_BACKUP_EXTRACTION_FAILED":["Failed to extract the contents of \/data\/disk\/tn\/backups\/stg.transitionnetwork.org-20131017.123311.tar.gz to \/data\/disk\/tn\/static\/transition-network-d6-s006\/sites\/stg.transitionnetwork.org.restore (The target directory could not be written to)"]

So this looks to be a simple folder perms/owner issue. I'll take a gander now.

comment:12 Changed 3 years ago by chris

The problems with these tasks hanging coincides with loads, just after 10pm, which were above the 18.88 threshold which causes php and drush tasks to be killed.

comment:13 follow-up: ↓ 14 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.2
  • Total Hours changed from 0.8 to 1.0

Odd. I wasn't working at 10-11pm, I was working at 3-5pm. So the work I did wasn't directly the cause at the time... And the 'Hang' isn't doing any processing...

What is happening (as far as I can tell so far) is:

  • I start a migrate or clone task;
  • The drush command never completes as it fails after making the DB backup in its prep tasks.
  • Actually behind the scenes it's stopped (as far as I can tell) due to the permission issue I mentioned a few comments back.
  • However, Aegir then never gets told that the task failed for some reason, which leaves a task in a 'Processing' state, even though it's actually dead.
  • What MIGHT coincide with the times Chris mentions, is when I go into the Aegir UI, go to the task node that's just sitting there, confused, and manually delete it... This is recommended on How to fix a stuck task (Aegir docs). I wonder if when this task node is deleted it then runs off and finishes up some stuff, hence the load spike.... Hmm.

So this needs investigation, and, as Chris says, I'll keep an eye on uptime and top when I work.

I think fundamentally there's a file/directory perms/owner issue causing the actual hang -- and that's what I'll work to fix next week when I have some time, and again I'll also watch the load too.

comment:14 in reply to: ↑ 13 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 1.0 to 1.1

Replying to jim:

Odd. I wasn't working at 10-11pm, I was working at 3-5pm.

Ah, I was also getting the 24 hour clock mixed up -- the comments here were posted between:

I read "11:25" as "23:25", do'h.

What is happening (as far as I can tell so far) is:

  • I start a migrate or clone task;
  • The drush command never completes as it fails after making the DB backup in its prep tasks.
  • Actually behind the scenes it's stopped (as far as I can tell) due to the permission issue I mentioned a few comments back.

OK, we need to track that down.

  • However, Aegir then never gets told that the task failed for some reason, which leaves a task in a 'Processing' state, even though it's actually dead.
  • What MIGHT coincide with the times Chris mentions, is when I go into the Aegir UI, go to the task node that's just sitting there, confused, and manually delete it... This is recommended on How to fix a stuck task (Aegir docs). I wonder if when this task node is deleted it then runs off and finishes up some stuff, hence the load spike.... Hmm.

So this needs investigation, and, as Chris says, I'll keep an eye on uptime and top when I work.

I think fundamentally there's a file/directory perms/owner issue causing the actual hang -- and that's what I'll work to fix next week when I have some time, and again I'll also watch the load too.

Cool, thanks.

comment:15 Changed 3 years ago by jim

I'm working on the basis of the tips/issue here: https://drupal.org/node/1279860

And I can avoid the high load by not testing with the migration of such a huge site...

comment:16 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.15
  • Total Hours changed from 1.1 to 1.25

Test 1: PASS - Aegir created a site no problem on its stock D7 platform.

Will now try same on non-stock platform.

Have also had a look around but not seen any general permission/owner issues as yet - will get specific next.

comment:17 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.05
  • Total Hours changed from 1.25 to 1.3

Test 2: PASS - site made new STG platform. Will now attempt to migrate it.

comment:18 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.5
  • Total Hours changed from 1.3 to 1.8

Test 3: PASS - new site migrated from s006 to s002 platform Ok... now back the other way.

This is good, it means aegir is fine, but the perms issue is specific to the site's files, rather than platform or system.

comment:19 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.05
  • Total Hours changed from 1.8 to 1.85

Confirmed, it's an issue with the sites made so far - gotta run now but will continue at 5pm and fix this shit.

comment:20 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 1.85 to 1.95

Aha, but migrating the test site to the new (made by me) Open Atrium site hangs. I'll try sending the site back to S002 to rule out S006 being the cause.

comment:21 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 1.95 to 2.05

Getting somewhere now... Deteted the old spaces/OA sites/platforms. Am building a new one based on processes in https://github.com/omega8cc/nginx-for-drupal/blob/ecae6a98e70b17fc7421ac1b228f15f3b1217364/aegir/scripts/AegirSetupC.sh.txt

Why?

Because in a moment of perfect timing, Drupal.org is down for Halloween whilst they upgrade to D7!

So it's a manual job, though the steps should only take 10 mins or so...

comment:22 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.75
  • Total Hours changed from 2.05 to 2.8

Steps taken (damn you Drupal.org!) based on the BOA script:

# Get, unzip OA
wget http://ftp.drupal.org/files/projects/openatrium-7.x-2.0-core.tar.gz
tar -xzf openatrium-7.x-2.0-core.tar.gz

# Manually do prepare_drupal7_core()
wget "http://files.aegir.cc/dev/drupal-7.23.3.tar.gz"
tar -xzf drupal-7.23.3.tar.gz
cd drupal-7.23.3/
# .. fix_dirs_files()
rm -f ./*.txt
rm -f ./modules/*.txt
rm -f ./themes/*.txt
rm -f -r ./modules/cookie_cache_bypass
mkdir -p ./sites/default/files
mkdir -p ./cache/{normal,perm}
chmod -R 777 ./cache
cp -af ./sites/default/default.settings.php ./sites/default/settings.php
chmod a+rw ./sites/default/settings.php
chmod a+rwx ./sites/default/files
mkdir -p ./profiles
mkdir -p ./sites/all/{modules,libraries,themes}
rm -f ./core/modules/*.txt
rm -f ./core/themes/*.txt
rm -f ./modules/*.txt
rm -f ./themes/*.txt
rm -f ./sites/all/*.txt
echo empty > ./profiles/EMPTY.txt
echo empty > ./sites/all/EMPTY.txt
echo empty > ./sites/all/modules/EMPTY.txt
echo empty > ./sites/all/libraries/EMPTY.txt
echo empty > ./sites/all/themes/EMPTY.txt
chmod 02775 ./profiles &> /dev/null
chmod 0751 ./sites
chmod 0751 ./sites/all
chmod 02775 ./sites/all/{modules,libraries,themes}
cp /data/disk/tn/distro/002/drupal-7.22.1-prod/.htaccess .
cp /data/disk/tn/distro/002/drupal-7.22.1-prod/crossdomain.xml .

# Copy D7 stuff into OA folder
cd ..
cp -af drupal-7.23.3/* openatrium-7.x-2.0/

# Manually do some of nocore_d7_dist_clean()
cd openatrium-7.x-2.0/
rm -f *.txt
rm -f web.config

# Do remove_default_core_seven_profiles()
rm -f -r profiles/minimal
rm -f -r profiles/standard
rm -f -r profiles/testing

# Do upgrade_contrib_less()
cd profiles/openatrium/modules/contrib
rm -R context
wget -q -U iCab http://files.aegir.cc/dev/contrib/context-7.x-3.1.tar.gz
tar -xzf context-7.x-3.1.tar.gz
rm -R entity
wget -q -U iCab http://files.aegir.cc/dev/contrib/entity-7.x-1.2.tar.gz
tar -xzf entity-7.x-1.2.tar.gz
rm *.tar.gz
# http://drupal.org/node/1766338#comment-6445882
curl -s -O -A iCab http://files.aegir.cc/dev/patches/views-revert-broken-filter-or-groups-1766338-7.patch
patch -p1 < views-revert-broken-filter-or-groups-1766338-7.patch
# BOA Apps patch
curl -s -O -A iCab  https://raw.github.com/omega8cc/nginx-for-drupal/ecae6a98e70b17fc7421ac1b228f15f3b1217364/aegir/patches/apps_msg.patch
patch -p1 < apps_msg.patch

# Finally make owners tn:users, www-data for key files/dirs
cd ../../../../..
chown -R tn:users *
chown -R tn:www-data cache/
chown -R tn:www-data sites/default/{local.settings.php,settings.php,civicrm.settings.php}

PHEW! It's clearly much better when Octopus does this, but two issues: a) Drupal.org is down and b) this is all HEAD stuff, and we want the best OA install with all the trimmings NOW, not when BOA gets to v2.0.10.

Now to add the platform and install OA 2.0.

comment:23 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.3
  • Total Hours changed from 2.8 to 3.1

OK so it installed ok... and works.

But the Aegir 'install' task hasn't completed... Not sure where it got lost, but it has somewhere. Could be the UI never got told Drush completed it.

Am playing now, will then create Ed's account and set up SSL.

comment:24 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.3
  • Priority changed from blocker to critical
  • Total Hours changed from 3.1 to 3.4

Hmm... had to delete the install task and re-verify the space site, then 'enable' it in Aegir... now seems fine.

So something is still up, but I'm not sure if its my stuff or Aegir/BOA. This is because everything so far has worked when using stock stuff, and not when it's my/our custom platforms.

I note too that the stg2 site I made a few weeks back works fine. This again means it looks like the commands are all working (when there's no perms issue) but Drush is timing out and so the Aegir UI (TN hostmaster site) never gets told that the job is a goodun...

So from here we're in semi good shape... but for this likely timeout issue. I've been looking at issues and found this interesting link (Google Cached since D.org is being updated still):Successful site install appears to hang in front-end

I'll downgrade this ticket and proceed on the basis that this is a timeout issue.

comment:25 Changed 3 years ago by ed

  • Milestone set to Maintenance

comment:26 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 3.4 to 3.5

I'll wait to do continue with this ticket after #535 (Wheezy update,which covers #618 BOA 2.1.1 update) is done. This will give us a 'clean'-ish system and some bug fixes, plus permissions tweaks built in.

The key parts of pending #582 (theme tweaks, gmap module #615 and comment notifications) can be done manually if required. In fact, I'll do so for the theme tweaks and comment notifications later today.

My hope is that the update and some newly built platforms will resolve this ticket for us with little or no extra work.

comment:27 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.25
  • Total Hours changed from 3.5 to 3.75

OK so I'm now certain this issue is not a permission problem, because a) there are no errors, and b) BOA is now fixing the perms of our platforms automatically every day.

In a nutshell: drush does its think ok, files and databases are moved, but Aegir's tasks hang -- like they never get told about the task finishing.

So this is strange because make platform, clone and migrate tasks are affected, but verify and others are not. Now clearly, making, cloning and migrating platforms can be rather long tasks that need to move databases etc, so I'm thinking it's a timeout.

Based on the work on the New Relic tidy-up, I think the issue could be with maximum execution settings for the CLI php instance... So I've now set max_execution_time /opt/local/lib/php.ini on line 444 to 7200s (10 mins) from 3600s.

Will try another migration of STG now.

comment:28 follow-up: ↓ 29 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.05
  • Total Hours changed from 3.75 to 3.8

I also note that when Aegir gets the platform lists - or does other tasks in the UI - several of the following syslog errors appear:

Jan  7 22:35:33 puffin mysqld: 140107 22:35:33  InnoDB: O_DIRECT is known to result in 'Invalid argument' on Linux on tmpfs, see MySQL Bug#26662
Jan  7 22:35:33 puffin mysqld: 140107 22:35:33  InnoDB: Failed to set O_DIRECT on file /run/shm/mysql/#sql1fb6_316b55_4.ibd: OPEN: Invalid argument, continuing anyway

MySQL bug 26662 is mysqld assertion when creating temporary (InnoDB) table on a tmpfs filesystem. It seems the tempfs work done to enhance performance may have side effects -- though this appears to be a warning rather than a error.

comment:29 in reply to: ↑ 28 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.25
  • Total Hours changed from 3.8 to 4.05

Replying to jim:

It seems the tempfs work done to enhance performance may have side effects -- though this appears to be a warning rather than a error.

I have read the thread and it seems safe to ignore this warning.

Replying to jim:

I've now set max_execution_time /opt/local/lib/php.ini on line 444 to 7200s (10 mins) from 3600s.

Will that file get clobbered when BOA is upgraded?

It might be worth increasing the debugging -- there is nothing in the errorlog to indicate that there are PHP errors, see /var/log/php/error_log_cli_53.

Some other settings in this file that might need tweaking:

max_input_time = 3600
memory_limit = 512M
post_max_size = 100M
upload_max_filesize = 100M

comment:30 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.8
  • Total Hours changed from 4.05 to 4.85

Thanks Chris re MySQL issue... Also I don't think we have a memory issue, nor a post size issue -- they would throw errors we can see... I don't think it's a PHP env issue now having done further tests. And if they get overridden by BOA updates, no probs, we don't need them.

---

I've just confirmed everything I said before... Database gets made, files get moved, sites folder set up, but the final switchover in Aegir from old to new version of site never happens.

Am now trying each key Aegir task in turn on stg2 to see what happens... Results:

  • Backup = ok
  • Disable = ok
  • Enable = ok
  • Restore (backup from earlier) = FAIL! (task deleted, stuck at /data/disk/tn/tools/drush/drush.php @stg2.transitionnetwork.org provision-restore '/data/disk/tn/backups/stg2.transitionnetwork.org-20140109.140706.tar.gz' --backend 2>&1)
  • Site briefly dead, simply ran 'verify' again to reinstate
  • Verify = ok
  • Site health check (new and quite handy!) = no issues found.

Made new sites on each platform in question (S006 and S008):

  • stg-s006.transitionnetwork.org:
    • site created = ok
    • Site cloned (to own platform) = ok
    • site migrated to S008 platform = ok
    • Site and clone deleted = ok
  • stg-s008.transitionnetwork.org:
    • Site created = ok
    • Site cloned (to own platform) = ok
    • Site and clone deleted = ok

SO THIS MEANS... all main tasks on new, clean sites on these two platforms work fine, but anything to do with clone, migrate or restore (all database heavy) on the STG2 site fails. I'm thinking this is to do with either some modules we have enabled on STG2 exploding, OR something to do with the symlinked files OR that it's database is big and there's some issue there.

And a related issue I've found: https://drupal.org/node/1392102 -- tried running mysql_repair.sh last night, no errors, no changes.

I'm now thinking its database, modules or files related... Going to walk dog now... will trial a few other options when I get back, plus clean up databases left orphaned by my testing on STG2 site.

comment:31 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.05
  • Total Hours changed from 4.85 to 4.9

Spare databases for STG2 cleaned up.

comment:32 Changed 3 years ago by jim

Having gone for a walk, it's clear that Aegir is actually working properly. BUT something about our STG2 site is wonky.

So I'll start next by cloning a new STG from PROD to see what happens. PROD will be unaffected.

comment:33 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 4.9 to 5.0

Clone of PROD did not work - hung, as others did, on /data/disk/tn/tools/drush/drush.php @www.transitionnetwork.org provision-clone '@stg.transitionnetwork.org' '@platform_TransitionNetworkD6P005' --backend 2>&1

I'll now clear up the commands, databases and files, then run this manually to see what happens.

comment:34 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.55
  • Total Hours changed from 5.0 to 5.55

AHA! I just realised that if you 'verify' a platform that has sites in it that are not present in the Aegir list, they will be auto-discovered and imported.

So I've done that on PROD platform, and imported the new STG. I'm now going to alter STG to be like STG2 and not have any chance of sending emails etc.

1) Re-copy PROD files to STG:

[as ROOT, to avoid perms issues]
cd /data/disk/tn/static/sites/
rm -R transitionnetwork.org-STG/
cp -af transitionnetwork.org-PROD/ transitionnetwork.org-STG/

2) Wait 5 mins for it to complete, then symlink new STG files to our new site's directory:

[as an aegir user]
cd ~/static/transition-network-d6-p005/sites/stg.transitionnetwork.org
rm -R files/
ln -s /data/disk/tn/static/sites/transitionnetwork.org-STG/files/ .

3) Ensure the STG stuff happens using the local-settings.php file:

[as ROOT]
cd /data/disk/tn/static/transition-network-d6-p005/sites/stg.transitionnetwork.org
cp ../../../transition-network-d6-s006/sites/stg2.transitionnetwork.org/local.settings.php .

And the new STG is done. I'll remove the old STG2 site.

Now to make a new STG2 from STG by running the drush provision-clone command myself, and add -vv for extra verbosity:

drush @stg.transitionnetwork.org provision-clone '@stg2.transitionnetwork.org' '@platform_TransitionNetworkD6P005' --backend -vv}}} 

comment:35 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.25
  • Total Hours changed from 5.55 to 5.8

Hmmm... with -vvv after ~5mins load of JSON got spat out, and I lost my connection to Puffin... The JSON looks fine, no errors, though it's big at 700Kb.

So I'll now try revalidating the platform since the DB, files and other bits all seem present and correcet.

comment:36 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 5.8 to 5.9

That worked.

So it's clear that Aegir is working as expected, but for the 'last gasp' of Drush telling it it's done.

I'll try the same again with the migrate command originally run in comment:9 and comment:11, but with the new platform name - like so:

drush @stg2.transitionnetwork.org provision-migrate '@platform_TransitionNetworkD6S008' --backend --debug

comment:37 Changed 3 years ago by chris

On Thu 09-Jan-2014 at 05:54:42PM -0000, Transiton Technology Trac wrote:
>  I lost my connection to Puffin... 

ssh in and then start a screen session:

  screen

Then if your ssh session fails you can ssh in again and resume it:

  screen -r

We could also install mosh, the mobile shell, if you are able to install
it locally, it's very good for keeping sessions alive with the most dire
connections. It's installed on penguin already.


comment:38 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.5
  • Total Hours changed from 5.9 to 6.4

OK so this time we have an error! About 3/4 though the JSON output we have:

    "error_log": {
        "PROVISION_DROP_DB_FAILED": [
            "Unable to drop database."
        ]
    },
    "error_status": 1, 

and then a little later the actual error (see items 2 and 3 below, rest is for context):

        {
            "error": null, 
            "memory": 18616968, 
            "message": "Dropping database stg2transitionne", 
            "timestamp": 1389292889.0647, 
            "type": "notice"
        }, 
        {
            "error": null, 
            "memory": 18619472, 
            "message": "Failed to drop database stg2transitionne", 
            "timestamp": 1389292889.065, 
            "type": "warning"
        }, 
        {
            "error": "PROVISION_DROP_DB_FAILED", 
            "memory": 18627632, 
            "message": "Unable to drop database.", 
            "timestamp": 1389292889.0661, 
            "type": "error"
        }, 
        {
            "error": null, 
            "memory": 18623680, 
            "message": "Running: /usr/bin/php-cli /var/aegir/drush/drush.php --php='/usr/bin/php-cli'  --platform='@platform_TransitionNetworkD6P005' provision-save '@stg2.transitionnetwork.org' --backend  2>&1", 
            "timestamp": 1389292889.0664, 
            "type": "command"
        }, 

And finally near then end, some "rollback" messages:

        {
            "error": null, 
            "memory": 21928080, 
            "message": "Peak memory usage was 14.41 MB", 
            "timestamp": 1389292889.4101, 
            "type": "memory"
        }, 
        {
            "error": null, 
            "memory": 18656400, 
            "message": "Changes made in drush_provision_drupal_provision_migrate have been rolled back.", 
            "timestamp": 1389292889.413, 
            "type": "rollback"
        }, 
        {
            "error": null, 
            "memory": 18657952, 
            "message": "Bringing site out of maintenance", 
            "timestamp": 1389292889.4131, 
            "type": "notice"
        }, 
        {
            "error": null, 
            "memory": 18665776, 
            "message": "Changed group ownership of <code>/data/disk/tn/static/transition-network-d6-p005/sites/stg2.transitionnetwork.org/local.settings.php</code> to www-data", 
            "timestamp": 1389292889.4148, 
            "type": "message"
        }, 
        {
            "error": null, 
            "memory": 18667424, 
            "message": "Changed permissions of <code>/data/disk/tn/static/transition-network-d6-p005/sites/stg2.transitionnetwork.org/local.settings.php</code> to 440", 
            "timestamp": 1389292889.4149, 
            "type": "message"
        }, 
        {
            "error": null, 
            "memory": 18668576, 
            "message": "Template loaded: /data/disk/tn/.drush/provision/Provision/Config/Drupal/provision_drupal_settings.tpl.php", 
            "timestamp": 1389292889.415, 
            "type": "notice"
        }, 
        {
            "error": null, 
            "memory": 18675920, 
            "message": "Changed permissions of /data/disk/tn/static/transition-network-d6-p005/sites/stg2.transitionnetwork.org/settings.php to 640", 
            "timestamp": 1389292889.4152, 
            "type": "message"
        }, 
        {
            "error": null, 
            "memory": 18677160, 
            "message": "Generated config Drupal settings.php file", 
            "timestamp": 1389292889.416, 
            "type": "message"
        }, 
        {
            "error": null, 
            "memory": 18678800, 
            "message": "Changed permissions of /data/disk/tn/static/transition-network-d6-p005/sites/stg2.transitionnetwork.org/settings.php to 440", 
            "timestamp": 1389292889.4161, 
            "type": "message"
        }, 
        {
            "error": null, 
            "memory": 18679824, 
            "message": "Change group ownership of /data/disk/tn/static/transition-network-d6-p005/sites/stg2.transitionnetwork.org/settings.php to www-data", 
            "timestamp": 1389292889.4165, 
            "type": "message"
        }, 
        {
            "error": null, 
            "memory": 18674360, 
            "message": "Platforms path /data/disk/tn/platforms exists.", 
            "timestamp": 1389292889.4166, 
            "type": "message"
        }, 
        {
            "error": null, 
            "memory": 18675696, 
            "message": "Platforms ownership of /data/disk/tn/platforms has been changed to tn.", 
            "timestamp": 1389292889.4169, 
            "type": "message"
        }, 
        {
            "error": null, 
            "memory": 18677088, 
            "message": "Platforms permissions of /data/disk/tn/platforms have been changed to 711.", 
            "timestamp": 1389292889.417, 
            "type": "message"
        }, 
        {
            "error": null, 
            "memory": 18677824, 
            "message": "Platforms path /data/disk/tn/platforms is writable.", 
            "timestamp": 1389292889.417, 
            "type": "message"
        }, 
        {
            "error": null, 
            "memory": 18672560, 
            "message": "Removed unused migration site package", 
            "timestamp": 1389292889.4359, 
            "type": "message"
        }, 
        {
            "error": null, 
            "memory": 18675512, 
            "message": "Template loaded: /data/disk/tn/.drush/provision/http/Provision/Config/Nginx/vhost.tpl.php", 
            "timestamp": 1389292889.4365, 
            "type": "notice"
        }, 
        {
            "error": null, 
            "memory": 18680320, 
            "message": "Generated config virtual host configuration file", 
            "timestamp": 1389292889.4374, 
            "type": "message"
        }, 
        {
            "error": null, 
            "memory": 18678080, 
            "message": "nginx on puffin.webarch.net has been restarted", 
            "timestamp": 1389292889.596, 
            "type": "notice"
        }, 
        {
            "error": null, 
            "memory": 18678808, 
            "message": "Changes made in drush_provision_drupal_pre_provision_migrate have been rolled back.", 
            "timestamp": 1389292889.5962, 
            "type": "rollback"
        }, 
        {
            "error": null, 
            "memory": 18681224, 
            "message": "Template loaded: /data/disk/tn/.drush/provision/http/Provision/Config/Nginx/vhost.tpl.php", 
            "timestamp": 1389292889.5975, 
            "type": "notice"
        }, 
        {
            "error": null, 
            "memory": 18685640, 
            "message": "Generated config virtual host configuration file", 
            "timestamp": 1389292889.5989, 
            "type": "message"
        }, 
        {
            "error": null, 
            "memory": 18682296, 
            "message": "Changes made in drush_http_pre_provision_migrate have been rolled back.", 
            "timestamp": 1389292889.5992, 
            "type": "rollback"
        }, 

comment:39 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.15
  • Total Hours changed from 6.4 to 6.55

OK so the verify on the S008 platform didn't add the site this time -- I expect because a site with the same name (url) is already known to Aegir and it won't do dupes. If I were to delete the STG2 site in P005 I'm sure the one in S008 would show on next verify.

So I wonder if this is just down to database permissions... Looking into that now.

comment:40 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.75
  • Total Hours changed from 6.55 to 7.3

So a quick review and the database permissions seem fine to me...

I've now done a bunch of Googling and others have similar reports, but no fixes/causes as yet. There's an Aegir Provision issue that seems relevant here: Unable to migrate remote site after 1.7 upgrade.

I've now tried:

  • deleted the site node -- NOT SITE! -- for STG2 in P005 platfor (by using the 'edit' tab on the STG2 site, then changing the URL to end in 'delete' rather than 'edit', then confirming the detete)
  • then re-verified the S008 platform to see if can see the STG2 it migrated there.
  • It can! It found it and tried to import it but hung on Running: /data/disk/tn/tools/drush/drush.php @stg2.transitionnetwork.org provision-import --backend 2>&1

It's now almost certainly a database issue -- either slow, permissions or timeout.

comment:41 Changed 3 years ago by jim

I'm now downloading a Aegir backup of STG to see what's in the files, and see what might be taking the time.

comment:42 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.35
  • Total Hours changed from 7.3 to 7.65

Ok so 51Mb for the zipped backup is pretty big... And it's pretty much all database -- that's 224Mb unzipped.

Also I cleaned up S008 STG2 site that failed to import with:

[as an Aegir user]
cd ~/static/transition-network-d6-s008/sites
rm -R stg2.transitionnetwork.org/ www.stg2.transitionnetwork.org

Now re-running verify to see if we can get STG2 back. I'll kill it by hand if needs be (we don't really need it apart from my testing here).

Now I'm going to try some mysql dumps and gzips by hand, to see what sort of time these things take.

comment:43 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.5
  • Total Hours changed from 7.65 to 8.15

Looking at the PROD database in [Chive], the numbers for the basic inbuilt Drupal search are scary:

  • search_dataset = 118.79 MiB
  • search_index = 55.66 MiB
  • search_total = 5.78 MiB
  • Total = 180.23 MiB -- OR 180/224 = 80% of total database size

This is almost certainly part of the issue...

I'm about to do a manual DB dump of PROD to test server speed now.

comment:44 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.6
  • Total Hours changed from 8.15 to 8.75
  • Description modified (diff)
  • Summary changed from Aegir migrate tasks hang on Drushrc load to Aegir database intensive (migrate, clone, restore) tasks hang for larger sites

Just tried migrating STG after truncating the search_* tables, but that hung... this time the DB will have been 20-25% smaller than before, weighing in around 50Mb unzipped.

Updating the issue description with what I've found...

My last work tonight will be to review the changes to the server in early October that has may have caused this issue -- ZFS and TempFS changes are the best candidates for perhaps causing this I'd guess.

comment:45 Changed 3 years ago by chris

On Fri 10-Jan-2014 at 07:38:28PM -0000, Transiton Technology Trac wrote:
> 
>  My last work tonight will be to review the changes to the server in
>  early October that has may have caused this issue -- ZFS and TempFS
>  changes are the best candidates for perhaps causing this I'd guess.

Edit /etc/mysql/my.cnf to change MySQL back to using the filesystem
rather than RAM for temp files.


comment:46 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 8.75 to 8.85

Per Chris' suggestion, I've set the tmpdir in /etc/mysql/my.cnf back to /tmp from /run/shm/mysql -- this is just a comment/recomment lines thing...

I waited until the load dropped below 2 I'll restarted mysql -- took 2 attmepts for some reason -- and happened at 20:26:

puffin:~# service mysql restart
[ ok ] Stopping MariaDB database server: mysqld.
[ ok ] Starting MariaDB database server: mysqld . . . . ..
[info] Checking for corrupt, not cleanly closed and upgrade needing tables..
puffin:~# service mysql status
[info] MariaDB is stopped..
puffin:~# service mysql start
[ ok ] Starting MariaDB database server: mysqld . . . ..
[info] Checking for corrupt, not cleanly closed and upgrade needing tables..
puffin:~# service mysql status
[info] /usr/bin/mysqladmin  Ver 9.0 Distrib 5.5.34-MariaDB, for debian-linux-gnu on x86_64
Copyright (c) 2000, 2013, Oracle, Monty Program Ab and others.
<snip>

I'll let the caches warm up and the traffic die down a bit and try another clone of STG in a couple of hours.

comment:47 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.5
  • Total Hours changed from 8.85 to 9.35

Tried to clone the slimmer STG site with MySQL using disk caching, still no dice... Temp tables issue ruled out -- I've undone the previous config changes and have reverted back to TmpFS from disk, then restarted mysql.

In a parallel test I've built the TN D6 platform on my server (Babylon), and have imported the fat version of the database, which took ~4.5 mins for the 213Mb DB file. I'll try cloning/migrating there too to rule out our code/DB.

Also ran drush stg.transitionnetwork.org dis -y notifictions messaging piwik google_analytics mailchimp captcha mollom to turn off stuff not needed/wanted in STG - did this on all copies on Puffin and Babylon, too.

Clone on Babylon still running after 10 mins...

comment:48 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 9.35 to 9.45

Babylon took 15 mins to clone the site, but succeeded and the site works no probs. This rules out our database as the issue... Also the site upgrades went without hitch, will do some simple testing.

So we're left with DB perms, ZFS, timeouts or something else as the cause of this... But it's definitely database server-related.

I'll continue over the weekend.

comment:49 Changed 3 years ago by jim

FYI http://tn-test2.i-jk.co.uk/ is up to date with modules etc if anyone wants to test...

comment:50 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.5
  • Total Hours changed from 9.45 to 9.95

OK so let's try some timeout, size and other settings for php and mysql...

In /opt/local/lib/php.ini the following variables with ;; before them have been copied with new values around double or more what was there... Cleaned up output lists affected settings:

cat /data/disk/tn/php.ini | grep ";;"
;;max_execution_time = 3600
;;max_input_time = 3600
;;post_max_size = 100M
;;upload_max_filesize = 100M
;;default_socket_timeout = 30
;;mysql.connect_timeout = 60

A I said, the above have been doubled or in some cases increaesd more. The originals are in the file, and a copy at /opt/local/lib/php.ini.jk#610 is available.

A quick check these values are being used by CLI PHP when logged in as the tn Aegir user is to execute this:

tn@puffin:~$ drush php-eval "echo phpinfo();"  | grep "max"
log_errors_max_len => 1024 => 1024
max_execution_time => 0 => 0
max_file_uploads => 50 => 50
max_input_nesting_level => 64 => 64
max_input_time => -1 => -1
max_input_vars => 9999 => 9999
post_max_size => 300M => 300M
upload_max_filesize => 500M => 500M
mysql.max_links => Unlimited => Unlimited
mysql.max_persistent => Unlimited => Unlimited
mysqli.max_links => Unlimited => Unlimited
mysqli.max_persistent => Unlimited => Unlimited
session.gc_maxlifetime => 1440 => 1440

I note some values (max_execution_time, max_input_time etc) are set to 0 or -1, which means another file is being included that overrides these values. 0 or -1 generally means 'unlimited'.

So the new values are in place for Drush... Now I'll take a look at what mysql timeout changes I can make with limited impact.

comment:51 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.25
  • Total Hours changed from 9.95 to 10.2

Re mysql: nothing useful in /etc/php5 or /data/conf (overrides etc), so will assume for now it's only /etc/mysql that matters, then also look in /var/aegir where BOA keeps a few core things.

puffin:/etc/mysql# grep time /etc/mysql/my.cnf 
connect_timeout         = 60
#wait_timeout            = 3600
wait_timeout            = 120
long_query_time         = 5
innodb_lock_wait_timeout = 120
interactive-timeout

Looks like wait_timeout has been altered before... But according to a search of the BOA source code the correct value is 3600, not 120 as presently set... So I've set this value back to the default -- could be important!

Load is low, so restarting MySQL. The PHP settings changes in previous comment are instantly effective on CLI mode, so no FPM restarts needed.

comment:52 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.15
  • Total Hours changed from 10.2 to 10.35

Old STG2 and STG3 testing sites directories and databases removed, now will try a clone of STG...

comment:53 Changed 3 years ago by jim

  • Priority changed from critical to minor

IT WORKED!!!!

WOOP!

So it's almost certainly the mysql wait_timeout value was far too low for big DB operations... A search of Trac shows Chris' work to optimise MySQL was the cause.

I'd guess we scraped though before under the 2min limit because the throughput on the old disks was faster than ZFS, and the site DB was a good deal smaller than now. But wrangling a 1/4GB database needs a good timeout set, or very fast disks.

So this is now a 'minor' ticket that needs a little cleanups -- revert my php changes from a couple of comments back, and raise two new tickets with recommendations I've come to during this ticket...

Last edited 3 years ago by jim (previous) (diff)

comment:54 Changed 3 years ago by chris

On Sat 11-Jan-2014 at 08:11:20PM -0000, Transiton Technology Trac wrote:
> 
>  So it's almost certainly the mysql {{{wait_timeout}}} value was far too
>  low for big DB operations... A search of Trac shows
>  [https://tech.transitionnetwork.org/trac/ticket/555#comment:68 Chris' work
>  to optimise MySQL] was the cause.

Ug, sorry :-(

comment:55 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.15
  • Status changed from new to closed
  • Resolution set to fixed
  • Total Hours changed from 10.35 to 10.5

Chris said:

Ug, sorry :-(

No worries mate, easily done... It was a needle in a haystack kinda issue though, glad it's fixed!

PHP tweaks from 3 comments up reverted.

Migration of STG2 to S008 platform worked. Now will continue over on #582 to do updates and wrap this puppy up...

New issues arising from this one:

Happy to close this!

comment:56 follow-up: ↓ 57 Changed 3 years ago by ed

Excellent work Holmes.

I'm noting also that Chris' changes were made in conversation with Jim so don't feel bad Chris - you did what you needed to do, then, and that became a thing now.

So this was documented in a flurry at the time, and what I'm noting now is that that type of documentation wasn't enough to help us track issues later. Humanly, this is also due to the speed at which you were both working, the pressure you were under, and the different nature of the things you were sorting out, and the whole ticket not having one 'documentation process and owner'.

THEREFORE - as per my suggestion on #670 I'm happy for you two to do what you do to simplify the rig now - with the clear requirement:

* make it clearly documented on a wiki page *

comment:57 in reply to: ↑ 56 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.25
  • Total Hours changed from 10.5 to 10.75

Replying to ed:

So this was documented in a flurry at the time, and what I'm noting now is that that type of documentation wasn't enough to help us track issues later.

The documentation of the key change to the MySQL config, which was done at the suggestion of mysqltuner, was documented enough for it to be found via a search:

That change is also referenced from the Puffin MySQL tuning ticket, see the list in the description at the top:

However it is true that the changes documented in the tickets hadn't been copied to the wiki page, this is where it should have been documented:

Very sorry about that.

Reflecting on this, I feel this might be related to the pressures of trying to do thing as quickly as possible, to keep costs down (see for example ticket:629#comment:11) -- the documentation is often the thing that doesn't get done as comprehensively as it could be, because it is time consuming. It doesn't seem right that a job that, for example, takes perhaps 20 seconds to do might take 5 mins to document. I'll try to do more, better documentation (ie distil the documentation in the ticket comments into the wiki pages) in the future and resist the pressure to skip it to keep the time down.

Last edited 3 years ago by chris (previous) (diff)
Note: See TracTickets for help on using tickets.