Ticket #630 (closed task: fixed)
Archiving Transition Town Totnes site
Reported by: | ed | Owned by: | chris |
---|---|---|---|
Priority: | major | Milestone: | Maintenance |
Component: | Unassigned | Keywords: | |
Cc: | Estimated Number of Hours: | 0.0 | |
Add Hours to Ticket: | 0 | Billable?: | yes |
Total Hours: | 4.55 |
Description
Please estimate to archive TTT site as per conversation with Ed:
- convert to html
- host on Penguin (incl. any likely issues for Penguin doing this)
- any ongoing maintenance
Then Ed will discuss with Frances at TTT as per agreement with Chris
Attachments
Change History
comment:2 Changed 3 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 0.0 to 0.25
I have set it off, it won't take long, on wiki:PenguinServer:
cd /web mkdir -p archive.transitiontowntotnes.org/www/ cd archive.transitiontowntotnes.org httrack Welcome to HTTrack Website Copier (Offline Browser) 3.43-9+libhtsjava.so.2 Copyright (C) Xavier Roche and other contributors To see the option list, enter a blank line or try httrack --help Enter project name :Transition Town Totnes Archive Base path (return=/root/websites/) :/web/archive.transitiontowntotnes.org/www/ Enter URLs (separated by commas or blank spaces) :http://archive.transitiontowntotnes.org/ Action: (enter) 1 Mirror Web Site(s) 2 Mirror Web Site(s) with Wizard 3 Just Get Files Indicated 4 Mirror ALL links in URLs (Multiple Mirror) 5 Test Links In URLs (Bookmark Test) 0 Quit : 1 Proxy (return=none) : You can define wildcards, like: -*.gif +www.*.com/*.zip -*img_*.zip Wildcards (return=none) : You can define additional options, such as recurse level (-r<number>), separed by blank spaces To see the option list, type help Additional options (return=none) : ---> Wizard command line: httrack http://archive.transitiontowntotnes.org/ -O "/web/archive.transitiontowntotnes.org/www/Transition Town Totnes Archive" -%v Ready to launch the mirror? (Y/n) :Y WARNING! You are running this program as root! It might be a good idea to use the -%U option to change the userid: Example: -%U smith Mirror launched on Fri, 29 Nov 2013 12:51:01 by HTTrack Website Copier/3.43-9+libhtsjava.so.2 [XR&CO'2010] mirroring http://archive.transitiontowntotnes.org/ with the wizard help..
I'll check back when it's done to see what we have and set up Nginx to serve it.
Changed 3 years ago by chris
- Attachment hts-log.txt added
HTTrack Log File for Transition Town Totnes Archive 30th Nov 2013
comment:3 Changed 3 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Status changed from new to accepted
- Total Hours changed from 0.25 to 0.5
HTTrack finally completed:
PANIC! : Too many URLs : >99995 [2870] Done. Thanks for using HTTrack! *
I think there must be some infinite loops.
I have attached the log file, /trac/attachment/ticket/630/hts-log.txt the errors I noticed include the print versioiins of pages not being allowed by robots.txt:
13:12:10 Warning: Link archive.transitiontowntotnes.org/print/2324 not scanned (follow robots meta tag)
And PDF version sof pages generating 500 errors:
05:25:48 Error: "" (500) after 2 retries at link archive.transitiontowntotnes.org/printpdf/Central/DVDsPresentedByTTT (from archive.transitiontowntotnes.org/Central/DVDsPresentedByTTT)
But I haven't read through the whole log file.
Based in this trial run I expect it would take about 1 to 2 hours to get the site optimised for running HTTrack against it and also for getting the HTTrack command line right and also sorting out the Nginx config to server the site.
Once it has been done for one site we will be in a position to archive other sites quicker.
comment:5 Changed 3 years ago by chris
- Add Hours to Ticket changed from 0.0 to 1.7
- Total Hours changed from 0.5 to 2.2
There are also errors for requests like this:
- http://archive.transitiontowntotnes.org/system/files/David_Strahan_Press_Release.doc
- http://archive.transitiontowntotnes.org/content/sites/default/files/images/starlings_04.jpg
- http://archive.transitiontowntotnes.org/drupal/sites/default/files/images/tttos4.jpg
So I have added a Redirect to the .htaccess file:
RedirectPermanent /system/files http://archive.transitiontowntotnes.org/sites/default/files RedirectPermanent /content/ http://archive.transitiontowntotnes.org/ RedirectPermanent /drupal/ http://archive.transitiontowntotnes.org/
I have also removed the anon users ability to view the print and pdf versions of documents via http://archive.transitiontowntotnes.org/admin/user/permissions
These pages needed the Input Type to be changed to "Full HTML" for them to render properly:
- http://archive.transitiontowntotnes.org/content/what-economic-crisis
- http://archive.transitiontowntotnes.org/content/publications
- http://archive.transitiontowntotnes.org/content/who-we-are-0
- http://archive.transitiontowntotnes.org/projects/cycling-group
- http://archive.transitiontowntotnes.org/projects/food-link-project
- http://archive.transitiontowntotnes.org/projects/heart-and-soul-workshops
- http://archive.transitiontowntotnes.org/projects/local-food-guide
- http://archive.transitiontowntotnes.org/projects/my-story
- http://archive.transitiontowntotnes.org/projects/nut-tree-planting-project
- http://archive.transitiontowntotnes.org/projects/planning-action
- http://archive.transitiontowntotnes.org/projects/totnes-pound
- http://archive.transitiontowntotnes.org/projects/totnes-sustainable-construction-company
- http://archive.transitiontowntotnes.org/projects/transition-library
- http://archive.transitiontowntotnes.org/groups/education
- http://archive.transitiontowntotnes.org/content/local-people
- http://archive.transitiontowntotnes.org/content/young-people
These needed the "Published" box ticked:
- http://archive.transitiontowntotnes.org/videos/totnes-what-past-can-teach-us-about-future
- http://archive.transitiontowntotnes.org/heartandsoul/meetingnotes040308
- http://archive.transitiontowntotnes.org/heartandsoul/meetingnotes200207
- http://archive.transitiontowntotnes.org/heartandsoul/gettingstartedpublicity
- http://archive.transitiontowntotnes.org/buildingandhousing/cohominutes
- http://archive.transitiontowntotnes.org/6thMay/comments/rolesetc
- http://archive.transitiontowntotnes.org/6thMay/event/email
- http://archive.transitiontowntotnes.org/node/420
I have enabled DirectoryIndexes here http://archive.transitiontowntotnes.org/sites/default/files/ and I'll rsync across all these files as well to be sure nothing is missing.
There are 404's for old wiki addresses:
- http://archive.transitiontowntotnes.org/Building_and_housing/Planning/PlanningConsultationLocalDevelopmentFramework?q=uploads/Main/CorePolicyResponse-Part-V.pdf the file CorePolicyResponse-Part-V.pdf can't be found on the drupal site filesystem, but I found it here http://2011.archive.transitionnetwork.org/Totnes/uploads/Main/CorePolicyResponse-Part-V.pdf
I'll look at sorting out some redirects for these.
Ed, I'm going to go over time wise perhaps on this, what does £50 equate to in terms of time? I'm happy to do some time for free on this if needs be.
comment:6 Changed 3 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.5
- Total Hours changed from 2.2 to 2.7
Actually I can't see a easy fix for the URL that should point to http://2011.archive.transitionnetwork.org/Totnes/ so I'm going to ignore them due to time constraints.
Running HTTrack again:
screen sudo -i cd /web/archive.transitiontowntotnes.org/ rm -rf Transition\ Town\ Totnes\ Archive/ httrack http://archive.transitiontowntotnes.org/ -O "/web/archive.transitiontowntotnes.org/www" -%v
That should do the trick, it'll take several hours to run.
Copying across the files:
cd /web/archive.transitiontowntotnes.org/www/archive.transitiontowntotnes.org/sites/default/files rsync -av host-a:/home/ttt/archive_public_html/sites/default/files/* .
Nginx config, minus comments
server { listen 80; server_name archive.transitiontowntotnes.org ttarchive.penguin.webarch.net; access_log /var/log/nginx/ttarchive.access.log; error_log /var/log/nginx/ttarchive.error.log crit; root "/web/archive.transitiontowntotnes.org/www/archive.transitiontowntotnes.org"; autoindex on; index index.html; location ~ /\. { access_log off; log_not_found off; deny all; } include gzip; } server { listen 443; server_name archive.transitiontowntotnes.org ttarchive.penguin.webarch.net; access_log /var/log/nginx/ttarchive.ssl_access.log; error_log /var/log/nginx/ttarchive.ssl_error.log debug; ssl on; ssl_certificate /etc/ssl/transitionnetwork.org/transitionnetwork.org.chained.pem; ssl_certificate_key /etc/ssl/transitionnetwork.org/transitionnetwork.org.key; ssl_protocols SSLv3 TLSv1 TLSv1.1 TLSv1.2; ssl_ciphers RC4:HIGH:!aNULL:!MD5; ssl_prefer_server_ciphers on; root "/web/archive.transitiontowntotnes.org/www/archive.transitiontowntotnes.org"; autoindex on; index index.html; location ~ /\. { access_log off; log_not_found off; deny all; } }
The site is now available here: http://ttarchive.penguin.webarch.net/
I have added a link to the site at http://penguin.transitionnetwork.org/
And it's looking good so far:
Bytes saved: 67,75MiB Links scanned: 251/1229 (+975) Time: 41min17s Files written: 1170 Transfer rate: 25,58KiB/s (20,90KiB/s)Files updated: 0 Active connections: 4 Errors: 8
It should be done in a few hours and then the DNS for archive.transitiontowntotnes.org will need updating.
comment:7 Changed 3 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.1
- Total Hours changed from 2.7 to 2.8
There are some embedded images like this which are 404's:
But there are available:
We can fix this with a DNS update, these need doing:
- Point totnes.transitionnetwork.org to 81.95.52.111
- Point archive.transitiontowntotnes.org to 81.95.52.111
Changed 3 years ago by chris
- Attachment hts-log2.txt added
HTTrack Log File for Transition Town Totnes Archive 3rd Dec 2013
comment:8 Changed 3 years ago by chris
- Add Hours to Ticket changed from 0.0 to 1.2
- Total Hours changed from 2.8 to 4.0
I have attached the log from the last run of HTTrack, /trac/attachment/ticket/630/hts-log2.txt
Entries like this:
- 15:24:56 Error: "Forbidden" (403) at link archive.transitiontowntotnes.org/sites/default/files/Cycliing+at+Hebmbury+Woods.Press+Release.doc?q=system/files/Cycliing+at+Hebmbury+Woods.Press+Release.doc (from archive.transitiontowntotnes.org/central/pressreleases)
Which is for this URL, http://ttarchive.penguin.webarch.net/sites/default/files/Cycliing+at+Hebmbury+Woods.Press+Release.doc is for a file that is at http://ttarchive.penguin.webarch.net/sites/default/files/Cycliing%20at%20Hebmbury%20Woods.Press%20Release.doc
I have done some searching for a Nginx level solution for this but it doesn't look easy to fix at that level.
Also HTTrack has created this file:
The file that it's linked from:
Is available from the HTTrack archive here:
This file needs this HTML:
<a href="../sites/default/files/Cycliing%2bat%2bHebmbury%2bWoods.Press%2bReleaseb2c2.html">Cycle to Hembury Woods </a>
Changing to:
<a href="../sites/default/files/Cycliing%20at%20Hebmbury%20Woods.Press%20Release.doc">Cycle to Hembury Woods </a>
Although we could manually fix these links there isn't the time / budget for this work, there are 337 files ending in .html in the sites/default/files folder and that will be roughly the number of broken links. A very keen researcher could find these files themselves from this index, http://ttarchive.penguin.webarch.net/sites/default/files/
I have also noticed that some pages didn't get HTML ticked, eg:
I have added this to the Nginx config:
# serve files without .html default_type "text/html"; try_files $uri $uri.html $uri/ =404;;
So a old URL like:
Is available at this URL:
In addition to:
This should dramatically reduce the 404's from existing links to pages on archive.transitiontowntotnes.org
I have added some text to the front page:
<strong>This site is a static archive of the dynamic Drupal site created in December 2013, if you are having problems with some files not being found you can look at the <a href="/sites/default/files/">index of files</a>. See also, the archives of the wiki site from <a href="http://2011.archive.transitionnetwork.org/Totnes/">2011</a> and <a href="http://2010.archive.transitionnetwork.org/Totnes/">2010</a>.</strong>
And copied the edited index.html to home.html.
The archiving failed in the end with:
Too many URLs, giving up..(>100000) To avoid that: use #L option for more links (example: -#L1000000)
From the times I checked on it's progress I think the massive number of links was caused by the dynamically generated events section of the site.
I have updated the DNS for archive.transitiontowntotnes.org:
dig @dns0.webarchitects.co.uk archive.transitiontowntotnes.org +short 81.95.52.111 dig @dns1.webarchitects.co.uk archive.transitiontowntotnes.org +short 81.95.52.111
And for totnes.transitionnetwork.org, but it hasn't updated yet:
dig @A.DNS.GANDI.NET totnes.transitionnetwork.org +short 81.95.52.103
I think it's now safe to delete the Drupal site and database, sorry this has taken longer than expected, I think this ticket can now be closed, I have also added a link to this ticket from the front page of the archive in case people are desperate to find something.
comment:9 in reply to: ↑ description ; follow-up: ↓ 10 Changed 3 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.3
- Status changed from accepted to closed
- Resolution set to fixed
- Total Hours changed from 4.0 to 4.3
The DNS for http://totnes.transitionnetwork.org/ has updated:
dig @A.DNS.GANDI.NET totnes.transitionnetwork.org +short 81.95.52.111
Regarding the original ticket questions:
Replying to ed:
Please estimate to archive TTT site as per conversation with Ed:
- convert to html
Done, 1.5 hours including this comment.
- host on Penguin (incl. any likely issues for Penguin doing this)
Done, no issues.
In terms of disk space it was using 1.7GB:
cd /web/archive.transitiontowntotnes.org/ du -h --max-depth=2 1.3G ./www/archive.transitiontowntotnes.org 397M ./www/hts-cache 8.0K ./www/archive.transitiontowntotnes.org_ 1.7G ./www 1.7G .
I have reduced this to 1.3GB by deleting all the HTTrack files and moving the document root:
cd /web/archive.transitiontowntotnes.org/ du -h --max-depth=1 1.3G ./www 1.3G .
There is still lots of space on the server:
df -h Filesystem Size Used Avail Use% Mounted on /dev/xvda2 40G 16G 22G 42% /
- any ongoing maintenance
It's static HTML, the only issue I can think of is if something in the archive needs editing or deleting.
I have deleted the Drupal site off the server it was running on, host-a.ecodis.net but we still have backups of it for now.
I have updated the list of sites / servers here wiki:WikiStart#ServersandWebsites
And added the archive to the list of sites here wiki:PenguinServer#totnes.transitionnetwork.org
I have added the site to the list of sites here http://penguin.transitionnetwork.org/
And for reference here is the Nginx config:
# totnes.transitionnetwork.org virtual server # http://nginx.org/en/docs/http/ngx_http_core_module.html#server server { # listen for ipv4 # http://nginx.org/en/docs/http/ngx_http_core_module.html#listen #listen 8000; listen 80; # server name and server aliases # http://nginx.org/en/docs/http/ngx_http_core_module.html#server_name server_name totnes.transitionnetwork.org archive.transitiontowntotnes.org ttarchive.penguin.webarch.net; # logs, error log levels: info | notice | warn | error | crit | alert # http://nginx.org/en/docs/http/ngx_http_log_module.html#access_log # http://nginx.org/en/docs/ngx_core_module.html#error_log access_log /var/log/nginx/ttarchive.access.log; error_log /var/log/nginx/ttarchive.error.log crit; # document root # http://nginx.org/en/docs/http/ngx_http_core_module.html#root root "/web/archive.transitiontowntotnes.org/www"; # serve files without .html default_type "text/html"; try_files $uri $uri.html $uri/ =404; # http://nginx.org/en/docs/http/ngx_http_autoindex_module.html#autoindex autoindex on; # document index # http://nginx.org/en/docs/http/ngx_http_index_module.html#index index index.html; # location match # http://nginx.org/en/docs/http/ngx_http_core_module.html#location # Prevent access to any files starting with a dot, like .htaccess # or text editor temp files location ~ /\. { access_log off; log_not_found off; deny all; } # gzip content include gzip; } # totnes.transitionnetwork.org https virtual server server { #listen 4430; listen 443; server_name totnes.transitionnetwork.org archive.transitiontowntotnes.org ttarchive.penguin.webarch.net; access_log /var/log/nginx/ttarchive.ssl_access.log; error_log /var/log/nginx/ttarchive.ssl_error.log crit; ssl on; ssl_certificate /etc/ssl/transitionnetwork.org/transitionnetwork.org.chained.pem; ssl_certificate_key /etc/ssl/transitionnetwork.org/transitionnetwork.org.key; #ssl_protocols SSLv3 TLSv1 TLSv1.1 TLSv1.2; #ssl_ciphers RC4-SHA:HIGH:!ADH:!SSLv2:!aNULL; ssl_protocols SSLv3 TLSv1 TLSv1.1 TLSv1.2; ssl_ciphers RC4:HIGH:!aNULL:!MD5; ssl_prefer_server_ciphers on; root "/web/archive.transitiontowntotnes.org/www"; autoindex on; index index.html; # serve files without .html default_type "text/html"; try_files $uri $uri.html $uri/ =404; # Prevent access to any files starting with a dot, like .htaccess # or text editor temp files location ~ /\. { access_log off; log_not_found off; deny all; } }
comment:10 in reply to: ↑ 9 Changed 3 years ago by chris
Replying to chris:
Replying to ed:
Please estimate to archive TTT site as per conversation with Ed:
- convert to html
Done, 1.5 hours including this comment.
Sorry I can't see where I got that figure from, I'm afraid the total time is 4.3 hours.
I said earlier ticket:630#comment:5
Ed, I'm going to go over time wise perhaps on this, what does £50 equate to in terms of time? I'm happy to do some time for free on this if needs be.
comment:11 follow-up: ↓ 12 Changed 3 years ago by ed
£50 is just under two hours of your time I think Chris
comment:12 in reply to: ↑ 11 Changed 3 years ago by chris
Replying to ed:
£50 is just under two hours of your time I think Chris
This ticket has clocked up 4.3 hours so shall we deduct 2.5 hours from my time this month when it comes to billing?
comment:13 Changed 3 years ago by ed
yes please chris - however if we don't hit December's £670 limit, I'm happy to include as much of this as fits
comment:14 Changed 2 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.25
- Total Hours changed from 4.3 to 4.55
Frances at TTT has asked that the archived TTT Drupal site at http://archive.transitiontowntotnes.org/ be excluded from search engine results.
To prevent the site being dropped from the Internet Archive I have allowed their robot but excluded others:
User-agent: ia_archiver Disallow: User-agent: * Disallow: /
See http://archive.transitiontowntotnes.org/robots.txt
For background information see:
Hi chris - please estimate :)