Ticket #630 (closed task: fixed)

Opened 3 years ago

Last modified 2 years ago

Archiving Transition Town Totnes site

Reported by: ed Owned by: chris
Priority: major Milestone: Maintenance
Component: Unassigned Keywords:
Cc: Estimated Number of Hours: 0.0
Add Hours to Ticket: 0 Billable?: yes
Total Hours: 4.55

Description

Please estimate to archive TTT site as per conversation with Ed:

  1. convert to html
  2. host on Penguin (incl. any likely issues for Penguin doing this)
  3. any ongoing maintenance

Then Ed will discuss with Frances at TTT as per agreement with Chris

Attachments

hts-log.txt (437.2 KB) - added by chris 3 years ago.
HTTrack Log File for Transition Town Totnes Archive 30th Nov 2013
hts-log2.txt (178.8 KB) - added by chris 3 years ago.
HTTrack Log File for Transition Town Totnes Archive 3rd Dec 2013

Change History

comment:1 Changed 3 years ago by ed

Hi chris - please estimate :)

comment:2 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.25
  • Total Hours changed from 0.0 to 0.25

I have set it off, it won't take long, on wiki:PenguinServer:

cd /web
mkdir -p archive.transitiontowntotnes.org/www/
cd archive.transitiontowntotnes.org
httrack 

Welcome to HTTrack Website Copier (Offline Browser) 3.43-9+libhtsjava.so.2
Copyright (C) Xavier Roche and other contributors
To see the option list, enter a blank line or try httrack --help

Enter project name :Transition Town Totnes Archive

Base path (return=/root/websites/) :/web/archive.transitiontowntotnes.org/www/

Enter URLs (separated by commas or blank spaces) :http://archive.transitiontowntotnes.org/

Action:
(enter) 1       Mirror Web Site(s)
        2       Mirror Web Site(s) with Wizard
        3       Just Get Files Indicated
        4       Mirror ALL links in URLs (Multiple Mirror)
        5       Test Links In URLs (Bookmark Test)
        0       Quit

: 1

Proxy (return=none) :

You can define wildcards, like: -*.gif +www.*.com/*.zip -*img_*.zip
Wildcards (return=none) :

You can define additional options, such as recurse level (-r<number>), separed by blank spaces
To see the option list, type help
Additional options (return=none) :

---> Wizard command line: httrack http://archive.transitiontowntotnes.org/  -O "/web/archive.transitiontowntotnes.org/www/Transition Town Totnes Archive"  -%v  

Ready to launch the mirror? (Y/n) :Y

WARNING! You are running this program as root!
It might be a good idea to use the -%U option to change the userid:
Example: -%U smith

Mirror launched on Fri, 29 Nov 2013 12:51:01 by HTTrack Website Copier/3.43-9+libhtsjava.so.2 [XR&CO'2010]
mirroring http://archive.transitiontowntotnes.org/ with the wizard help..

I'll check back when it's done to see what we have and set up Nginx to serve it.

Changed 3 years ago by chris

HTTrack Log File for Transition Town Totnes Archive 30th Nov 2013

comment:3 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.25
  • Status changed from new to accepted
  • Total Hours changed from 0.25 to 0.5

HTTrack finally completed:

PANIC! : Too many URLs : >99995 [2870]
Done.
Thanks for using HTTrack!
* 

I think there must be some infinite loops.

I have attached the log file, /trac/attachment/ticket/630/hts-log.txt the errors I noticed include the print versioiins of pages not being allowed by robots.txt:

13:12:10        Warning:        Link archive.transitiontowntotnes.org/print/2324 not scanned (follow robots meta tag)

And PDF version sof pages generating 500 errors:

05:25:48        Error:  "" (500) after 2 retries at link archive.transitiontowntotnes.org/printpdf/Central/DVDsPresentedByTTT (from archive.transitiontowntotnes.org/Central/DVDsPresentedByTTT)

But I haven't read through the whole log file.

Based in this trial run I expect it would take about 1 to 2 hours to get the site optimised for running HTTrack against it and also for getting the HTTrack command line right and also sorting out the Nginx config to server the site.

Once it has been done for one site we will be in a position to archive other sites quicker.

comment:4 Changed 3 years ago by ed

We've got a £50 budget for archiving TTT...

comment:5 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 1.7
  • Total Hours changed from 0.5 to 2.2

There are also errors for requests like this:

So I have added a Redirect to the .htaccess file:

RedirectPermanent /system/files  http://archive.transitiontowntotnes.org/sites/default/files
RedirectPermanent /content/  http://archive.transitiontowntotnes.org/
RedirectPermanent /drupal/  http://archive.transitiontowntotnes.org/

I have also removed the anon users ability to view the print and pdf versions of documents via http://archive.transitiontowntotnes.org/admin/user/permissions

These pages needed the Input Type to be changed to "Full HTML" for them to render properly:

These needed the "Published" box ticked:

I have enabled DirectoryIndexes here http://archive.transitiontowntotnes.org/sites/default/files/ and I'll rsync across all these files as well to be sure nothing is missing.

There are 404's for old wiki addresses:

I'll look at sorting out some redirects for these.

Ed, I'm going to go over time wise perhaps on this, what does £50 equate to in terms of time? I'm happy to do some time for free on this if needs be.

comment:6 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.5
  • Total Hours changed from 2.2 to 2.7

Actually I can't see a easy fix for the URL that should point to http://2011.archive.transitionnetwork.org/Totnes/ so I'm going to ignore them due to time constraints.

Running HTTrack again:

screen
sudo -i
cd /web/archive.transitiontowntotnes.org/
rm -rf Transition\ Town\ Totnes\ Archive/
httrack http://archive.transitiontowntotnes.org/  -O "/web/archive.transitiontowntotnes.org/www"  -%v

That should do the trick, it'll take several hours to run.

Copying across the files:

cd /web/archive.transitiontowntotnes.org/www/archive.transitiontowntotnes.org/sites/default/files
rsync -av host-a:/home/ttt/archive_public_html/sites/default/files/* .

Nginx config, minus comments

server {
        listen   80;
        server_name archive.transitiontowntotnes.org ttarchive.penguin.webarch.net;
        access_log  /var/log/nginx/ttarchive.access.log;
        error_log   /var/log/nginx/ttarchive.error.log crit;
        root   "/web/archive.transitiontowntotnes.org/www/archive.transitiontowntotnes.org";
        autoindex  on;
        index  index.html;
        location ~ /\. {
                access_log off;
                log_not_found off;
                deny all;
        }
        include  gzip;
}

server {
        listen   443;
        server_name archive.transitiontowntotnes.org ttarchive.penguin.webarch.net;
        access_log  /var/log/nginx/ttarchive.ssl_access.log;
        error_log   /var/log/nginx/ttarchive.ssl_error.log debug;
        ssl  on;
        ssl_certificate  /etc/ssl/transitionnetwork.org/transitionnetwork.org.chained.pem;
        ssl_certificate_key  /etc/ssl/transitionnetwork.org/transitionnetwork.org.key;
        ssl_protocols  SSLv3 TLSv1 TLSv1.1 TLSv1.2;
        ssl_ciphers  RC4:HIGH:!aNULL:!MD5;
        ssl_prefer_server_ciphers   on;
        root   "/web/archive.transitiontowntotnes.org/www/archive.transitiontowntotnes.org";
        autoindex  on;
        index  index.html;
        location ~ /\. {
                access_log off;
                log_not_found off;
                deny all;
        }
}

The site is now available here: http://ttarchive.penguin.webarch.net/

I have added a link to the site at http://penguin.transitionnetwork.org/

And it's looking good so far:

Bytes saved:    67,75MiB               Links scanned:   251/1229 (+975)
Time:   41min17s                       Files written:   1170
Transfer rate:  25,58KiB/s (20,90KiB/s)Files updated:   0
Active connections:     4              Errors:  8

It should be done in a few hours and then the DNS for archive.transitiontowntotnes.org will need updating.

comment:7 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 2.7 to 2.8

There are some embedded images like this which are 404's:

But there are available:

We can fix this with a DNS update, these need doing:

  • Point totnes.transitionnetwork.org to 81.95.52.111
  • Point archive.transitiontowntotnes.org to 81.95.52.111

Changed 3 years ago by chris

HTTrack Log File for Transition Town Totnes Archive 3rd Dec 2013

comment:8 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 1.2
  • Total Hours changed from 2.8 to 4.0

I have attached the log from the last run of HTTrack, /trac/attachment/ticket/630/hts-log2.txt

Entries like this:

  • 15:24:56 Error: "Forbidden" (403) at link archive.transitiontowntotnes.org/sites/default/files/Cycliing+at+Hebmbury+Woods.Press+Release.doc?q=system/files/Cycliing+at+Hebmbury+Woods.Press+Release.doc (from archive.transitiontowntotnes.org/central/pressreleases)

Which is for this URL, http://ttarchive.penguin.webarch.net/sites/default/files/Cycliing+at+Hebmbury+Woods.Press+Release.doc is for a file that is at http://ttarchive.penguin.webarch.net/sites/default/files/Cycliing%20at%20Hebmbury%20Woods.Press%20Release.doc

I have done some searching for a Nginx level solution for this but it doesn't look easy to fix at that level.

Also HTTrack has created this file:

The file that it's linked from:

Is available from the HTTrack archive here:

This file needs this HTML:

<a href="../sites/default/files/Cycliing%2bat%2bHebmbury%2bWoods.Press%2bReleaseb2c2.html">Cycle to Hembury Woods </a>

Changing to:

<a href="../sites/default/files/Cycliing%20at%20Hebmbury%20Woods.Press%20Release.doc">Cycle to Hembury Woods </a>

Although we could manually fix these links there isn't the time / budget for this work, there are 337 files ending in .html in the sites/default/files folder and that will be roughly the number of broken links. A very keen researcher could find these files themselves from this index, http://ttarchive.penguin.webarch.net/sites/default/files/

I have also noticed that some pages didn't get HTML ticked, eg:

I have added this to the Nginx config:

        # serve files without .html
        default_type "text/html";
        try_files  $uri $uri.html $uri/ =404;;

So a old URL like:

Is available at this URL:

In addition to:

This should dramatically reduce the 404's from existing links to pages on archive.transitiontowntotnes.org

I have added some text to the front page:

<strong>This site is a static archive of the dynamic Drupal site created in December 2013, if you are having problems with some files not being found you can look at the <a href="/sites/default/files/">index of files</a>. See also, the archives of the wiki site from <a href="http://2011.archive.transitionnetwork.org/Totnes/">2011</a> and <a href="http://2010.archive.transitionnetwork.org/Totnes/">2010</a>.</strong>

And copied the edited index.html to home.html.

The archiving failed in the end with:

Too many URLs, giving up..(>100000)
To avoid that: use #L option for more links (example: -#L1000000)

From the times I checked on it's progress I think the massive number of links was caused by the dynamically generated events section of the site.

I have updated the DNS for archive.transitiontowntotnes.org:

dig @dns0.webarchitects.co.uk archive.transitiontowntotnes.org +short
81.95.52.111

dig @dns1.webarchitects.co.uk archive.transitiontowntotnes.org +short
81.95.52.111

And for totnes.transitionnetwork.org, but it hasn't updated yet:

dig @A.DNS.GANDI.NET totnes.transitionnetwork.org +short
81.95.52.103

I think it's now safe to delete the Drupal site and database, sorry this has taken longer than expected, I think this ticket can now be closed, I have also added a link to this ticket from the front page of the archive in case people are desperate to find something.

comment:9 in reply to: ↑ description ; follow-up: ↓ 10 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.3
  • Status changed from accepted to closed
  • Resolution set to fixed
  • Total Hours changed from 4.0 to 4.3

The DNS for http://totnes.transitionnetwork.org/ has updated:

dig @A.DNS.GANDI.NET totnes.transitionnetwork.org +short
81.95.52.111

Regarding the original ticket questions:

Replying to ed:

Please estimate to archive TTT site as per conversation with Ed:

  1. convert to html

Done, 1.5 hours including this comment.

  1. host on Penguin (incl. any likely issues for Penguin doing this)

Done, no issues.

In terms of disk space it was using 1.7GB:

cd /web/archive.transitiontowntotnes.org/
du -h --max-depth=2
1.3G    ./www/archive.transitiontowntotnes.org
397M    ./www/hts-cache
8.0K    ./www/archive.transitiontowntotnes.org_
1.7G    ./www
1.7G    .

I have reduced this to 1.3GB by deleting all the HTTrack files and moving the document root:

cd /web/archive.transitiontowntotnes.org/
du -h --max-depth=1
1.3G    ./www
1.3G    .

There is still lots of space on the server:

df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/xvda2             40G   16G   22G  42% /
  1. any ongoing maintenance

It's static HTML, the only issue I can think of is if something in the archive needs editing or deleting.

I have deleted the Drupal site off the server it was running on, host-a.ecodis.net but we still have backups of it for now.

I have updated the list of sites / servers here wiki:WikiStart#ServersandWebsites

And added the archive to the list of sites here wiki:PenguinServer#totnes.transitionnetwork.org

I have added the site to the list of sites here http://penguin.transitionnetwork.org/

And for reference here is the Nginx config:

# totnes.transitionnetwork.org virtual server
# http://nginx.org/en/docs/http/ngx_http_core_module.html#server
server {

        # listen for ipv4
        # http://nginx.org/en/docs/http/ngx_http_core_module.html#listen
        #listen   8000; 
        listen   80;

        # server name and server aliases        
        # http://nginx.org/en/docs/http/ngx_http_core_module.html#server_name 
        server_name totnes.transitionnetwork.org archive.transitiontowntotnes.org ttarchive.penguin.webarch.net;

        # logs, error log levels: info | notice | warn | error | crit | alert 
        # http://nginx.org/en/docs/http/ngx_http_log_module.html#access_log
        # http://nginx.org/en/docs/ngx_core_module.html#error_log
        access_log  /var/log/nginx/ttarchive.access.log;
        error_log   /var/log/nginx/ttarchive.error.log crit;

        # document root
        # http://nginx.org/en/docs/http/ngx_http_core_module.html#root
        root   "/web/archive.transitiontowntotnes.org/www";

        # serve files without .html
        default_type "text/html";
        try_files  $uri $uri.html $uri/ =404;

        # http://nginx.org/en/docs/http/ngx_http_autoindex_module.html#autoindex
        autoindex  on;

        # document index
        # http://nginx.org/en/docs/http/ngx_http_index_module.html#index
        index  index.html;

        # location match
        # http://nginx.org/en/docs/http/ngx_http_core_module.html#location

        # Prevent access to any files starting with a dot, like .htaccess
        # or text editor temp files
        location ~ /\. {
                access_log off;
                log_not_found off;
                deny all;
        }

        # gzip content
        include  gzip;

}

# totnes.transitionnetwork.org https virtual server
server {
        #listen   4430;
        listen   443;
        server_name totnes.transitionnetwork.org archive.transitiontowntotnes.org ttarchive.penguin.webarch.net;
        access_log  /var/log/nginx/ttarchive.ssl_access.log;
        error_log   /var/log/nginx/ttarchive.ssl_error.log crit;

        ssl  on;
        ssl_certificate  /etc/ssl/transitionnetwork.org/transitionnetwork.org.chained.pem;
        ssl_certificate_key  /etc/ssl/transitionnetwork.org/transitionnetwork.org.key;
        #ssl_protocols  SSLv3 TLSv1 TLSv1.1 TLSv1.2;
        #ssl_ciphers  RC4-SHA:HIGH:!ADH:!SSLv2:!aNULL;
        ssl_protocols  SSLv3 TLSv1 TLSv1.1 TLSv1.2;
        ssl_ciphers  RC4:HIGH:!aNULL:!MD5;
        ssl_prefer_server_ciphers   on;

        root   "/web/archive.transitiontowntotnes.org/www";
        autoindex  on;
        index  index.html;

        # serve files without .html
        default_type "text/html";
        try_files  $uri $uri.html $uri/ =404;

        # Prevent access to any files starting with a dot, like .htaccess
        # or text editor temp files
        location ~ /\. {
                access_log off;
                log_not_found off;
                deny all;
        }
}

comment:10 in reply to: ↑ 9 Changed 3 years ago by chris

Replying to chris:

Replying to ed:

Please estimate to archive TTT site as per conversation with Ed:

  1. convert to html

Done, 1.5 hours including this comment.

Sorry I can't see where I got that figure from, I'm afraid the total time is 4.3 hours.

I said earlier ticket:630#comment:5

Ed, I'm going to go over time wise perhaps on this, what does £50 equate to in terms of time? I'm happy to do some time for free on this if needs be.

comment:11 follow-up: ↓ 12 Changed 3 years ago by ed

£50 is just under two hours of your time I think Chris

comment:12 in reply to: ↑ 11 Changed 3 years ago by chris

Replying to ed:

£50 is just under two hours of your time I think Chris

This ticket has clocked up 4.3 hours so shall we deduct 2.5 hours from my time this month when it comes to billing?

comment:13 Changed 3 years ago by ed

yes please chris - however if we don't hit December's £670 limit, I'm happy to include as much of this as fits

comment:14 Changed 2 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.25
  • Total Hours changed from 4.3 to 4.55

Frances at TTT has asked that the archived TTT Drupal site at http://archive.transitiontowntotnes.org/ be excluded from search engine results.

To prevent the site being dropped from the Internet Archive I have allowed their robot but excluded others:

User-agent: ia_archiver 
Disallow:

User-agent: *
Disallow: /

See http://archive.transitiontowntotnes.org/robots.txt

For background information see:

Note: See TracTickets for help on using tickets.