Ticket #487 (closed defect: fixed)

Opened 4 years ago

Last modified 2 years ago

robots.txt files for development sites

Reported by: chris Owned by: jim
Priority: minor Milestone: Maintenance
Component: Live server Keywords:
Cc: ed, chris, jim, sam Estimated Number of Hours: 0.0
Add Hours to Ticket: 0 Billable?: yes
Total Hours: 0.45


All the sites other than www.transitionnetwork.org on wiki:PuffinServer should have a robots.txt file to exclude them from being crawled and indexed to prevent the development versions of sites being included in search results.

Change History

comment:1 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Owner changed from chris to jim
  • Status changed from new to assigned
  • Total Hours changed from 0.0 to 0.1

On the old dev sites we had things in place to ensure that the sites didn't sent out emails etc -- is this in hand for the new dev sites on puffin?

I'm assigning this ticket to Jim, in the hope that this can be sorted at a Drupal level rather than at a Nginx / Postfix level.

comment:2 Changed 4 years ago by jim

The Robotstxt module is auto-installed and enabled on all platforms.

On any (D6) site an admin can go to: https://example.com/admin/settings/robotstxt and put

User-agent: *
Disallow: /

This is done on newlive, and the copy of the site on my server. I also ran drush newlive.puffin.webarch.net dis googleanalytics piwik to stop any further reporting -- sorry I did this before but missed it on my 3rd platform/site.


As for config that would prevent emails going out, on the old Kiwi DEV we had 'reroute_email' running, which had a settings.php switch for each environment to allow/reroute emails. That section looked like this:

 * Reroute Email 6.x-1.x-dev variable to send emails to a different address for LIVE
 * JK - this is LIVE so NOT rerouting emails!
$conf['reroute_email_enable'] = 0;

Based on http://groups.drupal.org/node/101274, I looked in our site settings.php on the new platform and saw:

  # Additional host wide configuration settings. Useful for safely specifying configuration settings.
  if (file_exists('/data/disk/tn/config/includes/global.inc')) {

  # Additional site configuration settings.
  if (file_exists('/data/disk/tn/static/transition-network-d6-002/sites/transitionnetwork.org/local.settings.php')) {

Meaning we can either add our config to each site's local.settings.php file, or better still add some logic to /data/disk/tn/config/includes/global.inc that just knows when a site is DEV or TEST and does stuff accordingly.

I have some ideas and will throw some PHP together to do this shortly.

comment:3 Changed 4 years ago by jim

I've added Reroute Email, and Environment Indicator to the makefile, and to the platform.

And based on #136 I've created a new /data/conf/override.global.inc with some goodies in. It's checked into Github at https://github.com/transitionnetwork/transitionnetwork.org-d6.profile/blob/master/override.global.php and currently looks like:

<?php // OVERRIDE global settings.php

 * @file override.settings.php
 * Sets up some key Dev/Stage/Test/Prod behaviours
 * Works with
 * -- Session 443
 * -- Environment Indicator
 * -- Reroute Email

/* ------------ DEFAULTS ------------ */

...These enforce some things we want, like HTTPS cookies and certain module settings...

 * Enforce secure cookies handling
 * @see: http://drupal.org/project/session443
ini_set('session.cookie_secure', 1);

 * Reroute Email 6.x-1.x-dev switch means we'll always reroute (if module enabled).
 * @see: http://drupal.org/project/reroute_email
$conf['reroute_email_enable'] = 1;
$conf['reroute_email_address'] = "transition-dev@email-lists.org";

 * Environment Indicator to remind users what site they're looking at (if module enabled).
 * @see: http://drupal.org/project/environment_indicator
$conf['environment_indicator_text'] = 'UNKNOWN SERVER!';
$conf['environment_indicator_color'] = 'red';
$conf['environment_indicator_enabled'] = TRUE;

...And the following function allows us to add to each site's own local.settings.php to set their environment type, and associated overrides of the settings above...

 * Allows server environment settings to be changed on a per site basis
 * from defaults above based on environment type.
 * $environment_name must start with 'Production', 'Testing' or 
 * 'Development' (default) else no changes will be made. Any other names
 * can be added after a space.
 * e.g. 'Testing - TN.org commerce'
function puffin_server_override_settings_set_environment($environment_name = 'Development') {
  // use full string for Environment Indicator module label.
  $conf['environment_indicator_text'] = $environment_name;
  // use string before space so we know which environment to choose.
  $env_type = explode(' ', $environment_name);
  // set our own $_SERVER variable for other uses if needs be
  $_SERVER['_TN_ENVIRONMENT'] = $env_type;

  switch ($env_type[0]) {
    case 'Production':
      $conf['reroute_email_enable'] = 0;
      $conf['environment_indicator_color'] = '#D0E7B4';
      $conf['environment_indicator_enabled'] = TRUE;
    case 'Testing':
      $conf['reroute_email_enable'] = 1;
      $conf['environment_indicator_color'] = '#D0E7B4';
      $conf['environment_indicator_enabled'] = TRUE;
    case 'Development':
      $conf['reroute_email_enable'] = 1;
      $conf['environment_indicator_color'] = '#D0E7B4';
      $conf['environment_indicator_enabled'] = TRUE;

E.g. add this line to our Prod site when all done: puffin_server_override_settings_set_environment('Production TN.org');

So the only thing we need do now is enforce reroute_email and environment_indicator to be enabled on every site. There are a number of ways to do this:

But for now we've made a good progress and helped development in the future. Also, the $_SERVER['_TN_ENVIRONMENT'] allows us to check quickly where we are in our code and do/not do things accordingly.

For another time, there will probably be a more efficient approach based on something here: http://community.aegirproject.org/content/overriding-site-specific-php-values.

comment:4 Changed 4 years ago by jim

I see there's a patch for Robots.txt module that would allow setting it from settings.php (or override.global.inc on our setup): http://drupal.org/node/619404#comment-2237812

We'd need our own version of robots though -- currently it comes from BOA automatically... For another day.

Once the dev and test sites are set up, devs should only ever clone/migrate them, which minimises this risk going forward.

comment:5 Changed 4 years ago by jim

  • Priority changed from major to minor

Patched version of Robots module added to makefile.

Downgrading but keeping open to automate 'deny everything' on dev and test if possible.

comment:6 Changed 4 years ago by chris

  • Status changed from assigned to closed
  • Resolution set to fixed

Closing this ticket as Jim has, urm, fixed it! (is this bad taste joke day?)

comment:7 Changed 3 years ago by chris

  • Cc sam added
  • Add Hours to Ticket changed from 0.0 to 0.25
  • Status changed from closed to reopened
  • Resolution fixed deleted
  • Total Hours changed from 0.1 to 0.35

Reopening this as Sam noticed that https://stg.transitionnetwork.org/ is indexed by google, "About 847 results", https://www.google.com/search?q=transition+site%3Astg.transitionnetwork.org

The site had the live robots.txt:

I get "You are not authorized to access this page." at https://stg.transitionnetwork.org/admin/settings/robotstxt

Following https://omega8.cc/how-to-use-robotstxt-properly-243 I created a robots.txt file in /data/disk/tn/clients/tnusers/stg.transitionnetwork.org/files and this has fixed it for this site.

There might be a better way to do this?

There might be other sites it needs doing for?

comment:8 Changed 3 years ago by jim

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 0.35 to 0.45

Hi chris,

The RobotsTxt? module should be enabled in STG -- this should then have the 'Disallow everyone' settings added.

Robots is included in the makefile.

So the module needs to be enabled, the robots.txt file in the docroot folder needs removing (since the module can't work without it), and it needs to be confirmed to work.

comment:9 Changed 3 years ago by sam

  • Status changed from reopened to closed
  • Resolution set to fixed

Hi I changed robots.txt on the stg & stg2 servers yesterday using the Drupal frontend/ robots.txt module

It seems to have worked: http://stg.transitionnetwork.org/robots.txt

So I'm going to close the ticket and we just have to remember to do it when creating a new stg site.



comment:10 Changed 3 years ago by ed

  • Milestone set to Maintenance

comment:11 Changed 3 years ago by chris

This issue came up again, see ticket:712#comment:27

comment:12 Changed 2 years ago by chris

This issue came up again, see ticket:767

Note: See TracTickets for help on using tickets.