wiki:PuffinServer

Version 103 (modified by chris, 3 years ago) (diff)

--

Puffin

puffin.webarch.net is a 8GB RAM, 14 CPU core Debian Squeeze virtual server which replaced NewLiveServer and DevelopmentServer for running the Transition Network Drupal sites. It went live in early 2013.

This server was migrated to run off a ZFS server in October 2013, see ticket:593 and it was upgraded from Squeeze to Wheezy on 17th November 2013, see ticket:535.

It was agreed to call this server puffin at the ttech meeting on 22nd November 2012, see ticket:463. The install and initial configuration of this server was tracked on ticket:466, see also the other PuffinServer#migrationtickets. Other services from the old server were migrated to PenguinServer.

System updates are recorded on ticket:218 and BOA updates on ticket:629.

Munin Stats

There are munin stats for the server available here

See ticket:555#comment:13 for the notes regarding the installation of the mysql munin stats package.

We did have a trial with New Relic in 2013, see ticket:586 but this isn't on-going. Sometimes the IO State graph stops, this can be fixed by deleting the lock files, see ticket:555#IOstategraph.

HTTP Stats

The wiki:PiwikServer generates stats from the humans visiting the server and some of these stats have been made public on wiki:WebStats.

There are some notes on analysing the raw Nginx stats on wiki:WebServerLogs and Webalizer stats for Puffin are available using the same username/password as this Trac site.

There is a wiki:ErrorCodeCheck script which emails the total number of HTTP errors each day, see ticket:483#comment:63 for a list of the toitaly for August, September and October 2013.

Load Spikes

The server has been suffering from load spikes which cause the site to be unresponsive for clients, you can see the current status via the puffin Munin load graph, note the Max values for the last day, week, month and year.

When the load hits 23.28 robots are served 403 Forbidden responses and when the load hits 86.64 maintenance tasks are killed and when the load hits 113.28 the server terminates until the 5 min load average falls below 93.30.

The default thresholds have been changed as they were causing the shut to shutdown for 15 min at a time far too often, the current values were applied on 23rd October 2013.

The server has 14 CPU cores, see Unix-style load calculation, the current thresholds are generated from these variables in /root/.barracuda.cnf, the commented out values are the default ones:

#_LOAD_LIMIT_ONE=1444
#_LOAD_LIMIT_TWO=888
_LOAD_LIMIT_ONE=8664
_LOAD_LIMIT_TWO=5328

These variables are used by the /var/xdrago/second.sh script, which is run every minute via cron and has a internal loop which causes it to run 5 times, waiting 10 seconds between each run, and it has the following variables in it (these have been edited from their default values):

ONEX_LOAD=`awk '{print $1*100}' /proc/loadavg`
FIVX_LOAD=`awk '{print $2*100}' /proc/loadavg`
CTL_ONEX_SPIDER_LOAD=2328
CTL_FIVX_SPIDER_LOAD=2328
CTL_ONEX_LOAD=8664
CTL_FIVX_LOAD=5328
CTL_ONEX_LOAD_CRIT=11328
CTL_FIVX_LOAD_CRIT=9330

These values translate to the following loads for comparison to the Munin graphs:

  • ONEX_LOAD: load average over the last minute times 100
  • FIVX_LOAD: load average over the last 5 minutes times 100
  • CTL_ONEX_SPIDER_LOAD: 23.28
  • CTL_FIVX_SPIDER_LOAD: 23.28
  • CTL_ONEX_LOAD: 86.64
  • CTL_FIVX_LOAD: 53.28
  • CTL_ONEX_LOAD_CRIT: 113.28
  • CTL_FIVX_LOAD_CRIT: 93.30

And the logic, translated into english, is:

  1. If the load average over the last minute is greater than 23.28 and less than 86.64 and the nginx high load config isn't in use then start to use it.
  2. Else if the load average over the last 5 mins is greater than 23.28 and less than 53.28 and the nginx high load config isn't in use then start to use it.
  3. Else if the load average over the last minute is less than 23.28 and the the load average over the last 5 mins is less than 23.28 and the nginx high load config is in use then stop using it.
  1. If the load average over the last minute is greater than 132.16 and if /var/run/boa_run.pid exists, wait a second, if not kill some maintenance jobs: killall -9 php drush.php wget
  2. Else if the load average over the last 5 mins is greater than 108.85 and if /var/run/boa_run.pid exists, wait a second, if not kill some maintenance jobs: killall -9 php drush.php wget
  1. If the load average over the last minute is greater than 101.08 then kill the web server, killall -9 nginx and killall -9 php-fpm php-cgi
  2. Else if the load average over the last 5 mins is greater than 62.16 then kill the web server, killall -9 nginx and killall -9 php-fpm php-cgi
  3. Else restart all the services via /var/xdrago/proc_num_ctrl.cgi

Tickets generated in relation to these issues include:

A total of 77.5 hours was spent on the tickets listed above, the final one was closed on 15th December 2013 and the total time was added up, see ticket:555#comment:132.

Tickets

Most the "live server" tickets relate to puffin, but the older ones, prior to ticket number #466, are for previous servers.

Current live server tickets

Ticket Summary Owner Reporter
#924 Sheffield Server Shutdown Timetable? chris chris
#918 redirects? chris sam
#905 TN site down due to redis not running chris chris
#904 Issues to consider in the migration from Drupal to WordPress chris chris
#903 Large load spike on PuffinServer chris chris
#901 Enable SSH access to PuffinServer for Ade chris chris
#898 Fwd: Access to Drupal chris ade
#897 Hosting information/requirements for 2016 chris chris
#893 BOA Cron Jobs chris chris
#884 RE: http://news.transitionnetwork.org ade paul
#875 Free HTTPS certificates from Let's Encrypt chris chris
#859 Subscription emails broken paul sam
#847 Upgrade Servers to Debian Jessie chris chris
#836 "Date is invalid" on film content type paul sam
#834 Slovenian State info missing again paul sam
#824 Analysis of the 2014 maintenance ticket time chris chris
#814 Higher that usual loads on PuffinServer since early September chris chris
#812 space.transitionnetwork.org hacked? chris chris
#790 Annesley locked out of puffin chris chris
#763 Server Backups chris chris
#742 Stg site to play with paul sam
#716 Heartbleed chris chris
#692 Debian Updates chris chris
#689 Duplicate comments paul sam
#644 AWstats Nginx config breaks aegir chris jim
#626 Add redirect from an old CMS to a new URL chris ed
#587 Puffin MySQL Tuning chris chris

Closed live server tickets

Ticket Summary Owner Reporter
#920 SSL weirdness? chris sam
#913 Drupal Site off-line chris chris
#900 Unusal High Load on Puffin chris chris
#896 Chive access to TN Drupal DB chris chris
#895 HTTPS wildcard *.transitionnnetwork.org expires on 22nd January 2016 chris chris
#889 BOA-2.4.7 ade chris
#872 BOA 2.4.6 chris chris
#864 BOA 2.4.5 chris chris
#863 BOA-2.4.4 chris chris
#862 Puffin locked ade chris
#854 BOA 2.4.3 chris chris
#846 Load Spikes on BOA PuffinServer chris chris
#845 Unneeded FTP server on PuffinServer chris chris
#844 Stable BOA 2.4.2 Release chris chris
#843 8.8.8.8 (US/United States/google-public-dns-a.google.com) blocked for port scanning chris chris
#839 Stable BOA-2.4.1 Release chris chris
#837 Iframe in a panel page ben sam
#831 Rob is having Image upload issues paul ade
#828 Site down due to massive load spike 2015-01-29 chris chris
#827 Stable BOA-2.4.0 Release chris chris
#820 *.transitionnetwork.org 2015 security certificate chris chris
#797 POODLE: SSLv3.0 vulnerability (CVE-2014-3566) chris chris
#795 SHA1 Deprecation: Regenerate all certs using SHA256 chris chris
#788 New BOA-2.3.3 Stable Edition available chris chris
#784 New BOA-2.3.0 chris chris
#779 Annesley locked out of puffin? chris chris
#775 New BOA-2.2.9 Stable Edition available chris chris
#769 Locked myself out of puffin again paul annesley
#765 New BOA-2.2.8 Stable Edition chris chris
#762 cannot log in to Puffin chris annesley
#760 New BOA-2.2.7 Stable Edition chris chris
#754 Can we upgrade from PHP 5.3? chris chris
#745 Upgrade to BOA-2.2.6 Stable Edition chris chris
#730 Redis Munin stats for puffin chris chris
#725 Upgrade to BOA-2.2.5 chris chris
#721 Upgrade to BOA-2.2.3 Stable Edition chris chris
#717 Heartbleed / Open SSL vunerability chris sam
#707 Upgrade to BOA-2.2.2 chris chris
#698 intransitionmovie.com returns 405 on submit sam sam
#685 SSL certificate about to expire? chris sam
#683 Create Aegir account for Paul jim sam
#678 transitionnetwork.org unavailable chris sam
#677 Spike in MyISAM (search) database activity, Redis unable to cache such requests chris chris
#674 Puffin locked up chris chris
#673 Install mosh - the mobile shell chris chris
#670 Roll back performance customisations and use stock BOA settings where possible jim jim
#629 Upgrade to BOA-2.1.3 Stable Edition chris chris
#612 Upgrade to BOA-2.1.1 Stable Edition chris chris
#610 Aegir database intensive (migrate, clone, restore) tasks hang for larger sites jim jim
#604 Times for admin tasks chris ed
#599 Server time drift chris chris
#593 Migrating Puffin to a ZFS file server chris chris
#591 Move MySQL temporary directory to tmpfs chris jim
#589 Blocking spammers at a firewall level chris chris
#588 RSS feed caching chris chris
#586 New Relic Monitoring for BOA chris chris
#585 TTech Meeting 5th September 2013 ed chris
#580 php5-fpm starting when puffin boots chris chris
#576 Site down chris ed
#574 EFF: How HTTPS Everywhere affects transitionnetwork.org chris chris
#573 MariaDB 5.5.32 is available for Puffin chris chris
#569 403s served to editors, admin very slow chris ed
#567 Update BOA for new Redis 2.6.14 chris chris
#563 503 Errors chris chris
#555 Load spikes causing the TN site to be stopped for 15 min at a time chris chris
#554 Site slow down and MySQL load increase chris chris
#552 Puffin Downtime 23rd May 2013 chris chris
#549 Support with publishing process mark ed
#547 New Barracuda BOA-2.0.9 Edition available chris chris
#545 Registration page: 502 chris ed
#544 CSF / LDF false positive blocks on Puffin chris chris
#543 Puffin Load Spike chris chris
#535 Upgrade Puffin, Penguin and Parrot from Debian Squeeze to Wheezy chris chris
#531 Disk usage on puffin chris chris
#530 New Barracuda BOA-2.0.8 Edition available chris chris
#529 New Barracuda BOA-2.0.7 Edition available chris chris
#522 Uninstall 'collectd' as redundant in face of Munin setup chris chris
#503 Widget owners cannot see project moderation tab jim ed
#500 Quince shutdown chris chris
#499 MySQL backup dump error on puffin chris chris
#489 Problems with SSL? chris ed
#487 robots.txt files for development sites jim chris
#483 Nginx 502 Bad Gateway Errors with BOA chris chris
#481 Puffin tweaks chris jim
#478 Import TN.org site from Quince to Puffin jim jim
#475 Generate a new SSL certificate chris chris
#472 Quince to Puffin rsync script chris chris
#471 Ttech Skype Meeting 17th December 2012 chris chris
#470 Penguin install and configuration chris chris
#468 Load problems on kiwi and quince chris chris
#466 Puffin install and configuration chris chris
#463 Ttech Skype Meeting 22nd November 2012 chris chris
#421 Subdomains: list from user: normal? chris ed
#420 Varnish Downtime chris chris
#417 Images issue on site chris laura
#409 HTTPS Security Issues chris chris
#408 MySQL InnoDB Changes chris chris
#405 Live server APC settings chris chris
#404 Wild card domain names - *.transitionnetwork.org chris chris
#403 Wild card domain names - *.transitionnetwork.org chris chris
#401 Intransitionmovie.com errors with Google and Paypal laura laura
#398 Host Upgrade to Debian Squeeze chris chris
#397 Live server RAM and disk upgrade chris chris
#396 Migrate MySQL Databases from MyISAM to InnoDB chris chris
#392 PSE Server Upgrade chris chris
#391 PSE tracking, moderation and security chris chris
#390 Apache pcre segfaults chris chris
#386 Domain redirect for the InTransitionmovie domain chris laura
#385 Install new SSL certificate on TN.org chris laura
#370 Problem with nightly MySQL backup load chris chris
#369 Drupal-level performance enhancements jim jim
#301 Upgrade LIVE server to Debian Squeeze chris jim
#287 Live Server Load chris chris
#227 Set up mirroring capability for core data types chris ed
#221 Adding Big Blue Button to TN.org: QUOTE PLEASE chris ed
#218 Debian upgrades and updates chris chris
#165 Security certificate warning on new server chris ed
#147 Migration of live server chris chris
#132 Documentation jim ed
#131 nodequeue: front page add broken jim ed
#130 Remove 'promote' function on items jim ed
#128 Design: links in some blocks wrong colour laura ed
#125 'Facilitator' and 'Speaker' to the user profile editing options jim ed
#124 Live database backups chris chris
#122 email alerts for moderation actions john ed
#121 email alerts for new projects and project updates john ed
#120 Patterns Directory: start a new directory jim ed
#119 Initiatives by number table needs UK *not* broken laura ed
#118 Newsletter - integrate MailChimp with Drupal name fields Content Profile fields jim ed
#117 Forum PathAuto settings not working jim jim
#116 Update all Drupal modules jim jim
#115 Update Drupal core to 6.19, plus the theme and modules to latest jim jim
#114 Change web host, use better Drupal stack, save CO2 with modern VPS + server setup Ed jim
#113 A forum list of all the latest topics jim ed
#112 Swap mollom for reCatpcha on registration jim ed
#111 Adding a map block for directory pages ed ed
#110 Remove RHS blocks from events calendar page jim ed
#109 Commenting process needs tidying up jim ed
#106 extra text for initiative addition form jim ed
#105 Initiative profile editors locked out of their profiles jim ed
#104 Media page: add map to listings jim ed
#103 Managing news install chris ed
#101 27th June 2010 Site Downtime chris chris
#97 Hardware upgrade to GH tier2 servers chris ed

Console Access

There is a Xen shell available for console access, see wiki:XenShell.

Barracuda Octopus Ageir

The server is using Octopus to manage Ageir and also the updates to the Transition Network Drupal site, this system is installed and upgraded using Barracuda, the Barracuda Octopus Aegir combination is documented on the BOA wiki.

The BOA install script output has been saved on ticket:466#comment:22

MariaDB

BOA installs MariaDB as the MySQL server using the debs from the MariaDB site.

We have set MySQL to use a RAM disk for temp tables, see ticket:591

php-fpm

Please note that the version of php-fpm that the http://transitionnetwork.org/ site needs to be running to work properly is:

/etc/init.d/php53-fpm 

The config file for it is /opt/local/etc/php53-fpm.conf and when it is running it is listed in top and ps as php-fpm:

ps -lA | grep php
1 S     0 29482     1  0  80   0 - 188067 -     ?        00:00:00 php-fpm
5 S    33 29483 29482  2  80   0 - 205351 -     ?        00:01:32 php-fpm
5 S    33 29484 29482  2  80   0 - 199726 -     ?        00:01:28 php-fpm
...

Please note the settings that we changed from the default BOA ones in /opt/local/etc/php53-fpm.conf below.

When the server boots another version of php-fpm was also started, which is listed in top and ps as php5-fpm, this one:

/etc/init.d/php5-fpm

Which is configured via files in /etc/php5/fpm/. This version should be stopped if it is found to be running:

/etc/init.d/php5-fpm stop

It was stopped from running at runlevel 2 by deleting this symlink (see ticket:560#comment:17):

/etc/rc2.d/S01php5-fpm -> ../init.d/php5-fpm

But that didn't solve the problem, see ticket:580.

Upgrading BOA

The steps are documented in UPGRADE.txt, to upgrade everything run these commands, this process can take around 30 mins:

sudo -i
screen
cd
wget -q -U iCab http://files.aegir.cc/BOA.sh.txt
bash BOA.sh.txt
barracuda up-stable
octopus up-stable all

Upgrade tickets

For upgrades to BOA see ticket:629

nginx config changes

To get the nginx and php-fpm munin stats working the following code starting with the comment needs adding to /var/aegir/config/server_master/nginx.conf in the nginx default server section:

#######################################################
###  nginx default server
#######################################################

server {
  limit_conn   limreq 32; # like mod_evasive - this allows max 32 simultaneous connections from one IP address
  listen       *:80;
  server_name  _;
  location / {
     root   /var/www/nginx-default;
     index  index.html index.htm;
  }
## chris
  location /nginx_status {
    stub_status on;
    access_log   off;
    allow 127.0.0.1;
    allow 81.95.52.103;
    deny all;
  }
  location ~ ^/(status|ping)$ {
    fastcgi_pass 127.0.0.1:9090;
    fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
    fastcgi_intercept_errors on;
    include fastcgi_params;
    access_log off;
    allow 127.0.0.1;
    deny all;
  }
}

Logs for analysis on penguin, see wiki:WebServerLogs are generated via the following being added to the http section of the /etc/nginx/nginx.conf file:

  # log for awstats
  log_format apache '$remote_addr - $remote_user [$time_local] "$request" '
                     '$status $body_bytes_sent "$http_referer" '
                     '"$http_user_agent"';
  access_log         /var/log/nginx/awstats.log apache;

php-fpm config changes

The following lines need uncommenting in /opt/local/etc/php53-fpm.conf:

pm.status_path = /status
ping.path = /ping

The following number of servers needs changing:

;pm.start_servers = 18
pm.start_servers = 4

;pm.max_spare_servers = 18
pm.max_spare_servers = 4

After the edits above have been made nginx and php-fpm need restarting:

/etc/init.d/php53-fpm reload
/etc/init.d/nginx restart

Best check the error log:

tail -f /var/log/php/php53-fpm-error.log

These fixes can be tested like this:

cd /etc/munin/plugins
munin-run phpfpm_connections
munin-run phpfpm_status
munin-run nginx_status 
munin-run nginx_request

mysql config changes

These settings in /etc/mysql/my.cnf are changed from the default and these changes don't get clobbered when BOA is upgraded as we have set _CUSTOM_CONFIG_SQL=YES in /root/.barracuda.cnf:

max_connections         = 40

xdrago shell script changes

To disable the clobbering of log files two shell scripts need editing and some lines commenting out (see ticket:555#comment:22):

vim /var/xdrago/clear.sh
:1,$s/^echo rotate/# echo rotate/g

To adjust the restarting of nginx and the killing of nginx and php-fpm under heave loads edit /var/xdrago/second.sh changing these values (see ticket:563#comment:9 and ticket:555#comment:52 and ticket:555#comment:124):

CTL_ONEX_SPIDER_LOAD=2716
CTL_FIVX_SPIDER_LOAD=2716
CTL_ONEX_LOAD=10108
CTL_FIVX_LOAD=6216
CTL_ONEX_LOAD_CRIT=13216
CTL_FIVX_LOAD_CRIT=10885

There is a copy of second.sh in /root so after an upgrade do:

cp /root/second.sh /var/xdrago/

System Updates

Don't use the regular debian tools for updating packages, do this:

barracuda up-stable system

After running the above command to update the system you also need to follow the steps documented above at PuffinServer#UpgradingBOA for php-fpm to get the Munin stats working again.

See also ticket:548#comment:33 for the steps that need to be followed after this to get BOA to work with the Session443 plugin.

CSF / LDF

To restart the firewall script:

csf -r

We have set the following variable in /root/.barracuda.cnf to ensure that the CSF / LDF changes are not clobbered by BOA:

_CUSTOM_CONFIG_CSF=YES

False positives

BOA installs CSF / LDF and automatically blocks IP addresses after too many failed SSH login attempts, if someone is blocked who shouldn't be then they can be unblocked like this:

csf -dr 81.95.52.66

To check if a IP address is blocked:

csf -g 81.95.52.66

See this ticket for problems caused by CSF / LDF blocking the monitoring server: ticket:544

Blocklists

Blocklists are configured in /etc/csf/csf.blocklists and some were enabled on ticket:589

Backupninja

backupninja has been installed and configured to backup to another server in the Sheffield colo, two backup tasks have been configured in /etc/backup.d/, 10.sys which does backups of system settings, like all the packages installed and 20.mysql which dumps all the mysql databases into /var/backups/mysql and uses /etc/mysql/debian.cnf for authentication. In October 2013 we switched the servers filesystem to a ZFS server on the network, see ticket:593#comment:5 and now filesystem backups are done via ZFS snapshots so the rsync backup was disabled, see ticket:535#comment:22

Postfix

Two changes were made the the default postfix install, it was set to send root emails out, see ticket:466#comment:23 and it was configured to use TLS with the transition network cert, see ticket:466#comment:25.

Nginx

The only changes made to the default nginx configuration was to move the key and cert it was using out of the way and symlink to the *.transitionnetwork.org ones, see ticket:466#comment:25.

Handy commands

There are some Bash aliases to quickly get around the system added by JK...

For root:

alias cdtn='cd /data/disk/tn/' # cd to tn directory
alias totn='su -s /bin/bash tn' # log into the tn user

# show file usages
alias duf='du -sk * | sort -n | perl -ne '\''($s,$f)=split(m{\t});for (qw(K M G)) {if($s<1024) {printf("%.1f",$s);print "$_\t$f"; last};$s=$s/1024}'\'

For tn

alias la='ls -Al --color=auto'
alias lc='ls -ltcr --color=auto'
alias lk='ls -lSr --color=auto'
alias ll='ls -la --group-directories-first --color=auto'
alias lr='ls -lR --color=auto'
alias ls='ls -hF --color=auto'
alias lt='ls -ltr --color=auto'
alias lu='ls -ltur --color=auto'
alias lx='ls -lXB --color=auto'

Vim config

To make vim the default editor for root the following was added to /root/.bashrc:

export EDITOR="vim"

To make config files nicer to read in vim the following was added to /root/.vimrc:

syntax on

And a /root/.vim/filetype.vim files was created with the following in it:

au BufRead,BufNewFile /etc/mysql/my.cnf, set ft=mycnf
autocmd BufRead,BufNewFile /etc/php5/fpm/* set syntax=dosini
autocmd BufRead,BufNewFile /opt/local/etc/php53-fpm.conf set syntax=dosini
au BufRead,BufNewFile /etc/nginx/*,/etc/nginx/conf.d/*,/var/aegir/config/server_master/nginx/*/* set ft=nginx
au BufRead,BufNewFile /data/disk/tn/config/server_master/nginx/vhost.d/* set ft=nginx

And a /root/.vim/syntax/ directory was created and mycnf.vim was created in it by downloading it from http://cvs.pld-linux.org/cgi-bin/cvsweb.cgi/packages/vim-syntax-mycnf/ and nginx.vim was downloaded from http://www.vim.org/scripts/script.php?script_id=1886

Migration Tickets

Tickets created during the migration of the http://www.transitionnetwork.org/ site from NewLiveServer to this server: