[[PageOutline(2-5, Table of Contents, floated)]] = Puffin = puffin.webarch.net is a 8GB RAM, 14 CPU core Debian Squeeze virtual server which replaced NewLiveServer and DevelopmentServer for running the [http://www.transitionnetwork.org/ Transition Network] Drupal sites. It went live in early 2013. This server was migrated to run off a ZFS server in October 2013, see ticket:593 and it was upgraded from Squeeze to Wheezy on 17th November 2013, see ticket:535. It was agreed to call this server {{{puffin}}} at the ttech meeting on 22nd November 2012, see ticket:463. The install and initial configuration of this server was tracked on ticket:466, see also the other PuffinServer#migrationtickets. Other services from the old server were migrated to PenguinServer. System updates are recorded on ticket:218 and BOA updates on ticket:629. == Munin Stats == There are munin stats for the server available here * https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/ See ticket:555#comment:13 for the notes regarding the installation of the mysql munin stats package. We did have a trial with New Relic in 2013, see ticket:586 but this isn't on-going. Sometimes the IO State graph stops, this can be fixed by deleting the lock files, see ticket:555#IOstategraph. == HTTP Stats == The wiki:PiwikServer generates stats from the humans visiting the server and some of these stats have been made public on wiki:WebStats. There are some notes on analysing the raw Nginx stats on wiki:WebServerLogs and [https://penguin.transitionnetwork.org/webalizer/puffin/ Webalizer stats for Puffin] are available using the same username/password as this Trac site. There is a wiki:ErrorCodeCheck script which emails the total number of HTTP errors each day, see ticket:483#comment:63 for a list of the total for August, September and October 2013. == Load Spikes == The server has been suffering from load spikes which cause the site to be unresponsive for clients, you can see the current status via the [https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/load.html puffin Munin load graph], note the Max values for the last day, week, month and year. '''When the load hits 23.28 robots are served 403 Forbidden responses and when the load hits 86.64 maintenance tasks are killed and when the load hits 113.28 the server terminates until the 5 min load average falls below 93.30.''' The [ticket:563#second.sh default thresholds] have been changed as they were causing [ticket:555 the shut to shutdown for 15 min at a time] far too often, the [ticket:555#comment:124 current values] were applied on [ticket:555#comment:126 23rd October 2013]. The server has 14 CPU cores, see [https://en.wikipedia.org/wiki/Load_average#Unix-style_load_calculation Unix-style load calculation], the current thresholds are generated from these variables in {{{/root/.barracuda.cnf}}}, the commented out values are the default ones: {{{ #_LOAD_LIMIT_ONE=1444 #_LOAD_LIMIT_TWO=888 _LOAD_LIMIT_ONE=8664 _LOAD_LIMIT_TWO=5328 }}} These variables are used by the {{{/var/xdrago/second.sh}}} script, which is run every minute via cron and has a internal loop which causes it to run 5 times, waiting 10 seconds between each run, and it has the following variables in it (these have been [ticket:555#comment:126 edited from their default values]): {{{ ONEX_LOAD=`awk '{print $1*100}' /proc/loadavg` FIVX_LOAD=`awk '{print $2*100}' /proc/loadavg` CTL_ONEX_SPIDER_LOAD=2328 CTL_FIVX_SPIDER_LOAD=2328 CTL_ONEX_LOAD=8664 CTL_FIVX_LOAD=5328 CTL_ONEX_LOAD_CRIT=11328 CTL_FIVX_LOAD_CRIT=9330 }}} These values translate to the following loads for comparison to the Munin graphs: * ONEX_LOAD: load average over the last minute times 100 * FIVX_LOAD: load average over the last 5 minutes times 100 * CTL_ONEX_SPIDER_LOAD: 23.28 * CTL_FIVX_SPIDER_LOAD: 23.28 * CTL_ONEX_LOAD: 86.64 * CTL_FIVX_LOAD: 53.28 * CTL_ONEX_LOAD_CRIT: 113.28 * CTL_FIVX_LOAD_CRIT: 93.30 And the logic, translated into english, is: 1. If the load average over the last minute is greater than 23.28 and less than 86.64 and the nginx high load config isn't in use then start to use it. 2. Else if the load average over the last 5 mins is greater than 23.28 and less than 53.28 and the nginx high load config isn't in use then start to use it. 3. Else if the load average over the last minute is less than 23.28 and the the load average over the last 5 mins is less than 23.28 and the nginx high load config is in use then stop using it. 1. If the load average over the last minute is greater than 132.16 and if {{{/var/run/boa_run.pid}}} exists, wait a second, if not kill some maintenance jobs: {{{killall -9 php drush.php wget}}} 2. Else if the load average over the last 5 mins is greater than 108.85 and if {{{/var/run/boa_run.pid}}} exists, wait a second, if not kill some maintenance jobs: {{{killall -9 php drush.php wget}}} 1. If the load average over the last minute is greater than 101.08 then kill the web server, {{{killall -9 nginx}}} and {{{killall -9 php-fpm php-cgi}}} 2. Else if the load average over the last 5 mins is greater than 62.16 then kill the web server, {{{killall -9 nginx}}} and {{{killall -9 php-fpm php-cgi}}} 3. Else restart all the services via {{{/var/xdrago/proc_num_ctrl.cgi}}} Tickets generated in relation to these issues include: * ticket:483 Nginx 502 Bad Gateway Errors with BOA * ticket:543 Puffin Load Spike * ticket:552 Puffin Downtime 23rd May 2013 * ticket:554 Site slow down and MySQL load increase * ticket:555 Load spikes causing the TN site to be stopped for 15 min at a time * ticket:563 503 Errors * ticket:569 403s served to editors, admin very slow * ticket:576 Site down A total of 77.5 hours was spent on the tickets listed above, the final one was closed on 15th December 2013 and the total time was added up, see ticket:555#comment:132. == Tickets == Most the "live server" tickets relate to puffin, but the older ones, prior to ticket number #466, are for previous servers. === Current live server tickets === [[TicketQuery(status=accepted|new|assigned|reopened&component=Live server,order=id,desc=1,format=table,col=summary|owner|reporter)]] === Closed live server tickets === [[TicketQuery(status=closed&component=Live server,order=id,desc=1,format=table,col=summary|owner|reporter)]] == Barracuda Octopus Ageir == The server is using [https://drupal.org/project/octopus Octopus] to manage [http://community.aegirproject.org/ Ageir] and also the updates to the Transition Network Drupal site, this system is installed and upgraded using [https://drupal.org/project/barracuda Barracuda], the Barracuda Octopus Aegir combination is documented on the [http://groups.drupal.org/node/163784 BOA wiki]. The initial BOA install script output has been saved on ticket:466#comment:22 and the updates are now documented on ticket:629. === MariaDB === The MySQL root password is available in {{{/root/.my.cnf}}}. Tuning of the MySQL server is being tracked on ticket:587. We have set MySQL to use a RAM disk for temp tables, see ticket:591. BOA installs MariaDB as the MySQL server using the debs from the MariaDB site, see {{{/etc/apt/sources.list.d/mariadb.list}}}, these are the current (2013-01-13) packages which are installed (note the config files only remain for php5-mysql as PHP in now installed from source code by BOA): {{{ dpkg -l | grep -i mysql ii libdbd-mysql-perl 4.021-1+b1 amd64 Perl5 database interface to the MySQL database ii libmysqlclient16 5.1.72-2 amd64 MySQL database client library ii libmysqlclient18 5.5.34+maria-1~wheezy amd64 Virtual package to satisfy external depends ii mariadb-common 5.5.34+maria-1~wheezy all MariaDB database common files (e.g. /etc/mysql/conf.d/mariadb.cnf) ii mysql-common 5.5.34+maria-1~wheezy all MariaDB database common files (e.g. /etc/mysql/my.cnf) ii mytop 1.6-6 all top like query monitor for MySQL rc php5-mysql 5.3.27-1~dotdeb.0 amd64 MySQL module for php5 ii python-mysqldb 1.2.3-2 amd64 Python interface to MySQL }}} === Nginx === BOA did use Nginx from dotdeb but now it compiles it from source, the dotdeb config files remain: {{{ dpkg -l | grep -i nginx rc nginx-common 1.4.1-1~dotdeb.0 all small, powerful, scalable web/proxy server - common files }}} The only changes made to the default nginx configuration during the initial install was to move the key and cert it was using out of the way and symlink to the *.transitionnetwork.org ones, see ticket:466#comment:25. The other change made from the default BOA config are to enable Munin graphs, see wiki:PuffinServer#nginxconfigchanges === php-fpm === Please note that the version of php-fpm that the http://transitionnetwork.org/ site needs to be running to work properly is: {{{ /etc/init.d/php53-fpm }}} The config file for it is {{{/opt/local/etc/php53-fpm.conf}}} and when it is running it is listed in top and ps as php-fpm: {{{ ps -lA | grep php 1 S 0 29482 1 0 80 0 - 188067 - ? 00:00:00 php-fpm 5 S 33 29483 29482 2 80 0 - 205351 - ? 00:01:32 php-fpm 5 S 33 29484 29482 2 80 0 - 199726 - ? 00:01:28 php-fpm ... }}} Please note the settings that we changed from the default BOA ones in {{{/opt/local/etc/php53-fpm.conf}}} below. When the server boots another version of php-fpm was also started, which is listed in top and ps as php5-fpm, this one: {{{ /etc/init.d/php5-fpm }}} Which is configured via files in {{{/etc/php5/fpm/}}}. This version should be stopped if it is found to be running: {{{ /etc/init.d/php5-fpm stop }}} It was stopped from running at runlevel 2 by deleting this symlink (see ticket:560#comment:17): {{{ /etc/rc2.d/S01php5-fpm -> ../init.d/php5-fpm }}} But that didn't solve the problem, see ticket:580. === Upgrading BOA === The steps are documented in [http://drupalcode.org/project/barracuda.git/blob/HEAD:/docs/UPGRADE.txt UPGRADE.txt], to upgrade everything run these commands, this process can take around 30 mins: {{{ sudo -i screen cd wget -q -U iCab http://files.aegir.cc/BOA.sh.txt bash BOA.sh.txt barracuda up-stable octopus up-stable all }}} ==== Upgrade tickets ==== For upgrades to BOA see ticket:629 * BOA-2.1.3 ticket:629#comment:8 * BOA-2.1.1 ticket:612 * BOA-2.0.9 ticket:547 * BOA-2.0.8 ticket:530 * BOA-2.0.7 ticket:529 * BOA-2.0.5 ticket:466#comment:26 ==== nginx config changes ==== To get the nginx and php-fpm munin stats working the following code starting with the comment needs adding to {{{/var/aegir/config/server_master/nginx.conf}}} in the nginx default server section: {{{ ####################################################### ### nginx default server ####################################################### server { limit_conn limreq 32; # like mod_evasive - this allows max 32 simultaneous connections from one IP address listen *:80; server_name _; location / { root /var/www/nginx-default; index index.html index.htm; } ## chris location /nginx_status { stub_status on; access_log off; allow 127.0.0.1; allow 81.95.52.103; deny all; } location ~ ^/(status|ping)$ { fastcgi_pass 127.0.0.1:9090; fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name; fastcgi_intercept_errors on; include fastcgi_params; access_log off; allow 127.0.0.1; deny all; } } }}} Logs for analysis on penguin, see wiki:WebServerLogs are generated via the following being added to the {{{http}}} section of the {{{/etc/nginx/nginx.conf}}} file: {{{ # log for awstats log_format apache '$remote_addr - $remote_user [$time_local] "$request" ' '$status $body_bytes_sent "$http_referer" ' '"$http_user_agent"'; access_log /var/log/nginx/awstats.log apache; }}} ==== php-fpm config changes ==== The following lines need uncommenting in {{{/opt/local/etc/php53-fpm.conf}}}: {{{ pm.status_path = /status ping.path = /ping }}} The following number of servers needs changing: {{{ ;pm.start_servers = 18 pm.start_servers = 4 ;pm.max_spare_servers = 18 pm.max_spare_servers = 4 }}} After the edits above have been made nginx and php-fpm need restarting: {{{ /etc/init.d/php53-fpm reload /etc/init.d/nginx restart }}} Best check the error log: {{{ tail -f /var/log/php/php53-fpm-error.log }}} These fixes can be tested like this: {{{ cd /etc/munin/plugins munin-run phpfpm_connections munin-run phpfpm_status munin-run nginx_status munin-run nginx_request }}} ==== mysql config changes ==== These settings in {{{/etc/mysql/my.cnf}}} are changed from the default and these changes don't get clobbered when BOA is upgraded as we have set {{{_CUSTOM_CONFIG_SQL=YES}}} in {{{/root/.barracuda.cnf}}}: {{{ max_connections = 40 }}} ==== xdrago shell script changes ==== To disable the clobbering of log files two shell scripts need editing and some lines commenting out (see ticket:555#comment:22): {{{ vim /var/xdrago/clear.sh :1,$s/^echo rotate/# echo rotate/g }}} To adjust the restarting of nginx and the killing of nginx and php-fpm under heave loads edit {{{/var/xdrago/second.sh}}} changing these values (see ticket:563#comment:9 and ticket:555#comment:52 and ticket:555#comment:124): {{{ CTL_ONEX_SPIDER_LOAD=2716 CTL_FIVX_SPIDER_LOAD=2716 CTL_ONEX_LOAD=10108 CTL_FIVX_LOAD=6216 CTL_ONEX_LOAD_CRIT=13216 CTL_FIVX_LOAD_CRIT=10885 }}} There is a copy of {{{second.sh}}} in {{{/root}}} so after an upgrade do: {{{ cp /root/second.sh /var/xdrago/ }}} === System Updates === Don't use the regular debian tools for updating packages, do this: {{{ barracuda up-stable system }}} After running the above command to update the system you also need to follow the steps documented above at PuffinServer#UpgradingBOA for php-fpm to get the Munin stats working again. See also ticket:548#comment:33 for the steps that need to be followed after this to get BOA to work with the Session443 plugin. === CSF / LDF === To restart the firewall script: {{{ csf -r }}} We have set the following variable in {{{/root/.barracuda.cnf}}} to ensure that the CSF / LDF changes are not clobbered by BOA: {{{ _CUSTOM_CONFIG_CSF=YES }}} ==== False positives ==== BOA installs [http://configserver.com/cp/csf.html CSF / LDF] and automatically blocks IP addresses after too many failed SSH login attempts, if someone is blocked who shouldn't be then they can be unblocked like this: {{{ csf -dr 81.95.52.66 }}} To check if a IP address is blocked: {{{ csf -g 81.95.52.66 }}} See this ticket for problems caused by CSF / LDF blocking the monitoring server: ticket:544 ==== Blocklists ==== Blocklists are configured in {{{/etc/csf/csf.blocklists}}} and some were enabled on ticket:589 == Console and SSH Access == There is a Xen shell available for console access, see wiki:XenShell. For developers and sysadmins there is SSH access, contact [mailto:chris@webarchitects.co.uk chris@webarchitects.co.uk] if you need an account creating. The server is also running [http://mosh.mit.edu/ Mosh : the mobile shell] which is very handy when you internet connection is poor, for example on a train. Mosh was installed on ticket:673. == Cron == BOA controls the root crontab and any changes made there will be overwritten, so things that would normally be in the root crontab need to go into users ones and use sudo, these are the ones in chris' crontab: {{{ # delete metche backups which are more than a day old # see https://tech.transitionnetwork.org/trac/ticket/531 28 11 * * * sudo /usr/local/bin/metche-clean -d # set the clock after a reboot # see https://trac.transitionnetwork.org/trac/ticket/599 @reboot sudo rdate -s ntp.demon.co.uk # create a tmp dir on the ram disk for mysql # see https://trac.transitionnetwork.org/trac/ticket/591 @reboot sudo mkdir /run/shm/mysql ; sudo chown mysql:mysql /run/shm/mysql }}} To edit chris' crontab after logging in as another user: {{{ sudo -i export EDITOR=vim crontab -e -u chris }}} == Backupninja == backupninja has been installed and configured to backup to another server in the Sheffield colo, two backup tasks have been configured in {{{/etc/backup.d/}}}, {{{10.sys}}} which does backups of system settings, like all the packages installed and {{{20.mysql}}} which dumps all the mysql databases into {{{/var/backups/mysql}}} and uses {{{/etc/mysql/debian.cnf}}} for authentication. In October 2013 we switched the servers filesystem to a ZFS server on the network, see ticket:593#comment:5 and now filesystem backups are done via ZFS snapshots so the rsync backup was disabled, see ticket:535#comment:22 == Postfix == Two changes were made the the default postfix install, it was set to send root emails out, see ticket:466#comment:23 and it was configured to use TLS with the transition network cert, see ticket:466#comment:25. == Handy commands == There are some Bash aliases to quickly get around the system added by JK... For {{{root}}}: {{{ alias cdtn='cd /data/disk/tn/' # cd to tn directory alias totn='su -s /bin/bash tn' # log into the tn user # show file usages alias duf='du -sk * | sort -n | perl -ne '\''($s,$f)=split(m{\t});for (qw(K M G)) {if($s<1024) {printf("%.1f",$s);print "$_\t$f"; last};$s=$s/1024}'\' }}} For {{{tn}}} {{{ alias la='ls -Al --color=auto' alias lc='ls -ltcr --color=auto' alias lk='ls -lSr --color=auto' alias ll='ls -la --group-directories-first --color=auto' alias lr='ls -lR --color=auto' alias ls='ls -hF --color=auto' alias lt='ls -ltr --color=auto' alias lu='ls -ltur --color=auto' alias lx='ls -lXB --color=auto' }}} === Vim config === To make vim the default editor for root the following was added to {{{/root/.bashrc}}}: {{{ export EDITOR="vim" }}} To make config files nicer to read in vim the following was added to {{{/root/.vimrc}}}: {{{ syntax on }}} And a {{{/root/.vim/filetype.vim}}} files was created with the following in it: {{{ au BufRead,BufNewFile /etc/mysql/my.cnf, set ft=mycnf autocmd BufRead,BufNewFile /etc/php5/fpm/* set syntax=dosini autocmd BufRead,BufNewFile /opt/local/etc/php53-fpm.conf set syntax=dosini au BufRead,BufNewFile /etc/nginx/*,/etc/nginx/conf.d/*,/var/aegir/config/server_master/nginx/*/* set ft=nginx au BufRead,BufNewFile /data/disk/tn/config/server_master/nginx/vhost.d/* set ft=nginx }}} And a {{{/root/.vim/syntax/}}} directory was created and {{{mycnf.vim}}} was created in it by downloading it from http://cvs.pld-linux.org/cgi-bin/cvsweb.cgi/packages/vim-syntax-mycnf/ and {{{nginx.vim}}} was downloaded from http://www.vim.org/scripts/script.php?script_id=1886 == Migration Tickets == Tickets created during the migration of the http://www.transitionnetwork.org/ site from NewLiveServer to this server: * ticket:466 Puffin install and configuration * ticket:472 Script to copy files from NewLiveServer to puffin * ticket:479 Transfer live transitionnetwork.org site to puffin * ticket:480 Transfer news.transitionnetwork.org to puffin * ticket:483 Nginx 502 Bad Gateway Errors with BOA see the summary on ticket:483#comment:46 * ticket:487 robots.txt files for development sites