Ticket #563 (closed maintenance: fixed)
503 Errors
Reported by: | chris | Owned by: | chris |
---|---|---|---|
Priority: | major | Milestone: | Maintenance |
Component: | Live server | Keywords: | |
Cc: | ed, jim | Estimated Number of Hours: | 0.0 |
Add Hours to Ticket: | 0 | Billable?: | yes |
Total Hours: | 6.95 |
Description (last modified by chris) (diff)
The BOA /var/xdrago/second.sh script is run every minute via the root crontab and if it detects a certain load level it changes the nginx config to a "high load" config which results in bots being served 503 errors when they spider the site. When the load goes higher and hits another threshold the second.sh script stops the site, see ticket:555.
Original Description
The site is generating a lot of 503 errors, 83 since 6:30am today and there were around 750 yesterday.
Change History
comment:1 Changed 3 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.3
- Total Hours changed from 0.0 to 0.3
comment:2 Changed 3 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.15
- Total Hours changed from 0.3 to 0.45
Following out phone call, ideas we are going to try:
- Jim - firewall
- Chris - php-fpm restarts, try disabling them
- Chris - logging - can we add any more?
comment:3 Changed 3 years ago by chris
- Add Hours to Ticket changed from 0.0 to 1.4
- Total Hours changed from 0.45 to 1.85
I have changed these things in the /opt/local/etc/php53-fpm.conf file to prevent php-fpm restarts to see if this helps the problems:
; If this number of child processes exit with SIGSEGV or SIGBUS within the time ; interval set by emergency_restart_interval then FPM will restart. A value ; of '0' means 'Off'. ; Default Value: 0 # chris #emergency_restart_threshold = 5 emergency_restart_threshold = 0 ; Interval of time used by emergency_restart_interval to determine when ; a graceful restart will be initiated. This can be useful to work around ; accidental corruptions in an accelerator's shared memory. ; Available Units: s(econds), m(inutes), h(ours), or d(ays) ; Default Unit: seconds ; Default Value: 0 # chris #emergency_restart_interval = 1m emergency_restart_interval = 0 ; The number of requests each child process should execute before respawning. ; This can be useful to work around memory leaks in 3rd party libraries. For ; endless request processing specify '0'. Equivalent to PHP_FCGI_MAX_REQUESTS. ; Default Value: 0 # chris #pm.max_requests = 500 pm.max_requests = 0
I have restarted php53-fpm.
Updated result of the script to find 50x errors, these totals are since 6:30am:
Total 502 errors: 7 Total 503 errors: 173 Total 504 errors: 0
The increase from earlier in the day is due to the site going down again at noon, I haven't looked at all the logs for this event but it looks much like the previous ones documented on ticket:555
I have set this script to run just before the Nginx logs are rotated, it can also be run on the command line, Jim, try this:
/usr/local/bin/50x-errors | less
Looking at the cron jobs which kill and restart critical processes I think we should consider stopping the second.sh one to see if this helps, following is a (hopefully) human readable version of what they appear to do.
second.sh
This runs every minute:
* * * * * bash /var/xdrago/second.sh >/dev/null 2>&1
It does things based on the load average, specifically, if:
awk '{print $1*100}' /proc/loadavg
is greater than 388 and less than 1444 and /data/conf/nginx_high_load_off.conf exists then nginx is restarted.
else if this:
awk '{print $2*100}' /proc/loadavg
is greater than 388 and less than 888 and /data/conf/nginx_high_load_off.conf exists then nginx is restarted.
else the result of the awk commands above are less than 388 and /data/conf/nginx_high_load.conf exists then /data/conf/nginx_high_load.conf is moved to /data/conf/nginx_high_load_off.conf and nginx is reloaded.
If the values of the awk commands are greater than 1888 or 1555 then these command line operations are killed:
killall -9 php drush.php wget
If the first awk command is greater than 1444 or the second is greater than 888 then the folowing is done:
/etc/init.d/nginx stop killall -9 nginx sleep 1 killall -9 nginx /etc/init.d/php-fpm stop /etc/init.d/php53-fpm stop killall -9 php-fpm php-cgi
minute.sh
This runs every minute:
* * * * * bash /var/xdrago/minute.sh >/dev/null 2>&1
And checks for segfaults
runner.sh
This runs every minute:
* * * * * bash /var/xdrago/runner.sh >/dev/null 2>&1
And checks that the load is less than 388 and if it is then /var/xdrago/run-tn is run.
clear.sh
This was clobbering log files but all the clobbering has been commented out.
graceful.sh
This runs at 6pm each day and does things like updating the GeoIP database and it restarts nginx and php53-fpm.
comment:4 Changed 3 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.5
- Total Hours changed from 1.85 to 2.35
Latest running total of 50x errors for today:
Total 502 errors: 11 Total 503 errors: 261 Total 504 errors: 0
One thought, the server has 14 CPUs, this means that a load of 14 is equivalent to a load of 1 on a server with 1 CPU. If the BOA scripts, with all the killing of processes that happen when the load gets to various levels, are based on servers with a different hardware configuration, less CPUs, less RAM and perhaps SSDs (for example the log clobbering was justified on the basis that busy servers with, presumably very fast but also very small, SSDs required it) then perhaps this could be part of the problem?
Looking at the User-Agent strings for the 503 errors, most are bots:
50x-errors | grep "Feedfetcher-Google" | wc -l 26 50x-errors | grep "Baiduspider" | wc -l 10 50x-errors | grep "Googlebot" | wc -l 117 50x-errors | grep "YandexBot" | wc -l 14 50x-errors | grep "bingbot" | wc -l 46 50x-errors | grep "Sogou web spider" | wc -l 8 50x-errors | grep "magpie-crawler" | wc -l 16 50x-errors | grep "Feedly" | wc -l 5 50x-errors | grep "Sosospider" | wc -l 4
In fact out of the total of 272 502 and 503 errors only 26 were not to bots, and out of these 26 around half were RSS readers:
50x-errors | grep -v "Googlebot" | grep -v "Feedfetcher-Google" | grep -v "Baiduspider" | grep -v "YandexBot" | grep -v "bingbot" | grep -v "Sogou web spider" | grep -v "magpie-crawler" | grep -v "Feedly" | grep -v "Sosospider" | grep transitionnetwork | wc -l 26
I'm not sure what the significance of the large number of bot hits is other than the observation that the site gets lots more traffic from bots than people and that this traffic will never get recorded by Piwik -- perhaps I should generate some stats from the Nginx log files so we get a better picture of how much total traffic the server is getting, not just the human traffic?
I suggest we try:
- Disabling the second.sh script which does a lot of process killing based on load.
- Increasing the number of php53-fpm processes.
- See what happens.
Jim -- would you be happy with trying that?
comment:5 Changed 3 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.5
- Total Hours changed from 2.35 to 2.85
Looking at /opt/local/etc/php53-fpm.conf:
; Set open file descriptor rlimit for the master process. ; Default Value: system defined value rlimit_files = 1024
This seems quite low, but there is no evidence in the logs that we have hit this limit. There are some notes here on increasing it if we did want to increase it http://www.cyberciti.biz/faq/linux-unix-nginx-too-many-open-files/
I have edited these values to dramatically increase the number of php-fpm processes:
process.max = 100 pm.max_children = 90 pm.start_servers = 20 pm.min_spare_servers = 10 pm.max_spare_servers = 20 pm.process_idle_timeout = 30s; request_slowlog_timeout = 60s
The thinking (hunch / guess) being that if when demand peaks lots of processes need to be spawned this might be increasing the load such that the process killing is triggered.
There are now 21 php-fpm processes by default:
ps -lA | grep php | wc -l 21
I haven't commented out the second.sh script -- waiting for Jims thoughts if we should try that.
comment:6 Changed 3 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.6
- Total Hours changed from 2.85 to 3.45
Regarding getting a better idea about what Nginx by processing the logs, how about doing it like this:
- Just before the Nginx access log is rotated (this happens daily) copy it to penguin (using ssh and rsync).
- Set up AWstats on penguin to analyse the log files, http://awstats.sourceforge.net/
- Perhaps generate some Munin stats from the awstats, http://www.gho.no/2009/03/munin-website-traffic-graph-plugin-for-awstats/
The thinking being that penguin has spare resources to do log processing and it will save setting up awstats on parrot.
I have been keeping an eye on the munin stats since increasing the number of php-fpm processes and more RAM is being used, but not more than we have available, see:
- https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/phpfpm_memory.html
- https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/phpfpm_processes.html
We just had another load spike:
uptime 22:01:44 up 34 days, 22:40, 2 users, load average: 31.81, 12.29, 4.81
A load of 28 is like a load of 2 on a 1 CPU machine -- a load of 31 isn't enough on this server to start killing processes but I saw php-fpm killed so I manually restarted it.
IMHO second.sh is killing too many processes so I have disabled it by commenting it from the root crontab:
#* * * * * bash /var/xdrago/second.sh >/dev/null 2>&1
Let's see what happens now.
comment:7 Changed 3 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.7
- Total Hours changed from 3.45 to 4.15
These are the times php-fpm was stopped/killed in the last few days:
grep bye-bye php53-fpm-error.log [19-Jun-2013 12:02:12] NOTICE: exiting, bye-bye! [19-Jun-2013 14:33:57] NOTICE: exiting, bye-bye! [19-Jun-2013 17:00:45] NOTICE: exiting, bye-bye! [19-Jun-2013 21:26:17] NOTICE: exiting, bye-bye! [19-Jun-2013 22:01:48] NOTICE: exiting, bye-bye! [19-Jun-2013 22:02:22] NOTICE: exiting, bye-bye!
grep bye-bye php53-fpm-error.log.1 [18-Jun-2013 08:14:30] NOTICE: exiting, bye-bye! [18-Jun-2013 10:02:57] NOTICE: exiting, bye-bye! [18-Jun-2013 11:00:54] NOTICE: exiting, bye-bye! [18-Jun-2013 14:20:58] NOTICE: exiting, bye-bye! [18-Jun-2013 14:36:25] NOTICE: exiting, bye-bye! [18-Jun-2013 15:00:36] NOTICE: exiting, bye-bye! [18-Jun-2013 16:00:49] NOTICE: exiting, bye-bye! [18-Jun-2013 16:30:00] NOTICE: exiting, bye-bye! [18-Jun-2013 16:50:54] NOTICE: exiting, bye-bye! [18-Jun-2013 17:00:53] NOTICE: exiting, bye-bye! [18-Jun-2013 18:00:30] NOTICE: exiting, bye-bye! [18-Jun-2013 20:36:30] NOTICE: exiting, bye-bye! [18-Jun-2013 21:57:53] NOTICE: exiting, bye-bye! [18-Jun-2013 23:20:41] NOTICE: exiting, bye-bye! [19-Jun-2013 04:00:27] NOTICE: exiting, bye-bye! [19-Jun-2013 06:29:00] NOTICE: exiting, bye-bye!
zgrep bye-bye php53-fpm-error.log.2.gz [17-Jun-2013 08:35:39] NOTICE: exiting, bye-bye! [17-Jun-2013 10:16:08] NOTICE: exiting, bye-bye! [17-Jun-2013 10:16:42] NOTICE: exiting, bye-bye! [17-Jun-2013 12:48:29] NOTICE: exiting, bye-bye! [17-Jun-2013 13:21:45] NOTICE: exiting, bye-bye! [17-Jun-2013 14:00:39] NOTICE: exiting, bye-bye! [17-Jun-2013 15:00:42] NOTICE: exiting, bye-bye! [17-Jun-2013 15:30:22] NOTICE: exiting, bye-bye! [17-Jun-2013 15:51:11] NOTICE: exiting, bye-bye! [17-Jun-2013 17:06:52] NOTICE: exiting, bye-bye! [17-Jun-2013 23:02:25] NOTICE: exiting, bye-bye! [18-Jun-2013 01:01:03] NOTICE: exiting, bye-bye! [18-Jun-2013 05:01:16] NOTICE: exiting, bye-bye! [18-Jun-2013 06:26:18] NOTICE: exiting, bye-bye!
zgrep bye-bye php53-fpm-error.log.3.gz [16-Jun-2013 17:22:56] NOTICE: exiting, bye-bye! [16-Jun-2013 22:03:49] NOTICE: exiting, bye-bye! [17-Jun-2013 05:00:40] NOTICE: exiting, bye-bye! [17-Jun-2013 06:00:39] NOTICE: exiting, bye-bye! [17-Jun-2013 06:26:02] NOTICE: exiting, bye-bye!
zgrep bye-bye php53-fpm-error.log.4.gz [15-Jun-2013 08:15:51] NOTICE: exiting, bye-bye! [15-Jun-2013 08:50:41] NOTICE: exiting, bye-bye! [15-Jun-2013 20:00:26] NOTICE: exiting, bye-bye! [16-Jun-2013 06:26:31] NOTICE: exiting, bye-bye!
zgrep bye-bye php53-fpm-error.log.5.gz [14-Jun-2013 13:50:57] NOTICE: exiting, bye-bye! [14-Jun-2013 14:01:40] NOTICE: exiting, bye-bye! [14-Jun-2013 14:35:59] NOTICE: exiting, bye-bye! [15-Jun-2013 01:29:13] NOTICE: exiting, bye-bye! [15-Jun-2013 01:35:29] NOTICE: exiting, bye-bye! [15-Jun-2013 01:35:40] NOTICE: exiting, bye-bye! [15-Jun-2013 01:39:17] NOTICE: exiting, bye-bye! [15-Jun-2013 01:52:54] NOTICE: exiting, bye-bye! [15-Jun-2013 02:55:20] NOTICE: exiting, bye-bye! [15-Jun-2013 02:56:40] NOTICE: exiting, bye-bye! [15-Jun-2013 05:20:34] NOTICE: exiting, bye-bye!
When chatting with Jim on the phone today he wondered about the /var/xdrago/guest-fire.sh script, this runs every minute and checks logs and blocks hosts using the firewall. I haven't yet found where this script is called.
We haven't yet solved the issue with the 503 errors, the last one was 1 min ago, these to the latest totals for today:
Total 502 errors: 12 Total 503 errors: 522 Total 504 errors: 0
More tomorrow...
comment:8 Changed 3 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.6
- Total Hours changed from 4.15 to 4.75
We are getting lots more 503 errors, the total for today is now:
Total 502 errors: 12 Total 503 errors: 823 Total 504 errors: 0
The load is very low:
uptime 23:44:04 up 35 days, 22 min, 2 users, load average: 0.13, 0.21, 0.25 uptime 23:54:10 up 35 days, 32 min, 2 users, load average: 0.03, 0.22, 0.28 uptime 23:59:23 up 35 days, 37 min, 2 users, load average: 0.24, 0.27, 0.27 uptime 00:02:10 up 35 days, 40 min, 2 users, load average: 0.36, 0.31, 0.28
I have set this variable back to where it was:
; The timeout for serving a single request after which a PHP backtrace will be ; dumped to the 'slowlog' file. A value of '0s' means 'off'. ; Available units: s(econds)(default), m(inutes), h(ours), or d(ays) ; Default Value: 0 ; chris request_slowlog_timeout = 30s ;request_slowlog_timeout = 60s
As we had nothing in the php error log since I increased it to 60 seconds.
Commenting out the seconds.sh script hasn't solved the 503 problem, I was tempted to leave it commented out over night and reassess it in the morning as I'm not sure what will happen when there is a spike in the load with it not running, however as the 503s are really stacking up faster than ever I'm going to uncomment it again to see if that reduces them again.
Again the clients getting the 503 are mostly bots -- perhaps this is due to some BOA script(s) intentionally denying access to bots?
I'm not sure if we should also be considering tweaking some mysql settings:
/usr/local/bin/mysqltuner.pl >> MySQLTuner 1.2.0 - Major Hayden <major@mhtx.net> >> Bug reports, feature requests, and downloads at http://mysqltuner.com/ >> Run with '--help' for additional options and output filtering [OK] Logged in using credentials from debian maintenance account. -------- General Statistics -------------------------------------------------- [--] Skipped version check for MySQLTuner script [OK] Currently running supported MySQL version 5.5.31-MariaDB-1~squeeze-log [OK] Operating on 64-bit architecture -------- Storage Engine Statistics ------------------------------------------- [--] Status: +Archive -BDB +Federated +InnoDB -ISAM -NDBCluster [--] Data in MyISAM tables: 104M (Tables: 2) [--] Data in InnoDB tables: 452M (Tables: 1031) [--] Data in PERFORMANCE_SCHEMA tables: 0B (Tables: 17) [!!] Total fragmented tables: 100 -------- Security Recommendations ------------------------------------------- [OK] All database users have passwords assigned -------- Performance Metrics ------------------------------------------------- [--] Up for: 1d 12h 22m 0s (8M q [68.424 qps], 222K conn, TX: 16B, RX: 1B) [--] Reads / Writes: 88% / 12% [--] Total buffers: 1.1G global + 13.4M per thread (75 max threads) [OK] Maximum possible memory usage: 2.1G (52% of installed RAM) [OK] Slow queries: 0% (44/8M) [OK] Highest usage of available connections: 74% (56/75) [OK] Key buffer size / total MyISAM indexes: 509.0M/92.3M [OK] Key buffer hit rate: 98.1% (15M cached / 281K reads) [OK] Query cache efficiency: 73.0% (5M cached / 8M selects) [!!] Query cache prunes per day: 937872 [OK] Sorts requiring temporary tables: 1% (4K temp sorts / 243K sorts) [!!] Joins performed without indexes: 10338 [!!] Temporary tables created on disk: 29% (83K on disk / 285K total) [OK] Thread cache hit rate: 99% (56 created / 222K connections) [!!] Table cache hit rate: 0% (128 open / 60K opened) [OK] Open file limit used: 0% (4/196K) [OK] Table locks acquired immediately: 99% (3M immediate / 3M locks) [OK] InnoDB data size / buffer pool: 452.9M/509.0M -------- Recommendations ----------------------------------------------------- General recommendations: Run OPTIMIZE TABLE to defragment tables for better performance Adjust your join queries to always utilize indexes When making adjustments, make tmp_table_size/max_heap_table_size equal Reduce your SELECT DISTINCT queries without LIMIT clauses Increase table_cache gradually to avoid file descriptor limits Variables to adjust: query_cache_size (> 64M) join_buffer_size (> 1.0M, or always use indexes with joins) tmp_table_size (> 64M) max_heap_table_size (> 128M) table_cache (> 128)
Really off to bed now...
comment:9 Changed 3 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.5
- Total Hours changed from 4.75 to 5.25
One final thought, the 503 errors and the periods of downtime reported in ticket:555 might be unrelated -- I suspect that the down times are caused by second.sh killing almost everything when the load gets to 14.44 -- it does the following:
/etc/init.d/nginx stop killall -9 nginx sleep 1 killall -9 nginx /etc/init.d/php-fpm stop /etc/init.d/php53-fpm stop killall -9 php-fpm php-cgi
Which seems totally crazy to me -- just because the load reaches 14 we shouldn't brutally kill all the key processes that run the site should we?
Running total for the 503s:
Total 502 errors: 12 Total 503 errors: 1570 Total 504 errors: 0
I have commented second.sh out again but I think we should experiment with the thresholds of things being killed -- I think perhaps they are set for servers with a lot less than 14 CPUs -- Jim what values do you have on your server for these in /var/xdrago/second.sh -- are these values hard coded irrespective of the number of CPUs or are they generated by BOA based on the number of CPUs?
CTL_ONEX_SPIDER_LOAD=388 CTL_FIVX_SPIDER_LOAD=388 CTL_ONEX_LOAD=1444 CTL_FIVX_LOAD=888 CTL_ONEX_LOAD_CRIT=1888 CTL_FIVX_LOAD_CRIT=1555
Note how they are multiplied:
ONEX_LOAD=`awk '{print $1*100}' /proc/loadavg` FIVX_LOAD=`awk '{print $2*100}' /proc/loadavg`
So for example:
uptime ; awk '{print $1*100}' /proc/loadavg 00:31:03 up 35 days, 1:09, 2 users, load average: 0.50, 0.42, 0.36 50
A load of 0.5 is 50 for the CTL_ONEX_ variables, see ticket:563#comment:3 for some more detail on what second.sh does.
Do we also need to look at the redis settings? This is all we get each day in the redis log, from when it's restarted:
[52978] 19 Jun 01:26:29.201 # User requested shutdown... [52978] 19 Jun 01:26:30.366 # Redis is now ready to exit, bye bye... [24363] 19 Jun 01:26:31.524 # Server started, Redis version 2.6.13 [24363] 19 Jun 01:26:31.524 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
comment:10 Changed 3 years ago by jim
503s are a red herring I think... The rest of the stuff on this ticket is good though Chris.
Bots/503s come from https://github.com/omega8cc/nginx-for-drupal/blob/master/aegir/conf/nginx_modern_include.conf#L17 and https://github.com/omega8cc/nginx-for-drupal/blob/master/aegir/conf/nginx_high_load_off.conf
It's deliberate and a good idea IMHO to prevent bots getting to the site when it's really busy. BUT! clearly the detection of the high load is squiffy, hence the issues in #555 and your investigation of second.sh.
So the strategy should be to make the BOA scripts work with the correct thresholds. There's a lot of good stuff (with the occasional headscratcher/WFT moment I agree!) wrapped up in these scripts from a team that hosts a LOT of fast Drupal sites. Clearly they don't have 14 cores, so the detection is off.
We need to either find the variable that sets this, or manually change the thresholds ourselves.
To get going faster, I suggest you search in GitHub? rather than manually looking at the scripts... e.g. two searches:
comment:11 Changed 3 years ago by jim
- Add Hours to Ticket changed from 0.0 to 0.5
- Total Hours changed from 5.25 to 5.75
comment:12 Changed 3 years ago by jim
I'll look at the firewall late tonight... per comment 2.
comment:13 Changed 3 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.5
- Priority changed from critical to major
- Status changed from new to accepted
- Description modified (diff)
- Total Hours changed from 5.75 to 6.25
These are the recent summary of results from the /usr/local/bin/50x-errors script which is run every day just before the nginx logs are rotated:
Date: Wed, 19 Jun 2013 14:53:27 +0100 Subject: 7 502, 173 503 and 0 504 errors from puffin.webarch.net Date: Thu, 20 Jun 2013 06:25:29 +0100 Subject: 12 502, 1570 503 and 0 504 errors from puffin.webarch.net Date: Fri, 21 Jun 2013 06:25:20 +0100 Subject: 464 502, 65 503 and 1 504 errors from puffin.webarch.net Date: Sat, 22 Jun 2013 06:25:24 +0100 Subject: 0 502, 149 503 and 1 504 errors from puffin.webarch.net Date: Sun, 23 Jun 2013 06:25:47 +0100 Subject: 1 502, 212 503 and 0 504 errors from puffin.webarch.net Date: Mon, 24 Jun 2013 06:25:16 +0100 Subject: 2 502, 103 503 and 1 504 errors from puffin.webarch.net
It's worth noting that since the downtime on 20th June, which is in the stats above for 21st June as the stats are generated at 6:25am, there have been very few 502 or 504 errors. The 503 errors for the last few days have been generated when the site is in high load mode and this has been happening quite often, see the spikes on the graphs here: ticket:555#comment:54.
The thresholds in second.sh were initially multiplied by four, see ticket:555#comment:43, and then by five, see ticket:555#comment:52, and the RAM was doubled after the 20th June downtime, from 4GB to 8GB.
I think it would perhaps be worth doing a bit more work on the 50x-errors script -- if it counted, sorted and listed the user agents that were triggering the 503 errors this would be a good check that it is an error that is just served to bots.
Apart from that this ticket is probably ready to be closed.
comment:14 Changed 3 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.5
- Total Hours changed from 6.25 to 6.75
I have done some work on the script which checks for web server errors codes each day and emails a summary, it's now documented here: wiki:ErrorCodeCheck.
Yesterday was quite a bad day, see ticket:569, the script reported:
Date: Wed, 10 Jul 2013 06:25:28 +0100 Subject: 770 403, 4219 404, 12 502, 2271 503 and 0 504 errors from puffin.webarch.net
comment:15 Changed 3 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.2
- Status changed from accepted to closed
- Resolution set to fixed
- Total Hours changed from 6.75 to 6.95
Since early August there has only been one day on which there were more than 100 502 errors however there have been days when the 503 errors have been over 6k, see the list posted on ticket:483#comment:63.
I have added a new section to the Puffin wiki page, wiki:PuffinServer#HTTPStats and the whole server load issue has been followed up on ticket:555 and it is documented at wiki:PuffinServer#LoadSpikes.
I think this ticket can now be closed.
I have copied the script we have to find 502 errors in the nginx logs and email them out one a day, /usr/local/bin/find-502s to /usr/local/bin/50x-errors and changed it to search for 502, 503 and 504.
So far today:
I'm on the phone with Jim now.