Ticket #483 (closed defect: fixed)

Opened 4 years ago

Last modified 3 years ago

Nginx 502 Bad Gateway Errors with BOA

Reported by: chris Owned by: chris
Priority: major Milestone: Maintenance
Component: Live server Keywords:
Cc: ed, jim Estimated Number of Hours: 0.0
Add Hours to Ticket: 0 Billable?: yes
Total Hours: 16.35

Description

The Barracuda Octopus Aegir server on puffin.webarch.net is getting generating a lot of Nginx 502 Bad Gateway errors, see ticket:466#comment:31 and ticket:466#comment:38

The Nginx logs contain 58 502 errors:

cd /var/log/nginx
grep " 502 " * | wc -l
58

I expect the cause of these is Nginx asking PHP-FPM for a page and not getting one, an answer might be to either speed up PHP-FPM response time or adjust the Nginx settings so that it waits longs for a response from PHP-FPM.

There are several threads about this on http://groups.drupal.org/ but none of them have any useful suggestions (eg 1 2 3 4).

Attachments

multips_memory-month.png (30.1 KB) - added by chris 4 years ago.
multips memory usage to 7th March 2013
phpfpm_average-month.png (35.5 KB) - added by chris 4 years ago.
phpfpm average process size to 7th March 2013
phpfpm_memory-month.png (29.8 KB) - added by chris 4 years ago.
phpfpm memory usage to 7th March 2013
multips_memory-week.png (25.4 KB) - added by chris 4 years ago.
multips memory usage 7th to 11th March 2013
phpfpm_average-week.png (25.7 KB) - added by chris 4 years ago.
phpfpm average process 7th to 11th March 2013
phpfpm_memory-week.png (36.9 KB) - added by chris 4 years ago.
phpfpm memory usage 7th to 11th March 2013
cpu-day_2013-03-12..png (22.4 KB) - added by chris 4 years ago.
cpu load spike 2013-03-12
load-day_2013-03-12..png (24.1 KB) - added by chris 4 years ago.
firewall spike 2013-03-12
mysql_queries-day_2013-03-12.png (34.2 KB) - added by chris 4 years ago.
mysql spike 2013-03-12
phpfpm_connections-day_2013-03-12..png (21.1 KB) - added by chris 4 years ago.
php-fpm spike 2013-03-12

Change History

comment:1 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.25
  • Total Hours changed from 0.0 to 0.25

comment:2 follow-up: ↓ 3 Changed 4 years ago by jim

It's not a timeout, the 502s are instant.

It seems to begin, or at least centre around the 'Aborted connection' issue I mentioned. And once it starts happening, it keeps doing so. It appears to need a service mysql restart to fix it. Restarting nginx, php52-fpm or php53-fpm does nothing AFAIK.

That it's doing it this morning on http://newlive.puffin.webarch.net/ means the 'fresh start' given it by reinstalling PHP and updating BOA to HEAD did not work. Which, given the lack of issues on the Barracuda issue tracker, makes it most likely to be a configuration issue on Puffin.

In fact, since enabling SSL we have an issue whereby https://master.puffin.webarch.net/ redirects to 'install.php' with an '403 Forbidden'. So my current hunch is that the SSL work, or any changes done to any NGINX, PHP or MySQL config are the cause of the issue.

I propose we revert to stock settings and work from a clear base.

comment:3 in reply to: ↑ 2 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.4
  • Total Hours changed from 0.25 to 0.65

Replying to jim:

It's not a timeout, the 502s are instant.

I'm not so sure, it appears to me that the problem is that Nginx doesn't get a page from PHP-FPM, this is what the status code means:

502 Bad Gateway

The server, while acting as a gateway or proxy, received an invalid response from the upstream server it accessed in attempting to fulfill the request.

http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.5.3

It seems to begin, or at least centre around the 'Aborted connection' issue I mentioned.

Have you a reference to these?

And once it starts happening, it keeps doing so. It appears to need a service mysql restart to fix it. Restarting nginx, php52-fpm or php53-fpm does nothing AFAIK.

If it is related to the database then we should look at the Maria DB settings to try to increase response times.

That it's doing it this morning on http://newlive.puffin.webarch.net/ means the 'fresh start' given it by reinstalling PHP and updating BOA to HEAD did not work.

I wouldn't expect reinstalling BOA to change anything.

In fact, since enabling SSL we have an issue whereby https://master.puffin.webarch.net/ redirects to 'install.php' with an '403 Forbidden'. So my current hunch is that the SSL work, or any changes done to any NGINX, PHP or MySQL config are the cause of the issue.

The wrong SSL certificate is being served on that domain, it's using the self generated one rather than the Gandi.net one.

I propose we revert to stock settings and work from a clear base.

I don't see the need for that, isn't all that has changed from the "stock settings" is a couple of sym links, see ticket:466#comment:39 ?

comment:4 follow-up: ↓ 5 Changed 4 years ago by jim

It seems to begin, or at least centre around the 'Aborted connection' issue I mentioned.

Have you a reference to these?

Yes, ticket:466#comment:29

If it is related to the database then we should look at the Maria DB settings to try to increase response times.

Except, according to syslog when I restart it, it's Percona, not MariaDB. I use MariaDB on my server, never tried Percona -- weren't we on MariaDB at one point, or did I imagine that? If not, we should change the setup to use MariaDB as the MySQL server instead to rule out bugs there. I've seen issues with Percona before.

I wouldn't expect reinstalling BOA to change anything.

Well I told it to reinstall PHP which is why I ran the installer, and it should run scripts that restart everything as part of that... Clean slate etc.

The wrong SSL certificate is being served on that domain, it's using the self generated

Because I gave it 'transitionnetwork.org' assuming we had a wildcard cert, not 'www...' When Aegir is fixed (by me disabling SSL support) we can try this again with the right settings.

---

FYI last night installed tcpdump and started to do the debugging process at http://www.mysqlperformanceblog.com/2008/08/23/how-to-track-down-the-source-of-aborted_connects/, but ran out of time/energy. We need to establish if it's PHP or MySQL doing this, and main hurdle here is the lack of good log information

According to Omega8cc (and paraphrasing since I don't have the link to hand): "502 errors are caused by PHP or MySQL segfaulting or dropping off partway though a response. Barracuda works out of the box with no problems, so if you're seeing these its not BOA or NGINX, and you might need to reinstall". or words to that effect!

Finally, I think this might come down to a simple connectivity issue between the database, 'localhost' and the IP address -- perhaps something is being blocked or does not have permission via one or another connection route? But, then why would it be fine for ages, then plagued by the issue... Does that sound possible?


Rambling aside..

My proposed debug/plan to fix, in order of pain level with nuclear option last:

  1. Debug issue -- not sure how best to do this?
  2. Revert SSL: Jim to disable Aegir SSL support (via SSH/Drush since we're being redirected to 'install.php'. Test newlive and master.
  3. Reinstall stuff: Run installer with _SQL_FORCE_REINSTALL=YES and _NGX_FORCE_REINSTALL=YES; Test newlive and master sites.
  4. Change DB: Replace Percona with MariaDB in .barracuda.cnf, set _FULL_FORCE_REINSTALL=YES; run reinstall; Test newlive and master.1. Chris to roll back all SSL changes, and any other NGINX/PHP/MYSQL changes done (if there are some); Test newlive and master.
  5. NUCLEAR: Wipe Puffin, re-run setup with NO non-essential changes to .barracuda.cnf whatsoever (i.e. email only), get TN site imported/Aegir running; THEN set up SSL and backups and leave well alone.

What do you think, Chris, got anything you want to add to that list.

Despite 'nuclear' being bad', there's nothing we need on the server provided the Quince->Puffin script is backed up (put in GitHub?), all my work has been to prep/document/setup Github and I need to do one last run through anyway.

Even if we completely reinstalled, that should only take an hour or so, and most of that is waiting for it to finish. So that's the backup and should be weighed against spending too long trying to fix an otherwise empty server.

Chris?

comment:5 in reply to: ↑ 4 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.5
  • Total Hours changed from 0.65 to 1.15

Replying to jim:

It seems to begin, or at least centre around the 'Aborted connection' issue I mentioned.

Have you a reference to these?

Yes, ticket:466#comment:29

There are quite a lot of these:

grep "Aborted" /var/log/daemon* | wc -l
91


If it is related to the database then we should look at the Maria DB settings to try to increase response times.

Except, according to syslog when I restart it, it's Percona, not MariaDB.

It's MariaDB:

which mysqld
  /usr/sbin/mysqld
dpkg -S /usr/sbin/mysqld 
  mariadb-server-core-5.5: /usr/sbin/mysqld

I'm not sure to what, if any, extent BOA has modified the config files in /etc/mysql/ but we could look at tweaking settings there to improve performance.

I use MariaDB on my server, never tried Percona -- weren't we on MariaDB at one point, or did I imagine that? If not, we should change the setup to use MariaDB as the MySQL server instead to rule out bugs there. I've seen issues with Percona before.

I don't know where you got that idea from, the BOA uses MariaDB.

Finally, I think this might come down to a simple connectivity issue between the database, 'localhost' and the IP address -- perhaps something is being blocked or does not have permission via one or another connection route? But, then why would it be fine for ages, then plagued by the issue... Does that sound possible?

Not really, where is the file with the database settings for the site?

My proposed debug/plan to fix, in order of pain level with nuclear option last:

  1. Debug issue -- not sure how best to do this?

I'd suggest adjusing PHP-FPM and MariaDB settings.

  1. Revert SSL: Jim to disable Aegir SSL support (via SSH/Drush since we're being redirected to 'install.php'. Test newlive and master.
  2. Reinstall stuff: Run installer with _SQL_FORCE_REINSTALL=YES and _NGX_FORCE_REINSTALL=YES; Test newlive and master sites.
  3. Change DB: Replace Percona with MariaDB in .barracuda.cnf, set _FULL_FORCE_REINSTALL=YES; run reinstall; Test newlive and master.1. Chris to roll back all SSL changes, and any other NGINX/PHP/MYSQL changes done (if there are some); Test newlive and master.
  4. NUCLEAR: Wipe Puffin, re-run setup with NO non-essential changes to .barracuda.cnf whatsoever (i.e. email only), get TN site imported/Aegir running; THEN set up SSL and backups and leave well alone.

What do you think, Chris, got anything you want to add to that list.

I'm not convinced that any of these will help, I think we need to look at the PHP-FPM and MariaDB settings, and perhaps the Nginx settings as well.

Despite 'nuclear' being bad', there's nothing we need on the server provided the Quince->Puffin script is backed up (put in GitHub), all my work has been to prep/document/setup Github and I need to do one last run through anyway.

Even if we completely reinstalled, that should only take an hour or so, and most of that is waiting for it to finish. So that's the backup and should be weighed against spending too long trying to fix an otherwise empty server.

I expect it would take a bit longer and all it would result in would be you being assured that the problems are not as a result of anything we have done, but perhaps this is a good enough reason to do it.

comment:6 follow-up: ↓ 8 Changed 4 years ago by jim

  • MariaDB: Ok, great! It said 'Percona' a bunch of times in syslog when I restarted MySQL.
  • Settings: These should just work, that's the point of using BOA, and why I'm confused... That we have to change them implies the original install wasn't quite right, which is why I thought a reinstall might correct things. Happy to stay with what we have for now if you've happy you're not changed much.

I wonder if it's caused by the daily tasks for looking after sites? In that it was fine, then the issue came back the next day... Perhaps something in the new TN.org site is making it flaky somehow. Hmm...

I'll investigate along these lines and later raise an issue on the Barracuda issues list to get some support from the horse's mouth.

Last edited 4 years ago by jim (previous) (diff)

comment:7 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.45
  • Total Hours changed from 1.15 to 1.6

Looking at the PHP settings, the key files are the ones in /etc/php5/fpm/ where there is php-fpm.conf which doesn't contain much apart from an include:

cat /etc/php5/fpm/php-fpm.conf | grep -v '^;' | grep -v "^[[:space:]]*$"

[global]
pid = /var/run/php5-fpm.pid
error_log = /var/log/php5-fpm.log
include=/etc/php5/fpm/pool.d/*.conf

The file included is /etc/php5/fpm/pool.d/www.conf, which contains:

cat www.conf | grep -v '^;' | grep -v "^[[:space:]]*$"

[www]
user = www-data
group = www-data
listen = 127.0.0.1:9000
pm = dynamic
pm.max_children = 5
pm.start_servers = 2
pm.min_spare_servers = 1
pm.max_spare_servers = 3
chdir = /

There is also a symlink at /etc/php5/fpm/php.ini to /opt/local/etc/php53.ini which contains:

cat /opt/local/etc/php53.ini | grep -v '^;' | grep -v "^[[:space:]]*$"

[PHP]
display_errors = Off
user_ini.filename =
engine = On
short_open_tag = On
asp_tags = Off
precision = 14
y2k_compliance = On
output_buffering = 4096
zlib.output_compression = Off
implicit_flush = Off
unserialize_callback_func =
serialize_precision = 100
allow_call_time_pass_reference = Off
safe_mode = Off
safe_mode_gid = Off
safe_mode_include_dir =
safe_mode_exec_dir =
safe_mode_allowed_env_vars = PHP_
safe_mode_protected_env_vars = LD_LIBRARY_PATH
open_basedir = ".:/data:/mnt:/opt:/srv:/tmp:/usr:/var/aegir:/var/lib/collectd:/var/lib/nginx:/var/run:/var/second:/var/www"
disable_functions = "allow_url_fopen,allow_url_include,link,passthru,posix_getpwnam,posix_getpwuid,posix_getrlimit,posix_uname,proc_nice,symlink"
disable_classes =
realpath_cache_size=8M
realpath_cache_ttl=180
zend.enable_gc = On
expose_php = On
max_execution_time = 300
max_input_time = 300
max_input_vars = 9999
memory_limit = 256M
error_reporting = E_ALL & ~E_DEPRECATED
display_errors = Off
display_startup_errors = Off
log_errors = On
log_errors_max_len = 1024
ignore_repeated_errors = Off
ignore_repeated_source = Off
report_memleaks = On
track_errors = Off
html_errors = Off
error_log = /var/log/php/error_log_53
variables_order = "GPCS"
request_order = "GP"
register_globals = Off
register_long_arrays = Off
register_argc_argv = Off
auto_globals_jit = On
post_max_size = 100M
magic_quotes_gpc = Off
magic_quotes_runtime = Off
magic_quotes_sybase = Off
auto_prepend_file =
auto_append_file =
default_mimetype = "text/html"
include_path    =  ".:/opt/local/lib/php:/usr/lib/php5"
doc_root =
user_dir =
extension_dir = "/opt/local/lib/php/extensions/no-debug-non-zts-20090626/"
enable_dl = Off
file_uploads = On
upload_tmp_dir = /tmp
upload_max_filesize = 100M
max_file_uploads = 50
allow_url_fopen = On
allow_url_include = Off
default_socket_timeout = 60

[Date]
date.timezone = "UTC"

[filter]

[iconv]

[intl]

[sqlite]

[sqlite3]

[Pcre]

[Pdo]

[Pdo_mysql]
pdo_mysql.cache_size = 2000
pdo_mysql.default_socket=

[Phar]

[Syslog]
define_syslog_variables  = Off

[mail function]
sendmail_path = /usr/sbin/sendmail -t -i
mail.add_x_header = Off

[SQL]
sql.safe_mode = Off

[ODBC]
odbc.allow_persistent = On
odbc.check_persistent = On
odbc.max_persistent = -1
odbc.max_links = -1
odbc.defaultlrl = 4096
odbc.defaultbinmode = 1

[Interbase]
ibase.allow_persistent = 1
ibase.max_persistent = -1
ibase.max_links = -1
ibase.timestampformat = "%Y-%m-%d %H:%M:%S"
ibase.dateformat = "%Y-%m-%d"
ibase.timeformat = "%H:%M:%S"

[MySQL]
mysql.allow_local_infile = On
mysql.allow_persistent = On
mysql.cache_size = 2000
mysql.max_persistent = -1
mysql.max_links = -1
mysql.default_port =
mysql.default_socket =
mysql.default_host =
mysql.default_user =
mysql.default_password =
mysql.connect_timeout = 60
mysql.trace_mode = Off

[MySQLi]
mysqli.max_persistent = -1
mysqli.allow_persistent = On
mysqli.max_links = -1
mysqli.cache_size = 2000
mysqli.default_port = 3306
mysqli.default_socket =
mysqli.default_host =
mysqli.default_user =
mysqli.default_pw =
mysqli.reconnect = Off

[mysqlnd]
mysqlnd.collect_statistics = On
mysqlnd.collect_memory_statistics = Off

[OCI8]

[PostgreSQL]
pgsql.allow_persistent = On
pgsql.auto_reset_persistent = Off
pgsql.max_persistent = -1
pgsql.max_links = -1
pgsql.ignore_notice = 0
pgsql.log_notice = 0

[Sybase-CT]
sybct.allow_persistent = On
sybct.max_persistent = -1
sybct.max_links = -1
sybct.min_server_severity = 10
sybct.min_client_severity = 10

[bcmath]
bcmath.scale = 0

[browscap]

[Session]
session.save_handler = files
session.save_path = "/opt/tmp"
session.use_cookies = 1
session.use_only_cookies = 1
session.name = PHPSESSID
session.auto_start = 0
session.cookie_lifetime = 0
session.cookie_path = /
session.cookie_domain =
session.cookie_httponly = 1
session.serialize_handler = php
session.gc_probability = 1
session.gc_divisor = 1000
session.gc_maxlifetime = 1440
session.bug_compat_42 = Off
session.bug_compat_warn = Off
session.referer_check =
session.entropy_length = 32
session.entropy_file = /dev/urandom
session.cache_limiter = nocache
session.cache_expire = 180
session.use_trans_sid = 0
session.hash_function = 0
session.hash_bits_per_character = 5
url_rewriter.tags = "a=href,area=href,frame=src,input=src,form=fakeentry"

[MSSQL]
mssql.allow_persistent = On
mssql.max_persistent = -1
mssql.max_links = -1
mssql.min_error_severity = 10
mssql.min_message_severity = 10
mssql.compatability_mode = Off
mssql.secure_connection = Off

[Assertion]

[COM]

[mbstring]

[gd]

[exif]

[Tidy]
tidy.clean_output = Off

[soap]
soap.wsdl_cache_enabled=1
soap.wsdl_cache_dir="/tmp"
soap.wsdl_cache_ttl=86400
soap.wsdl_cache_limit = 5

[sysvshm]

[ldap]
ldap.max_links = -1

[mcrypt]

[dba]

[xsl]

[Zend]
zend_extension="/usr/local/ioncube/ioncube_loader_lin_5.3.so"
auto_globals_jit = Off
extension=apc.so
apc.shm_size = 512M
apc.stat = 1
apc.stat_ctime = 1
apc.include_once_override = 0
cgi.fix_pathinfo = 1
mbstring.http_input = "pass"
mbstring.http_output = "pass"
mbstring.encoding_translation = 0
extension=uploadprogress.so
extension=imagick.so
extension=jsmin.so
extension=redis.so  

I'd suggest starting by increasing the number of servers in /etc/php5/fpm/pool.d/www.conf, I haven't yet gone through the long list of settings above in detail.

For MariaDB AKA MySQL the main configuration file is /etc/mysql/my.cnf}:

cat my.cnf | grep -v '^#' | grep -v "^[[:space:]]*$"

[client]
port                    = 3306
socket                  = /var/run/mysqld/mysqld.sock
default-character-set   = utf8

[mysqld]
user                    = mysql
pid-file                = /var/run/mysqld/mysqld.pid
socket                  = /var/run/mysqld/mysqld.sock
port                    = 3306
basedir                 = /usr
datadir                 = /var/lib/mysql
tmpdir                  = /tmp
lc_messages_dir         = /usr/share/mysql
lc_messages             = en_US
character_set_server    = utf8
collation_server        = utf8_general_ci
skip-external-locking
skip-name-resolve
connect_timeout         = 60
join_buffer_size        = 1M
key_buffer              = 1M
key_buffer_size         = 32M
log_warnings            = 2
max_allowed_packet      = 32M
max_connect_errors      = 30
max_connections         = 30
max_user_connections    = 30
myisam_sort_buffer_size = 1024K
query_cache_limit       = 128K
query_cache_size        = 64M
query_cache_type        = 1
read_buffer_size        = 8M
read_rnd_buffer_size    = 4M
sort_buffer_size        = 128K
bulk_insert_buffer_size = 128K
table_open_cache        = 64
table_definition_cache  = 512
table_cache             = 128
thread_stack            = 256K
thread_cache_size       = 128
thread_concurrency      = 8
wait_timeout            = 3600
tmp_table_size          = 64M
max_heap_table_size     = 128M
low_priority_updates    = 1
concurrent_insert       = 2
max_tmp_tables          = 16384
server-id               = 8
myisam-recover          = BACKUP
log_bin                 = /var/log/mysql/mariadb-bin
log_bin_index           = /var/log/mysql/mariadb-bin.index
expire_logs_days        = 1
max_binlog_size         = 100M
open_files_limit        = 196608
default_storage_engine  = InnoDB
sql_mode                = NO_ENGINE_SUBSTITUTION
innodb_buffer_pool_size = 1019M
innodb_log_buffer_size  = 4M
innodb_file_per_table   = 1
innodb_open_files       = 256
innodb_io_capacity      = 512
innodb_flush_method     = O_DIRECT
innodb_flush_log_at_trx_commit = 2
innodb_thread_concurrency = 8
innodb_lock_wait_timeout = 120

[mysqld_safe]
socket                  = /var/run/mysqld/mysqld.sock
nice                    = 0
open_files_limit        = 196608
syslog

[mysqldump]
quick
max_allowed_packet      = 32M
quote-names

[mysql]
no-auto-rehash

[myisamchk]
key_buffer              = 1M
sort_buffer_size        = 64K
read_buffer             = 4M
write_buffer            = 4M

[isamchk]
key_buffer              = 1M
sort_buffer_size        = 64K
read_buffer             = 4M
write_buffer            = 4M

[mysqlhotcopy]
interactive-timeout
!includedir /etc/mysql/conf.d/

For the MySQL settings I think we should compare them with the ones on quince to start with.

comment:8 in reply to: ↑ 6 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.15
  • Total Hours changed from 1.6 to 1.75

Replying to jim:

  • Settings: These should just work, that's the point of using BOA, and why I'm confused... That we have to change them implies the original install wasn't quite right

Or that the settings are not quite right...

We are on the 3rd fresh install already, the first one was ticket:466#comment:2 the second was ticket:466#comment:9 and the third was ticket:466#comment:22

comment:9 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.75
  • Total Hours changed from 1.75 to 2.5

I started looking at increasing the number of php-fpm processes, in /etc/php5/fpm/pool.d/www.conf and these values have been increased:

;pm.max_children = 5
pm.max_children = 10

;pm.start_servers = 2
pm.start_servers = 4

;pm.min_spare_servers = 1
pm.min_spare_servers = 2

However this didn't change the number of php-fpm processes running -- it's still 3 and the reason for this is that this config file doesn't appear to be used after all, there are two php-fpm init scripts, /etc/init.d/php5-fpm and /etc/init.d/php53-fpm and the second one uses /opt/local/etc/php53-fpm.conf which contains:

cat /opt/local/etc/php53-fpm.conf | grep -v '^;' | grep -v "^[[:space:]]*$"

[global]
pid = /var/run/php53-fpm.pid
error_log = /var/log/php/php53-fpm-error.log
syslog.facility = daemon
syslog.ident = php-fpm
log_level = notice
emergency_restart_threshold = 5
emergency_restart_interval = 1m
process_control_timeout = 0s
process.max = 8
daemonize = yes
rlimit_files = 1024
rlimit_core = 0
events.mechanism = epoll

[www]
user = www-data
group = www-data
listen = 127.0.0.1:9090
listen.allowed_clients = 127.0.0.1,81.95.52.103
pm = dynamic
pm.max_children = 4
pm.start_servers = 2
pm.min_spare_servers = 1
pm.max_spare_servers = 2
pm.process_idle_timeout = 10s;
pm.max_requests = 500
slowlog = /var/log/php/php53-fpm-slow.log
request_slowlog_timeout = 30s
request_terminate_timeout = 300s
rlimit_files = 1024
rlimit_core = 0
env[HOSTNAME] = $HOSTNAME
env[PATH] = /usr/local/bin:/usr/bin:/bin
env[TMP] = /tmp
env[TMPDIR] = /tmp
env[TEMP] = /tmp

I'm not clear which config files BOA might overwrite and which ones are safe to edit.

comment:10 follow-up: ↓ 11 Changed 4 years ago by jim

Chris, I don't think we should be changing settings if they look 'ok'. The point being BOA runs on hundreds (probably thousands) of servers, and does so with out of the box settings.

That we're hitting a big problem implies one of:

  1. There is a bug in the current version of BOA that we installed on the server. If so, we should raise an appropriate ticket at http://drupal.org/project/issues/barracuda -- I just don't know what information to give at present since we've not seen an error that describes the fail, or points us to a specific component. As soon as we have proof it's a bug, I'll post something.
  2. There was a misconfiguration or other issue when it was installed, or afterwards. Chris seems pretty happy there wasn't anything done besides a few tweaks to the .barracuda.cnf, and SSL setup. However it's strange that bash autocompletes don't work proprely so perhaps the original OS or other installs went awry. Back burner for now...
  3. Something is odd with our TN site. I'm going to try the import process and makefile out on my server next, and if that breaks I'll let you know!

comment:11 in reply to: ↑ 10 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.2
  • Total Hours changed from 2.5 to 2.7

Replying to jim:

I don't think we should be changing settings if they look 'ok'. The point being BOA runs on hundreds (probably thousands) of servers, and does so with out of the box settings.

OK let's look at how to duplicate the problem .

That we're hitting a big problem implies one of:

  1. There is a bug in the current version of BOA that we installed on the server. If so, we should raise an appropriate ticket at http://drupal.org/project/issues/barracuda -- I just don't know what information to give at present since we've not seen an error that describes the fail, or points us to a specific component. As soon as we have proof it's a bug, I'll post something.

OK.

  1. There was a misconfiguration or other issue when it was installed, or afterwards. Chris seems pretty happy there wasn't anything done besides a few tweaks to the .barracuda.cnf, and SSL setup. However it's strange that bash autocompletes don't work proprely so perhaps the original OS or other installs went awry. Back burner for now...

Does Ubuntu have a different bash auto completion setup by default and you are used to that? This isn't something that has any settings other than the default debian ones AFAIK.

comment:12 follow-up: ↓ 13 Changed 4 years ago by jim

Good plan re duplication... SSL issue (ticket:478#comment:12) to fix first then I'll start on narrowing it down.

Are we running Ubuntu? I thought this was Debian? My server is Debian Squeeze too.

IF we're Ubuntu, this APC issue could be a problem: http://drupal.org/node/1643616

comment:13 in reply to: ↑ 12 Changed 4 years ago by chris

Replying to jim:

Are we running Ubuntu?

No, debian.

comment:14 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 2.7 to 2.8

I have installed the bash-completion package in puffin.

comment:15 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 1.0
  • Total Hours changed from 2.8 to 3.8

There have been seven 502 errors today -- Google is crawling the site (I have created a ticket for a robots.txt file to prevent this for dev sites ticket:487):

cd /var/log/nginx
grep '1.[0|1]" 502 ' access.log 
"66.249.75.117" newlive.puffin.webarch.net [29/Jan/2013:07:19:15 +0000] "GET /transition-themes HTTP/1.1" 502 166 257 311 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.112 "-"
"66.249.78.117" newlive.puffin.webarch.net [29/Jan/2013:08:11:18 +0000] "GET /cms/26134/news/feed HTTP/1.1" 502 166 259 311 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.030 "-"
"66.249.78.117" newlive.puffin.webarch.net [29/Jan/2013:09:14:56 +0000] "GET /robots.txt HTTP/1.1" 502 166 276 311 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.029 "-"
"66.249.78.117" newlive.puffin.webarch.net [29/Jan/2013:10:11:03 +0000] "GET /about/people/trustees HTTP/1.1" 502 166 261 311 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.030 "-"
"66.249.78.117" newlive.puffin.webarch.net [29/Jan/2013:11:19:37 +0000] "GET /people/w HTTP/1.1" 502 166 248 311 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.030 "-"
"66.249.78.117" newlive.puffin.webarch.net [29/Jan/2013:12:10:55 +0000] "GET /people/y HTTP/1.1" 502 166 248 311 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.030 "-"
"66.249.75.117" newlive.puffin.webarch.net [29/Jan/2013:13:17:34 +0000] "GET /blogs/feed HTTP/1.1" 502 166 250 311 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.112 "-"

It might be of note that the errors above only happened once an hour at between 7 and 13 past the hour.

I tried to duplicate this using wget to recursively download the site (without limits) but this didn't generate the error, so I changed to requesting pages every 10 mins to see if this generated errors and it has (once so far):

wget -r --wait=10m --user-agent="Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" --server-response http://newlive.puffin.webarch.net/

Connecting to newlive.puffin.webarch.net|81.95.52.103|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 502 Bad Gateway
  Server: nginx
  Date: Tue, 29 Jan 2013 14:11:45 GMT
  Content-Type: text/html
  Content-Length: 166
  Connection: close
2013-01-29 14:11:45 ERROR 502: Bad Gateway.

Notes that robots.txt is a static file -- no need for a database connection to serve this file.

I have the impression that perhaps it's something like the disks spinning down due to inactivity behind this?

comment:16 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.25
  • Total Hours changed from 3.8 to 4.05

There have been two more 502s for Googlebot this afternoon, but no additional ones from the wget running every 10 mins. Both of the Googlebot 502s were as 12 mins past the hour:

"66.249.75.117" newlive.puffin.webarch.net [29/Jan/2013:15:12:28 +0000] "GET /people/malomo HTTP/1.1" 502 166 253 311 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.112 "-"
"66.249.75.117" newlive.puffin.webarch.net [29/Jan/2013:16:12:04 +0000] "GET /initiatives/h HTTP/1.1" 502 166 253 311 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.111 "-"

comment:17 Changed 4 years ago by jim

FYI I did a diff on Babylon's my.cnf vs. Puffin's... There were two;

  • Puffin has a higher innodb_buffer_pool_size as it's got much more memory
  • Puffin has the skip-name-resolve option enabled, which should avoid DNS lookups at the cost of only being able to use IPs in the GRANTS table.

So nothing relevant in my.cnf, and no differences in /etc/mysql/conf.d/*

Hmmm... this is now the only issue left stopping us going ahead with the move AFAIK. More debugging tomorrow...

comment:18 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.25
  • Total Hours changed from 4.05 to 4.3

Further 502s since the ones reported in ticket:483#comment:16, yesterday, the XXX.XXX.XXX.XXX requests are me and the YYY.YYY.YYY.YYY requests are (I assume) Jim. The update to robots.txt has resulted in the Googlebot requests stopping:

"66.249.75.117" newlive.puffin.webarch.net [29/Jan/2013:17:13:34 +0000] "GET /cms/23663/events HTTP/1.1" 502 166 256 311 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.114 "-"
"XXX.XXX.XXX.XXX" newlive.puffin.webarch.net [29/Jan/2013:18:17:22 +0000] "GET /media HTTP/1.0" 502 166 209 311 "http://newlive.puffin.webarch.net/" "Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" 0.015 "-"
"YYY.YYY.YYY.YYY" tn.puffin.webarch.net [29/Jan/2013:19:10:27 +0000] "GET /hosting/tasks/queue HTTP/1.0" 502 568 700 713 "https://tn.puffin.webarch.net/admin/hosting/settings" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11" 0.013 "-"
"XXX.XXX.XXX.XXX" newlive.puffin.webarch.net [29/Jan/2013:20:17:46 +0000] "GET /syndication-and-social-media HTTP/1.0" 502 166 232 311 "http://newlive.puffin.webarch.net/" "Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" 0.013 "-"
"XXX.XXX.XXX.XXX" newlive.puffin.webarch.net [29/Jan/2013:21:18:20 +0000] "GET /support/dealing-conflict HTTP/1.0" 502 166 228 311 "http://newlive.puffin.webarch.net/" "Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" 0.013 "-"
"XXX.XXX.XXX.XXX" newlive.puffin.webarch.net [29/Jan/2013:22:18:27 +0000] "GET /training HTTP/1.0" 502 166 212 311 "http://newlive.puffin.webarch.net/" "Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" 0.014 "-"
"YYY.YYY.YYY.YYY" newlive.puffin.webarch.net [29/Jan/2013:23:10:01 +0000] "GET /user/login HTTP/1.0" 502 568 656 713 "http://newlive.puffin.webarch.net/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11" 0.269 "-"
"YYY.YYY.YYY.YYY" newlive.puffin.webarch.net [29/Jan/2013:23:10:03 +0000] "GET /user/login HTTP/1.0" 502 568 612 713 "http://newlive.puffin.webarch.net/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11" 0.013 "-"
"YYY.YYY.YYY.YYY" tn.puffin.webarch.net [30/Jan/2013:00:15:55 +0000] "GET /hosting/servers HTTP/1.0" 502 568 709 713 "https://tn.puffin.webarch.net/admin/content/node" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11" 0.014 "-"
"XXX.XXX.XXX.XXX" newlive.puffin.webarch.net [30/Jan/2013:01:18:54 +0000] "GET /ingredients HTTP/1.0" 502 166 215 311 "http://newlive.puffin.webarch.net/" "Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" 0.014 "-"
"XXX.XXX.XXX.XXX" newlive.puffin.webarch.net [30/Jan/2013:02:19:03 +0000] "GET /support/branding HTTP/1.0" 502 166 220 311 "http://newlive.puffin.webarch.net/" "Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" 0.013 "-"
"XXX.XXX.XXX.XXX" newlive.puffin.webarch.net [30/Jan/2013:03:19:10 +0000] "GET /transition-2/film-news HTTP/1.0" 502 166 226 311 "http://newlive.puffin.webarch.net/" "Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" 0.016 "-"
"XXX.XXX.XXX.XXX" newlive.puffin.webarch.net [30/Jan/2013:04:19:18 +0000] "GET /video HTTP/1.0" 502 166 209 311 "http://newlive.puffin.webarch.net/" "Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" 0.013 "-"
"XXX.XXX.XXX.XXX" newlive.puffin.webarch.net [30/Jan/2013:05:19:34 +0000] "GET /why-do-transition HTTP/1.0" 502 166 221 311 "http://newlive.puffin.webarch.net/" "Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" 0.014 "-"
"XXX.XXX.XXX.XXX" newlive.puffin.webarch.net [30/Jan/2013:06:19:42 +0000] "GET /news/2012-12-14/transition-recognised-european-parliament HTTP/1.0" 502 166 261 311 "http://newlive.puffin.webarch.net/" "Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" 0.013 "-"

And todays:

"XXX.XXX.XXX.XXX" newlive.puffin.webarch.net [30/Jan/2013:07:19:52 +0000] "GET /initiatives/madeira-em-transi-o HTTP/1.0" 502 166 235 311 "http://newlive.puffin.webarch.net/" "Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" 0.014 "-"
"XXX.XXX.XXX.XXX" newlive.puffin.webarch.net [30/Jan/2013:08:10:01 +0000] "GET /stories/caroline-jackson/2013-01/wet-feet-cold-baths-and-lukewarm-soup-resilience-lancaster HTTP/1.0" 502 166 295 311 "http://newlive.puffin.webarch.net/" "Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" 0.015 "-"
"XXX.XXX.XXX.XXX" newlive.puffin.webarch.net [30/Jan/2013:09:10:13 +0000] "GET /about/funding HTTP/1.0" 502 166 217 311 "http://newlive.puffin.webarch.net/" "Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" 0.014 "-"
"XXX.XXX.XXX.XXX" newlive.puffin.webarch.net [30/Jan/2013:10:10:19 +0000] "GET /terms HTTP/1.0" 502 166 209 311 "http://newlive.puffin.webarch.net/" "Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" 0.013 "-"

I think the next things to look at is increasing the log level of Nginx, php-fpm and !MariaDB to try to get some more data -- we are managing to reproduce the bug, just not on demand.

comment:19 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 1.0
  • Total Hours changed from 4.3 to 5.3

Currently !MariaDB, has a binary log, in /etc/mysql/my.cnf:

log_bin                 = /var/log/mysql/mariadb-bin
log_bin_index           = /var/log/mysql/mariadb-bin.index

This can be read using mysqlbinlog, see https://kb.askmonty.org/en/using-mysqlbinlog/2727/ however it doesn't work:

cd /var/log/mysql
mysqlbinlog mariadb-bin.000079 
mysqlbinlog: unknown variable 'default-character-set=utf8'

And answer here is to edit this line in /etc/mysql/mysql.cnf:

#default-character-set   = utf8
loose-default-character-set   = utf8

Then the log can be read like this:

mysqlbinlog mariadb-bin.000079 | vim -

Comparing the Nginx 502 times with the !MariaDB log times, the logs at around 10:10:19 today, nothing was logged between 10:10:12 and 10:10:22:

# at 9449034
#130130 10:10:12 server id 8  end_log_pos 9449061       Xid = 1265233
COMMIT/*!*/;
# at 9449061
#130130 10:10:22 server id 8  end_log_pos 9449131       Query   thread_id=121909        exec_time=0     error_code=0
SET TIMESTAMP=1359540622/*!*/;
flush hosts
/*!*/;

Logs around 9:10:13:

# at 6600743
#130130  9:10:11 server id 8  end_log_pos 6600813       Query   thread_id=119123        exec_time=0     error_code=0
SET TIMESTAMP=1359537011/*!*/;
flush hosts
/*!*/;
# at 6600813
#130130  9:10:22 server id 8  end_log_pos 6600883       Query   thread_id=119129        exec_time=0     error_code=0
SET TIMESTAMP=1359537022/*!*/;
flush hosts
/*!*/;

Around 8:10:01:

# at 4456538
#130130  8:09:52 server id 8  end_log_pos 4456608       Query   thread_id=116323        exec_time=0     error_code=0
SET TIMESTAMP=1359533392/*!*/;
flush hosts
/*!*/;
# at 4456608
#130130  8:10:01 server id 8  end_log_pos 4456678       Query   thread_id=116330        exec_time=0     error_code=0
SET TIMESTAMP=1359533401/*!*/;
flush hosts
/*!*/;
# at 4456678
#130130  8:10:01 server id 8  end_log_pos 4456758       Query   thread_id=116333        exec_time=0     error_code=0
SET TIMESTAMP=1359533401/*!*/;
BEGIN
/*!*/;
# at 4456758
#130130  8:10:01 server id 8  end_log_pos 4456917       Query   thread_id=116333        exec_time=0     error_code=0
use `masterpuffinweba`/*!*/;
SET TIMESTAMP=1359533401/*!*/;
UPDATE variable SET value = 'i:1359533401;' WHERE name = 'hosting_dispatch_last_run'
/*!*/;
# at 4456917
#130130  8:10:01 server id 8  end_log_pos 4456944       Xid = 1235267
COMMIT/*!*/;
# at 4456944
#130130  8:10:01 server id 8  end_log_pos 4457024       Query   thread_id=116333        exec_time=0     error_code=0
SET TIMESTAMP=1359533401/*!*/;
BEGIN
/*!*/;
# at 4457024
#130130  8:10:01 server id 8  end_log_pos 4457232       Query   thread_id=116333        exec_time=0     error_code=0
SET TIMESTAMP=1359533401/*!*/;
INSERT INTO semaphore (name, value, expire) VALUES ('variable_cache_regenerate', '9021422225108d559ec3f29.55950919', 1359533431.9676)
/*!*/;
# at 4457232
#130130  8:10:01 server id 8  end_log_pos 4457259       Xid = 1235280
COMMIT/*!*/;
# at 4457259
#130130  8:10:01 server id 8  end_log_pos 4457339       Query   thread_id=116333        exec_time=0     error_code=0
SET TIMESTAMP=1359533401/*!*/;
BEGIN
/*!*/;
# at 4457339
#130130  8:10:01 server id 8  end_log_pos 4459118       Query   thread_id=116333        exec_time=0     error_code=0
SET TIMESTAMP=1359533401/*!*/;
UPDATE cache SET data = 'a:33:{s:9:\"clean_url\";b:1;s:9:\"cron_last\";i:1359532983;s:19:\"css_js_query_string\";s:20:\"DsKMxXpRC1BUJwu4tHkL\";s:21:\"date_default_timezone\";s:1:\"0\";s:19:\"file_directory_path\";s:37:\"sites/master.puffin.webarch.net/files\";s:19:\"file_directory_temp\";s:41:\"sites/master.puffin.webarch.net/files/tmp\";s:13:\"filter_html_1\";i:1;s:20:\"hosting_admin_client\";s:1:\"1\";s:19:\"hosting_cron_backup\";a:0:{}s:22:\"hosting_default_client\";s:1:\"1\";s:25:\"hosting_default_db_server\";s:1:\"4\";s:26:\"hosting_default_web_server\";s:1:\"2\";s:24:\"hosting_dispatch_enabled\";b:1;s:25:\"hosting_dispatch_last_run\";i:1359533401;s:21:\"hosting_own_db_server\";s:1:\"4\";s:20:\"hosting_own_platform\";s:1:\"7\";s:22:\"hosting_own_web_server\";s:1:\"2\";s:27:\"hosting_queue_cron_last_run\";i:1359532982;s:28:\"hosting_queue_tasks_last_run\";i:1359533343;s:15:\"install_profile\";s:10:\"hostmaster\";s:12:\"install_task\";s:4:\"done\";s:12:\"install_time\";i:1355588145;s:11:\"install_url\";s:32:\"http://master.puffin.webarch.net\";s:17:\"javascript_parsed\";a:0:{}s:13:\"menu_expanded\";a:0:{}s:10:\"menu_masks\";a:15:{i:0;i:61;i:1;i:31;i:2;i:30;i:3;i:29;i:4;i:24;i:5;i:21;i:6;i:15;i:7;i:14;i:8;i:13;i:9;i:11;i:10;i:7;i:11;i:5;i:12;i:3;i:13;i:2;i:14;i:1;}s:18:\"node_options_forum\";a:1:{i:0;s:6:\"status\";}s:14:\"site_frontpage\";s:13:\"hosting/sites\";s:9:\"site_mail\";s:35:\"webmaster@master.puffin.webarch.net\";s:9:\"site_name\";s:25:\"master.puffin.webarch.net\";s:13:\"theme_default\";s:5:\"eldir\";s:33:\"user_mail_status_activated_notify\";b:1;s:13:\"user_register\";i:0;}', created = 1359533401, expire = 0, headers = '', serialized = 1 WHERE cid = 'variables'
/*!*/;
# at 4459118
#130130  8:10:01 server id 8  end_log_pos 4459145       Xid = 1235282
COMMIT/*!*/;
# at 4459145
#130130  8:10:01 server id 8  end_log_pos 4459225       Query   thread_id=116333        exec_time=0     error_code=0
SET TIMESTAMP=1359533401/*!*/;
BEGIN
/*!*/;
# at 4459225
#130130  8:10:01 server id 8  end_log_pos 4459409       Query   thread_id=116333        exec_time=0     error_code=0
SET TIMESTAMP=1359533401/*!*/;
DELETE FROM semaphore WHERE name = 'variable_cache_regenerate' AND value = '9021422225108d559ec3f29.55950919'
/*!*/;
# at 4459409
#130130  8:10:01 server id 8  end_log_pos 4459436       Xid = 1235283
COMMIT/*!*/;
# at 4459436
#130130  8:10:11 server id 8  end_log_pos 4459516       Query   thread_id=116339        exec_time=0     error_code=0
SET TIMESTAMP=1359533411/*!*/;
BEGIN
/*!*/;

And around 7:19:52:

# at 2727237
#130130  7:19:42 server id 8  end_log_pos 2727307       Query   thread_id=114014        exec_time=0     error_code=0
SET TIMESTAMP=1359530382/*!*/;
flush hosts
/*!*/;
# at 2727307
#130130  7:19:52 server id 8  end_log_pos 2727377       Query   thread_id=114020        exec_time=0     error_code=0
SET TIMESTAMP=1359530392/*!*/;
flush hosts
/*!*/;
# at 2727377
#130130  7:20:01 server id 8  end_log_pos 2727447       Query   thread_id=114026        exec_time=0     error_code=0
SET TIMESTAMP=1359530401/*!*/;
flush hosts
/*!*/;

I can't see anything of interest in these logs.

To increase the PHP-FPM log level the /opt/local/etc/php53-fpm.conf file was edited and the following was changed:

;log_level = notice
log_level = debug

And the server was restarted (restart didn't work, stop and start were used):

/etc/init.d/php53-fpm stop
Gracefully shutting down php53-fpm... done
/etc/init.d/php53-fpm stop
Gracefully shutting down php53-fpm..warning, no pid file found - php53-fpm not running ?
/etc/init.d/php53-fpm start

So next time there is a 502 we can check /var/log/php/php53-fpm-error.log.

Since starting this ticket comment there have been two more 502s:

"XXX.XXX.XXX.XXX" newlive.puffin.webarch.net [30/Jan/2013:11:10:31 +0000] "GET /resources/who-we-are-and-what-we-do HTTP/1.0" 502 166 244 311 "http://newlive.puffin.webarch.net/about" "Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" 0.013 "-"
"XXX.XXX.XXX.XXX" newlive.puffin.webarch.net [30/Jan/2013:12:20:36 +0000] "GET /user/password HTTP/1.0" 502 166 227 311 "http://newlive.puffin.webarch.net/user/login" "Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" 0.014 "-"

Comparing the php-fpm and the times of the 502 (this is from before the log level was set to debug), around 12:20:36 we have:

rotate
[30-Jan-2013 12:10:01] NOTICE: Reloading in progress ...
[30-Jan-2013 12:10:01] NOTICE: reloading: execvp("/opt/local/sbin/php-fpm", {"/opt/local/sbin/php-fpm", "--fpm-config", "/opt/local/etc/php53-fpm.conf", "-c", "/opt/local/etc/php53.ini"})
[30-Jan-2013 12:10:01] NOTICE: using inherited socket fd=7, "127.0.0.1:9090"
[30-Jan-2013 12:10:01] NOTICE: using inherited socket fd=7, "127.0.0.1:9090"
[30-Jan-2013 12:10:01] NOTICE: fpm is running, pid 687
[30-Jan-2013 12:10:01] NOTICE: ready to handle connections
[30-Jan-2013 12:20:36] WARNING: [pool www] child 691 exited with code 1 after 635.103064 seconds from start
[30-Jan-2013 12:20:36] NOTICE: [pool www] child 4876 started
[30-Jan-2013 12:52:41] NOTICE: Finishing ...
[30-Jan-2013 12:52:41] NOTICE: exiting, bye-bye!

There are no previous logs -- it looks like there is a problem with log rotation for php-fpm also the following line from the log above corresponds exactly with a 502:

[30-Jan-2013 12:20:36] WARNING: [pool www] child 691 exited with code 1 after 635.103064 seconds from start

I think we are getting somewhere...

comment:20 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.25
  • Total Hours changed from 5.3 to 5.55

The last two 502 errors in /var/log/nginx/access.log:

"XXX.XXX.XXX.XXX" newlive.puffin.webarch.net [30/Jan/2013:13:00:40 +0000] "GET /blogs/ed-mitchell/2010-03/editing-your-personal-profile HTTP/1.0" 502 166 268 311 "http://newlive.puffin.webarch.net/site-help" "Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" 0.013 "-"
"XXX.XXX.XXX.XXX" newlive.puffin.webarch.net [30/Jan/2013:13:10:40 +0000] "GET /blogs/ed-mitchell/2010-08/adding-yourself-speakers-and-facilitators-directories HTTP/1.0" 502 166 292 311 "http://newlive.puffin.webarch.net/site-help" "Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" 0.012 "-"

The logs are being rotated so fast that log for the first 502 has already been lost, for the second 502, at 13:10:40 the corresponding lines in /var/log/php/php53-fpm-error.log are:

[30-Jan-2013 13:10:39.988513] DEBUG: pid 23781, fpm_pctl_perform_idle_server_maintenance(), line 379: [pool www] currently 0 active children, 2 spare children, 2 running children. Spawning rate 1
[30-Jan-2013 13:10:40.001237] DEBUG: pid 23781, fpm_got_signal(), line 72: received SIGCHLD
[30-Jan-2013 13:10:40.001436] WARNING: pid 23781, fpm_children_bury(), line 252: [pool www] child 23782 exited with code 1 after 38.148025 seconds from start
[30-Jan-2013 13:10:40.009202] NOTICE: pid 23781, fpm_children_make(), line 421: [pool www] child 24119 started
[30-Jan-2013 13:10:40.009310] DEBUG: pid 23781, fpm_event_loop(), line 411: event module triggered 1 events
[30-Jan-2013 13:10:40.992493] DEBUG: pid 23781, fpm_pctl_perform_idle_server_maintenance(), line 379: [pool www] currently 0 active children, 2 spare children, 2 running children. Spawning rate 1

I don't see what is causing the php-fpm logs to be rotated, there is nothing in /etc/logrotate.d/.

comment:21 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.5
  • Total Hours changed from 5.55 to 6.05

The latest 502 was at 13:50:45:

"XXX.XXX.XXX.XXX" newlive.puffin.webarch.net [30/Jan/2013:13:50:45 +0000] "GET /blogs/ed-mitchell/2011-01/adding-your-initiatives-event-site HTTP/1.0" 502 166 273 311 "http://newlive.puffin.webarch.net/site-help" "Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" 0.014 "-

And the corresponding entry in /var/log/php/php53-fpm-error.log:

[30-Jan-2013 13:50:45.446679] DEBUG: pid 5737, fpm_pctl_perform_idle_server_maintenance(), line 379: [pool www] currently 1 active children, 1 spare children, 2 running children. Spawning rate 1
[30-Jan-2013 13:50:45.446794] DEBUG: pid 5737, fpm_got_signal(), line 72: received SIGCHLD
[30-Jan-2013 13:50:45.447000] WARNING: pid 5737, fpm_children_bury(), line 252: [pool www] child 5738 exited with code 1 after 225.534299 seconds from start
[30-Jan-2013 13:50:45.454058] NOTICE: pid 5737, fpm_children_make(), line 421: [pool www] child 7222 started
[30-Jan-2013 13:50:45.454162] DEBUG: pid 5737, fpm_event_loop(), line 411: event module triggered 1 events
[30-Jan-2013 13:50:46.447263] DEBUG: pid 5737, fpm_pctl_perform_idle_server_maintenance(), line 379: [pool www] currently 0 active children, 2 spare children, 2 running children. Spawning rate 1

Email from Jim:

You should check /var/xdrago for the BOA scripts that do various self healing, tweaking and other tasks -- might be in there.

OR you could download the source from https://github.com/omega8cc/nginx-for-drupal and search it.

If the logs are rotating too fast, we should raise a bug on the Barracuda issue tracker.

In /var/xdrago/cron/crontabs/root there are these tasks:

# DO NOT EDIT THIS FILE - edit the master and reinstall.
# (/tmp/hm.cronBTAzqF installed on Thu Jul 22 19:24:53 2010)
# (Cron version -- $Id: crontab.c,v 2.13 1994/01/17 03:20:37 vixie Exp $)
* * * * * bash /var/xdrago/second.sh >/dev/null 2>&1
* * * * * bash /var/xdrago/minute.sh >/dev/null 2>&1
* * * * * bash /var/xdrago/runner.sh >/dev/null 2>&1
*/5 * * * * bash /var/xdrago/manage_ltd_users.sh >/dev/null 2>&1
10 * * * * bash /var/xdrago/clear.sh >/dev/null 2>&1
10 * * * * bash /var/xdrago/purge_binlogs.sh >/dev/null 2>&1
15 3 * * * bash /var/xdrago/graceful.sh >/dev/null 2>&1
15 4 * * * bash /var/xdrago/mysql_backup.sh >/dev/null 2>&1
15 7 * * * bash /var/xdrago/usage.sh >/dev/null 2>&1
15 8 * * * bash /var/xdrago/fetch-remote.sh >/dev/null 2>&1
###EOF2013###

Reading all the scripts at /var/xdrago/*.sh, the file clobbering the logs is /var/xdrago/clear.sh:

#!/bin/bash

SHELL=/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/opt/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

echo rotate > /var/log/php/php-fpm-error.log
echo rotate > /var/log/php/php-fpm-slow.log
echo rotate > /var/log/php/php53-fpm-error.log
echo rotate > /var/log/php/php53-fpm-slow.log
echo rotate > /var/log/php/error_log_52
echo rotate > /var/log/php/error_log_53
echo rotate > /var/log/php/error_log_cli_52
echo rotate > /var/log/php/error_log_cli_53
echo rotate > /var/log/redis/redis-server.log
echo rotate > /var/log/mysql/sql-slow-query.log
if test -f /root/.high_traffic.cnf ; then
  echo rotate > /var/log/nginx/access.log
else
  /etc/init.d/php53-fpm reload
  if test -f /etc/init.d/php-fpm ; then
    /etc/init.d/php-fpm reload
  fi
fi
if test -f /var/run/boa_run.pid ; then
  sleep 1
else
  rm -f /tmp/*error*
fi
ntpdate pool.ntp.org
touch /var/xdrago/log/clear.done
###EOF2013###

I have commented the /var/xdrago/clear.sh task from the root crontab.

I have added include = /var/xdrago to /etc/backup.d/90.rdiff.

Jim -- by this evening we should have enough examples of 502s and the corresponding PHP+FPM log entries to raise a ticket upstream? Would it help if I put these all on a wiki page to save digging through the comments on this ticket?

comment:22 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.4
  • Total Hours changed from 6.05 to 6.45

I have just checked the nginx log:

grep '1.[0|1]" 502 ' /var/log/nginx/access.log

And there have been no 502s since the 10 mins past the hour /var/xdrago/clear.sh cronjob was disabled, the wget job crawling the site at a page every 10 mins is still running.

Looking at the /var/xdrago/clear.sh script I think the problem might relate to this section, I have added comments explaining what each part does:

# the following if tests to see if /root/.high_traffic.cnf exists:
if test -f /root/.high_traffic.cnf ; then
  
  # if /root/.high_traffic.cnf exists then clobber the Nginx access log:
  echo rotate > /var/log/nginx/access.log

# if /root/.high_traffic.cnf doesn't exist (and it doesn't and I expect it
# won't unless the site has a lot of traffic) then:
else

  # reload /etc/init.d/php53-fpm 
  /etc/init.d/php53-fpm reload

  # test to see if /etc/init.d/php-fpm exists - it does!
  if test -f /etc/init.d/php-fpm ; then

    # reload the other install of php-fpm -- this is perhaps where the problem is?
    /etc/init.d/php-fpm reload
  fi
fi

So I have changed that section to only reload one version of php-fpm and I have also disabled the log clobbering:

#!/bin/bash

SHELL=/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/opt/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

#echo rotate > /var/log/php/php-fpm-error.log
#echo rotate > /var/log/php/php-fpm-slow.log
#echo rotate > /var/log/php/php53-fpm-error.log
#echo rotate > /var/log/php/php53-fpm-slow.log
#echo rotate > /var/log/php/error_log_52
#echo rotate > /var/log/php/error_log_53
#echo rotate > /var/log/php/error_log_cli_52
#echo rotate > /var/log/php/error_log_cli_53
#echo rotate > /var/log/redis/redis-server.log
#echo rotate > /var/log/mysql/sql-slow-query.log
if test -f /root/.high_traffic.cnf ; then
  echo rotate > /var/log/nginx/access.log
else
  if test -f /etc/init.d/php53-fpm ; then
    /etc/init.d/php53-fpm reload
  elif test -f /etc/init.d/php-fpm ; then
    /etc/init.d/php-fpm reload
  fi
fi
if test -f /var/run/boa_run.pid ; then
  sleep 1
else
  rm -f /tmp/*error*
fi
ntpdate pool.ntp.org
touch /var/xdrago/log/clear.done
###EOF2013###

I have reenabled the cronjob but changed it so the results will be mailed to root:

#10 * * * * bash /var/xdrago/clear.sh >/dev/null 2>&1
10 * * * * bash /var/xdrago/clear.sh 

comment:23 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 6.45 to 6.55

So there has been another 502, while I was writing the comment above:

"XXX.XXX.XXX.XXX" newlive.puffin.webarch.net [30/Jan/2013:17:31:17 +0000] "GET /user/login?destination=comment%2Freply%2F28633%23comment-form HTTP/1.0" 502 166 269 311 "http://newlive.puffin.webarch.net/news" "Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" 0.014 "-"

And the entry in the php-fpm log is much the same as before:

[30-Jan-2013 17:31:16.664650] DEBUG: pid 24701, fpm_pctl_perform_idle_server_maintenance(), line 379: [pool www] currently 0 active children, 2 spare children, 2 running children. Spawning rate 1
[30-Jan-2013 17:31:17.501155] DEBUG: pid 24701, fpm_got_signal(), line 72: received SIGCHLD
[30-Jan-2013 17:31:17.501375] WARNING: pid 24701, fpm_children_bury(), line 252: [pool www] child 24702 exited with code 1 after 192.936592 seconds from start
[30-Jan-2013 17:31:17.509328] NOTICE: pid 24701, fpm_children_make(), line 421: [pool www] child 25983 started
[30-Jan-2013 17:31:17.509420] DEBUG: pid 24701, fpm_event_loop(), line 411: event module triggered 1 events
[30-Jan-2013 17:31:17.668624] DEBUG: pid 24701, fpm_pctl_perform_idle_server_maintenance(), line 379: [pool www] currently 0 active children, 2 spare children, 2 running children. Spawning rate 1

What is the reason for reloading php-fpm every hour -- does it have memory leaks?

comment:24 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.35
  • Total Hours changed from 6.55 to 6.9

Two more 502s:

"XXX.XXX.XXX.XXX" newlive.puffin.webarch.net [30/Jan/2013:18:21:18 +0000] "GET /user/register?destination=comment%2Freply%2F28501%23comment-form HTTP/1.0" 502 166 272 311 "http://newlive.puffin.webarch.net/news" "Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" 0.013 "-"
"XXX.XXX.XXX.XXX" newlive.puffin.webarch.net [30/Jan/2013:19:11:25 +0000] "GET /user/login?destination=comment%2Freply%2F28485%23comment-form HTTP/1.0" 502 166 269 311 "http://newlive.puffin.webarch.net/news" "Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" 0.017 "-"

And the corresponding php-fpm logs from 18:21:18:

[30-Jan-2013 18:21:18.090712] DEBUG: pid 8258, fpm_pctl_perform_idle_server_maintenance(), line 379: [pool www] currently 0 active children, 2 spare children, 2 running children. Spawning rate 1
[30-Jan-2013 18:21:18.811415] DEBUG: pid 8258, fpm_got_signal(), line 72: received SIGCHLD
[30-Jan-2013 18:21:18.811647] WARNING: pid 8258, fpm_children_bury(), line 252: [pool www] child 8261 exited with code 1 after 678.214427 seconds from start
[30-Jan-2013 18:21:18.819596] NOTICE: pid 8258, fpm_children_make(), line 421: [pool www] child 12623 started
[30-Jan-2013 18:21:18.819710] DEBUG: pid 8258, fpm_event_loop(), line 411: event module triggered 1 events
[30-Jan-2013 18:21:19.094755] DEBUG: pid 8258, fpm_pctl_perform_idle_server_maintenance(), line 379: [pool www] currently 0 active children, 2 spare children, 2 running children. Spawning rate 1

And 19:11:25:

[30-Jan-2013 19:11:25.370780] DEBUG: pid 31407, fpm_pctl_perform_idle_server_maintenance(), line 379: [pool www] currently 1 active children, 1 spare children, 2 running children. Spawning rate 1
[30-Jan-2013 19:11:25.370883] DEBUG: pid 31407, fpm_got_signal(), line 72: received SIGCHLD
[30-Jan-2013 19:11:25.371088] WARNING: pid 31407, fpm_children_bury(), line 252: [pool www] child 31408 exited with code 1 after 84.060608 seconds from start
[30-Jan-2013 19:11:25.377938] NOTICE: pid 31407, fpm_children_make(), line 421: [pool www] child 32060 started
[30-Jan-2013 19:11:25.378060] DEBUG: pid 31407, fpm_event_loop(), line 411: event module triggered 1 events
[30-Jan-2013 19:11:26.374138] DEBUG: pid 31407, fpm_pctl_perform_idle_server_maintenance(), line 379: [pool www] currently 0 active children, 2 spare children, 2 running children. Spawning rate 1

Because it's still happening the edits I made to /var/xdrago/clear.sh were clearly not the answer.

So I have changed these values in /opt/local/etc/php53-fpm.conf to see if that helps:

;emergency_restart_interval = 1m
emergency_restart_interval = 0

;pm.max_requests = 500
pm.max_requests = 0

;access.log = log/$pool.access.log
access.log = /var/log/php/$pool.access.log

And /etc/init.d/php53-fpm was stopped and started.

The following packages were updated via aptitude (should we be updating packages using aptitude of just using the baracuda script?):

libmariadbclient-dev/squeeze libmariadbclient18/squeeze libmariadbd-dev/squeeze libmysqlclient18/squeeze mariadb-client-5.5/squeeze mariadb-client-core-5.5/squeeze mariadb-common/squeeze mariadb-server-5.5/squeeze mariadb-server-core-5.5/squeeze mysql-common/squeeze

comment:25 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 6.9 to 7.0

There have been no more 502s since the one at 19:11 so perhaps we have cracked it... I'm not around tomorrow during the day but I'll check on the 502s again tomorrow night.

comment:26 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 1.0
  • Total Hours changed from 7.0 to 8.0

No more 502s, I think one of the changes made to /opt/local/etc/php53-fpm.conf solved it and I suspected it was this one:

;pm.max_requests = 500
pm.max_requests = 0

To confirm this I undid the change to emergency_restart_interval made in ticket:483#comment:24 but left pm.max_requests set to 0 (infinity).

Since the requests to the server from spiders has stopped due to the new robots.txt files I have set off a speed unlimited crawl of the site using http://httrack.org/ from penguin, with httrack set to ignore robots rules to see if this generates some 502s.

And while writing this ticket comment there was another 502, however I think this was caused by the restart of php-pfm done after making the changes above -- there have been no others (so far!):

"81.95.52.111" newlive.puffin.webarch.net [31/Jan/2013:16:22:11 +0000] "GET /news?page=21 HTTP/1.1" 502 166 452 311 "http://newlive.puffin.webarch.net/news?page=13" "Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" 1.445 "-"

The corresponding entry in /var/log/php/php53-fpm-error.log:

[31-Jan-2013 16:22:11.138278] DEBUG: pid 29287, fpm_got_signal(), line 108: received SIGUSR2
[31-Jan-2013 16:22:11.138352] NOTICE: pid 29287, fpm_got_signal(), line 109: Reloading in progress ...
[31-Jan-2013 16:22:11.138412] DEBUG: pid 29287, fpm_pctl(), line 231: switching to 'reloading' state
[31-Jan-2013 16:22:11.138479] DEBUG: pid 29287, fpm_pctl_kill_all(), line 159: [pool www] sending signal 3 SIGQUIT to child 31291
[31-Jan-2013 16:22:11.138532] DEBUG: pid 29287, fpm_pctl_kill_all(), line 159: [pool www] sending signal 3 SIGQUIT to child 29291
[31-Jan-2013 16:22:11.138573] DEBUG: pid 29287, fpm_pctl_kill_all(), line 168: 2 child(ren) still alive
[31-Jan-2013 16:22:11.138615] DEBUG: pid 29287, fpm_event_loop(), line 411: event module triggered 1 events
[31-Jan-2013 16:22:11.138649] DEBUG: pid 29287, fpm_pctl_kill_all(), line 159: [pool www] sending signal 15 SIGTERM to child 31291
[31-Jan-2013 16:22:11.138680] DEBUG: pid 29287, fpm_pctl_kill_all(), line 159: [pool www] sending signal 15 SIGTERM to child 29291
[31-Jan-2013 16:22:11.138703] DEBUG: pid 29287, fpm_pctl_kill_all(), line 168: 2 child(ren) still alive
[31-Jan-2013 16:22:11.171483] DEBUG: pid 29287, fpm_got_signal(), line 72: received SIGCHLD
[31-Jan-2013 16:22:11.171574] DEBUG: pid 29287, fpm_children_bury(), line 252: [pool www] child 29291 exited on signal 15 (SIGTERM) after 729.775003 seconds from start
[31-Jan-2013 16:22:11.171593] DEBUG: pid 29287, fpm_event_loop(), line 411: event module triggered 1 events
[31-Jan-2013 16:22:11.172417] DEBUG: pid 29287, fpm_got_signal(), line 72: received SIGCHLD
[31-Jan-2013 16:22:11.172466] DEBUG: pid 29287, fpm_children_bury(), line 252: [pool www] child 31291 exited on signal 15 (SIGTERM) after 443.710119 seconds from start
[31-Jan-2013 16:22:11.172479] NOTICE: pid 29287, fpm_pctl_exec(), line 98: reloading: execvp("/opt/local/sbin/php-fpm", {"/opt/local/sbin/php-fpm", "--fpm-config", "/opt/local/etc/php53-fpm.conf", "-c", "/opt/local/etc/php53.ini"})
[31-Jan-2013 16:22:11.305963] DEBUG: pid 2674, fpm_scoreboard_init_main(), line 40: got clock tick '100'
[31-Jan-2013 16:22:11.306956] NOTICE: pid 2674, fpm_sockets_init_main(), line 328: using inherited socket fd=8, "127.0.0.1:9090"
[31-Jan-2013 16:22:11.306956] NOTICE: pid 2674, fpm_sockets_init_main(), line 328: using inherited socket fd=8, "127.0.0.1:9090"
[31-Jan-2013 16:22:11.307145] DEBUG: pid 2674, fpm_event_init_main(), line 333: event module is epoll and 1 fds have been reserved
[31-Jan-2013 16:22:11.307442] NOTICE: pid 2674, fpm_init(), line 83: fpm is running, pid 2674
[31-Jan-2013 16:22:11.307575] DEBUG: pid 2674, main(), line 1832: Sending "1" (OK) to parent via fd=6
[31-Jan-2013 16:22:11.316120] DEBUG: pid 2674, fpm_children_make(), line 421: [pool www] child 2675 started
[31-Jan-2013 16:22:11.323642] DEBUG: pid 2674, fpm_children_make(), line 421: [pool www] child 2676 started
[31-Jan-2013 16:22:11.323755] DEBUG: pid 2674, fpm_pctl_heartbeat(), line 460: heartbeat have been set up with a timeout of 10000ms
[31-Jan-2013 16:22:11.323816] DEBUG: pid 2674, fpm_event_loop(), line 362: 4808 bytes have been reserved in SHM
[31-Jan-2013 16:22:11.323844] NOTICE: pid 2674, fpm_event_loop(), line 363: ready to handle connections

comment:27 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.2
  • Total Hours changed from 8.0 to 8.2

Running the post code speed test script (see ticket:466#comment:55) triggered a 502:

"81.95.52.111" newlive.puffin.webarch.net [31/Jan/2013:21:10:01 +0000] "GET /nearby?distance[postal_code]=PE3&distance[country]=uk&distance[search_distance]=50&distance[search_units]=mile HTTP/1.0" 502 166 240 311 "-" "Mozilla/5.0 (TN Speed Test)" 1.248 "-"

This is from the /var/log/php/php53-fpm-error.log log:

[31-Jan-2013 21:10:01.854236] DEBUG: pid 406, fpm_pctl_perform_idle_server_maintenance(), line 379: [pool www] currently 1 active children, 1 spare children, 2 running children. Spawning rate 1
[31-Jan-2013 21:10:01.931397] DEBUG: pid 406, fpm_got_signal(), line 108: received SIGUSR2
[31-Jan-2013 21:10:01.931492] NOTICE: pid 406, fpm_got_signal(), line 109: Reloading in progress ...
[31-Jan-2013 21:10:01.931571] DEBUG: pid 406, fpm_pctl(), line 231: switching to 'reloading' state
[31-Jan-2013 21:10:01.931659] DEBUG: pid 406, fpm_pctl_kill_all(), line 159: [pool www] sending signal 3 SIGQUIT to child 408
[31-Jan-2013 21:10:01.931727] DEBUG: pid 406, fpm_pctl_kill_all(), line 159: [pool www] sending signal 3 SIGQUIT to child 407
[31-Jan-2013 21:10:01.931786] DEBUG: pid 406, fpm_pctl_kill_all(), line 168: 2 child(ren) still alive
[31-Jan-2013 21:10:01.931851] DEBUG: pid 406, fpm_event_loop(), line 411: event module triggered 1 events
[31-Jan-2013 21:10:01.931934] DEBUG: pid 406, fpm_pctl_kill_all(), line 159: [pool www] sending signal 15 SIGTERM to child 408
[31-Jan-2013 21:10:01.932007] DEBUG: pid 406, fpm_pctl_kill_all(), line 159: [pool www] sending signal 15 SIGTERM to child 407
[31-Jan-2013 21:10:01.932047] DEBUG: pid 406, fpm_pctl_kill_all(), line 168: 2 child(ren) still alive
[31-Jan-2013 21:10:01.962708] DEBUG: pid 406, fpm_got_signal(), line 72: received SIGCHLD
[31-Jan-2013 21:10:01.963062] DEBUG: pid 406, fpm_children_bury(), line 252: [pool www] child 407 exited on signal 15 (SIGTERM) after 3599.775706 seconds from start
[31-Jan-2013 21:10:01.963169] DEBUG: pid 406, fpm_event_loop(), line 411: event module triggered 1 events
[31-Jan-2013 21:10:01.969971] DEBUG: pid 406, fpm_got_signal(), line 72: received SIGCHLD
[31-Jan-2013 21:10:01.970087] DEBUG: pid 406, fpm_children_bury(), line 252: [pool www] child 408 exited on signal 15 (SIGTERM) after 3599.779086 seconds from start
[31-Jan-2013 21:10:01.970109] NOTICE: pid 406, fpm_pctl_exec(), line 98: reloading: execvp("/opt/local/sbin/php-fpm", {"/opt/local/sbin/php-fpm", "--fpm-config", "/opt/local/etc/php53-fpm.conf", "-c", "/opt/local/etc/php53.ini"})
[31-Jan-2013 21:10:02.115087] DEBUG: pid 24440, fpm_scoreboard_init_main(), line 40: got clock tick '100'
[31-Jan-2013 21:10:02.116062] NOTICE: pid 24440, fpm_sockets_init_main(), line 328: using inherited socket fd=8, "127.0.0.1:9090"
[31-Jan-2013 21:10:02.116062] NOTICE: pid 24440, fpm_sockets_init_main(), line 328: using inherited socket fd=8, "127.0.0.1:9090"
[31-Jan-2013 21:10:02.116183] DEBUG: pid 24440, fpm_event_init_main(), line 333: event module is epoll and 1 fds have been reserved
[31-Jan-2013 21:10:02.116528] NOTICE: pid 24440, fpm_init(), line 83: fpm is running, pid 24440
[31-Jan-2013 21:10:02.116641] DEBUG: pid 24440, main(), line 1832: Sending "1" (OK) to parent via fd=6
[31-Jan-2013 21:10:02.125758] DEBUG: pid 24440, fpm_children_make(), line 421: [pool www] child 24444 started
[31-Jan-2013 21:10:02.134330] DEBUG: pid 24440, fpm_children_make(), line 421: [pool www] child 24449 started
[31-Jan-2013 21:10:02.134454] DEBUG: pid 24440, fpm_pctl_heartbeat(), line 460: heartbeat have been set up with a timeout of 10000ms
[31-Jan-2013 21:10:02.134524] DEBUG: pid 24440, fpm_event_loop(), line 362: 4808 bytes have been reserved in SHM
[31-Jan-2013 21:10:02.134554] NOTICE: pid 24440, fpm_event_loop(), line 363: ready to handle connections
[31-Jan-2013 21:10:03.138981] DEBUG: pid 24440, fpm_pctl_perform_idle_server_maintenance(), line 379: [pool www] currently 1 active children, 1 spare children, 2 running children. Spawning rate 1

This 502 corresponds exactly with the php-fpm reload that the clear.sh script does at 10 mins past each hour, so I have now commented out that cron task and set the httrack spidering of the site off again.

comment:28 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.5
  • Total Hours changed from 8.2 to 8.7

Have been doing some more searching on this issue and there are a some interesting comments here:

comment:29 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 8.7 to 8.8

There have been no more 502s since the cronjob to run clear.sh at 10 past the hour was stopped in ticket:483#comment:27

comment:30 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.25
  • Total Hours changed from 8.8 to 9.05

A few hours ago I set a recursive wget off on penguin, requesting pages from puffin using this command:

wget -r -e robots=off --user-agent="Mozilla/5.0 (TN web checker) Gecko/20100101 Firefox/18.0" --no-check-cert --server-response http://newlive.puffin.webarch.net/ 

And the result, so far, is 586 502s -- so this problem is far from fixed after all :-(

grep '1.[0|1]" 502 ' /var/log/nginx/access.log | wc -l
586

So I think the next step is try tweaking some settings as suggested in the links in ticket:483#comment:28

comment:31 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 9.05 to 9.15

The 586 502s all happened at the same time, between 12:25:29 and 12:25:32 and I expect it was me that cause it, so the issue might be solved, more testing needed.

comment:32 Changed 4 years ago by jim

Sounds promising!!! Anything you want me to do Chris?

comment:33 Changed 4 years ago by chris

  • Owner changed from chris to jim
  • Priority changed from blocker to minor
  • Status changed from new to assigned

Apart from Jim raising a ticket against BOA regarding the 502 issue I think this can be now be closed.

The following is about to be posted to ticket:466#comment:67 about this issue:

A recursive wget is being run from penguin against puffin.

Following this advice:

Note: Barracuda will *not* overwrite your changes made to php.ini or php-fpm.conf on the next upgrade if you will touch the empty control file:

$ touch /opt/etc/custom.php.ini

This file was touched.

Looking at /var/xdrago/clear.sh it really doesn't do anything we need, it does things like clobbering log files but we are not so short of space that this is an issue so I don't think the root crontab needs it uncommenting, however after the next BOA upgrade we should check to see that it's not been reinstated.

comment:34 follow-up: ↓ 35 Changed 4 years ago by jim

  • Owner changed from jim to chris
  • Priority changed from minor to major

Though the 502 issue is resolved, I think stopping the killing of PHP-FPM is bad:

Noticed server very slow (and Munin stopped on Thursday for all Web Architect assets BTW), ran top and saw PHP-FPM sitting at 70-90% CPU. Did my work, stayed slow, ran top again, same thing...

So I service php53-fpm restart and there was an issue as below:

puffin:~# service php53-fpm restart
Gracefully shutting down php53-fpm.......... failed. Use force-quit
Starting php53-fpm..[10-Mar-2013 17:25:29.232841] ERROR: pid 8103, fpm_sockets_new_listening_socket(), line 191: unable to bind listening socket for address '127.0.0.1:9090': Address already in use (98)
[10-Mar-2013 17:25:29.233035] ERROR: pid 8103, fpm_init(), line 72: FPM initialization failed
 done
puffin:~# service php53-fpm restart
Gracefully shutting down php53-fpm.......... failed. Use force-quit
Starting php53-fpm..[10-Mar-2013 17:25:39.882861] ERROR: pid 8203, fpm_sockets_new_listening_socket(), line 191: unable to bind listening socket for address '127.0.0.1:9090': Address already in use (98)
[10-Mar-2013 17:25:39.883052] ERROR: pid 8203, fpm_init(), line 72: FPM initialization failed
 done
puffin:~# service php53-fpm force-quit 
Terminating php53-fpm... done
puffin:~# service php53-fpm start
Starting php53-fpm.. done

However, after this the site has been super fast again.

Suggest going back to stock BOA and letting the cron jobs do their thing -- the 502s appear to have been caused by other issues (GEOS IP etc).

Escalating, passing to Chris for consideration...

Changed 4 years ago by chris

multips memory usage to 7th March 2013

Changed 4 years ago by chris

phpfpm average process size to 7th March 2013

Changed 4 years ago by chris

phpfpm memory usage to 7th March 2013

comment:35 in reply to: ↑ 34 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 1.1
  • Total Hours changed from 9.15 to 10.25

Replying to jim:

Though the 502 issue is resolved

I don't think it is resolved, the outstanding issue, as far as I'm concerned, is for you to raise this matter with the BOA developers.

I think stopping the killing of PHP-FPM is bad:

I don't agree, I think killing PHP-FPM was bad.

Noticed server very slow

Do you have any data to show this?

(and Munin stopped on Thursday for all Web Architect assets BTW)

Yes, the munin stats were moved to another server and I sent a email to the list about it. The server they were moved to is password protected and also uses a Cacert.org certificate and this has caused problems, so I have set up penguin to generate munin stats for the two servers.

Munin was installed from backports:

aptitude install munin/squeeze-backports

The following as added to /etc/munin/munin.conf (with the email address for my phone not the one shown here):

contact.me.command mail -s "${var:host} Munin Alert" nobody@example.org
contact.me.always_send warning critical

[penguin.transitionnetwork.org]
    #address 81.95.52.111
    address 127.0.0.1
    use_node_name yes

[puffin.transitionnetwork.org]
    address 81.95.52.103
    use_node_name yes

And this was added to /etc/munin/munin-node.conf on both servers:

# https://penguin.webarch.net/
allow ^81\.95\.52\.111$

This was added to the ssl config in /etc/nginx/sites-available/penguin:

        location /munin/ {
                # http://munin.readthedocs.org/en/latest/example/webserver/ 
                alias /var/cache/munin/www/;
                expires modified +310s;
        }

And this was added to /etc/csf/csf.allow on puffin:

tcp:in:d=4949:s=81.95.52.111 # penguin.webarch.net

And csf was restarted on puffin:

csf -r

And now we have our own munin stats here: https://penguin.transitionnetwork.org/munin/ and the wiki pages have been updated, wiki:PuffinServer#MuninStats and wiki:PenguinServer#MuninStats

ran top and saw PHP-FPM sitting at 70-90% CPU. Did my work, stayed slow, ran top again, same thing...

So I service php53-fpm restart and there was an issue as below:

puffin:~# service php53-fpm restart
Gracefully shutting down php53-fpm.......... failed. Use force-quit
Starting php53-fpm..[10-Mar-2013 17:25:29.232841] ERROR: pid 8103, fpm_sockets_new_listening_socket(), line 191: unable to bind listening socket for address '127.0.0.1:9090': Address already in use (98)
[10-Mar-2013 17:25:29.233035] ERROR: pid 8103, fpm_init(), line 72: FPM initialization failed
 done
puffin:~# service php53-fpm restart
Gracefully shutting down php53-fpm.......... failed. Use force-quit
Starting php53-fpm..[10-Mar-2013 17:25:39.882861] ERROR: pid 8203, fpm_sockets_new_listening_socket(), line 191: unable to bind listening socket for address '127.0.0.1:9090': Address already in use (98)
[10-Mar-2013 17:25:39.883052] ERROR: pid 8203, fpm_init(), line 72: FPM initialization failed
 done
puffin:~# service php53-fpm force-quit 
Terminating php53-fpm... done
puffin:~# service php53-fpm start
Starting php53-fpm.. done

However, after this the site has been super fast again.

The above problem with restarting php-fpm illustrates why there were problems with it being killed via cron -- this isn't a graceful event, it involves downtime and while php-fpm is down people get the 502 errors.

The justification for php-fpm being killed all the time is that it leaks memory. But we have no evidence of this, if you look at these graphs, which cover the period from the server going live to 7th March 2013, you can see that the memory usage for php-fpm does not keep growing -- it's stable:

Suggest going back to stock BOA and letting the cron jobs do their thing

I strongly disgree, I consider that all the evidence points to the 502s being causes by the cron job killing php-fpm and that if this cron job was reinstated then we would start seeing a lot of 502s again.

the 502s appear to have been caused by other issues (GEOS IP etc).

What "other issues" and what is "GEOS IP"?

comment:36 follow-up: ↓ 37 Changed 4 years ago by jim

Sigh... Here goes...

Re Killing PHP-FPM:
That decision was made by the BOA creators based on lots of experience, so I'm not going to second guess them. If you want to and disagree then that's your choice, but I would urge you to understand their reasons (beyond the risk of memory leaks).

I just want to point out that the 502s might have been caused by other issues I mentioned to you before (during a Skype I think): http://drupal.org/node/1913488 and http://drupal.org/node/1263602.

As an aside, you might want to check http://drupal.org/project/issues/barracuda occasionally to keep up to date with what's going on with the platform. I also follow the RSS from http://drupalcode.org/project/barracuda.git to keep up to date.

Re slow site:
I have no data to share, but it was VERY slow, taking ages to show pages and top showed a busy PHP-FPM task. Had Munin been working we would have had data.

Ed also complained of this on the call, but I'd never seen it until now.


That covers your points I think. I just want to say we have an update coming up for BOA 2.0.6 that includes fixes for the above issues and a bunch more stuff besides. I would like to make it clear that my position is we should aim for 'stock' where possible to avoid missing out on good stuff, and to avoid spending Transition's money on potentially unnecessary fixes.

We should try update out -- and try the FPM killing too -- then we will have all the information we need to raise a bug should the issue still arise. Else we might have been seeing manifestations of other issues since fixed that appear to be due to the FPM killer cron job.

I can't raise a sensible ticket until we've proven/disproven your theory against a clean base.

comment:37 in reply to: ↑ 36 ; follow-ups: ↓ 38 ↓ 41 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 1.1
  • Total Hours changed from 10.25 to 11.35

sigh indeed -- I thought this issue was resolved, but it's back :-(

Replying to jim:

Re Killing PHP-FPM:
That decision was made by the BOA creators based on lots of experience, so I'm not going to second guess them. If you want to and disagree then that's your choice, but I would urge you to understand their reasons (beyond the risk of memory leaks).

I'd love to know the exact reasons, am I wrong in understanding that it was down to memory leaks? Can you point out where the BOA creators have documented the reasons for the killing of php-fpm?

I just want to point out that the 502s might have been caused by other issues I mentioned to you before (during a Skype I think): http://drupal.org/node/1913488

OK, we do have logs that indicate that the issue raised on that ticket is happening on the server:

grep " Aborted connection " /var/log/daemon.log
Mar 10 17:25:21 puffin mysqld: 130310 17:25:21 [Warning] Aborted connection 3885543 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Mar 10 17:25:31 puffin mysqld: 130310 17:25:31 [Warning] Aborted connection 3885595 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Mar 10 17:25:31 puffin mysqld: 130310 17:25:31 [Warning] Aborted connection 3885603 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Mar 10 17:25:31 puffin mysqld: 130310 17:25:31 [Warning] Aborted connection 3885606 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Mar 10 17:25:31 puffin mysqld: 130310 17:25:31 [Warning] Aborted connection 3885604 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Mar 10 17:25:31 puffin mysqld: 130310 17:25:31 [Warning] Aborted connection 3885601 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Mar 10 17:25:31 puffin mysqld: 130310 17:25:31 [Warning] Aborted connection 3885605 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Mar 10 17:25:46 puffin mysqld: 130310 17:25:46 [Warning] Aborted connection 3885698 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Mar 10 17:25:46 puffin mysqld: 130310 17:25:46 [Warning] Aborted connection 3885700 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Mar 10 17:25:46 puffin mysqld: 130310 17:25:46 [Warning] Aborted connection 3885695 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Mar 10 17:25:46 puffin mysqld: 130310 17:25:46 [Warning] Aborted connection 3885699 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Mar 11 03:16:14 puffin mysqld: 130311  3:16:14 [Warning] Aborted connection 3938842 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)

There are 45 502 errors that correspond to the above error on March 10th at 17:25 and one from March 11th at 03:16:

grep '1.[0|1]" 502 ' /var/log/nginx/access.log.1

"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:21 +0000] "GET /user/login HTTP/1.0" 502 166 747 311 "http://www.transitionnetwork.org/login" "Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/17.0 Firefox/17.0" 11.902 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:21 +0000] "GET /user/login HTTP/1.0" 502 568 787 713 "http://www.transitionnetwork.org/login" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11" 12.009 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:21 +0000] "POST /node/23907/edit HTTP/1.0" 502 166 10553 311 "https://www.transitionnetwork.org/node/23907/edit" "Mozilla/5.0 (Windows NT 5.1; rv:19.0) Gecko/20100101 Firefox/19.0" 12.053 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:21 +0000] "POST /user/login HTTP/1.0" 502 166 713 311 "http://www.transitionnetwork.org/user/login" "Opera/9.80 (Windows NT 6.1; Win64; x64) Presto/2.12.388 Version/12.11" 18.508 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:21 +0000] "GET /blogs/feed/transition-us/2011 HTTP/1.0" 502 166 366 311 "-" "Mozilla/5.0 (compatible; AhrefsBot/4.0; +http://ahrefs.com/robot/)" 1.667 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:21 +0000] "GET /node/add/forum HTTP/1.0" 502 568 798 713 "http://www.transitionnetwork.org/node/add/forum" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2" 19.549 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:21 +0000] "GET /node/add/forum HTTP/1.0" 502 166 762 311 "http://www.transitionnetwork.org/node/add/forum" "Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20100101 Firefox/8.0" 6.786 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:21 +0000] "POST /user/login HTTP/1.0" 502 568 745 713 "http://www.transitionnetwork.org/user/login" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11" 9.409 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:21 +0000] "POST /user/login HTTP/1.0" 502 568 762 713 "http://www.transitionnetwork.org/user/login" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.12 (KHTML, like Gecko) Maxthon/3.0 Chrome/18.0.966.0 Safari/535.12" 25.193 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:21 +0000] "POST /user/login HTTP/1.0" 502 166 728 311 "http://www.transitionnetwork.org/user/login" "Opera/9.80 (Windows NT 5.1; U; MRA 6.0 (build 5970); ru) Presto/2.7.62 Version/11.00" 23.682 "-"
"XX.XX.XX.XX" www.transitionnetwork.org [10/Mar/2013:17:25:21 +0000] "GET /transition-movie HTTP/1.1" 502 166 410 311 "http://www.transitionnetwork.org/books" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/536.26.17 (KHTML, like Gecko) Version/6.0.2 Safari/536.26.17" 1.745 "-"
"XX.XX.XX.XX" chive.master.puffin.webarch.net [10/Mar/2013:17:25:21 +0000] "POST /index.php/schema/transitionnetwor/tables HTTP/1.1" 502 568 677 713 "http://chive.master.puffin.webarch.net/index.php/schema/transitionnetwor" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.97 Safari/537.22" 12.133 "-"
"XX.XX.XX.XX" chive.master.puffin.webarch.net [10/Mar/2013:17:25:21 +0000] "POST /index.php/schema/transitionnetwor/tables/node/structure HTTP/1.1" 502 568 661 713 "http://chive.master.puffin.webarch.net/index.php/schema/transitionnetwor" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.97 Safari/537.22" 12.155 "-"
"XX.XX.XX.XX" chive.master.puffin.webarch.net [10/Mar/2013:17:25:21 +0000] "POST /index.php/schema/transitionnetwor/views HTTP/1.1" 502 568 676 713 "http://chive.master.puffin.webarch.net/index.php/schema/transitionnetwor" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.97 Safari/537.22" 12.119 "-"
"XX.XX.XX.XX" www.transitionnetwork.org [10/Mar/2013:17:25:21 +0000] "GET /transition-movie HTTP/1.0" 502 166 271 311 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1" 1.458 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:21 +0000] "GET /forums/process/web-and-comms/eco-friendly-printing-and-print-manufacturers HTTP/1.0" 502 166 882 311 "http://www.transitionnetwork.org/forums/process/web-and-comms/eco-friendly-printing-and-print-manufacturers" "Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20100101 Firefox/8.0" 0.001 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:21 +0000] "GET /forum/recent HTTP/1.0" 502 166 764 311 "http://www.transitionnetwork.org/forum/recent" "Opera/9.80 (Windows NT 6.1; Win64; x64) Presto/2.12.388 Version/12.11" 0.001 "-"
"XX.XX.XX.XX" www.transitionnetwork.org [10/Mar/2013:17:25:21 +0000] "GET /people/nerissa-warren HTTP/1.1" 502 166 301 311 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.530 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:21 +0000] "GET /user HTTP/1.0" 502 166 740 311 "http://www.transitionnetwork.org/user" "Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/17.0 Firefox/17.0" 0.001 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:21 +0000] "GET /forum/recent HTTP/1.0" 502 568 813 713 "http://www.transitionnetwork.org/forum/recent" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.12 (KHTML, like Gecko) Maxthon/3.0 Chrome/18.0.966.0 Safari/535.12" 0.001 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:21 +0000] "GET /forum/recent HTTP/1.0" 502 568 796 713 "http://www.transitionnetwork.org/forum/recent" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11" 0.002 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:21 +0000] "GET /forum/recent HTTP/1.0" 502 166 779 311 "http://www.transitionnetwork.org/forum/recent" "Opera/9.80 (Windows NT 5.1; U; MRA 6.0 (build 5970); ru) Presto/2.7.62 Version/11.00" 0.002 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:21 +0000] "GET /user HTTP/1.0" 502 568 780 713 "http://www.transitionnetwork.org/user" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11" 0.015 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:21 +0000] "GET /forums/process/web-and-comms/eco-friendly-printing-and-print-manufacturers HTTP/1.0" 502 568 918 713 "http://www.transitionnetwork.org/forums/process/web-and-comms/eco-friendly-printing-and-print-manufacturers" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2" 0.001 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:21 +0000] "GET /forums/themes/local-economics/community-idea-needs-careful-exploration HTTP/1.0" 502 166 874 311 "http://www.transitionnetwork.org/forums/themes/local-economics/community-idea-needs-careful-exploration" "Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20100101 Firefox/8.0" 0.001 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:21 +0000] "GET /forum HTTP/1.0" 502 166 750 311 "http://www.transitionnetwork.org/forum" "Opera/9.80 (Windows NT 6.1; Win64; x64) Presto/2.12.388 Version/12.11" 0.001 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:31 +0000] "GET /forums/process/researchers/collaborative-bibliography-academic-work-transition HTTP/1.0" 502 568 945 713 "http://www.transitionnetwork.org/forums/process/researchers/collaborative-bibliography-academic-work-transition" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.12 (KHTML, like Gecko) Maxthon/3.0 Chrome/18.0.966.0 Safari/535.12" 3.382 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:31 +0000] "GET /forum/recent HTTP/1.0" 502 568 796 713 "http://www.transitionnetwork.org/forum/recent" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11" 1.880 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:31 +0000] "GET /node/add/forum HTTP/1.0" 502 166 766 311 "http://www.transitionnetwork.org/node/add/forum" "Mozilla/5.0 (Windows NT 6.1; rv:9.0.1) Gecko/20100101 Firefox/9.0.1" 1.322 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:31 +0000] "GET /index.php HTTP/1.0" 502 568 788 713 "http://www.transitionnetwork.org/index.php" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2" 0.813 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:31 +0000] "GET /node/add/forum HTTP/1.0" 502 166 768 311 "http://www.transitionnetwork.org/node/add/forum" "Opera/9.80 (Windows NT 6.1; Win64; x64) Presto/2.12.388 Version/12.11" 0.982 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:31 +0000] "GET /node/add/forum HTTP/1.0" 502 166 760 311 "http://www.transitionnetwork.org/node/add/forum" "Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/17.0 Firefox/17.0" 0.384 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:31 +0000] "GET / HTTP/1.0" 502 166 748 311 "http://www.transitionnetwork.org/node/add/forum" "Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20100101 Firefox/8.0" 0.594 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:31 +0000] "GET /forums/process/researchers/collaborative-bibliography-academic-work-transition HTTP/1.0" 502 568 928 713 "http://www.transitionnetwork.org/forums/process/researchers/collaborative-bibliography-academic-work-transition" "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.91 Safari/537.11" 0.000 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:31 +0000] "GET /forums/process/web-and-comms/eco-friendly-printing-and-print-manufacturers HTTP/1.0" 502 166 880 311 "http://www.transitionnetwork.org/forums/process/web-and-comms/eco-friendly-printing-and-print-manufacturers" "Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/17.0 Firefox/17.0" 0.017 "-"
"XX.XX.XX.XX" www.transitionnetwork.org [10/Mar/2013:17:25:31 +0000] "GET /people/braziers HTTP/1.1" 502 166 376 311 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.833 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:46 +0000] "GET /node/add/forum HTTP/1.0" 502 166 783 311 "http://www.transitionnetwork.org/node/add/forum" "Opera/9.80 (Windows NT 5.1; U; MRA 6.0 (build 5970); ru) Presto/2.7.62 Version/11.00" 0.525 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:46 +0000] "GET /forums/process/researchers/collaborative-bibliography-academic-work-transition HTTP/1.0" 502 166 911 311 "http://www.transitionnetwork.org/forums/process/researchers/collaborative-bibliography-academic-work-transition" "Opera/9.80 (Windows NT 5.1; U; MRA 6.0 (build 5970); ru) Presto/2.7.62 Version/11.00" 0.000 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:46 +0000] "GET /index.php HTTP/1.0" 502 568 790 713 "http://www.transitionnetwork.org/index.php" "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.91 Safari/537.11" 0.001 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:46 +0000] "GET /index.php HTTP/1.0" 502 568 807 713 "http://www.transitionnetwork.org/index.php" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.12 (KHTML, like Gecko) Maxthon/3.0 Chrome/18.0.966.0 Safari/535.12" 0.001 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:47 +0000] "GET /forums/process/web-and-comms/eco-friendly-printing-and-print-manufacturers HTTP/1.0" 502 166 903 311 "http://www.transitionnetwork.org/forums/process/web-and-comms/eco-friendly-printing-and-print-manufacturers" "Opera/9.80 (Windows NT 5.1; U; MRA 6.0 (build 5970); ru) Presto/2.7.62 Version/11.00" 0.001 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:47 +0000] "GET /index.php HTTP/1.0" 502 166 756 311 "http://www.transitionnetwork.org/index.php" "Mozilla/5.0 (Windows NT 6.1; rv:9.0.1) Gecko/20100101 Firefox/9.0.1" 0.001 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:47 +0000] "GET /forums/themes/local-economics/community-idea-needs-careful-exploration HTTP/1.0" 502 166 895 311 "http://www.transitionnetwork.org/forums/themes/local-economics/community-idea-needs-careful-exploration" "Opera/9.80 (Windows NT 5.1; U; MRA 6.0 (build 5970); ru) Presto/2.7.62 Version/11.00" 0.001 "-"
"XX.XX.XX.XX, 127.0.0.1" www.transitionnetwork.org [10/Mar/2013:17:25:47 +0000] "GET /forums/process/general/sailing-farm-tribe-transition-transport-tribe-tomorrow-tttt HTTP/1.0" 502 166 919 311 "http://www.transitionnetwork.org/forums/process/general/sailing-farm-tribe-transition-transport-tribe-tomorrow-tttt" "Opera/9.80 (Windows NT 5.1; U; MRA 6.0 (build 5970); ru) Presto/2.7.62 Version/11.00" 0.001 "-"
"XX.XX.XX.XX" www.transitionnetwork.org [10/Mar/2013:17:25:50 +0000] "GET /people/alistair-millar HTTP/1.1" 502 166 383 311 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.146 "-"
"XX.XX.XX.XX" www.transitionnetwork.org [11/Mar/2013:03:16:14 +0000] "GET /people/rachael-ludlow-williams HTTP/1.1" 502 166 460 311 "-" "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)" 0.491 "-"

Jim you logged into the server at 17:24 on 10th March -- did you by any chance stop or restart the MySQL server? If you did then this would explain the 45 502s that happened within a minute of you sshing in.

There have been no 502 errors today but as documented above, 46 in yesterdays log, 2 the day before, one 5 days ago and 6 days ago, 19 seven days ago, 36 ten days ago, 4 eleven days ago, 2 fourteen days ago, one 16 days ago, 2 twenty one days ago.

Todays nginx log:

grep '1.[0|1]" 502 ' /var/log/nginx/access.log

Yesterdays log:

grep '1.[0|1]" 502 ' /var/log/nginx/access.log.1

The day before (gz'ed log file):

zgrep '1.[0|1]" 502 ' /var/log/nginx/access.log.2.gz

I have added a cronjob on puffin to email me the 502s from todays nginx access log every hour:

1 * * * * grep '1.[0|1]" 502 ' /var/log/nginx/access.log | mutt -s "502s errors from puffin" chris@webarchitects.co.uk 

and http://drupal.org/node/1263602.

I can find no evidence that newrelic is installed on the server, what has this do to with the problem?

As an aside, you might want to check http://drupal.org/project/issues/barracuda occasionally to keep up to date with what's going on with the platform. I also follow the RSS from http://drupalcode.org/project/barracuda.git to keep up to date.

OK, thanks.

Re slow site:
I have no data to share, but it was VERY slow, taking ages to show pages and top showed a busy PHP-FPM task. Had Munin been working we would have had data.

Munin was working, as I said on the list it was moved to another server, I'll post to graphs to this ticket for the period 7th March to today.

Ed also complained of this on the call, but I'd never seen it until now.

Yes, but we need something concrete to work on don't we?

That covers your points I think. I just want to say we have an update coming up for BOA 2.0.6 that includes fixes for the above issues and a bunch more stuff besides. I would like to make it clear that my position is we should aim for 'stock' where possible to avoid missing out on good stuff, and to avoid spending Transition's money on potentially unnecessary fixes.

I don't have a problem with that.

If you read the comments above it is clear that there was an issue with the 502s that did appear to solved by the stopping of the php-fpm killing cron job -- the 502s did correspond to the times when the cron job was killing php-fpm. Now the 502s have started again, so this is now very much a critical issue again, in fact more critical than before as the site is now live.

We should try update out -- and try the FPM killing too -- then we will have all the information we need to raise a bug should the issue still arise. Else we might have been seeing manifestations of other issues since fixed that appear to be due to the FPM killer cron job.

I can't raise a sensible ticket until we've proven/disproven your theory against a clean base.

I'm not sure that a "clean base" is in this context, but I agree we can no longer blame the 502s on the php-fpm killing alone, however I still don't think we need to start killing php-fpm again -- what is the reason to kill it, when killing it results in 502s? What is the benefit?

Changed 4 years ago by chris

multips memory usage 7th to 11th March 2013

Changed 4 years ago by chris

phpfpm average process 7th to 11th March 2013

Changed 4 years ago by chris

phpfpm memory usage 7th to 11th March 2013

comment:38 in reply to: ↑ 37 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.2
  • Total Hours changed from 11.35 to 11.55

Replying to chris:

Munin was working, as I said on the list it was moved to another server, I'll post to graphs to this ticket for the period 7th March to today.

I have attached the graphs for 7th to 11th March 2013 and they do indicate that there was an issue with php-fpm memory usage growing for a 3 hour period on 10th March:

I don't know the cause of this memory usage spike. It is however a lot less dramatic when looked at in the context of the other main processes running on the server, so I don't know if it means much.

comment:39 follow-up: ↓ 40 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.25
  • Total Hours changed from 11.55 to 11.8

There was one 502 last night, at 3:17am, served to a spambot:

"XX.XX.XX.XX" www.transitionnetwork.org [12/Mar/2013:03:17:20 +0000] "GET /user/register HTTP/1.1" 502 568 366 713 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" 10.257 "-"  

In /var/log/messages.1 there is this corresponding entry, the IP address matches the one in the Nginx log:

Mar  3 13:17:19 puffin drupal: http://www.transitionnetwork.org|1362316639|BOTCHA|XX.XX.XX.XX|http://www.transitionnetwork.org/user/register|http://www.transitionnetwork.org/user/register|0||user_register post blocked by BOTCHA: submission looks like from a spambot.Failed 4 of 5 recipes [honeypot, honeypot2, no_resubmit, obscure_url] from "default" recipe book.
Mar  3 13:17:19 puffin drupal: http://www.transitionnetwork.org|1362316639|spambot|XX.XX.XX.XX|http://www.transitionnetwork.org/user/register|http://www.transitionnetwork.org/user/register|0||Blocked registration: email=fashionzxv25@XX.XX,ip=XX.XX.XX.XX

It coincides with these !MySQL errors in /var/log/daemon.log:

Mar 12 03:17:20 puffin mysqld: 130312  3:17:20 [Warning] Aborted connection 4086920 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Mar 12 03:17:20 puffin mysqld: 130312  3:17:20 [Warning] Aborted connection 4086921 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Mar 12 03:17:20 puffin mysqld: 130312  3:17:20 [Warning] Aborted connection 4086922 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Mar 12 03:17:20 puffin mysqld: 130312  3:17:20 [Warning] Aborted connection 4086919 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)

This doesn't correspond with the nightly backup which happened at 1am last night.

It does correspond with a large load spike, I'll post the munin graphs of this to this ticket and link to them in the next comment.


Changed 4 years ago by chris

cpu load spike 2013-03-12

Changed 4 years ago by chris

firewall spike 2013-03-12

Changed 4 years ago by chris

mysql spike 2013-03-12

Changed 4 years ago by chris

php-fpm spike 2013-03-12

comment:40 in reply to: ↑ 39 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 11.8 to 11.9

Replying to chris:

In /var/log/messages.1 there is this corresponding entry, the IP address matches the one in the Nginx log:

Mar  3 13:17:19 puffin drupal: http://www.transitionnetwork.org|1362316639|BOTCHA|XX.XX.XX.XX|http://www.transitionnetwork.org/user/register|http://www.transitionnetwork.org/user/register|0||user_register post blocked by BOTCHA: submission looks like from a spambot.Failed 4 of 5 recipes [honeypot, honeypot2, no_resubmit, obscure_url] from "default" recipe book.
Mar  3 13:17:19 puffin drupal: http://www.transitionnetwork.org|1362316639|spambot|XX.XX.XX.XX|http://www.transitionnetwork.org/user/register|http://www.transitionnetwork.org/user/register|0||Blocked registration: email=fashionzxv25@XX.XX,ip=XX.XX.XX.XX

Oops, the above was the same IP but the date was wrong, it does show it was a spambot but that request was on 3rd March 2013.

Load spike graphs:

I haven't tracked down the corresponding Nginx logs for this spike.

comment:41 in reply to: ↑ 37 ; follow-up: ↓ 44 Changed 4 years ago by jim

Replying to chris:

Re Killing PHP-FPM:

I'd love to know the exact reasons, am I wrong in understanding that it was down to memory leaks? Can you point out where the BOA creators have documented the reasons for the killing of php-fpm?

Not right now, but all the 'self-healing' scripts in /var/xdrago do things to generally keep servers serving pages. A zombie or hung or high-cpu FPM task would be killed by the cron jobs in this instance.

OK, we do have logs that indicate that the issue raised on that ticket is happening on the server:

<snip logs>
I've checked and the GEOS PHP extension is not installed, which is good because been marked experimental and removed from the default install, due to badness happening. That's that rules out for now...

However: What worries me is that the issue the person in the OP and comment 1 on the GEOS issue appears to have the same symptoms as us -- have you read that and followed the links, checked the screenshots?

So I'm thinking: either the GEOS thing is is masking another issue that BOA has but no-one is reporting as yet, OR we have something behaving like the same kind of thing -- a PHP extension (.so file) that is causing issues that crashes PHP, and sometimes database connections.

There are 45 502 errors that correspond to the above error on March 10th at 17:25 and one from March 11th at 03:16:

<snippedy>

Jim you logged into the server at 17:24 on 10th March -- did you by any chance stop or restart the MySQL server? If you did then this would explain the 45 502s that happened within a minute of you sshing in.

Nope, as outlined: I logged in, was sloooow, did some stuff, still slow. I did clear the Drupal caches a few times as part of the work I was doing, but didn't restart MySQL -- but I did restart php53-fpm per my earlier comment 34.

Obviously flushing the caches in Drupal will cause LOTS of transactions whist things get rebuilt, so that might explain the source of flurry of 502s/aborted connections if not the cause of the errors themselves.

I can find no evidence that newrelic is installed on the server, what has this do to with the problem?

It is on my server, got my wires crossed, no worries.

Actually that's a good point: New Relic gives you 2 weeks of 'full' reporting for free -- we might be able to use their analytics (that go far deeper than Munin and work with Drupal) to help root out the cause? We can always uninstall it when done.


...update coming up for BOA 2.0.6 that includes fixes for the above issues and a bunch more stuff besides...

I don't have a problem with that.

Ok, well we're just waiting on Omega8cc to finish up and release the next version of BOA.

If you read the comments above it is clear that there was an issue with the 502s that did appear to solved by the stopping of the php-fpm killing cron job -- the 502s did correspond to the times when the cron job was killing php-fpm. Now the 502s have started again, so this is now very much a critical issue again, in fact more critical than before as the site is now live.

I know, what a bummer.

I can't raise a sensible ticket until we've proven/disproven your theory against a clean base.

I'm not sure that a "clean base" is in this context, but I agree we can no longer blame the 502s on the php-fpm killing alone, however I still don't think we need to start killing php-fpm again -- what is the reason to kill it, when killing it results in 502s? What is the benefit?

I broadly agree, but it's clear the 502s are caused by something that is exacerbated by the PHP-FPM killing -- or maybe the re-spawning of the processes once killed. The point being that we've found another thing that generates them (clearing Drupal caches) and a thing that -- at least when I tried it -- stopped some of the errors (that was when I restarted the php53-fpm service.

All I can suggest is these:

  • Wait for the next BOA version
  • Keep a keener eye on this (you're now doing that via email, which is great)
  • Raise an issue on the Barracuda issue tracker.

We're doing all these except the latter... Since we're still in this boat I'll be happy to raise a ticket tomorrow, BUT can I please ask that you try to summarise this ticket and your findings to date in a paragraph or two? That would be really handy as I know what I've seen, but you've tried other stuff... I'll then add to your outline and post in the issue queue with all the appropriate files, and a link to this page as a source further information.

Finally, when logged in to Drupal as an admin/developer the phpinfo() page shows the extensions loaded and versions from the Drupal PHP context... It's a big job but looking these up and comparing their versions to known bugs might get us somewhere... This is a back burner thing I guess.

comment:42 follow-up: ↓ 45 Changed 4 years ago by jim

OK so MariaDB 5.5.30 has finally been released and committed to Barracuda, so we're due a release any day now as that was a blocker...

I am going to go out on a limb, but based on the OP in the GEOS issue, I reckon this might be the solution to our pain -- it's also the only thing that didn't get updated by the up-head I did.

Fingers crossed.

comment:43 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.25
  • Total Hours changed from 11.9 to 12.15

I'll try to spend some time tomorrow addressing the points raised by Jim in ticket:483#comment:41 and ticket:483#comment:42

There was one 502 last night, at 3:16 am served to bingbot:

"157.56.93.216" www.transitionnetwork.org [14/Mar/2013:03:16:39 +0000] "GET /initiatives/honiton HTTP/1.1" 502 166 299 311 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 4.335 "-"

I need to write a small script so that I only get a email about a 502 when there is one to save having to check the email that is being sent ever hour even when there are none.

comment:44 in reply to: ↑ 41 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.35
  • Total Hours changed from 12.15 to 12.5

Replying to jim:

A zombie or hung or high-cpu FPM task would be killed by the cron jobs in this instance.

But we haven't had an issue like this, look at the stats:

There is no sign of a php-fpm process that has needed killing.

However: What worries me is that the issue the person in the OP and comment 1 on the GEOS issue appears to have the same symptoms as us -- have you read that and followed the links, checked the screenshots?

I don't see what this has to do with us -- his server ran out of memory and stopped working as far as I can see. This isn't something we need to worry about, we are not running out of memory:

So I'm thinking: either the GEOS thing is is masking another issue that BOA has but no-one is reporting as yet, OR we have something behaving like the same kind of thing -- a PHP extension (.so file) that is causing issues that crashes PHP, and sometimes database connections.

What evidence is there that anything like this is happening?

Jim you logged into the server at 17:24 on 10th March -- did you by any chance stop or restart the MySQL server? If you did then this would explain the 45 502s that happened within a minute of you sshing in.

Nope, as outlined: I logged in, was sloooow, did some stuff, still slow. I did clear the Drupal caches a few times as part of the work I was doing, but didn't restart MySQL -- but I did restart php53-fpm per my earlier comment 34.

I think the 502s were caused by you restarting php53-fpm.

Actually that's a good point: New Relic gives you 2 weeks of 'full' reporting for free -- we might be able to use their analytics (that go far deeper than Munin and work with Drupal) to help root out the cause? We can always uninstall it when done.

I don't see the point -- I don't think we have an issue.

I broadly agree, but it's clear the 502s are caused by something that is exacerbated by the PHP-FPM killing -- or maybe the re-spawning of the processes once killed. The point being that we've found another thing that generates them (clearing Drupal caches) and a thing that -- at least when I tried it -- stopped some of the errors (that was when I restarted the php53-fpm service.

I think the 502s happen either when the database or php-fpm isn't running and it's as simple as that.

All I can suggest is these:

  • Wait for the next BOA version
  • Keep a keener eye on this (you're now doing that via email, which is great)
  • Raise an issue on the Barracuda issue tracker.

We're doing all these except the latter... Since we're still in this boat I'll be happy to raise a ticket tomorrow, BUT can I please ask that you try to summarise this ticket and your findings to date in a paragraph or two? That would be really handy as I know what I've seen, but you've tried other stuff... I'll then add to your outline and post in the issue queue with all the appropriate files, and a link to this page as a source further information.

In a nutshell: When php-fpm is killed we get 502 errors from Nginx. So we disabled php-fpm being killed and the 502 stopped.

comment:45 in reply to: ↑ 42 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.5
  • Total Hours changed from 12.5 to 13.0

Replying to jim:

OK so MariaDB 5.5.30 has finally been released and committed to Barracuda, so we're due a release any day now as that was a blocker...

I am going to go out on a limb, but based on the OP in the GEOS issue, I reckon this might be the solution to our pain -- it's also the only thing that didn't get updated by the up-head I did.

Yeah, could be, the last 502, ticket/483#comment:43 does have a corresponding entry in /var/log/daemon.log:

Mar 14 03:16:39 puffin mysqld: 130314  3:16:39 [Warning] Aborted connection 194223 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Mar 14 03:16:39 puffin mysqld: 130314  3:16:39 [Warning] Aborted connection 194222 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Version 0, edited 4 years ago by chris (next)

comment:46 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.3
  • Total Hours changed from 13.0 to 13.3

Draft summary of the 502 issue for raising with the BOA developers:

HTTP 502 Errors

We have a Drupal site running via BOA at http://www.transitionnetwork.org/ on a virtual debian machine with 14 CPU cores and 4GB of RAM (munin stats for it are here: https://penguin.transitionnetwork.org/munin/transitionnetwork.org/puffin.transitionnetwork.org/ ).

We appear have two issues causing Nginx to serve 502 error messages to clients, one being caused by occasional (only a handful a week) MariaDB "Aborted connection" errors, which leave messages like this in /var/log/daemon.log:

Mar 14 03:16:39 puffin mysqld: 130314  3:16:39 [Warning] Aborted connection 194223 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Mar 14 03:16:39 puffin mysqld: 130314  3:16:39 [Warning] Aborted connection 194222 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)

Which correspond with 502 errors from Ngnix, the one that corresponds with the above error:

"157.56.93.216" www.transitionnetwork.org [14/Mar/2013:03:16:39 +0000] "GET /initiatives/honiton HTTP/1.1" 502 166 299 311 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 4.335 "-"

The other cause of huge number of 502 errors was the cron job that runs /var/xdrago/clear.sh which kills php-fpm on a regular basis -- we found that while it was being killed clients were served 502 errors by Ngnix. We have therefore commented out the cron job that runs /var/xdrago/clear.sh and we can't see any evidence of php-fpm having memory leaks and have no evidence of negative side effects caused by the removal of the cron job. The php-fpm munin stats for the server are here:

And the total memory usage of php-fpm can be seen in this graph:

Further detail of these issues are documented on this ticket: /trac/ticket/483

comment:47 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.25
  • Total Hours changed from 13.3 to 13.55

I have written a small bash script which can be run on the command line to display recent 502 errors:

/usr/local/bin/find-502s 
  No recent 502 errors were found

And run via cron with a email address to send the result to.

This is the script:

#!/bin/bash

EMAIL=$1

GREP_NGINX=$(grep '1.[0|1]" 502 ' /var/log/nginx/access.log)

# check that the script is being run by root
if [[ "$(id -u)" != "0" ]] ; then
  echo "You must run $0 as root or via sudo" 
  exit 2
fi

# check to see if any 502s were found
if [[ $GREP_NGINX ]]; then
  # check for a email address 
  if [[ $1 ]]; then
    echo $GREP_NGINX | mutt -s "502 errors from puffin" $EMAIL
  # display the results of the grep
  else [[ ! $1 ]]
    echo "Supply a email address on the command line to send the following results by email" 
    echo ""
    echo $GREP_NGINX
  fi
else
   # we don't want any output if a email address is specified as the script is run via cron
   if [[ ! $EMAIL ]] ; then
     echo "No recent 502 errors were found" 
   fi
fi

comment:48 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.25
  • Total Hours changed from 13.55 to 13.8

There was another 502 last night, again at 3:17 am:

"117.26.254.40" www.transitionnetwork.org [19/Mar/2013:03:17:02 +0000] "GET /user/login HTTP/1.1" 502 568 259 713 "http://www.transitionnetwork.org/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)" 0.389 "-"

The IP address is from China and all that this this bot, which is pretending to be IE6, has done the the last 4 days is make 14 GET requests for /user/login and 16 POST requests to /

This time there were not corresponding MySQL entries in the daemon.log.

I think it's worth noting the times:

  1. 03:17:02 ticket:483#comment:48
  2. 03:16:39 ticket:483#comment:43 and ticket:483#comment:45
  3. 03:17:20 ticket:483#comment:39

I'd guess there is some MySQL related task that runs at this time?

comment:49 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.65
  • Total Hours changed from 13.8 to 14.45

There were two 502 errors last night, both at exactly the same time, both on Port 80 and both from the the same Russian IP address, one request pretending to be Opera and one pretending to be Firefox.

Again this was a spam bot -- this IP address has posted 123 registration attempts to the site today so far.

"31.192.104.144, 127.0.0.1" www.transitionnetwork.org [21/Mar/2013:03:16:56 +0000] "GET /index.php HTTP/1.0" 502 166 757 311 "http://www.transitionnetwork.org/index.php" "Opera/9.80 (Windows NT 6.2; U; en) Presto/2.10.289 Version/12.02" 0.604 "-"
"31.192.104.144, 127.0.0.1" www.transitionnetwork.org [21/Mar/2013:03:16:56 +0000] "GET / HTTP/1.0" 502 166 761 311 "http://www.transitionnetwork.org/node/add/forum" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0" 0.625 "-"  

These are some of the logs from around the same time in /var/log/daemon.log:

Mar 21 03:16:56 puffin mysqld: 130321  3:16:56 [Warning] Aborted connection 1169164 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Mar 21 03:16:56 puffin mysqld: 130321  3:16:56 [Warning] Aborted connection 1169163 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)

And this is a selection of a few of the 606 entries from this IP address in /var/log/messages from around the same time:

Mar 21 03:16:38 puffin drupal: https://www.transitionnetwork.org|1363835798|BOTCHA|31.192.104.144|https://www.transitionnetwork.org/user/register|http://www.transitionnetwork.org/user/register|0||user_register post blocked by BOTCHA: submission looks like from a spambot.Failed 4 of 5 recipes [honeypot, honeypot2, obscure_url, timegate] from "default" recipe book.
Mar 21 03:16:38 puffin drupal: https://www.transitionnetwork.org|1363835798|spambot|31.192.104.144|https://www.transitionnetwork.org/user/register|http://www.transitionnetwork.org/user/register|0||Blocked registration: email=beardjfg@outlook.com,ip=31.192.104.144
Mar 21 03:16:42 puffin drupal: https://www.transitionnetwork.org|1363835802|BOTCHA|31.192.104.144|https://www.transitionnetwork.org/user/register|http://www.transitionnetwork.org/user/register|0||user_register post blocked by BOTCHA: submission looks like from a spambot.Failed 4 of 5 recipes [honeypot, honeypot2, obscure_url, timegate] from "default" recipe book.
Mar 21 03:16:42 puffin drupal: https://www.transitionnetwork.org|1363835802|spambot|31.192.104.144|https://www.transitionnetwork.org/user/register|http://www.transitionnetwork.org/user/register|0||Blocked registration: email=beardjfg@outlook.com,ip=31.192.104.144
Mar 21 03:16:47 puffin drupal: https://www.transitionnetwork.org|1363835807|user|31.192.104.144|https://www.transitionnetwork.org/user|http://www.transitionnetwork.org/user|0||Login attempt failed for Spoonsody.
Mar 21 03:16:50 puffin drupal: https://www.transitionnetwork.org|1363835810|user|31.192.104.144|https://www.transitionnetwork.org/user|http://www.transitionnetwork.org/user|0||Login attempt failed for Spoonsody.
Mar 21 03:16:51 puffin drupal: http://www.transitionnetwork.org|1363835811|access denied|31.192.104.144|http://www.transitionnetwork.org/node/add/forum|http://www.transitionnetwork.org/node/add/forum|0||node/add/forum
Mar 21 03:16:52 puffin drupal: https://www.transitionnetwork.org|1363835812|access denied|31.192.104.144|https://www.transitionnetwork.org/node/add/forum|http://www.transitionnetwork.org/node/add/forum|0||node/add/forum
Mar 21 03:16:54 puffin drupal: http://www.transitionnetwork.org|1363835814|access denied|31.192.104.144|http://www.transitionnetwork.org/node/add/forum|http://www.transitionnetwork.org/node/add/forum|0||node/add/forum
Mar 21 03:16:55 puffin drupal: https://www.transitionnetwork.org|1363835815|access denied|31.192.104.144|https://www.transitionnetwork.org/node/add/forum|http://www.transitionnetwork.org/node/add/forum|0||node/add/forum

Again the 502s happened at the same time as usual, these are the 4 times in the last 10 days there have been 502 errors:

  1. 2013-03-21 03:16:56 ticket:483#comment:49
  2. 2013-03-19 03:17:02 ticket:483#comment:48
  3. 2013-03-14 03:16:39 ticket:483#comment:43 and ticket:483#comment:45
  4. 2013-03-12 03:17:20 ticket:483#comment:39

There is a page on the MySQL site, Communication Errors and Aborted Connections which lists a number of reasons for this error and also points out that "the server increments the Aborted_clients status variable" -- we are at 1,169,164!

As we are only seeing one or two 502 a night, every few days and always between 3:16:30 am and 3:17:30 am and generally these are served up to spam bots I'm not sure this issue is worth spending much more time on?

I have checked for cron jobs that run at this time in /etc/cron.d/ but there are none, I don't know why these errors only happen at this time.

comment:50 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 14.45 to 14.55

There was a pair of 502 error's served up to a couple of bots last night:

"86.48.79.199, 127.0.0.1" transitionnetwork.org [24/Mar/2013:03:16:37 +0000] "GET /training/sponsoring HTTP/1.0" 502 166 354 311 "-" "Riddler (http://riddler.io/about.html)" 0.526 "-"                                                                                       
"180.76.5.151" www.transitionnetwork.org [24/Mar/2013:03:16:37 +0000] "GET /events/ HTTP/1.1" 502 166 234 311 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 1.133 "-"

These are corresponding /var/log/daemon.log entries:

Mar 24 03:16:37 puffin mysqld: 130324  3:16:37 [Warning] Aborted connection 1601751 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Mar 24 03:16:37 puffin mysqld: 130324  3:16:37 [Warning] Aborted connection 1601752 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)

Updating the list of recent 502 errors:

comment:51 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 14.55 to 14.65

There was a 502 last night served up to a real person in the USA:

"XX.XX.XX.XX" www.transitionnetwork.org [26/Mar/2013:03:16:52 +0000] "GET /initiatives?page=2&themes=13&community_type=All&status_value=All&country=us&field_title_search= HTTP/1.1" 502 166 629 311 "http://www.transitionnetwork.org/initiatives?page=1&themes=13&community_type=All&status_value=All&country=us&field_title_search=" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.5; rv:16.0) Gecko/20100101 Firefox/16.0" 1.175 "-"

The corresponding /var/log/daemon.log entries:

Mar 26 03:16:52 puffin mysqld: 130326  3:16:52 [Warning] Aborted connection 1874815 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Mar 26 03:16:52 puffin mysqld: 130326  3:16:52 [Warning] Aborted connection 1874816 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)

Updating the list of recent 502 errors:

comment:52 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 14.65 to 14.75

Upgrading php-fpm, see ticket:218#comment:72, required a restart of php-fpm and during the restart these 502 errors were served to 4 different clients, at least two of these were bots:

"AAA.AAA.AAA.AAA" www.transitionnetwork.org [26/Mar/2013:12:46:45 +0000] "GET /training/courses HTTP/1.1" 502 166 367 311 "http://www.transitionnetwork.org/deepening" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0" 0.035 "-"
"BB.BB.BBB.BBB" www.transitionnetwork.org [26/Mar/2013:12:46:46 +0000] "GET /resources/all?stage_nid=All&themes=All&language=es&country=All&asset_type=All&keywords= HTTP/1.1" 502 568 407 713 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; MALN; .NET4.0C; InfoPath.2)" 0.105 "-"
"CCC.CCC.CCC.CC" www.transitionnetwork.org [26/Mar/2013:12:46:46 +0000] "GET /news/2011-05-24/transitioner-selected-yellow-pages-cover HTTP/1.1" 502 166 222 311 "-" "ichiro/3.0 (http://help.goo.ne.jp/door/crawler.html)" 0.299 "-"
"DDD.DD.DD.DDD" www.transitionnetwork.org [26/Mar/2013:12:46:48 +0000] "GET /tags/community-supported-agriculture HTTP/1.1" 502 166 368 311 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.152 "-"
"86.48.79.199, 127.0.0.1" transitionnetwork.org [26/Mar/2013:12:46:51 +0000] "GET /user/login?destination=users/javier-guerrero HTTP/1.0" 502 166 379 311 "-" "Riddler (http://riddler.io/about.html)" 0.001 "-"

comment:53 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 14.75 to 14.85

There were two 502 errors served up to bots last night:

"86.48.79.199" transitionnetwork.org [29/Mar/2013:03:18:59 +0000] "GET /media/%22/people/peter-aston/%22/u003ePeter?page=8&order=name&sort=asc HTTP/1.0" 502 166 267 311 "-" "Riddler (http://riddler.io/about.html)" 0.199 "-"                                                                                                
"95.108.247.253" www.transitionnetwork.org [29/Mar/2013:03:18:59 +0000] "GET /initiatives/marlborough HTTP/1.1" 502 166 363 311 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" 1.645 "-"

These were a bit later than usual time -- the server clock was out by several minutes so I installed ntp, see ticket:466#comment:78.

The corresponding /var/log/daemon.log entries:

Mar 29 03:18:59 puffin mysqld: 130329  3:18:59 [Warning] Aborted connection 2370670 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Mar 29 03:18:59 puffin mysqld: 130329  3:18:59 [Warning] Aborted connection 2370672 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)

The updated list of recent 502s which were not caused by things like php-fpm updates and manual restarts:

comment:54 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 14.85 to 14.95

The upgrade to the latest BOA, ticket:529 has resulted in the script which clobbers log files and restarts php-fpm being re-enabled, it's run via the root crontab at 11 mins past the hour:

11 * * * * bash /var/xdrago/clear.sh >/dev/null 2>&1

It contains:

#!/bin/bash

SHELL=/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/opt/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

touch /var/run/fmp_wait.pid
if [ -e "/etc/cron.daily/logrotate" ] ; then
  _SYSLOG_SIZE_TEST=$(du -s -h /var/log/syslog)
  if [[ "$_SYSLOG_SIZE_TEST" =~ "G" ]] ; then
    echo $_SYSLOG_SIZE_TEST too big
    bash /etc/cron.daily/logrotate
    echo system logs rotated
  fi
fi
echo rotate > /var/log/php/php-fpm-error.log
echo rotate > /var/log/php/php-fpm-slow.log
echo rotate > /var/log/php/php53-fpm-error.log
echo rotate > /var/log/php/php53-fpm-slow.log
echo rotate > /var/log/php/error_log_52
echo rotate > /var/log/php/error_log_53
echo rotate > /var/log/php/error_log_cli_52
echo rotate > /var/log/php/error_log_cli_53
echo rotate > /var/log/redis/redis-server.log
echo rotate > /var/log/mysql/sql-slow-query.log
if test -f /root/.high_traffic.cnf ; then
  echo rotate > /var/log/nginx/access.log
else
  /etc/init.d/php53-fpm reload
  if test -f /etc/init.d/php-fpm ; then
    /etc/init.d/php-fpm reload
  fi
  sleep 8
fi
if test -f /var/run/boa_run.pid ; then
  sleep 1
else
  rm -f /tmp/*error*
fi
rm -f /var/run/fmp_wait.pid
touch /var/xdrago/log/clear.done
###EOF2013###

If the restart of php-fpm that it'll do in 10 mins time caused any 502 errors I'm going to disable this cronjob again.

comment:55 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.2
  • Total Hours changed from 14.95 to 15.15

There were no 502's at 9:11pm so I'm leaving the script in place for now.

comment:56 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.25
  • Total Hours changed from 15.15 to 15.4

There was a 502 served to a Chinese dynamic IP address at 07:11:04 am pretending to be IE6, I expect it was a spambot, it was in response to a registration POST:

"117.26.XX.XX" www.transitionnetwork.org [08/Apr/2013:07:11:04 +0100] "POST /user/register HTTP/1.1" 502 568 1289 713 "http://www.transitionnetwork.org/user/register" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)" 3.479 "-"

This means that the cronjob running /var/xdrago/clear.sh at 11 mins past the hour is, once again, causing clients to get 502 errors when it kills php-fpm, so I have commented it out or root crontab:

#11 * * * * bash /var/xdrago/clear.sh >/dev/null 2>&1

This will need re-commenting after the upgrade to BOA-2.0.8 on ticket:530 and I have added this task to the BOA upgrade process documneted here wiki:PuffinServer#UpgradingBOA.

comment:57 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.1
  • Total Hours changed from 15.4 to 15.5

The cronjob to run the script to find 502 errors (see ticket:483#comment:47) and send them out by email got clobbered in a BOA upgrade, so I have now added it to the pre-rotation task in /etc/logrotate.d/nginx:

/var/log/nginx/*.log {
        daily
        missingok
        rotate 52
        compress
        delaycompress
        notifempty
        create 0640 www-data adm
        sharedscripts
        prerotate
                /usr/local/bin/find-502s chris@webarchitects.co.uk
                if [ -d /etc/logrotate.d/httpd-prerotate ]; then \
                        run-parts /etc/logrotate.d/httpd-prerotate; \
                fi \
        endscript
        postrotate
                [ ! -f /var/run/nginx.pid ] || kill -USR1 `cat /var/run/nginx.pid`
        endscript
}

comment:58 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.35
  • Total Hours changed from 15.5 to 15.85

The 502 errors that were happening between 3:16am and 3:19am, see ticket:483#comment:53 are now happening at 19 mins past midnight (apart from one case at 9pm):

9th April 00:19:07

Apache errors:

"192.XX.XXX.XX" www.transitionnetwork.org [11/Apr/2013:00:19:07 +0100] "GET /user/register HTTP/1.0" 502 568 507 713 "http://www.transitionnetwork.org/user/register" "Mozilla/5.0 (Windows NT 6.0; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11" 0.193 "-"
"66.249.XX.XXX" www.transitionnetwork.org [11/Apr/2013:00:19:07 +0100] "GET /tags/stop-shopping HTTP/1.1" 502 166 307 311 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.528 "-"

The corresponding MariaDB error from the daemon.log:

Apr 11 00:19:07 puffin mysqld: 130411  0:19:07 [Warning] Aborted connection 341263 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)

12th April 21:00:32

Apache error:

"XXX.X.XXX.X" www.transitionnetwork.org [11/Apr/2013:21:00:32 +0100] "GET /forums/process/researchers/collaborative-bibliography-academic-work-transition HTTP/1.0" 502 568 778 713 "http://www.transitionnetwork.org/forums/process/researchers/collaborative-bibliography-academic-work-transition" "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11" 0.146 "-" 

The corresponding MariaDB errors:

Apr 11 21:00:19 puffin mysqld: 130411 21:00:19 [Warning] Aborted connection 499168 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Apr 11 21:00:19 puffin mysqld: 130411 21:00:19 [Warning] Aborted connection 499172 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Apr 11 21:00:19 puffin mysqld: 130411 21:00:19 [Warning] Aborted connection 499170 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Apr 11 21:00:19 puffin mysqld: 130411 21:00:19 [Warning] Aborted connection 499147 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Apr 11 21:00:19 puffin mysqld: 130411 21:00:19 [Warning] Aborted connection 499146 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)

13th April 00:19:06

Apache error:

"XX.XXX.XX.XX, 127.0.0.1" www.transitionnetwork.org [13/Apr/2013:00:19:06 +0100] "GET /forums/process/general/list-ideas-starting-transition HTTP/1.0" 502 166 838 311 "http://www.transitionnetwork.org/forums/process/general/list-ideas-starting-transition" "Mozilla/5.0 (Windows NT 5.2; rv:17.0) Gecko/17.0 Firefox/17.0" 1.273 "-"

MariaDB error:

Apr 13 00:19:06 puffin mysqld: 130413  0:19:06 [Warning] Aborted connection 678014 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)

14th April 00:19:07

Apache errors, served to the same IP address but different browsers, this will be a spambot:

"XXX.XXX.XXX.X, 127.0.0.1" www.transitionnetwork.org [14/Apr/2013:00:19:07 +0100] "POST /user/register HTTP/1.0" 502 568 2826 713 "http://www.transitionnetwork.org/user/register" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" 1.710 "-"
"XXX.XXX.XXX.X, 127.0.0.1" www.transitionnetwork.org [14/Apr/2013:00:19:07 +0100] "POST /user/register HTTP/1.0" 502 568 2870 713 "http://www.transitionnetwork.org/user/register" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.79 Safari/537.4" 2.244 "-"
"222.187.221.8, 127.0.0.1" www.transitionnetwork.org [14/Apr/2013:00:19:07 +0100] "POST /user/register HTTP/1.0" 502 166 2851 311 "http://www.transitionnetwork.org/user/register" "Opera/9.80 (Windows NT 5.1; U; MRA 6.0 (build 5970); ru) Presto/2.7.62 Version/11.00" 2.966 "-" 

MariaDB errors:

Apr 14 00:19:07 puffin mysqld: 130414  0:19:07 [Warning] Aborted connection 819687 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Apr 14 00:19:07 puffin mysqld: 130414  0:19:07 [Warning] Aborted connection 819688 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Apr 14 00:19:07 puffin mysqld: 130414  0:19:07 [Warning] Aborted connection 819683 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Apr 14 00:19:07 puffin mysqld: 130414  0:19:07 [Warning] Aborted connection 819684 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Apr 14 00:19:07 puffin mysqld: 130414  0:19:07 [Warning] Aborted connection 819681 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Apr 14 00:19:07 puffin mysqld: 130414  0:19:07 [Warning] Aborted connection 819682 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Apr 14 00:19:07 puffin mysqld: 130414  0:19:07 [Warning] Aborted connection 819685 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Apr 14 00:19:07 puffin mysqld: 130414  0:19:07 [Warning] Aborted connection 819686 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)

15th April 00:19:58

Apache errors:

"66.XXX.XX.XX" www.transitionnetwork.org [15/Apr/2013:00:19:58 +0100] "GET /ingredients/starting/measurement HTTP/1.1" 502 166 462 311 "-" "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)" 0.622 "-"
"157.XX.XX.XX" www.transitionnetwork.org [15/Apr/2013:00:19:58 +0100] "GET /people/tully-wakeman HTTP/1.1" 502 166 352 311 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.274 "-" 

MariaDB errors:

Apr 15 00:19:58 puffin mysqld: 130415  0:19:58 [Warning] Aborted connection 968953 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Apr 15 00:19:58 puffin mysqld: 130415  0:19:58 [Warning] Aborted connection 968952 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)

comment:59 Changed 4 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.25
  • Total Hours changed from 15.85 to 16.1

There was another 502 last night at 00:19:18, the usual time...

"XXX.XXX.XX.XXX" www.transitionnetwork.org [16/Apr/2013:00:19:18 +0100] "GET /ingredients/deepening/celebrate-failure HTTP/1.1" 502 568 854 713 "http://www.transitionnetwork.org/ingredients" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; MDDR; InfoPath.3)" 1.187 "-"

This is the corresponding entry in /var/log/daemon.log:

Apr 16 00:19:18 puffin mysqld: 130416  0:19:18 [Warning] Aborted connection 1136789 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)

Since then there have been a further two sets of "Aborted connection" errors, one set at 08:01:53:

Apr 16 08:01:53 puffin mysqld: 130416  8:01:53 [Warning] Aborted connection 1187027 to db: 'masterpuffinweba' user: 'masterpuffinweba' host: 'localhost' (Unknown error)
Apr 16 08:01:54 puffin mysqld: 130416  8:01:54 [Warning] Aborted connection 1187075 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Apr 16 08:01:54 puffin mysqld: 130416  8:01:54 [Warning] Aborted connection 1187071 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Apr 16 08:01:54 puffin mysqld: 130416  8:01:54 [Warning] Aborted connection 1187073 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Apr 16 08:01:54 puffin mysqld: 130416  8:01:54 [Warning] Aborted connection 1187025 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Apr 16 08:01:54 puffin mysqld: 130416  8:01:54 [Warning] Aborted connection 1187023 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Apr 16 08:01:54 puffin mysqld: 130416  8:01:54 [Warning] Aborted connection 1187061 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)

And again at 17:01:00:

Apr 16 17:01:00 puffin mysqld: 130416 17:01:00 [Warning] Aborted connection 1241538 to db: 'masterpuffinweba' user: 'masterpuffinweba' host: 'localhost' (Unknown error)
Apr 16 17:01:00 puffin mysqld: 130416 17:01:00 [Warning] Aborted connection 1241522 to db: 'masterpuffinweba' user: 'masterpuffinweba' host: 'localhost' (Unknown error)
Apr 16 17:01:01 puffin mysqld: 130416 17:01:01 [Warning] Aborted connection 1241564 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Apr 16 17:01:01 puffin mysqld: 130416 17:01:01 [Warning] Aborted connection 1241518 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)
Apr 16 17:01:01 puffin mysqld: 130416 17:01:01 [Warning] Aborted connection 1241544 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error) 
Apr 16 17:01:01 puffin mysqld: 130416 17:01:01 [Warning] Aborted connection 1241574 to db: 'transitionnetwor' user: 'transitionnetwor' host: 'localhost' (Unknown error)

Can the possibility that it could have been caused by the GEOS PHP extension now be ruled out as this has been disabled in the latest BOA?

comment:60 Changed 4 years ago by jim

Chris, here's a commit that'll make you happy :-) PHP: Do not reload FPM every hour - it causes error 502 if request is in progress.

There are a couple of cleanup commits after this, so there will be no more commenting out the offending scripts etc from the next BOA onwards. Good news (and a vindication of your position)!

comment:61 Changed 4 years ago by ed

  • Milestone set to Maintenance

comment:62 Changed 4 years ago by chris

This text has been removed from the wiki:PuffinServer#UpgradingBOA

Then the root cronjob running /var/xdrago/clear.sh at 11 mins past the hour needs commenting out due to the 502 errors it causes, see ticket:483:

sudo -i
export EDITOR=vim
crontab -e

#11 * * * * bash /var/xdrago/clear.sh >/dev/null 2>&1

As the script no longer kills php-fpm every hour, see ticket:547.

comment:63 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0.0 to 0.25
  • Status changed from assigned to closed
  • Resolution set to fixed
  • Total Hours changed from 16.1 to 16.35

We have only once had more that 100 502 errors in a day since early August 2013, these are the subject lines from the cron script that checks for these errors just before the logs are rotated, which contains the totals for each day:

Aug 04  558 403, 3121 404, 0 502, 69 503 and 0 504 errors from puffin.webarch.net
Aug 05  532 403, 2897 404, 1 502, 82 503 and 1 504 errors from puffin.webarch.net
Aug 06  770 403, 2879 404, 5 502, 265 503 and 1 504 errors from puffin.webarch.net
Aug 07  827 403, 3339 404, 6 502, 142 503 and 0 504 errors from puffin.webarch.net
Aug 08  852 403, 3919 404, 12 502, 349 503 and 45 504 errors from puffin.webarch.net
Aug 09  1165 403, 3637 404, 3 502, 267 503 and 5 504 errors from puffin.webarch.net
Aug 10  1166 403, 3679 404, 12 502, 226 503 and 0 504 errors from puffin.webarch.net
Aug 11  695 403, 3397 404, 1 502, 105 503 and 0 504 errors from puffin.webarch.net
Aug 12  619 403, 4100 404, 1 502, 46 503 and 0 504 errors from puffin.webarch.net
Aug 13  587 403, 2877 404, 5050 502, 295 503 and 0 504 errors from puffin.webarch.net
Aug 14  654 403, 4196 404, 8 502, 233 503 and 2 504 errors from puffin.webarch.net
Aug 15  853 403, 3385 404, 7 502, 30 503 and 0 504 errors from puffin.webarch.net
Aug 16  750 403, 2967 404, 2 502, 127 503 and 0 504 errors from puffin.webarch.net
Aug 17  934 403, 3568 404, 7 502, 155 503 and 0 504 errors from puffin.webarch.net
Aug 18  890 403, 2881 404, 1 502, 8 503 and 0 504 errors from puffin.webarch.net
Aug 19  532 403, 2473 404, 1 502, 170 503 and 1 504 errors from puffin.webarch.net
Aug 20  619 403, 2636 404, 2 502, 157 503 and 3 504 errors from puffin.webarch.net
Aug 21  640 403, 3700 404, 7 502, 265 503 and 7 504 errors from puffin.webarch.net
Aug 22  1361 403, 2907 404, 4 502, 91 503 and 1 504 errors from puffin.webarch.net
Aug 23  1324 403, 3321 404, 2 502, 89 503 and 0 504 errors from puffin.webarch.net
Aug 24  840 403, 3308 404, 14 502, 582 503 and 0 504 errors from puffin.webarch.net
Aug 25  730 403, 2772 404, 1 502, 29 503 and 0 504 errors from puffin.webarch.net
Aug 26  673 403, 2702 404, 19 502, 212 503 and 0 504 errors from puffin.webarch.net
Aug 27  726 403, 3459 404, 9 502, 121 503 and 0 504 errors from puffin.webarch.net
Aug 28  636 403, 3582 404, 20 502, 396 503 and 7 504 errors from puffin.webarch.net
Aug 29  828 403, 3209 404, 15 502, 490 503 and 11 504 errors from puffin.webarch.net
Aug 30  679 403, 4128 404, 11 502, 411 503 and 3 504 errors from puffin.webarch.net
Aug 31  593 403, 5750 404, 8 502, 56 503 and 0 504 errors from puffin.webarch.net
Sep 01  735 403, 3348 404, 0 502, 17 503 and 0 504 errors from puffin.webarch.net
Sep 02  551 403, 2939 404, 2 502, 38 503 and 0 504 errors from puffin.webarch.net
Sep 04  1352 403, 6744 404, 59 502, 935 503 and 7 504 errors from puffin.webarch.net
Sep 05  1679 403, 3364 404, 29 502, 219 503 and 1 504 errors from puffin.webarch.net
Sep 06  1916 403, 3627 404, 9 502, 317 503 and 1 504 errors from puffin.webarch.net
Sep 08  2947 403, 6114 404, 18 502, 1366 503 and 14 504 errors from puffin.webarch.net
Sep 09  1634 403, 3698 404, 9 502, 219 503 and 0 504 errors from puffin.webarch.net
Sep 10  1734 403, 5584 404, 11 502, 531 503 and 0 504 errors from puffin.webarch.net
Sep 11  1269 403, 13393 404, 36 502, 1591 503 and 30 504 errors from puffin.webarch.net
Sep 12  1679 403, 7105 404, 4 502, 863 503 and 19 504 errors from puffin.webarch.net
Sep 14  2780 403, 8882 404, 15 502, 2634 503 and 0 504 errors from puffin.webarch.net
Sep 15  1752 403, 5810 404, 1 502, 485 503 and 0 504 errors from puffin.webarch.net
Sep 16  1910 403, 4832 404, 60 502, 361 503 and 0 504 errors from puffin.webarch.net
Sep 18  3388 403, 9494 404, 35 502, 1461 503 and 0 504 errors from puffin.webarch.net
Sep 19  1489 403, 5994 404, 6 502, 792 503 and 0 504 errors from puffin.webarch.net
Sep 20  1774 403, 4775 404, 10 502, 762 503 and 0 504 errors from puffin.webarch.net
Sep 21  1936 403, 4081 404, 14 502, 580 503 and 0 504 errors from puffin.webarch.net
Sep 22  1942 403, 4589 404, 32 502, 301 503 and 0 504 errors from puffin.webarch.net
Sep 23  1878 403, 4447 404, 10 502, 406 503 and 0 504 errors from puffin.webarch.net
Sep 24  1822 403, 4616 404, 6 502, 215 503 and 0 504 errors from puffin.webarch.net
Sep 25  1836 403, 5263 404, 4 502, 64 503 and 0 504 errors from puffin.webarch.net
Sep 26  2319 403, 5889 404, 7 502, 288 503 and 0 504 errors from puffin.webarch.net
Sep 27  2278 403, 6601 404, 7 502, 262 503 and 0 504 errors from puffin.webarch.net
Sep 28  2169 403, 7631 404, 8 502, 152 503 and 0 504 errors from puffin.webarch.net
Sep 29  2099 403, 6310 404, 10 502, 77 503 and 0 504 errors from puffin.webarch.net
Sep 30  2628 403, 6351 404, 27 502, 802 503 and 0 504 errors from puffin.webarch.net
Oct 01  2744 403, 7097 404, 103 502, 3162 503 and 0 504 errors from puffin.webarch.net
Oct 02  2448 403, 7249 404, 38 502, 2152 503 and 0 504 errors from puffin.webarch.net
Oct 03  2515 403, 6332 404, 54 502, 3554 503 and 0 504 errors from puffin.webarch.net
Oct 04  2394 403, 4961 404, 71 502, 3453 503 and 0 504 errors from puffin.webarch.net
Oct 05  2337 403, 4719 404, 7 502, 228 503 and 0 504 errors from puffin.webarch.net
Oct 06  2327 403, 5885 404, 32 502, 363 503 and 0 504 errors from puffin.webarch.net
Oct 07  2661 403, 6267 404, 8 502, 286 503 and 0 504 errors from puffin.webarch.net
Oct 08  3006 403, 5432 404, 37 502, 1466 503 and 0 504 errors from puffin.webarch.net
Oct 09  2685 403, 5714 404, 40 502, 3504 503 and 0 504 errors from puffin.webarch.net
Oct 10  3371 403, 5213 404, 59 502, 6327 503 and 0 504 errors from puffin.webarch.net
Oct 11  3304 403, 5825 404, 56 502, 2065 503 and 0 504 errors from puffin.webarch.net
Oct 12  3113 403, 4921 404, 52 502, 2252 503 and 0 504 errors from puffin.webarch.net
Oct 13  3527 403, 4912 404, 40 502, 289 503 and 0 504 errors from puffin.webarch.net
Oct 14  3210 403, 7936 404, 41 502, 436 503 and 0 504 errors from puffin.webarch.net
Oct 15  3400 403, 9043 404, 33 502, 677 503 and 0 504 errors from puffin.webarch.net
Oct 16  3286 403, 5362 404, 50 502, 1010 503 and 0 504 errors from puffin.webarch.net
Oct 17  3409 403, 5837 404, 17 502, 426 503 and 0 504 errors from puffin.webarch.net
Oct 18  4060 403, 4747 404, 15 502, 224 503 and 0 504 errors from puffin.webarch.net
Oct 19  3808 403, 4420 404, 6 502, 14 503 and 0 504 errors from puffin.webarch.net
Oct 20  3457 403, 3904 404, 13 502, 215 503 and 0 504 errors from puffin.webarch.net
Oct 21  3200 403, 4270 404, 5 502, 379 503 and 0 504 errors from puffin.webarch.net
Oct 22  2994 403, 5661 404, 9 502, 275 503 and 0 504 errors from puffin.webarch.net
Oct 23  3359 403, 4950 404, 3 502, 54 503 and 0 504 errors from puffin.webarch.net
Oct 24  3200 403, 4620 404, 5 502, 16 503 and 0 504 errors from puffin.webarch.net
Oct 25  3262 403, 4728 404, 10 502, 16 503 and 0 504 errors from puffin.webarch.net
Oct 26  3026 403, 5553 404, 9 502, 19 503 and 0 504 errors from puffin.webarch.net
Oct 27  3001 403, 4599 404, 6 502, 41 503 and 0 504 errors from puffin.webarch.net
Oct 28  2939 403, 3725 404, 2 502, 20 503 and 0 504 errors from puffin.webarch.net
Oct 29  2967 403, 4119 404, 8 502, 15 503 and 0 504 errors from puffin.webarch.net
Oct 30  2782 403, 5032 404, 7 502, 9 503 and 0 504 errors from puffin.webarch.net
Oct 31  3500 403, 4254 404, 10 502, 24 503 and 0 504 errors from puffin.webarch.net
Nov 01  3335 403, 4973 404, 11 502, 70 503 and 0 504 errors from puffin.webarch.net
Nov 02  3581 403, 7293 404, 9 502, 17 503 and 0 504 errors from puffin.webarch.net
Nov 03  2972 403, 4330 404, 7 502, 18 503 and 0 504 errors from puffin.webarch.net
Nov 04  3284 403, 4797 404, 5 502, 26 503 and 0 504 errors from puffin.webarch.net
Nov 05  2844 403, 4637 404, 5 502, 10 503 and 0 504 errors from puffin.webarch.net
Nov 06  3523 403, 4657 404, 17 502, 27 503 and 0 504 errors from puffin.webarch.net
Nov 07  3210 403, 13280 404, 4 502, 37 503 and 0 504 errors from puffin.webarch.net
Nov 08  3472 403, 7198 404, 11 502, 28 503 and 0 504 errors from puffin.webarch.net
Nov 09  6897 403, 6470 404, 11 502, 315 503 and 0 504 errors from puffin.webarch.net
Nov 10  3996 403, 4431 404, 9 502, 31 503 and 0 504 errors from puffin.webarch.net
Nov 11  4268 403, 4168 404, 2 502, 45 503 and 0 504 errors from puffin.webarch.net
Nov 12  3800 403, 4111 404, 10 502, 33 503 and 0 504 errors from puffin.webarch.net
Nov 13  3513 403, 7100 404, 11 502, 27 503 and 0 504 errors from puffin.webarch.net
Nov 14  3969 403, 9565 404, 12 502, 23 503 and 0 504 errors from puffin.webarch.net
Nov 15  3260 403, 8559 404, 8 502, 8 503 and 0 504 errors from puffin.webarch.net

It's worth noting that 404 and 403 errors have gone up by a significant amount in this time.

However this issue was resolved on ticket:483#comment:60 so this ticket is being closed.

Note: See TracTickets for help on using tickets.