Hi Folks
(I hope Andy doesn’t mind me posting about a VPS with a provider over the road from him.
He knows that the eventual intention is to bring all of my business over! I am tempted to
“solve” this problem by migrating the offending WordPress site to my BitFolk box, but I
suspect the problem would just follow with it.)
I have a server where Apache is falling over more often than I’d like. I’ve not previously
been big on server monitoring, so I don’t have huge amounts of data that could point to
the problem. (I’m changing that with as much time as I have available.)
I’ll try to keep the story brief:
This is happening on an Ubuntu 12.04 box. It runs Varnish and Apache with many virtual
hosts, and also MySQL and Nagios. One virtual host is a WordPress site, another is a
Drupal 7 site that has not had much publicity made about it, but could attract “enemies”
when it does [1]. Not much customisation was done to the usual install, but some TCP
tweaking was done (mostly by the hosting company) because the VPS was setup to expect
massive traffic.
For the most part, I configure Drupal sites and then take on the hosting. I was asked to
take on the hosting of a WordPress site. Not wanting to run my sites and the WP site with
the same user, I setup a second Apache instance for the WordPress site, with Varnish
forwarding to the appropriate Apache instance.
Some time later, the second Apache instance started segfaulting far too often, and this
could be seen in the error logs. It couldn’t go any longer than about an hour before it
fell over. It started happening around about the time, but not exactly when, I installed
an extra PHP module at the users request [2]. The site was down until Apache was
restarted, and as a temporary measure, I wrote a very crude cron job to automatically
restart Apache as needed.
Dealing with the core dump and gdb seemed too much effort, so I decided to try mpm_itk [3]
instead (which is actually a better fit with my longer term plans). This worked swimmingly
for about a week. However, over the past 3 nights, all of my sites have fallen over at
about 2am, with nothing being served until I woke up and restarted Apache. Some
distinguishing features are:
* This happens at roughly the same time each day.
* All other services on the server are responsive, it’s just Apache that is bound up.
* CPU loads are next to nothing.
* Nothing shows up in the Apache logs during the downtime.
* Nagios reports many more processes than usual are in the system.
I am getting a bit too far from my comfort zone now, but it seems to smell of slowloris to
me.
I’ve since enabled server-status, and if the issue comes up again, I should look at it
before I restart Apache. Watching it now, some distinguishing features are:
* wp-login.php is getting hit quite a bit on the WordPress vhost from different IPs.
* KeepAlive doesn’t seem to be working as I understand it — why should there be Varnish
connections that seem to be open and waiting for 100’s or 1000’s of seconds? (I”m looking
at the SS column.)
* I was running my Drupal cron jobs too often — embarrassingly so -- via cron and wget,
and cron runs for the same vhost seemed to be overlapping [4]. I’ve slowed this down.
Any thoughts? What would you do differently?
(OK, so the story wasn’t brief — apologies!)
Thanks
Ben [5]
[1] [2] My suspicion is that these are red herrings.
[3] Is this a daft thing to do?
[4] I’ll sort out separately why this might have happened, but I’ll move to managing this
via Drush rather than wget.
[5] I am a site builder first, and an admin second, so please excuse the things I’ve done
that are clearly sub-optimal.