Detecting and Resolving LAMP Stack Performance Problems

As a sysadmin, we sometimes run into performance problems with multiple angles and portions. It’s sometimes not particularly obvious where the actual performance problem is, and resolving one problem that you can see might bring another couple of problems to the surface.

The below comes from a consulting gig that I’ve been working on recently. The parties will remain nameless. I’m going to break this into several parts, since it took over three weeks to resolve all of the immediate problems with the site, and we’re still not all the way done with the task list.

Going in, I knew that we were dealing with a heavily loaded Drupal site that shared a mysql database with a wiki and a forum. The site would go down at random times — sometimes multiple times per hour. Upon logging into the server the first time, it seemed slow — so I immediately called ‘uptime’ and the answer came back with all three time period load averages over 90 on an 8-core server. There were 125 Apache processes running, but most of them were in Deadlocked state. The very second command I ran on the server was killall -9 httpd, which is never the way you want to start out a consulting gig…

While that was busy killing off processes, I checked the Apache configuration. Sure enough, it was still at the stock settings. I immediately cranked up the requests per process to 20,000 and upped the server limit to 300. (Remember, we’re dealing with prefork here.) I restarted Apache and watched it churn. It handled the load far more gracefully with some room to move around, and I quickly saw the number of Apache processes spike, and then sink down to about 80 and stay there.

The next step was looking through the logs. A quick aside about logs: I like my logs to be clean. I don’t like debug messages, I don’t like status messages, and I don’t want to see either of them. If I have a lot of a certain type of status message that I *do* want to trap, I make sure that syslog puts it into it’s own file or I handle the problem that’s causing it. In this case, /var/log/messages had a bunch of SNMP messages logging each get, and some messages about martian packets. The martian packets issue could be (and was) resolved with a quick firewall tweak to reject packets from an illegal source. The snmp issue was resolved by editing snmpd’s startup configuration to log to local1 instead of the default (check your man file for snmpd to make sure you get the right flags, it’s changed…), and then editing syslog’s configuration to log everything on local1 to /var/log/snmpd — and don’t forget to add it to logrotate!

Now we were down to two classes of errors. The first was obvious and sort of easy to troubleshoot: “MySQL server has gone away.” Log into the MySQL server. See if there’s slow-running queries. Nope? Well, double check the timeout that’s set in /etc/my.cnf — on this server, slow-query-time was set to twenty seconds, but timeout was set to ten seconds. Well, that’s not very useful. Also, check your caches and table types. In this case, everything was MyISAM. More on that later — for now, just make sure we’re using the right kind of caching strategy for your table type and system specs, which in this case is MyISAM key cache (and lots of it!). Try to fit all of your most-used tables in memory.

On this gig, we got the site back on it’s feet with these things. Downtime went from multiple events an hour down to one or two events per six hour period. Unfortunately, we were also out of easy things to change. Next time I post, we’ll start to get into fixes that will cause downtime.

Latest Images

Trending Articles

Latest Images