There are many LAMP stacks of a few to a dozen of servers around that have serious troubles with performance, troubleshooting and manageability.
To give a clear example. Let's think of a setup with 10 apache servers serving static files and PHP, a mysql database server and a haproxy load balancer in front. Assume that there are 5 sites with different domains load balanced across the servers. Some common problems in a setup like this, and possible solutions are:
Apache performance is mediocre at best.
Most probably 2-4 of the apache servers can be removed/reused for something else by using a well configured lightweight web server. The two most common choices are nginx and lighttpd, in that order. In my opinion they are both stable and pretty equivalent in performance - but I do like the flexibility in nginx configuration a bit better.
No good cache system utilized makes each and every PHP request come with a big overhead.
Changing the plain load balancer into a caching one can come with great performance gains. Varnish is now probably the first choice for most people and is both very stable and performs well. By setting an expiration time from your backend (web) servers you can control exactly how long each file should be cached. Put some time and effort to achieve high hitrates and it is probably the most important change you can do.
No good way to identify bottlenecks or performance troubles in the PHP code. With many recent releases over several sites in the web cluster it's very hard to know the reason for overloaded servers.
Having a live profiler for PHP in your web servers is a great win in this situation. If you want something free there is XHPROF that was open sourced by Facebook. Unfortunately they did not release their web interfaces that made it useful in production. There are however two open source solutions for that Xhgui that logs into a mongodb database and XhProf Analytics that logs into mysql. Just take caution on where and how you use it as it's quick to fill up your disk with the profiler results A good setup could be to log only a percentage of the requests and then to empty it from old data (with an index for mongodb and a cronjob for mysql).
If you don't mind paying a bit there is also a hosted solution called New Relic with a nice intuitive interface. They do more than just PHP as well.
No good way to get an overview of where errors happen. To read the PHP error logs you need to log into each of the machines.
Use a centralized logging service.
If you want to roll your own there's a very good solution in Graylog2 together with Logstash for collection. Graylog2 provides a user interface with search, log streams and alerts and Logstash provides an easy to use client for parsing and forwarding log file messages to the Graylog2 server.
For hosted solutions the most mentioned service is loggly. Beware that it might become costly depending on your logging scale.
MySQL uses MyISAM and it keep causing locks.
With row level locks instead of table level InnoDB mostly comes as a must when you get alot of writes to some hit tables. It will also in general be faster for an average setup with it's buffer pool caching of data, not to mention it's primary key partitioning. Be sure that you have a large enough buffer pool (and suffificent memory for it) and it will do miracles for your database reliability.
Other things to consider
Are you using the right database system?
There are many cases where alot of your database load can be transferred into a more lightweight system. A common situation is that many simple data structures can be moved from a somewhat heavy RDBMS (say MySQL) to a more lightweight NoSQL database such as Redis. This can increase the performance significantly if used in hotspots with simple data structures.
Revisit your database load
Database loads are a very common bottleneck and is most often the hardest part of your application to scale. By periodically checking up on what is causing the load there are usually many easy optimizations to be done. pt-query-digest is an excellent tool for profiling your database load and finding the heavy queries.
Measure your performance. See what causes the most load on the servers with xhprof/new relic, varnish cli tools, pt-query-digest etc. You will surely find things within the top requests and queries that can be more effeciently cached.
This list is far from complete but I think and hope I managed to get the most common and important points down.
Varnish comes with several very useful command line tools that can be a bit hard to get the grasp of. The list below is by no means meant to be exhaustive, but give an introduction to some tools and use cases.
Easily seen as a geeky graph with little information, varnishhist is actually extremely useful to get an overview of the overall status of your backend servers and varnish.
The pipes (|) are requests served from the cache whereas the hash-signs (#) are requests to the backend. The X axis is a logaritmic scale for request times. So the histogram above shows that we have a good amount of cache hits that are served really fast whereas roughly half of the backend requests takes a bit more than 0,1s. Like most of the other command line applications you can filter out the data you need with regex, only show backend requests or only cache hits. See https://www.varnish-cache.org/docs/trunk/reference/varnishhist.html for a complete list of parameters.
varnishstat can be used in two modes, equally useful. If run with only "varnishstat" you will get a continously updated list of the counters that fit on your screen. If you want all the values for indepth analysis you can however use "varnishstat -1" for the current counter values.
Now a couple of important figures from this image:
- The very first row is the uptime of varnish. This instance had been restarted 1 day and 2h 44 mins before screenshot
- The counters below that are the average hitrate of varnish. The first row is the timeframe and the second is the average hitrate. If varnishstat is kept open for longer the second timeframe will go up to 100 seconds and the third to 1000 seconds
- As for the list of variables below the values correspons to total value, followed by current value and finally the average value. Some rows that are apparently interesting would be
- cache_hit/cache_miss to see if you have monumental miss storms
- Relationship between client_conn and client_req to see if connections are being reused. In this case there's only API traffic where very few connections are kept open. So the almost 1:1 ratio is to be seen as normal.
- Also relationship between s_hdrbytes and s_bodybytes is rather interesting as you can see how much of your bandwith is actually being used by the headers. So if s_hdrbytes are a high percentage of your s_bodybytes you might want to consider if all your headers are actually necesary and useful.
Varnishtop is a very handy tool to get filtered information about your traffic. Especially since alot of high-traffic varnish sites do not have access_logs on their backend servers - this can be of great use.
tx are always requests to backends, whereas rx are requests from clients to varnish. The examples below should clarify what I mean.
Some handy examples to work from:
See what requests are most common to the backend servers.
varnishtop -i txurl
See what useragents are the most common from the clients
varnishtop -i RxHeader -C -I ^User-Agent
See what user agents are commonly accessing the backend servers, compare to the previous one to find clients that are commonly causing misses.
varnishtop -i TxHeader -C -I ^User-Agent
See what cookies values are the most commonly sent to varnish.
varnishtop -i RxHeader -I Cookie
See what hosts are being accessed through varnish. Will of course only give you useful information if there are several hosts behind your varnish instance.
varnishtop -i RxHeader -I '^Host:'
See what accept-charsets are used by clients
varnishtop -i RxHeader -I '^Accept-Charset'
varnishlog is yet another powerful tool to log the requests you want to analyze. It's also very useful without parameters to develop your vcl and see the exact results of your changes in all it's verbosity. See https://www.varnish-cache.org/docs/trunk/tutorial/logging.html for the manual and a few examples. You will find it very similar to varnishstop in it's syntax.
One useful example for listing all details about requests resulting in a 500 status:
varnishlog -b -m "RxStatus:500"
varnishncsa is a handy tool for producing apache/ncsa formatted logs. This is very useful if you want to log the requests to varnish and analyze them with one of the many availalable log analyzers that reads such logs, for instance awstats.