Scaling PHP Applications (2014)

Case Study: HTTP Caching and the Nginx Fastcgi Cache

This entire book has been, for the most part, dedicated to scaling our backend services— making our frontend faster by improving server performance, database caching, and getting the result back to the client as fast as possible. Generally this is a good strategy, but we’ve skipping talking about an entirely different side of caching, HTTP Caching. What if we could have the client cache our output, in the browser or api client, and skip the trip to the server all-together? That would be most excellent.

This is exactly what HTTP Caching does— it provides a mechanism for us to define rules on when the client should check back for a new version in the future.

information

In the examples below, > designates the start of a client request, while < represents the start of a server response.

In the most simplistic version, HTTP Caching works by sending back an Expires header that tells the browser that this response will be valid until the timestamp specified.

1 > GET /some_page.html

2

3 < HTTP/1.1 200 OK

4 Expires: Thu, 1 Jun 2015 01:00:00 GMT

5 Cache-Control: max-age=29557199

question

Wait, what is Cache-Control? It’s a newer header. Use both Expires and Cache-Control with max-age. They both essentially do the same thing and you’ll be covered for every browser. Cache-Control: max-age will overwrite Expires in newer browsers. Think of the two synonymously.

This works great for static assets (images, stylesheets, etc) that we know we have a long lifetime. Sometimes, we don’t always know when content will expire— an example being the front page of a news website. The homepage is updated several times per day when a new story is posted, but we can’t predict the news and don’t know when that will be. Guesstimating a timestamp for the expires header means that people with the homepage cached in the browser could miss breaking news. We need something more robust.

To do more complex-type caching, we use the ETag header— it works like this:

1.    On the first GET request, the server generates a unique string value for the ETag. In our news homepage example, it can be something as simple as md5($latest_article->created_at).

2.    On subsequent requests, the browser will first make a HEAD request, with the HTTP Header If-None-Match: {$ETag}. On the server, you can take this value, compare it to md5($latest_article->created_at), and if they match just return HTTP status 304 Not Modified with no body to the client. This will instruct the browser to use the page that they’ve cached, and skip having to fully generate the page (you would return a 200 OK with an empty body to instruct the browser to refetch the page).

An example showing ETags in practice is below.

 1 # Make the first/uncached HTTP request

 2 > GET /some_page.html

 3

 4 < HTTP/1.1 200 OK

 5   ETag: "098f6bcd4621d373cade4e832627b4f6"

 6   Cache-Control: public, max-age=86400

 7

 8 # Sometime in the future we go to the page again, instead

 9 # the browser checks for a newer version than the one in its cache

10 > HEAD /some_page.html

11   If-None-Match: "098f6bcd4621d373cade4e832627b4f6"

12

13 < HTTP/1.1 304 Not Modified

Simple enough. By the way, notice the Cache-Control header. This is useful to add more caching instructions— the public part is useful later when we talk about proxy cache servers, but essentially it tells everyone in between the server and the client whether the content is public(cacheably) or private (uncachebale). For example, you may not want a page with private or personalized content (think /timeline on twitter or /account on facebook) to be cached the same for everyone. The max-age part tells the browser to invalide the cache and make a full request after a specific amount of seconds.

Caching Static Assets

For simple files (images, stylesheets, webfonts), we can have nginx handle caching headers for us automatically— just need to a location block to our nginx server configuration.

1 location ~* \.(js|css|png|jpg|jpeg|gif|ico)$ {

2   expires max;

3 }

Easy. But what about when you make a change to these assets? They are cached in the browser and outside of our control, how do tell the browser to download the newest version?

Assets in the browser are cached based on their full URL— including the query string! All you have to do is append a “cache buster” to the query string of the asset to force the browser to do a cold pull. Incrementing a version on deploy is easy (some frameworks, like Rails, even do this for you).

The simplest solution is something like <link rel="stylesheet" type="text/css" href="main.css?v=5">, where ?v=5 gets updated everytime you make a change to your stylesheet and want to force the browsers to update their cached version immediately.

Ok, we’ve got static asset caching down, but how do we handle the magic in our PHP controllers? How do we handle caching for our code paths that can be cached? Here’s an example of HTTP Caching in some framework psuedo-code.

 1 <?php

 2 class UsersController {

 3   public function show($id) {

 4

 5     $user = $memcache->get("user:{$id}");

 6

 7     $type = $_SERVER["HTTP_METHOD"];

 8

 9     if ($type == "HEAD") {

10

11       $theirs = trim($_SERVER["HTTP_IF_NONE_MATCH"], '"");

12       $mine   = md5($user["updated_at"]);

13

14       if ($theirs == $mine) {

15         return header("Status: 304 Not Modified");

16       } else {

17         return header("Status: 200 Ok");

18       }

19     }

20

21     // ... Do normal action stuff

22     $user->expensiveDatabaseCall();

23     $view->render("users/show");

24   }

25 }

Notice how we are able to quickly compare the Etag with some data from memcached and potentially skip our expensive database calls? Also worth pointing out that you should wrap $_SERVER['HTTP_IF_NONE_MATCH'] in a call to trim() to remove double quotes from the ETag value— some browsers send them, some do not.

We can also set the expires header here, just as easily, with header("Expires: $timestamp") if you know that the action will not expire for a long time (maybe a contact page or terms of service).

information

A quick note about Nginx and ETags— there is currently a bug with Nginx where if you have gzip on set, Nginx will eat your ETags and never send them to the client. Annoying!

Caching Proxy

Ok, now that we’ve got client caching down, in practice we’ll find that it’s incredibly frustrating to not be in control of the end client. Browsers don’t always obey caching rules, each implements the rules slightly different, and they don’t always do what you expect them to. While caching is fantastic and can greatly reduce the load on yours server, it’s annoying to leave your server load up to the mercy of Internet Explorer.

Enter the caching proxy— our own middle-man that acts as a middle-layer HTTP page cache between your backend php-fpm workers and the browser. It’s even built into Nginx! Basically, we’ll Nginx looks out for cache headers being set from PHP, and when it sees them, it stores a static copy of that data for the future. When a future web request is made, if it has a cached version it will skip PHP entirely and send the static copy to the client. All while respecting Expires, and Cache-Control. Wahla! For the right use cases, it’s like magic.

I really like using Nginx for this sort of job— it’s already part of our stack on our app servers, we know how to configure it, and it’s fast as hell. You might hear people talk about Varnish, which is also decent, albeit much more complex to configure for a tiny bit of a extra functionality while being equally as fast. Stick with Nginx unless you have a specific need that Varnish can solve better.

Which pages can I use my caching proxy for?

A caching proxy really works best for public, global pages. Remember, the cache is shared for all users, without any custom logic. Think, user profiles, unique generates content, etc. Anything with a public, globally accessible URL. This limitation can be tricky for some scenarios that I’ve outlined below.

Logged in / logged out navigation bars

It’s pretty common to have a navigation bar on every page that changes depending on if the user is logged in or not. If we cache the logged out version of the page, for instance, a logged in user that gets the same page from the proxy cache (instead of from PHP-FPM), will see the navigation bar in the logged out state. This is actually easily solvable— toggle the state in javascript by looking at some type of session cookie instead of doing it in the PHP view.

Authenticated views (private messages, etc)

Another concern is authenticated views with unique URLs. An example might be private messages— /messages/123456 is unique but we only want authenticated users (like, sender and receiver) to be able to view the page. Unfortunately, we have to skip the cache for this scenario. We’ll have to send it from PHP with the header Cache-Control: private; no-cache; no-store, which tells the caching proxy to skip caching the page.

Highly Dynamic Content

One more common scenario is handling content that’s highly dynamic, like a view counter that is increased on every page load. We don’t want something small like this hindering is from seeing the huge benefits of HTTP caching. The answer? Simple— move this type of dynamic content to AJAX and load it from the browser with a seperate request.

Configuring Nginx to be a Cache for PHP-FPM

We can easily enable HTTP caching on our app server by using the built-in nginx fastcgi cache. It’s pretty simple to do, just add the following to your nginx server configuration.

 1 # Set the path where the cache is stored; Set the zone name (my_app),

 2 # total size (100m), and max lifetime (60m)

 3 fastcgi_cache_path /tmp/cache levels=1:2 keys_zone=my_app:100m inactive=60m;

 4

 5 # Set the cache key used, in this case: httpsGETtest.com/somepage.html

 6 fastcgi_cache_key "$scheme$request_method$host$request_uri";

 7

 8 server {

 9   listen 80;

10   root /u/apps/my_app;

11

12   location ~ \.php$ {

13     fastcgi_pass http://127.0.0.1:8080;

14

15     # Use the zone name we defined on line 1

16     fastcgi_cache my_app;

17

18     # Only cache HTTP 200 responses (no error pages)

19     fastcgi_cache_valid 200 60m;

20

21     # Only enable for GET and HEAD requests (no POSTs)

22     fastcgi_cache_methods GET HEAD;

23

24     # Bypass the cache if the request contains HTTP Authorization

25     fastcgi_cache_bypass $http_authorization;

26

27     # Bypass the cache if the response contains HTTP Authorization

28     fastcgi_no_cache $http_authorization;

29

30     # Add a debugging header to show if the page came from cache

31     add_header X-Fastcgi-Cache $upstream_cache_status;

32   }

33 }

This configuration sets up a fastcgi cache infront of our PHP-FPM server. We define the location of the cached files with fastcgi_cache_path, setting a 100MB cache limit, with an automatic 60 minute purge for inactive items.

The cache key is defined with fastcgi_cache_key— this can be changed depending on your use-case. For instance, notice how $scheme is included in the cache key? The $scheme variable holds the HTTP scheme (i.e, http or https), which by default will cache the pages differently depending on if they are accessed over SSL. This may not be what you want. An example cache key would look something like httpsGETmy_app.com/index.php?foo=bar.

Next up, we define what types of HTTP responses are valid with fastcgi_cache_valid— just 200 responses. We don’t want to cache error pages. Likewise, with fastcgi_cache_methods, we limit caching to only GET and HEAD requests. Generally, it doesn’t make sense to cache the response of a POST, which is more often than not going to have dynamically generated content.

Lastly, we put in some definitions for fastcgi_cache_bypass and fastcgi_no_cache. These two settings allow you to either bypass checking the cache or skip storing something in the cache, depending on the presence of an aribitrary cookie, header, or nginx variable. In this case, we are skipping the cache if the HTTP Header Authorization is present.

Remember, though— nginx will respect and obey your Expires and Cache-Control headers. If a page is sent from PHP-FPM with the Cache-Control: private; no-store header set, it will not be stored in the nginx Cache.

information

Notice we also added a dynamic header, X-Fastcgi-Cache, which is really helpful for debugging. This header will display HIT if the file was served from the cache and MISS if it was not and was served from PHP. You can log this in your access_log and use it to calculate cache hit ratios.

PHP-FPM with and without a proxy cache

I quickly benchmarked this on an i2.8xlarge EC2 server really simply— I setup a blank laravel application (no database), with one page (/public) coded to always return Cache-Control: public and the other (/private) setup to always return Cache-Control: private. I ran ab (Apache Benchmark) and benchmarked 100,000 requests— here are the number of requests per second:

1 /public  - 24690 requests per second, 4ms per request

2 /private - 3253  requests per second, 30ms per request

The takeaway here is— it is much, much, much faster to skip PHP and serve files directly from an HTTP cache if at all possible. The results are incredibly visible with the most basic of PHP scripts, and we’d see an even more substantial difference if the PHP code was doing more complex work like hitting the database multiple times. That being said, sometimes, it’s just not possible to do if the data is too dynamic. But for the pages where it does make sense— boy this can be a huge win.

question

**How is the fastcgi cache stored on disk?

The cached responses that the nginx fastcgi cache saves are stored on disk and can be easily manipulated. The output of the response from PHP-FPM is captured and stored in fastcgi_cache_path (/tmp/cache in our example). The filename is predictable and derived from the derived cache key of the request. For instance, for a GET request to https://domain.com/index.html, the derived cache key is httpsGETdomain.com/index.html. The response body will be stored in a file that is named md5(httpsGETdomain.com/index.html), or d5c94ba8b944742f64c115fa8c8e65ea.

The cache files aren’t just stored at the base of /tmp/cache, though. Nginx uses a two level folder structure to avoid having millions of files in a single directory which is bad for performance. The last three characters of the filename are used to build the folder structure, so for the example request above, the response body will be stored in/tmp/cache/a/5e/d5c94ba8b944742f64c115fa8c8e65ea. Why is this useful? If you delete this file, it will remove it from the fastcgi cache and force nginx to re-pull the page next time it’s requested.