Scaling PHP Applications (2014)
Case Study: How DNS took down Twitpic down for 4 hours
Believe it or not, DNS resolution actually caused a significant amount of downtime at Twitpic in 2010.
At the time, we weren’t using our own DNS servers and relied on our datacenter’s DNS servers to handle DNS resolution. That morning, I pushed a code change to move our sessions from a single memcache server to be split across multiple memcache servers. The configuration line listing the memcache servers looked like this:
Notice the newline after the second host. In my text editor, it just wrapped, so I didn’t notice it before deploying.
I deployed the code and the site went down. I noticed the extra newline, figured that it caused the issue, reverted the change, and everything was back to normal. Great.
A few minutes later our application processes started hanging. This caused a huge domino effect, which took down the entire site. What broke? I had already reverted the bad change…
It turns out, due to a bug in the memcache library, when it saw the host \n\r192.0.2.7 in $SESSION_SAVE_PATH, it parsed it as a domain name and not an IP address. In an attempt to resolve the domain \n\r192.0.2.7, it made millions of invalid domain lookups to our hosting provider’s DNS resolver.
They really didn’t like thisâ€” it’s a shared service and we were effectively DOSing them with invalid requests, so they promptly blacklisted our servers from their DNS resolver, even after the change was reverted. Being blocked from DNS, with the default system settings, caused a 5-second timeout every time we tried to do a DNS lookup, which we do quite frequently as our code uses several 3rd party APIs, including Twitter, Amazon S3, ZenCoder, etc. Blocking for 5-seconds ate into our available PHP-FPM daemons and eventually brought everything down to a halt.
This was a strange, unexpected, and rather weird problem, so it took several hours to debug and determine the root cause. And that was promptly followed by an hour of frantically setting up our own internal DNS solution. So, recognize that if your application makes any 3rd party API calls, you need to take control of your DNS situation BEFORE it bites you in the ass.
Obviously, using a caching resolver like nscd or dnsmasq, coupled with a low timeout in /etc/resolv.conf would have helped to avoid the issue entirely.