October 27th Downtime Postmortem
Hi, my name is Matthew Lyon and I’m responsible for rolling out changes to the PHP Fog website and orchestrating changes across our service infrastructure, and this is not what I wanted to be writing.
I wanted to write about how we’ve rolled out our new logging service. I wanted to tell you about how we can provide you with real-time viewing of your app’s access and error logs from our web console (or other http tools such as curl), about how it will aggregate logs from across multiple servers, and about how we’ll be able to tie this into third-party log archiving and search services in the coming weeks.
Instead I’m writing about our downtime on the evening of Thursday, October 27th, and how we’re working to prevent it from happening again.
What Went Wrong
Platform as a service is a complex interconnected system. Managing and deploying complex features like a unified logging system that collects and centralizes logs for tens of thousands of different applications had some unforeseen scaling issues. Although we had worked hard to predict and plan around them as much as we could, although we ran and developed this feature in a cloned qa environment, although we did load testing, we missed a few small details that brought our systems down temporarily.
It is our responsibility to provide the most reliable and robust platform for web development out there. Since going general availability, our reliability has been pretty terrific, which makes this outage really painful for us. We are using this opportunity to put the proper pieces in place to prevent this from ever happening again.
How We’re Going to Prevent This From Happening Again
We’re going to announce deployments before we do them. We are going to do a much better job letting you know when there will be maintenance on our site that may affect your sites.
We’re implementing staged deployments for new features. This will allow us to catch problems that get past our other safeguards before they affect all of our customers.
We’re improving our tools for deploying complex features like the logging service daemon. We are built on the cloud and we provide a cloud service, so we can take advantage of that. We are going to bake in new service features to our server templates, fire up new servers for everyone’s apps, ensure consistency and reliability of data, flip a switch so that those new servers go live, and then turn the old servers off.
We’re improving our load testing for services that run on your servers to insure they can handle organic real-world load. While many people are content to use ab or httperf and call it a day, we know that’s not enough, and apparently our load testing process wasn’t enough, either. We’re improving our load testing processes to better handle the needs of real-world applications, and will be testing them on our own applications before they are put to general use.
Reliability has always been my first goal for PHP Fog, and is our primary goal as a service. We’re very sorry about the outage and are working to ensure that our service is more reliable than ever.
I look forward to being able to write about our new logging service soon.