Two reasons why PaaS is so much more than automation
Bruno Terkaly is a heck of an interesting and intelligent guy. I suggest you check out his many videos and writings. As a fellow developer evangelist, I look up to Bruno a lot. And like him, I’m heavily invested in the platform-as-a-service (PaaS) paradigm, as you can well imagine given that I work at AppFog. And so when I came across this piece of his from a while back, I couldn’t help but devour it and ruminate on it for several days. It’s an impressive bit of thinking but I feel that there are some serious problems with his understanding of PaaS.
The argument of the piece, titled “Why Platform as a Service (Robotics) will rule the world,” is essentially this: PaaS will rule the (cloud) world because the principle behind PaaS is automation, and automation is the core of a “radical technology revolution” that is slowly but surely making our global digital architecture more efficient. Terkaly even goes as far as to equate PaaS and “robotics” in the very title of the piece. The premise is that PaaS essentially roboticizes cloud infrastructure and thereby makes it vastly more efficient and easier to use.
How does this roboticized system work? The answer lies in what Terkaly calls the Fabric Controller. It lies at the very heart of any PaaS platform. It’s a kind of autonomous, master robot that streamlines the use of virtual machines, automates repetitive tasks, provisions servers, provides complex monitoring capabilities, and the like. This master robot is what is missing from IaaS; the mission of PaaS, on his reading, is to fill this gap.
Terkaly’s core argument: PaaS == automation == value
For Terkaly, IaaS is an essential part of the cloud equation (how could it not be?), but IaaS is by definition not automated. This strikes me as fundamentally right. I would even go so far as to say that IaaS would lose a great deal of its value if it were to have too much automation built in, because the fundamental configurability of IaaS is what has guaranteed its rise thus far.
But IaaS, as Terkaly says, is “more flexible but more labor intensive.” We can slot a variety of things under this labor intensity: mastering and using tools like Puppet and Chef, becoming acquainted with the many nuances of EC2 or OpenStack or whatever–in essence, the entire “DevOps” paradigm. The problem with IaaS on its own is that it requires developers to do things like “directly interact with a portal or execute scripts for VMs to be created.”
This kind of direct interaction sounds nice, but what it ultimately means is lost time, lost productivity, and sunken resources of many other kinds. This labor intensity is a large part of what has guaranteed the rise of PaaS as we’ve witnessed it over the past few years. And whenever processes are needlessly labor intensive, there’s room for automation to make a decisive impact, and this, for Terkaly, is what we’ve been seeing.
The advantage of PaaS is that it (at best) provides a seamless-as-humanly-possible automation layer on top of cloud hardware that separates the application developer/deployer/manager from the hardware itself. This layer is responsible for providing developers with a solid UI, perhaps a command line tool, perhaps a set of database options, etc., instead of forcing them to touch bare metal. With that layer in place and operational, the developer no longer needs to mess with things like nginx or Apache configuration or SSHing into cloud VMs or being on call in the event of hardware failures. Automation encroaches into that territory into ways that are resource-saving.
So far, so good. Terkaly’s framing of PaaS has a lot to say for it. But…
Automation will never exhaust the value of PaaS
Terkaly’s conceptual picture of PaaS provides a solid conceptual jumping-off point for understanding aspects of PaaS in general, but it doesn’t even approach our particular vision of PaaS at AppFog. We think that Terkaly’s automation-centered vision has two very crucial shortcomings that I’d like to discuss in turn.
1. Automation is great, but it requires skilled guidance
We think that automation is hugely important. We use lots and lots of automation behind the scenes at AppFog. Automation drives our various analytics interfaces, we use Chef and other tools to automate large portions of our workflow, and core components of Cloud Foundry’s architecture (such as the Droplet Execution Agent (DEA), which is designed for tasks like starting and stopping applications), are designed with automation as a core principle.
But in our opinion, automation still requires management and deeply informed yet often on-the-spot decision making. Automation simply doesn’t add a great deal of value unless there is an incredibly skilled team of ops engineers working to constantly oversee the web of automation processes that drive PaaS as we envision it.
Automation is well and good, but if you’re stuck facing something like a major infrastructure outage (which does happen, as we all know), there’s simply no automating your way out of it. It takes experts with intimate knowledge of the granular details of a complex distributed system–actually, in our case a distributed system of distributed systems (!)–who are willing to respond to phone calls at all hours of the day and night. When we said a while back that AppFog is an ops company, this is what we meant. We simply dispute the idea that any Fabric Controller can be intelligent enough on its own.
“Let the robots do it” works just fine as a principle within highly circumscribed domains. But we believe strongly that PaaS will never be fully automatable as a computing paradigm. There are simply too many contingencies that can befall cloud hardware to be able to construct exception-driven automated processes to fully cope with them. Somebody needs to be on call 24/7 to deal with problems, to resuscitate non-responsive VMs, to re-start elements of AppFog’s architecture that are lagging, to investigate processes that are spiking in CPU usage, and so on.
PaaS should combine the best of both worlds. It should be a healthy admixture of automation and human oversight. And it will remain that way as long as PaaS is with us.
2. PaaS should be about cloud interoperability, not just abstracting away hardware
This second point is, I think, even more decisive. An underlying yet unexpressed assumption throughout Terkaly’s piece is that PaaS is a single-infrastructure animal. There are repeated references to “the” data center. This isn’t surprising given that Terkaly is writing specifically about Windows Azure’s PaaS offering. But never in his discussion does he distinguish between Windows Azure and PaaS as such, which leaves him highly vulnerable to reasonable criticism.
The problem is that providing an automation/abstraction layer on top of cloud hardware is only one crucial component of PaaS. PaaS begins to take on a much different character when it is re-envisioned as a cross-cloud animal. Never once did we at AppFog envision PaaS as a tool for simply making it easier to run applications on Amazon EC2 or OpenStack or Eucalyptus or whatever. We wanted to abstract away the differences between all of them. We wanted to inject orchestration and portability where incompatibility had previously reigned.
For us, “the” data center or “the” cloud hardware doesn’t make a whole lot of sense heading into 2013. It is indeed true that PaaS has emerged because there’s a lot of value in turning hardware into a non-issue for developers and IT departments, but it also matters a great deal what hardware is being abstracted away. People don’t simply want a slightly smoother path to using this or that virtualization platform. This is not the kind of abstraction that we’re after.
In addition, thinking in a multi-infra way means deeply modifying Terkaly’s very conception of the Fabric Controller. AppFog indeed acts as a kind of Fabric Controller (although the analogy isn’t necessarily perfect), but we divide this function into two discrete levels: the data center (DC) level and the meta-DC level.
For Terkaly, the Fabric Controller is simply an OS for “the” data center. It is a “distributed stateful application distributed across data center nodes and fault domains.” That’s starting to sound a bit like AppFog. Except that for us, this distributed stateful application is distributed across data centers throughout the world (this is the meta-DC level).
The generation gap
Or, we could describe it this way: first-generation PaaS involved infra-specific Fabric Controllers of exactly the sort that Terkaly describes. In this generation of PaaS, the lone data center was the most important locus of automation and control. The cloud was so new and there were so few players in the IaaS space (well, really only one) that single-infra abstraction was an incredibly valuable service. It still is highly valuable in its own way, but this approach has not kept pace with the ever-expanding constellation of IaaS players and platforms from OpenStack clouds (Rackspace, HP, etc.) to Windows Azure and beyond.
Because we’re building AppFog as a platform for cross-cloud interoperability (and are already well on the way), our vision of cloud control departs quite a bit from Terkaly’s concept of the Fabric Controller. We do something similar, but we do so in principle across any hardware that AppFog can run on (including bare metal). If you want to describe AppFog as a Fabric Controller, that’s fine, but ours is a controller of controllers, called upon to orchestrate an array of AppFog instances that will grow and grow and over time. There’s little in Terkaly’s vision that can fully conceptualize this kind of system.
And so if Terkaly’s Fabric Controller is a kind of meta-VM, responsible for managing and provisioning subordinate VMs, then AppFog is a kind of crazy meta-meta-VM, responsible for keeping in line all of the different AppFog mini-systems that combine into the system that is AppFog proper. And so take all of the things that Terkaly’s Fabric Controller does, multiply them across data centers, and then insert a kind of Master Controller (not to mention the separate AppFog instance that we use for quality assurance).
I say a “kind” of Master Controller because we don’t do anything quite like that. At the end of the day, our engineering and ops teams are the Master Controller and the Single Point of Truth behind the operation–not to mention the place where automation reaches its limits and gives way to human oversight and intelligence.
Conclusion: the “Platform” in PaaS is a many-headed hydra
Terkaly’s article is a real tour de force and I highly recommend reading it. It’s excellent food for thought. But I would challenge you to not conflate Terkaly’s vision of PaaS and PaaS as such, because companies like AppFog show that there are vastly divergent visions behind what PaaS should be and what it should offer in the PaaS space.
Automation–the “rule of robots”–simply does not exhaust the definition of PaaS.