Engineering and Operations: Bridging the Divide
A recent post by the folks over at Agile Web Operations discusses some common sources of tension between engineering and operations organizations in web companies: a mutual lack of experience in each other’s domains, conflicting departmental goals, and an us–against–them mentality drawn from social identity theory. Continuing the conversation, I suggest there is a subtler but more fundamental source of tension between engineers and operators that has to do with their different mindsets: developers think in terms of possibilities, while administrators think in terms of realities.
Developers tend to downplay—perhaps unconsciously—the significance of bugs because they understand how to fix them: just make a one-line change over here and tweak a unit test over there and we’re done. If she has a good idea how to fix a bug, a developer may file it away in the “solved” folder in her brain before she’s actually implemented the fix. I’m not saying developers aren’t concerned with quality—they are—or that they don’t fix bugs—they do. But how many times have you spotted a bug and dutifully reported it only to have the developer reassuringly tell you that, “yes, it’s a known issue, we’ll fix it sooner or later—probably later”?
Systems administrators, on the other hand, face the stark binary reality that the software either works or it doesn’t. It survives unanticipated load or it doesn’t. The pager goes off or it doesn’t. No amount of reassurance that the bug can be fixed easily will appease an administrator—if it’s broken, it’s broken. And during the first few iterations of a new product, frequently the software is, in fact, broken. Over time, administrators become conditioned to believe the software will always be broken. It is not uncommon for administrators to express concern about bugs that were known to bring the site down in months past as if they might strike again the next time they are on-call, despite having been fixed months ago.
I point to the difference in mindsets not to disparage one group or the other—I wore a sysadmin hat long before I wore my developer hat—but to expose a fundamental flaw with organizational structures that divide all site development and maintenance functions into just these two separate–but–equal groups. Despite the benefits afforded by the separation of responsibility that you get with distinct engineering and operations groups, such a structure breeds an inefficiency that can threaten a company’s ability to scale.
How well does your operations team understand your software components and how they interact? How well does your engineering team understand how your systems are built, or how they’re connected? When engineering and operations don’t understand each other’s domains, the result is a release process that is at best inefficient, and at worst dangerously fragile.
For example, even though engineering may write detailed release notes describing new features, systems administrators often don’t speak the same language—release notes are practically useless to operations. As a result, valuable time is wasted translating release notes into a language that operations understands: listings of the commands needed to deploy the software. Conversely, developers may not understand infrastructure dependencies (operating system versions, libraries, NFS mount points, firewall rules), leading to confusion (and possibly outages) when code is deployed to machines where it has no chance of working.
In shops that split all work on the production site between the false dichotomy of engineering and operations roles, most software releases will require the two teams to work closely together, and so releases become a significant source of tension between the groups. If your systems administrators cringe whenever a release is coming up, you know you’ve got a problem. Releasing software is how your company grows, both by adding new features and by fixing bugs in the existing features. Yet if the administrators had it their way, there’d be no releases.
Just about the time I had started thinking that what is needed is a third team responsible solely for releases and other aspects of the production site, a friend and colleague forwarded along a slide deck describing Google’s Site Reliability Engineering organization. This team is responsible for one thing: the production web site. Engineering is free to develop features and operations is free to think strategically about systems, storage, and network. What makes the SRE team so interesting is that it is staffed with (junior) engineers, so it’s got an engineering mindset, but at the same time it’s charged with an operations objective: keeping the web site up.
Using Google’s Site Reliability Engineering concept to frame my own thoughts, I tend to think of SRE as an internal customer of both the engineering and operations teams. SRE expects engineering to deliver working software, and they will file and track bugs when that is not the case. SRE should also make an effort to fix the bugs they have filed—something not possible when operations files all the bugs against production. Conversely, SRE expects operations to deliver the server, storage, and network infrastructure required to meet the demands of the production site. SRE leads capacity planning efforts, placing orders with operations for server, storage, and network expansion. SRE also constantly monitors the production site and is responsible for installing and configuring the monitoring software.
With the addition of an SRE team, the division of responsibilities starts to look like this:
- Operations delivers infrastructure
- Engineering delivers features
- Site Reliability delivers uptime
Despite the title, SRE should not report into the engineering organization. Rather, it should be its own, first-class, top-level organization, complete with executive representation at the VP level. I know what you’re saying: how much is it going to cost to staff yet another organization? Not as much as you think. Since SRE will off–load releases from operations, it may be possible to scale back the operations team. And since SRE removes the inefficiencies involved in translating release notes to deployment plans, engineers will have more time to work on features.
Operations managers may balk at the idea of scaling back their teams, arguing that they’re already so busy that they can’t complete all the work on their plates with the team they have. But look at what is consuming most of the time. It’s probably deployments, especially if they occur anywhere near the frequency of deployments at Flickr. Operations teams are also burdened with production incident response, a responsibility that rightly belongs in the SRE organization. By handing both releases and first–response duties off to SRE, the operations team workload will fall and the team can be restructured, eliminating some middle–tier systems administrator positions while retaining mostly the strategic thinkers (operations architects) and data center support engineers.
If you’ve been thinking “AUTOMATION!” while reading this, I hear you. I wholeheartedly agree that automation, when carefully conceived and conscientiously deployed, can improve efficiencies and ease the tensions stemming from a manual release process. But for all the advances in the current generation of automation tools, it may still be a while before automation tools can configure themselves. Until then, who should own the configuration? Engineering understands the intrinsic properties of the software—the proper sequence to start the various components, the proper settings for feature-related properties—but operations has the extrinsic knowledge necessary to make the site work—which databases are available, which load balancers to use, etc. It might be possible to arrive at a working configuration by merging the two team’s knowledge, but I think it makes more sense if one group owns production and the associated automation configuration and workflows.
Ultimately, by freeing other teams to focus on their core competencies, Site Reliability Engineering can increase uptime and help the company scale, all while reducing tensions among engineering and operations—what more can you want from a three-letter acronym?



Interesting idea!
Jonathan Aquino
6 Apr 09 at 2:13 pm
While I agree with a lot of the points you make, and I see the purpose of the SRE distinction, I can’t help but feel that the “us versus them” divide in perspectives can be bridged without creating a ‘middle’ ground group.
Still thinking on it…
John Allspaw
26 Jun 09 at 4:02 pm
@allspaw: Thanks for your comment. Flickr is a great example of how two teams can amicably and efficiently run a large, dynamic site. It’s interesting to see how different sites approach the SRE position. Google SRE reports into engineering, whereas Facebook SRE reports into operations. The design of the reliability team, and of the software release process, seems to vary with company culture and platform architecture.
At Ning, we don’t have a dedicated SRE team — instead, we have weekly on-call rotations in eng and ops. I have maybe a somewhat unusual perspective, having been on-call first in ops and then later in eng. Our guiding philosophy is that “operations owns production,” which means the on-call engineer does not have authority to deploy code or restart services. Instead, the on-call engineer can request that the on-call operations person make a certain change. I think this is a common approach in enterprise shops, especially those with SOX or PCI compliance concerns, but I’m not sure it scales for web shops doing multiple deployments a day — hence my post.
Something about a tripartite organization consisting of specialized infrastructure, feature, and availability teams appeals to me, but I’ve not seen it done this way in practice. The success of Flickr, Facebook, and Google does seem to suggest that a third, top-level SRE team may not be necessary. In Ning’s case, we might be able to increase efficiency and reduce tension if we empowered engineering to deploy code and restated our core philosophy as, “the on-call team owns production.”
clay
27 Jun 09 at 9:01 am
Very nice post Clay. Unfortunately, it has been a rare treat to read something so articulate and thoughtful about our industry.
I agree with much of what you say, but I have a slightly different take, and may share some of John’s reservation. I’ve always been confused by the rich number of silos that exist and are always being created within IT. When I began my Internet life, I was dumped into a shell without much more than a blinking cursor and a pat on the back. I had to figure it out. When I wanted to run a machine that could run the flavor of BSD I preferred, I had to build it. When I wanted a lab to try things out in, I had to learn networking in order to build it. In order to allow me to single-handedly manage a growing userbase, I learned how to write scripts and tools. When my employer wanted to build a national network, I had to learn a lot more.
Once I learned enough about networks to realize the intriguing growth possibilities, I had to improve my programming skills and learn elements of software architecture, so my tools would scale properly. When my attraction to scale took me to a much larger shop, I had to learn about the theory of operations. When I started doing consulting, I had to learn about project management, as well as “the business”. When I joined a VC company, I had to learn how to evaluate things that didn’t exist, and consider the many paths that we might all end up following. When I went to academia, my attraction to scale and complexity took me into high performance computing and cognitive computing, where concepts of parallelism, distribution, and efficient automation and management were not nice to haves, but must haves.
All the while, I never felt that I was changing into a different person. My pursuit of perfection in my work, birthed by access to uptime(1) and a motley community of competitive hackers, never wavered. My interest in understanding how systems worked remained acute, and the more systems the better — including ’soft’ systems, like organizations, businesses, economies, brains. I was a hacker, and my primary tools were not shell commands or Linux packages. They were, are, and always will be my curiosity, my intuition, my way of thinking and unpacking problems into constituent parts, my energy, my passion for learning, my willingness to push boundaries, and my fearlessness.
At the time I began, available resources were scarce — most of the knowledge was locked up in small social circles, traded at conferences and deep within vendor circles and universities. Of course, information quickly exploded, and today the amount of valuable information out there is stunning.
While I am reluctant to call this out as a bad thing, I have noticed the industry change enormously. I did a stint in academia for about 4 or 5 years, and when I came back to the valley, everything had changed. Many of the top engineers I worked with after the hiatus had graduated from CS programs while I was out of the industry. Some of them were born with Linux distros in their laps. Their skillsets were impressive and accomplishments myriad, but it felt like something was missing. The average mentality that I had been exposed to when I came up was like me — a hacker. Now it seems to have morphed into something else, but I’m not sure what to call it. Computer scientist? Or maybe Capitalist?
I can’t be sure, but I feel that the massive growth in the industry, the amount of money being thrown around, and the ever growing amount of shared knowledge in the cloud is having at least one adverse effect. In my opinion, there is too much specialization, and too much focus around goals that involve more traditional pursuits such as ego, money, and traditional status. I see less focus on knowledge for the pure sake of it, ability for the honor in it, and uptime for the glory.
I dearly miss the old days, but I’m glad that I’m young enough still (31) to be a part of more than one major era in the development of the Internet, and I’m proud to have been raised by hackers who were there at the beginnings. For myself personally, I count 5 eras so far — pre-Mosaic, pre-Netscape IPO, pre-dotcom crash, web2.0, and now the cloud era. For my 15 or so years in IT, that’s an average of 1 new era every 3 years. It’s incredible when you think about it with some perspective.
Things are moving quickly. I don’t know what will come next, but I hope we don’t end up destroying all of the opportunities for young kids like me to be exposed to the broad variety of fascinating and exciting challenges within IT, and within life. I hope that curious young people are given the opportunity to take their time with things, to learn about them more deeply, and to gain wisdom rather than just collections of repeatable skills. The world needs hackers, not just role players.
Brian Merritt
21 Jul 09 at 4:09 pm