~clay

merely my musings

Archive for the ‘sre’ tag

Engineering and Operations: Bridging the Divide

with 4 comments

A recent post by the folks over at Agile Web Operations discusses some common sources of tension between engineering and operations organizations in web companies: a mutual lack of experience in each other’s domains, conflicting departmental goals, and an us–against–them mentality drawn from social identity theory. Continuing the conversation, I suggest there is a subtler but more fundamental source of tension between engineers and operators that has to do with their different mindsets: developers think in terms of possibilities, while administrators think in terms of realities.

Developers tend to downplay—perhaps unconsciously—the significance of bugs because they understand how to fix them: just make a one-line change over here and tweak a unit test over there and we’re done. If she has a good idea how to fix a bug, a developer may file it away in the “solved” folder in her brain before she’s actually implemented the fix. I’m not saying developers aren’t concerned with quality—they are—or that they don’t fix bugs—they do. But how many times have you spotted a bug and dutifully reported it only to have the developer reassuringly tell you that, “yes, it’s a known issue, we’ll fix it sooner or later—probably later”?

Systems administrators, on the other hand, face the stark binary reality that the software either works or it doesn’t. It survives unanticipated load or it doesn’t. The pager goes off or it doesn’t. No amount of reassurance that the bug can be fixed easily will appease an administrator—if it’s broken, it’s broken. And during the first few iterations of a new product, frequently the software is, in fact, broken. Over time, administrators become conditioned to believe the software will always be broken. It is not uncommon for administrators to express concern about bugs that were known to bring the site down in months past as if they might strike again the next time they are on-call, despite having been fixed months ago.

I point to the difference in mindsets not to disparage one group or the other—I wore a sysadmin hat long before I wore my developer hat—but to expose a fundamental flaw with organizational structures that divide all site development and maintenance functions into just these two separate–but–equal groups. Despite the benefits afforded by the separation of responsibility that you get with distinct engineering and operations groups, such a structure breeds an inefficiency that can threaten a company’s ability to scale.

How well does your operations team understand your software components and how they interact? How well does your engineering team understand how your systems are built, or how they’re connected? When engineering and operations don’t understand each other’s domains, the result is a release process that is at best inefficient, and at worst dangerously fragile.

For example, even though engineering may write detailed release notes describing new features, systems administrators often don’t speak the same language—release notes are practically useless to operations. As a result, valuable time is wasted translating release notes into a language that operations understands: listings of the commands needed to deploy the software. Conversely, developers may not understand infrastructure dependencies (operating system versions, libraries, NFS mount points, firewall rules), leading to confusion (and possibly outages) when code is deployed to machines where it has no chance of working.

In shops that split all work on the production site between the false dichotomy of engineering and operations roles, most software releases will require the two teams to work closely together, and so releases become a significant source of tension between the groups. If your systems administrators cringe whenever a release is coming up, you know you’ve got a problem. Releasing software is how your company grows, both by adding new features and by fixing bugs in the existing features. Yet if the administrators had it their way, there’d be no releases.

Just about the time I had started thinking that what is needed is a third team responsible solely for releases and other aspects of the production site, a friend and colleague forwarded along a slide deck describing Google’s Site Reliability Engineering organization. This team is responsible for one thing: the production web site. Engineering is free to develop features and operations is free to think strategically about systems, storage, and network. What makes the SRE team so interesting is that it is staffed with (junior) engineers, so it’s got an engineering mindset, but at the same time it’s charged with an operations objective: keeping the web site up.

Using Google’s Site Reliability Engineering concept to frame my own thoughts, I tend to think of SRE as an internal customer of both the engineering and operations teams. SRE expects engineering to deliver working software, and they will file and track bugs when that is not the case. SRE should also make an effort to fix the bugs they have filed—something not possible when operations files all the bugs against production. Conversely, SRE expects operations to deliver the server, storage, and network infrastructure required to meet the demands of the production site. SRE leads capacity planning efforts, placing orders with operations for server, storage, and network expansion. SRE also constantly monitors the production site and is responsible for installing and configuring the monitoring software.

With the addition of an SRE team, the division of responsibilities starts to look like this:

  • Operations delivers infrastructure
  • Engineering delivers features
  • Site Reliability delivers uptime

Despite the title, SRE should not report into the engineering organization. Rather, it should be its own, first-class, top-level organization, complete with executive representation at the VP level. I know what you’re saying: how much is it going to cost to staff yet another organization? Not as much as you think. Since SRE will off–load releases from operations, it may be possible to scale back the operations team. And since SRE removes the inefficiencies involved in translating release notes to deployment plans, engineers will have more time to work on features.

Operations managers may balk at the idea of scaling back their teams, arguing that they’re already so busy that they can’t complete all the work on their plates with the team they have. But look at what is consuming most of the time. It’s probably deployments, especially if they occur anywhere near the frequency of deployments at Flickr. Operations teams are also burdened with production incident response, a responsibility that rightly belongs in the SRE organization. By handing both releases and first–response duties off to SRE, the operations team workload will fall and the team can be restructured, eliminating some middle–tier systems administrator positions while retaining mostly the strategic thinkers (operations architects) and data center support engineers.

If you’ve been thinking “AUTOMATION!” while reading this, I hear you. I wholeheartedly agree that automation, when carefully conceived and conscientiously deployed, can improve efficiencies and ease the tensions stemming from a manual release process. But for all the advances in the current generation of automation tools, it may still be a while before automation tools can configure themselves. Until then, who should own the configuration? Engineering understands the intrinsic properties of the software—the proper sequence to start the various components, the proper settings for feature-related properties—but operations has the extrinsic knowledge necessary to make the site work—which databases are available, which load balancers to use, etc. It might be possible to arrive at a working configuration by merging the two team’s knowledge, but I think it makes more sense if one group owns production and the associated automation configuration and workflows.

Ultimately, by freeing other teams to focus on their core competencies, Site Reliability Engineering can increase uptime and help the company scale, all while reducing tensions among engineering and operations—what more can you want from a three-letter acronym?

Written by clay

April 2nd, 2009 at 8:22 pm

Posted in Engineering, Operations

Tagged with , ,