<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>~clay &#187; Engineering</title>
	<atom:link href="http://daemons.net/~clay/category/work/engineering/feed/" rel="self" type="application/rss+xml" />
	<link>http://daemons.net/~clay</link>
	<description>merely my musings</description>
	<lastBuildDate>Mon, 10 May 2010 17:48:58 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>The various flavors of Ruby class attributes</title>
		<link>http://daemons.net/~clay/2009/05/16/the-various-flavors-of-ruby-class-attributes/</link>
		<comments>http://daemons.net/~clay/2009/05/16/the-various-flavors-of-ruby-class-attributes/#comments</comments>
		<pubDate>Sun, 17 May 2009 04:57:38 +0000</pubDate>
		<dc:creator>clay</dc:creator>
				<category><![CDATA[Engineering]]></category>
		<category><![CDATA[Geek]]></category>
		<category><![CDATA[ruby]]></category>

		<guid isPermaLink="false">http://daemons.net/~clay/?p=292</guid>
		<description><![CDATA[Here&#8217;s a curious thing about Ruby: it&#8217;s got three flavors of class attributes. You can adorn your classes with class variables, class instance variables, and class constants. Not knowing the differences between them, and thinking that one of them might be useful for a project I was working on, I set out to figure out [...]]]></description>
			<content:encoded><![CDATA[<p>Here&#8217;s a curious thing about Ruby: it&#8217;s got three flavors of class attributes. You can adorn your classes with class variables, class instance variables, and class constants. Not knowing the differences between them, and thinking that one of them might be useful for a project I was working on, I set out to figure out how they all worked, especially with respect to inheritance.</p>
<p>As a trivial example of the design scenario I was working with, consider the case of an object-oriented vegetable garden. Vegetables come in all shapes, sizes, and colors, but we might want to say that all vegetables should be green unless we&#8217;ve said otherwise. We might start modeling our vegetable garden with a <code>Vegetable</code> class, and we could set a <code>color</code> attribute on it with a default value of <code>"green"</code>. <code>Lettuce</code>, which happens to be green, could inherit that attribute from <code>Vegetable</code>. <code>Eggplant</code>, however, should redefine <code>color</code> to be <code>"purple"</code>. </p>
<p>While certainly a contrived and flawed example, it demonstrates the behavior I was looking for. Let&#8217;s see how Ruby&#8217;s various flavors of class attributes can help us solve this design problem &#8212; or not.</p>
<p>First up, class variables:</p>
<pre class="brush: ruby;">
class Vegetable
  @@color = 'green'
  def color
    @@color
  end
end

class Eggplant &lt; Vegetable
  @@color = 'purple'
end

Vegetable.new.color  # =&gt; &quot;purple&quot;
Eggplant.new.color   # =&gt; &quot;purple&quot;
</pre>
<p>I wasn&#8217;t expecting that! Apparently class variables are shared among subclasses, so you can&#8217;t redefine their value in subclasses without changing the value in the base class.</p>
<p>Next up, class instance variables:</p>
<pre class="brush: ruby;">
class Vegetable
  @color = 'green'
  class &lt;&lt; self
    attr_reader :color
  end
  def color
    self.class.color
  end
end

class Lettuce &lt; Vegetable
  # no need to set @color here, since lettuce is green ... right?
end

Vegetable.new.color  # =&gt; &quot;green&quot;
Lettuce.new.color    # =&gt; nil
</pre>
<p>No love here, either: class instance variables are not accessible from subclasses at all. Probably for the better, since the code needed to access class instance variables from instances is even uglier than that needed to access class variables from instances.</p>
<p>Class constants are right out:</p>
<pre class="brush: ruby;">
class Vegetable
  Color = 'green'
  def color
    Color
  end
end

class Eggplant &lt; Vegetable
  Color = 'purple'
end

Vegetable.new.color  # =&gt; &quot;green&quot;
Eggplant.new.color   # =&gt; &quot;green&quot;
</pre>
<p>Class constants are statically bound, so the polymorphic call to Vegetable#color from an Eggplant instance references the Color constant defined in Vegetable, not the one defined in Eggplant.</p>
<p>Giving up on the class attributes approach, I resorted to defining the attributes at the instance level. I considered explicitly setting a <code>@color</code> instance variable in the class <code>initialize</code> method, but then the attribute wouldn&#8217;t be constant. Instead, the simplest implementation that does what I want seems to be to use methods that return constant values:</p>
<pre class="brush: ruby;">
class Vegetable
  def color
    'green'
  end
end

class Lettuce &lt; Vegetable
end

class Eggplant &lt; Vegetable
  def color
    'purple'
  end
end

Vegetable.new.color  # =&gt; &quot;green&quot;
Lettuce.new.color    # =&gt; &quot;green&quot;
Eggplant.new.color   # =&gt; &quot;purple&quot;
</pre>
<p>So as it turns out, each of Ruby&#8217;s class attribute mechanisms behaves differently in subclasses. I&#8217;m sure class variables, class instance variables, and class constants have their utility, but they aren&#8217;t useful for defining constant attributes shared by all instances of a class, but which can be redefined in subclasses.</p>
]]></content:encoded>
			<wfw:commentRss>http://daemons.net/~clay/2009/05/16/the-various-flavors-of-ruby-class-attributes/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Simulating synchronous programming with Python generators</title>
		<link>http://daemons.net/~clay/2009/05/15/simulating-synchronous-programming-with-python-generators/</link>
		<comments>http://daemons.net/~clay/2009/05/15/simulating-synchronous-programming-with-python-generators/#comments</comments>
		<pubDate>Sat, 16 May 2009 06:41:06 +0000</pubDate>
		<dc:creator>clay</dc:creator>
				<category><![CDATA[Engineering]]></category>
		<category><![CDATA[Geek]]></category>
		<category><![CDATA[Systems Management]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[twisted]]></category>

		<guid isPermaLink="false">http://daemons.net/~clay/?p=287</guid>
		<description><![CDATA[Robey&#8217;s recent article on naggati reminded me of something I&#8217;d been idly pondering for a while. Having recently written an SSH-based host discovery scanner on top of the Twisted asynchronous programming library, I too yearned for a way to write sequences of commands in plain-old imperative code, hiding the callback complexities of event-driven code from [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://robey.lag.net/">Robey</a>&#8217;s recent article on <a href="http://robey.lag.net/2009/03/02/actors-mina-and-naggati.html">naggati</a> reminded me of something I&#8217;d been idly pondering for a while. Having recently written an SSH-based host discovery scanner on top of the <a href="http://twistedmatrix.com/">Twisted</a> asynchronous programming library, I too yearned for a way to write sequences of commands in plain-old imperative code, hiding the callback complexities of event-driven code from users.</p>
<p><a href="http://en.wikipedia.org/wiki/Continuation">Continuations</a> fit the bill nicely. These are functions from which you can return multiple times, resuming right where you left off. With continuations, you could write a sequence of functions that might make asynchronous calls, but the framework would call your continuation back where it left off.</p>
<p>Python does not have first-class continuations, but it does have <a href="http://en.wikipedia.org/wiki/Generator_(computer_science)">generators</a>, and these behave almost identically (for my purposes, at least). A generator is a function that can yield multiple values. Well, actually, it returns an iterator, which then can be used to fetch multiple values from the generator. An example will probably make it clear:</p>
<pre class="brush: python;">
&gt;&gt;&gt; def finite_generator():
...     yield 'apple'
...     yield 'orange'
...     yield 'pear'
...
&gt;&gt;&gt; iterator = finite_generator()
&gt;&gt;&gt; for fruit in iterator:
...     print fruit
...
apple
orange
pear
</pre>
<p>Generators can also run forever:</p>
<pre class="brush: python;">
&gt;&gt;&gt; def infinite_generator():
...     i = 0
...     while True:
...         yield i
...         i += 1
...
&gt;&gt;&gt; iterator = infinite_generator()
&gt;&gt;&gt; for i in iterator:
...     print i
...
0
1
2
3
4
5
... and on and on forever
</pre>
<p>I had been using iterators in my asynchronous host scanner whenever I needed to run asynchronous commands within a loop. The asynchronous programming model prevents you from writing something like:</p>
<pre class="brush: python;">
for foo in bar:
    async_method(foo)
</pre>
<p>Instead, you would do something like this:</p>
<pre class="brush: python;">
def callback(response, iterator):
    do_something_with_response(response)
    schedule_next_task(iterator)

def schedule_next_task(iterator):
    try:
        foo = iterator.next()
        deferred = async_method(foo)
        deferred.addCallback(callback, iterator)
    except StopIteration:
        pass

iterator = iter(bar)
schedule_next_task(iterator)
</pre>
<p>It works like this:</p>
<ol>
<li>We get an iterator for our list, bar &#8212; this could just as well be a generator function</li>
<li>We fetch the first value from the iterator and pass it to the asynchronous method</li>
<li>That method presumably makes some type of I/O request, and responds immediately with a Deferred instance</li>
<li>We add a callback function to the Deferred and request that our iterator instance be passed to it when it is called</li>
<li>Control returns to the event loop, which might be busy scheduling other I/O requests</li>
<li>When the I/O completes, the event loop calls our callback function with the response and our iterator instance</li>
<li>The callback processes the response, and then repeats to step 2, fetching the next item from the iterator</li>
<li>When the iterator is exhausted, the cycle stops</li>
</ol>
<p>It occurred to me that I might be able to extend this concept to use generators as a sort of continuation to emulate synchronous code. What if, instead of returning strings or numbers from a generator, you returned functions? Some wrapper code could initialize the iterator, and then loop over it using the technique above, calling each function returned from the generator.</p>
<p>Tonight I decided to give this a try. Forking off an experimental branch and making a few modifications to the underlying fido host discovery routines, I crafted the following pleisiochronous host scanner:</p>
<pre class="brush: python;">
#!/usr/bin/env python
#
# Use a generator to simulate synchronous execution on an asynchronous framework
#

from fido.common.command import RemoteCommandExecutor
from fido.common.host.unix import UnixHost
from fido.common.ssh import SSHCredentials

from contrib.host.software.sun.host import SolarisHost
from contrib.host.software.linux.host import LinuxHost

from twisted.internet import reactor

import pprint

class PlesiochronousHostScanner(object):
    &quot;&quot;&quot;
    Scans a host over SSH, building a list of host attributes. Built on the Twisted asynchronous
    library, but uses a Python generator function to emulate garden variety synchronous code.
    &quot;&quot;&quot;

    def __init__(self, address, credentials):
        &quot;&quot;&quot;
        address: the IP address to scan
        credentials: a hash like: { 'username': '...' , 'password': '...', 'public_key': '&lt;optional&gt;' }
        &quot;&quot;&quot;

        self.address = address
        self.credentials = credentials
        self.host = UnixHost(RemoteCommandExecutor(address, credentials))
        self.pp = pprint.PrettyPrinter()

        # create some scratch space for the discovery methods
        self.context = { }

        # get an iterator from the generator
        self.iterator = self.scanning_sequence()

    def scanning_sequence(self):
        &quot;&quot;&quot;
        A typical nugget of synchronous code, with one important exception: asynchronous
        functions must be yielded instead of being called directly.
        &quot;&quot;&quot;
        yield self.host.uname

        os = self.context['uname'].split()[0]

        if os == 'SunOS':
            self.host = SolarisHost.from_host(self.host)
            yield self.host.zonename
            yield self.host.zones
        elif os == 'Linux':
            self.host = LinuxHost.from_host(self.host)
        else:
            print &quot;Unable to scan host type: %s&quot; % os
            return

        yield self.host.hostid
        yield self.host.device
        yield self.host.bios
        yield self.host.installed_memory_in_MB
        yield self.host.interfaces

    def callback(self, response):
        self.context.update(response)
        self.schedule_next_task()

    def errback(self, error):
        print &quot;scanning error: %s&quot; % error

    def schedule_next_task(self):
        try:
            function = self.iterator.next()
            deferred = function()
            deferred.addCallbacks(self.callback, self.errback)
        except StopIteration:
            self.scan_complete()

    def start_scan(self):
        self.schedule_next_task()

    def scan_complete(self):
        print &quot;Scan of %s is complete&quot; % self.address
        self.pp.pprint(self.context)

        # In this contrived example, we'll stop the reactor when we've finished scanning a host
        reactor.stop()

if __name__ == '__main__':
    import sys
    from optparse import OptionParser
    parser = OptionParser()
    parser.add_option(&quot;-u&quot;, &quot;--username&quot;, dest=&quot;username&quot;)
    parser.add_option(&quot;-p&quot;, &quot;--password&quot;, dest=&quot;password&quot;)

    (options, args) = parser.parse_args()

    address = args.pop(0)
    credentials = iter([SSHCredentials(options.username, options.password, None)])

    scanner = PlesiochronousHostScanner(address, credentials)

    reactor.callWhenRunning(scanner.start_scan)

    reactor.run()
</pre>
<p>It works:</p>
<pre class="brush: plain;">
satellite:~ clay$ python pleisio.py -u username -p password 10.20.30.40
Scan of 10.20.30.40 is complete
{'bios': {'bios_date': '11/15/2007',
          'bios_vendor': 'Sun Microsystems',
          'bios_version': 'S39_3B25'},
 'device': {'system_product': 'Sun Fire X2200 M2',
            'system_serial': '0805QAT0EA',
            'system_uuid': 'bd6529dc-fc79-0010-9e1b-001b245c1d4f',
            'system_vendor': 'Sun Microsystems',
            'system_version': 'Rev 50'},
 'hostid': '0ec2daa6',
 'installed_memory_in_MB': 32768,
 'interfaces': {'bge0': {'ipv4_addresses': [10.20.30.40],
                         'ipv6_addresses': [],
                         'mac_address': 00:1B:24:5C:18:B5,
                         'zone': None},
                'lo0': {'ipv4_addresses': [],
                        'ipv6_addresses': [],
                        'mac_address': None,
                        'zone': None}},
 'uname': 'SunOS myhost.mydomain.com 5.10 Generic_127112-11 i86pc i386 i86pc',
 'zonename': 'global',
 'zones': {'myzone': {'brand': 'native',
                          'ip_mode': 'shared',
                          'root': '/zones/myzone',
                          'state': 'running',
                          'uuid': '09fbf9ba-c0c5-408f-c9e9-820471983f25',
                          'zonename': 'myzone'}}}
</pre>
<p>The beauty of this approach is that the <code>PlesiochronousHostScanner#scanning_sequence</code> method is pretty straightforward, and could actually be written by end users familiar with Python but not familiar with asynchronous programming. It also makes discovery logic much easier to understand than in the state-machine-based asynchronous discovery engine I had previously built.</p>
<p>Having just concocted this tonight, I&#8217;m not sure whether this is something I&#8217;ll pursue, but it has been a fun experiment. I&#8217;m curious what other asynchronous programmers think of this approach.</p>
]]></content:encoded>
			<wfw:commentRss>http://daemons.net/~clay/2009/05/15/simulating-synchronous-programming-with-python-generators/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Ruby, why do you torment me?</title>
		<link>http://daemons.net/~clay/2009/05/03/ruby-why-do-you-torment-me/</link>
		<comments>http://daemons.net/~clay/2009/05/03/ruby-why-do-you-torment-me/#comments</comments>
		<pubDate>Sun, 03 May 2009 17:39:03 +0000</pubDate>
		<dc:creator>clay</dc:creator>
				<category><![CDATA[Engineering]]></category>
		<category><![CDATA[Geek]]></category>
		<category><![CDATA[ruby]]></category>

		<guid isPermaLink="false">http://daemons.net/~clay/?p=91</guid>
		<description><![CDATA[I want to like Ruby, I really do. The language is expressive, powerful, and eminently readable. Moreover, it&#8217;s fun to write. But try as I might to be productive, I keep running into quirks and gotchas with Ruby libraries that make we wish I was using a language with a more mature standard library. Things [...]]]></description>
			<content:encoded><![CDATA[<p>I want to like Ruby, I really do. The language is expressive, powerful, and eminently readable. Moreover, it&#8217;s fun to write. But try as I might to be productive, I keep running into quirks and gotchas with Ruby libraries that make we wish I was using a language with a more mature standard library. Things that take five minutes in Perl or Python have taken me all day to get working in Ruby.</p>
<p>SOAP support, which ought to be fully baked in Ruby by now, is still somewhat painful to work with. In Perl, SOAP just works. When I wrote our release orchestration tool a year ago, it took way longer than it should have to get Ruby talking to the SOAP iControl interface on our BigIP load balancers. By contrast, it took all of five minutes to get the Perl sample working &#8212; and that includes time spent installing the <code>SOAP::Lite</code> CPAN module.</p>
<p>Using Rails for the first time in a recent project, I was immediately struck by how little work is required to get a web app off the ground. I almost felt guilty for writing so little code. But a lot of the clever Rails magic that&#8217;s supposed to make life easier, didn&#8217;t. While error messages like, &#8220;<a href="http://blog.teksol.info/2007/03/09/expected-x-to-define-y-error">Expected foo.rb to define Foo</a>&#8221; seem pretty straight-forward, they are maddening when foo.rb does indeed define Foo. For their next trick, the Rails developers ought to use their meta-programming fu to produce intelligible error messages! <img src='http://daemons.net/~clay/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>We recently ported a Rails app to JRuby, and straight away we ran into bugs. JRuby couldn&#8217;t call Java correctly, and it had a file descriptor leak in Net::SSH that caused the site crawler component of our application to go belly-up after a few hours. And we should have known better than to try talking to Oracle from JRuby on Rails. The <code>activerecord-jdbc-adapter</code> component had myriad issues &#8212; goofy things like <code>"uninitialized constant ActiveRecord::VERSION"</code>, improper column name quoting, and incorrect integer datatype coercions. Finally we gave up and ported the database to MySQL.</p>
<p>I understand that Ruby and its libraries are open-source efforts written mostly by unpaid enthusiasts, so I try not to get too upset when things don&#8217;t work correctly. I wish I had the time to jump in and submit patches to fix issues when I run into them.</p>
]]></content:encoded>
			<wfw:commentRss>http://daemons.net/~clay/2009/05/03/ruby-why-do-you-torment-me/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>setuid() ate my CSS</title>
		<link>http://daemons.net/~clay/2009/05/02/setuid-ate-my-css/</link>
		<comments>http://daemons.net/~clay/2009/05/02/setuid-ate-my-css/#comments</comments>
		<pubDate>Sat, 02 May 2009 10:15:35 +0000</pubDate>
		<dc:creator>clay</dc:creator>
				<category><![CDATA[Engineering]]></category>
		<category><![CDATA[Systems Management]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[setuid]]></category>

		<guid isPermaLink="false">http://daemons.net/~clay/?p=244</guid>
		<description><![CDATA[We ran into an interesting problem while testing a new version of our code deployment tool tonight. By all appearances, the tool was happily deploying code and launching our Java applications, but one of our QA engineers noticed missing CSS on some pages in our test environment. Could that possibly be related to the code deployment tool, which essentially just untars an archive and forks off a little ruby script to start the application?]]></description>
			<content:encoded><![CDATA[<p>We ran into an interesting problem while testing a new version of our code deployment tool tonight. By all appearances, the tool was happily deploying code and launching our Java applications, but one of our QA engineers noticed missing CSS on some pages in our test environment. Could that possibly be related to the code deployment tool, which essentially just untars an archive and forks off a little ruby script to start the application?</p>
<p>Tracing the application&#8217;s system calls with truss revealed that the process was getting EPERM errors while trying to read the CSS files, which live on NFS. One of our more clever engineers decided to start up the application manually, not via the code deployment tool, and found that the CSS loaded just fine when the Java process was invoked directly from the shell. He compared user and group ids, as reported by ps, of JVMs started by our tool and those started manually and found no differences. Hmm.</p>
<p>When looking at the processes&#8217; <code>/proc/&lt;pid&gt;/cred</code> files, however, some differences were apparent. The <code>cred</code> file contains binary data and is best viewed with <code>od</code>:</p>
<p><code><br />
$ od -X /proc/$$/cred<br />
0000000 00002716 00002716 00002716 0000000a<br />
0000020 0000000a 0000000a 00000002 0000000a<br />
0000040 0000000e<br />
0000044<br />
</code></p>
<p>The file consists of a sequence of 32-bit id values in the following order:</p>
<p>* uid<br />
* euid<br />
* suid<br />
* gid<br />
* egid<br />
* sgid<br />
* supplemental group ids &#8230;</p>
<p>You can see how that maps to decimal ids by comparing with <code>id</code> output:</p>
<p><code><br />
$ id -a<br />
uid=10006(clay) gid=10(staff) groups=10(staff),14(sysadmin)<br />
</code></p>
<p>[Solaris geek aside: remember when you wanted to be a member of the sysadmin group so you could run the handy-dandy admintool?]</p>
<p>So what we noticed was that while the manually started JVM and the JVM launched via our code deployment tool had identical uid/euid/sgid and gid/egid/sgid values, they had different supplemental group id lists. Notably, the JVM running under the code deployment tool still had a gid of 0 in its supplemental group list. Letting our Java application servers traipse around the filesystem with elevated privileges is perhaps not the best &#8220;feature&#8221; we&#8217;ve ever implemented.</p>
<p>Trust but verify might be a good foreign policy, but our NFS server wasn&#8217;t having any of it. It thoroughly distrusted the Java app servers claiming to have elevated privileges, and rewarded them with EPERMs for their trouble. Root squash is, after all, a pretty common NFS security measure.</p>
<p>As it turns out, I had implemented a new feature in the code deployment agent to make it switch user id on startup. Previously we handled the user switch by launching the tool under <code>su</code>, but that approach prevented the tool from writing its pid file to the root-owned /var/run directory. The solution, I thought, was just to call <code>setgid()</code> followed by <code>setuid()</code>. We tested that code by verifying the user and group ids with <code>ps</code>, and it seemed to work just great.</p>
<p>Quick: what&#8217;s wrong with this?</p>
<pre class="brush: ruby;">
    def HostUtils.switch_user user
      pwent = Etc::getpwnam(user)
      Process::GID::change_privilege(pwent.gid)
      Process::UID::change_privilege(pwent.uid)
    end
</pre>
<p>Maybe several things, but certainly one thing is that I&#8217;ve completely neglected supplemental group ids. I should have written:</p>
<pre class="brush: ruby;">
    def HostUtils.switch_user user
      pwent = Etc::getpwnam(user)
      Process::initgroups(user, pwent.gid)
      Process::GID::change_privilege(pwent.gid)
      Process::UID::change_privilege(pwent.uid)
    end
</pre>
<p>That call to <a href="http://www.ruby-doc.org/core/classes/Process.html#M003208">Process::initgroups</a> makes all the difference. After making the change, the apps could access NFS and our test site looked all pretty again. Good thing we caught it when we did!</p>
<p>Turns out this is a fairly <a href="http://www.ruby-forum.com/topic/110492">common problem</a>, and I feel especially dumb for overlooking something so obvious. Live and learn.</p>
]]></content:encoded>
			<wfw:commentRss>http://daemons.net/~clay/2009/05/02/setuid-ate-my-css/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Engineering and Operations: Bridging the Divide</title>
		<link>http://daemons.net/~clay/2009/04/02/engineering-and-operations-bridging-the-divide/</link>
		<comments>http://daemons.net/~clay/2009/04/02/engineering-and-operations-bridging-the-divide/#comments</comments>
		<pubDate>Fri, 03 Apr 2009 03:22:56 +0000</pubDate>
		<dc:creator>clay</dc:creator>
				<category><![CDATA[Engineering]]></category>
		<category><![CDATA[Operations]]></category>
		<category><![CDATA[eng]]></category>
		<category><![CDATA[ops]]></category>
		<category><![CDATA[sre]]></category>

		<guid isPermaLink="false">http://daemons.net/~clay/?p=97</guid>
		<description><![CDATA[A recent post by the folks over at Agile Web Operations discusses some common sources of tension between engineering and operations organizations in web companies: a mutual lack of experience in each other&#8217;s domains, conflicting departmental goals, and an us–against–them mentality drawn from social identity theory. Continuing the conversation, I suggest there is a subtler but more fundamental source [...]]]></description>
			<content:encoded><![CDATA[<p>A <a href="http://www.agileweboperations.com/partitions-and-warfare/">recent post</a> by the folks over at <a href="http://www.agileweboperations.com/">Agile Web Operations</a> discusses some common sources of tension between engineering and operations organizations in web companies: a mutual lack of experience in each other&#8217;s domains, conflicting departmental goals, and an us–against–them mentality drawn from social identity theory. Continuing the conversation, I suggest there is a subtler but more fundamental source of tension between engineers and operators that has to do with their different mindsets: developers think in terms of <em>possibilities</em>, while administrators think in terms of <em>realities</em>.</p>
<p>Developers tend to downplay—perhaps unconsciously—the significance of bugs because they understand how to fix them: just make a one-line change over here and tweak a unit test over there and we&#8217;re done. If she has a good idea how to fix a bug, a developer may file it away in the &#8220;solved&#8221; folder in her brain before she&#8217;s actually implemented the fix. I&#8217;m not saying developers aren&#8217;t concerned with quality—they are—or that they don&#8217;t fix bugs—they do. But how many times have you spotted a bug and dutifully reported it only to have the developer reassuringly tell you that, &#8220;yes, it&#8217;s a known issue, we&#8217;ll fix it sooner or later—probably later&#8221;?</p>
<p>Systems administrators, on the other hand, face the stark binary reality that the software either works or it doesn&#8217;t. It survives unanticipated load or it doesn&#8217;t. The pager goes off or it doesn&#8217;t. No amount of reassurance that the bug can be fixed easily will appease an administrator—if it&#8217;s broken, it&#8217;s broken. And during the first few iterations of a new product, frequently the software is, in fact, broken. Over time, administrators become conditioned to believe the software will always be broken. It is not uncommon for administrators to express concern about bugs that were known to bring the site down in months past as if they might strike again the next time they are on-call, despite having been fixed months ago.</p>
<p>I point to the difference in mindsets not to disparage one group or the other—I wore a sysadmin hat long before I wore my developer hat—but to expose a fundamental flaw with organizational structures that divide all site development and maintenance functions into just these two separate–but–equal groups. Despite the benefits afforded by the separation of responsibility that you get with distinct engineering and operations groups, such a structure breeds an inefficiency that can threaten a company&#8217;s ability to scale.</p>
<p>How well does your operations team understand your software components and how they interact? How well does your engineering team understand how your systems are built, or how they&#8217;re connected? When engineering and operations don&#8217;t understand each other&#8217;s domains, the result is a release process that is at best inefficient, and at worst dangerously fragile.</p>
<p>For example, even though engineering may write detailed release notes describing new features, systems administrators often don&#8217;t speak the same language—release notes are practically useless to operations. As a result, valuable time is wasted translating release notes into a language that operations understands: listings of the commands needed to deploy the software. Conversely, developers may not understand infrastructure dependencies (operating system versions, libraries, NFS mount points, firewall rules), leading to confusion (and possibly outages) when code is deployed to machines where it has no chance of working.</p>
<p>In shops that split all work on the production site between the false dichotomy of engineering and operations roles, most software releases will require the two teams to work closely together, and so releases become a significant source of tension between the groups. If your systems administrators cringe whenever a release is coming up, you know you&#8217;ve got a problem. Releasing software is how your company grows, both by adding new features and by fixing bugs in the existing features. Yet if the administrators had it their way, there&#8217;d be no releases.</p>
<p>Just about the time I had started thinking that what is needed is a third team responsible solely for releases and other aspects of the production site, a friend and colleague forwarded along a <a href="http://research.google.com/archive/LinuxWorld-07-describeSRE.pdf">slide deck describing Google&#8217;s Site Reliability Engineering</a> organization. This team is responsible for one thing: the production web site. Engineering is free to develop features and operations is free to think strategically about systems, storage, and network. What makes the SRE team so interesting is that it is staffed with (junior) engineers, so it&#8217;s got an engineering mindset, but at the same time it&#8217;s charged with an operations objective: keeping the web site up.</p>
<p>Using Google&#8217;s Site Reliability Engineering concept to frame my own thoughts, I tend to think of SRE as an internal customer of both the engineering and operations teams. SRE expects engineering to deliver working software, and they will file and track bugs when that is not the case. SRE should also make an effort to <em>fix</em> the bugs they have filed—something not possible when operations files all the bugs against production. Conversely, SRE expects operations to deliver the server, storage, and network infrastructure required to meet the demands of the production site. SRE leads capacity planning efforts, placing orders with operations for server, storage, and network expansion. SRE also constantly monitors the production site and is responsible for installing and configuring the monitoring software.</p>
<p>With the addition of an SRE team, the division of responsibilities starts to look like this:</p>
<ul>
<li>Operations delivers infrastructure</li>
<li>Engineering delivers features</li>
<li>Site Reliability delivers uptime</li>
</ul>
<p>Despite the title, SRE should not report into the engineering organization. Rather, it should be its own, first-class, top-level organization, complete with executive representation at the VP level. I know what you&#8217;re saying: how much is it going to cost to staff yet another organization? Not as much as you think. Since SRE will off–load releases from operations, it may be possible to scale back the operations team. And since SRE removes the inefficiencies involved in translating release notes to deployment plans, engineers will have more time to work on features.</p>
<p>Operations managers may balk at the idea of scaling back their teams, arguing that they&#8217;re already so busy that they can&#8217;t complete all the work on their plates with the team they have. But look at what is consuming most of the time. It&#8217;s probably deployments, especially if they occur anywhere near the <a href="http://en.oreilly.com/velocity2009/public/schedule/detail/7641">frequency of deployments at Flickr</a>. Operations teams are also burdened with production incident response, a responsibility that rightly belongs in the SRE organization. By handing both releases and first–response duties off to SRE, the operations team workload will fall and the team can be restructured, eliminating some middle–tier systems administrator positions while retaining mostly the strategic thinkers (operations architects) and data center support engineers.</p>
<p>If you&#8217;ve been thinking &#8220;AUTOMATION!&#8221; while reading this, I hear you. I wholeheartedly agree that automation, when carefully conceived and conscientiously deployed, can improve efficiencies and ease the tensions stemming from a manual release process. But for all the advances in the current generation of automation tools, it may still be a while before automation tools can configure themselves. Until then, who should own the configuration? Engineering understands the intrinsic properties of the software—the proper sequence to start the various components, the proper settings for feature-related properties—but operations has the extrinsic knowledge necessary to make the site work—which databases are available, which load balancers to use, etc. It might be possible to arrive at a working configuration by merging the two team&#8217;s knowledge, but I think it makes more sense if one group owns production and the associated automation configuration and workflows.</p>
<p>Ultimately, by freeing other teams to focus on their core competencies, Site Reliability Engineering can increase uptime and help the company scale, all while reducing tensions among engineering and operations—what more can you want from a three-letter acronym?</p>
]]></content:encoded>
			<wfw:commentRss>http://daemons.net/~clay/2009/04/02/engineering-and-operations-bridging-the-divide/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>
