What I learned about starting a web company
- Don’t expect other people to be as enthusiastic about your project as you are, and don’t let their indifference infect you.
- Don’t hand the ball to non-founders and expect them to run with it. You’ll have to make everything happen yourself, even if you’re paying people.
- Don’t hire a graphic designer if what you want is an illustrator.
- Be very critical of your design—you have to live with it for a long time.
- Make sure you know your co-founders’ commitment levels up front, in terms of time and money.
- Starting a company is not the right time to learn new skills or technologies: pick people who know what they’re doing.
- Don’t go into business with someone you haven’t known for a while. It takes a while to learn someone’s strengths and weaknesses.
- Know the difference between a hobby and a business.
Reimagining social messaging
We’re building a social utility for tennis players, called TennisMatch. I say social utility and not social network because, while we have all the features you’d expect from a social network, our goal is to help tennis players make new connections and set up tennis matches—to be useful, and not just fun.
One of the most useful features we offer, not surprisingly, is messaging—the ability for one tennis player to contact another, whether to schedule a match, get directions to a court, find a tournament partner, or just to say hello. When we designed our messaging platform, our first instinct was to follow the de facto standard Inbox convention, found in nearly all email clients and replicated on social networks like Facebook and LinkedIn.
But the Inbox model leaves some things to be desired. You’d like to be able to group related messages into conversations, and not have to bother with cumbersome folder-oriented organizational schemes. Google delivered innovative solutions to these problems when they built Gmail: messages are threaded into conversations automatically, and Google-powered search makes it easy to find archived emails.
So we refined our design for TennisMatch messaging, adopting a conversation-based approach. Because it was simpler to implement at the time, we opted not to include a subject field in our database, and to only allow one conversation per pair of users. We planned to go back and add support for subjects and multiple conversations in subsequent iterations, but after looking at what we had built, we realized that our messaging feature worked very much like the iPhone’s Messages app, which has no subjects and only one conversation per contact. Since that design seems to work well on the iPhone, we decided to explore that paradigm a bit.
Text messages are limited to 160 characters. Adding a subject line to each message would eat into the available message size, but I think there’s another reason the telco engineers who designed SMS opted not to support subjects: they just slow you down. There’s the obvious fact that typing on a mobile device isn’t speedy, but there’s also the cognitive burden of having to summarize what you’re about to say before you say it that makes it take longer to send a message with a subject line. And, since text messages are short enough as it is, there’s really no reason to summarize them with a subject.
Thinking about how tennis players would use messaging on TennisMatch, we suspected that most messages would be short and conversational, much like text messages. “Do you want to play today?” and “Sure, I’m free after work. I’ll reserve us a court at Piedmont Park” seemed like common use cases for our players. We couldn’t think of many examples of where having a subject line would make the message more meaningful. And without subject lines, it doesn’t make much sense to have multiple conversations going.
Like many other web apps, we send an email notification when a user receives a new message. If the user has given us their mobile number, we’ll also send them a text. That can be great for keeping on top of your inbox, but if you get engaged in a back-and-forth conversation with someone, all those notifications can be really annoying. That’s a situation I’ve sometimes found myself in with Twitter direct messages—I want to get a text when someone first DMs me, but not after every subsequent DM.
The solution seems straight-forward: if you’re not logged in to TennisMatch, we’ll send you a notification when someone sends you a message, but if you are logged in, we’ll just update the unread message count in the page header. As a convenience, if you happen to be on the conversations page when someone sends you a message, we’ll dynamically insert that message into the page. Likewise, we dynamically add any messages that you send to the page, too.
With those changes, our messaging interface started to look and feel a lot like chat—it’s real-time, conversation-based, subject-less message exchange. But it’s not exactly like chat. We don’t show a buddy list. We show whether the person you’re chatting with is online, but we don’t show when they’re typing.
Probably the biggest difference between our messaging system and chat is that on TennisMatch, you use the same interface to send someone a message regardless if they’re online or not. You don’t have to burn the cognitive energy to figure out whether someone is online before sending them a message, like you do on other sites. If I want to send a message to a friend on Facebook, for instance, I’ll first check to see if they’re online in the chat interface, and fall back to the message interface if not.
So we’ve arrived at a messaging design that’s something of a hybrid between asynchronous, Inbox-style messaging and synchronous, real-time chat. When we launched our site, we were curious whether users would understand our hybrid messaging paradigm. Now that we’ve had a chance to collect some empirical data, it looks like most people do understand it well enough for it to be useful. The most common source of confusion for new users is that when they’re composing a message, pressing enter is all it takes to send the message. So we see some conversations that start with nothing but a “Hi,”. But after that first surprise, users seem to learn how the interface works.
I’m curious what user experience designers think of our hybrid messaging approach. Are we asking for more trouble than it’s worth by conflating two well-understood concepts—Inbox messages and chat—or could our hybrid messaging model prove to be a smash hit?
Jumping into the Atlanta start-up scene
There’s no better way to meet Atlanta entrepreneurs than to become one, and there’s no better place to do that than at Atlanta Start-Up Weekend. If you haven’t heard of that, let me break it down for you: 150 people pitch 50 ideas and then start 15 companies in 3 days. Now how long those companies last is anybody’s guess, but where else can you take a product from concept to launch in only three days? It’s exciting stuff, but it’s also a great way to meet other like-minded folk, and with that goal in mind I signed up for this year’s Start-Up Weekend.
First rule of Start-Up Weekend: it’s not a conference! It’s not like you go in and get lectured by CEOs older and richer than you. Start-up Weekend is hands-on from day one, and rightly so, because you only have a couple hours to pitch and vote on ideas before breaking into teams and building products and companies. In that way, Start-Up Weekend is *nothing* like Start-Up School—both were time well spent, but vastly different.
Here’s a very stream-of-consciousness play-by-play of how the weekend went for me:
Day One: The Pitch
Friday around 6:00 people started rolling in, and by 7:00 the room was at capacity and tropically hot. The first round of pitches lasted maybe an hour. One by one people stood and presented their ideas. Most of them rambled but there were some great presenters in the crowd. We raised our hands in support of each idea as it was presented. I pitched a crowd-sourced translations service for web apps; didn’t get many votes, which was sort of a bummer because I could really use that tool for my current project, TennisMatch.
After all ideas had been pitched, those with the most votes were selected for a second round of discussion with some time allotted for Q&A. Since my idea didn’t make it to the second round, I listened for a project that might have similarities with some other concepts I had been mulling over. My ears perked up when Jay Cuthrell pitched an idea for a location-aware mobile app that would show you what’s going on near you right now. Of all the product ideas I heard that night, Jay’s idea (originally called SPACE or PAGE or AGAPE or something similarly acronymious) had the right combination of real-time, location-aware, mobile, and fun that I was looking for.
Also in the second round, Mike Schinkel pitched an idea for an event aggregation service, something like an Eventful–MeetUp hybrid. His idea sounded similar enough to Jay’s that the crowd suggested the ideas be merged into a super event aggregation and discovery tool. So, we went with it.
With the pitching complete, teams started forming. I joined Jay and Mike; my role on the team was technical, as I had a couple years hands-on experience building web apps in Django. We were also joined by a graphic designer, a lawyer, a copywriter, a tester, a marketer, and some serial entrepreneurs that I’ll neglect to name here in the interest of time—with the exception of Justin Dawkins—no relation to Richard, by the way—whom I will mention because he’s awesome
.
Our team convened in an ATDC conference room and began brainstorming about how the product should work. Justin and I seemed to have the same vision for a web app with a very clean UI that simply showed you what was happening near you right now, with kayak.com–style filters. Mike was less interested in the “now” component and more interested in the ability to discover MeetUp–style planned events, so we explored a concept where the central UI metaphor consisted of two prominent tabs: “Now” and “Later”. Not arriving at any consensus that evening, we decided to sleep on it and come back in the morning with ideas and name suggestions.
Day Two: The Rush
Still no consensus. After a few hours of trying to design a product that appealed to both Mike and Jay, we decided to rethink the merger that had brought the two ideas together, and instead approach it from the perspective of Mike’s concept as an API upon which Jay’s idea could be built. Once we recast the problem like that, it was simply a matter of splitting into teams and building two pretty straight-forward products. Justin, Jay, and I branched off and started working on the mobile, location-aware, real-time event discovery tool, while Mike and the rest of the team started designing the ultimate event aggregation service, complete with handy API.
We decided an iPhone app was too ambitious for a weekend project, but that a mobile-friendly web site was doable. We were behind schedule, having only begun to design our product after lunch on Saturday. Fortunately, Jay is handy with Linux and got our Linode server instance up and running lickity split, complete with Django and MySQL. And I was able to reuse the Django deployment process and work-flow that I had designed for TennisMatch—more details to come in a future post. So we were up and running with the necessary infrastructure by that afternoon.
I’m not usually very creative when it comes to naming things, but inspiration hit as I was playing with combinations of the words “go” and “mobile”: Gomodo! It’s short, rolls off the tongue, sounds interesting, is unique and brandable, and conjures a logo idea featuring a Komodo dragon. So we went with it. Jay doodled a dragon and our mascot was born—fortunately, he chose an obnoxious shade of orange to replace the dragon’s original obnoxious shade of brown, which had something of a poop patina.
Gomodo.com was taken, but not in use—have I mentioned how much I hate domain squatters?—so we opted for something fun and short: gomodo.me. We figured that if we were wildly successful and made it into the vernacular as a verb (a la “just google it”), that it’d be sort of catchy to say “gomodo me!”
In technical terms, the app is pretty simple: we have two models, Event and Venue, populated by an ingestor daemon that fetches event data from event aggregation sites like Mike’s EventTank and Eventful, and searched by a simple web front-end that figures out where you are using GeoIP or Javascript location services. I worked on the app all night, more out of excitement than necessity, and had it mostly working by 6:00 AM.
Day Three: The Pitch, Part Deux
The first real test of Gomodo’s utility was when I pulled out my iPhone early Sunday morning, and it told me about a running group meeting for breakfast and a run at Piedmont Park at 7:00 AM—so it successfully got the time and location and showed me something relevant! Much like a director must feel when he attends his film’s premier, there’s an inescapable feeling of pride and accomplishment when you’re first able to actually use the software that you’ve written. And if for nothing else, Start-Up Weekend was worth giving up my weekend for that reason alone.
The rest of the day we spent polishing details and working on the UI—and on our presentation. Justin took the lead on UI design and did a bang-up job making something simple and easy with the limited time and tools at our disposal. Jay took the lead on our presentation, and did a fantastic job of keeping it short and sweet.
Teams were feverishly racing to finish their products before presentations, and since ours was mostly done I took some time to stroll around and see what everyone was working on. Some folks had gotten wind of the fact that I worked at Twitter and that I knew Django, so I got to help a few teams with Twitter integration, Django arcana, and the vagaries of DNS.
As evening rolled around, all the teams reconvened and we launched into presentations. The energy in the room was electric, and it was really fun seeing what everyone had come up with. Almost all of the teams had working demos, and many had minimum viable products with which they could start attracting real customers. At least half of the companies had changed names, and a few had changed product ideas altogether.
Jay presented Gomodo—you can check out the slide deck here. I was pretty nervous because the presentation included a live demo—but not just Jay demonstrating the app on his phone. No, he gave everyone the URL and let them try it themselves, so we had about 50 visits to the site during the presentation. It was super exciting watching the logs and seeing the app serve up events just like it was supposed to!
The Aftermath
Something like 15 companies and alpha-quality products came out of Start-Up Weekend 3. Many of them are still alive. Gomodo hummed along with occasional care-and-feeding for a few months. Sadly, our main source of event data (Eventful) seems to have wised up and started prohibiting full event dumps, so Gomodo doesn’t return useful results any more. Neither Jay nor Justin nor I have had the time to invest in looking for other sources of event data because we’re all busy with our own start-ups and consulting gigs. I still think Gomodo is a great proof-of-concept for a real-time, location-aware, mobile event discovery app, and I’d love to pick it back up once things settle down with TennisMatch.
Thanks to Lance Weatherby and ATDC for organizing and hosting such a valuable event. It’s great to know that the Atlanta start-up scene is alive and well. With the connections I’ve made over the weekend, I have no doubt that Atlanta is the best place to bring TennisMatch to life.
[Ed note: this post is rather late. But hey, I'm starting a company, so the blog's pretty low priority.]
The various flavors of Ruby class attributes
Here’s a curious thing about Ruby: it’s got three flavors of class attributes. You can adorn your classes with class variables, class instance variables, and class constants. Not knowing the differences between them, and thinking that one of them might be useful for a project I was working on, I set out to figure out how they all worked, especially with respect to inheritance.
As a trivial example of the design scenario I was working with, consider the case of an object-oriented vegetable garden. Vegetables come in all shapes, sizes, and colors, but we might want to say that all vegetables should be green unless we’ve said otherwise. We might start modeling our vegetable garden with a Vegetable class, and we could set a color attribute on it with a default value of "green". Lettuce, which happens to be green, could inherit that attribute from Vegetable. Eggplant, however, should redefine color to be "purple".
While certainly a contrived and flawed example, it demonstrates the behavior I was looking for. Let’s see how Ruby’s various flavors of class attributes can help us solve this design problem — or not.
First up, class variables:
class Vegetable
@@color = 'green'
def color
@@color
end
end
class Eggplant < Vegetable
@@color = 'purple'
end
Vegetable.new.color # => "purple"
Eggplant.new.color # => "purple"
I wasn’t expecting that! Apparently class variables are shared among subclasses, so you can’t redefine their value in subclasses without changing the value in the base class.
Next up, class instance variables:
class Vegetable
@color = 'green'
class << self
attr_reader :color
end
def color
self.class.color
end
end
class Lettuce < Vegetable
# no need to set @color here, since lettuce is green ... right?
end
Vegetable.new.color # => "green"
Lettuce.new.color # => nil
No love here, either: class instance variables are not accessible from subclasses at all. Probably for the better, since the code needed to access class instance variables from instances is even uglier than that needed to access class variables from instances.
Class constants are right out:
class Vegetable
Color = 'green'
def color
Color
end
end
class Eggplant < Vegetable
Color = 'purple'
end
Vegetable.new.color # => "green"
Eggplant.new.color # => "green"
Class constants are statically bound, so the polymorphic call to Vegetable#color from an Eggplant instance references the Color constant defined in Vegetable, not the one defined in Eggplant.
Giving up on the class attributes approach, I resorted to defining the attributes at the instance level. I considered explicitly setting a @color instance variable in the class initialize method, but then the attribute wouldn’t be constant. Instead, the simplest implementation that does what I want seems to be to use methods that return constant values:
class Vegetable
def color
'green'
end
end
class Lettuce < Vegetable
end
class Eggplant < Vegetable
def color
'purple'
end
end
Vegetable.new.color # => "green"
Lettuce.new.color # => "green"
Eggplant.new.color # => "purple"
So as it turns out, each of Ruby’s class attribute mechanisms behaves differently in subclasses. I’m sure class variables, class instance variables, and class constants have their utility, but they aren’t useful for defining constant attributes shared by all instances of a class, but which can be redefined in subclasses.
Simulating synchronous programming with Python generators
Robey’s recent article on naggati reminded me of something I’d been idly pondering for a while. Having recently written an SSH-based host discovery scanner on top of the Twisted asynchronous programming library, I too yearned for a way to write sequences of commands in plain-old imperative code, hiding the callback complexities of event-driven code from users.
Continuations fit the bill nicely. These are functions from which you can return multiple times, resuming right where you left off. With continuations, you could write a sequence of functions that might make asynchronous calls, but the framework would call your continuation back where it left off.
Python does not have first-class continuations, but it does have generators, and these behave almost identically (for my purposes, at least). A generator is a function that can yield multiple values. Well, actually, it returns an iterator, which then can be used to fetch multiple values from the generator. An example will probably make it clear:
>>> def finite_generator(): ... yield 'apple' ... yield 'orange' ... yield 'pear' ... >>> iterator = finite_generator() >>> for fruit in iterator: ... print fruit ... apple orange pear
Generators can also run forever:
>>> def infinite_generator(): ... i = 0 ... while True: ... yield i ... i += 1 ... >>> iterator = infinite_generator() >>> for i in iterator: ... print i ... 0 1 2 3 4 5 ... and on and on forever
I had been using iterators in my asynchronous host scanner whenever I needed to run asynchronous commands within a loop. The asynchronous programming model prevents you from writing something like:
for foo in bar:
async_method(foo)
Instead, you would do something like this:
def callback(response, iterator):
do_something_with_response(response)
schedule_next_task(iterator)
def schedule_next_task(iterator):
try:
foo = iterator.next()
deferred = async_method(foo)
deferred.addCallback(callback, iterator)
except StopIteration:
pass
iterator = iter(bar)
schedule_next_task(iterator)
It works like this:
- We get an iterator for our list, bar — this could just as well be a generator function
- We fetch the first value from the iterator and pass it to the asynchronous method
- That method presumably makes some type of I/O request, and responds immediately with a Deferred instance
- We add a callback function to the Deferred and request that our iterator instance be passed to it when it is called
- Control returns to the event loop, which might be busy scheduling other I/O requests
- When the I/O completes, the event loop calls our callback function with the response and our iterator instance
- The callback processes the response, and then repeats to step 2, fetching the next item from the iterator
- When the iterator is exhausted, the cycle stops
It occurred to me that I might be able to extend this concept to use generators as a sort of continuation to emulate synchronous code. What if, instead of returning strings or numbers from a generator, you returned functions? Some wrapper code could initialize the iterator, and then loop over it using the technique above, calling each function returned from the generator.
Tonight I decided to give this a try. Forking off an experimental branch and making a few modifications to the underlying fido host discovery routines, I crafted the following pleisiochronous host scanner:
#!/usr/bin/env python
#
# Use a generator to simulate synchronous execution on an asynchronous framework
#
from fido.common.command import RemoteCommandExecutor
from fido.common.host.unix import UnixHost
from fido.common.ssh import SSHCredentials
from contrib.host.software.sun.host import SolarisHost
from contrib.host.software.linux.host import LinuxHost
from twisted.internet import reactor
import pprint
class PlesiochronousHostScanner(object):
"""
Scans a host over SSH, building a list of host attributes. Built on the Twisted asynchronous
library, but uses a Python generator function to emulate garden variety synchronous code.
"""
def __init__(self, address, credentials):
"""
address: the IP address to scan
credentials: a hash like: { 'username': '...' , 'password': '...', 'public_key': '<optional>' }
"""
self.address = address
self.credentials = credentials
self.host = UnixHost(RemoteCommandExecutor(address, credentials))
self.pp = pprint.PrettyPrinter()
# create some scratch space for the discovery methods
self.context = { }
# get an iterator from the generator
self.iterator = self.scanning_sequence()
def scanning_sequence(self):
"""
A typical nugget of synchronous code, with one important exception: asynchronous
functions must be yielded instead of being called directly.
"""
yield self.host.uname
os = self.context['uname'].split()[0]
if os == 'SunOS':
self.host = SolarisHost.from_host(self.host)
yield self.host.zonename
yield self.host.zones
elif os == 'Linux':
self.host = LinuxHost.from_host(self.host)
else:
print "Unable to scan host type: %s" % os
return
yield self.host.hostid
yield self.host.device
yield self.host.bios
yield self.host.installed_memory_in_MB
yield self.host.interfaces
def callback(self, response):
self.context.update(response)
self.schedule_next_task()
def errback(self, error):
print "scanning error: %s" % error
def schedule_next_task(self):
try:
function = self.iterator.next()
deferred = function()
deferred.addCallbacks(self.callback, self.errback)
except StopIteration:
self.scan_complete()
def start_scan(self):
self.schedule_next_task()
def scan_complete(self):
print "Scan of %s is complete" % self.address
self.pp.pprint(self.context)
# In this contrived example, we'll stop the reactor when we've finished scanning a host
reactor.stop()
if __name__ == '__main__':
import sys
from optparse import OptionParser
parser = OptionParser()
parser.add_option("-u", "--username", dest="username")
parser.add_option("-p", "--password", dest="password")
(options, args) = parser.parse_args()
address = args.pop(0)
credentials = iter([SSHCredentials(options.username, options.password, None)])
scanner = PlesiochronousHostScanner(address, credentials)
reactor.callWhenRunning(scanner.start_scan)
reactor.run()
It works:
satellite:~ clay$ python pleisio.py -u username -p password 10.20.30.40
Scan of 10.20.30.40 is complete
{'bios': {'bios_date': '11/15/2007',
'bios_vendor': 'Sun Microsystems',
'bios_version': 'S39_3B25'},
'device': {'system_product': 'Sun Fire X2200 M2',
'system_serial': '0805QAT0EA',
'system_uuid': 'bd6529dc-fc79-0010-9e1b-001b245c1d4f',
'system_vendor': 'Sun Microsystems',
'system_version': 'Rev 50'},
'hostid': '0ec2daa6',
'installed_memory_in_MB': 32768,
'interfaces': {'bge0': {'ipv4_addresses': [10.20.30.40],
'ipv6_addresses': [],
'mac_address': 00:1B:24:5C:18:B5,
'zone': None},
'lo0': {'ipv4_addresses': [],
'ipv6_addresses': [],
'mac_address': None,
'zone': None}},
'uname': 'SunOS myhost.mydomain.com 5.10 Generic_127112-11 i86pc i386 i86pc',
'zonename': 'global',
'zones': {'myzone': {'brand': 'native',
'ip_mode': 'shared',
'root': '/zones/myzone',
'state': 'running',
'uuid': '09fbf9ba-c0c5-408f-c9e9-820471983f25',
'zonename': 'myzone'}}}
The beauty of this approach is that the PlesiochronousHostScanner#scanning_sequence method is pretty straightforward, and could actually be written by end users familiar with Python but not familiar with asynchronous programming. It also makes discovery logic much easier to understand than in the state-machine-based asynchronous discovery engine I had previously built.
Having just concocted this tonight, I’m not sure whether this is something I’ll pursue, but it has been a fun experiment. I’m curious what other asynchronous programmers think of this approach.
Ruby, why do you torment me?
I want to like Ruby, I really do. The language is expressive, powerful, and eminently readable. Moreover, it’s fun to write. But try as I might to be productive, I keep running into quirks and gotchas with Ruby libraries that make we wish I was using a language with a more mature standard library. Things that take five minutes in Perl or Python have taken me all day to get working in Ruby.
SOAP support, which ought to be fully baked in Ruby by now, is still somewhat painful to work with. In Perl, SOAP just works. When I wrote our release orchestration tool a year ago, it took way longer than it should have to get Ruby talking to the SOAP iControl interface on our BigIP load balancers. By contrast, it took all of five minutes to get the Perl sample working — and that includes time spent installing the SOAP::Lite CPAN module.
Using Rails for the first time in a recent project, I was immediately struck by how little work is required to get a web app off the ground. I almost felt guilty for writing so little code. But a lot of the clever Rails magic that’s supposed to make life easier, didn’t. While error messages like, “Expected foo.rb to define Foo” seem pretty straight-forward, they are maddening when foo.rb does indeed define Foo. For their next trick, the Rails developers ought to use their meta-programming fu to produce intelligible error messages!
We recently ported a Rails app to JRuby, and straight away we ran into bugs. JRuby couldn’t call Java correctly, and it had a file descriptor leak in Net::SSH that caused the site crawler component of our application to go belly-up after a few hours. And we should have known better than to try talking to Oracle from JRuby on Rails. The activerecord-jdbc-adapter component had myriad issues — goofy things like "uninitialized constant ActiveRecord::VERSION", improper column name quoting, and incorrect integer datatype coercions. Finally we gave up and ported the database to MySQL.
I understand that Ruby and its libraries are open-source efforts written mostly by unpaid enthusiasts, so I try not to get too upset when things don’t work correctly. I wish I had the time to jump in and submit patches to fix issues when I run into them.
setuid() ate my CSS
We ran into an interesting problem while testing a new version of our code deployment tool tonight. By all appearances, the tool was happily deploying code and launching our Java applications, but one of our QA engineers noticed missing CSS on some pages in our test environment. Could that possibly be related to the code deployment tool, which essentially just untars an archive and forks off a little ruby script to start the application?
Tracing the application’s system calls with truss revealed that the process was getting EPERM errors while trying to read the CSS files, which live on NFS. One of our more clever engineers decided to start up the application manually, not via the code deployment tool, and found that the CSS loaded just fine when the Java process was invoked directly from the shell. He compared user and group ids, as reported by ps, of JVMs started by our tool and those started manually and found no differences. Hmm.
When looking at the processes’ /proc/<pid>/cred files, however, some differences were apparent. The cred file contains binary data and is best viewed with od:
$ od -X /proc/$$/cred
0000000 00002716 00002716 00002716 0000000a
0000020 0000000a 0000000a 00000002 0000000a
0000040 0000000e
0000044
The file consists of a sequence of 32-bit id values in the following order:
* uid
* euid
* suid
* gid
* egid
* sgid
* supplemental group ids …
You can see how that maps to decimal ids by comparing with id output:
$ id -a
uid=10006(clay) gid=10(staff) groups=10(staff),14(sysadmin)
[Solaris geek aside: remember when you wanted to be a member of the sysadmin group so you could run the handy-dandy admintool?]
So what we noticed was that while the manually started JVM and the JVM launched via our code deployment tool had identical uid/euid/sgid and gid/egid/sgid values, they had different supplemental group id lists. Notably, the JVM running under the code deployment tool still had a gid of 0 in its supplemental group list. Letting our Java application servers traipse around the filesystem with elevated privileges is perhaps not the best “feature” we’ve ever implemented.
Trust but verify might be a good foreign policy, but our NFS server wasn’t having any of it. It thoroughly distrusted the Java app servers claiming to have elevated privileges, and rewarded them with EPERMs for their trouble. Root squash is, after all, a pretty common NFS security measure.
As it turns out, I had implemented a new feature in the code deployment agent to make it switch user id on startup. Previously we handled the user switch by launching the tool under su, but that approach prevented the tool from writing its pid file to the root-owned /var/run directory. The solution, I thought, was just to call setgid() followed by setuid(). We tested that code by verifying the user and group ids with ps, and it seemed to work just great.
Quick: what’s wrong with this?
def HostUtils.switch_user user
pwent = Etc::getpwnam(user)
Process::GID::change_privilege(pwent.gid)
Process::UID::change_privilege(pwent.uid)
end
Maybe several things, but certainly one thing is that I’ve completely neglected supplemental group ids. I should have written:
def HostUtils.switch_user user
pwent = Etc::getpwnam(user)
Process::initgroups(user, pwent.gid)
Process::GID::change_privilege(pwent.gid)
Process::UID::change_privilege(pwent.uid)
end
That call to Process::initgroups makes all the difference. After making the change, the apps could access NFS and our test site looked all pretty again. Good thing we caught it when we did!
Turns out this is a fairly common problem, and I feel especially dumb for overlooking something so obvious. Live and learn.
Engineering and Operations: Bridging the Divide
A recent post by the folks over at Agile Web Operations discusses some common sources of tension between engineering and operations organizations in web companies: a mutual lack of experience in each other’s domains, conflicting departmental goals, and an us–against–them mentality drawn from social identity theory. Continuing the conversation, I suggest there is a subtler but more fundamental source of tension between engineers and operators that has to do with their different mindsets: developers think in terms of possibilities, while administrators think in terms of realities.
Developers tend to downplay—perhaps unconsciously—the significance of bugs because they understand how to fix them: just make a one-line change over here and tweak a unit test over there and we’re done. If she has a good idea how to fix a bug, a developer may file it away in the “solved” folder in her brain before she’s actually implemented the fix. I’m not saying developers aren’t concerned with quality—they are—or that they don’t fix bugs—they do. But how many times have you spotted a bug and dutifully reported it only to have the developer reassuringly tell you that, “yes, it’s a known issue, we’ll fix it sooner or later—probably later”?
Systems administrators, on the other hand, face the stark binary reality that the software either works or it doesn’t. It survives unanticipated load or it doesn’t. The pager goes off or it doesn’t. No amount of reassurance that the bug can be fixed easily will appease an administrator—if it’s broken, it’s broken. And during the first few iterations of a new product, frequently the software is, in fact, broken. Over time, administrators become conditioned to believe the software will always be broken. It is not uncommon for administrators to express concern about bugs that were known to bring the site down in months past as if they might strike again the next time they are on-call, despite having been fixed months ago.
I point to the difference in mindsets not to disparage one group or the other—I wore a sysadmin hat long before I wore my developer hat—but to expose a fundamental flaw with organizational structures that divide all site development and maintenance functions into just these two separate–but–equal groups. Despite the benefits afforded by the separation of responsibility that you get with distinct engineering and operations groups, such a structure breeds an inefficiency that can threaten a company’s ability to scale.
How well does your operations team understand your software components and how they interact? How well does your engineering team understand how your systems are built, or how they’re connected? When engineering and operations don’t understand each other’s domains, the result is a release process that is at best inefficient, and at worst dangerously fragile.
For example, even though engineering may write detailed release notes describing new features, systems administrators often don’t speak the same language—release notes are practically useless to operations. As a result, valuable time is wasted translating release notes into a language that operations understands: listings of the commands needed to deploy the software. Conversely, developers may not understand infrastructure dependencies (operating system versions, libraries, NFS mount points, firewall rules), leading to confusion (and possibly outages) when code is deployed to machines where it has no chance of working.
In shops that split all work on the production site between the false dichotomy of engineering and operations roles, most software releases will require the two teams to work closely together, and so releases become a significant source of tension between the groups. If your systems administrators cringe whenever a release is coming up, you know you’ve got a problem. Releasing software is how your company grows, both by adding new features and by fixing bugs in the existing features. Yet if the administrators had it their way, there’d be no releases.
Just about the time I had started thinking that what is needed is a third team responsible solely for releases and other aspects of the production site, a friend and colleague forwarded along a slide deck describing Google’s Site Reliability Engineering organization. This team is responsible for one thing: the production web site. Engineering is free to develop features and operations is free to think strategically about systems, storage, and network. What makes the SRE team so interesting is that it is staffed with (junior) engineers, so it’s got an engineering mindset, but at the same time it’s charged with an operations objective: keeping the web site up.
Using Google’s Site Reliability Engineering concept to frame my own thoughts, I tend to think of SRE as an internal customer of both the engineering and operations teams. SRE expects engineering to deliver working software, and they will file and track bugs when that is not the case. SRE should also make an effort to fix the bugs they have filed—something not possible when operations files all the bugs against production. Conversely, SRE expects operations to deliver the server, storage, and network infrastructure required to meet the demands of the production site. SRE leads capacity planning efforts, placing orders with operations for server, storage, and network expansion. SRE also constantly monitors the production site and is responsible for installing and configuring the monitoring software.
With the addition of an SRE team, the division of responsibilities starts to look like this:
- Operations delivers infrastructure
- Engineering delivers features
- Site Reliability delivers uptime
Despite the title, SRE should not report into the engineering organization. Rather, it should be its own, first-class, top-level organization, complete with executive representation at the VP level. I know what you’re saying: how much is it going to cost to staff yet another organization? Not as much as you think. Since SRE will off–load releases from operations, it may be possible to scale back the operations team. And since SRE removes the inefficiencies involved in translating release notes to deployment plans, engineers will have more time to work on features.
Operations managers may balk at the idea of scaling back their teams, arguing that they’re already so busy that they can’t complete all the work on their plates with the team they have. But look at what is consuming most of the time. It’s probably deployments, especially if they occur anywhere near the frequency of deployments at Flickr. Operations teams are also burdened with production incident response, a responsibility that rightly belongs in the SRE organization. By handing both releases and first–response duties off to SRE, the operations team workload will fall and the team can be restructured, eliminating some middle–tier systems administrator positions while retaining mostly the strategic thinkers (operations architects) and data center support engineers.
If you’ve been thinking “AUTOMATION!” while reading this, I hear you. I wholeheartedly agree that automation, when carefully conceived and conscientiously deployed, can improve efficiencies and ease the tensions stemming from a manual release process. But for all the advances in the current generation of automation tools, it may still be a while before automation tools can configure themselves. Until then, who should own the configuration? Engineering understands the intrinsic properties of the software—the proper sequence to start the various components, the proper settings for feature-related properties—but operations has the extrinsic knowledge necessary to make the site work—which databases are available, which load balancers to use, etc. It might be possible to arrive at a working configuration by merging the two team’s knowledge, but I think it makes more sense if one group owns production and the associated automation configuration and workflows.
Ultimately, by freeing other teams to focus on their core competencies, Site Reliability Engineering can increase uptime and help the company scale, all while reducing tensions among engineering and operations—what more can you want from a three-letter acronym?
Dual-booting Windows XP and Mac OS X on Intel Macs
I had hoped to use this blog entry to post step-by-step instructions for installing Windows XP on shiny new MacIntels, but alas, it appears that someone has beaten me to it. The winners, narf2006 and blanka, have been working on the problem for quite a while and have been posting pictures of their progress over the past few weeks. Today they uploaded a video showing a fresh install of XP on an iMac and they submitted their solution to sud0n1m for testing. Assuming the testing goes well, they will be declared the winners and will share the $13k prize.
While I’m disappointed not to have won, I’m encouraged to see that our approaches were remarkably similar. We both wrote custom EFI CSM drivers to emulate the BIOS functions Windows requires to boot. I’m very curious how they managed to get VGA working, and I won’t be surprised if it doesn’t work in either the Mini or the Macbook Pro, as it looks like they did all their development on an iMac.
If nothing else, this was a tremendous learning experience for me, and the timing couldn’t have been better. I have recently become interested in Intel assembly and protected mode programming, topics I considered too challenging years ago when I was doing DOS programming, but concepts that make much more sense to me now. I had randomly dusted off some old assembly language references from my bookshelf and read some chapters on protected mode programming a few weeks prior to beginning work on this project, so I was able to grasp what needed to be done to provide a working solution almost immediately.
With the deadline fast approaching and narf’s Flickr images haunting me, I coded quickly and didn’t spend much time making the code pretty or maintainable, but I’m still fairly proud of the code I wrote. It’s fairly succinct, but does quite a bit. Anyone who’s interested can peruse the code here.
The main function is in OSXP.c, which contains code for reading the GPT partition table that Mac OS X uses, writing a MBR partition table that Windows would use, and loading a bootloader from an El Torito bootable CD-ROM.
Code to switch from protected mode to real mode (called a thunk) is in thunk.c and asmthunk.s. It’s not very general, but it’s the first protected mode assembly code I’ve ever written, and, surprisingly enough, it works.
Code to setup a real-mode interrupt vector table and the real-mode interrupt service routine is in rmisr.s. For the most part, this duplicates the thunk code, but in reverse order: it switches from real mode to protected mode and then back again. This reverse thunk is necessary to emulate BIOS functions using the native EFI functions (read disk sector, print character, etc).
The protected-mode interrupt service routine, which does the actual BIOS emulation, is in pmisr.c. It reads and writes a saved register context on the stack that the real-mode code inherits upon return from the interrupt service routine.
Just writing those thunk and interrupt service routines is probably about 50% of a complete solution, and I’m very happy with how they came out. The first time I thunked into the MBR code, it worked better than I expected, and actually identified the active partition, loaded the boot sector from it, and jumped to it. Booting into the CD bootloader worked also, though it hung right after probing memory.
I’m pretty amazed that my code works, but to toot my own horn a little more, I’m pretty happy with some of the debug techniques I came up with along the way. Running EFI applications in the pre-boot environment leaves a lot to be desired. You can’t exactly fire up gdb and throw a bunch of watchpoints on your code to find out what’s going wrong (though I did toy with the idea of compiling GDB for EFI). And even if I had a debugger at my disposal, it’s hard to debug a protected mode/real mode transition.
At first, my debugging consisted of writing copious amounts of debug output to the console and waiting for my tester, Chris, to run the code and take a picture of the result. At that time I didn’t have an Intel Mac to play with, so needless to say, progress was slow. We did get fairly far with this method, though. I wrote all of the partition table (GPT and MBR) code without ever having seen my code run with my own eyes. Chris’ pictures showed me what I needed to know, then I’d make a change, recompile, and Chris would download the new file and reboot. Again and again. Thanks, Chris.
I was unsure of how to access the CD-ROM under EFI, because it didn’t show up when I listed all the block I/O devices in Chris’ Macbook Pro. Ryan was nice enough to lend me his shiny new Mac Mini, and I was pleased to find that the CD-ROM device showed up once there was a disk in the drive. I was even more pleased to see that my El Torito code worked almost flawlessly from the beginning.
Then came the hard part, thunking into real mode. I wrote code and pored over it for hours making sure it looked right, but when I ran it, the computer spontaneously rebooted. Unsure of which instruction(s) were causing the reboots, I added an infinite loop to a section of the code, recompiled, and ran it. The machine hung. That validated that all of the code above the loop was not causing the reboot (at least not directly), so I moved the loop down a few instructions and tried again. Using this technique I was eventually able to find all of the bugs in my code, which were all stupid syntax problems and not logic problems (there’s a big difference between $0×10 and 0×10 in AT&T assembly).
Once the thunk and interrupt handler code was working, I started looking into why NTLDR was hanging after probing memory. NTLDR is 233kb and its disassembly is 97k lines long. I knew roughly where the hang occurred, based on the output I had from the last BIOS interrupt it invoked, but I wanted to narrow it down to a specific routine.
It occured to me that I could just write my own debugger of sorts. By handling interrupt 3, my code would get control anytime the NTLDR code stumbled onto an INT3 instruction. So, using the disassembly listing as a guide, I made a list of instructions that I thought might be interesting stopping points, and wrote a routine to replace those instructions with 0xCC (the INT3 opcode). Then I wrote an INT3 handler that replaced the original instruction and decremented the return address by one so the original code would be run upon return from the interrupt service routine. And, to my surprise, it worked!
Earlier today I extended this a bit by automatically enabling trap mode in the INT3 handler so I could repatch the breakpoint instruction with 0xCC right after executing the original code. This change allowed a breakpoint to show up each time through a loop or each time a particular function was called. Then I went a step farther and added a breakpoint option that would leave trap mode enabled, so I could get a trace of every instruction executed between two points in the code. This would prove useful for figuring out which branches the program took as it made various tests and decisions.
The one thing that continues to elude me is how to enable VGA text mode. The standard graphics and text framebuffers (0xA0000 and 0xB8000) aren’t even mapped in memory, and reading from the VGA registers appears to return garbage. I suspect if I knew more about PCI programming, I’d be able to map the framebuffer memory and configure the I/O ports, but I’m at a loss for how to do that. In the interim, I’ve been patching NTLDR at load time so that it writes to my own text framebuffer, which I then scan on every interrupt in order to paint a portion of the emulated textmode screen. This is slow and hackish, and I know it’s possible to enable true VGA mode (narf and blanka did), but I don’t know where to begin.
All in all I’m pretty happy with the code I wrote, even though I didn’t win the contest. I’m looking forward to the next big challenge and an opportunity to use some of the techniques I learned on this project.
Getting past ptrace()
The holidays have given me a chance to relax and geek around with Mac OS X, and I’ve finally gotten around to installing the Developer Tools package, which includes the GNU C compiler (gcc) and the GNU debugger (gdb). Over the years I’ve gotten pretty comfortable using gdb to troubleshoot programs on SPARC and Intel platforms. Debugging requires that you know a bit about assembly language, and I had learned x86 assembly back in the day when I was coding fun little graphics toys for DOS, and had learned some SPARC assembly trying to, uhm, correct an annoying license issue in a piece of commercial software.
During the holiday break I figured I’d learn some PowerPC (PPC) assembly while I still had the chance, given Apple’s decision to move to x86 early next year. Debugging simple programs isn’t much fun, though, so I figured I’d start poking around with a big application. An annoying thing kept happening everytime I fired up the app under the debugger, though; it exited immediately with a strange error code:
% gdb /Applications/blah.app/Contents/MacOS/blah
(gdb) run
Program exited with code 055.
Bummer. I remembered the same problem happening with that commercial Solaris app years before, but I never paid much attention to it back then, because it was possible to work around the problem by attaching to the program after it was already up and running. Apple seems to be a bit smarter than that, though, because whenever I attached to a running copy of the application, GDB seg faulted:
(gdb) attach 17813
Attaching to program: `/Applications/blah.app/Contents/MacOS/blah', process 17813.
Segmentation fault
Since I wanted to know why the application was exiting, I figured I’d step through it one instruction at a time until I found the culprit.
Read the rest of this entry »


