setuid() ate my CSS
We ran into an interesting problem while testing a new version of our code deployment tool tonight. By all appearances, the tool was happily deploying code and launching our Java applications, but one of our QA engineers noticed missing CSS on some pages in our test environment. Could that possibly be related to the code deployment tool, which essentially just untars an archive and forks off a little ruby script to start the application?
Tracing the application’s system calls with truss revealed that the process was getting EPERM errors while trying to read the CSS files, which live on NFS. One of our more clever engineers decided to start up the application manually, not via the code deployment tool, and found that the CSS loaded just fine when the Java process was invoked directly from the shell. He compared user and group ids, as reported by ps, of JVMs started by our tool and those started manually and found no differences. Hmm.
When looking at the processes’ /proc/<pid>/cred files, however, some differences were apparent. The cred file contains binary data and is best viewed with od:
$ od -X /proc/$$/cred
0000000 00002716 00002716 00002716 0000000a
0000020 0000000a 0000000a 00000002 0000000a
0000040 0000000e
0000044
The file consists of a sequence of 32-bit id values in the following order:
* uid
* euid
* suid
* gid
* egid
* sgid
* supplemental group ids …
You can see how that maps to decimal ids by comparing with id output:
$ id -a
uid=10006(clay) gid=10(staff) groups=10(staff),14(sysadmin)
[Solaris geek aside: remember when you wanted to be a member of the sysadmin group so you could run the handy-dandy admintool?]
So what we noticed was that while the manually started JVM and the JVM launched via our code deployment tool had identical uid/euid/sgid and gid/egid/sgid values, they had different supplemental group id lists. Notably, the JVM running under the code deployment tool still had a gid of 0 in its supplemental group list. Letting our Java application servers traipse around the filesystem with elevated privileges is perhaps not the best “feature” we’ve ever implemented.
Trust but verify might be a good foreign policy, but our NFS server wasn’t having any of it. It thoroughly distrusted the Java app servers claiming to have elevated privileges, and rewarded them with EPERMs for their trouble. Root squash is, after all, a pretty common NFS security measure.
As it turns out, I had implemented a new feature in the code deployment agent to make it switch user id on startup. Previously we handled the user switch by launching the tool under su, but that approach prevented the tool from writing its pid file to the root-owned /var/run directory. The solution, I thought, was just to call setgid() followed by setuid(). We tested that code by verifying the user and group ids with ps, and it seemed to work just great.
Quick: what’s wrong with this?
def HostUtils.switch_user user
pwent = Etc::getpwnam(user)
Process::GID::change_privilege(pwent.gid)
Process::UID::change_privilege(pwent.uid)
end
Maybe several things, but certainly one thing is that I’ve completely neglected supplemental group ids. I should have written:
def HostUtils.switch_user user
pwent = Etc::getpwnam(user)
Process::initgroups(user, pwent.gid)
Process::GID::change_privilege(pwent.gid)
Process::UID::change_privilege(pwent.uid)
end
That call to Process::initgroups makes all the difference. After making the change, the apps could access NFS and our test site looked all pretty again. Good thing we caught it when we did!
Turns out this is a fairly common problem, and I feel especially dumb for overlooking something so obvious. Live and learn.


