The other day I found a webpage that I thought was worth sharing, which presents a set of practices you can follow to deliver defect-free software. I'm going to leave aside in-depth discussions of exactly what "defect-free" means right now, but as you may have guessed, I started researching the topic in the interests of improving the code-quality of my own team's product. I've also been thinking about the topic independently (indeed, that line of research is what led me to the page), so I'm writing this down to hopefully impart some new ideas, gather some new ones from you, and ultimately improve what everyone's doing. Basically what we're interested in is writing code that meets the standards of:
- Our penn-testers
- Ourselves
- Our QA department
More or less in that order. As an aside, our penn-testers have far and away the hardest standards to meet -- they do the things we don't think about, and they do the things we hope no-one thinks about. We've learned a couple of valuable lessons from them, but I'll get back to that later.
I found the page particularly curious because all the literature I've read thus far (and my own experience) has said, more or less, "software is really, really, really hard, so you'll get it wrong. Here's some strategies for mitigating the damage you'll cause"; the idea that you could skip that whole "make bugs, find bugs, fix bugs" step in the software development process is pretty appealing, nay, seductive. The "find bugs, fix bugs" part is the absolute bane of my existence as a program manager, because it makes every schedule I write meaningless and that's not good for anyone. I'm persuaded enough by it that I've put some of those ideas in place as policies inside my team at work to see how they go.
There are a few salient points that resonated somewhat with me, and I'd like to chat about them in some more depth.
Point 4: Design your code for Simplicity and Reliability.
That almost needs to be two bullet points. It will either sound obvious, or it will sound needlessly beurocratic, but designing your code at all makes a huge difference to the end product. For most problems you should use the format that is most convenient for you: I like to scribble boxes in my notebook, other people I've worked with go in for massive great specs on webpages that are very... long. Whatever suits you, just make sure you take the necessary time. I've worked with people who do this, and people who have enough natural genius to get away with working it out as they go along, but both types spend *less time* getting it right when they think about it a bit first.
The second virtual bullet point -- designing for simplicity -- well, that one's a trap. As software-engineers we delight in simplicity, but the problem is that we delight in complexity even more. I was discussing a piece of code today, which is tiny in comparison to the grand master-works my words to this point may have suggested, but I think it demonstrates the point nicely. We had:
- a fixed pool of buffers
- a widget (w), in need of a buffer
- a widget (v), whose buffer we want to steal to sate the need of (w)
So, to make it explicit, the reason we're stealing buffers from (v) (for 'victim') is that the pool was exhausted. Now, the really tricky part of this was when we made the realisation that you could end up in the situation where v == w. Those of you who write multi-threaded code for a living might be starting to see the problem.
The problem is this: to add the buffer to (w), you need to lock it to protect its internal consistency. To get the buffer from (v), you need to lock it to protect its internal consistency. If you build this naively, such that:
- you lock (w)
- you lock the pool
- if the pool has a buffer, claim it and add it to (w)
- if not, lock (v), evict a buffer and add that to (w)
then when you get to that last step, and when v == w, you're going to deadlock. Our first solution to this, well, it sucked. It went something like this:
int alreadyLocked = false;
lock(w)
if(!trylock(v))
{
if(v == w)
{
alreadyLocked = true;
}
else
{
lock(v)
}
}
// ...
if(!alreadyLocked)
{
unlock(v);
}
Now, points are due for realising that a deadlock situation potentially existed (and we picked that one before it bit us. Phew), but man, that's horrible. The hard call to make is that the alternative was to break it up into a bunch of non-overlapped critical sections, using our buffer pool as the mediator between our various bits of code, and that sounded like more CPU work.
The real killer here is that we'd engineered it the way we had because we were worried that we'd be adding CPU work where it wasn't strictly necessary and ultimately slowing our product down (and that's a big deal where I work). I didn't let it go in in the form above because I thought it was just too prone to bugs. The really cool thing is that our re-engineered version which does that extra work actually goes faster. Why is a whole new article.
I'm not against performance optimizations, by the way. Just don't do them prematurely; aim for simplicity first and foremost. If nothing else, it will make the optimization step simpler.
Point 5: Trace Every Line of Code When Written
The next point, number 5, tracing every line of code in a debugger as it is written, is an interesting one. I've never had the discipline to do this (and honestly, the debugger in the kernel absolutely sucks), but reading his justification, it makes a lot of sense. I've been looking into this some more and the solution I'm going to try is using UML inside TotalView (which is a kick-ass debugger, by the way). The problem I have right now is that all of our code needs specialised hardware, which means you can't use UML. However, I realised this morning that it's an opportunity to provide another tool for debugging, which is that every component you can switch out for an alternative implementation gives you a way to confirm the functionality of that component. Having a software implementation of our hardware will almost certainly help us find new problems, so that's what I plan to do.
Point 6: Review Code by Programmer Peers
Code review: definitely. In my group we had been doing this by emailing diffs, but I realised recently that actually there's no good reason to do it on email: we're all 5 metres apart, so I've instituted the policy that code reviews happen in person. I'll be interested to see how this one works out, because whilst I agree that the process of articulating the algorithm behind an implementation engages more of the brain, and thus presents another opportunity for finding bugs, I fear that bugs may be missed because the reviewer has been guided by the coder's walk-through.
This fear is not entirely irrational either: I'm strongly reminded of some code I wrote about a year ago which was highly venn-diagram like in nature. I knew it was hairy, so I dragged over no less than 3 engineers to scrutinise it independently. I wrote test-cases which enumerated what I thought (and these reviewers agreed) was every possible input and output. And we all got it wrong. I repeated the process and got it wrong again. I honestly don't know if that code is right now; I just know that I've stopped getting inexplicable results from it. What I've come to learn since then is that that process's failure probably did not stem from its existence, just its implementation, and the evidence has been that we've already avoided some bugs from this policy, so in my mind it's paid for itself already.
I think that really, a code review shouldn't be much different from tracing the code in question. Going forward I want to try doing our reviews in the debugger. I'll report back when I have a good system for that.
Automated testing is a little hard for us right now. Today I enlisted our penn-tester in helping us get an automated version of himself. It's a shot-gun approach, sure, but it'll run every day, and if it passes, that's a whole day's worth of traffic we've tested on that we wouldn't have had otherwise. The thing to realise here is that any automated testing is so, so much better than no automated testing. If you're stuck in a rut now, because you don't have such a system, my advice is to not put off putting something in place because you won't have time to make it perfect, get cracking on putting *anything* in place, TODAY. You can build on that, or replace it, it really doesn't matter. You have bugs TODAY, that's the problem you need to think about.
Additionally I realised again today that end-to-end testing is different from unit-testing. I'm not sure what direction I want to take on this, but LTP looks like an ok starting point. My other option is to provide a way to get our kernel code running in userspace, which has other desirable properties, so I'll probably try to do that if I can.
Annnyway, it's worth checking out; my problem is subtly different, but there's good ideas in there for everyone. To wrap up, a couple of pieces of advice from what we've found from the issues our penn-testers have found since we started this drive for "high-quality, correct code". This is pretty heavily geared toward kernel-mode networking code, but there's analogues in other disciplines as well, just look out for them.
Use statically allocated pools for everything that you possibly can: one of the killer issues with network programming is that your attacker can easily exhaust your supply of memory if you allow him or her to invoke your memory allocator unwatched. Adding a watcher is a solution to be avoided, because as I learned this week, there's a bunch of different ways running out of memory can affect you, and frankly, watchers are just another bit of code you can get wrong. Your best strategy is to use a pool of memory that you allocate at startup. That way you can be absolutely certain that you have enough memory for any situation that your attacker can throw at you. The big win on this one though, is that it makes catastrophic "I freed the pointer too early!" type bugs far less disastrous -- if you're looking at memory from your pool and do this, you'll get a wrong result. If you do it when the MMU is involved, your kernel will panic (arguably, also a wrong result).
Which brings me nicely to point number two: since you're now circumventing the MMU as a debugging tool, you'll miss a bunch of conditions in your testing that you would otherwise see in the form of kernel panics. The trouble with kernel panics is that you then have to do work to find out what went wrong. So my next suggestion is to use "poison values". The idea is that when you return items to your pool, you set the pointer to some value which you know to be invalid, but which is unique to that particular path through your code. Then, you hammer your box with whatever tool is most appropriate and wait for it to break -- the address where the bad dereference occurred should point you in the right direction to fix your bug. Wise to build macros so that this only happens in debug mode though -- no point inflicting 0xdeadbeef on your customers.
It almost goes without saying (but I'll say it anyway) that you should make your code re-entrant and threadsafe as a matter of practice. Even if you're the only caller, static variables will bite you at some point.
And *please*, use simple locking. Not so brain-dead that you cripple your software's performance, but when you've got more than one lock on a structure, think really hard about whether you're actually going to save on contention there. AVOID OVERLAPPING LOCKS. They're always harder than you think.
Make some ASSERT_* macros to test pre-conditions on your functions. These save time, and let you communicate your assumptions to future readers of your code. I'm generally against the use of the 'assert' macro provided by C, because it's just not that good, and it's too easy to leave in release builds. Roll something that makes sense for your own environment.
I also can't recommend enough running with PREEMPT, on an SMP machine, with every debug option your kernel provides enabled and using whatever tools you have at your disposal to actively seek out flaws. kmemleak is on my list to investigate. We used SGI's lockmeter last week, which allowed us to make about a 20% performance improvement to our codebase and showed up some bits of code that just weren't working the way we thought they were. Oprofile is similarly good for this.
Finally, I highly recommend tcpsic et al as tools to ensure your network code is resilient against all possible inputs. They won't generate all inputs of course (well, they will, but you don't have enough time), but they will generate the kind of inputs that will break your computer, so they'll help you identify a bunch of test-cases you should be interested in. This is one of the tools our automated-penn-tester system will use.
It's a pretty paltry list, huh? Maybe I should add "don't write kernel code" to the list. I'll keep you guys posted. This is not the end, this is only the beginning...
Ideas? Mail me at codelore@james.id.au, if you dare...