Unit Tests: The French Fries of Software Development

by David Snook

June 18, 2024

TLDWTRBTPIC (Too Long, Don’t Want to Read But the Picture is Cool)

I love unit tests as much as anyone, but we might be wasting our time by misusing them:

Good for design time, but a waste to run them all the time
Sheer volume plus coupling makes new development burdensome
Too simple to prove correctness
But still too complicated to get completely right by trial and error

A Tasty Analogy

I fell in love with unit tests way back at the beginning of my software development career, [mumble, mumble] years ago. I loved the speed, the simplicity of testing a single aspect at a time, the visceral satisfaction from seeing that pile of green check marks after every code change. Maybe you have a similar Pavlovian response when you see those green checkmarks fly by on your screen, like a cascade of warm, toasty french fries landing on your plate. Mmm, french fries!

But over time I have also acquired an appreciation for the subtle downsides to my favorite test snack. Whereas before I might have thought that more was always better, counting on my healthy development metabolism to deal with another plateful of little bite-sized tests, I now think of cutting back, or – gasp! – healthier alternatives.

Yes, They are Delicious

But to be fair, unit tests are well-loved for a reason. Multiple reasons, really:

Speedy (running quickly, maybe hundreds in a second)
Simple (isolating one aspect of logic/behavior at a time, so easier to think about)
Sensitive (failing immediately with any changes to expected behavior)
Selective (ideally failing for just one reason)

Even more than that, unit tests are valuable in multiple roles, at different stages of the coding process.

They are extremely useful during the initial development of a unit – essential, even, if you follow the TDD (Test Driven Development) approach. Start with a test that asserts some desired behavior (either positive, like doing something useful, or neutral, as in at least not crashing when given bad input), show that it fails, write some code to make it pass, then refactor the code a bit to make it feel better (to you). Red, green, refactor, repeat. It is both useful and addictive, once you get into that flow!

Unit tests are also useful at some point in the future when the unit is modified, perhaps to add new behavior. The existing unit tests can catch regressions in which the new code no longer meets the original expectations.

But what about in-between those two points in time? These tests probably run with every other code change, to every other unit. But they’re tiny, right? What’s the problem with one more french fry?

The Accumulated Empty Calories

The problem with unit tests, as with french fries, is their accumulated impact. They weigh you down. Everywhere you go, you are dragging around all those french fries.

When I think of all those unit tests running in the server farms, spinning their tiny legs on their tiny electronic equivalent of a treadmill, I think of all the waste heat that they generate – literal calories of heat – as they go nowhere. And all that wasted heat adds up, with billions of unit tests running millions of times. I don’t know how much heat that amounts to across the entire software development world (it would be an interesting calculation if anyone has some rough numbers), but I’m going to wave my hands here and say that it is probably A LOT. Not enough to boil an ocean, but an uncomfortably large amount of waste heat.

But what do we get for all that expenditure of energy, both for running the tests and then pumping away the heat? As a percentage, not much, as 99.999% the time these tests are passing and tell us (almost) nothing. The best unit tests fail maybe twice in their entire lifetime: the first failure before the implementation even exists, and then maybe again to catch a regression. That’s it. And it is only the failures that are actionable, as they are the ones that tell us that we need to fix something. The times that the tests pass are a signal that nothing needs to be done, that we can relax in our cocoons of accumulating adipose tissue.

The Psychological Weight

A more subtle downside to unit tests is that as they accumulate, they discourage making improvements.

Why? Because due to the (usually) tight coupling between unit tests and their corresponding units, making improvements to even a single unit could invoke a smattering of red test failures. But due to the moderate coupling between units as they work together (they do work together, don’t they?) changes to that first unit often mean changes to other units, which leads to changes to other units, etc., and that could lead to a flood of red test failures.

Psychologically, the combination of seeing the before state as a sea of green goodness in the test pass and seeing the after state as a bleeding mess is an inducement to indolence. It is so much easier to just leave it green, even if it is less good.

The Lack of Protein

And finally, I have come to the conclusion that unit tests are not sufficient for proving the correctness of a complex program.

To explain, let me use another analogy: an old-fashioned mechanical watch.

The watch is made of many tiny parts, like a spring, a knob for setting the time, and many gears. There are many other parts, but you get the idea.

Now, suppose we have a specification for each little part, like how many teeth a gear should have, the bevel on the edges, the weight, etc. Measuring each of those aspects of the part before we put it into the watch would be like running unit tests. If the measurements match the spec, we would say that the unit tests pass.

Now, suppose we have run all the tests on all the parts and all the tests pass and we have fully assembled the watch. Does it work? Does it tell time, and is it accurate more than twice a day?

Well, that depends. The specifications are derived from the theory of operation for a watch, and in theory the theory is perfect. In practice, the theory is always out of sync with reality and the degree of this mismatch will determine if your watch is a perfect chronometer or a pile of crap.

Complex behavior, the kind that makes software valuable, emerges from the interactions between simpler components. But saying that all the simple components are good does not mean that the more complex emergent behavior of the system is good.

And knowing whether a unit is good is harder than it might seem at the outset. In the watch analogy, what if it turns out that friction between parts can inhibit proper operation, and we just never thought of a test for the level of polish at these friction points? We could just add another test, but then discover (through trial and error) that some types of metal don’t polish as smoothly as others, so some parts might need to be plated with another metal before polishing? Reality is remarkably complex.

Returning to the french fry analogy, unit tests don’t have the protein for building muscle, for making the body able to actually do predictably good stuff as a whole system. And that higher-level behavior is the actual value of software, so…unit tests are insufficient.

Can Still Haz Snack?

But I still love unit tests! They help me reason about code implementation in a tiny, manageable context. They are tangible evidence of positive progress. They feel good, dammit. Can’t I still use them?

Sure. But like french fries, we need to cut back on when and how often we use them.

As mentioned earlier, unit tests are great for TDD and are useful in catching regressions when code is modified later. Why not just cut out all the other runs in-between? Unit tests can just be run locally, when that unit of code is being developed or updated, but not for every other change to every other unit.

Hints of Healthier Alternatives

But if we cut these superfluous unit tests out of our repeated test runs, we won’t have hardly any green left. And the few things that are left, like functional tests and UI tests, are slow and prone to flakiness. What do we do instead?

Our natural human inclination is to swing the pendulum in the opposite direction, so perhaps towards functional integration tests, but somewhat famously those are described as a scam – that way lies combinatorial madness and spotty coverage. Or does it? That will be the subject of the next post, which I am tentatively calling “Integration Tests: The Spaghetti of Software Development”.

And we aren’t completely done with unit tests, either, as there are some alternative approaches that are more adaptive to code changes and possibly more thorough – perhaps the equivalent of yam fries? That will also be described in a future post.

But stepping back to look at the big picture, I think that what we need is an approach that tests software at the level at which the value is delivered, to show that it really is delivered, while also accounting for the complex interactions that can lead to the delivery of anti-value, which we call bugs.

Let’s see if we can reason our way to this better place.

Revision History

Date	Comment
2024-03-03	Initial version