A short guide to structuring code to write better tests

Why write this?

Well-written tests often have a positive return on investment. This makes sense; bugs become more expensive to fix the later in the development process they are discovered. This is backed by research. This also matches my experience at Etsy, my current employer. Detecting a bug in our development environment is cheaper than detecting it in staging, which is cheaper than detecting it in production, which is cheaper than trying to divine what a forums post means when it says “THEY BROKE SEARCH AGAIN WHY CAN’T THEY JUST FIX SEARCH??,” which is cheaper than debugging a vague alert about async jobs failing.

Over my career I’ve rediscovered what many know: there are good tests and bad tests. Good tests are mostly invisible except when they catch regressions. Bad tests fail frequently and their failures aren’t real regressions. More often they’re because the test logic makes assumptions about implementation logic and the two have drifted. These tests need endless tweaking to sync the implementation and test logic.

So here’s a guide to help you write better tests by improving how your code is structured. It’s presented as a set of guidelines. They were developed over a few years when I was at Google. My team noticed that we had good tests and bad tests, and we invested time in digging up characteristics of each. I feel like they are applicable outside the original domain, since I have successfully used these techniques since then.

Some may point out that this post isn’t a “short guide” by many definitions. But I think it’s better than saying “Read this 350 page book on testing. Now that I have pointed you to a resource I will not comment further on the issue.”

Please ask me questions!

Get HYPE for a testing discussion!

“Testing” is a broad topic, so I want to explain the domain I have in mind. I’m targeting a database-driven website or API. I’m not thinking about countless other environments like microcontrollers or hard realtime robotics or batch data processing pipelines or anything else. The techniques in this post can be applied broadly, and can be applicable outside of the web domain. But not all of them work for all situations. You’re in the best position to decide what works for you.

For discussion, I will introduce an imaginary PHP testing framework for evil scientists looking to make city-wide assertions: “Citizens of New York”, or cony[0]. It will be invoked as follows:

$x = 3;
cony\BEHOLD::that($x)->equals(3);
cony\BEHOLD::that($x)->isNotNull();

Terminology

Everyone has their own testing terminology. That means this blog post is hopeless. People are going to skip this section and and disagree with something that I didn’t say. This happened with my test readers even though the terminology section was already in place. But here goes!

Here are some definitions from Martin Fowler – Mocks Aren’t Stubs:

Fake objects actually have working implementations, but usually take some shortcut which makes them not suitable for production (an in memory database is a good example).

Mocks are […] objects pre-programmed with expectations which form a specification of the calls they are expected to receive.

Martin Fowler’s test object definitions

Here are a few more definitions that I will use:

Unit test: A test that verifies the return values, state transitions, and side effects of a single function or class. Assumed to be deterministic.

Integration test: A test that verifies the interaction between multiple components. May be fully deterministic or include non-deterministic elements. For instance, a test that executes a controller’s handler backed by a real database instance.

System test: A test that verifies a full system end-to-end without any knowledge of the code. Often contains nondeterministic elements like database connections and API requests. For instance, a Selenium test.

Real object: A function or class that you’d actually use in production.

Fragile test: A test whose assertion logic easily diverges from the implementation logic. Failures in fragile tests are often not due to regressions, but due to a logic divergence between the test and implementation.

A few more definitions I needed

This post mostly discusses using “real” vs “fake” vs “mocks.” When I say “fake” I will be interchanging a bunch of things that you can find defined in Martin Fowler’s article, like dummy, fake, stub, or a spy. This is because their implementations are often similar or identical despite being conceptually different. The differences matter in some contexts, but they don’t contribute much to this discussion.

Dependency injection is your best friend

Injecting a dependency means to pass it in where they are needed rather than statically accessing or constructing them in place.

For instance:

// No dependency injection.
public static function isMobileRequest(): bool {
   $request = HttpRequest::getInstance();
   // OMITTED: calculate $is_mobile from $request's user agent
   return $is_mobile;
}

// With dependency injection.
public static function isMobileRequest(HttpRequest $request): bool {
   // OMITTED: calculate $is_mobile from $request's user agent
   return $is_mobile;
}

Dependency injection makes this easier to test for three reasons.

First examine the static accessor for the HTTP request. Imagine testing it. You’d need to create machinery in the singleton to set an instance for testing. Alternatively you will need to mock out that call. But the following test is much simpler:

public static function testIsMobileRequest(): bool {
   $mobile_request = Testing_HttpRequest::newMobileRequest();
   $desktop_request = Testing_HttpRequest::newDesktopRequest();
   cony\BEHOLD::that(MyClass::isMobileRequest($mobile_request))->isTrue();
   cony\BEHOLD::that(MyClass::isMobileRequest($desktop_request))->isFalse();
}

Second, passing dependencies allows common utils to be written. There will be a one-time cost to implement newMobileRequest() and newDesktopRequest() if they don’t exist when you start writing your test. But other tests can use them once they exist. Writing utils pays off very quickly. Sometimes after only one or two usages.

Third, dependency injection will pay off for isMobileRequest() as the program grows. Imagine that it’s nested a few levels deep: used by a configuration object that’s used by a model util that’s called by a view. Now you’re calling your view renderer and you see that it takes an HTTP request. This has two benefits. It exposes that the behavior of the view is parameterized by the HTTP request. It also lets you say, “that’s insane! I need to restructure this” and figure out a cleaner structure. This is a tradeoff; you need to manage some parameter cruft to get these benefits. But in my long experience with this approach, managing these parameters aren’t a problem even when the list grows really long. And the benefits are worth it.

Inject the smallest thing needed by your code

We can make isMobileRequest even more maintainable. Look at testIsMobileRequest again. To write a proper test function, an entire HttpRequest needs to be created twice. Imagine that it gains extra dependencies over time. A MobileDetector and a DesktopDetector and a VirtualHeadsetDetector and a StreamProcessor. And because other tests inject their own, the constructors use dependency injection.

public static function testIsMobileRequest(): bool { 
   $mobile_detector = new MobileDetector();
   $desktop_detector = new DesktopDetector();
   $vh_detector = new VirtualHeadsetDetector();
   $stream_processor = new StreamProcessor();
   $mobile_request = Testing_HttpRequest::newMobileRequest(        $mobile_detector, $desktop_detector, $vh_detector, $stream_processor
   );
   $desktop_request = Testing_HttpRequest::newDesktopRequest(
       $mobile_detector, $desktop_detector, $vh_detector, $stream_processor
   );    cony\BEHOLD::that(MyClass::isMobileRequest($mobile_request))->isTrue();
   cony\BEHOLD::that(MyClass::isMobileRequest($desktop_request))->isFalse();
}

It’s more code than before. That’s fine. This is what tests tend to look like when you have lots of dependency injection. But this test can be simpler. The implementation only needs the user agent in order to properly classify a request.

public static function isMobileRequest(string $user_agent): bool {   
 // OMITTED: calculate $is_mobile from $user_agent
    return $is_mobile;
}

public static function testIsMobileRequest(): bool {
    $mobile_ua = Testing_HttpRequest::$mobile_useragent;
    $desktop_ua = Testing_HttpRequest::$desktop_useragent;
    cony\BEHOLD::that(MyClass::isMobileRequest($mobile_ua))->isTrue();
   cony\BEHOLD::that(MyClass::isMobileRequest($desktop_ua))->isFalse();
}

We’ve made the code simpler by only passing in the limited dependency. The test is also more maintainable. Now isMobileRequest and testIsMobileRequest won’t need to be changed whenever changes are made to HttpRequest.

You should be aggressive about this. You need to instantiate the transitive closure of all dependencies in order to test an object. Keeping the dependencies narrow makes it easier to instantiate objects for test. This makes testing easier overall.

Write tests for failure cases

In my experience, failure cases are often neglected in tests. There’s a major temptation to check in a test when it first succeeds. There are often more ways for code to fail than to succeed. Failures can be nearly impossible to replicate manually, so it’s important to automatically verify failure cases in tests.

Understanding the failure cases for your systems is a major step towards resilience. Failure tests execute logic that could be the difference between partial degradation and a full outage: what happens when things go wrong? What happens when the connection to the database is down? What happens when you can’t read a file from disk? The tests will verify that your system behaves as expected when there is a partial outage, or that your users get the proper error messages, or whatever behaviors you need to ensure that the single failure doesn’t turn into a full-scale outage.

This isn’t a magic wand. There will always be failures that you don’t think to test, and they will bring down your site inevitably. But you can minimize this risk by starting to add failure tests as you code.

Use real objects whenever possible

You often have several options for injecting dependencies into the implementation being tested. You could construct a real instance of the dependency. You could create an interface for the dependency and create a fake implementation. And you could mock out the dependency.

When possible, prefer to use a real instance of the object rather than fakes or mocks. This should be done when the following circumstances are true:

  • Constructing the real object is not a burden. This becomes more likely when dependency injecting the smallest thing needed by the code
  • The resulting test is still deterministic
  • State transitions in the real object can be detected completely via the object’s API or the return value of the function

The real object is preferable to the fake because the test will be a verification of the real interaction the dependency and the fake will have in production. You can verify the correct thing happened in a few different ways. Maybe you’re testing whether the return values change in response to the injected object. Or you can check that the function actually modifies the state of the dependency, like seeing that an in-memory key value store has been modified.

The real object is preferable to the mock because it doesn’t make assumptions about how the two objects interact. The exact API details of the interaction is not important compared to what it actually does to the dependency. Mocks often create fragile tests since they record everything that should be happening; what methods should be invoked, any parameters that are being passed, etc.

Even worse, the test author indicates what the return value from the object is. It may not be a sane return value for the parameters when the test is written. It may not remain true over time. It bakes extra assumptions into the test file that don’t need to be there. And imagine that you go through the trouble of mocking a single method 85 times, and you implement a major change to the real method’s behavior that may invalidate the mock returns. Now you will need to go examine each of the 85 cases and decide how each of them will change and additionally how each of the test cases will need to adapt. Or alternatively you will fix the two that fail and hope that the other 83 are still accurate just because they’re still passing. For my money, I’d rather just use the real object.

The key observation is that “how did something get changed?” matters way less than “what changed?” Your users don’t care which API puts a word into spellcheck. They just care that it persists between page reloads. A corollary is that if “how” matters quite a lot, then you should be using a mock or a spy or something similar.

Combining this with the structuring rules above creates a relatively simple rule: Reduce necessary dependencies whenever possible, and prefer the real objects to mocks when you need complex dependencies.

A careful reader will note that using real objects turns unit tests into deterministic integration tests. That’s fine. Improving the maintenance burden is more desirable than maintaining ideological purity. Plus you will be testing how your code actually runs in production. Note that this isn’t an argument against unit tests – all of the structuring techniques in this doc are designed to make it easier to write unit tests. This is just a tactical case where the best unit test turns out to be a deterministic integration test.

Another complaint I’ve heard to this approach is “but a single error in a common dependency could cause dozens of errors across all tests.” That’s actually good! You made dozens of integration errors and the test suite caught all of them. What a time to be alive. These are also easy to debug. You can choose from dozens of stack traces to help investigate what went wrong. In my experience, the fix is usually in the dependency’s file rather than needing to be fixed across tons of files.

Prefer fakes to mocks

A real object should not be used if you can’t verify what you need from its interface, or it’s frustrating to construct, or it is nondeterministic. At that point the techniques at your disposal are fake implementations and mock implementations. Prefer fake implementations over mock implementations when all else is equal. This reuses much of the same reasoning as the previous section.

Fake viking ship implementation

Despite their name, a fake implementation is a trivial but real implementation of an interface. When your code interacts with the fake object, side effects and return values should follow the same contract as the real implementation. This is good. You are verifying that your code behaves correctly with a correct implementation of the interface. You can also add convenience setters or getters to your fake implementation that you might not ordinarily put on the interface.

Fakes also minimize the number of assumptions that a test makes about the implementation. You’re not specifying the exact calls that are going to be made, or the order that the same function returns different values, or the exact values of parameters. Instead you will be either checking that the return value of your function changes based on data in the fake, or you will be verifying that the state of the fake matches your expectations after test function execution.

Here’s an example implementation:

interface KeyValueStore {
   public function has(string $key): bool;
   public function get(string $key): string;
   public function set(string $key, string $value);
}

// Only used in production. Connects to a real Redis implementation.
// Includes error logging, StatsD, everything!
class RedisKeyValueStore implements KeyValueStore {}

class Testing_FakeKeyValueStore implements KeyValueStore {
   public function __construct() {
$this->data = [];
}

   public function has(string $key): bool {
return array_key_exists($key, $this->data);
}

   public function get(string $key): string {
       if (!$this->has($key)) {
           throw new Exception("No key $key");
       }
       return $this->data[$key];
   }

   public function set(string $key, string $value) {
$this->data[$key] = $value;
}
}

Another benefit is that you now have a reusable test implementation of KeyValueStore that you can easily use anywhere. As you tweak the implementation of needsToBeCached() over time you will only need to change the tests when the side effects and return value changes. You will not need to update tests to keep the mocks up-to-date with the exact logic that is used in the implementation.

There are many cases where this is a bad fit, and anything that sounds like a bad idea is probably a bad idea. Don’t fake a SQL database. If your code has an I/O boundary like network requests, you will basically have no choice but to mock that. You can always abstract it behind other layers, but at some point you will need to write a test for that final layer.

Prefer writing a simple test with mocks to faking a ton of things or writing a massive integration test

I spend lots of time encouraging test authors to avoid mocks as a default testing strategy. I acknowledge that mocks exist for a reason. To borrow the XML adage, an automatic mocking framework is like violence: if it doesn’t solve your problem you’re not using enough of it. A determined tester can mock as many things as possible to isolate an effect in any code. My ideal testing strategy is more tactical and requires discipline. Imagine that you’re adding the first test for an ancient monolithic controller. You have roughly three options to write the test: prep a database to run against a fake request you construct, spending a ton of time refactoring dependencies, or mocking a couple of methods. You should probably do the latter one out of pragmatism. Just writing a test at all will make the file more testable, since now the infrastructure exists.

You can slowly make improvements as you continue to make edits. You can also slowly improve the code’s organization as you go. This will start to enable you to use techniques that lead to less fragile tests.

Always weigh the cost and benefit of the approaches you take. I’ve outlined several techniques above that I think lead to better tests. Unfortunately they may not be immediately usable on your project yet. It takes time to reshape a codebase. As you use them you will discover what works best for your own projects, and you should slowly improve them as you go.

System tests pay for themselves, but it’s hard to predict which ones are worth writing

At Google, my team had a long stretch where we wrote a system test for every regression. We were optimistic that they would become easier to write over time. Eventually the burden could not be ignored: they were flaky and we never ended up in our dream tooling state. So we phased out this strategy. But one day I was discussing an “incremental find” system test with a few teammates. We figured out that this single test saved us from regressing production an average of 4 times per person. Our bugs surfaced on our dev machines instead of later in our deployment process. This saved each of us lots of expensive debugging from user reports or monitoring graphs.

We couldn’t think of another system test that was nearly that valuable. It followed a Pareto distribution: most bugs were caught by a few tests. Many tests caught only a bug or two. Many other tests had similar characteristics (user-visible, simple functionality backed by lots of complex code, easy to make false assumptions about the spec), but only this one saved full eng-months.

So system tests aren’t magic and all of my experience with this suggests that we should only use them tactically. The critical paths for the customer flow are a good first-order metric to target what system tests we should write. Consider adding new system tests as the definition of your critical path changes.

What’s next?

Write tests for your code! Tests are the best forcing function for properly structuring your code. Properly structuring your implementation code will make testing easier for everyone. As you come up with good generic techniques, share them with people on your team. When you see utilities that others will find useful.

Even though this guide is well north of 3000 words, it still only scratches the surface of the subject of structuring code and tests. Check out “Refactoring” by Martin Fowler if you’d like to read more on the subject of how to write code to be more testable.

I don’t recommend following me on Twitter unless you want to read a software engineer complain about how cold it is outside.

Thanks to everyone at Etsy who provided feedback on drafts of this, whether you agreed with everything or not!

Footnotes

[0] I’ve seen this joke before but I can’t figure out where. Please send me pointers to the source material!