Testing, or not

· ngp's blog


Testing, or not #

Software testing is one of those things that nobody can seem to agree on. Everything from what to call different scopes of tests (what is a "unit" anyways?) to whether to do it in the first place. Fortunately, consensus has largely formed that writing and maintaining them is valuable for most applications, but that there are still a lot of different opinions. As I've gotten later into my career, I find my focus and opinions on what and how to write tests has changed over the years. This post serves as a brief summary of my approach and methodology to software testing. It is not intended to be prescriptive of what others should do, I likely don't work on the same kinds of projects as you do and my methods may not translate to every environment. This gets to the core of my methodology:

It's a tool, not a ceremony, but ceremony is ok #

Software development is full of ceremony. Some of it might even be valuable, especially for junior engineers who haven't had the time to develop the sense for what is appropriate and when. Testing is a very good example of this. When I was a wee junior engineer, I tested nearly everything. Every function, every branch, and often times even larger, scarier "integration tests". I didn't know what was worth testing and my lack of experience meant I was both more likely to make dumb mistakes (I still am, of course) and I didn't really know why tests were valuable. I was just told that they were, and it made logical sense after all. Tests ensure code is correct! ... right?

Ceremony is the practice of the naive, but not without value.

I shackled myself to a veneer of correctness for fear that I might blunder #

After writing thousands and thousands of lines of tests I got burned one too many times.

One Friday evening I was 7 hours deep in into a high priority fix; production had hit a snag on some poorly formatted data and the team who primarily used the application was waiting on a fix. At the end of when I normally called it quits for the night, I fixed the core issue but there was one problem: we had over 80 tests that needed changes to pass the CI pipeline and meet the requirements for review and approval. These tests were not providing any value, in fact, they were providing negative value. They were creating a barrier to progress. Now, most of the issues were small things like adjusting a type name or function signature, not massive overhauls to what the tests were validating, though some required more invasive changes. Ultimately, many of those tests were thrown out, not that night, but later on after a few more cases like this.

The problem with overly comprehensive test suite, among a suite of other issues like CI pipeline times asymptotically approaching infinity, is that they do not really testing anything important. It's not that they weren't testing something or were necessarily incorrect, but the logic they covered was not critical to the behavior of the system and increased drag on further changes to those modules.

Tests are not a good fix for a poorly designed system.

Sometimes, there are better tools for the job, such as static type systems and a little defensive programming.

Who's going to test the tests? #

Years later I was working with a large Python code base. This code base was well tested, enough that we had already beaten our CI pipeline's execution time back down to under 15 minutes for a second time already. Python, being a much more... dynamic language meant we had to heavily rely on testing to validate that code would even run. Ultimately, this worked well enough, but the maintenance burden was intense. If there is any single lesson I learned from this it's that dynamic languages are the wrong choice for most large, complex systems. The other lesson is that complicated tests are more likely to not be testing what they claim to be, or even validating anything at all.

Around this time I was fixing a failing test for an internal command-line utility. This utility was not simple, but it had a larger test suite than any command line utility I have ever seen. The tests that I was in the process of fixing used an output capture feature of pytest to check that certain information was displayed to the user under a failure condition. This seemed unusual to me at the time, considering the primary consumer of this information was a human actively looking at the output of the utility and doubly so because it was effectively testing that print worked properly. After digging further into this test, I discovered that it didn't really validate anything at all: the failure condition it was simulating in the test was implemented improperly. The condition it was trying to simulate used a different exception type and the simulated condition was triggered in a different point in the code higher in the call stack such that it was caught by a broad except block that would have made the condition impossible in the first place. But the test was complicated, so it mostly evaded scrutiny until I was confused as to why it was still passing despite making substantial changes to what it claimed to be checking. It was deleted shortly after.

Tests should be simpler than the code they validate.

I was wrong then, and I'm wrong now #

Requirements change. Some other team discovers a new constraint that was missed, some assumption was incorrect, or the business simply has different priorities and needs now. Either way, software is powerful because it is flexible. The first section of this post went into what can happen in an overly tested application, but here's another reason: some code simply needs to remain flexible. Tests are the opposite of flexibility, they are an assertion that something should remain static, that behavior should not change. This is often times appropriate. Tests themselves can be flexible and can be changed, but too many and it starts hurting "agility". As with many things, this is a balance that takes years of building up intuition to know when and where this balance is.

Test coverage should be inversely proportional to current + future rate of change, confidence in the understanding of requirements, and criticality of the code.

What really matters is what others see #

Coding is inherently abstraction, making decisions for a future user to perform some task with minimal steps and decision making usually using programming language constructs such as functions, types, interfaces, and too many others to list. Whatever your abstraction tool of choice, a well design abstraction makes testing it obvious and easy. "Test the API" is a common mantra, and it's true. At a minimum, testing the behavior that a user relies on is the most important thing to get right, it's the hardest thing to change after the fact too (especially for library authors).

Testing your interfaces is easy, but building good interfaces to test is hard.

Every function is an interface. It has semantics, inputs, outputs, and may even throw exceptions. Not every function need be tested, though. Some are not meant to be used outside of their immediate context. It's sometimes worth writing tests for these, but typically only when they're particularly critical or tricky.

The Rest #

I haven't written the rest of this, but come back later for more.