When unit testing I often find myself debating how much of the data I feed to, and expect back from my units under test, I should include in the actual test files.
The tradeoff I constantly struggling with is:
- If a large portion of the test (in code volume) consists of input and output data, it seems difficult to actually read the test, but I can easily see the actual in-and outputs.
- If I load the test data from files, I can easily test a bunch of variations on possible data input, easily reuse test data for multiple tests, but I have to leave the source code to look at another file to see what exactly the inputs are.
Is either one of these an anti-pattern?
2
To answer your question directly – no, I don’t believe either is an anti-pattern when used correctly.
— More verbose answer —
From my experience, I think this depends heavily on the goal of your test. Here’s the rule of thumb I’ve used in the past and it’s helped me decide:
Are you actually testing a small unit of code? (A true unit test)
If yes, then I’ve found it’s much easier to create the data inside the test itself exactly because I can see what is being passed in. In these cases, I will usually look for a Jasmine-like library to use because I find that it makes creating and maintaining the test data easier. That’s a personal preference though – use whatever makes your job easier.
If no, then you’re probably actually testing the system itself. In these cases, I often do load data from an external source, the reasons here being:
- This test isn’t about code clarity for programmers (although that is still important – someone has to maintain this), it’s about running enough different types of data through the entire chunk of the system to be reasonably sure it works.
- Often I will write the plumbing code to load and use the test data, but the data itself is created by someone else (usually a QA staff member in my case). These people aren’t usually programmers so I can’t expect them to be editing code.
So long answer short, it depends on what you’re testing and why. Both approaches are useful and have their place – choose what works best for your situation.
I don’t see a trade-off here. Source code is supposed to describe algorithms, or at least business logic, not big amounts of data. If you write a Fourier transform, you want to verify that a sinus tone is correctly mapped to a single peak, a mixed sound to more peaks etc., but for that it is completely sufficient to feed a file named sinus.wav
into the routine and verify that the output structure is what you expect.
Sure, technically you don’t have an immediate assurance that sinus.wav
really does contain a sinus tone, but as you said, listing the 100,000 amplitude values in the source doesn’t really give you that either – in fact, it is worse, because you can at least play an external file with an audio player to check, while data values buried in source code are essentially impossible to do anything with.