The first idea we want to appropriate from test-driven development is that of regression testing, and our specific analytical variant of this, the idea of a reference test.
We propose a "zeroth level" of test-driven data analysis as recording one or more specific sets of inputs to an analytical process, together with the corresponding outputs generated, and ensuring that the process can be re-run using those recorded inputs. The first test can then simply be checking that the results remain the same if the analysis is re-run.
In the language of test-driven development, this is a regression test, because it tests that no regressions have occurred, i.e. the results are the same now as previously. It is also a system test, in the sense that it checks the functioning of the whole system (the analytical process), rather than one or more specific subunits, as is the case with unit tests.
In our work with Skyscanner, Stochastic Solutions maintains a number of tests of this type for each of our major analytical processes. They help to ensure that as we make changes to the analysis scripts, and any of the software they depend on, we don't break anything without noticing. We also run them whenever we install new versions on Skyscanner servers, to check that we get identical results on their platforms as on our own development systems. We call these whole-system regression tests reference tests, and run them as part of the special commit process we use each time we update the version number of the software. In fact, our process only allows the version number to be updated if the relevant tests—including the relevant reference tests—pass.
Some practical considerations
Stochastic (Randomized) Analyses
We assume that our analytical process is deterministic. If it involves a random component, we can make it deterministic by fixing the seed (or seeds) used by the random number generators. Any seeds should be treated as input parameters; if the process seeds itself (e.g. from the clock), it is important it writes out the seeds to allow the analysis to be re-run.
We also assume that the analyst has performed some level of checking of the results to convince herself that they are correct. In the worst case, this may consist of nothing more than verifying that the program runs to completion and produces output of the expected form that is not glaringly obviously incorrect.
Needless to say, it is vastly preferable if more diligent checking than this has been carried out, but even if the level of initial checking of results is superficial, regression tests deliver value by allowing us to verify the impact of changes to the system. Specifically, they allow us to detect situations in which a result is unexpectedly altered by some modification of the process—direct or indirect—that was thought to be innocuous (see below).
Size / Time
Real analysis input datasets can be large, as can outputs, and complex analyses can take a long time. If the data is "too large" or the run-time excessive, it is quite acceptable (and in various ways advantageous) to cut it down. This should obviously be done with a view to maintaining the richness and variability of the inputs. Indeed, the data can also be changed to include more "corner cases", or, for example, to anonymize it, if it is sensitive.
The main reason we are not specifically advocating cutting down the data is that we want to make the overhead of implementing a reference test as low as possible.
If the analytical process directly connects to some dynamic data feed, it will be desirable (and possibly necessary) to replace that feed with a static input source, usually consisting of a snapshot of the input data. Obviously, in some circumstances, this might be onerous, though in our experience it is usually not very hard.
Another factor that can cause analysis of fixed input data, with a fixed analytical process, to produce different results is explicit or implicit time-dependence in the analysis. For example, the analysis might convert an input that is a date stamp to something like "number of whole days before
today", or the start of the current month. Obviously, such transformations produce different results when run on different days. As with seeds, if there are such transformations in the analysis code, they need to handled. To cope with this sort of situation, we typically look up any reference values such as
todayearly in the analytical process, and allow optional override parameters to be provided. Thus, in ordinary use we might run an analysis script by saying:
but in testing replace this by something like
AAA_TODAY="2015/11/01" python analysis_AAA.py
to set the environment variable
AAA_TODAYto an override value, or with a command such as
python analysis_AAA.py -d 2015/11/01
to pass in the date as a command-line option to our script.
Computers are basically deterministic, and, regardless of what numerical accuracy they achieve, if they are asked to perform the same operations, on the same inputs, in the same order, they will normally produce identical results every time. Thus even if our outputs are floating-point values, there is no intrinsic problem with testing them for exact equality. The only thing we really need to be careful about is that we don't perform an equality test between a rounded output value and an floating-point value held internally without rounding (or, more accurately, held as an IEEE floating point value, rather than a decimal value of given precision). In practice, when comparing floating-point values, we either need to compare formatted string output, rounded in some fixed manner, or compare to values to some fixed level of precision. In most cases, the level of precision will not matter very much, though in particular domains we may want to exercise more care in choosing this.
To make this distinction clear, look at the following Python code:
$ python Python 2.7.10 (default, Jul 14 2015, 19:46:27) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from __future__ import division >>> a = 1/3 >>> b = 1/3 >>> print a 0.333333333333 >>> a == 0.333333333333 False >>> a == b True >>> round(a, 12) == round(0.333333333333, 12) True >>> str(a) == '0.333333333333' True >>> '%.12f' % a == '0.333333333333' True
In this code fragment,
The first line tells Python to return floating-point values from integer division (always a good idea).
The next two lines just assign
beach to be a third.
The following line confirms the result of this is, as we'd expect
0.3333...But, crucially, this value is not exact. If we print it to 60 decimal places, we see:
>>> print "%.60f" % a 0.333333333333333314829616256247390992939472198486328125000000
Unsurprisingly, therefore, when in the next statement we ask Python whether a is equal to
0.333333333333, the result is
After this, as expected, we confirm that
a == bis
We then confirm that if we round
ato 12 decimal places, the result is exactly
round(0.333333333333, 12). Do we need the round on the right-hand side? Probably not, but be aware that 0.333333333333 is not a value that can be stored exactly in binary, so:
>>> print '%.60f' % 0.333333333333 0.333333333333000025877623784253955818712711334228515625000000
It's probably, therefore, both clearer to round both sides, or to use string comparisons.
Finally, we perform two string comparisons. The first relies on Python's default string formatting rules, and the second is more explicit.
NOTE: When it comes to actually writing tests, Python's
unittestmodule includes an
assertAlmostEqualmethod, that takes a number of decimal places, so if a function
f(x)is expected to return the result 1/3 when
x = 1, the usual way to test this to 12dp is with the following code fragment:
def testOneThird(self): self.assertAlmostEqual(f(1), 0.333333333333, 12)
Another factor that can cause differences in results is parallel execution, which can often result in subtle changes of detailed sequence of operations carried out. A simple example would be a task farm in which each of a number of workers calculates a result. If those results are then summed by the controller process in the order they are returned, rather than in a predefined sequence, numerical rounding errors may result in different answers. Thus, more care has to be taken in these sorts of cases.
A final implementation detail is that we sometimes have to be careful about simply comparing output logs, graph files etc. It is very common for output to include things that may vary from run-to-run, such as timestamps, version information or sequence numbers (run 1, run 2...) In these cases, the comparison process needs to make suitable affordances. We will discuss some methods for handling this in a future article.
Reasons a Regression Test Might Fail
Changes to the system not intended to change the result, but sometimes doing so, can take many forms. For example:
We might extend our analysis code to accommodate some variation in the input data handled.
We might add an extra parameter or code path to allow some variation in the analysis performed.
We might upgrade some software, e.g. the operating system, libraries, the analysis software or the environment in which the software runs.
We might upgrade the hardware (e.g. adding memory, processing capacity or GPUs), potentially causing different code paths to be followed.
We might run the analysis on a different machine.
We might change the way in which the input data is stored, retrieved or presented to the software.
Hardware and software can develop faults, and data corruption can and does occur.
The Law of Software Regressions
Experience shows that regression tests are a very powerful tool for identifying unexpected changes, and that such changes occur more often than anyone expects. In fact writing this reminds me of the self-referential law1 proposed by Doug Hofstadter:
It always takes longer than you expect, even when you take into account Hofstadter's Law.
— Gödel, Esher Bach: An Eternal Golden Braid, Douglas R. Hofstadter.
In a similar vein, we might coin a Law of Software Regressions:
The Law of Software Regressions:
Software regressions happen more often than expected, even when you take into account the Law of Software Regressions.
Douglas R. Hofstadter, Gödel, Esher Bach: An Eternal Golden Braid, p. 152. Penguin Books (Harmondsworth) 1980. ↩