Unit Testing

Posted on Sat 28 November 2015 in TDDA

Systematic Unit Tests, System Tests and Reference Tests

The third major idea in test-driven data analysis is the one most directly taken from test-driven development, namely systematically developing both unit tests for small components of the analytical process and carefully constructed, specific tests for the whole system or larger components.

With regression testing, we emphasized the idea of taking whole-system analyses that have already been performed and turning those into overall system tests. In contrast, here we are talking about creating (or selecting) specific patterns in the input data for particular functions that exercise both core functionality (the main code paths) and so-called edge cases (the less-trodden paths). This is definitely significant extra work, and may (particularly if retrofitted) require either restructuring of code or use of some kind of mocking.

When we talk about "edge cases", we are referring to patterns, code-paths or functionality that is used only a minority of the time, or perhaps only on rare occasions. Examples might include handling missing values, extreme values, small data volumes and so forth. Some questions that might help to illuminate common edge cases include:

What happens if we use a standard deviation and there is only one value (so that the standard deviation is not defined)?
What happens if there are no records?
What happens if some or all of the data is null?
What happens if a few extreme (and possibly erroneous) values occur: will this, for example, cause a mean to have an non-useful value or bin boundaries to be set to useless values?
What happens if two fields are perfectly correlated: will this cause instability or errors when performing, for example, a statistical regression?
What happens if there are invalid characters in string data, especially field names or file names?
What happens if input data formats (e.g. string encodings, date formats, separators) are not as expected.
Are the various styles of line-endings (e.g. Unix vs. pc.) handled correctly?
What happens if an external system (such as a database) is running in some unexpected mode, such as with a different locale?

This idea of performing checks throughout the software does not really have an analogue in mainstream TDD, but is certainly good software engineering practice—see, e.g. the timeless Little Bobby Tables XKCD https://xkcd.com/327/—and relates fairly closely to the software engineering practice of defensive programming https://en.wikipedia.org/wiki/Defensive_programming. ↩
Obviously, in many situations, it's fine for identifiers or keys to be repeated, but it is also often the case that in a particular table a field value must be unique, typically when the records act as master records, defining the entities that exist in some category. Such tables are often referred to as master tables in database contexts https://encyclopedia2.thefreedictionary.com/master+file. ↩