Test-Driven Data Analysis

Flat Files (a.k.a. CSV files)

Posted on Fri 16 July 2021 in TDDA • Tagged with data

This week, a client I'm working for received a large volume of data, and as usual the data was sent as "flat" files—or CSV (comma-separated values¹) files, as they are more often called. Everyone hates CSV files, because they are badly specified, contain little metadata and are generally an unreliable way to transfer information accurately. They continue to be used, of course, because they are the lowest-common denominator format and just about everything can read and write them in some fashion.

Some of the problems with CSV files are well captured in a pithy blog post by Jesse Donat entitled Falsehoods Programmers Believe about CSVs.

Among other things, the data we received this week featured:

unescaped commas in unquoted (comma-separated) values;
an unspecified non-UTF-8 encoding that also did not appear to be iso-8859-1 ("latin-1" to its friends), nor indeed iso-8859-15 ("latin-9");
different null markers in different fields, and some cases, different null markers in a single field;²
field names (column headers) that included spaces, apostrophes, dashes and (in at least one case) a non-ASCII non-alphanumeric character;
multiple date formats, even within a single field, including some dates with three-digit years.

All of this is a bit frustrating, but far from unusual, and only one of these problems was actually fatal—the use of unquoted, unescaped separators in values, which makes the file inherently ambiguous. I'm almost sure this data was written but not read or validated, because I don't believe the supplier would have been able to read it reliably either.

Metadata

In an ideal world, we'd move away from CSV files, but we also need to recognise not only that this probably won't happen, but that the universality, plain-text nature, grokkability and simplicity of CSV files are all strengths; for all that we might gain using fancier, better-specified formats, we would lose quite a lot too, not least the utility of awk, split, grep and friends in many cases.

So if we can't get away from CSV files, how can we increase reliability when using them? Standardizing might be good, but again, this is going to be hard to achieve. What we might be able to do, however, is to work towards a way of specifying flat files that at least allows a receiver of them to know what to expect, or a generator to know what to write. I've been involved with a few such ideas over the years, and the software my company produces (Miró) used its own non-standard, XML-based way of describing flat files.

What I'm thinking about is trying to produce something more general, less opinionated, and more modern (think JSON, rather than XML, for starters) that addresses more issues. The initial goal would be simply descriptive—to allow a metadata file to be created that accurately describes the specific features of a given flat file so that a reader (human or machine) knows how to interpret it. Over time, this might grow into something bigger. I think obvious things to do after the format is created include:

In my case, getting Miró to accept these in place of³ its current XML-based files when reading (or writing) flat files. (Initially, at least, Miró would not be able to read or write all files that could be specified in this way, but could at least warn the user when it couldn't.)
Also getting the Python tdda library to be able to use this when using CSV files for input (and perhaps also for output).
Writing an "argument generator" for some of the standard (Python) CSV readers and writers to set the read/write options to be consistent with a given metadata description, and then probably to provide wrapped versions of those readers/writers that can accept a path for a CSV file and a path to a metadata file and use the underlying CSV reader or writer to read or write the file using that specification.
Writing (yet another) "smart" reader to try to read any old CSV files (using heuristics) and write out a metadata file that appears to match the data provided. This could not possibly work completely reliably because of all inherent ambiguity in flat files already alluded to, but an "80%" solution for real-world files should certainly be achievable as many programs make a reasonable job of handling arbitrary CSV files already.
Writing a validator to confirm whether a given CSV file is consistent with the specification in the metadata file.
Incorporating such a flat-file validator into TDDA so that it can check not only the (semantic) content of a dataset, but also the syntactic/formatting validity of data, confirming that it has been or can be read correctly.⁴

Together, a smart reader that generates a metadata file for a CSV file (item 4 above) and a validator that validates a CSV file against such a metadata specification (item 5) are very analogous to the current constraint discovery and data verification, respectively, but in the space of CSV files—roughly, "syntactic" conformance—rather than data (or "semantic") correctness.

Miró's Flat File Description format (XMD Files)

Here is an example, from its documentation, of the XMD data files that Miró uses.

<?xml version="1.0" encoding="UTF-8"?>
<dataformat>
    <sep>,</sep>                     <!-- field separator -->
    <null></null>                    <!-- NULL marker -->
    <quoteChar>"</quoteChar>         <!-- Quotation mark -->
    <encoding>UTF-8</encoding>       <!-- any python coding name -->
    <allowApos>True</allowApos>      <!-- allow apostophes in strings -->
    <skipHeader>False</skipHeader>   <!-- ignore the first line of file -->
    <pc>False</pc>                   <!-- Convert 1.2% to 0.012 etc. -->
    <excel>False</excel>             <!-- pad short lines with NULLs -->
    <dateFormat>eurodt</dateFormat>  <!-- Miró date format name -->
    <fields>
        <field extname="mc id" name="ID" type="string"/>
        <field extname="mc nm" name="MachineName" type="int"/>
        <field extname="secs" name="TimeToManufacture" type="real"/>
        <field extname="commission date" name="DateOfCommission"
               type="date"/>
        <field extname="mc cp" name="Completion Time" type="date"
               format="rdt"/>
        <field extname="sh dt" name="ShipDate" type="date" format="rd"/>
        <field extname="qa passed?" name="Passed QA" type="bool"/>
    </fields>
    <requireAllFields>False</requireAllFields>
    <banExtraFields>False</banExtraFields>
</dataformat>

Three things to note immediately about this:

I'm not presenting this as the solution: the XMD format is now rather out of vogue and there are a number of things I would definitely do differently fifteen years on (such as using more standard names for types and more standard date format specifiers).
The XMD format is slightly more than just a flat file description, in that it contains a couple of things that are more about how to interpret and handle the data after reading, rather than simply describing the data.
The XMD file supports the notion of two different names for a field. The extname is the name in the CSV file (the external name), while the name is the name for Miró to use for the field. The semantics of this are slightly complicated, but allow for renaming of fields on import, and for naming of fields where there is no external name, or external names are repeated, or the external name is otherwise unusable by Miró. If the CSV file has a header and each field has a different name in the header, the order of the fields int he XMD file does not matter, but if there are missing or repeated field names, Miró will use the field order in the XMD file.

Notwithstanding the amazing variety seen in CSV files, as illuminated by Jesse Donat's aforementioned blogpost, most CSV files from mature systems vary only in the ways covered by a few of the items described in the CSV file. The most important things to know about a flat file overall are normally:

Encoding. The file encoding—these days, most commonly UTF-8.
Separator. The separator character—most commonly a comma (,), but pipe (|), tab and semicolon (;) are also frequently used.
Quoting. What character is used to quote strings (if any). There are quite number of subtleties here (not all capable of being expressed in the XMD file) including:
- Are all strings quoted or just some (e.g. ones containing the field separator)?
- Are non-string values (e.g. numbers) quoted too?⁵
- Are missing values (NULL) quoted?⁶
Missing Values. How are missing values (NULLs) denoted in the file, should there be any?
Escaping. How are characters "escaped"? This really covers a set of different issues, and the XMD file is not rich enough to cover all possibilities. One aspect is, when strings are quoted, how are quotes in the string handled? The most common answers are either by preceding them with an escape character, usually backslash (\), e.g.
```
"This is an escaped \" character in a string"
```
or by stuttering:
```
"This is a stuttered "" character in a string"
```
Escaping is also a way of including the separator in non-quoted values, like these display prices:
```
Price,DisplayPrice
100.0,£100.00
1000.0,£1\,000.00
1000000,£1\,000\,000.00
```
Escaping is also a way of specifying some special characters, e.g. \n for a newline, \t for a tab etc., and as a result when an actually backslash is required it is self-escaped (as \\).
Row Truncation after the last non-null value. Are rows in which the last value is missing truncated? Like many CSV writers, Excel writes missing values as blanks so that 1,,3 is read as 1 for the first field, a missing value for the second field and 3 for the third field. More quirkily, when Excel writes out CSV files, if there are n columns and the last m of them on a row are missing, Excel will write out only the non-missing values, and no further separators, so that there will be only n – m values on that line and only n – m – 1 separators. This behaviour is hard to describe and (as far as I know) unique to Excel, so in the XMD file this is simply marked as <excel>True</excel>.⁷
Header handling. Although the common case is for CSV files to have a single line at the start with the field names, sometimes there is no such line, and sometimes there are multiple lines before the data (one or more of which many specify the field names). As a minimum, a metadata description needs to be able to specify whether there is a header line, and ideally how many such lines there are and how headers should be extracted from them. If there are no headers, the specification should probably specify the field names. (Miró imaginatively calls the fields Field1 to FieldN if no fieldnames are available in the flat file or any XMD file.)

Per-Field Information

It's always useful and sometimes necessary to specify field types, and as discussed above, sometimes field names. Typing is almost always ambigous, and such ambiguity is increased if there are any bad values in the data. Moreover, in some cases (especially dates and timestamps), it is useful to specify the date format. Although good flat-file readers generally make a reasonable job of inferring types, and often date formats too, it is clearly helpful for a metadata specification to include these.

Just as date formats can vary between fields, other things can vary too, most obviously null indicators (missing value information), quoting and escaping. Moreover, if numeric data is formatted (e.g. including currency indicators, thousand separators etc.) these can all usefully be specified.

Required/Allowed Fields

The final pair of settings in the XMD file look slightly different from the others, partly because they are phrased as directives rather than descriptions. requireAllFields, when set, is a directive to Miró to raise a warning or an error if any of the fields in the XMD file are not present in the CSV file. Similarly, banExtraFields is a directive to raise such a warning or error if any fields are found in the CSV file that are not listed in the XMD file. Miró has several ways to specify whether infringements result in warnings or errors.

These directives can, however, be recast as declarations. The banExtraFields directive, when true, can equally be thought of as a declaration the field list is complete. Similarly, the requireAllFields directive, when true, can be thought of as a declaration that the field list is not just describing types that and formats for fields that might be in the CSV files, but rather that all fields listed are actually in the file.⁸

In principle, I think it would probably be better if these descriptions were more obviously descriptive or declarative, but I am struggling to find a pair of words/phrases that would capture that elegantly. At this point I am tempted to retain their imperative nature but make them slightly more symmetrical, perhaps with:

"require-all-fields": true,
"allow-extra-fields": false

Alternatively a more declarative syntax might be something like:

"csv-file-might-omit-fields": false,
"csv-file-might-include-extra-fields": false

The reader might wonder why the fields in the metadata file would ever not correspond exactly to those in file. In practice, it is not uncommon when dealing with relatively "good" CSV files to write an XMD file that specifies types and formats only for fields that trip up the flat-file reader. Conversely, it can be useful to have XMD files that describe a variety of possible files that share field names and types; in those cases, the extra ones do no harm.

What Might a Metadata File Look Like?

The XMD file gets quite a lot of things right:

As XML, it's a standard format that's easy to read, though today JSON is clearly more popular for this sort of use. (It would be fairly easy to allow a common format to be expressed in JSON, XML or YAML, but there's something to be said for a single format, probably JSON.)
All of the most fundamental overall properties are represented—encoding, separator, null marker, escape characters, and date format.
There's a separation between the overall file properties and the per-field properties, with the ability to specify the actual fieldname in the file, the field type and, in the case of date fields, custom formats on a per-field basis, if necessary.
It can give enough enough information to allow Excel-style truncated lines can be read successfully.

There are also a few major shortcomings:

The single escape chaaracter specification covers multiple things.
There is no explicit support for quote stuttering (which is fairly common).
The format does not recognise multiple headers.
The format does not provide any way to specify non-date field formats such as boolean specifiers, possible thousand separators and decimal point markers.
The format assume a single NULL indicator for all fields and assumes that there is only one kind of missing value/missing value.
The date formats supported are not comprehensive and are not expressed in a standard way.
Type specifiers are also somewhat non-standard.
XMD files fail to recognize the possibility that null markers are quoted, and implicitly assume that any empty string is distinct from a missing string value. This is probably too opinionated.

Some of these shortcomings reflect the fact that the XMD format was conceived less as a general-purpose flat-file descriptor than a specification as to how Miró should read or write a given flat file, and also a way for Miró to specify how it has written a flat file.

Essentially, I think a good flat-file description format would preserve the good aspects and remedy the faults identified, as well as providing a mechanism for specifying some more esoteric possibilies not mentioned so far.

I'll propose something concrete in subsequent posts.

UPDATE The example metadata was updated on 2025-06-23, to be slightly more interesting and realistic. This coincides with the the post, tdda.serial: Metadata for Flat Files (CSV Files)

Sometimes the separator in a flat file is a character other than a comma, and you occasionally see .tsv used an extension when the separator is a tab character, or .psv when the separator is a pipe character (|). Often, however, a csv extension is still used, and as result the acronym CSV is sometimes restyled as character-separated values. I had always heard this extension attributed to Microsoft, but have been unable to verify this. ↩
To be fair, the notion of different kinds of missing values is reasonable—missing because it wasn't recorded, missing because it was unreadable, missing because it's an undefined result (e.g. mean of no values) etc. But this wasn't that: it was just multiple ways of denoting generic missing values. ↩
by which, of course, I mean as well as ... ↩
There's an interesting question as to whether the CSV format specification should be incorporated as an optional part of a TDDA file, and if so, whether it should simply be a nested section or whether the field-specific components should be merged with TDDA's field sections. There are pros and cons. ↩
Yes, some systems do this. ↩
I know, madness! But such practices occur! ↩
Maybe it should have been called quirks mode ↩
Miró's slightly extended version of TDDA files includes lists of required and allowed fields, which serve a similar purpose to these settings. ↩

Sharing Tests across Implementations by Externalizing Test Data

Posted on Sun 30 August 2020 in TDDA • Tagged with tests, reference tests, data

I've been dabbling in Swift—Apple's new-ish programming language—recently. One of the things I often do when learning a new language is either to take an existing project in a language I know (usually, Python) and translate it to the new one, or (better) to try a new project, first writing it in Python then translating it. This allows me to separate out debugging the algorithm from debugging my understanding of the new language, and also give me something to test against.

I have a partially finished Python project for analysing chords that I've been starting to translate, and this has led me to begin to experiment with some new extensions to the TDDA library (not yet pushed/published).

It's a bit fragmented and embryonic, but this what I'm thinking about.

Sharing test data between languages

Many tests boil down to "check that passing these inputs to this function¹ produces this result". There would be some benefits in sharing the inputs and expected outputs between implementations:

DRY principle (don't repeat yourself);
reducing the chances of things getting out of sync;
more confidence that the two implementations really do the same thing;
less typing / less code.

** Looping over test cases **

Standard unit-testing dogma tends to focus on the idea of testing small units using many tests, each containing a single assertion, usually as the last statement in the test.² The benefit of using a single assertion is that when there's a failure it's very clear what it was, and an earlier failure doesn't prevent a later check (assertion) from being carried out: you get all your failures in one go. Less importantly, it also means that the number of tests executed is the same as the number of assertions tested, which might be useful and psychologically satisfying.

On the other hand, it is extremely common to want to test multiple input-output pairs and it is natural and convenient to collect those together and loop over them. I do this all the time, and the reference testing capability in the TDDA library already helps mitigate some downsides of this approach in some situations.

A common way I do this is to loop over a dictionary or a list of tuples specifying input-output pairs. For example, if I were testing a function that did string slicing from the left in python (string[:n]) I might use something like

cases = {
    ('Catherine', 4): 'Cath',
    ('Catherine', -6): 'Cath',  # deliberately wrong, for illustration
    ('', 7): '',
    ('Miró forever', 4): 'Miró',
    ('Miró forever', 0): ' '    # also deliberately wrong
}
for (text, n), expected in cases.items():
    self.assertEqual(left_string(text, n), expected)

In Python this is fine, because tuples, being hashable, can be used as dictionary keys, and there's something quite intuitive and satisfying about the cases being presented as lines of the form input: expected output. But I also often just use nested tuples or lists, partly as a hangover from older versions of Python in which dictionaries weren't sorted.³ Here's a full example using tuples:

from tdda.referencetest import ReferenceTestCase

def left_string(s, n):
    return s[:n]


class TestLeft(ReferenceTestCase):
    def testLeft(self):
        cases = (
            (('Catherine', 4), 'Cath'),
            (('Catherine', -6), 'Cath'),    # deliberately wrong, for illustration
            (('', 7), ''),
            (('Miró forever', 4), 'Miró'),
            (('Miró forever', 0), ' ')      # also deliberately wrong
        )
        for (text, n), expected in cases:
            self.assertEqual(left_string(text, n), expected)


if __name__ == '__main__':
    ReferenceTestCase.main()

As noted above, two problems with this are:

if one test case fails, it's not necessarily easy to figure out which one it was, especially if expected values (e.g. 'Cath') are repeated.
an earlier failure prevents later cases from running.

We can see both of these problems if we run this:

$ python3 looptest.py
F
======================================================================
FAIL: testLeft (__main__.TestLeft)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "looptest.py", line 18, in testLeft
    self.assertEqual(left_string(text, n), expected)
AssertionError: 'Cat' != 'Cath'
- Cat
+ Cath
?    +


----------------------------------------------------------------------
Ran 1 test in 0.000s

FAILED (failures=1)

It's actually the second case that failed, and the fifth case would also fail if it ran (since it should produce an empty string, not a space).

A technique I've long used to address the first problem is to include the test case in the equality assertion, replacing

    self.assertEqual(actual, expected)

with

    self.assertEqual((case, actual), (case, expected))

like so:

def testLeft(self):
    cases = (
        (('Catherine', 4), 'Cath'),
        (('Catherine', -6), 'Cath'),
        (('', 7), ''),
        (('Miró forever', 4), 'Miró'),
        (('Miró forever', 0), ' ')
        )
    for case, expected in cases:
        (text, n) = case
        self.assertEqual((case, left_string(text, n)),
                         (case, expected))

Now when a case fails, we see what the failure is more easily:

$ python3 looptest2.py
F
======================================================================
FAIL: testLeft (__main__.TestLeft)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "looptest2.py", line 20, in testLeft
    (case, expected))
AssertionError: Tuples differ: (('Catherine', -6), 'Cat') != (('Catherine', -6), 'Cath')

First differing element 1:
'Cat'
'Cath'

- (('Catherine', -6), 'Cat')
+ (('Catherine', -6), 'Cath')
?                         +


----------------------------------------------------------------------
Ran 1 test in 0.001s

FAILED (failures=1)

I wouldn't call it beautiful, but it does the job, at least when the inputs and outputs are of a manageable size.

This still leaves the problem that the failure of an earlier case prevents later cases from running. The TDDA library already addresses this in the case of file checks, by providing the assertFilesCorrect (plural) assertion in addition to the assertFileCorrect (singular); we'll come back to it later.

Externalizing Test Data

Returning to the main theme of this post, when there are multiple implementations of software, potentially in different languages, there is some attraction to being able to share the test data—ideally, both the inputs being tested and the expected results.

The project I'm translating is a chord analysis tool focused on jazz guitar chords, especially moveable ones with no root. It has various classes, functions and structures concerned with musical notes, scales, abstract chords, tunings, chord shapes, chord names and so forth. It includes an easy-to-type text format that uses # as the sharp sign and b as the flat sign, though on output, these are usually translated to ♯ and ♭. Below are two simple tests from the Python code.

For those interested, the first tests a function transpose that transposes a note by an number of semitones. There's an optional key parameter which, when provided, is used to decide whether to express the result as a sharp or flat note (when appropriate).

def testTranspose(self):
    self.assertEqual(transpose('C', 0), 'C')

    self.assertEqual(transpose('C', 1), 'C#')
    self.assertEqual(transpose('C', 1, key='F'), 'Db')
    self.assertEqual(transpose('C#', -1), 'C')
    self.assertEqual(transpose('Db', -1), 'C')

    self.assertEqual(transpose('C', 2), 'D')
    self.assertEqual(transpose('D', -2), 'C')

    self.assertEqual(transpose('C', 3, key='A'), 'D#')
    self.assertEqual(transpose('C', 3), 'Eb')
    self.assertEqual(transpose('C', 3, key='Bb'), 'Eb')

    self.assertEqual(transpose('D#', -3), 'C')
    self.assertEqual(transpose('Eb', -3), 'C')

    self.assertEqual(transpose('C', -1), 'B')
    self.assertEqual(transpose('B', 1), 'C')

    self.assertEqual(transpose('C', -2), 'Bb')
    self.assertEqual(transpose('C', -2, 'E'), 'A#')
    self.assertEqual(transpose('Bb', 2), 'C')
    self.assertEqual(transpose('A#', 2), 'C')

    self.assertEqual(transpose('C', -3), 'A')
    self.assertEqual(transpose('A', 3), 'C')

    self.assertEqual(transpose('G', 4), 'B')
    self.assertEqual(transpose('B', -4), 'G')

    self.assertEqual(transpose('F#', 4), 'Bb')
    self.assertEqual(transpose('F#', 4, 'E'), 'A#')
    self.assertEqual(transpose('Bb', -4), 'F#')
    self.assertEqual(transpose('A#', -4), 'F#')
    self.assertEqual(transpose('Bb', -4, 'F'), 'Gb')
    self.assertEqual(transpose('A#', -4, 'Eb'), 'Gb')

    self.assertEqual(transpose('G', 4), 'B')
    self.assertEqual(transpose('F#', 4), 'Bb')
    self.assertEqual(transpose('F#', 4, 'E'), 'A#')
    self.assertEqual(transpose('B', -4), 'G')
    self.assertEqual(transpose('Bb', -4), 'F#')
    self.assertEqual(transpose('A#', -4), 'F#')
    self.assertEqual(transpose('Bb', -4, 'F'), 'Gb')
    self.assertEqual(transpose('A#', -4, 'F'), 'Gb')

Clearly, this test does not use looping, but does combine some 36 test cases in a single test (dogma be damned!)

A second test is for a function to_flat_equiv, which (again, for those interested) accepts chord names (in various forms) and—where the chord's key is sharp, as written—converts them to the equivalent flat form. (Here, o is one of the ways to indicate a diminished chord (e.g. Dº) and M is on of the ways of describing a major chord (also maj or Δ). The function also accepts None as an input (returned unmodified) and R as an abstract chord with no key specified (also unmodified).⁴

def test_to_flat_equiv(self):
    cases = (a
        ('C', 'C'),
        ('C#m', 'Dbm'),
        ('Db7', 'Db7'),
        ('C#M7', 'DbM7'),
        ('Do', 'Do'),
        ('D#M', 'EbM'),
        ('E9', 'E9'),
        ('FmM7', 'FmM7'),
        ('F#mM7', 'GbmM7'),
        ('G', 'G'),
        ('G#11', 'Ab11'),
        ('Ab11', 'Ab11'),
        ('Am11', 'Am11'),
        ('A#+', 'Bb+'),
        ('Bb+', 'Bb+'),
        ('A♯+', 'B♭+'),
        ('B♭+', 'B♭+'),

        (None, None),
        ('R', 'R'),
        ('R#', 'R#'),
        ('Rm', 'Rm'),
        )
    for k, v in cases:
        self.assertEqual(to_flat_equiv(k), v)

    for letter in 'BEPQaz@':
        self.assertRaises(NoteError, to_flat_equiv, letter + '#')

This function uses two loops within the test, one for the good cases and another for eight illegal input cases that raise exceptions. The looping has a clear benefit, but there's no reason to have combined the good and bad test cases in a single test function other than laziness.

In 2020, if we're going to share the test data between implementations, it hard to look beyond JSON. Here's an extract from a file scale-tests.json that encapsulates the inputs and expected outputs for all the tests above:

{
    "transpose": [
        [["C", 0], "C"],

        [["C", 1], "C#"],
        [["C", 1, "F"], "Db"],
        [["C#", -1], "C"],
        [["Db", -1], "C"],

        [["C", 2], "D"],
        [["D", -2], "C"],

        [["C", 3, "A"], "D#"],
        [["C", 3], "Eb"],
        [["C", 3, "Bb"], "Eb"],

        [["D#", -3], "C"],
        [["Eb", -3], "C"],

        [["C", -1], "B"],
        [["B", 1], "C"],

        [["C", -2], "Bb"],
        [["C", -2, "E"], "A#"],
        [["Bb", 2], "C"],
        [["A#", 2], "C"],

        [["C", -3], "A"],
        [["A", 3], "C"],

        [["G", 4], "B"],
        [["B", -4], "G"],

        [["F#", 4], "Bb"],
        [["F#", 4, "E"], "A#"],
        [["Bb", -4], "F#"],
        [["A#", -4], "F#"],
        [["Bb", -4, "F"], "Gb"],
        [["A#", -4, "Eb"], "Gb"],

        [["G", 4], "B"],
        [["F#", 4], "Bb"],
        [["F#", 4, "E"], "A#"],
        [["B", -4], "G"],
        [["Bb", -4], "F#"],
        [["A#", -4], "F#"],
        [["Bb", -4, "F"], "Gb"],
        [["A#", -4, "F"], "Gb"],
    ],
    "flat_equivs": [
        ["C", "C"],
        ["C#m", "Dbm"],
        ["Db7", "Db7"],
        ["C#M7", "DbM7"],
        ["Do", "Do"],
        ["D#M", "EbM"],
        ["E9", "E9"],
        ["FmM7", "FmM7"],
        ["F#mM7", "GbmM7"],
        ["G", "G"],
        ["G#11", "Ab11"],
        ["Ab11", "Ab11"],
        ["Am11", "Am11"],
        ["A#+", "Bb+"],
        ["Bb+", "Bb+"],
        ["A♯+", "B♭+"],
        ["B♭+", "B♭+"],

        [null, null],

        ["R", "R"],
        ["R#", "R#"],
        ["Rm", "Rm"]
    ],
    "flat_equiv_bads": "BEPQaz@"
}

I have a function that reads this uses json.load to read this and other test data, storing the results in an object with a .scale attribute, like so:

>>> from moveablechords.utils import ReadJSONTestData
>>> from pprint import pprint
>>> TestData = ReadJSONTestData()
>>> pprint(TestData.scale['transpose'])
[[['C', 0], 'C'],
 [['C', 1], 'C#'],
 [['C', 1, 'F'], 'Db'],
 [['C#', -1], 'C'],
 [['Db', -1], 'C'],
 [['C', 2], 'D'],
 [['D', -2], 'C'],
 [['C', 3, 'A'], 'D#'],
 [['C', 3], 'Eb'],
 [['C', 3, 'Bb'], 'Eb'],
 [['D#', -3], 'C'],
 [['Eb', -3], 'C'],
 [['C', -1], 'B'],
 [['B', 1], 'C'],
 [['C', -2], 'Bb'],
 [['C', -2, 'E'], 'A#'],
 [['Bb', 2], 'C'],
 [['A#', 2], 'C'],
 [['C', -3], 'A'],
 [['A', 3], 'C'],
 [['G', 4], 'B'],
 [['B', -4], 'G'],
 [['F#', 4], 'Bb'],
 [['F#', 4, 'E'], 'A#'],
 [['Bb', -4], 'F#'],
 [['A#', -4], 'F#'],
 [['Bb', -4, 'F'], 'Gb'],
 [['A#', -4, 'Eb'], 'Gb'],
 [['G', 4], 'B'],
 [['F#', 4], 'Bb'],
 [['F#', 4, 'E'], 'A#'],
 [['B', -4], 'G'],
 [['Bb', -4], 'F#'],
 [['A#', -4], 'F#'],
 [['Bb', -4, 'F'], 'Gb'],
 [['A#', -4, 'F'], 'Gb']]

I have also made it so you can get entries using attribute lookup on the objects, i.e. TestData.scale.transpose rather than TestData.scale['transpose'], just because it looks more elegant and readable to me.

A straightforward refactoring of the testTranspose function to use the JSON-loaded data in TestData.scale would be

def testTranspose2(self):
    for (case, expected) in TestData.scale.transpose:
        if len(case) == 2:
            (note, offset) = case
            self.assertEqual((case, transpose(note, offset)),
                             (case, expected))
        else:
            (note, offset, key) = case
            self.assertEqual((case, transpose(note, offset, key=key)),
                             (case, expected))

In case this isn't self-explanatory

The loop runs over the cases and expected values, so on the first iteration case is ["C", 0] and expected is 'C';
The assignments set the note and offset variables; if the list is of length three the key variable is also set;
As discussed above, rather than just using things like self.assertEqual(transpose(note, offset), expected), we're including the case (the tuple of input parameters) on both sides of the assertion so that if there's a failure, we can see which case is failing.

We can simplify this further since the transpose function has only one optional (keyword) argument, key, which can also be provided as a third positional argument. Assuming we don't specifically need to test the handling of key as a keyword argument, we can combine the two branches as follows:

def testTranspose3(self):
    for (case, expected) in TestData.scale.transpose:
        self.assertEqual((case, transpose(*case)),
                         (case, expected))

Here, we're using the * operator to unpack⁵ case into an argument list for the transpose function.

Adding TDDA Support

It probably hasn't escaped your attention that this third version of testTranspose is rather generic: the same structure would work for any function f and list of input-output pairs Pairs:

def testAnyOldFunction_f(self):
    for (case, expected) in Pairs:
        self.assertEqual((case, f(*case)), (case, expected))

This makes it fairly easy to add TDDA support. I added prototype support for this that allows us to use an even shorter version of the test:

def testTranspose4(self):
    self.checkFunctionByArgs(transpose, TestData.scale.transpose)

This new checkFunctionByArgs takes a function to test and a list of input output pairs and runs a slightly fancier version of testAnyOldFunction. I'll go into extensions in another post, but the most important difference is that it will report all failures rather than stopping at the first one.

We can illustrate this by changing the last first and last cases in TestData.scale['transpose'] to be incorrect, say:

   [[['C', 0], 'Z'],
     ...
    [["A#", -4, "F"], "Zb"]

If we run testTranspose3 using this modified test data, we get only the first failing case, and although the test case is listed in the output, the output isn't particularly easy to grok.

$ python3 testscale.py
.....F.......
======================================================================
FAIL: testTranspose2 (__main__.TestScale)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "testscale.py", line 22, in testTranspose2
    (case, expected))
AssertionError: Tuples differ: (['C', 0], 'C') != (['C', 0], 'Z')

First differing element 1:
'C'
'Z'

- (['C', 0], 'C')
?             ^

+ (['C', 0], 'Z')
?             ^


----------------------------------------------------------------------
Ran 13 tests in 0.002s

FAILED (failures=1)

But if we use the TDDA's prototype checkFunctionByArgs functionality, we see both failures and it shows them in a more digestible format:

$ python3 testscale.py
.....

Case transpose('C', 0): failure.
    Actual: 'C'
  Expected: 'Z'


Case transpose('A#', -4, 'F'): failure.
    Actual: 'Gb'
  Expected: 'Zb'
F.......
======================================================================
FAIL: testTranspose4 (__main__.TestScale)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "testscale.py", line 15, in testTranspose4
    self.checkFunctionByArgs(transpose, TestData.scale.transpose)
  File "/Users/njr/python/tdda/tdda/referencetest/referencetest.py", line 899, in checkFunctionByArgs
    self._check_failures(failures, msgs)
  File "/Users/njr/python/tdda/tdda/referencetest/referencetest.py", line 919, in _check_failures
    self.assert_fn(failures == 0, msgs.message())
AssertionError: False is not true :

Case transpose('C', 0): failure.
    Actual: 'C'
  Expected: 'Z'


Case transpose('A#', -4, 'F'): failure.
    Actual: 'Gb'
  Expected: 'Zb'

----------------------------------------------------------------------
Ran 13 tests in 0.001s

FAILED (failures=1)

The failures currently get shown twice, once during execution of the tests and again at the end in the summary, and the test just counts this as a single failure, though these are both things that could be changed.

There are variant forms of the prototype checking function above to handle keyword arguments only and mixed positional and keyword argmuents. There's also a version specifically for single-argument functions, where it's natural not to write the arguments as a tuple, but a simple value.

Is this a Good Idea?

I think the potential benefits of sharing data between different implementations of the same project are pretty clear. I haven't actually modified the Swift implementation to use the JSON, but I'm sure doing so will be easy and a clear win. I hope the example above also illustrates that good support from testing frameworks can significantly mitigate the downsides of looping over test cases within a single test function. But there are other potential downsides.

The most obvious problem, to me, is that the separation of the test data from the test it makes it harder to see what's being tested (and perhaps means you have to trust the framework more, though that is quite easy to check). Arguably, this is even more true when the test is reduced to the one-line form in testTranpose4, rather than longer form in testTranspose2, where the function arguments are unpacked and named, so that you can see a bit more of what is actually being passed into the function.

There's a broader point about the utility of tests as a form of documentation. A web search for externalizing test data uncovered this post from Arvind Patil in 2005 in which he proposes something like scheme here for Java (with XML taking the place of JSON, in 2005, of course). Three replies to the post are quite hostile, including the first for Irakli Nadareishvili, who says:

sorry, but this is a quite dangerous anti-pattern. Unit-tests are not simply for testing a piece of code. They carry several, additional, very important roles. One of them is - documentation.

In a well-tested code, unit-tests are the first examples of API usage (API that they test). A TDD-experienced developer can learn a lot about the API, looking at its unit-tests. For the readability and clarity of what unit-test tests, it is very important that test data is in the code and the reader does not have to consistently hop from a configuration file to the test code.

Also, usually boundary conditions for a code (which is what test data commonly is) almost never change, so there is more harm in this "pattern" than gain, indeed.

This is definitely a reasonable concern. Even if code has good documentation, it is all too common for it to become out of date, whereas (passing) tests, almost by definition, tend to stay up-to-date with API changes. We could mitigate this issue quite a lot by hooking into verbose mode (-v or --verbose) and having it show each call as well as the test function being run, which seems like a good idea anyway. At the moment, if you run the scale tests with -v on my chord project like this you get output like this:

$ python3 testscale.py -v
testAsSmallestIntervals (__main__.TestScale) ... ok
testDeMinorMajors (__main__.TestScale) ... ok
testFretForNoteOnString (__main__.TestScale) ... ok
testNotePairIntervals (__main__.TestScale) ... ok
testRelMajor (__main__.TestScale) ... ok
testTranspose (__main__.TestScale) ... ok
test_are_not_same (__main__.TestScale) ... ok
test_are_same (__main__.TestScale) ... ok
test_are_same_invalids (__main__.TestScale) ... ok
test_flat_equiv (__main__.TestScale) ... ok
test_flat_equiv_bads (__main__.TestScale) ... ok
test_preferred_equiv (__main__.TestScale) ... ok
test_preferred_equiv_bads (__main__.TestScale) ... ok

----------------------------------------------------------------------
Ran 13 tests in 0.002s

OK

but we could (probably) extend this to something more like:⁶

$ python3 testscale.py -v
testAsSmallestIntervals (__main__.TestScale) ... ok
testDeMinorMajors (__main__.TestScale) ... ok
testFretForNoteOnString (__main__.TestScale) ... ok
testNotePairIntervals (__main__.TestScale) ... ok
testRelMajor (__main__.TestScale) ... ok
testTranspose (__main__.TestScale) ...
    transpose('C', 0): OK
    transpose('C', 1): OK
    transpose('C', 1, 'F'): OK
    transpose('C#', -1): OK
    transpose('Db', -1): OK
    transpose('C', 2): OK
    transpose('D', -2): OK
    transpose('C', 3, 'A'): OK
    transpose('C', 3): OK
    transpose('C', 3, 'Bb'): OK
    transpose('D#', -3): OK
    transpose('Eb', -3): OK
    transpose('C', -1): OK
    transpose('B', 1): OK
    transpose('C', -2): OK
    transpose('C', -2, 'E'): OK
    transpose('Bb', 2): OK
    transpose('A#', 2): OK
    transpose('C', -3): OK
    transpose('A', 3): OK
    transpose('G', 4): OK
    transpose('B', -4): OK
    transpose('F#', 4): OK
    transpose('F#', 4, 'E'): OK
    transpose('Bb', -4): OK
    transpose('A#', -4): OK
    transpose('Bb', -4, 'F'): OK
    transpose('A#', -4, 'Eb'): OK
    transpose('G', 4): OK
    transpose('F#', 4): OK
    transpose('F#', 4, 'E'): OK
    transpose('B', -4): OK
    transpose('Bb', -4): OK
    transpose('A#', -4): OK
    transpose('Bb', -4, 'F'): OK
    transpose('A#', -4, 'F'): OK
... testTranspose (__main__.TestScale): 36 tests: ... ok
test_are_not_same (__main__.TestScale) ... ok
test_are_same (__main__.TestScale) ... ok
test_are_same_invalids (__main__.TestScale) ... ok
test_flat_equiv (__main__.TestScale) ... ok
test_flat_equiv_bads (__main__.TestScale) ... ok
test_preferred_equiv (__main__.TestScale) ... ok
test_preferred_equiv_bads (__main__.TestScale) ... ok

----------------------------------------------------------------------
Ran 49 test cases across 13 tests in 0.002s

OK

I also found this post from Jeremy Wadhams in 2015 on the subject of Sharing unit tests between several language implementations of one spec. It discusses JsonLogic:

JsonLogic is a data format (built on top of JSON) for storing and sharing rules between front-end and back-end code. It's essential that the same rule returns the same result whether executed by the JavaScript client or the PHP client.

Currently the JavaScript client has tests in QUnit, and the PHP client has tests in PHPunit. The vast majority of tests are "given these inputs (rule and data), assert the output equals the expected result."

Jeremy also suggests something very like the scheme above, again using JSON.

Conclusion

I think this has been quite a promising experiment. It reduced the length of testscale.py from 223 lines to 75, which wasn't an aim (and carries the potential issues noted above), but which does make the scope and structure of the tests easier to understand. It also achieved the primary goal of allowing test data to be shared between implementations, which seems like a valuable prize. Eventually, the project might gain a command line in both implementations, and and that will potentially enable my favourite mode of testing—pairs of input command lines and expected output. But this is a useful start.

Meanwhile, I will probably refine (and document and test!) the prototype implementations a bit more and then release it.

If you have thoughts, do get in touch.

or, more generally, this callable. ↩
other than, perhaps, and manual teardown in a try...finally block. ↩
From Python 3.8 on, all Python dictionaries are ordered. This is also the case in CPython implementations from 3.6 onwards. ↩
The function does not accept B# or E#, even though musically these can be used as alternatives to C and F respectively. That is outside the scope of this function. ↩
this operation is sometimes called splatting, and sometimes unsplatting or desplatting. ↩
Would I seem like a very old fuddy-duddy if I ask "who writes 'ok' in lower case anyway?" ↩

Reference Testing Exercise 2 (pytest flavour)

Posted on Thu 31 October 2019 in TDDA • Tagged with reference test, exercise, screencast, video, pytest

This exercise (video 2m 58s) shows a powerful way to run only a single test, or some subset of tests, by using the @tag decorator available in the TDDA library. This is useful for speeding up the test cycle and allowing you to focus on a single test, or a few tests. We will also see, in the next exercise, how it can be used to update test results more easily and safely when expected behaviour changes.

(If you do not currently use pytest for writing tests, you might prefer the unittest-flavoured version of this exercise, since unittest is in Python's standard library.)

Prerequisites

★ You need to have the TDDA Python library (version 1.0.31 or newer) installed see installation. Use

tdda version

to check the version that you have.

Step 1: Copy the exercises (if you don't already have them)

You need to change to some directory in which you're happy to create three directories with data. We are use ~/tmp for this. Then copy the example code.

$ cd ~/tmp
$ tdda examples    # copy the example code

Step 2: Go the exercise files and examine them:

$ cd referencetest_examples/exercises-pytest/exercise2  # Go to exercise2

As in the first exercise, you should have at least the following four files:

$ ls
conftest.py expected.html   generators.py   test_all.py

conftest.html is configuration to extend pytest with referencetest capabilities,
expected.html contains the expected output from one test,
generators.py contains the code to be tested,
test_all.py contains the tests.

If you look at test_all.py, you'll see it contains five test functions. Only one of the tests is useful (testExampleStringGeneration) with all the others making manifestly true assertions and most of them deliberately wasting time to simulate annoyingly slow tests.

from generators import generate_string

def testZero():
    assert True

def testOne():
    time.sleep(1)
    assert 1 == 1

def testExampleStringGeneration(ref):
    actual = generate_string()
    ref.assertStringCorrect(actual, 'expected.html')

def testTwo():
    time.sleep(2)
    assert 2 == 2

def testThree():
    time.sleep(3)
    assert 3 == 3

Step 3: Run the tests, which should be slow and produce one failure

$ pytest           #  This will work with Python 3 or Python2

When you run the tests, you should get a single failure, that being the non-trivial test testExampleStringGeneration.

The output will be something like:

============================= test session starts ==============================
test_all.py ..F..

[...details of test failure...]

====================== 1 failed, 4 passed in 6.17 seconds ======================

We get a test failure because we haven't added the ignore_substrings parameter that we saw in Exercise 1 is needed for it to pass.

The tests should take slightly over 6 seconds in total to run, because of the three annoyingly slow tests with sleep statements in them—testOne, testTwo and testThree. (If you're not annoyed by a 6-second delay, increase the sleep time in one of the "sleepy" tests until you are annoyed!)

The point of this exercise is to show some simple but very useful functionality for running only tests on which we wish to focus, such as our failing test.

Step 4: Tag the failing test using @tag

The TDDA library includes a function called tag; this is a decorator function¹ that we can put before individual tests, to mark them as being of special interest temporarily.

Edit test_all.py to decorate the failing test by an import statement to bring in tag from the TDDA library, and then decorate the definition of testStringFunction by preceding it with @tag as follows:

from tdda.referencetest import tag

def testZero():
    assert True

def testOne():
    time.sleep(1)
    assert 1 == 1

@tag
def testExampleStringGeneration(ref):
    actual = generate_string()
    ref.assertStringCorrect(actual, 'expected.html')

Step 5: Run only the tagged test

Having tagged the failing test, if we run the tests again adding --tagged to the command, it will run only the tagged test, and take hardly any time. The (abbreviated) output should be something like

============================= test session starts ==============================
$ pytest --tagged
test_all.py F

[...details of test failure...]

=========================== 1 failed in 0.16 seconds ===========================

We can tag as many tests as we like, across any number of test files, to run a subset of tests, rather than a single one.

Step 6: Locating @tag decorators

Although it's not hard to use grep or grep -r to find them, the library can actually do this for you. If you use the --istagged flag instead of running the tests, the library will report which test classes in which files have tagged tests. So in our case:

$ pytest --istagged
============================= test session starts ==============================
platform darwin -- Python 3.7.3, pytest-4.4.0, py-1.8.0, pluggy-0.9.0
rootdir: /Users/njr/tmp/referencetest_examples/exercises-pytest/exercise2
collecting ...
test_all.testExampleStringGeneration
collected 5 items

========================= no tests ran in 0.01 seconds =========================

Obviously, in the case of a single test file, this is not a big deal, but if you have dozens or hundreds of source files, in a directory hierarchy, and have tagged a few functions across them, it becomes significantly more helpful.

Recap: What we have seen

This simple exercise has shown how we can easily run subsets of tests by tagging them and then using --tagged to run only tagged tests.

In this case, the motivation was simply to save time and reduce clutter in the output, focusing on one test, or a small number of tests.

In the Exercise 3, we will see how this combines with the ability to automatically regenerate updated reference outputs to make for a safe and efficient way to update tests after code changes.

Decorator functions in Python are functions that are used to transform other functions: they take a function as an argument and return a new function that modifies the original in some way. Out decorator function tag is called by writing @tag on the line before function (or class) definition, and the effect of this is that the function returned by @tag replaces the function (or class) it precedes. In our case, all @tag does is set an attribute on the function in question so that the TDDA reference test framework can identify it as a tagged function, and choose to run only tagged tests when so requested. ↩

Reference Testing Exercise 2 (unittest flavour)

Posted on Wed 30 October 2019 in TDDA • Tagged with reference test, exercise, screencast, video, unittest

This exercise (video 3m 34s) shows a powerful way to run only a single test, or some subset of tests, by using the @tag decorator available in the TDDA library. This is useful for speeding up the test cycle and allowing you to focus on a single test, or a few tests. We will also see, in the next exercise, how it can be used to update test results more easily and safely when expected behaviour changes.

(If you use pytest for writing tests, you might prefer the pytest-flavoured version of this exercise.)

Prerequisites

★ You need to have the TDDA Python library (version 1.0.31 or newer) installed see installation. Use

tdda version

to check the version that you have.

Step 1: Copy the exercises (if you don't already have them)

You need to change to some directory in which you're happy to create three directories with data. We are use ~/tmp for this. Then copy the example code.

$ cd ~/tmp
$ tdda examples    # copy the example code

Step 2: Go the exercise files and examine them:

$ cd referencetest_examples/exercises-unittest/exercise2  # Go to exercise2

As in the first exercise, you should have at least the following three files

$ ls
expected.html   generators.py   test_all.py

expected.html contains the expected output from one test,
generators.py contains the code to be tested,
test_all.py contains the tests.

If you look at test_all.py, you'll see it contains two test classes with five tests between them. Only one of the tests is useful (testExampleStringGeneration) with all the others making manifestly true assertions and deliberately wasting time to simulate annoyingly slow tests.

import time
from tdda.referencetest import ReferenceTestCase, tag
from generators import generate_string

class TestQuickThings(ReferenceTestCase):

    def testExampleStringGeneration(self):
        actual = generate_string()
        self.assertStringCorrect(actual, 'expected.html')

    def testZero(self):
        self.assertIsNone(None)


class TestSuperSlowThings(ReferenceTestCase):

    def testOne(self):
        time.sleep(1)
        self.assertEqual(1, 1)

    def testTwo(self):
        time.sleep(2)
        self.assertEqual(2, 2)

    def testThree(self):
        time.sleep(3)
        self.assertEqual(3, 3)

Step 3: Run the tests, which should be slow and produce one failure

$ python test_all.py   #  This will work with Python 3 or Python2

When you run the tests, you should get a single failure, that being the non-trivial test testExampleStringGeneration from the class TestQuickThings.

The output will be:

F....

[...details of test failure...]

Ran 5 tests in 6.007s
FAILED (failures=1)

We get a test failure because we haven't added the ignore_substrings parameter that we saw in Exercise 1 is needed for it to pass.

The tests should take slightly over 6 seconds in total to run, because of the annoyingly slow tests in TestSuperSlowThings. (If you're not annoyed by a 6-second delay, increase the sleep time in one of the "slow" tests until you are annoyed!)

The point of this exercise is to show some simple but very useful functionality for running only tests on which we wish to focus, such as our failing test.

Step 4: Tag the failing test using @tag

If you look at the import statements, you'll see that as well as ReferenceTestCase we also import tag. This is a decorator function¹ that we can put before individual tests, or test classes, to indicate that they are of special interest temporarily.

Edit test_all.py to decorate the failing test by adding @tag on the line before it, thus:

class TestQuickThings(ReferenceTestCase):

    @tag
    def testExampleStringGeneration(self):
        actual = generate_string()
        self.assertStringCorrect(actual, 'expected.html')

    def testZero(self):
        self.assertIsNone(None)

Step 5: Run only the tagged test

Having tagged the failing test, if we run the tests again adding -1 (the digit one, for "single",not the letter ell) to the command, it will run only the tagged test, and take hardly any time. The (abbreviated) output should be something like

$ python test_all.py -1
F

[...details of test failure...]

Ran 1 tests in 0.006s
FAILED (failures=1)

You can also use --tagged instead of -1 if you like more descriptive flags.

We can tag as many tests as we like, across any number of test files, and we can also tag whole classes by placing the @tag decorator before a test class definition. So if we instead use:

@tag
class TestQuickThings(ReferenceTestCase):

    def testExampleStringGeneration(self):
        actual = generate_string()
        self.assertStringCorrect(actual, 'expected.html')

    def testZero(self):
        self.assertIsNone(None)

and run the tests with -1, we will get output more like:

$ python test_all.py -1
F.

[...details of test failure...]

Ran 2 tests in 0.006s
FAILED (failures=1)

In this case, both the tests in our first test class were run, but no others (and, in particular, not our painfully slow tests!)

Step 6: Locating @tag decorators

In a typical debugging or test development cycle in which you have been using the @tag decorator to focus on just a few failing tests, you might end up with @tag decorations scattered across several files, perhaps in multiple directories. (We're assuming here you have test_all.py or similar that imports all the other test classes so you can easily run them all together.)

Although it's not hard to use grep or grep -r to find them, the library can actually do this for you. If you use the -0 flag (the digit zero, for "no tests"), or the --istagged flag, instead of running the tests, the library will report which test classes in which files have tagged tests. So in our case:

$ python test_all.py -0

produces:

__main__.TestQuickThings

Here, __main__ stands for the current file; other files would be referenced by their imported name.

Recap: What we have seen

This simple exercise has shown how we can easily run subsets of tests by tagging them and then using the -1 flag (or --tagged) to run only tagged tests.

In this case, the motivation was simply to save time and reduce clutter in the output, focusing on one test, or a small number of tests.

In the Exercise 3, we will see how this combines with the ability to automatically regenerate updated reference outputs to make for a safe and efficient way to update tests after code changes.

Decorator functions in Python are functions that are used to transform other functions: they take a function as an argument and return a new function that modifies the original in some way. Out decorator function tag is called by writing @tag on the line before function (or class) definition, and the effect of this is that the function returned by @tag replaces the function (or class) it precedes. In our case, all @tag does is set an attribute on the function in question so that the TDDA reference test framework can identify it as a tagged function, and choose to run only tagged tests when so requested. ↩

Reference Testing Exercise 1 (pytest flavour)

Posted on Tue 29 October 2019 in TDDA • Tagged with reference test, exercise, screencast, video, pytest

This exercise (video 8m 54s) shows how to migrate a test from using pytest directly to the exploiting the referencetest capabilities in the TDDA library. (If you do not currently use pytest for writing tests, you might prefer the unittest-flavoured version of this exercise, since unittest is in Python's standard library.)

We will see how even simple use of referencetest

makes it much easier to see how tests have failed when complex outputs are generated
helps us to update reference outputs (the expected values) when we have verified that a new behaviour is correct
allows us easily to write tests of code whose outputs are not identical from run to run. We do this by specifying exclusions from the comparisons used in assertions.

Prerequisites

★ You need to have the TDDA Python library installed (version 1.0.31 or newer) see installation. Use

tdda version

to check the version that you have.

Step 1: Copy the exercises

You need to change to some directory in which you're happy to create three new directories with data. We are use ~/tmp for this. Then copy the example code.

$ cd ~/tmp
$ tdda examples    # copy the example code

Step 2: Go the exercise files and examine them:

$ cd referencetest_examples/exercises-pytest/exercise1  # Go to exercise1

You should have at least the following four files:

$ ls
conftest.py expected.html   generators.py   test_all.py

generators.py contains a function called generate_string that, when called, returns HTML text suitable for viewing as a web page.
expected.html is the result of calling that function, saved to file
test_all.py contains a single unittest-based test of that file.
conftest.py imports key referencetest functionality from the tdda library into pytest.

It's probably useful to look at the web page expected.html in a browser, either by navigating to it in a file browser and double clicking it, or by using

open expected.html

if your OS supports this. As you can see, it's just some text and an image. The image is an inline SVG vector image, generated along with the text.

Also have a look at the test code. The core part of it is very short:

from generators import generate_string

def testExampleStringGeneration():
    actual = generate_string()
    with open('expected.html') as f:
        expected = f.read()
    assert actual == expected

The code

calls generate_string() to create the content
stores its output in the variable actual
reads the expected content into the variable expected
asserts that the two strings are the same.

Step 3. Run the test, which should fail

$ pytest      #  This will whether pytest uses Python2 or Python3

You should get a failure, and pytest tries quite hard to show what's causing the failure:

=================================== FAILURES ===================================
_________________________ testExampleStringGeneration __________________________

    def testExampleStringGeneration():
        actual = generate_string()
        with open('expected.html') as f:
            expected = f.read()
>       assert actual == expected
E       AssertionError: assert '<!DOCTYPE ht...y>\n</html>\n' == '<!DOCTYPE htm...y>\n</html>\n'
E         Skipping 69 identical leading characters in diff, use -v to show
E         -  Solutions, 2016
E         +  Solutions Limited, 2016
E         ?           ++++++++
E         -     Version 1.0.0
E         ?             ^
E         +     Version 0.0.0...
E
E         ...Full output truncated (31 lines hidden), use '-vv' to show

test_all.py:24: AssertionError
=========================== 1 failed in 0.11 seconds ===========================

You can certainly see that there's a different in the Version number in the output and also a line including 2016 (a copyright notice, in fact).

But it also says:

...Full output truncated (31 lines hidden), use '-vv' to show

and if you do that, the output becomes a bit overwhelming.

We'll convert the test to use the TDDA libraries referencetest and see how that helps.

Step 4. Change the code to use referencetest.

The key change we need to make is the to the assertion, which will now be:

ref.assertStringCorrect(actual, 'expected.html')

ref is object made available by conftest.py, and is passed into our test function by pytest. We therefore need to change the function declaration to take ref as an argument:

def testExampleStringGeneration(ref):

Finally, because assertStringCorrect compares a string in memory to content from a file, we don't need the lines in the middle that read the file:

* Delete the middle two lines of the test function.

The result is:

from generators import generate_string

def testExampleStringGeneration(ref):
    actual = generate_string()
    ref.assertStringCorrect(actual, 'expected.html')

Step 5. Run the modified test

$ pytest

You should see very different output, that includes, near the end, something like this:

E       AssertionError: 2 lines are different, starting at line 5
E       Expected file expected.html
E       Compare raw with:
E           diff /var/folders/zv/3xvhmvpj0216687_pk__2f5h0000gn/T/actual-raw-expected.html expected.html
E
E       Compare post-processed with:
E           diff /var/folders/zv/3xvhmvpj0216687_pk__2f5h0000gn/T/actual-expected.html /var/folders/zv/3xvhmvpj0216687_pk__2f5h0000gn/T/expected-expected.html

/Users/njr/python/tdda/tdda/referencetest/referencepytest.py:187: AssertionError

(You will probably need to scroll right to see all of the message on this page.)

Because the test failed, the TDDA library has written a copy of the actual ouput to file to make it easy for us to examine it and to use diff commands to see how it actually differs from what we expected. (In fact, it's written out two copies, a "raw" and a "post-precocessed" one, but we haven't used any processing, so they will be the same in our case. So we ignore the second diff command suggested for now.)

It's also given us the precise diff command we need to see the differences between our actual and expected output.

Step 6. Copy the first diff command and run it. You should see something similar to this:

$ diff /var/folders/zv/3xvhmvpj0216687_pk__2f5h0000gn/T/actual-raw-expected.html expected.html
5,6c5,6
<     Copyright (c) Stochastic Solutions, 2016
<     Version 1.0.0
—
>     Copyright (c) Stochastic Solutions Limited, 2016
>     Version 0.0.0
35c35
< </html>
\ No newline at end of file
—
> </html>

(If you have a visual diff tool, can also use that. For example, on a Mac, if you have Xcode installed, you should have the opendiff command available.)

The diff makes it clear that there are three differences:

The copyright notice has changed slightly
The version number has changed
The string doesn't have a newline at the end, whereas the file does.

The Copyright and version numbers lines are both in comments in the HTML, so don't affect the rendering at all. You might want to confirm that if you look at the actual file it saved (/var/folders/zv/3xvhmvpj0216687_pk__2f5h0000gn/T/actual-raw-expected.html, the first file in the diff command), you should see that it looks identical.

In this case, therefore, we might now feel that we should simply update expected.html with what generate_string() is now producing. It would be (by design) extremely easy to change the diff in the command it gave is to cp to achieve that.

However, there's better thing we can do in this case.

Step 7. Specify exclusions

Standing back, it seems obvious likely that periodically the version number and Copyright line written to comments in the HTML will change. If the only difference between out expected output and what we actually generate are those, we'd probably prefer the test didn't fail.

The ref.assertStringCorrect function from referencetest gives us several mechanisms for specifying changes that can be ignored when checking whether a string is correct. The simplest one, which will be enough for our example, is just to specify strings which, if they occur on a line in the output, case differences in those lines to be ignored, so that the assertion doesn't fail.

** Step 7a. Add the ignore_substrings parameter to assertStringCorrect as follows:**

        ref.assertStringCorrect(actual, 'expected.html',
                                ignore_substrings=['Copyright', 'Version'])

Step 7b. Run the test again. It should now pass:

$ pytest
============================= test session starts ==============================

test_all.py .                                                            [100%]

=========================== 1 passed in 0.04 seconds ===========================

Recap: What we have seen

We've seen

Converting standard pytest-based tests to use referencetestcase is straightfoward.
When we do that, we gain access to powerful new kinds of assertion such as assertStringCorrect. Among the immediate benefits:
- When there is failure, this saves the failing output to a temporary file
- It tells you the exact diff command you need to see be able to see differences
- This also makes it very easy to copy the new "known good" answer into place if you've verified that the new answer is now correct. (In fact, the library also has a more powerful way to do this, as we'll see in a later exercise).
The ref.assertStringCorrect fucntion also has a number of mechanisms for allowing specific expected differences to occur without causing the test to fail. The simplest of these mechanisms is the ignore_substrings keyword argument we used here.

Older Posts Newer Posts