Rexpy for Generating Regular Expressions: Postcodes

Posted on Wed 20 February 2019 in TDDA • Tagged with regular expressions, rexpy, tdda

Rexpy is a powerful tool we created that generates regular expressions from examples. It's available online at and forms part of our open-source TDDA library.

Miró users can use the built-in rex command.

This post illustrates using Rexpy to find regular expressions for UK postcodes.

A regular expression for Postcodes

If someone asked you what a UK postcode looks like, and you don't live in London, you'd probably say something like:

A couple of letters, then a number then a space, then a number then a couple of letters.

About the simplest way to get Rexpy to generate a regular expression is to give it at least two examples. You can do this online at or using the open-source TDDA library.

If you give it EH1 3LH and BB2 5NR, Rexpy generates [A-Z]{2}\d \d[A-Z]{2}, as illustrated here, using the online version of rexpy:

Rexpy online, with EH1 3LH and BB2 5NR as inputs, produces [A-Z]{2}\d \d[A-Z]{2}

This is the regular-expression equivalent of what we said:

  • [A-Z]{2} means exactly two ({2}) characters from the range [A-Z], i.e. two capital letters
  • \d means a digit (which is the same as [0-9]—two characters from the range 0 to 9)
  • the gap () is a space character
  • \d is another digit
  • [A-Z]{2} is two more letters.

This doesn't cover all postcodes, but it's a good start.

Other cases

Any easy way to try out the regular expression we generated is to use the grep command1. This is built into all Unix and Linux systems, and is available on Windows if you install a Linux distribution under WSL.

If we try matching a few postcodes using this regular expression, we'll see that many—but not all—postcodes match the pattern.

  • On Linux, the particular variant of grep we need is grep -P, to tell it we're using Perl-style regular expressions.
  • On Unix (e.g. Macintosh), we need to use grep -E (or egrep) to tell it we're using "extended" regular expressions

If we write a few postcodes to a file:

$ cat > postcodes
G1 9PU
RG22 4EX
OL14 8DQ

we can then use grep to find the lines that match:

$ grep -E '[A-Z]{2}\d \d[A-Z]{2}' postcodes

(Use -P instead of -E on Linux.)

More relevantly, for present purposes, we can also add the -v flag, to ask the match to be "inVerted", i.e. to show lines that fail to match:

$ grep -v -E '[A-Z]{2}\d \d[A-Z]{2}' postcodes
G1 9PU
RG22 4EX
OL14 8DQ
  • The first of these, a Glasgow postcode, fails because it only has a single letter at the start.

  • The second and fourth fail because they have two digits after the letters.

  • The third fails because it's a London postcode with an extra letter, A after the EC1.

Let's add an example of each in turn:

If we first add the Glasgow postcode, Rexpy generates ^[A-Z]{1,2}\d \d[A-Z]{2}$.

Rexpy online, adding G1 9PU, produces [A-Z]{1,2}\d \d[A-Z]{2}

Here [A-Z]{1,2} in brackets means 1–2 capital letters, and we've checked the anchor checkbox, to get it to add in ^ at the start and $ at the end of the regular expression.2 If we use this with our grep command, we get:

$ grep -v -E '^[A-Z]{1,2}\d \d[A-Z]{2}$' postcodes
RG22 4EX
OL14 8DQ

If we now add in an example with two digits in the first part of the postcode—say RG22 4EX—rexpy further refines the expression to ^[A-Z]{1,2}\d{1,2} \d[A-Z]{2}$, which is good for all(?) non-London postcodes. If we repeat the grep with this new pattern:

$ grep -v -E '^[A-Z]{1,2}\d{1,2} \d[A-Z]{2}$' postcodes

only the London example now fails.

In a perfect world, just by adding EC1A 1AB, Rexpy would produce our ideal regular expression—something like ^[A-Z]{1,2}\d[A-Z]? \d[A-Z]{2}$. (Here, the ? is the equivalent to {0,1}, meaning that the term before can occur zero times or once, i.e. it is optional.)

Unfortunately, that's not what happens. Instead, Rexpy produces:

^[A-Z0-9]{2,4} \d[A-Z]{2}$

Unfortunately, Rexpy has concluded that the first part is just a jumble of capital letters and numbers and is saying that the first part can be any mixture of 2-4 letters and numbers.

In this case, we'd probably fix up the regular expression by hand, or separately pass in the special Central London postcodes and all the rest. If we feed in a few London postcodes on their own, we get:

^[A-Z]{2}\d[A-Z] \d[A-Z]{2}$

which is also a useful start.

Have fun with Rexpy!

By the way: if you're in easy reach of Edinburgh, we're running a training course on the TDDA library as part of the Fringe of the Edinburgh DataFest, on 20th March. This will include use of Rexpy. You should come!

Training Course on Testing Data and Data Processes

  1. grep stands for global regular expression print, and the e in egrep stands for extended

  2. Sometimes, regular expressions match any line that contains the pattern anywhere in them, rather than requiring the pattern to match the whole line. In such cases, using the anchored form of the regular expression, ^[A-Z]{2}\d \d[A-Z]{2}$, means that matching lines must not contain anything before or after the text that matches the regular expression. (You can think of ^ as matching the start of the string, or line, and $ as matching the end.) 

Tagging PyTest Tests

Posted on Tue 22 May 2018 in TDDA • Tagged with tests, tagging

A recent post described the new ability to run a subset of ReferenceTest tests from the tdda library by tagging tests or test classes with the @tag decorator. Initially, this ability was only available for unittest-based tests. From version 1.0 of the tdda library, now available, we have extended this capability to work with pytest.

This post is very similar to the previous one on tagging unittest-based tests, but adapted for pytest.


  • A decorator called tag can be imported and used to decorate individual tests or whole test classes (by preceding the test function or class with @tag).

  • When pytest is run using the ––tagged option, only tagged tests and tests from tagged test classes will be run.

  • There is a second new option, ––istagged. When this is used, the software will report which test classes are tagged, or contain tests that are tagged, but will not actually run any tests. This is helpful if you have a lot of test classes, spread across different files, and want to change the set of tagged tests.


The situations where we find this particularly helpful are:

  • Fixing a broken test or working on a new feature or dataset. We often find ourselves with a small subset of tests failing (perhaps, a single test) either because we're adding a new feature, or because something has changed, or because we are working with data that has slightly different characteristics. If the tests of interest run in a few seconds, but the whole test suite takes minutes or hours to run, we can iterate dramatically faster if we have an easy way to run only the subset of tests currently failing.

  • Re-writing test output. The tdda library provides the ability to re-write the expected ("reference") output from tests with the actual result from the code, using the ––write-all command-line flag. If it's only a subset of the tests that have failed, there is real benefit in re-writing only their output. This is particularly true if the reference outputs contain some differences each time (version numbers, dates etc.) that are being ignored using the ignore-lines or ignore-patterns options provided by the library. If we regenerate all the test outputs, and then look at which files have changed, we might see differences in many reference files. In contrast, if we only regenerate the tests that need to be updated, we avoid committing unnecessary changes and reduce the likelihood of overlooking changes that may actually be incorrect.


In order to use the reference test functionality with pytest, you have always needed to add some boilerplate code to in the directory from which you are running pytest. To use the tagging capability, you need to add one more function definition, pytest_collection_modifyitems.

The recommended imports in are now:

from tdda.referencetest.pytestconfig import (pytest_addoption,
                                             ref) is also a good place to set the reference file location if you want to do so using set_default_data_location.


We'll illustrate this with a simple example. The code below implements four trivial tests, two in a class and two as plain functions.

Note the import of the tag decorator function near the top, and that test_a and the class TestClassA are decorated with the @tag decorator.


from tdda.referencetest import tag

def test_a(ref):
    assert 'a' == 'a'

def test_b(ref):
    assert 'b' == 'b'

class TestClassA:
    def test_x(self):
        assert 'x' * 2 == 'x' + 'x'

    def test_y(self):
        assert 'y' > 'Y'

If we run this as normal, all four tests run and pass:

$ pytest
============================= test session starts ==============================
platform darwin – Python 3.5.1, pytest-3.2.1, py-1.4.34, pluggy-0.4.0
rootdir: /Users/njr/tmp/referencetest_examples/pytest, inifile:
plugins: hypothesis-3.4.2
collected 4 items ....

=========================== 4 passed in 0.02 seconds ===========================

But if we add the –tagged flag, only three tests run:

$ pytest ––tagged
============================= test session starts ==============================
platform darwin – Python 3.5.1, pytest-3.2.1, py-1.4.34, pluggy-0.4.0
rootdir: /Users/njr/tmp/referencetest_examples/pytest, inifile:
plugins: hypothesis-3.4.2
collected 4 items ...

=========================== 3 passed in 0.02 seconds ===========================

Adding the –-verbose flag confirms that these three are the tagged test and the tests in the tagged class, as expected:

$ pytest ––tagged ––verbose
============================= test session starts ==============================
platform darwin – Python 3.5.1, pytest-3.2.1, py-1.4.34, pluggy-0.4.0 – /usr/local/Cellar/python/3.5.1/bin/python3.5
cachedir: .cache
rootdir: /Users/njr/tmp/referencetest_examples/pytest, inifile:
plugins: hypothesis-3.4.2
collected 4 items PASSED PASSED PASSED

=========================== 3 passed in 0.01 seconds ===========================

Finally, if we want to find out which classes include tagged tests, we can use the ––istagged flag:

pytest ––istagged
============================= test session starts ==============================
platform darwin – Python 3.5.1, pytest-3.2.1, py-1.4.34, pluggy-0.4.0
rootdir: /Users/njr/tmp/referencetest_examples/pytest, inifile:
plugins: hypothesis-3.4.2
collected 4 items


========================= no tests ran in 0.01 seconds =========================

This is particularly helpful when our tests are spread across multiple files, as the filenames are then shown as well as the class names.


Information about installing the library is available in this post.

Other Features

Other features of the ReferenceTest capabilities of the tdda library are described in this post. Its capabilities in the area of constraint discovery and verification are discussed in this post, and this post.

Detecting Bad Data and Anomalies with the TDDA Library (Part I)

Posted on Fri 04 May 2018 in TDDA • Tagged with tests, anomaly detection, bad data

The test-driven data analysis library, tdda, has two main kinds of functionality

  • support for testing complex analytical processes with unittest or pytest
  • support for verifying data against constraints, and optionally for discovering such constraints from example data.

Until now, however, the verification process has only reported which constraints failed to be satisfied by a dataset.

We have now extended the tdda library to allow identification of individual failing records, allowing it to act as a general purpose anomaly detection framework.

The new functionality is available through a new detect_df API call, and from the command line with the new tdda detect command.

The diagram shows conceptually how detection works, separating out anomalous records from the rest.

A simple anomaly detection process, splitting input data into anomalous and non-anomalous records

With the TDDA framework, anomalous simply means fails at least one constraint. We'll discuss cases in which the constraints have been developed to try to model some subset of data of interest (defects, high-risk applicants, heart arrythmias, flawless diamonds, model patients etc.) in part II of this post. In those cases, we start to be able to discuss classifications such as true and false positives, and true and false negatives.

Example Usage from the Command Line

Suppose we have a simple transaction stream with just three fields, id, category and price, like this:

       id    category     price
710316821       QT       150.39
516025643       AA       346.69
414345845       QT       205.83
590179892       CB        55.61
117687080       QT       142.03
684803436       AA       152.10
611205703       QT        39.65
399848408       AA       455.67
289394404       AA       102.61
863476710       AA       297.82
534170200       KA        80.96
898969231       QT        81.39
255672456       QT        71.67
133111344       TB       229.19
763476994       CB       338.40
769595502       QT       310.19
464477044       QT        54.41
675155634       QT       199.07
483511995       QT       209.53
416094320       QT        83.31

and the following constraints (which might have been created by hand, or generated using the tdda discover command).

  • id (integer): Identifier for item. Should not be null, and should be unique in the table

  • category (string): Should be one of “AA”, “CB”, “QT”, “KA” or “TB”

  • price (floating point value): unit price in pounds sterling. Should be non-negative and no more than 1,000.00.

This would be represented in a TDDA file with the following constraints.

  "fields": {
    "id": {
      "type": "int",
      "max_nulls": 0,
      "no_duplicates": true
    "category": {
      "type": "string",
      "max_nulls": 0,
        ["AA", "CB", "QT", "KA", "TB"]
    "price": {
      "type": "real",
      "min": 0.0,
      "max": 1000.0,
      "max_nulls": 0

We can use the tdda verify command to verify a CSV file or a feather file1 against these files, and get a summary of which constraints pass and fail. If our data is in the file items.feather and the JSON constraints are in constraints.tdda, and there are some violations we will get output exemplified by the following:

$ tdda verify items.feather constraints.tdda

id: 1 failure  2 passes  type ✓  max_nulls ✓  no_duplicates ✗

category: 1 failure  2 passes  type ✓  max_nulls ✓  allowed_values ✗

price: 2 failures  2 passes  type ✓  min ✓  max ✗  max_nulls ✗


Constraints passing: 6
Constraints failing: 4

The new tdda detect command allows us to go further and find which individual records fail.

We can use the following command to write out a CSV file, bads.csv, containing the records that fail constraints:

$ tdda detect items.feather constraints.tdda bads.csv ––per-constraint ––output-fields

The flag ––per-constraint tells the software to write out a boolean column for each constraint, indicating whether the record passed, and the ––output-fields tells the software to include all the input fields in the output.

The result is the following CSV file:


which, we can read a bit more easily if we reformat this (using · to denote nulls) as:

       id  category        price        id  category  price     price  n_failures
                                   _nodups   _values   _max  _nonnull
                                       _ok       _ok    _ok       _ok
113791348        TQ       318.63      true     false   true      true           1
102829374        AA        65.24     false      true   true      true           1
720313295        TB     1,004.72      true      true  false      true           1
384044032        QT       478.65     false      true   true      true           1
602948968        TB       209.31     false      true   true      true           1
105983384        AA         8.95     false      true   true      true           1
444140832        QT     1132.87       true      true  false      true           1
593548725        AA       282.58     false      true   true      true           1
545398672        QT     1,026.40      true      true  false      true           1
759425162        CB     1,052.72      true      true  false      true           1
452691252        AA     1,028.19      true      true  false      true           1
105983384        QT       242.64     false      true   true      true           1
102829374        KA        71.64     false      true   true      true           1
105983384        AA        10.24     false      true   true      true           1
405321922        QT        85.23     false      true   true      true           1
102829374         ·   100,000.00     false     false  false      true           3
872018391        QT        51.69     false      true   true      true           1
862101984        QT       158.53     false      true   true      true           1
274332319        AA     1,069.25      true      true  false      true           1
827919239        QT     1,013.00      true      true  false      true           1
105983384        QT       450.68     false      true   true      true           1
102829374         ·   100,000.00     false     false  false      true           3
872018391        QT       199.37     false      true   true      true           1
602948968        KA       558.73     false      true   true      true           1
328073211        CB     1,031.67      true      true  false      true           1
405321922        TB       330.97     false      true   true      true           1
334193154        QT     1,032.31      true      true  false      true           1
194125540        TB            ·      true      true      ·     false           1
724692620        TB     1,025.81      true      true  false      true           1
862101984        QT       186.76     false      true   true      true           1
593548725        QT       196.56     false      true   true      true           1
384044032        AA       157.25     false      true   true      true           1

Command Line Syntax

The basic form of the command-line command is:



  • INPUT is normally either a .csv file, in a suitable format, or a .feather file1 containing a DataFrame, preferably with an accompanying .pmm file1
  • CONSTRAINTS is a JSON file containing constraints, usually with a .tdda suffix. This can be created by the tdda discover command, or edited by hand.
  • OUTPUT is again either a .csv or .feather file to be created with the output rows. If the pmmif library is installed, a .pmm metadata file will be generated alongside the .feather file, when .feather output is requested.


Several command line options are available to control the detailed behaviour:

  • defaults: If no command-line options are supplied:

    • only failing records will be written
    • only a record identifier and the number of failing constraints will be written.
      • When the input is Pandas, the record identifier will be the index for the failing records;
      • when the input is a CSV file, the record identifier will be the row number, with the first row after the header being numbered 1.
  • ––per-constraint when this is added, an _ok column will also be written for every constraint that has any failures, with true for rows that satisfy the contraint, false for rows that do not satisfy the constraint and a missing value where the constraint is inapplicable (which does not count as a failure).

  • ––output-fields [FIELD1 FIELD2 ...] If the ––output-fields flag is used without specifying any fields, all fields from the input will be included in the output. Alternatively, a space-separated list of fields may be provided, in which case only those will be included. Whenever this option is used, no index or row-number is written unless specifically requested

  • ––write-all If this flag is used, all records from the input will be included in the output, including those that have no constraint failures.

  • ––index This flag forces the writing of the index (for DataFrame inputs) or row number (for CSV inputs).

  • ––int When writing boolean values to CSV files (either from input data or as per-constraint output fields), use 1 for true and 0 for false.

API Access

The detection functionality is also available through the TDDA library's API with a new detect_df function, which takes similar parameters to the command line. The corresponding call, with a DataFrame df in memory, would be:

from tdda.constraints import detect_df

verification = detect_df(df, 'constraints.tdda', per_constraint=True,
bads_df = verification.detected()

  1. The feather file format is an interoperable way to save DataFrames from Pandas or R. Its aim is to preserve metadata better and be faster than CSV files. It has a few issues, particularly around types and nulls, and when available, we save a secondary .pmm file alongside .feather files which makes reading and writing them more robust when our extensions in the pmmif library are used. We'll do a future blog post about this, but if you install both feather and pmmif with pip install feather and pip install pmmif, and use featherpmm.write_dataframe, imported from pmmif, rather than feather.write_dataframe, you should get more robust behaviour. 

Saving Time Running Subsets of Tests with Tagging

Posted on Tue 01 May 2018 in TDDA • Tagged with tests, tagging

It is common, when working with tests for analytical processes, for test suites to take non-trivial amount of time to run. It is often helpful to have a convenient way to execute a subset of tests, or even a single test.

We have added a simple mechanism for allowing this to unittest-based tests in the ReferenceTest functionality of the tdda Python library. It is based on tagging tests.

The quick summary is:

  • A decorator called tag can be imported and used to decorate individual tests or whole test classes (by preceding the test function or class with @tag).

  • When a script calling ReferenceTest.main() is run, if the flag ––tagged (or –1, the digit one) is used on the command line, only tagged tests and tests from tagged test classes will be run.

  • There is a second new option, ––istagged (or –0, the digit zero). When this is used, the software will report only which test classes are tagged, or contain tests that are tagged, and will not actually run any tests. This is helpful if you have a lot of test classes, spread across different files, and want to change the set of tagged tests.


The situations where we find this particularly helpful are:

  • Fixing a broken test or working on a new feature or dataset. We often find ourselves with a small subset of tests failing (perhaps, a single test case), either because we're adding a new feature, or because something has changed or we are working with data that has slightly different characteristics. If the tests of interest run in a few seconds, but the whole test suite takes minutes or hours to run, we can iterate dramatically faster if we have an easy way to run only the subset of tests currently failing.

  • Re-writing test output. The tdda library provides the ability to re-write the expected ("reference") output from tests with whatever the code is currently generating, using the ––write-all command-line flag. If it's only a subset of the tests that have failed, there is real benefit in re-writing only the output for the previously failing tests, rather than for all tests. This is particularly true if the reference outputs contain some differences each time (version numbers, dates etc.) that are being ignored using the ignore-lines or ignore-patterns options provided by the library. If we re-write all the tests, and then look at which files have changed, we might see differences in all reference files, whereas if we only regenerate the tests with meaningful changes, we will avoid committing changes that were not required.


We'll illustrate this with a simple example. The code below implements four trivial tests across two classes.

Note the import of the tag decorator function near the top, and that the two of the tests—testTwo and testThree in the class Tests—are decorated with the @tag decorator, as is the entire test class MoreTests.

from tdda.referencetest import ReferenceTestCase, tag

class Tests(ReferenceTestCase):
    def testOne(self):
        self.assertEqual(1, 1)

    def testTwo(self):
        self.assertEqual(2, 2)

    def testThree(self):
        self.assertEqual(3, 3)

    def testFour(self):
        self.assertEqual(4, 4)

class MoreTests(ReferenceTestCase):
    def testFive(self):
        self.assertEqual(5, 5)

    def testSix(self):
        self.assertEqual(6, 6)

if __name__ == '__main__':

If we run this as normal, all six tests run and pass:

$ python
Ran 6 tests in 0.000s


But if we add the –1 flag, only four tests run:

$ python -1
Ran 4 tests in 0.000s


Adding the –v (verbose) flag confirms that these four are the tagged tests, as expected:

$ python -1 -v
testFive (__main__.MoreTests) ... ok
testSix (__main__.MoreTests) ... ok
testThree (__main__.Tests) ... ok
testTwo (__main__.Tests) ... ok

Ran 4 tests in 0.000s


Finally, if we want to find out which classes include tagged tests, we can use the –0 flag:

$ python -0

Ran 0 tests in 0.000s


This is particularly helpful when our tests are spread across multiple files, as the filenames are then shown as well as the class names.


Information about installing the library is available in this post.

Other Features

Other features of the ReferenceTest capabilities of the tdda library are described in this post. Its capabilities in the area of constraint discovery and verification are discussed in this post, and this post.

Our Approach to Data Provenance

Posted on Tue 12 December 2017 in TDDA • Tagged with data lineage, data provenance, data governance, tdda, constraints, miro

NEW DATA GOVERNANCE RULES: — We need to track data provenance. — No problem! We do that already! — We do? — We do! — (thinks) Results2017_final_FINAL3-revised.xlsx

Our previous post introduced the idea of data provenance (a.k.a. data lineage), which has been discussed on a couple of podcasts recently. This is an issue that is close to our hearts at Stochastic Solutions. Here, we'll talk about how we handle this issue, both methodologically and in our Miró software.

We'll focus on seven key ideas from our approach to data provenance:

  1. Automatic Logging
  2. Audit trail stored in datasets
  3. Recording of field definitions and contexts
  4. Constraint generation and verification
  5. Data Dictionary Generation & Import
  6. Data signatures (hashing)
  7. Comparing datasets (diff commands)

Automatic Logging

Automatic logging provides a concrete record of all analysis performed in the software.

Our analysis software, Miró, is normally used through a scripting interface that automatically writes several detailed logs. Of these, the most important are:

  • A log of all commands executed (in editable, re-executable form)

  • A full interleaved log of commands issued and the resulting output, in several forms (including plain text and rich HTML).

Logs build up in a year/month/day/session hierarchy indefinitely, providing a comprehensive (and searchable) record of the analysis that has been performed.

Even when the software is used through the API, the sequence of operations is recorded in the log, though in that case the ability to re-execute the operations is normally lost.

Audit trail

The audit trail in a dataset tracks changes to the data and metadata across sessions, users, and machines, making it possible to see the sequence of operations that led to the current state of the data.

Like a table in a database, or a DataFrame in Pandas or R, Miró's most important data structure is a dataset—a tabular structure with named fields as columns and different observations (records) stored in rows. These form a column-based data store, and datasets can be saved to disk with a .miro extension—a folder that contains the typed data together with rich metadata.

Every time a change is made to a dataset, the operation that caused the change is recorded in the Audit Trail section of the dataset. This is true both for changes to data and to metadata:

  • Examples of changes to data in a dataset include creating new fields, deleting fields, filtering out records, appending new records and (more rarely) changing the original values in the data.1

  • Miró maintains many types of metadata, including, field descriptions, field and dataset tags, binnings on fields, long and short names, custom colours and labels, a current record selection (filter) and various formatting information.

In this post, we'll illustrate the most concept using the following small dataset containing transactions for four different customers, identified by id:

id date categ amount days-since-prev
1 2009-01-31 00:00:00 A 1,000.00
2 2009-02-02 00:00:00 A 2,000.00
2 2009-02-02 22:22:22 B 2,222.22 0.93
3 2009-03-03 00:00:00 A 1,000.00
3 2009-03-03 13:33:33 B 3,000.00 0.56
3 2009-03-03 23:33:33 B 3,333.33 0.42
4 2009-04-04 00:00:00 A 1,000.00
4 2009-04-04 04:44:44 B 1,111.11 0.20
4 2009-04-04 14:44:44 B 0.42
4 2009-04-04 20:44:44 B 4,444.44 0.25

Here is the audit trail recorded in that dataset:

Date Description Command Host Session Line
2017/12/07 14:05:30 Load from flat file /Users/njr/python/artists/miro/testdata/trans.csv load testdata/trans.csv godel.local /Users/njr/miro/log/2017/12/02/session142 1
2017/12/07 14:05:30 Save as /miro/datasets/trans.miro save trans godel.local /Users/njr/miro/log/2017/12/02/session142 3
2017/12/11 12:48:39 Set dataset description to "Some demo transactions". description -d "Some demo transactions" bartok.local /Users/njr/miro/log/2017/12/11/session013 2
2017/12/11 12:48:39 Set description for field amount to "transaction value (GBP)". description amount "transaction value (GBP)" bartok.local /Users/njr/miro/log/2017/12/11/session013 3
2017/12/11 12:48:39 Defined field days-since-prev as (/ (- date (prev-by date id)) 24 60 60) def days-since-prev (/ (- date (prev-by date id)) 24 60 60) bartok.local /Users/njr/miro/log/2017/12/11/session013 4
2017/12/11 12:48:39 Tag field days-since-prev with L="Time Since Previous Transaction (days)". tag L="Time Since Previous Transaction (days)" days-since-prev bartok.local /Users/njr/miro/log/2017/12/11/session013 5
2017/12/11 12:48:39 Save as /miro/datasets/trans.miro save bartok.local /Users/njr/miro/log/2017/12/11/session013 6

In this case, the history is quite short, but includes information about

  • where the data originally came from (first line)
  • when the data has been saved (second and seventh lines)
  • metadata changes (third, fourth and sixth lines)
  • changes to the data content (in this case, creation of a new field, days-since-prev).
  • detail about when (column 1) and where (column 4) changes were made, in what session these occurred (column 5, linking to the logs), what commands were used (column 3) and where in the log files to find those commands and context (column 6).

Field Definitions and Contexts

Fields remember their definitions, including—where relevant—the context in which they were created. This allows us to understand where any value in a dataset came from, as a sequence of transformations of input values.

In the previous section, we saw that the audit trail contained information about a derived field in the data, including the expression used to derive it. That information is also available as a basic property of the field:

[2]> ls -D days-since-prev
          Field                                 Definition
days-since-prev    (/ (- date (prev-by date id)) 24 60 60)

In other cases, entire datasets are created by taking "measurements" from a base dataset, such as the transactional data shown above. For example, we might want to create a dataset that measures how many transactions each customer has, and their total value.

One way of doing this is Miró is with the following command:

[4]> xtab -d -R MEASURES count sum amount by id
MEASURES: 4 records; 4 (100%) selected; 3 fields.

which creates a new dataset. As you might expect, this is the result:

id count sumamount
1 1 1,000.00
2 2 4,222.22
3 3 7,333.33
4 4 6,555.55

The field definitions in the dataset created by default) are attached to the fields, as we can see:

[6]> -D
    Field                                  Definition
       id             Rollup variable [in trans.miro]
    count         count by id [in dataset trans.miro]
sumamount    sum amount by id [in dataset trans.miro]

Of course, the audit trail for MEASURES also contains this information, together with more detailed information about the session or sessions in which the relevant commands were issued.

Constraint Generation and Verification

Automatically generated constraints can be used to identify anomalous and possibly incorrect data within a source dataset. The can also be used to check that new data with the same structure has similar or expected properties.

We've talked in previous posts (here, here, here, and here) and in a white paper about automatically generating constraints that characterize either all of, or an inferred "good" subset of, the data in a dataset. Such constraints are useful for finding bad and anomalous data in the original dataset, and also for checking ("verifying") new data as it comes in.

We won't go over all the details of constraint generation and verification in this post, but do note that this relates to Roger Peng's idea, discussed in the last post, of tracking changes to data tests as a surrogate for tracking changes to the actual data. Obviously, having generated constraints, it's a good idea to store the constraints under version control, to facilitate such tracking. More directly, the results of verification allow you to see some changes to data directly.

Data Dictionary

The data dictionary provides a useful reference for any user of the data particularly when it includes helpful (and up-to-date) annotations. By making it easily editable, we encourage users to record useful information alongside the data.

Miró can generate a data dictionary from the data. This contains:

  • Summary information about the overall dataset
  • Per-field information, including

    • basic metadata about each field, including name and type
    • some information about the range of each field, including minimum and maximum values, null count etc.
    • further characterization information, including whether there are duplicate values,
    • any annotations that have been added to the dataset (such as descriptions, alternate names, tags etc.)

Here's an example of the first part of the data dictionary for our transaction dataset:

Name trans.miro
Path /miro/datasets/trans.miro
Host bartok.local
Hash 0971dde52de7bc2fb2ad2282a572f6ca295c33fae105d8b0fab7a618f4c70b71
Description Some demo transactions
Creation Date 2017-12-11 13:33:36
Number of Records 10
Number of Fields 5

And here's the second part, with an entry for each field:

Name Type Min Max Min Length Max Length # nulls # empty/zero # positive # negative Any duplicates Values Description Tags Long Name Short Name Definition
id int 1 4 0 0 10 0 yes
date date 2009-01-31T00:00:00 2009-04-04T20:44:44 0 0 10 0 no
categ string A B 1 1 0 0 yes "A" "B"
amount real 1,000.00 4,444.44 1 0 9 0 yes transaction value (GBP)
days-since-prev real 0.20 0.93 4 0 6 0 yes Time Since Previous Transaction (days) (/ (- date (prev-by date id)) 24 60 60)

While some of the information in the data dictionary is derived directly from the data, other parts (descriptions, alternate names, and tags) are created by annotation actions, whether by humans, bots or scripts. Although there are Miró commands for setting all the (editable) metadata properties, to encourage maximum use, Miró can also export the metadata to a spreadsheet, where users can update it, and then the appropriate parts of the metadata can be re-imported using Miró's importmetadata command.

Data signatures

A data signature is a very compact way to summarize a dataset. This allows quick and efficient checking that analysis is being performed using the correct data.

One of the properties reported in the data dictionary (the first table above) is a hash. If you're not familiar with hashing, a hash function is one that takes a (typically large) input and converts it to a much smaller output. Hash functions are designed so that different inputs tend (but are not guaranteed) to map to different outputs. So if you store the hashes of two large objects and they are different, you can be certain that the objects are different. If the hashes are the same, this does not absolutely guarantee that the objects are the same, but hash functions are designed to make so-called "hash collisions" extremely rare, especially between similar inputs. For most practical purposes, therefore, we can safely assume that if two hashes are the same, the inputs were the same.2

The point of storing the hash is that it acts as a much smaller, very efficient proxy for our original data, and if we want to know whether some dataset we have lying around contains the same data as the one used to generate the data dictionary, all we have to do is compare the hashes.

The hashes that Miró constructs for fields use only the data values, not any metadata, as inputs.3 This is a choice, of course, and we could also hash some or all of the metadata, but our primary concern here is whether we have the same underlying data or not, so we view it as an advantage that the hash is based solely on the data values.

There is an option to store hashes for individual fields, as well, which we have not used in generating the data dictionary shown.

Comparing Datasets (diff commands)

The ability to compare two datasets, and when they are different, to see clearly what the differences are, is as fundamental as the ability to compare two files in Unix or git, or to perform a track changes operation on a Word document.

If we want to compare two datasets to see if they are the same, comparing hashes is a very efficient way to do so.

If, however, we actually want to understand the differences between datasets, we need something more like Unix's diff command, which was discussed in the previous post, or Microsoft Word's Compare Documents functionality.

Miró includes a ddiff command for comparing two datasets. Let's look at an example.

Here's our transaction data again, without the derived days-since-prev field:

id date categ amount
1 2009-01-31 00:00:00 A 1,000.00
2 2009-02-02 00:00:00 A 2,000.00
2 2009-02-02 22:22:22 B 2,222.22
3 2009-03-03 00:00:00 A 1,000.00
3 2009-03-03 13:33:33 B 3,000.00
3 2009-03-03 23:33:33 B 3,333.33
4 2009-04-04 00:00:00 A 1,000.00
4 2009-04-04 04:44:44 B 1,111.11
4 2009-04-04 14:44:44 B
4 2009-04-04 20:44:44 B 4,444.44

and here's a variant of it:

id date categ amount
1 2009-01-31 00:00:00 A 1,000.00
2 2009-02-02 00:00:00 A 2,000.00
2 2009-02-02 22:22:22 B 2,222.22
3 2009-03-03 00:00:00 A 1,000.00
3 2009-03-03 13:33:33 B 3,000.00
3 2009-03-03 23:33:33 B 3,333.33
4 2009-04-04 00:00:00 A 1,000.00
4 2009-04-04 04:44:44 B 1,111.11
4 2009-04-04 14:44:44 B 3,874.18
4 2009-04-04 20:44:44 A 4,444.44

These datasets are small enough that you can probably see the differences by inspection (though it might take a little while to be confident that you've spotted them all), but when there are millions of rows and hundreds of columns, that becomes less easy.

Using the hash trick we talked about previously, we can see whether there are any differences, and slightly more besides. Assuming the current working dataset in Miró is the first, and the second is in TRANS2, we can hash them both:

[46]>  hash
19451c13321284f2b0dd7736b75b443945fac1eae08a8118d600ec0d49b6bf87 id
f3c08f1d2d23abaa06da0529237f63bf8099053b9088328dfd5642d9b06e8f6a date
3a0070ae42b9f341e7e266a18ea1c78c7d8be093cb628c7be06b6175c8b09f23 categ
f5b3f6284f7d510f22df32daac2784597122d5c14b89b5355464ce05f84ce120 amount

b89ae4b74f95187ecc5d49ddd7f45a64849a603539044ae318a06c2dc7292cf9 combined

[47]>  TRANS2.hash
19451c13321284f2b0dd7736b75b443945fac1eae08a8118d600ec0d49b6bf87 id
f3c08f1d2d23abaa06da0529237f63bf8099053b9088328dfd5642d9b06e8f6a date
fcd11dbd69eee0bf6d2a405a6e4ef9227bb3f0279d9cc7866e2efe5b4c97112c categ
64c5c97e9e9676ec085b522303d75ff11b0ebe01a1ceebaf003719b3718f12bb amount

2e171d2a24183e5e25bbcc50d9cd99ad8b4ca48ee7e1abfa6027edd291a22584 combined

We can see immediately that these are different, but that the individual hashes for the id and date fields are the same, indicating that their content is (almost certainly) the same. It's the categ and amount fields that differ between the two datasets.

We can use the ddiff command to get a more detailed diagnosis:

[49]>  ddiff -P TRANS2

Number of differences       Field Pair
           0:                   id : id-2
           0:                 date : date-2
           1:                categ : categ-2
           1:               amount : amount-2

Diff fields:
           1: diff-amount
           1: diff-categ

Total number of differences found: 2

The output here confirms that there are no differences between the id and date fields in the two datasets, but that one value differs for each of the categ and amount fields. The -P flag that we passed to the ddiff command told it to preserve information about the differences, and if we now look at the data, we see five extra fields (on the first dataset)—the fields as they were in the other dataset, TRANS2. Miró also creates an overall diff field showing whether each record has any differences across the two datasets.

[50] show
id date categ categ-2 diff-categ amount amount-2 diff-amount diff
1 2009-01-31 00:00:00 A A 0 1,000.00 1,000.00 0 0
2 2009-02-02 00:00:00 A A 0 2,000.00 2,000.00 0 0
2 2009-02-02 22:22:22 B B 0 2,222.22 2,222.22 0 0
3 2009-03-03 00:00:00 A A 0 1,000.00 1,000.00 0 0
3 2009-03-03 13:33:33 B B 0 3,000.00 3,000.00 0 0
3 2009-03-03 23:33:33 B B 0 3,333.33 3,333.33 0 0
4 2009-04-04 00:00:00 A A 0 1,000.00 1,000.00 0 0
4 2009-04-04 04:44:44 B B 0 1,111.11 1,111.11 0 0
4 2009-04-04 14:44:44 B B 0 3,874.18 1 1
4 2009-04-04 20:44:44 B A 1 4,444.44 4,444.44 0 1

This makes is easy to identify and select only those fields or records with differences, which is one of the key tasks when trying to track data lineage.

As powerful as Miró's ddiff and related commands are, there is also much more that we would like (and plan) to support. Our comparison is fundamentally based on joining the two datasets (either on one or more nominated key fields, or, as in this case, implicitly on record number). When we are using a join key, it's quite easy to deal with row addition and deletion, but that is harder when we are just joining on record number. It would be useful to have a Unix diff-like ability to spot single rows or groups of rows that have been inserted, deleted, or re-ordered, but we don't have that today. In certain cases, spotting other kinds of systematic edits would be interesting—for example, thinking of the table in speadsheet-like terms, it would be useful to spot cases in which blocks of cells shift up, down, left or right. This situation is not very common in the data we most commonly work with, but there are domains in which those sorts of changes might be frequent.

What Next?

We surveyed a few of the ways we think about and implement features in our software (and workflows) to help track data provenance and data lineage. There's a great deal more we could do, and over time we will almost certainly add more. Hopefully these ideas will prove useful and interesting, and obviously any of you fortunate enough to use Miró can try them out.

We'll keep you posted as we extend our thinking.


Thanks to our Social Media Manager, for actively policing our content even as it was being created.

Alfie, our Social Media Manager, inspects progress on the blog post.

  1. Changing (original) values in data is actually so rare that Miró does provides very few facilities for doing so directly, but data can effectively be changed through deleting one field or record and adding another, and there are a couple of operations that can change values in place—primarily anonymization operations and functions that update special, automatically created fields, albeit usually by deletion and regeneration. In fact, one of the rules on our "unwritten list" of good practices is never to replace an input field with an edited copy, but instead always to derive a new field with a variant name, so that when we see a field from source data we know its values have not been altered. 

  2. A familiar example comes from git, which uses hashes to compare the contents of files efficiently, and also uses a SHA-1 hash to identify a commit, by hashing all of the important information about that commit. 

  3. The overall hash depends on the name of the fields and their order, as well as the data in the fields, but the individual field hashes do not. As a result, two fields containing the same values (in the same order) will receive the same hash, but datasets in which fields have been renamed or reordered will have different overall hashes. We should also note that missing values (NULLs) also contribute to the field hashes. 

Data Provenance and Data Lineage: the View from the Podcasts

Posted on Thu 30 November 2017 in TDDA • Tagged with data lineage, data provenance, data governance, tdda, constraints

In Episode 49 of the Not So Standard Deviations podcast, the final segment (starting at 59:32) discusses data lineage, after Roger Peng listened to the September 3rd (2017) episode of another podcast, Linear Digressions, which discussed that subject.

This is a topic very close to our hearts, and I thought it would be useful to summarize the discussions on both podcasts, as a precursor to writing up how we approach some of these issues at Stochastic Solutions—in our work, in our Miró software and through the TDDA approaches discussed on this blog.

It probably makes sense to begin by summarizing the Linear Digressions, episode, in which Katie Malone explains the ideas of data lineage (also known as data provenance) as the tracking of changes to a dataset.

Any dataset starts from one or more "original sources"—the sensors that first measured the quantities recorded, or the system that generated the original data. In almost all cases, a series of transformations is then applied to the data before it is finally used in some given application. For example, in machine learning, Katie describes typical transformation stages as:

  1. Cleaning the data
  2. Making additions, subtractions and merges to the dataset
  3. Aggregating the data in some way
  4. Imputing missing values

She describes this as the process view of data lineage.

An alternative perspective focuses less on the processes than the resulting sequence of datasets, as snapshots. As her co-host, Ben Jaffe points out, this is more akin to the way version control systems view files or collections of files, and diff tools (see below) effectively reconstruct the process view1 from the data.

In terms of tooling, Katie reckons that the tools for tracking data lineage are relatively well developed (but specialist) in large scientific collaborations such as particle physics (she used to work on the LHC at CERN) and genomics, but her sense is tools are less well developed in many business contexts.

She then describes five reasons to care about data provenance:

  1. (To improve/track/ensure) data quality
  2. (To provide) an audit trail
  3. (To aid) replication (her example was enabling the rebuilding of a predictive model that had been lost if you knew how it had been produced, subject obviously, not only to having the data but full details of the parameters, training regime and any random number seeds used)
  4. (To support) attribution (e.g. providing a credit to the original data collector when publishing a paper)
  5. Informational, (i.e. to keep track of and aid navigation within large collection of datasets).

I recommend listening to the episode.

In Not-so-Standard Deviations 49, Roger Peng introduces the idea of tracking and versioning data, as a result of listening to Katie and Ben discuss the issue on their podcast. Roger argues that while you can stick a dataset into Git or other version control software, doing so it not terribly helpful because, most of the time the dataset acts essentially as a blob,2 rather than as a structured entity that diff tools help you to understand.

In this, he is exactly right. In case you're not familar with version control and diff tools, let me illustrate the point. In a previous post on Rexpy, I added a link to a page about our Miró software between two edits. If I run the relevant git diff command, this is its output:

git diff of two versions of the markdown for a blog post

As you can see, this shows pretty clearly what's changed. Using a visual diff tool, we get an even clearer picture of the changes (especially when the changes are more numerous and complex):

visual diff (opendiff) of two versions of the markdown for a blog post

In contrast, if I do a diff on two Excel files stored in Git, I get the following:

git diff b1b85ddc448a723845c36688480cfe5072f28c1a – test-excel-sheet1.xlsx 
diff –git a/testdata/test-excel-sheet1.xlsx b/testdata/test-excel-sheet1.xlsx
index 0bd63cb..91e5b0e 100644
Binary files a/testdata/test-excel-sheet1.xlsx and b/testdata/test-excel-sheet1.xlsx differ

This is better than nothing, but gives no insight into what has changed. (In fact, it's worse than it looks, because even changes to the metadata inside an Excel file–such as the location of the selected cell—will cause the files to be shown as different. As a result, there are many hard-to-detect false positives when using diff commands with binary files.)

Going back to the podcast, Hilary Parker then talked about the difficulty of the idea of data provenance in the context of streaming data, but argued there was a bit more hope with batch processes, since at least the datasets used then are "frozen".

Roger then argued that there are good custom tools used in particular places like CERN, but those are not tools he can just pick up and use. He then wondered aloud whether such tools cannot really exist, because they require too much understanding on analytical goals. (I don't agree with this, as the next post will show.) He then rowed back slightly, saying that maybe it's too hard for general data, but more reasonable in a narrower context such as "tidy data".

If you aren't familiar with the term "tidy data", it really just refers to data stored in the same way that relational databases store data in tables, with columns corresponding to variables, rows corresponding to whatever the items being measured are (the observations), consistent types for all the values in a column and a resulting regular, grid-like structure. (This contrasts with, say, JSON data, which is hierarchical, and with many spreadsheets, in which different parts of the sheet are used for different things, and with data in which different observations are in columns and the different variables are in rows.) So "tidy data" is an extremely large subset of the data we use in structured data analysis, at least after initial regularization.

Hilary then mentioned an R package called testdat, that she had had worked on at an "unconference". This aimed to test things check things like missing values and contiguity of dates in datasets. These ideas are very similar to those of constraint generation and verification, which we frequently discuss on this blog, as supported in the TDDA package. But Hilary (and others?) concluded that the package was not really useful, and that what was more important was tools to make writing tests for data easier. (I guess we disagree that general tools like this aren't useful, but very much support the latter point.)

Roger then raised the idea of tracking changes to such tests as a kind of proxy for tracking changes to the data, though clearly this is a very partial solution.

Both podcasts are interesting and worth a listen, but both of them seemed to feel that there is very little standardization in this area, and that it's really hard.

We have a lot of processes, software and ideas addressing many aspects of these issues, and I'll discuss them and try to relate them to the various points raised here in a subsequent post, fairly soon, I hope.

  1. Strictly speaking, diff tools cannot know what the actual processes used to transform the data were, but construct a set of atomic changes (known as patches) that are capable of transforming the old dataset into the new one. 

  2. a blob, in version control systems like Git, is a Binary Large OBject. Git allows you to store blobs, and will track different versions of them, but all you can really see is whether two versions are the same, whereas for "normal" files (text files), visual diff tools normally allow you to see easily exactly what changes have been made, much like the track changes feature in Word documents. 

Automatic Constraint Generation and Verification White Paper

Posted on Fri 06 October 2017 in TDDA • Tagged with tdda, constraints, verification, bad data

We have a new White Paper available:

Automatic Constraint Generation and Verification


Correctness is a key problem at every stage of data science projects: completing an entire analysis without a serious error at some stage is surprisingly hard. Even errors that reverse or completely invalidate the analysis can be hard to detect. Test-Driven Data Analysis (TDDA) attempts to identify, reduce, and aid correction of such errors. A core tool that we use in TDDA is Automatic Constraint Discovery and Verification, the focus of this paper.

Download White Paper here

Constraint Generation in the Presence of Bad Data

Posted on Thu 21 September 2017 in TDDA • Tagged with tdda, constraints, discovery, verification, suggestion, cartoon, bad data

Bad data is widespread and pervasive.1

Only datasets and analytical processes that have been subject to rigorous and sustained quality assurance processes are typically capable of achieving low or zero error rates. "Badness" can take many forms and have various aspects, including incorrect values, missing values, duplicated entries, misencoded values, values that are inconsistent with other entries in the same dataset, and values that are inconsistent with those in other datasets, to name but a few.

Person A: 'After our data quality drive, I am happy to say we know of no remaining bad data.' Person B: 'But every day I find problems, and mail the bad data account. They never get fixed.' A: 'Sure, but that's an unmonitored account.' (Covers ears, closes eyes.) A: 'Like I said, we know of no remaining bad data.' B: (Thinks): 'What is this hell?'

We have previously discussed automatic constraint discovery and verification as a mechanism for finding, specifying, and executing checks on data, and have advocated using these to verify input, output, and intermediate datasets. Until now, however, such constraint discovery has been based on the assumption that the datasets provided to the discovery algorithm contain only good data—a significant limitation.

We have recently been thinking about ways to extend our approach to constraint generation to handle cases in which this assumption is relaxed, so that the "discovery" process can be applied to datasets that include bad data. This article discusses and illustrates one such approach, which we have prototyped, as usual, in our own Miró software. We plan to bring the same approach to the open-source Python tdda library as soon as we have gained further experience using it and convinced ourselves we are going in a good direction.

The most obvious benefit of extending constraint generation to function usefully even when the dataset used contains some bad data is increased applicability. Perhaps a more important benefit is that it means that, in general, more and tighter constraints will be generated, potentially increasing their utility and more clearly highlighting areas that should be of concern.

The Goal and Pitfalls: "Seeing Through" Bad Data

Our aim is to add to constraint generation a second mode of operation whose goal is not to discover constraints that are true over our example data, but rather to generate constraints for which there is reasonable evidence, even if some of them are not actually satisfied by all of the example data. This is not so much constraint discovery as constraint suggestion. We want the software to attempt to "see through" the bad data to find constraints similar to those that would have been discovered had there been only good data. This is hard (and ill specified) because the bad data values are not identified, but we shall not be deterred.

For constraints generated in this way, the role of a human supervisor is particularly important: ideally, the user would look at the constraints produced—both the ordinary ones actually satisfied by the data and the more numerous, tighter suggestions typically produced by the "assuming bad data" version—and decide which to accept on a case-by-case basis.

Extending constraint generation with the ability to "see through" bad data is very powerful, but carries obvious risks: as an example, constraint suggestion might look at lending data and conclude that defaults (non-payment of loans—hopefully a small proportion of customers) are probably outliers that should be excluded from the data. Clearly, that would be the wrong conclusion.

Notwithstanding the risks, our initial experiences with generation in the presence of bad data have been rather positive: the process has generated constraints that have identified problems of which we had been blissfully unaware. The very first time we used it highlighted:

  • a nearly but not-quite unique identifier (that should, of course, have been unique)
  • a number of low-frequency, bad categorical values in a field
  • several numeric fields with much smaller true good ranges than we had realized.

Additionally, even when the constraints discovered go beyond identifying incorrect data, they often highlight data that we are happy to have flagged because they do represent outliers deserving of scrutiny.

Our Approach

Our initial attempt at extending constraint discovery to the case in which there is (or might be) bad data is straightforward: we simply allow that a proportion of the data might be bad, and look for the best constraints we can find on that basis.

With our current implementation, in Miró, the user has to supply an upper limit on the proportion, p, of values that are thought likely to be bad. Eventually, we expect to be able to determine this proportion heuristically, at least by default. In practice, for datasets of any reasonable size, we are mostly just using 1%, which seems to be working fairly well.

We should emphasize that the "bad" proportion p provided is not an estimate of how much of the data is bad, but an upper limit on how much we believe is reasonably likely to be bad.2 As a result, when we use 1%, our goal is not to find the tighest possible constraints that are consistent with 99% of data, but rather to work with the assumption that at least 99% of the data values are good, and then to find constraints that separate out values that look very different from the 99%. If this turns out to be only 0.1%, or 0.001%, or none at all, so much the better.

The way this plays out is different for different kinds of constraint. Let's work through some examples.

Minimum and Maximum Constraints

Looking for minimum and maximum constraints in the presence of possible bad values is closely connected to univariate outlier detection. We cannot, however, sensibly make any assumptions about the shape of the data in the general case, so parametric statistics (means, variances etc.) are not really appropriate. We also need to be clear that what we are looking for is not the tails of some smooth distribution, but rather values that look as if they might have been generated by some completely different process, or drawn from some other distribution.

An example of the kind of thing what we are looking for is something qualitatively like the situation in the top diagram (though perhaps considerably more extreme), where the red and blue sections of the distribution look completely different from the main body, and we want to find cutoffs somewhere around the values shown. This is the sort of thing we might see, for example, if the main body of values are prices in one currency and the outliers, in contrast, are the similar prices, but measured in two other currencies. In contrast, we are not aiming to cut off the tails of a smooth, regular distribution, such as the normal distribution shown in the lower diagram.

Top: a distribution with a double-peaked centre, and two small (coloured) peaks well to the left and right of the centre. Bottom: a normal distribution, with a small part of each tail coloured.

In our initial implementation we have used the possible bad proportion p to define an assumed "main body" of the distribution as running between the quantiles at (1 – p) and p (from the 1st percentile to the 99th, when p = 1%). We assume all values in this range are good. We then search out from this main body of the distribution, looking for a gap between successive values that is at least α times the (1% – 99%) interquantile range. We are using α=0.5 for now. Once we have identified a gap, we then pick a value within it, currently favouring the end nearer the main distribution and also favouring "round" numbers.

Let's look at an example.

We have a dataset with 12,680,141 records and 3 prices (floating-point values), Price1, Price2 and Price3, in sterling. If we run ordinary TDDA constraint discovery on this, we get the following results:

Individual Field Constraints
Name Type Allowed Min Allowed Max Allowed
Price1 real 0.0 198,204.34
Price2 real –468,550,685.56 2,432,595.87
Price3 real 0.0 1,390,276,267.42

With ordinary constraint discovery, the min and max constraints are just set to the largest and smallest values in the dataset, so this tells us that the largest value for Price1 in the dataset is nearly £200,000, that Price2 ranges from close to –£500m to about +£2.5m, and that Price3 runs up to about £1.4bn.

If we rerun the constraint generation allowing for up to 1% bad data, we get the following instead.

Individual Field Constraints
Name Type Allowed Min Allowed Max Allowed
Price1 real 0.0 38,000.0
Price2 real –3,680.0 3,870.0
Price3 real 0.0 125,000.0

Let's look at why these values were chosen, and see whether they are reasonable.

Starting with Price1, let's get a few more statistics.

min Price1 percentile 1 Price1 median Price1 percentile 99 Price1 max Price1
0.00 24.77 299.77 2,856.04 198,204.34

So here our median value is just under £300, while our first percentile value is about £25 and our 99th percentile is a little under £3,000, giving us an interquantile range also somewhat under £3,000 (£2,831.27, to be precise).

The way our algorithm currently works, this interquantile range defines the scale for the gap we need to define something as an outlier—in this case, half of £2,831.27, which is about £1,415.

If we were just looking to find the tighest maximum constraint consistent with 1% of the data being bad (with respect to this constraint), we would obviously just set the upper limit to fractionally above £2,856.04. But this would be fairly nonsensensical as we can see if we look at the data around that threshold. It is a feature of this particular data that most values are duplicated a few times (typically between 3 and 10 times), so we first deduplicate them. Here are the first 20 distinct values above £2,855:

2,855.02   2,855.42   2,855.80   2,856.04
2,855.16   2,855.50   2,855.82   2,856.08
2,855.22   2,855.64*  2,855.86   2,856.13
2,855.38   2,855.76   2,855.91   2,856.14

Obviously, there's nothing special about £2,854.04 (marked with *), and there is no meaningful gap after it, so that would be a fairly odd place to cut off.

Now let's look at the (distinct) values above £30,000:

30,173.70   31,513.21   39,505.67   44,852.01   67,097.01    72,082.60
30,452.45   32,358.52   39,703.93   45,944.21   60,858.88    72,382.57
30,562.55   32,838.23   40,888.79   47,026.70   60,911.84    72,388.55
30,586.96   33,906.98   40,999.04   47,058.99   63,160.28    72,657.04
30,601.67   34,058.27   41,252.50   47,126.60   63,984.64    72,760.90
30,620.78   35,302.33   41,447.00   49,827.28   64,517.16    90,713.28
31,118.10   36,472.53   41,513.95   51,814.22   67,256.27   103,231.44
31,139.86   36,601.86   42,473.36   53,845.76   68,081.68   123,629.97
31,206.71   36,941.76*  43,384.46   55,871.84   70,782.28   198,204.34
31,449.57   39,034.38   43,510.03   56,393.14   71,285.95

The cutoff the TDDA algorithm has chosen (£38,000) is just after the value marked *, which is the start of the first sizable gap. We are not claiming that there is anything uniquely right or "optimal" about the cutoff that has been chosen in the context of this data, but it looks reasonable. The first few values here are still comparatively close together—with gaps of only a few hundred between each pair of successive value—and the last ones are almost an order of magnitude bigger.

We won't look in detail at Price2 and Price3, where the algorithm has narrowed the range rather more, except to comment briefly on the negative values for Price2, which you might have expected to be excluded. This table shows why:3

countnegative Price2 countzero Price2 countpositive Price2 countnull Price2 countnonnull Price2
1,300 51,640 39,750 15,880 92,690

Although negative prices are not very common, they do account for about 1.2% of the data (critically, more than 1%). Additionally, nearly 15% of the values for Price2 are missing, and we exclude nulls from this calculation, so the truly relevant figure is that 1,300 of 92,690 non-null values are negative, or about 1.4%.

Other Constraints

We will now use a slightly wider dataset, with more field types, to look at how constraint generation works for other kinds of constraints in the presence of bad data. Here are the first 10 records from a dataset with 108,559 records, including a similar set of three price fields (with apologies for the slightly unintuitive colouring of the Colour field).

ID Date Price1 Price2 Price3 Code nItems Parts Colour
af0370b4-16e4-1891-8b30-cbb075438394 2016-11-01 00:38:34 830.25 830.25 NLOM 1 62444 red
126edf08-16e5-1891-851c-d926416373f2 2016-11-01 00:41:21 983.08 0.00 983.08 QLDZ 1 62677 yellow
73462586-16ed-1891-b164-7da16012a3ab 2016-11-01 01:41:20 540.82 0.00 540.82 TKNX 1 62177 62132 red
1cd55aac-170e-1891-aec3-0ffc0ab020e7 2016-11-01 05:35:08 398.89 0.00 398.89 PKWI 1 61734 red
68f74486-170e-1891-8b30-cbb075438394 2016-11-01 05:37:16 314.54 8.21 314.54 PLUA 1 62611 red
28dff654-16fa-1891-bd69-03d25ee96dee 2016-11-01 03:12:18 479.75 479.75 UHEG 1 62128 red
14321670-1703-1891-ab75-77d18aced2f5 2016-11-01 04:16:09 733.41 0.00 733.41 RTKT 2 61829 red
60957d68-1717-1891-a5b2-67fe1e248d60 2016-11-01 06:41:27 537.81 0.00 537.81 OBFZ 1 61939 62371 red
4fea2bb6-171d-1891-80a6-2f322b85f525 2016-11-01 07:23:56 132.60 2.42 135.02 TACG 2 62356 red
204ed3d6-1725-1891-a1fb-15b6ffac52ce 2016-11-01 08:19:52 866.32 866.32 XHLE 2 61939 red

If we run ordinary constraint discovery on this (not allowing for any bad data), these are the results:

Individual Field Constraints
Name Type Allowed Min Allowed Max Allowed Sign Allowed Nulls Allowed Duplicates Allowed Values Allowed # Regular Expressions
ID string length 32 length 36 0 - 2
Date date 1970-01-01 2017-06-03 23:59:06 0
Date:time-before-now timedelta 107 days, 11:28:05 17428 days, 11:27:11 positive
Price1 real 0.0 19,653,405.06 non-negative 0
Price2 real -4,331,261.54 589,023.50
Price3 real 0.0 20,242,428.57 non-negative 0
Code string length 4 length 4 0 - 3
nItems int 1 99 positive -
Parts string length 5 length 65 1 - 5
Colour string length 3 length 6 0 - 6 values 2

There are a few points to note:

  • To make the table more manageable, the regular expressions and field values are not shown, by default. We'll see them later.
  • The third line of constraints is generated is slightly different from the others, and looks not at the actual values of the dates, but rather than how far in the past or future they are. This is something else we are experimenting with, but doesn't matter too much for this example.

Let's repeat the process telling the software that up to 1% of the data might be bad.

Individual Field Constraints
Name Type Allowed Min Allowed Max Allowed Sign Allowed Nulls Allowed Duplicates Allowed Values Allowed # Regular Expressions
ID string length 32 length 36 0 no 1
Date date 1970-01-01 01:23:20 2017-06-03 23:59:06 0
Date:time-before-now timedelta 107 days, 11:55:27 17428 days, 11:54:33 positive
Price1 real 0.0 25000.0 non-negative 0
Price2 real -520.0 790.0
Price3 real 0.0 24000.0 non-negative 0
Code string length 4 length 4 0 - 2
nItems int 1 10 positive 0 -
Parts string length 5 length 65 0 - 2
Colour string length 3 length 6 0 - 3 values 1

Now let's consider the (other) constraint kinds in turn.


Nothing has changed for any of the sign constraints. In general, all that we do in this case is see whether less than our nominated proportion p is negative, or less than our nominated proportion p is positive, and if so write a constraint to this effect, but in the example nothing has changed.


The nulls allowed constraint puts an upper limit on the number of nulls allowed in a field. The only two values we ever use are 0 and 1, with 0 meaning that no nulls are allowed and 1 meaning that a single null is allowed. The Parts field originally had a value of 1 here, meaning that there is a single null in this field in the data, and therefore the software wrote a constraint that a maxumim of 1 null is permitted. When the software is allowed to assume that 1% of the data might be bad, a more natural constraint is not to allow nulls at all.

No constraint on nulls was produced for the field nItems originally, but now we have one. If we check the number of nulls in nItems, it turns out to be just 37 or around 0.03%. Since that is well below our 1% limit, a "no-nulls" constraint has been generated.


In the original constraint discovery, no constraints banning duplicates were generated, whereas in this case ID did get a "no duplicates" constraint. If we count the number of distinct values in ID, there are in fact, 108,569 for 108,570 records, so there is clearly one duplicate, affecting 2 records. Since 2/108,570 is again well below our 1% limit (more like 0.002%), the software generates this constraint.

Values allowed.

In the original dataset, a constraint on the values for the field Colour was generated. Specifically, the six values it allowed were:

"red" "yellow" "blue" "green" "CC1734" "???"

When we ran the discovery process allowing for bad data, the result was that only 3 values were allowed, which turn out to be "red", "yellow", and "blue". If we look at the breakdown of the field, we will see why.

Colour count
red 88,258
yellow 10,979
blue 8,350
green 978
CC1734 4
??? 1

The two values we might have picked out as suspicious (or at least, having a different form from the others) are obviously CC1734 and ???, and constraint generation did exclude those, but also green, which we probably would not have done. There are 978 green values (about 0.9%), which is slighty under out 1% cutoff, so it can be excluded by the algorithm, and is. The algorithm simply works through the values in order, starting with the least numerous one, and removes values until the maximum possible bad proportion p is reached (cumulatively). Values with the same frequency are treated together, which means that if we had had another value (say "purple") with exactly the same frequency as green (978), neither would have been excluded.

Regular Expressions

Sets of regular expressions were generated to characterize each of the four string fields, and in every case the number generated was smaller when assuming the possible presence of bad data than when not.

For the ID field, two regular expressions were generated:


The first of these just corresponds to a 32-digit hex number, while the second is a 32-digit hex number broken into five groups of 8, 4, 4, 4, and 12 digits, separated by dashes—i.e. a UUID.4

If we get Rexpy to give us the coverage information We see that all but three of the IDs are properly formatted UUIDs, with just three being plain 32-digit numbers.

Regular Expression Incremental Coverage
^[0-9a-f]{8}\-[0-9a-f]{4}\-1891\-[0-9a-f]{4}\-[0-9a-f]{12}$ 108,567
^[0-9a-f]{32}$ 3

and indeed, it is the 3 records that are excluded by the constraint generation assuming bad data.

We won't go through all the cases, but will look at one more. The field Parts is a list of 5-digit numbers, separated by spaces, with a single one being generated most commonly. Here are the regular expressions that Rexpy generated, together with the coverage information.

Regular Expression Incremental Coverage
^\d{5}$ 99,255
^\d{5} \d{5}$ 8,931
^\d{5} \d{5} \d{5}$ 378
^\d{5} \d{5} \d{5} \d{5}$ 4
^61000 61000 61000 61000 61000 61000 61000 61000 61000 61000 61000$ 1

As you can see, Rexpy has generated five separate regular expressions, where in an ideal world we might have preferred it produced a single one:

^\d{5}( \d{5})*$

In fact, however, the fact it has produced separate expressions for five clearly distinguishable cases turns out to be very helpful for TDDA purposes.

In this case, we can see that the vast bulk of the cases have either one or two 5-digit codes (which are the two regular expressions retained by constraint generation), but we would almost certainly consider the 382 cases with three and four codes to be also correct. The last one is more interesting. First, the number of codes (eleven) is noticably larger than for any other record. Secondly, the fact that it is eleven copies of a single code, that is a relatively round number is suspicious. (Looking at the data, it turns out that in no other case are any codes repeated when there are multiple codes, suggesting even more strongly that there's something not right with this record.)

Verifying the Constraint Generation Data Against the Suggested Constraints

With the previous approach to constraint discovery, in which we assume that the example data given to us contains only good data, it should always be the case that if we verify the example data against constraints generated against it, they will all pass.5 With the new approach, this is no longer the case, because our hope is that the constraints will help us to identify bad data. We show below the result of running verification against the generated contraints for the example we have been looking at:

Failing Constraints
Failing Records
Failing Fields
Failing Values

We can also see a little more information about where the failures were in the table below.

Name Failures Type Minimum Maximum Sign Max Nulls Duplicates Values Rex
Values Constraints Allowed Actual Allowed Actual Allowed Actual Allowed Actual Allowed Actual Allowed Actual Allowed Actual Allowed Actual
Colour 983 2 string string length 3 length 3 length 6 length 6 - - - 0 0 - - - 3 values e.g. "green" 1 pattern e.g. "???"
Parts 384 2 string string length 5 length 5 length 65 length 65 - - - 0 1 - - - - - - 2 patterns e.g. "62702 62132 62341"
Price2 281 2 real real -520.00 -4,331,261.54 790.00 589,023.50 - - - - 15880 - - - - - - - - - -
Price3 92 1 real real 0.00 0.00 24,000.00 20,242,428.57 ≥ 0 0 0 - - - - - - - - -
Price1 91 1 real real 0.00 0.00 25,000.00 19,653,405.06 ≥ 0 0 0 - - - - - - - - -
nItems 44 2 int int 1 1 10 99 > 0 0 37 - - - - - - - - -
Code 4 1 string string length 4 length 4 length 4 length 4 - - - 0 0 - - - - - - 2 patterns e.g. "A__Z"
ID 3 2 string string length 32 length 32 length 36 length 36 - - - 0 0 no e.g. "374e4f9816e51891b1647da16012a3ab" - - - 1 pattern e.g. "374e4f9816e51891b1647da16012a3ab"
Date 3 1 date date 1970-01-01 01:23:20 1970-01-01 2017-06-03 23:59:06 2017-06-03 23:59:06 - - - 0 0 - - - - - - - - -
Date:time-before-now 2 1 - - - 109 days, 16:27:17 109 days, 16:28:09 17430 days, 16:26:23 17430 days, 16:27:15 > 0 - 0 - - - - - - - - - -

Because the verification fails in this way, and in doing so creates indicator fields for each failing constraint, and an overall field with the number of failures for each record, it is then easy to narrow down to the data being flagged by these constraints to see whether the constraints are useful or over zealous, and adjust them as necessary.


In this post, we've shown how we're extending Automatic Constraint Generation in TDDA to cover cases where the datasets used are not assumed to be perfect. We think this is quite a significant development. We'll use it a bit more, and when it's solid, extend the open-source tdda library to include this functionality.

  1. As previously, I am aware that, classically, data is the plural of datum, and that purists would prefer the assertion to be "bad data are widespread and pervastive." I apologise to anyone whose sensibilities are offended by my use of the word data in the singular. 

  2. As a further clarification, the proportion p is a used for each potential constraint on each field separately, so it's not "1% of all values" might be bad, but rather "1% of the Ages might be higher than the maximum constraint we generate" and so on. Obviously, we could generalize this approach to allow different possible propotions for each constraint type, or each field, or both, at the cost of increasing the number of free parameters. 

  3. Like you, I look at those figures and am immediately suspicious that all the counts shown are multiples of 10. But this is one of those cases our suspicions are wrong: this is a random sample from larger data, but the roundness of these numbers is blind chance. 

  4. In fact, looking carefully the third group of digits is fixed and starts with 1, indicating this is a UUID-1, which is something we hadn't noticed in this data until we got Rexpy to generate a regular expression for us, as part of TDDA. 

  5. With the time-delta constraints we are now generating, this is not strictly true, but this need not concern us in this case. 

Obtaining the Python tdda Library

Posted on Thu 14 September 2017 in TDDA • Tagged with tdda, python

This post is a standing post that we plan to try to keep up to date, describing options for obtaining the open-source Python TDDA library that we maintain.

Using pip from PyPI

If you don't need source, and have Python installed, the easiest way to get the TDDA library is from the Python package index PyPI using the pip utility.

Assuming you have a working pip setup, you should be able to install the tdda library by typing:

pip install tdda

or, if your permissions don't allow use in this mode

sudo pip install tdda

The tdda library supports both Python 2 (version 2.7) and Python 3.

Installing from Source

The source for the tdda library is available from Github and can be cloned with

git clone


git clone

When installing from source, if you want the command line tdda utility to be available, you need to run

python install

from the top-level tdda directory after downloading it.


The main documentation for the tdda library is available on Read the Docs.

You can also build it youself if you have downloaded the source from Github. In order to do this, you will need an installation of Sphinx. The HTML documentation is built, starting from the top-level tdda directory by running:

cd doc
make html

Running TDDA's tests

Once you have installed TDDA (whether using pip or from source), you can run its tests by typing

tdda test

If you have all the dependencies, including optional dependencies, installed, you should get a line of dots and the message OK at the end, something like this:

$ tdda test
Ran 122 tests in 3.251s


If you don't have some of the optional dependencies installed, some of the dots will be replaced by the letter 's'. For example:

$ tdda test
Ran 120 tests in 3.221s

OK (skipped=2)

This does not indicate a problem, and simply means there will be some of the functionality unavailable (e.g. usually one or more database types).

Using the TDDA examples

The tdda library includes three sets of examples, covering reference testing, automatic constraint discovery and verification, and Rexpy (discovery of regular expressions from examples, outside the context of constraints).

The tdda command line can be used to copy the relevant files into place. To get the examples, first change to a directory where you would like them to be placed, and then use the command:

tdda examples

This should produce the following output:

Copied example files for tdda.referencetest to ./referencetest-examples
Copied example files for tdda.constraints to ./constraints-examples
Copied example files for tdda.rexpy to ./rexpy-examples

Quick Reference Guides

There is a quick reference guides available for the TDDA library. These are often a little behind the current release, but are usually still quite helpful.

These are available from here.

Tutorial from PyData London

There is a video online of a workshop at PyData London 2017. Watching a video of a workshop probably isn't ideal, but it does have a fairly detailed and gentle introduction to using the library, so if you are struggling, it might be a good place to start.

GDPR, Consent and Microformats: A Half-Baked Idea

Posted on Fri 08 September 2017 in TDDA • Tagged with tdda

Last night I went to The Protectors of Data Scotland Meetup on the subject of Marketing and GDPR. If you're not familiar with Europe's fast-approaching General Data Protection Regulation, and you keep or process any personal data about humans,1, you probably ought to learn about it. A good place to start is episode 202 of Horace Dediu's The Critical Path podcast, in which he interviews Tim Walters.

During the meeting, I had an idea, and though it is rather less than half-baked, right now it seems just about interesting enough that I thought I'd record it.

One of the key provisions in GDPR is that data processing generally requires consent of the data subject, and that consent is defined as

any freely given, specific, informed and unambiguous indication of his or her wishes by which the data subject, either by a statement or by a clear affirmative action, signifies agreement to personal data relating to them being processed

This is further clarified as follows:

This could include ticking a box when visiting an Internet website, choosing technical settings for information society services or by any other statement or conduct which clearly indicates in this context the data subject’s acceptance of the proposed processing of their personal data.

Silence, pre-ticked boxes or inactivity should therefore not constitute consent.

The Idea in a Nutshell

Briefly, the idea is this:

  • Websites (and potentially apps) requesting consents should include a digital specification of that consent in a standardized format to be defined (probably either an HTML microformat or a standardized JSON bundle).

  • This would allow software to understand the consents being requested unambiguously and present them in a standardized, uniform, easy-to-understand format. It would also encourages businesses and other organizations to standardize the forms of consent they request. I imagine that if this happened, initially browser extensions and special apps such as password managers would learn to read the format and present the information clearly, but if were successful, eventually web browsers themselves would do this.

  • Software could also allow people to create one or more templates or default responses, allowing, for example, someone who never wants to receive marketing to make this their default response, and someone who wants as many special offers as possible to have settings that reflect that. Obviously, you might want several different formats for organizations towards which you have different feelings.

  • A very small extension to the idea would extend the format to record the choices made, allowing password managers, browsers, apps etc. to record for the user exactly what consents were given.2


I believe such a standard has potential benefits for all parties—businesses and other organizations requesting consent, individuals giving consent, regulators and courts:

  • Businesses and other data processing/capturing organizations would benefit from a clear set of consent kinds, each of which could have a detailed description (perhaps on an EU or W3C document) that could be referenced by a specific label (e.g. marketing_contact_email_organization). Best practice would hopefully quickly move to using these standard categories.

  • Software could present information in a standardized, clear way to users, highlighting non-standard provisions (preferably with standard symbols, a bit like the Creative Commons Symbols

  • By using template responses, users could more easily complete consent forms with less effort and less fear of ticking the wrong boxes.

Digital Specification? Microformat? JSON?

What are we actually talking about here?

The main (textual) content of a web page consists of the actual human-readable text together with annotation ("markup") to specify formatting and layout, as well as special features like the sort of checkboxes used to request consent. In older versions of the web, a web page was literally a text file in the a special format originally defined by Tim Berners-Lee (HTML).

For example, in HTML, this one-sentence paragraph with the word very in bold might be written as follows:

<p>For example, in HTML, this one-sentence paragraph with the
word <b>very</b> in bold might be written as follows:</p>

Since the advent of "Web 2.0", many web pages are generated dynamically, with much of the data being sent in a format called JSON. A simple example of some JSON for describing (say) an element in the periodic might be

    "name": "Lithium",
    "atomicnumber": 3,
    "metal": true,
    "period": 2,
    "group": 1,
    "etymology": "Greek lithos"

It would be straightforward3 to develop a format for allowing all the common types of marketing consent (and indeed, many other kinds of processing consent) to be expressed either in JSON or an HTML microformat (which might not be rendered directly by the webpage). As a sketch, the kind of thing a marketing consent request might look like in JSON could be:

    "format": "GDPR-Marketing-Consent-Bundle",
    "format_version": "1.0",
    "requesting_organization": "Stochastic Solutions Limited",
    "requesting_organization_partners": [],
    "requested_consents": [
    "request_date": "2017-09-08",
    "request_url": ""

Key features I am trying to illustrate here are:

  • The format would include details about the organization making the request
  • The format would have the capacity to list partner organizations in cases in which consent for partner marketing or processing was also requested
  • The format would be granular with a taxonomy of known kinds of consents. These might be parameterized, rather than being simple strings. In this case, I've included a few different contact mechanisms and the suffix "organization" to indicate this is consent for the organization itself, rather than any partners or other randoms.

Undoubtedly, a real implementation would end up a bit bigger than this, and perhaps more hierarchical, but hopefully not too much bigger.

The format could be extended very simply to include the response, which could then be sent back to the site and also made available on the page to the browser/password manager/apps etc. Here is an augmentation of the request format that also captures the responses:

    "format": "GDPR-Marketing-Consent-Bundle",
    "format_version": "1.0",
    "requesting_organization": "Stochastic Solutions Limited",
    "requesting_organization_partners": [],
    "requested_consents": {
        "marketing_contact_email_organization": false,
        "marketing_contact_mobile_phone_organization": false
        "marketing_contact_physical_mail_organization: true
    "request_date": "2017-09-08",
    "request_url": ""

This example indicates consent to marketing contact by paper mail from the organization, but not by phone or email.

Exactly the same could be achieved with an HTML Microformat, perhaps with something like this:

<div id="GDPR-Marketing-Consent-Bundle">
    <span id="format_version">1.0</span>
    <span id="requesting_organization">
        Stochastic Solutions Limited
    <span id="requesting_organization_data_protection_policy_page">
    <ol id="requesting_organization_partners">
    <ol id="requested_consents">
    <span id="request_date">2017-09-08</span>
    <span id="request_url">

(Again, I've no idea whether this is actually what HTML-based microformats typically look like; this is purely illustrative.)


I don't know whether this idea is useful or feasible, nor whether it is merely a half-baked version of something that an phalanx of people in Brussels has already specified, though I did perform many seconds of arduous due dilligence in the form of a web searches for terms like "marketing consent microformat" without turning up anything obviously relevant.

It seems to me that if something like this were created and adopted, it might help make GDPR and web/app-based consent avoid the ignominious fate of the cookie pop-ups that were so well intentioned but such a waste of time in practice. Ideally, some kind of collaboration between the relevant part of the EU and either W3C would produce (or at least endorse) any format.

Do get in touch through any of the channels (@tdda0, mail to info@ this domain etc.) if you have thoughts.

  1. So-called personally identifiable information (PII)

  2. Possibly even on a blockchain, if you want to be terribly au courrant and have the possibility of cryptographic verification. 

  3. technically straightforward; obviously this would require much work and hammering out of special cases and mechanisms for non-standard requirements. 

Quick Reference for TDDA Library

Posted on Thu 04 May 2017 in TDDA • Tagged with tdda

A quick-reference guide ("cheat sheet") is now available for the Python TDDA library. This is linked in the sidebar and available here.

We will try to keep it up-to-date as the library evolves.

See you all at PyData London 2017 this weekend (5-6 May 2017), where we'll be running a TDDA tutorial on Friday.

Improving Rexpy

Posted on Thu 09 March 2017 in TDDA • Tagged with tdda, rexpy, regular expressions

Today we are announcing some enhancements to Rexpy, the tdda tool for finding regular expressions from examples. In short, the new version often finds more precise regular expressions than was previously the case, with the only downside being a modest increase in run-time.

Background on Rexpy is available in two previous posts:

  • This post introduced the concept and Python library
  • This post discussed the addition of coverage information—statistics about which how many examples each regular expression matched.

Rexpy is also available online at

Weaknesses addressed

Rexpy is not intended to be an entirely general-purpose tool: it is specifically focused on the case of trying to find regular expressions to characterize the sort of structured textual data we most often see in database and datasets. We are very interested in characterizing things like identifiers, zip codes, phone numbers, URLs, UUIDs, social security numbers and (string) bar codes, and much less interested in characterizing things like sentences, tweets, programs and encrypted text.

Within this focus, there were some obvious shortcomings of Rexpy, a significant subset of which the current release (tdda version 0.3.0) now addresses.

Example 1: Postcodes

Rexpy's tests have always included postcodes, but Rexpy never did a very good job with them. Here is the output from using tdda version 0.2.7:

$ python
B2 8EA

^[A-Z0-9]{2,3}\ [0-9A-F]{3}$

Rexpy's result is completely valid, but not very specific. It has correctly identified that there are two main parts, separated by a space, and that the first part is a mixture of two or three characters, each a capital letter or a number, and that the second part is exactly three characters, again all capital letters or numbers. However, it has failed to notice that the first group starts with a letter and follows this with a single digit and that the second group is one digit followed by two letters. What a human would probably have written is something more like:

^[A-Z]{1,2}[0-9]\ [0-9][A-F]{2}$

Let's try Rexpy 0.3.0.

^[A-Z]{1,2}\d\ \d[A-Z]{2}$

Now Rexpy does exactly what we would probably have wanted it to do, and actually written it slightly more compactly—\d is any digit, i.e. it is precisely equivalent to [0-9].

With a few more examples, it still does the perfect thing (in 0.3.0).

$ rexpy
W1 0AX
M1 1AE
B33 8TH
DN55 1PT

^[A-Z]{1,2}\d{1,2}\ \d[A-Z]{2}$

(Note that the 0.3.0 release of TDDA includes wrapper scripts, rexpy and tdda that allow the main functions to be used directly from command line. These are installed when you pip install tdda. So the rexpy above is exactly equivalent to running python

We should note, however, that it still doesn't work perfectly if we include general London postcodes (even with 0.3.0):

$ rexpy
M1 1AE
B33 8TH
DN55 1PT

^[A-Z0-9]{2,4}\ \d[A-Z]{2}$

In this case, the addition of the final letter in the first block for EC1A and W1A has convinced the software that the first block is just a jumble of capital letters and numbers. We might hope that examples like these (at least, if expanded) would result in something like:

^[A-Z]{1,2}\d{1,2}A?\ \d[A-Z]{2}$

though the real pattern for postcodes is actually quite complex, with only certain London postal areas being allowed a trailing letter, and only in cases where there is a single digit in the first group, and that letter can actually be A, C, P or W.

So while it isn't perfect, Rexpy is doing fairly well with postcodes now.

Let's look at another couple of examples.

Example 2: Toy Examples

Looking at the logs from the Rexpy online, it is clear that a lot of people (naturally) start by trying the sorts of examples commonly used for teaching regular expressions. Here are some examples motivated by what we tend to see in logs.

First, let's try a common toy example in the old version of Rexpy (0.2.7):

$ python


Not so impressive.

Now in the new version (0.3.0):

$ rexpy


That's more like it!

Example 3: Names

Here's another example it's got better at. First under the old version:

$ python
Albert Einstein
Rosalind Franklin
Isaac Newton

^[A-Za-z]+\ [A-Za-z]{6,8}$

Again, this is not wrong, but Rexpy has singularly failed to notice the pattern of capitalization.

Now, under the new version:

$ rexpy
Albert Einstein
Rosalind Franklin
Isaac Newton

^[A-Z][a-z]+\ [A-Z][a-z]{5,7}$


Incidentally, it's not doing anything special with the first character of groups. Here are some related examples:

$ rexpy


Example 4: Identifiers

Some of the examples we used previously were like this (same result under both versions):

$ rexpy


What worked less well in the old version were examples like these:

$ python


These work much better under the 0.3.0:

$ rexpy


Some Remaining Areas for Improvement

The changes that we've made in this release of Rexpy appear to be almost unambiguous improvements. Both from trying examples, and from understanding the underlying code changes, we can find almost no cases in which the changes make the results worse, and a great number where the results are improved. Of course, that's not to say that there don't remain areas that could be improved.

Here we summarize a few of the things we still hope to improve:

Alternations of Whole Groups

Rexpy isn't very good at generating alternations at the moment, either at a character or group level. So for example, you might have hoped that in the following example, Rexpy would notice that the middle two letters are always AA or BB (or, possibly, that the letter is repeated).

$ rexpy


Unfortunately, it does not. (This probably won't be very hard to change.)

Alternations within Groups

Similarly, you might hope that it would do rather better than this:

$ rexpy


Clearly, we would like this to produce


and you might think from the previous examples that it would do this, but it can't combine the fixed oger with the adjacent letter range.

Too Many Expressions (or Combining Results)

Rexpy also produces rather too many regular expressions in many cases, particularly by failing to use optionals when it could. For example:

$ rexpy
Angela Carter
Barbara Kingsolver
Martin Luther King
James Clerk Maxwell

^[A-Z][a-z]{5,6}\ [A-Z][a-z]+$
^[A-Z][a-z]{4,5}\ [A-Z][a-z]{4,5}\ [A-Z][a-z]+$

At least in some circumstances, we might prefer that this would produce a single result such as:

^[A-Z][a-z]+((\ [A-Z][a-z]+(\ [A-Z][a-z]+)?)?$

or, ever better:

^[A-Z][a-z]+(\ [A-Z][a-z])*$

Although we would definitely like Rexpy to be able to produce one of these results, we don't necessarily always want this behaviour. It transpires that in a TDDA context, producing different expressions for the different cases is very often useful. So if we do crack the "combining" problem, we'll probably make it an option (perhaps with levels); that will just leave the issue of deciding on a default!


We have ideas on how to address all of these, albeit not perfectly, so expect further improvements.

If you use Rexpy and have feedback, do let us know. You can reach us on Twitter at (@tdda0), and there's also a TDDA Slack (#TDDA) that we'd be happy to invite you to.

An Error of Process

Posted on Wed 08 March 2017 in TDDA • Tagged with tdda, errors of process, errors of interpretation

Yesterday, email subscribers to the blog, and some RSS/casual viewers, will have seen a half-finished (in fact, abandoned) post that began to try to characterize success and failure on the crowd-funding platform Kickstarter.

The post was abandoned because I didn't believe its first conclusion, but unfortunately was published by mistake yesterday.

This post explains what happened and tries to salvage a "teachable moment" out of this minor fiasco.

The problem the post was trying to address

Kickstarter is a crowd-funding platform that allows people to back creative projects, usually in exchange for rewards of various kinds. Projects set a funding goal and backers only pay out if the aggregate pledges made match or exceed the funding goal during a funding period—usually 30 days.

Project Phoenix on Kickstarter, from The Icon Factory, seeks to fund the development of a new version of Twitterrific for Mac. Twitterrific was the first independent Twitter client, and was responsible for many of the things that define Twitter today.1 (You were, and are, cordially invited to break off from reading this post to go and back the project before reading on.)

Ollie is the bird in Twitterrific's icon.

At the time I started the post, the project had pledges of $63,554 towards a funding goal of $75,000 (84%) after 13 days, with 17 days to go. This is what the amount raised over time looked like (using data from Kicktraq):

Given that the amount raised was falling each day, and looked asymptotic, questions I was interested in were:

  • How likely was the project to succeed (i.e. to reach its funding goal by day 30? (In fact, it is now fully funded.)
  • How much was the project likely to raise?
  • How likely was the project to reach its stretch goal of $100,000?

The idea was to use some open data from Kickstarter and simple assumptions to try to find out what successful and unsuccessful projects look like.

Data and Assumptions

[This paragraph is unedited from the post yesterday, save that I have made the third item below bold.]

Kickstarter does not have a public API, but is scrapable. The site Web Robots makes available a series of roughly monthly scrapes of Kickstarter data from October 2015 to present; as well as seven older datasets. We have based our analysis on this data, making the following assumptions:

  1. The data is correct and covers all Kickstarter Projects
  2. That we are interpreting the fields in the data correctly
  3. Most critically: if any projects are missing from this data, the missing projects are random. Our analysis is completely invalid if failing projects are removed from the datasets.

[That last point, heavily signalled as critical, turned out not to be the case. As soon as I saw the 99.9% figure below, I went to try to validate that projects didn't go missing from month to month in the scraped data. In fact, they do, all the time, and when I realised this, I abandoned the post. There would have been other ways to try to make the prediction, but they would have been less reliable and required much more work.]

We started with the latest dataset, from 15th February 2017. This included data about 175,085 projects, which break down as follows.

Only projects with a 30-day funding period were included in the comparison, and only those for which we knew the final outcome.

count           is 30 day?
state           no     yes    TOTAL
failed      41,382  42,134   83,516
successful  44,071  31,142   75,213
canceled     6,319   5,463   11,782
suspended      463     363      826
live         2,084   1,664    3,748
TOTAL       94,319  80,766  175,085
less live:           1,664
Universe            79,102

The table showed that 80,766 of the projects are 30-day, and of these, 79,102 are not live. So this is our starting universe for analysis. NOTE: We deliberately did not exclude suspended or canceled projects, since doing so would have biased our results.

Various fields are available in the JSON data provided by Web Robots. The subset we have used are listed below, together with our interpretation of the meaning of each field:

  • launched_at — Unix timestamp (seconds since 1 January 1970) for the start of the funding period
  • deadline — Unix timestamp for the end of the funding period
  • state — (see above)
  • goal — the amount required to be raised for the project to be funded
  • pledged — the total amount of pledges (today); pledges can only be made during the funding period
  • currency — the currency in which the goal and pledges are made.
  • backers_count — the number of people who have pledged money.

Overall Statistics for 30-day, non-live projects

These are the overall statistics for our 30-day, non-live projects:

succeeded    count        %
no          47,839   60.48%
yes         31,263   39.52%
TOTAL       79,102  100.00%

Just under 40% of them succeed.

But what proportion reach 84% and still fail to reach 100%? According to the detailed data, the answer was just 0.10%, suggesting 99.90% of 30-day projects that reached 84% of their funding goal, at any stage of the campaign went on to be fully funded.

That looked wildly implausible to me, and immediately made me question whether the data I was trying to use was capable of supporting this analysis. In particular, my immediate worry was that projects that looked like they were not going to reach their goal might end up being removed—for whatever reason—more often that those that were on track. Although I have not proved that this is the case, it is clear that projects do quite often disappear between successive scrapes.

To check this, I went back over all the earlier datasets I had collected and extracted the projects that were live in those datasets. There were 47,777 such projects. I then joined those onto the latest dataset to see how many of them were in the latest dataset. 15,276 (31.97%) of the once-live projects were not in the latest data (based on joining on id).

It was at this point I abandoned the blog post.

Error of Process

So what did we learn?

The whole motivation for test-driven data analysis is the observation that data analysis is hard to get right, and most of us make mistakes all the time. We have previously classified these mistakes as

  • errors of interpretation (where we or a consumer of our analysis misunderstand the data, the methods, or our results)
  • errors of implementation (bugs)
  • errors of process (where we make a mistake in using our analytical process, and this leads to a false result being generated or propagated)
  • errors of applicability (where we use an analytical process with data that does not satisfy the requirements or assumptions (explicit or implicit) of the analysis).

We are trying to develop methodologies and tools to reduce the likelihood and impact of each of these kinds of errors.

While we wouldn't normally regard this blog as an analytical process, it's perhaps close enough that we can view this particular error through the TDDA lens. I was writing up the analysis as I did it, fully expecting to generate a useful post. Although I got as far as writing into the entry the (very dubious) 99.9% of 30-day projects that reach 84% funding at any stage on Kickstarter go on to be fully funded, that result immediately smelled wrong and I went off to try to see whether my assumptions about the data were correct. So I was trying hard to avoid an error of interpretation.

But an error of process occurred. This blog is published using Pelican, a static site generator that I mostly quite like. The way Pelican works is posts are (usually) written in Markdown with some metadata at the top. One of the bits of metadata is a Status field, which can either be set to Draft or Published.

When writing the posts, before publishing, you can either run a local webserver to view the output, or actually post them to the main site (on Github Pages, in this case). As long as their status is set to Draft, the posts don't show up as part of the blog in either site (local or on Github), but have to be accessed through a special draft URL. Unfortunately, the draft URL is a little hard to guess, so I generally work with posts with status set to Published until I push them to Github to allow other people to review them before setting them live.

What went wrong here is that the abandoned post had its status left as Published, which was fine until I started the next post (due tomorrow) and pushed that (as draft) to Github. Needless to say, a side-effect of pushing the site with a draft of tomorrow's post was that the abandoned post got pushed too, with its status as Public. Oops!

So the learning for me is that I either have to be more careful with Status (which is optimistic) or I need to add some protection in the publishing process to stop this happening. Realistically, that probably means creating a new Status—Internal—which will get the make process to transmogrify into Published when compiling locally, and Draft when pushing to Github. That should avoid repeats of this particular error of process.

  1. good things, like birds and @names and retweeting; not the abuse. 

Errors of Interpretation: Bad Graphs with Dual Scales

Posted on Mon 20 February 2017 in TDDA • Tagged with tdda, errors of interpretation, graphs

It is a primary responsibility of analysts to present findings and data clearly, in ways to minimize the likelihood of misinterpretation. Graphs should help this, but all too often, if drawn badly (whether deliberately or through oversight) they can make misinterpretation highly likely. I want to illustrate this danger with a unifortunate graph I came across recently in a very interesting—and good, and insightful—article on the US Election.

Take a look at this graph, taken from an article called The Road to Trumpsville: The Long, Long Mistreatment of the American Working Class, by Jeremy Grantham.1

Exhibit 1: Corportate Profits and Employee Compensation

In the article, this graph ("Exhibit 1") is described as follows by Grantham:

The combined result is shown in Exhibit 1: the share of GDP going to labor hit historical lows as recently as 2014 and the share going to corporate profits hit a simultaneous high.

Is that what you interpret from the graph? I agree with these words, but they don't really sum up my first reading of the graph. Rather, I think the natural reading of the graph is as follows:

Wow: Labor's share and Capital's share of GDP crossed over, apparently for good, around 2002. Before then, Capital's share was mostly materially lower than Labor's (though they were nearly equal, briefly, in 1965, and crossed for a for a few years in 1995), but over the 66-year period shown Capital's share increased while Labor's fell, until now is taking about four times as much as Labor.

I think something like that is what most people will read from the graph, unless they read it particularly carefully.

But that is not what this graph is saying. In fact, this is one of the most misleading graphs I have ever come across.

If you look carefully, the two lines use different scales: the red one, for Labor, uses the scale on the right, which runs from 23% to about 34%, whereas the blue line for Capital, uses the scale on the left, which runs from 3% to 11%.

Dual-scale graphs are always difficult to read; so difficult, in fact, that my personal recommendation is

Never plot data on two different scales on the same graph.

Not everyone agrees with this, but most people accept that dual-scale graphs are confusing and hard to read. Even, however, by the standards of dual scale graphs, this is bad.

Here are the problems, in roughly decreasing order of importance:

  1. The two lines are showing commensurate2 figures of roughly the same order of magnitude, so could and should have been on the same scale: this isn't a case of showing price against volume, where the units are different, or even a case in which one size in millimetres and another in miles: these are both percentages, of the same thing, all between 4% and 32%.
  2. The graphs cross over when the data doesn't. The very strong suggestion from the graphs that we go from Labor's share of GDP exceeding that of Capital to being radically lower than that of Capital is entirely false.
  3. Despite measuring the same quantity, the magnification is different on the two axes (i.e. the distance on the page between ticks is different, and the percentage-point gap represented by ticks on the two scales is different). As a consequence slopes (gradients) are not comparable.
  4. Neither scale goes to zero.
  5. The position of the two series relative to their scales is inconsistent: the Labor graph goes right down to the x-axis at its minimum (23%) while the Capital graph—whose minimum is also very close to an integer percentage—does not. This adds further to the impression that Labor's share has been absolutely annihilated.
  6. There are no gridlines to help you read the data. (Sure, gridlines are chart junk3, but are especially important when different scales are used, so you have some hope of reading the values.)

I want to be clear: I am not accusing Jeremy Grantham of deliberately plotting the data in a misleading way. I do not believe he intended to distort or manipulate. I suspect he's plotted it this way because stock graphs, which may well be the graphs he most often looks at,4 are frequently plotted with false zeros. Despite this, he has unfortunately plotted the graphs in a way5 that visually distorts the data in almost exactly the way I would choose to do if I wanted to make the points he is making appear even stronger than they are.

I don't have the source numbers, so I have gone through a rather painful exercise, of reading the numbers off the graph (at slightly coarser granularity) so that I can replot the graph as it should, in my opinion, have been plotted in the first place. (I apologise if I have misread any values; transcribing numbers from graphs is tedious and error-prone.) This is the result:

Exhibit 1 (revised): Same Data, with single, zero-based scale (redrawn approximation)

Even after I'd looked carefully at the scales and appreciated all the distortions in the original graph, I was quite shocked to see the data presented neutrally. To be clear: Grantham's textual summary of the data is accurate: a few years ago, Capital's share of GDP (from his figures) were at an all time—albeit not dramatically higher than in 1949 or about 1966—and Labor's share of GDP, a few years ago, was at an all-time low around 23%, down from 30%. But the true picture just doesn't look like the graph Gratham showed. (Again: I feel a bit bad about going on about this graph from such a good article; but the graph encapsulates a number of problematical practices that it makes a perfect illustration.)

How to Lie with Statistics

In 1954, Darrell Huff published a book called How to Lie with Statistics6. Chapter 5 is called The Gee Wizz Graph. His first example is the following graph (neutrally presented) graph:

Exhibit 2 (neutral): Sales Data, zero-based scale (redrawn from original)

As Huff says:

That is very well if all you want to do is convey information. But suppose you wish to win an argument, shock a reader, move him into action, sell him something. For that, this chart lacks schmaltz. Chop off the bottom.

Exhibit 2 (non-zero-based): Sales Data, non-zero-based scale (redrawn from original)

Huff continues:

Thats more like it. (You've saved paper7 too, something to point out if any carping fellow objects to your misleading graphics.)

But there's more, folks:

Now that you have practised to deceive, why stop with truncating? You have one more trick available that's worth a dozen of that. It will make your modest rise of ten per cent look livelier than one hundred percent is entitled to look. Simply change the proportion between the ordinate and the abscissa:

Exhibit 2 (non-zero-based, expanded): Sales Data, non-zero-based scale, expanded effect (redrawn from original)

Both of these unfortunate practices are present in Exhibit 1, and that's before we even get to dual scales.

Errors of Interpretation

In our various overviews of test-driven data analysis, (e.g., this summary) we have described four major classes of errors:

  • errors of interpretation
  • errors of implementation (bugs)
  • errors of process
  • errors of applicability

Errors of interpretation can occur at any point in the process: not only are we, the analysts, susceptible to misinterpreting our inputs, our methods, our intermediate results and our outputs, but the recipients of our insights and analyses are in even greater danger of misinterpreting our results, because they have not worked through the process and seen all that we did. As analysts, we have a special responsibility to make our results as clear as possible, and hard to misinterpret. We should assume not that the reader will be diligent, unhurried and careful, reading every number and observing every subtlety, but that she or he will be hurried and will rely on us to have brought out the salient points and to have helped the reader towards the right conclusions.

The purpose of a graph is to bring allow a reader to assimilate large quantities of data, and to understand patterns therein, more quickly and more easily than is possible from tables of numbers. There are strong conventions about how to do that, based on known human strengths and weaknesses as well as commonsense "fair treatment" of different series.

However well intentioned, Exhibit 1 fails in every respect: I would guess very few casual readers would get an accurate impression from the data as presented.

If data scientists had the equivalent of a Hippocratic Oath, it would be something like:

First, do not mislead.

  1. The Road to Trumpsville: The Long, Long Mistreatment of the American Working Class, by Jeremy Grantham, in the GMO Quarterly Newsletter, 4Q, 2016. 

  2. two variables are commensurate if they are measured in the same units and it is meaningful to make a direct comparison between them. 

  3. Tufte describes all ink on a graph that is not actually plotting data "chart junk", and advocates "maximizing data ink" (the amount of the ink on a graph actually devoted to plotting the data points) and minimizing chart junk. These are excellent principles. The Visual Display of Quantitative Information, Edward R. Tufte, Graphics Press (Cheshire, Connecticut) 1983. 

  4. Mr Grantham works for GMO, a "global investment management firm". 

  5. chosen to use a plot, if he isn't responsible for the plot 

  6. How to Lie with Statistics, Darrell Huff, published Victor Gollancz, 1954. Republished, 1973, by Pelican Books. 

  7. Obviously the "saving paper" argument had more force in 1954, and the constant references to "him", "he" and "fellows" similarly stood out less than they do today. 

TDDA 1-pager

Posted on Fri 10 February 2017 in TDDA • Tagged with tdda

We have written a 1-page summary of some of the core ideas in TDDA.

It is available as a PDF from

Coverage information for Rexpy

Posted on Tue 31 January 2017 in TDDA • Tagged with tdda, regular expressions

Rexpy Stats

We previously added rexpy to the Python tdda module. Rexpy is used to find regular expressions from example strings.

One of the most common requests from Rexpy users has been for information regarding how many examples each resulting regular expression matches.

We have now added a few methods to Rexpy to support this.

Currently this is only available in the Python library for Rexpy, available as part of the tdda module, with either

pip install tdda


git clone

Needless to say, we also plan to use this functionality in the online version of Rexpy in the future.

Rexpy: Quick Recap

The following example shows simple use of Rexpy from Python:

$ python
>>> from tdda import rexpy
>>> corpus = ['123-AA-971', '12-DQ-802', '198-AA-045', '1-BA-834']
>>> for r in rexpy.extract(corpus):
>>>    print(r)

In this case, Rexpy found a single regular expression that matched all the strings, but in many cases it returns a list of regular expressions, each covering some subset of the examples.

The way the algorithm currently works, in most cases1 each example will match only one regular expression, but in general, some examples might match more than one pattern. So we've designed the new functionality to work even when this is the case. We've provided three new methods on the Extractor class, which gives a more powerful API than the simple extract function.

Here's an example based on one of Rexpy's tests:

>>> urls2 = [
    '',  # actual duplicate
>>> x = rexpy.Extractor(urls2)
>>> for r in x.results.rex:
>>>    print(r)


As you can see, Rexpy has produced six different regular expressions, some of which should probably be collapsed together. The Extractor object we have created has three new methods available.

The New Coverage Methods

The simplest new method is coverage(dedup=False), which returns a list of the number of matches for each regular expression returned, in the same order as the regular expressions in x.results.rex. So:

>>> print(x.coverage())
[2, 3, 2, 4, 2, 3]

is the list of frequencies for the six regular expressions given, in order. So the pairings are illustrated by:

>>> for k, n in zip(x.results.rex, x.coverage()):
        print('%d examples are matched by %s' % (n, k))

2 examples are matched by ^[a-z]{3,4}\.[a-z]{2,4}$
3 examples are matched by ^[a-z]+\.com\/$
2 examples are matched by ^[a-z]{3,4}[\.\/\:]{1,3}[a-z]+\.[a-z]{3}$
4 examples are matched by ^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
2 examples are matched by ^http\:\/\/www\.[a-z]{6,8}\.com\/$
3 examples are matched by ^http\:\/\/www\.[a-z]+\.co\.uk\/$

The optional dedup parameter, when set to True, requests deduplicated frequencies, i.e. ignoring any duplicate strings passed in (remembering that Rexpy strips whitespace from both ends of input strings). In this case, there is just one duplicate string ( So:

>>> print(x.coverage(dedup=True))
[2, 2, 2, 4, 2, 3]

where the second number (the matches for ^[a-z]+\.com\/$) is now 2, because has been deduplicated.

We can also find the total number of examples, with or without duplicates, by calling the n_examples(dedup=False) method:

>>> print(x.n_examples())
>>> print(x.n_examples(dedup=True))

But what we will probably normally be most interested in doing is sorting the regular expressions from highest to lowest coverage, ignoring any examples matched by an earlier pattern in cases where they do overlap. That's exactly what the incremental_coverage(dedup=False) method does for us. It returns an ordered dictionary.

>>> for (k, n) in x.incremental_coverage().items():
        print('%d: %s' % (n, k))
4: ^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
3: ^[a-z]+\.com\/$
3: ^http\:\/\/www\.[a-z]+\.co\.uk\/$
2: ^[a-z]{3,4}[\.\/\:]{1,3}[a-z]+\.[a-z]{3}$
2: ^[a-z]{3,4}\.[a-z]{2,4}$
2: ^http\:\/\/www\.[a-z]{6,8}\.com\/$

This is our sixteen input strings (including duplicates), and the number of examples matched by this expression, not matched by any previous expression. (As noted earlier, that caveat probably won't make any difference at the moment, but it will in future versions.) So, to be explicit, this is saying:

  • The regular expression that matches most examples is: ^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$ which matches 4 of the 16 strings.

  • Of the remaining 12 examples, 3 are matched by ^[a-z]+\.com\/$.

  • Of the remaining 9 examples, 3 more are matched by ^http\:\/\/www\.[a-z]+\.co\.uk\/$

  • and so on.

Note, that in the case of ties, Rexpy sorts regular expressions as strings to break ties.

We can get the deduplicated numbers if we prefer:

>>> for (k, n) in x.incremental_coverage(dedup=True).items():
        print('%d: %s' % (n, k))
4: ^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
3: ^http\:\/\/www\.[a-z]+\.co\.uk\/$
2: ^[a-z]+\.com\/$
2: ^[a-z]{3,4}[\.\/\:]{1,3}[a-z]+\.[a-z]{3}$
2: ^[a-z]{3,4}\.[a-z]{2,4}$
2: ^http\:\/\/www\.[a-z]{6,8}\.com\/$

That's all the new functionality for now. Let us know how you get on, and if you find any problems. And tweet your email address to @tdda0 if you want to join the TDDA Slack to discuss anything around the subject of test-driven data analysis.

[NOTE: This post was updated on 10.2.2017 after an update to the rexpy library changed function and attribute names from "sequential" (which was not very descriptive) to "incremental", which is better.]

  1. In fact, probably in all cases, currently 

The New ReferenceTest class for TDDA

Posted on Thu 26 January 2017 in TDDA • Tagged with tdda, constraints, rexpy

Since the last post, we have extended the reference test functionality in the Python tdda library. Major changes (as of version 0.2.5, at the time of writing) include:

  • Introduction of a new ReferenceTest class that has significantly more functionality from the previous (now deprecated) WritableTestCase.
  • Support for pytest as well as unittest.
  • Available from PyPI with pip install tdda, as well as from Github.
  • Support for comparing CSV files.
  • Support for comparing pandas DataFrames.
  • Support for preprocessing results before comparison (beyond simply dropping lines) in reference tests.
  • Greater consistency between parameters and options for all comparison methods
  • Support for categorizing kinds of reference data and rewriting only nominated categories (with -w or –write)
  • More (meta) tests of the reference test functionality.

Background: reference tests and WritableTestCase

We previously introduced the idea of a reference test with an example in the post First Test, and then when describing the WritableTestCase library. A reference test is essentially the TDDA equivalent of a software system test (or integration test), and is characterized by:

  • normally testing a relatively large unit of analysis functionality (up to and including whole analytical processes)
  • normally generating one or more large or complex outputs that are complex to verify (e.g. datasets, tables, graphs, files etc.)
  • sometimes featuring unimportant run-to-run differences that mean testing equality of actual and expected output will often fail (e.g. files may contain date stamps, version numbers, or random identifiers)
  • often being impractical to generate by hand
  • often needing to be regenerated automatically (after verification!) when formats change, when bugs are fixed or when understanding changes.

The old WritableTestCase class that we made available in the tdda library provided support for writing, running and updating such reference tests in Python by extending the unittest.TestCase class and providing methods for writing reference tests and commands for running (and, where necessary, updating the reference results used by) reference tests.

Deprecating WritableTestCase in Favour of ReferenceTest

Through our use of reference testing in various contexts and projects at Stochastic Solutions, we have ended up producing three different implementations of reference-test libraries, each with different capabilities. We also become aware that an increasing number of Python developers have a marked preference for pytest over unittest, and wanted to support that more naturally. The new ReferenceTest class brings together all the capabilities we have developed, standardizes them and fills in missing combinations while providing idiomatic patterns for using it both in the context of unittest and pytest.

We have no immediate plans to remove WritableTestCase from the tdda library (and, indeed, continue use it extensively ourselves), but encourage people to adopt ReferenceTest instead as we believe it is superior in all respects.

Availability and Installation

You can now install the tdda module with pip:

pip install tdda

and the source remains available under an MIT licence from github with

git clone

The tdda library works under Python 2.7 and Python 3 and includes reference test functionality mentionedabove, constraint discovery and verification, including a Pandas version and automatic discovery of regular expressions from examples (rexpy, also available online at here).

After installation, you can run TDDA's tests as follows:

$ python -m tdda.testtdda
Ran 76 tests in 0.279s

Getting example code

However you have obtained the tdda module, you can get a copy of its reference test examples by running the command

$ python -m tdda.referencetest.examples

which will place them in a referencetest-examples subdirectory of your current directory. Alternatively, you can specify that you want them in a particular place with

$ python -m tdda.referencetest.examples /path/to/particular/place

There are variations for getting examples for the constraint generation and validation functionality, and for regular expression extraction (rexpy)

$ python -m tdda.constraints.examples [/path/to/particular/place]
$ python -m tdda.rexpy.examples [/path/to/particular/place]

Example use of ReferenceTest from unittest

Here is a cut-down example of how to use the ReferenceTest class with Python's unittest, based on the example in referencetest-examples/unittest. For those who prefer pytest, there is a similar pytest-ready example in referencetest-examples/pytest.

from __future__ import unicode_literals

import os
import sys
import tempfile

from tdda.referencetest import ReferenceTestCase

# ensure we can import the generators module in the directory above,
# wherever that happens to be
FILE_DIR = os.path.abspath(os.path.dirname(__file__))
PARENT_DIR = os.path.dirname(FILE_DIR)

from generators import generate_string, generate_file

class TestExample(ReferenceTestCase):
    def testExampleStringGeneration(self):
        actual = generate_string()
        self.assertStringCorrect(actual, 'string_result.html',
                                 ignore_substrings=['Copyright', 'Version'])

    def testExampleFileGeneration(self):
        outdir = tempfile.gettempdir()
        outpath = os.path.join(outdir, 'file_result.html')
        self.assertFileCorrect(outpath, 'file_result.html',

TestExample.set_default_data_location(os.path.join(PARENT_DIR, 'reference'))

if __name__ == '__main__':


  1. These tests illustrate comparing a string generated by some code to a reference string (stored in a file), and testing a file generated by code to a reference file.

  2. We need to tell the ReferenceTest class where to find reference files used for comparison. The call to set_default_data_location, straight after defining the class, does this.

  3. The first test generates the string actual and compares it to the contents of the file string_result.html in the data location specified (../reference). The ignore_substrings parameter specifies strings which, when encountered, cause these lines to be omitted from the comparison.

  4. The second test instead writes a file to a temporary directory (but using the same name as the reference file). In this case, rather than ignore_strings we have used an ignore_patterns parameter to specify regular expressions which, when matched, cause lines to be disregarded in comparisons.

  5. There are a number of other parameters that can be added to the various assert... methods to allow other kinds of discrepancies between actual and generated files to be disregarded.

Running the reference tests: Success, Failure and Rewriting

If you just run the code above, or the file in the examples, you should see output like this:

$ python
Ran 2 tests in 0.003s


or, if you use pytest, like this:

$ pytest
============================= test session starts ==============================
platform darwin – Python 2.7.11, pytest-3.0.2, py-1.4.31, pluggy-0.3.1
rootdir: /Users/njr/tmp/referencetest-examples/pytest, inifile:
plugins: hypothesis-3.4.2
collected 2 items ..

=========================== 2 passed in 0.01 seconds ===========================

If you then edit in the directory above and make some change to the HTML in the generate_string and generate_file functions preferably non-semantic changes, like adding an extra space before the > in </html>) and then rerun the tests, you should get failures. Changing just the generate_string function:

$ python unittest/
.1 line is different, starting at line 33
Expected file /Users/njr/tmp/referencetest-examples/reference/string_result.html
Note exclusions:
Compare with "diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/actual-string_result.html
FAIL: testExampleStringGeneration (__main__.TestExample)
Traceback (most recent call last):
  File "unittest/", line 62, in testExampleStringGeneration
    ignore_substrings=['Copyright', 'Version'])
  File "/Users/njr/python/tdda/tdda/referencetest/", line 527, in assertStringCorrect
    self.check_failures(failures, msgs)
  File "/Users/njr/python/tdda/tdda/referencetest/", line 709, in check_failures
    self.assert_fn(failures == 0, '\n'.join(msgs))
AssertionError: 1 line is different, starting at line 33
Expected file /Users/njr/tmp/referencetest-examples/reference/string_result.html
Note exclusions:
Compare with "diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/actual-string_result.html

    Ran 2 tests in 0.005s

As expected, the string test now fails, and the ReferenceTest library suggests a command you can use to diff the output: because the test failed, it wrote the actual output to a temporary file. (It reports the failure twice, once as it occurs and once at the end. This is deliberate as it's convenient to see it when it happens if the tests take any non-trivial amount of time to run, and convenient to collect together all the failures at the end too.)

Because these are HTML files, I would probably instead open them both (using the open command on Mac OS) and visually inspect them. In these case, the pages look identical, and diff will confirm that the changes are only those we expect:

$ diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/actual-string_result.html
<     Copyright (c) Stochastic Solutions, 2016
<     Version 1.0.0

>     Copyright (c) Stochastic Solutions Limited, 2016
>     Version 0.0.0
< </html >
> </html>

In this case

  • We see that the copyright and version lines are different, but we used ignore_strings to avoid say that's OK
  • It shows us the extra space before the close tag.

If we are happy that the new output is OK and should replace the previous reference test, you can rerun with the -W (or –write-all).

$ python unittest/ -W
Written /Users/njr/tmp/referencetest-examples/reference/file_result.html
.Written /Users/njr/tmp/referencetest-examples/reference/string_result.html
Ran 2 tests in 0.003s


Now if you run them again without -W, the tests should all pass.

You can do the same with pytest, except that in this case you need to use

pytest –write-all

because pytest does not allow short flags.

Other kinds of tests

Since this post is already quite long, we won't go through all the other options, parameters and kinds of tests in detail, but will mention a few other points:

  • In addition to assertStringCorrect and assertFileCorrect there are various other methods available:

    • assertFilesCorrect for checking that multiple files are as expected
    • assertCSVFileCorrect for checking a single CSV file
    • assertCSVFilesCorrect for checking multiple CSV files
    • assertDataFramesEqual to check equality of Pandas DataFrames
    • assertDataFrameCorrect to check a data frame matches data in a CSV file.
  • Where appropriate, assert methods accept the various optional parameters, including:

    • lstrip — ignore whitespace at start of files/strings (default False)
    • rstrip — ignore whitespace at end of files/strings (default False)
    • kind — optional category for test; these can be used to allow selective rewriting of test output with the -w/–write flag
    • preprocess — function to call on expected and actual data before comparing results

We'll add more detail in future posts.

If you'd like to join the slack group where we discuss TDDA and related topics, DM your email address to @tdda on Twitter and we'll send you an invitation.

Introducing Rexpy: Automatic Discovery of Regular Expressions

Posted on Fri 11 November 2016 in TDDA • Tagged with tdda, constraints, pandas, regular expressions


There's a Skyscanner data feed we have been working with for a year or so. It's produced some six million records so far, each of which has a transaction ID consisting of three parts—a four-digit alphanumeric transaction type, a numeric timestamp and a UUID, with the three parts separated by hyphens. Things like this:


The only thing that really matters is that the transaction IDs are unique, but if everything is working correctly, the three parts should have the right structure and match data that we have in other fields in the feed.

We're pretty familiar with this data; or so we thought . . .

We've added a command to our data analysis software—Miró—for characterizing the patterns in string fields. The command is rex and when we run it on the field (tid), by saying:

[1] load skyscanner/transactions.miro
transactions.miro: 6,020,946 records; 6,020,946 (100%) selected; 57 fields.

[2] rex tid

this is the output:


The rex command has found four patterns that, between them, characterize all the values in the field, and it has expressed these in the form of regular expressions. (If you aren't familiar with regular expressions, you might want to read the linked Wikipedia article.)

The third pattern in the list is more-or-less the one we thought would characterize all the transaction IDs. It reads:

  • start1 (^)
  • then 2–4 letters or numbers ([A-Za-z0-9]{2,4})
  • then a mixture one to three hyphens and underscores ([\-\_]{1,3})
  • then 10 digits (\d{10})
  • then a hyphen (\-)
  • then a UUID, which is a hyphen-separated collection of 28 hex digits, in groups of 8, 4, 4, 4 and 12 ([0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12})
  • and end2 ($).

The only difference between this and what we expected is that it turns out that some of the "alphanumeric transaction types" end with two underscores rather than being stricly alphanumeric, and the rex command has expressed this as "2-4 alphanumeric characters followed by 1-3 hyphens or underscores", rather than "2-4 characters that are alphanumeric or underscores, followed by a hyphen", which would have been more like the way we think about it.

What are the other three expressions?

The first one is the same, but without the UUID. Occasionally the UUID is missing (null), and when this happens the UUID portion of the tid is blank. It is possible to write a regular expression that combines these two cases, but rex doesn't quite yet know how to do this. The way to do it would be to make the UUID optional by enclosing it in parentheses and following it by a question mark:


which reads zero or one UUIDs, given that we know the pattern inside the parentheses corresponds to a UUID.

The last pattern is another one we could unify with the other two: it is the same except that it identifies a particular transaction type that again uses underscores, but now in the middle: q__s. So we could replace those three (and might, in an ideal world, want rex to find)


This is exactly what we described, except with the inclusion of _ as an allowable character in the 4-character transaction type at the start.

But what about the second pattern?


It's actually a single, completely specific transaction ID, that matches the main pattern except for omitting the hyphens in the UUID. This shouldn't be possible—by which I mean, the UUID generator should never omit the hyphens. But clearly either it did for this one transaction, or something else stripped them out later. Either way, it shows that among the several million transactions, there a bad transaction ID.

A further check in Miró shows that this occurs just a single time (i.e. it is not duplicated):

[4]>  select tid = "dckx-1466604137-1aada032aa7348e1ac0fcfdd02a80f9c"
transactions.miro: 6,020,946 records; 1 (0%) selected; 57 fields.
(= tid "dckx-1466604137-1aada032aa7348e1ac0fcfdd02a80f9c")

So we've learnt something interesting and useful about our data, and Miró's rex command has helped us produce regular expressions that characterize all our transaction IDs. It hasn't done a perfect job, but it was pretty useful, and it's easy for us to merge the three main patterns by hand. We plan to extend the functionality to cover these cases better over coming weeks.

Rexpy outside Miró

If you don't happen to have Miró, or want to find regular expressions from data outside the context of a Miró dataset, you can use equivalent functionality directly from the rexpy module of our open-source, MIT-licenced tdda package, available from Github:

git clone

This provides both a Python API for finding regular expressions from example data, and a command line tool.

The Rexpy Command Line

If you just run from the command line, with no arguments, it expects you to type (or paste) a set of strings, and when you finish (with CTRL-D on unix-like systems, or CTRL-Z on Windows systems) it will spit out the regular expressions it thinks you need to match them: For example:

$ python
G2 3PQ
^[A-Za-z0-9]{2,3}\ [A-Za-z0-9]{3}$

Or, of course, you can pipe input into it. If, for example, I do that in a folder full of photos from a Nikon camera, I get

$ ls *.* | python ~/python/tdda/rexpy/

(because these are all files like DSC_1234.NEF or DSC_9346.xmp).

You can also give it a filename as the first argument, in which case it will read strings (one per line) from a file.

So given the file ids.txt (which is in the rexpy/examples subdirectory in the tdda repository), containing:


we can use rexpy on it by saying:

$ python examples/ids.txt

and if there is a header line you want to skip, you can add either -h or –header to tell Rexpy to skip that.

You can also give a filename as a second command line argument, in which case Rexpy will write the results (one per line) to that file.

Motivation for Rexpy

There's an old joke among programmers, generally attributed to Jamie Zawinski that bears repeating:

Some people, when confronted with a problem, think "I know, I'll use regular expressions."

Now they have two problems.

— Jamie Zawinski

As powerful as regular expressions are, even their best friends would probably concede that they can be hard to write, harder to read and harder still to debug.

Despite this, regular expressions are an attractive way to specify constraints on string fields for two main reasons:

  1. First, regular expressions constitute a fast, powerful, and near-ubiquitous mechanism for describing a wide variety of possible structures in string data.

  2. Secondly, regular expressions include the concept of capture groups. You may recall that in a grouped3 regular expression, one or more subcomponents is enclosed in parentheses and its matched value can be extracted. As we will see below, this is particulary interesting in the context of test-driven data analysis.

For concreteness, suppose we have a field containing strings such as:


One obvious regular expression to describe these would be:


(start [^] with one or more digits [\d+], then a hyphen [\-], then two capital letters [A-Z]{2}, then another hyphen [\-], then three digits [\d{3}] then stop [$]).

But it is in the nature of regular expressions that there are both more and less specific formulations we could use to describe the same strings, ranging from the fairly specific:


(start [^] with a 1 [1], followed by up to two digits chosen from {2, 3, 8, 9} [[2389]{0,2}], followed by a hyphen [\-] then AA, DA or BA [(AA|DQ|BA)], followed by another hypen [\-] then three digits [\d{3}] and finish [$])

to the fully general


(start [^], then zero or more characters [.*], then stop [$]).

Going back to our first formulation


one possible grouped equivalent is


The three parenthesized sections are known as groups, and regular expression implementations usually provide a way of looking up the value of these groups when a particular string is matched by it. For example, in Python we might say:

import re

pattern = re.compile(r'^(\d+)\-([A-Z]{2})\-(\d{3})$')
m = re.match(pattern, '123-ZQ-987')
if m:
    print('1: "%s"  2: "%s"  3: "%s"' % (,,

m = re.match(pattern, '00-FT-020')
if m:
    print('1: "%s"  2: "%s"  3: "%s"' % (,,

If we run this, the output is:

1: "123"  2: "ZQ"  3: "987"
1: "00"  2: "FT"  3: "020"

The Big Idea: Automatic Discovery of Constraints on String Structure

In the context of test-driven data analysis, the idea is probably obvious: for string fields with some kind of structure—telephone numbers, post codes, zip codes, UUIDs, more-or-less any kind of structured identifier, airline codes, airport codes, national insurance numbers, credit card numbers, bank sort codes, social security numbers—we would like to specify constraints on the values in the field using regular expressions. A natural extension to the TDDA constraints file format introduced in the last post would be something along the lines of:4

"regex": "^\\d+\\-[A-Z]{2}\\-\\d{3}$"

if there is a single regular expression that usefully matches all the allowed field values. If a field contains strings in multiple formats that are so different that using a single regular expression would be unhelpful, we might instead provide a list of regular expressions, such as:

"regex": ["^\\d+\\-[A-Z]{2}\\-\\d{3}$", "^[A-Z]{5}\\+\\d{5}$"]

which would mean each field values should match at least one of the regular expressions in the list.

Just as with the automatic discovery of other types of constraints, we want the TDDA constraint discovery library to be able to suggest suitable regular expression constraints on string fields, where appropriate. This is where Rexpy comes in:

Rexpy is a library for finding regular expressions that usefully characterize a given corpus of strings.

We are choosing to build Rexpy as a stand-alone module because it has clear utility outside the context of constraint generation.

The Other Idea: Automatically discovered Quasi-Fields

We can imagine going beyond simply using (and automatically discovering) regular expressions to describe constraints on string data. Once we have useful regular expressions that characterize some string data—and more particularly, in cases where we have a single regular expression that usefully describes the structure of the string—we can tag meaningful subcomponents. In the example we used above, we had three groups:

  • (\d+) — the digits at the start of the identifier
  • ([A-Z]{2}) — the pair of letters in the middle
  • (\d{3}) — the three digits at the end

It's not totally trivial to work out which subcomponents are useful to tag, but I think we could probably find pretty good heuristics that would do a reasonable job, at least in simple cases. Once we know the groups, we can potentially start to treat them as quasi-fields in their own right. So in this case, if we had a field ID containing string identifiers like those shown, we might create from that three quasi fields as follows:

  • ID_qf1, of type int, values 123, 12, 198, and 1
  • ID_qf2, of type string, values AA, DQ, AA, and BA
  • ID_qf3, of type int, values 971, 802, 45 and 834

Once we have these quasi fields, we can potentially subject them to the usual TDDA constraint generation process, which might suggest extra, stronger constraints. For example, we might find the the numbers in ID_qf3 are unique, or form a contiguous sequence, and we might find that although our regular expression only specified that there were two letters in the middle, in fact the only combinations found in the (full) data were AA, DQ, BA and BB.

I don't want to suggest that all of this is easy: there are three or four non-trivial steps to get from where Rexpy is today to this full vision:

  • First, it has to get better at merging related regular expressions into a useful single regular expression with optional components and alternations than it is today.

  • Secondly, it would have to be able to identify good subcomponents for grouping.

  • Thirdly, it would have to do useful type inference on the groups it identifies.

  • Finally, it would have to be extended to create the quasi fields and apply the TDDA discovery process to them.

But none of this seems hopelessly difficult. So continue to watch this space.

The Rexpy API

Assuming you have cloned the TDDA library somewhere on your PYTHONPATH, you should then be able to use it through the API as follows:

from tdda import rexpy

corpus = ['123-AA-971', '12-DQ-802', '198-AA-045', '1-BA-834']
results = rexpy.extract(corpus)
print('Number of regular expressions found: %d' % len(results))
for rex in results:
    print('   ' +  rex)

which produces:

$ python
Number of regular expressions found: 1

In general, Rexpy returns a list of regular expressions, and at the moment it is not very good at merging them. There's quite a lot of unused code that, I hope, will soon allow it to do so. But even as it is, it can do a reasonable job of characterizing simple strings. Within reason, the more examples you give it, the better it can do, and it is reasonably performant with hundreds of thousands or even millions of strings.

Rexpy: the Pandas interface

There's also a Pandas binding. You can say:

from tdda import rexpy
results = rexpy.pdextract(df['A'])

to find regular expressions for the strings in column A of a dataframe df, or

from tdda import rexpy
results = rexpy.pdextract([df['A'], df['B'], df['C']])

to find a single set of regular expressions that match all the strings from columns A, B and C. In all cases, null ( values are ignored. The results are returned as a (Python) list of regular expressions, as strings.

Final Words

Take it for a spin and let us know how you get on.

As always, follow us or tweet at us (@tdda0) if you want to hear more, and watch out for the TDDA Slack team, which will be opening up very soon.

  1. More precisely, ^ matches start of line; by default, rex always starts regular expressions with ^ and finishes them with $ on the assumption that strings will be presented one-per-line 

  2. Again, more precisely, $ matches end of line

  3. Grouped regular expressions are also referred to variously as marked or tagged regular expressions, and the groups are also sometimes known as subexpressions

  4. One thing to notice here is that in JSON we need extra backslashes in the regular expression. This is because regular expressions themselves make fairly liberal use of backslashes, and JSON uses backslash as an escape character. We could avoid this in Python by using raw strings, which are introduced with an r prefix (e.g. '^(\d+)\-([A-Z]{2})\-(\d{3})$'). In such strings, backslashes are not treated in any special way. Since JSON has no equivalent mechanism, we have to escape all our backslashes, leading to the ugliness above. 

The TDDA Constraints File Format

Posted on Fri 04 November 2016 in TDDA • Tagged with tdda, constraints, pandas


We recently extended the tdda library to include support for automatic discovery of constraints from datasets, and for verification of datasets against constraints. Yesterday's post—Constraint Discovery and Verification for Pandas DataFrames—describes these developments and the API.

The library we published is intended to be a base for producing various implementations of the constraint discovery and verification process, and uses a JSON file format (extension .tdda) to save constraints in a form that should be interchangable between implementations. We currently have two compatible implementations—the open-source Pandas code in the library and the implementation in our own analytical software, Miró.

This post describes the .tdda JSON file format. The bulk of it is merely a snapshot of the documentation shipped with the library in the Github repository (visible on Github). We intend to keep that file up to date as we expand the format.

The TDDA JSON File Format

The TDDA constraints library (Repository, module constraints) uses a JSON file to store constraints.

This document describes that file format.


TDDA files describe constraints on a dataset, with a view to verifying the dataset to check whether any or all of the specified constraints are satisfied.

A dataset is assumed to consist of one or more fields (also known as columns), each of which has a (different) name1 and a well-defined type.2 Each field has a value for each of a number of records (also known as rows). In some cases, values may be null (or missing).3 Even a field consisting entirely of nulls can be considered to have a type.

Familiar examples of datasets include:

  • tables in relational databases
  • DataFrames in (Pandas and R)
  • flat ("CSV") files (subject to type inference or assigment)
  • sheets in spreadsheets, or areas within sheets, if the columns have names, are not merged, and have values with consistent meanings and types over an entire column
  • more generally, many forms of tabular data.

In principle, TDDA files are intended to be capable of supporting any kind of constraint regarding datasets. Today, we are primarily concerned with * field types * minimum and maximum values (or in the case of string fields, minumum and maximum string lengths) * whether nulls are allowed, * whether duplicate values are allowed within a field * the allowed values for a field.

The format also has support for describing relations between fields.

Future extensions we can already foresee include:

  • dataset-level constraints (e.g. numbers of records; required or disallowed fields)
  • sortedness of fields, of field values or both
  • regular expressions to which string fields should conform
  • constraints on subsets of the data (e.g. records dated after July 2016 should not have null values for the ID field)
  • constraints on substructure within fields (e.g. constraints on tagged subexpressions from regular expressions to which string fields are expected to conform)
  • potentially checksums (though this is more suitable for checking the integreity of transfer of a specific dataset, than for use across multiple related datasets)
  • constraints between datasets, most obviously key relations (e.g. every value field KA in dataset A should also occur in field KB in dataset B).

The motivation for generating, storing and verifying datasets against such sets of constraints is that they can provide a powerful way of detecting bad or unexpected inputs to, or outputs from, a data analysis process. They can also be valuable as checks on intermediate results. While manually generating constraints can be burdensome, automatic discovery of constraints from example datasets, potentially followed by manual removal of over-specific constraints, provides a good cost/benefit ratio in many situations.

Filename and encoding

  • The preferred extension for TDDA Constraints files is .tdda.

  • TDDA constraints files must be encoded as UTF-8.

  • TDDA files must be valid JSON.


This is an extremely simple example TDDA file:

    "fields": {
        "a": {
            "type": "int",
            "min": 1,
            "max": 9,
            "sign": "positive",
            "max_nulls": 0,
            "no_duplicates": true
        "b": {
            "type": "string",
            "min_length": 3,
            "max_length": 3,
            "max_nulls": 1,
            "no_duplicates": true,
            "allowed_values": [

General Structure

A TDDA file is a JSON dictionary. There are currently two supported top-level keys:

  • fields: constraints for individual fields, keyed on the field name. (In TDDA, we generally refer to dataset columns as fields.)

  • field_groups: constraints specifying relations between multiple fields (two, for now). field_groups constraints are keyed on a comma-separated list of the names of the fields to which they relate, and order is significant.

Both top-level keys are optional.

In future, we expect to add further top-level keys (e.g. for possible constraints on the number of rows, required or disallowed fields etc.)

The order of constraints in the file is immaterial (of course; this is JSON), though writers may choose to present fields in a particular order, e.g. dataset order or sorted on fieldname.

Field Constraints

The value of a field constraints entry (in the fields section) is a dictionary keyed on constraint kind. For example, the constraints on field a in the example above are specified as:

"a": {
    "type": "int",
    "min": 1,
    "max": 9,
    "sign": "positive",
    "max_nulls": 0,
    "no_duplicates": true

The TDDA library currently recognizes the following kinds of constraints:

  • type
  • min
  • max
  • min_length
  • max_length
  • sign
  • sign
  • max_nulls
  • no_duplicates
  • allowed_values

Other constraint libraries are free to define their own, custom kinds of constraints. We will probably recommend that non-standard constraints have names beginning with colon-terminated prefix. For example, if we wanted to support more specific Pandas type constraints, we would probably use a key such as pandas:type for this.

The value of a constraint is often simply a scalar value, but can be a list or a dictionary; when it is a dictionary, it should include a key value, with the principle value associated with the constraint (true, if there is no specific value beyond the name of the constraint).

If the value of a constraint (the scalar value, or the value key if the value is a dictionary) is null, this is taken to indicate the absence of a constraint. A constraint with value null should be completely ignored, so that a constraints file including null-valued constraints should produce identical results to one omitting those constraints. (This obviously means that we are discouraging using null as a meaningful constraint value, though a string "null" is fine, and in fact we use this for sign constraints.)

The semantics and values of the standard field constraint types are as follows:

  • type: the allowed (standard, TDDA) type of the field. This can be a single value from bool (boolean), int (integer; whole-numbered); real (floating point values); string (unicode in Python3; byte string in Python2) or date (any kind of date or date time, with or without timezone information). It can also be a list of such allowed values (in which case, order is not significant).

    It is up to the generation and verification libraries to map between the actual types in whatever dataset/dataframe/table/... object is used and these TDDA constraint types, though over time we may provide further guidance.


    • {"type": "int"}
    • {"type": ["int", "real"]}
  • min: the minimum allowed value for a field. This is often a simple value, but in the case of real fields, it can be convenient to specify a level of precision. In particular, a minimum value can have precision (default: fuzzy):

    • closed: all non-null values in the field must be greater than or equal to the value specified.
    • open: all non-null values in the field must be strictly greater than the value specified.
    • fuzzy: when the precision is specified as fuzzy, the verifier should allow a small degree of violation of the constraint without generating a failure. Verifiers take a parameter, epsilon, which specifies how the fuzzy constraints should be taken to be: epsilon is a fraction of the constraint value by which field values are allowed to exceed the constraint without being considered to fail the constraint. This defaults to 0.01 (i.e. 1%). Notice that this means that constraint values of zero are never fuzzy.

    Examples are:

    • {"min": 1},
    • {"min": 1.2},
    • {"min": {"value": 3.4}, {"precision": "fuzzy"}}.

    JSON, does not—of course—have a date type. TDDA files specifying dates should use string representations in one of the following formats:

    • YYYY-MM-DD for dates without times
    • YYYY-MM-DD hh:mm:ss for date-times without timezone
    • YYYY-MM-DD hh:mm:ss [+-]ZZZZ for date-times with timezone.

    We recommend that writers use precisely these formats, but that readers offer some flexibility in reading, e.g. accepting / as well as - to separate date components, and T as well as space to separate the time component from the date.

  • max: the maximum allowed value for a field. Much like min, but for maximum values. Examples are:

    • {"max": 1},
    • {"max": 1.2},
    • {"max": {"value": 3.4}, {"precision": "closed"}}.

    Dates should be formatted as for min values.

  • min_length: the minimum allowed length of strings in a string field. How unicode strings are counted is up to the implementation. Example:

    • {"min_length": 2}
  • max_length: the minimum allowed length of strings in a string field. How unicode strings are counted is up to the implementation.

    • {"max_length": 22}
  • sign: For numeric fields, the allowed sign of (non-null) values. Although this overlaps with minimum and maximum values, it it often useful to have a separate sign constraint, which carries semantically different information. Allowed values are:

    • positive: All values must be greater than zero
    • non-negative: No value may be less than zero
    • zero: All values must be zero
    • non-positive: No value may be greater than zero
    • negative: All values must be negative
    • null: No signed values are allowed, i.e. the field must be entirely null.


    • {"sign": "non-negative"}
  • max_nulls: The maximum number of nulls allowed in the field. This can be any non-negative value. We recommend only writing values of zero (no nulls values are allowed) or 1 (At most a single null is allowed) into this constraint, but checking against any value found.


    • {"max_nulls": 0}
  • no_duplicates: When this constraint is set on a field (with value true), it means that each non-null value must occur only once in the field. The current implementation only uses this constraint for string fields.


    • {"no_duplicates": true}
  • allowed_values: The value of this constraint is a list of allowed values for the field. The order is not significant.


    • {"allowed_values": ["red", "green", "blue"]}

MultiField Constraints

Multifield constraints are not yet being generated by this implementation, though our (proprietary) Miró implementation does produce them. The currently planned constraint types for field relations cover field equality and inequality for pairs of fields, with options to specify null relations too.

A simple example would be:

"field_groups": {
    "StartDate,EndDate": {"lt": true}

This is a less-than constraint, to be interpreted as

  • StartDate < EndDate wherever StartDate and EndDate and both non-null.

The plan is to support the obvious five equality and inequality relations:

  • lt: first field value is strictly less than the second field value for each record
  • lte: first field value is less than or equal to the second field value for each record
  • eq: first field value is equal to the second field value for each record
  • gte: first field value is greater than or equal to the second field value for each record
  • gt: first field value is strictly greater than the second field value for each record.

In the case of equality (only), we will probably also support a precision parameter with values fuzzy or precise.

There should probably also be an option to specify relations between null values in pairs of columns, either as a separate constraint or as a quality on each of the above.

  1. Pandas, of course, allows multiple columns to have the same name. This format makes no concessions to such madness, though there is nothing to stop a verifier or generator sharing constraints across all columns with the same name. The Pandas generators and verifiers in this library do not currently attempt to do this. 

  2. Pandas also allows columns of mixed type. Again, this file format does not recognize such columns, and it would probably be sensible not to use type constraints for columns of mixed type. 

  3. Pandas uses not-a-number ( to represent null values for numeric, string and boolean fields; it uses a special not-a-time (pandas.NaT) value to represent null date (Timestamp) values. 

Constraint Discovery and Verification for Pandas DataFrames

Posted on Thu 03 November 2016 in TDDA • Tagged with tdda, constraints, pandas


In a previous post, Constraints and Assertions, we introduced the idea of using constraints to verify input, output and intermediate datasets for an analytical process. We also demonstrated that candidate constraints can be automatically generated from example datasets. We prototyped this in our own software (Miró) expressing constraints as lisp S-expressions.

Improving and Extending the Approach: Open-Source Pandas Code

We have now taken the core ideas, polished them a little and made them available through an open-source library, currently on Github. We will push it to PyPI when it has solidified a little further.

The constraint code I'm referring to is available in the constraints module of the tdda repository for the tdda user on github. So if you issue the command

git clone

in a directory somewhere on your PYTHONPATH, this should enable you to use it.

The TDDA library:

  • is under an MIT licence;
  • runs under Python2 and Python3;
  • other than Pandas itself, has no dependencies outside the standard library unless you want to use feather files (see below);
  • includes a base layer to help with building constraint verification and discovery libraries for various systems;
  • includes Pandas implementations of constraint discovery and verification through a (Python) API;
  • uses a new JSON format (normally stored in .tdda files) for saving constraints;
  • also includes a prototype command-line tool for verifying a dataframe stored in feather format against a .tdda file of constraints. Feather is a file format developed by Wes McKinney (the original creator of Pandas) and Hadley Wickham (of ggplot and tidyverse fame) for dataframes that allows interchange between R and Pandas while preserving type information and exact values. It is based on the Apache Arrow project. It can be used directly or using our ugly-but-useful extension library pmmif, which allows extra metadata (including extended type information) to be saved alongside a .feather file, in a companion .pmm file.


All the constraint-handling code is in the constraints module within the TDDA repository.

After you've cloned the repository, it's probably a good idea to run the tests. There are two sets, and both should run under Python2 or Python3.

$ cd tdda/constraints
$ python
Ran 5 tests in 0.003s


$ python
Ran 21 tests in 0.123s


There is example code (which we'll walk through below) in the examples subdirectory of constraints.

Basic Use

Constraint Discovery

Here is some minimal code for getting the software to discover constraints satisfied by a Pandas DataFrame:

import pandas as pd
from tdda.constraints.pdconstraints import discover_constraints

df = pd.DataFrame({'a': [1,2,3], 'b': ['one', 'two',]})
constraints = discover_constraints(df)
with open('/tmp/example_constraints.tdda', 'w') as f:

(This is the core of the example code in tdda/constraints/examples/, and is included in the docstring for the discover_constraints function.)

Hopefully the code is fairly self-explanatory, but walking through the lines after the imports:

  • We first generate a trivial 3-row DataFrame with an integer column a and a string column b. The string column includes a Pandas null (NaN).
  • We then pass that DataFrame to the discover_constraints function from tdda.constraints.pdconstraints, and it returns a DatasetConstraints object, which is defined in tdda.constraints.base.
  • The resulting constraints object has a to_json() method which converts the structured constraints into a JSON string.
  • In the example, we write that to /tmp/example_constraints.tdda; we encourage everyone to use the .tdda extension for these JSON constraint files.

This is what happens if we run the example file:

$ cd tdda/constraints/examples

$ python
Written /tmp/example_constraints.tdda successfully.

$ cat /tmp/example_constraints.tdda
    "fields": {
        "a": {
            "type": "int",
            "min": 1,
            "max": 9,
            "sign": "positive",
            "max_nulls": 0,
            "no_duplicates": true
        "b": {
            "type": "string",
            "min_length": 3,
            "max_length": 3,
            "max_nulls": 1,
            "no_duplicates": true,
            "allowed_values": [

As you can see, in this case, the system has 'discovered' six constraints for each field, and it's rather easy to read what they are and at least roughly what they mean. We'll do a separate post describing the .tdda JSON file format, but it's documented in tdda/constraints/ in the repository (which—through almost unfathomable power of Github—means you can see it formatted here).

Constraint Verification

Now that we have a .tdda file, we can use it to verify a DataFrame.

First, let's look at code that should lead to a successful verification (this code is in tdda/constraints/examples/

import pandas as pd
from tdda.constraints.pdconstraints import verify_df

df = pd.DataFrame({'a': [2, 4], 'b': ['one',]})
v = verify_df(df, 'example_constraints.tdda')

print('Passes: %d' % v.passes)
print('Failures: %d\n\n\n' % v.failures)

Again, hopefully the code is fairly self-explanatory, but:

  • df is a DataFrame that is different from the one we used to generate the constraints, but is nevertheless consistent with the constraints.
  • The verify_df function from tdda.constraints.pdconstraints takes a DataFrame and the location of a .tdda file and verifies the DataFrame against the constraints in the file. The function returns a PandasVerification object. The PandasVerification class is a subclass of Verification from tdda.constraints.base, adding the ability to turn the verification object into a DataFrame.
  • All verification objects include passes and failures attributes, which respectively indicate the number of constraints that passed and the number that failed. So the simplest complete verification is simply to check that v.failures == 0.
  • Verification methods also include a __str__ method. Its output currently includes a section for fields and a summary.
  • the .to_frame() method converts a PandasVerification into a Pandas DataFrame.

If we run this, the result is as follows:

$ python
Passes: 12
Failures: 0


a: 0 failures  6 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✓  no_duplicates ✓

b: 0 failures  6 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  no_duplicates ✓  allowed_values ✓


Passes: 12
Failures: 0

  field  failures  passes  type   min min_length   max max_length  sign  \
0     a         0       6  True  True        NaN  True        NaN  True
1     b         0       6  True   NaN       True   NaN       True   NaN

  max_nulls no_duplicates allowed_values
0      True          True            NaN
1      True          True           True

In the DataFrame produced from the Verification

  • True indicates a constraint that was satisfied in the dataset;
  • False indicates a constraint that was not satisfied in the dataset;
  • NaN (null) indicates a constraint that was not present for that field.

As you would expect, the field, failures and passes columns are, respectively, the name of the field, the number of failures and the number of passes for that field.

If we now change the DataFrame definition to:

df = pd.DataFrame({'a': [0, 1, 2, 10,],
                   'b': ['one', 'one', 'two', 'three',]})

(as is the case in tdda/constraints/examples/, we now expect some constraint failures. If we run this, we see:

$ python
Passes: 5
Failures: 7


a: 4 failures  2 passes  type ✓  min ✗  max ✗  sign ✗  max_nulls ✗  no_duplicates ✓

b: 3 failures  3 passes  type ✓  min_length ✓  max_length ✗  max_nulls ✓  noh_duplicates ✗  allowed_values ✗


Passes: 5
Failures: 7

  field  failures  passes  type    min min_length    max max_length   sign  \
0     a         4       2  True  False        NaN  False        NaN  False
1     b         3       3  True    NaN       True    NaN      False    NaN

  max_nulls no_duplicates allowed_values
0     False          True            NaN
1      True         False          False

Final Notes

There are more options and there's more to say about the Pandas implementation, but that's probably enough for one post. We'll have follow-ups on the file format, more options, and the foibles of Pandas.

If you want to hear more, follow us on twitter at @tdda0.

WritableTestCase: Example Use

Posted on Sun 18 September 2016 in TDDA • Tagged with tdda

In my PyCon UK talk yesterday I promised to update the and document the copy of writabletestcase.WritableTestCase on GitHub.

The version I've put up is not quite as powerful as the example I showed in the talk—that will follow—but has the basic functionality.

I've now added examples to the repository and, below, show how these work.

The library is available with

git clone

WritableTestCase extends unittest.TestCase, from the Python's standard library, in three main ways:

  • It provides methods for testing strings produced in memory or files written to disk against reference results in files. When a test fails, rather than just showing a hard-to-read difference, it writes the actual result to file (if necessary) and then shows the diff command needed to compare it—something like this:

    Compare with "diff /path/to/actual-output /path/to/expected-output"

    Obviously, the diff command can be replaced with a graphical diff tool, an open command or whatever.

    Although this shouldn't be necessary (see below), you also have the option, after verification, or replacing diff with cp to copy the actual output as the new reference output.

  • Secondly, the code supports excluding lines of the output contain nominated strings. This is often handy for excluding things like date stamps, version numbers, copyright notices etc. These often appear in output, and vary, without affecting the semantics.

    (The version of the library I showed at PyCon had more powerful variants of this, which I'll add later.)

  • Thirdly, if you verify that the new output is correct, the library supports re-running with the -w flag to overwrite the expected ("reference") results with the ones generated by the code.

    Obviously, if this feature is abused, the value of the tests will be lost, but provided you check the output carefully before re-writing, this is a significant convenience.

The example code is in the examples subdirectory, called It has two test functions, one of which generates HTML output as a string, and the other of which produces some slightly different HTML output as a file. In each case, the output produced by the function is not identical to the expected "reference" output (in examples/reference), but differs only on lines containing "Copyright" and "Version". Since these are passed into the test as exclusions, the tests should pass.

Here is the example code:

# -*- coding: utf-8 -*-
""" A simple example of how to use

Source repository:

License: MIT

Copyright (c) Stochastic Solutions Limited 2016
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import tempfile

from tdda import writabletestcase
from tdda.examples.generators import generate_string, generate_file

class TestExample(writabletestcase.WritableTestCase):
    def testExampleStringGeneration(self):
        This test uses generate_string() from tdda.examples.generators
        to generate some HTML as a string.

        It is similar to the reference HTML in
        tdda/examples/reference/string_result.html except that the
        Copyright and version lines are slightly different.

        As shipped, the test should pass, because the ignore_patterns
        tell it to ignore those lines.

        Make a change to the generation code in the generate_string
        function in to change the HTML output.

        The test should then fail and suggest a diff command to run
        to see the difference.

        Rerun with

            python -w

        and it should re-write the reference output to match your
        modified results.
        actual = generate_string()
        this_dir = os.path.abspath(os.path.dirname(__file__))
        expected_file = os.path.join(this_dir, 'reference',
        self.check_string_against_file(actual, expected_file,

    def testExampleFileGeneration(self):
        This test uses generate_file() from tdda.examples.generators
        to generate some HTML as a file.

        It is similar to the reference HTML in
        tdda/examples/reference/file_result.html except that the
        Copyright and version lines are slightly different.

        As shipped, the test should pass, because the ignore_patterns
        tell it to ignore those lines.

        Make a change to the generation code in the generate_file function
        in to change the HTML output.

        The test should then fail and suggest a diff command to run
        to see the difference.

        Rerun with

            python -w

        and it should re-write the reference output to match your
        modified results.
        outdir = tempfile.gettempdir()
        outpath = os.path.join(outdir, 'file_result.html')
        this_dir = os.path.abspath(os.path.dirname(__file__))
        expected_file = os.path.join(this_dir, 'reference',
        self.check_file(outpath, expected_file,
                        ignore_patterns=['Copyright', 'Version'])

if __name__ == '__main__':

If you download it, and try running it, you should output similar to the following:

$ python
Ran 2 tests in 0.004s


The reference output files it compares against are:

  • examples/reference/string_result.html
  • examples/reference/file_result.html

To see what happens when there's a difference, try editing one or both of the main functions that generate the HTML in They're most just using explicit strings, so the simplest thing is just to change a word or something in the output.

If I change It's to It is in the generate_string() function and rerun, I get this output:

$ python
File check failed.
Compare with "diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/string_result.html /Users/njr/python/tdda/examples/reference/string_result.html".

Note exclusions:
FAIL: testExampleStringGeneration (__main__.TestExample)
Traceback (most recent call last):
  File "", line 55, in testExampleStringGeneration
  File "/Users/njr/python/tdda/", line 294, in check_string_against_file
    self.assertEqual(failures, 0)
AssertionError: 1 != 0

Ran 2 tests in 0.005s

FAILED (failures=1)
1 godel:$

If I then run the diff command it suggests, the output is:

$ diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/string_result.html /Users/njr/python/tdda/examples/reference/string_result.html
<     Copyright (c) Stochastic Solutions, 2016
<     Version 1.0.0
>     Copyright (c) Stochastic Solutions Limited, 2016
>     Version 0.0.0
<     It is not terribly exciting.
>     It's not terribly exciting.

Here you can see the differences that are excluded, and the change I actually made.

(The version I showed at PyCon has an option to see the only the non-excluded differences, but this version doesn't; that will come!)

If I now run again using -w, to re-write the reference output, it shows:

$ python -w
.Expected file /Users/njr/python/tdda/examples/reference/string_result.html written.
Ran 2 tests in 0.003s


And, of course, if I run a third time, without -w, the test now passes:

$ python
Ran 2 tests in 0.003s


So that's a quick overview of it works.

Slides and Rough Transcript of TDDA talk from PyCon UK 2016

Posted on Sat 17 September 2016 in TDDA • Tagged with tdda

Python UK 2016, Cardiff.

I gave a talk on test-driven data analysis at PyCon UK 2016, Cardiff, today.

The slides (which are kind-of useless without the words) are available here.

More usefully, a rough transcript, with thumbnail slides, is available here.

Extracting More Apple Health Data

Posted on Wed 20 April 2016 in TDDA • Tagged with xml, apple, health

The first version of the Python code for extracting data from the XML export from the Apple Health on iOS neglected to extract Activity Summaries and Workout data. We will now fix that.

As usual, I'll remind you how to get the code, if you want, then discuss the changes to the code, the reference test and the unit tests. Then in the next post, we'll actually start looking at the data.

The Updated Code

As before, you can get the code from Github with

$ git clone

or if you have pulled it before, with

$ git pull –tags

This version of the code is tagged with v1.3, so if it has been updated by the time you read this, get that version with

$ git checkout v1.3

I'm not going to list all the code here, but will pull out a few key changes as we discuss them.


Change 1: Change FIELDS to handle three different field structures.

The first version of the extraction code wrote only Records, which contain the granular activity data (which is the vast bulk of it, by volume).

Now I want to extend the code to handle the other two main kinds of data it writes—ActivitySummary and Workout elements in the XML.

The three different element types contain different XML attributes, which correspond to different fields in the CSV, and although they overlap, I think the best approach is to have separate record structures for each, and then to create a dictionary mapping the element kind to its field information.

Accordingly, the code that sets FIELDS changes to become:

RECORD_FIELDS = OrderedDict((
    ('sourceName', 's'),
    ('sourceVersion', 's'),
    ('device', 's'),
    ('type', 's'),
    ('unit', 's'),
    ('creationDate', 'd'),
    ('startDate', 'd'),
    ('endDate', 'd'),
    ('value', 'n'),

    ('dateComponents', 'd'),
    ('activeEnergyBurned', 'n'),
    ('activeEnergyBurnedGoal', 'n'),
    ('activeEnergyBurnedUnit', 's'),
    ('appleExerciseTime', 's'),
    ('appleExerciseTimeGoal', 's'),
    ('appleStandHours', 'n'),
    ('appleStandHoursGoal', 'n'),

WORKOUT_FIELDS = OrderedDict((
    ('sourceName', 's'),
    ('sourceVersion', 's'),
    ('device', 's'),
    ('creationDate', 'd'),
    ('startDate', 'd'),
    ('endDate', 'd'),
    ('workoutActivityType', 's'),
    ('duration', 'n'),
    ('durationUnit', 's'),
    ('totalDistance', 'n'),
    ('totalDistanceUnit', 's'),
    ('totalEnergyBurned', 'n'),
    ('totalEnergyBurnedUnit', 's'),

    'Record': RECORD_FIELDS,
    'ActivitySummary': ACTIVITY_SUMMARY_FIELDS,
    'Workout': WORKOUT_FIELDS,

and we have to change references (in both the main code and the test code) to refer to RECORD_FIELDS where previously there were references to FIELDS.

Change 2: Add a Workout to the test data

There was a single workout in the data I exported from the phone (a token one I performed primarily to generate a record of this type). I simply used grep to extract that line from export.xml and poked it into the test data `testdata/export6s3sample.xml'.

Change 3: Update the tag and field counters

The code for counting record types previously considered only nodes of type Record. Now we also want to handle Workout and ActivitySummary elements. Workouts do come in different types (they have a workoutActivityType field), so it may be that we will want to write out different workout types into different CSV files, but since I have only, so far, seen a single workout, I don't really want to do this. So instead, we'll write all Workout elements to a corresponding Workout.csv file, and all ActivitySummary elements to an ActivitySummary.csv file.

Accordingly, the count_record_types method now uses an extra Counter attribute, other_types to count the number of each of these elements, keyed on their tag (i.e. Workout or ActivitySummary).

Change 4: Update the test results to reflect the new behaviour

Two of the unit tests introduced last time need to be updated to reflect this Change 3. First, the field counts change, and secondly we need reference values for the other_types counts. Hence the new section in test_extracted_reference_stats:

    expectedOtherCounts = [
       ('ActivitySummary', 2),
       ('Workout', 1),

Change 5: Open (and close) files for Workouts and ActivitySummaries

We need to open new files for Workout.csv and ActivitySummary.csv if we have any such records. This is handled in the open_for_writing method.

Change 6: Write records for Workouts and ActivitySummaries

There are minor changes to the write_records method to allow it to handle writing Workout and ActivitySummary records. The only real difference is that the different CSV files have different fields, so we need to look up the right values, in the order specified by the header for each kind. The new code does that:

def write_records(self):
    kinds = FIELDS.keys()
    for node in self.nodes:
        if node.tag in kinds:
            attributes = node.attrib
            kind = attributes['type'] if node.tag == 'Record' else node.tag
            values = [format_value(attributes.get(field), datatype)
                      for (field, datatype) in FIELDS[node.tag].items()]
            line = encode(','.join(values) + '\n')

Change 7: Update the reference test

Finally, the reference test itself now generates two more files, so I've added reference copies of those to the testdata subdirectory and changed the test to loop over all four files:

def test_tiny_reference_extraction(self):
    path = copy_test_data()
    data = HealthDataExtractor(path, verbose=VERBOSE)
    for kind in ('StepCount', 'DistanceWalkingRunning',
                 'Workout', 'ActivitySummary'):
        self.check_file('%s.csv' % kind)

Mission Accomplished

We've now extracted essentially all the data from the export.xml file from the Apple Health app, and created various tests for that extraction process. We'll start to look at the data in future posts. There is one more component in my extract—another XML file called export_cda.xml. This contains a ClinicalDocument, apparently conforming to a standard from (or possibly administered by) Health Level Seven International. It contains heart-rate data from my Apple Watch. I probably will extract it and publish the code for doing so, but later.

Unit Tests

Posted on Tue 19 April 2016 in TDDA • Tagged with xml, apple, health

In the last post, we presented some code for implementing a "reference" test for the code for extracting CSV files from the XML dump that the Apple Health app on iOS can produce.

We will now expand that test with a few other, smaller and more conventional unit tests. Each unit test focuses on a smaller block of functionality.

The Test Code

As before, you can get the code from Github with

$ git clone

or if you have pulled it previously, with

$ git pull

This version of the code is tagged with v1.2, so if it has been updated by the time you read this, get that version with

$ git checkout v1.2

Here is the updated test code.

# -*- coding: utf-8 -*-
""" tests for the

Copyright (c) 2016 Nicholas J. Radcliffe
Licence: MIT
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import re
import shutil
import sys
import unittest

from collections import Counter

from applehealthdata import (HealthDataExtractor,
                             format_freqs, format_value,
                             abbreviate, encode)


def get_base_dir():
    Return the directory containing this test file,
    which will (normally) be the applyhealthdata directory
    also containing the testdata dir.
    return os.path.split(os.path.abspath(__file__))[0]

def get_testdata_dir():
    """Return the full path to the testdata directory"""
    return os.path.join(get_base_dir(), 'testdata')

def get_tmp_dir():
    """Return the full path to the tmp directory"""
    return os.path.join(get_base_dir(), 'tmp')

def remove_any_tmp_dir():
    Remove the temporary directory if it exists.
    Returns its location either way.
    tmp_dir = get_tmp_dir()
    if os.path.exists(tmp_dir):
    return tmp_dir

def make_tmp_dir():
    Remove any existing tmp directory.
    Create empty tmp direcory.
    Return the location of the tmp dir.
    tmp_dir = remove_any_tmp_dir()
    return tmp_dir

def copy_test_data():
    Copy the test data export6s3sample.xml from testdata directory
    to tmp directory.
    tmp_dir = make_tmp_dir()
    name = 'export6s3sample.xml'
    in_xml_file = os.path.join(get_testdata_dir(), name)
    out_xml_file = os.path.join(get_tmp_dir(), name)
    shutil.copyfile(in_xml_file, out_xml_file)
    return out_xml_file

class TestAppleHealthDataExtractor(unittest.TestCase):
    def tearDownClass(cls):
        """Clean up by removing the tmp directory, if it exists."""
        if CLEAN_UP:

    def check_file(self, filename):
        expected_output = os.path.join(get_testdata_dir(), filename)
        actual_output = os.path.join(get_tmp_dir(), filename)
        with open(expected_output) as f:
            expected =
        with open(actual_output) as f:
            actual =
        self.assertEqual(expected, actual)

    def test_tiny_reference_extraction(self):
        path = copy_test_data()
        data = HealthDataExtractor(path, verbose=VERBOSE)

    def test_format_freqs(self):
        counts = Counter()
        self.assertEqual(format_freqs(counts), '')
        counts['one'] += 1
        self.assertEqual(format_freqs(counts), 'one: 1')
        counts['one'] += 1
        self.assertEqual(format_freqs(counts), 'one: 2')
        counts['two'] += 1
        counts['three'] += 1
                         '''one: 2
three: 1
two: 1''')

    def test_format_null_values(self):
        for dt in ('s', 'n', 'd', 'z'):
            # Note: even an illegal type, z, produces correct output for
            # null values.
            # Questionable, but we'll leave as a feature
            self.assertEqual(format_value(None, dt), '')

    def test_format_numeric_values(self):
        cases = {
            '0': '0',
            '3': '3',
            '-1': '-1',
            '2.5': '2.5',
        for (k, v) in cases.items():
            self.assertEqual((k, format_value(k, 'n')), (k, v))

    def test_format_date_values(self):
        hearts = 'any string not need escaping or quoting; even this: ♥♥'
        cases = {
            '01/02/2000 12:34:56': '01/02/2000 12:34:56',
            hearts: hearts,
        for (k, v) in cases.items():
            self.assertEqual((k, format_value(k, 'd')), (k, v))

    def test_format_string_values(self):
        cases = {
            'a': '"a"',
            '': '""',
            'one "2" three': r'"one \"2\" three"',
            r'1\2\3': r'"1\\2\\3"',
        for (k, v) in cases.items():
            self.assertEqual((k, format_value(k, 's')), (k, v))

    def test_abbreviate(self):
        changed = {
            'HKQuantityTypeIdentifierHeight': 'Height',
            'HKQuantityTypeIdentifierStepCount': 'StepCount',
            'HK*TypeIdentifierStepCount': 'StepCount',
            'HKCharacteristicTypeIdentifierDateOfBirth': 'DateOfBirth',
            'HKCharacteristicTypeIdentifierBiologicalSex': 'BiologicalSex',
            'HKCharacteristicTypeIdentifierBloodType': 'BloodType',
        unchanged = [
        for (k, v) in changed.items():
            self.assertEqual((k, abbreviate(k)), (k, v))
            self.assertEqual((k, abbreviate(k, False)), (k, k))
        for k in unchanged:
            self.assertEqual((k, abbreviate(k)), (k, k))

    def test_encode(self):
        # This test looks strange, but because of the import statments
        #     from __future__ import unicode_literals
        # in Python 2, type('a') is unicode, and the point of the encode
        # function is to ensure that it has been converted to a UTF-8 string
        # before writing to file.
        self.assertEqual(type(encode('a')), str)

    def test_extracted_reference_stats(self):
        path = copy_test_data()
        data = HealthDataExtractor(path, verbose=VERBOSE)

        self.assertEqual(data.n_nodes, 19)
        expectedRecordCounts = [
           ('DistanceWalkingRunning', 5),
           ('StepCount', 10),

        expectedTagCounts = [
           ('ActivitySummary', 2),
           ('ExportDate', 1),
           ('Me', 1),
           ('Record', 15),
        expectedFieldCounts = [
            ('HKCharacteristicTypeIdentifierBiologicalSex', 1),
            ('HKCharacteristicTypeIdentifierBloodType', 1),
            ('HKCharacteristicTypeIdentifierDateOfBirth', 1),
            ('HKCharacteristicTypeIdentifierFitzpatrickSkinType', 1),
            ('activeEnergyBurned', 2),
            ('activeEnergyBurnedGoal', 2),
            ('activeEnergyBurnedUnit', 2),
            ('appleExerciseTime', 2),
            ('appleExerciseTimeGoal', 2),
            ('appleStandHours', 2),
            ('appleStandHoursGoal', 2),
            ('creationDate', 15),
            ('dateComponents', 2),
            ('endDate', 15),
            ('sourceName', 15),
            ('startDate', 15),
            ('type', 15),
            ('unit', 15),
            ('value', 16),

if __name__ == '__main__':


We're not going to discuss every part of the code, but will point out a few salient features.

  • I've added a coding line at the top of both the test script and the main script. This tells Python (and my editor, Emacs) the encoding of the file on disk (UTF-8). This is now relevant because one of the new tests (test_format_date_values) features a non-ASCII character in a string literal.

  • The previous test method test_tiny_fixed_extraction has been renamed test_tiny_reference_extraction, but is otherwise unchanged.

  • Several of the tests loop over dictionaries or lists of input-output pairs, with an assertion of some kind in the main body. Some people don't like this, and prefer one assertion per test. I don't really agree with that, but do think it's important to be able to see easily which assertion fails. An idiom I often use to assist this is to include the input on both sides of the test. For example, in test_abbreviate, when checking the abbreviation of items that should change, the code reads:

    for (k, v) in changed.items():
        self.assertEqual((k, abbreviate(k)), (k, v))

    rather than

    for (k, v) in changed.items():
        self.assertEqual(abbreviate(k), v)

    This makes it easy to tell which input fails, if one does, even in cases in which the main values being compared (abbreviate(k) and v, in this case) are long, complex or repeated across different inputs. It doesn't actually make much difference in these examples, but in general I find it helpful.

  • The test test_extracted_reference_stats checks that three counters used by the code work as expected. Some people would definitely advocate splitting this into three tests, but, even though it's quick, it seems more natural to test these together to me. This also means we don't have to process the XML file three times. There are other ways of achieving the same end, and this approach has the potential disadvantage that the later cases won't be run if the first one fails.

    The other point to note here is that the Counter objects are unordered, so I've sorted the expected results on their keys in the expected values, and then used Python's sorted function, which returns a generator to return the values of a list (or other iterable) in sorted order. We could avoid the sort by constructing sets or a dictionaries from the Counter objects and checking those instead, but the sort here is not expensive, and this approach is probably simpler.

  • I haven't bothered to write a separate test for the extraction phase (checking that it writes the right CSV files) because that seems to me to add almost nothing over the existing reference test (test_tiny_reference_extraction).


That's it for this post. The unit tests are not terribly exciting, but they will prove useful as we extend the extraction code, which we'll start to do in the next post.

In a few posts' time, we will start analysing the data extracted from the app; it will be interesting to see whether, at that stage, we discover any more serious problems with the extraction code. Experience teaches that we probably will.

First Test

Posted on Mon 18 April 2016 in TDDA • Tagged with xml, apple, health

In the last post, I presented some code for extracting (some of) the data from the XML file exported by the Apple Health app on iOS, but—almost comically, given this blog's theme—omitted to include any tests. This post and the next couple (in quick succession) will aim to fix that.

This post begins to remedy that by writing a single "reference" test. To recap: a reference test is a test that tests a whole analytical process, checking that the known inputs produce the expected outputs. So far, our analytical process is quite small, consisting only of data extraction, but this will still prove very worthwhile.


While the mainstream TDD dogma states that tests should be written before the code, it is far from uncommon to write them afterwards, and in the context of test-driven data analysis I maintain that this is usually preferable. Regardless, when you find yourself in a situation in which you have written some code and possess any reasonable level of belief that it might be right,1 an excellent starting point is simply to capture the input(s) that you have already used, together with the output that it generates, and write a test that checks that the input you provided produces the expected output. That's exactly the procedure I advocated for TDDA, and that's how we shall start here.

Test Data

The only flies in the ointment in this case are

  1. the input data I used initially was quite large (5.5MB compressed; 109MB uncompressed), leading to quite a slow test;

  2. the data is somewhat personal.

For both these reasons, I have decided to reduce it so that it will be more manageable, run more quickly, and be more suitable for public sharing.

So I cut down the data to contain only the DTD header, the Me record, ten StepCount records, and five DistanceWalkingRunning records. That results in a small, valid XML file (under 7K) containing exactly 100 lines. It's in the testdata subdirectory of the repository, and if I run it (which you probably don't want do, at least in situ, as that will trample over the reference output), the following output is produced:

$ python applehealthdata/ testdata/export6s3sample.xml
Reading data from testdata/export6s3sample.xml . . . done

ActivitySummary: 2
ExportDate: 1
Me: 1
Record: 15

HKCharacteristicTypeIdentifierBiologicalSex: 1
HKCharacteristicTypeIdentifierBloodType: 1
HKCharacteristicTypeIdentifierDateOfBirth: 1
HKCharacteristicTypeIdentifierFitzpatrickSkinType: 1
activeEnergyBurned: 2
activeEnergyBurnedGoal: 2
activeEnergyBurnedUnit: 2
appleExerciseTime: 2
appleExerciseTimeGoal: 2
appleStandHours: 2
appleStandHoursGoal: 2
creationDate: 15
dateComponents: 2
endDate: 15
sourceName: 15
startDate: 15
type: 15
unit: 15
value: 16

Record types:
DistanceWalkingRunning: 5
StepCount: 10

Opening /Users/njr/qs/testdata/StepCount.csv for writing
Opening /Users/njr/qs/testdata/DistanceWalkingRunning.csv for writing
Written StepCount data.
Written DistanceWalkingRunning data.

The two CSV files it writes, which are also in the testdata subdirectory in the repository, are as follows:

$ cat testdata/StepCount.csv
"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:27:54 +0100,2014-09-13 10:27:59 +0100,329
"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:34:09 +0100,2014-09-13 10:34:14 +0100,283
"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:39:29 +0100,2014-09-13 10:39:34 +0100,426
"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:45:36 +0100,2014-09-13 10:45:41 +0100,61
"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:51:16 +0100,2014-09-13 10:51:21 +0100,10
"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:57:40 +0100,2014-09-13 10:57:45 +0100,200
"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:03:00 +0100,2014-09-13 11:03:05 +0100,390
"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:08:10 +0100,2014-09-13 11:08:15 +0100,320
"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:27:22 +0100,2014-09-13 11:27:27 +0100,216
"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:33:24 +0100,2014-09-13 11:33:29 +0100,282


$ cat testdata/DistanceWalkingRunning.csv
"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:28 +0100,2014-09-20 10:41:30 +0100,0.00288
"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:30 +0100,2014-09-20 10:41:33 +0100,0.00284
"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:33 +0100,2014-09-20 10:41:36 +0100,0.00142
"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:43:54 +0100,2014-09-20 10:43:56 +0100,0.00639
"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:43:59 +0100,2014-09-20 10:44:01 +0100,0.0059

Reference Test

The code for a single reference test is below. It's slightly verbose, because it tries to use sensible locations for everything, but not complex.

As before, you can get the code from Github with

$ git clone

or if you have pulled it previously, you can update it with

$ git pull

This version of the code is tagged with v1.1, so if it has been updated by the time you read this, get that version with

$ git checkout v1.1

Here is the code:

""" tests for the

Copyright (c) 2016 Nicholas J. Radcliffe
Licence: MIT
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import re
import shutil
import sys
import unittest

from applehealthdata import HealthDataExtractor


def get_base_dir():
    Return the directory containing this test file,
    which will (normally) be the applyhealthdata directory
    also containing the testdata dir.
    return os.path.split(os.path.abspath(__file__))[0]

def get_testdata_dir():
    """Return the full path to the testdata directory"""
    return os.path.join(get_base_dir(), 'testdata')

def get_tmp_dir():
    """Return the full path to the tmp directory"""
    return os.path.join(get_base_dir(), 'tmp')

def remove_any_tmp_dir():
    Remove the temporary directory if it exists.
    Returns its location either way.
    tmp_dir = get_tmp_dir()
    if os.path.exists(tmp_dir):
    return tmp_dir

def make_tmp_dir():
    Remove any existing tmp directory.
    Create empty tmp direcory.
    Return the location of the tmp dir.
    tmp_dir = remove_any_tmp_dir()
    return tmp_dir

def copy_test_data():
    Copy the test data export6s3sample.xml from testdata directory
    to tmp directory.
    tmp_dir = make_tmp_dir()
    name = 'export6s3sample.xml'
    in_xml_file = os.path.join(get_testdata_dir(), name)
    out_xml_file = os.path.join(get_tmp_dir(), name)
    shutil.copyfile(in_xml_file, out_xml_file)
    return out_xml_file

class TestAppleHealthDataExtractor(unittest.TestCase):
    def tearDownClass(cls):
        """Clean up by removing the tmp directory, if it exists."""
        if CLEAN_UP:

    def check_file(self, filename):
        expected_output = os.path.join(get_testdata_dir(), filename)
        actual_output = os.path.join(get_tmp_dir(), filename)
        with open(expected_output) as f:
            expected =
        with open(actual_output) as f:
            actual =
        self.assertEqual(expected, actual)

    def test_tiny_fixed_extraction(self):
        path = copy_test_data()
        data = HealthDataExtractor(path, verbose=VERBOSE)

if __name__ == '__main__':

Running the Test

This is what I get if I run it:

$ python
Ran 1 test in 0.007s


That's encouraging, but not particularly informative. If we change the value of VERBOSE at the top of the test file to True, we see slightly more reassuring output:

$ python
Reading data from /Users/njr/qs/applehealthdata/tmp/export6s3sample.xml . . . done
Opening /Users/njr/qs/applehealthdata/tmp/StepCount.csv for writing
Opening /Users/njr/qs/applehealthdata/tmp/DistanceWalkingRunning.csv for writing
Written StepCount data.
Written DistanceWalkingRunning data.
Ran 1 test in 0.006s

NOTE: The tearDownClass method is a special Python class method that the unit testing framework runs after executing all the tests in the class, regardless of whether they pass, fail or produce errors. I use it to remove the tmp directory containing any test output, which is normally good practice. In a later post, we'll either modify this to leave the output around if any tests fail, or make some other change to make it easier to diagnose what's gone wrong. In the meantime, if you change the value of CLEAN_UP, towards the top of the code, to False, it will leave the tmp directory around, allowing you to examine the files it has produced.


The test itself is in the 5-line method test_tiny_fixed_extraction. Here's what the five lines do:

  1. Copy the input XML file from the testdata directory to the tmp directory. The Github repository contains the 100-line input XML file together with the expected output in the testdata subdirectory. Because the data extractor writes the CSV files next to the input data, the cleanest thing for us to do is to take a copy of the input data, write it into a new directory (applehealthdata/tmp) and also to use that directory as the location for the output CSV files. The copy_test_data function removes any existing tmp directory it finds, creates a fresh one, copies the input test data into it and returns the path to the test data file. The only "magic" here is that the get_base_dir function figures out where to locate everything by using __file__, which is the location of the source file being executed by Python.

  2. Create a HealthDataExtractor object, using the location of the copy of the input data returned by copy_test_data(). Note that it sets verbose to False, making the test silent, and allowing the line of dots from a successful test run (in this case, a single dot) to be presented without interruption.

  3. Extract the data. This writes two output files to the applehealthdata/tmp directory.

  4. Check that the contents of tmp/StepCount.csv match the reference output in testdata/StepCount.csv.

  5. Check that the contents of tmp/DistanceWalkingRunning.csv match the reference output in testdata/DistanceWalkingRunning.csv.


In cases in which the tests are written after the code, it's important to check that they really are running correctly. My usual approach to that is to write the test, and if appears to pass first time,2 to break it deliberately to verify that it fails when it should, before repairing it. In this case, the simplest way to break the test is to change the reference data temporarily. This will also reveal a weakness in the current check_file function.

We'll try three variants of this:

Variant 1: Break the StepCount.csv reference data.

First, I add a Z to the end of testdata/StepCount.csv and re-run the tests:

$ python
FAIL: test_tiny_fixed_extraction (__main__.TestAppleHealthDataExtractor)
Traceback (most recent call last):
  File "", line 98, in test_tiny_fixed_extraction
  File "", line 92, in check_file
    self.assertEqual(expected, actual)
AssertionError: 'sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:27:54 +0100,2014-09-13 10:27:59 +0100,329\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:34:09 +0100,2014-09-13 10:34:14 +0100,283\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:39:29 +0100,2014-09-13 10:39:34 +0100,426\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:45:36 +0100,2014-09-13 10:45:41 +0100,61\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:51:16 +0100,2014-09-13 10:51:21 +0100,10\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:57:40 +0100,2014-09-13 10:57:45 +0100,200\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:03:00 +0100,2014-09-13 11:03:05 +0100,390\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:08:10 +0100,2014-09-13 11:08:15 +0100,320\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:27:22 +0100,2014-09-13 11:27:27 +0100,216\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:33:24 +0100,2014-09-13 11:33:29 +0100,282\nZ' != 'sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:27:54 +0100,2014-09-13 10:27:59 +0100,329\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:34:09 +0100,2014-09-13 10:34:14 +0100,283\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:39:29 +0100,2014-09-13 10:39:34 +0100,426\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:45:36 +0100,2014-09-13 10:45:41 +0100,61\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:51:16 +0100,2014-09-13 10:51:21 +0100,10\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:57:40 +0100,2014-09-13 10:57:45 +0100,200\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:03:00 +0100,2014-09-13 11:03:05 +0100,390\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:08:10 +0100,2014-09-13 11:08:15 +0100,320\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:27:22 +0100,2014-09-13 11:27:27 +0100,216\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:33:24 +0100,2014-09-13 11:33:29 +0100,282\n'

Ran 1 test in 0.005s

FAILED (failures=1)

That causes the expected failure. Because, however, we've compared the entire contents of the two CSV files, it's hard to see what's actually gone wrong. We'll address this by improving the check_file method in a later post.

Variant 2: Break the DistanceWalkingRunning.csv reference data.

After restoring the StepCount.csv data, I modify the reference testdata/DistanceWalkingRunning.csv data. This time, I'll change Health to Wealth throughout.

$ python
FAIL: test_tiny_fixed_extraction (__main__.TestAppleHealthDataExtractor)
Traceback (most recent call last):
  File "", line 99, in test_tiny_fixed_extraction
  File "", line 92, in check_file
    self.assertEqual(expected, actual)
AssertionError: 'sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value\n"Wealth",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:28 +0100,2014-09-20 10:41:30 +0100,0.00288\n"Wealth",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:30 +0100,2014-09-20 10:41:33 +0100,0.00284\n"Wealth",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:33 +0100,2014-09-20 10:41:36 +0100,0.00142\n"Wealth",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:43:54 +0100,2014-09-20 10:43:56 +0100,0.00639\n"Wealth",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:43:59 +0100,2014-09-20 10:44:01 +0100,0.0059\n' != 'sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value\n"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:28 +0100,2014-09-20 10:41:30 +0100,0.00288\n"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:30 +0100,2014-09-20 10:41:33 +0100,0.00284\n"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:33 +0100,2014-09-20 10:41:36 +0100,0.00142\n"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:43:54 +0100,2014-09-20 10:43:56 +0100,0.00639\n"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:43:59 +0100,2014-09-20 10:44:01 +0100,0.0059\n'

Ran 1 test in 0.005s

FAILED (failures=1)

The story is very much the same: the test has failed, which is good, but again the source of difference is hard to discern.

Variant 3: Break the input XML Data.

After restoring DistanceWalkingRunning.csv, I modify the input XML file. In this case, I'll just change the first step count to be 330 instead of 329:

$ python
FAIL: test_tiny_fixed_extraction (__main__.TestAppleHealthDataExtractor)
Traceback (most recent call last):
  File "", line 98, in test_tiny_fixed_extraction
  File "", line 92, in check_file
    self.assertEqual(expected, actual)
AssertionError: 'sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:27:54 +0100,2014-09-13 10:27:59 +0100,329\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:34:09 +0100,2014-09-13 10:34:14 +0100,283\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:39:29 +0100,2014-09-13 10:39:34 +0100,426\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:45:36 +0100,2014-09-13 10:45:41 +0100,61\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:51:16 +0100,2014-09-13 10:51:21 +0100,10\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:57:40 +0100,2014-09-13 10:57:45 +0100,200\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:03:00 +0100,2014-09-13 11:03:05 +0100,390\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:08:10 +0100,2014-09-13 11:08:15 +0100,320\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:27:22 +0100,2014-09-13 11:27:27 +0100,216\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:33:24 +0100,2014-09-13 11:33:29 +0100,282\n' != 'sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:27:54 +0100,2014-09-13 10:27:59 +0100,330\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:34:09 +0100,2014-09-13 10:34:14 +0100,283\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:39:29 +0100,2014-09-13 10:39:34 +0100,426\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:45:36 +0100,2014-09-13 10:45:41 +0100,61\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:51:16 +0100,2014-09-13 10:51:21 +0100,10\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:57:40 +0100,2014-09-13 10:57:45 +0100,200\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:03:00 +0100,2014-09-13 11:03:05 +0100,390\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:08:10 +0100,2014-09-13 11:08:15 +0100,320\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:27:22 +0100,2014-09-13 11:27:27 +0100,216\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:33:24 +0100,2014-09-13 11:33:29 +0100,282\n'

Ran 1 test in 0.005s

FAILED (failures=1)

Again, we get the expected failure, and again it's hard to see what it is. (We really will need to improve check_file.)


That's enough for this post. We've successfully added a single "reference" test to the code, which should at least make sure that if we break it during further enhancements, we will notice. It will also check that it is working correctly on other platforms (e.g., yours).

We haven't done anything to check the the CSV files produced are genuinely right beyond the initial eye-balling I did on first extracting the data before. But if we see problems when we start doing proper analysis, it will be easy to correct the expected output to keep the test running. And in the meantime, we'll notice if we make changes to the code that result in different output when it wasn't meant to do so. This is one part of the pragmatic essence of basic TDDA.

We also haven't written any unit tests at all for the extraction code; we'll do that in a later post.

  1. For example, you might have already blogged about it and pushed it to a public repository on Github 

  2. Which is not always the case 

In Defence of XML: Exporting and Analysing Apple Health Data

Posted on Fri 15 April 2016 in TDDA • Tagged with xml, apple, health

I'm going to present a series of posts based around the sort of health and fitness data that can now be collected by some phones and dedicated fitness trackers. Not all of these will be centrally on topic for test-driven data analysis, but I think they'll provide an interesting set of data for discussing many issues of relevance, so I hope readers will forgive me to the extent that these stray from the central theme.

The particular focus for this series will be the data available from an iPhone and the Apple Health app, over a couple of different phones, and with a couple of different devices paired to them.

In particular, the setup will be:

  • Apple iPhone 6s (November 2015 to present)
  • Apple iPhone 5s (with fitness data from Sept 2014 to Nov 2015)
  • Several Misfit Shine activity trackers (until early March 2016)
  • An Apple Watch (about a month of data, to date)

Getting data out of Apple Health (The Exploratory Version)

I hadn't initially spotted a way to get the data out of Apple's Health app, but a quick web search1 turned up this very helpful article: It turns out there is a properly supported way to export granular data from Apple Health, described in detail in the post. Essentially:

  • Open the Apple Health App.
  • Navigate to the Health Data section (left icon at the bottom)
  • Select All from the list of categories
  • There is a share icon at the top right (a vertical arrow sticking up from a square)
  • Tap that to export all data
  • It thinks for a while (quite a while, in fact) and then offers you various export options, which for me included Airdrop, email and handing the data to other apps. I used Airdrop to dump it onto a Mac.

The result is a compressed XML file called For me, this was about 5.5MB, which expanded to 109MB when unzipped. (Interestingly, I started this with an earlier export a couple of weeks ago, when the zipped file was about 5MB and the expanded version was 90MB, so it is growing fairly quickly, thanks to the Watch.)

As helpful as the iDownloadBlog article is, I have to comment on its introduction to exporting data, which reads

There are actually two ways to export the data from your Health app. The first way, is one provided by Apple, but it is virtually useless.

To be fair to iDownloadBlog, an XML file like this probably is useless to the general reader, but it builds on a meme fashionable among developers and data scientists to the effect of "XML is painful to process, verbose and always worse than JSON", and I think this is somewhat unfair.

Let's explore export.xml using Python and the ElementTree library. Although the decompressed file is quite large (109MB), it's certainly not problematically large to read into memory on a modern machine, so I'm not going to worry about reading it in bits: I'm just going to find out as quickly as possible what's in it.

The first thing to do, of course, is simply to look at the file, probably using either the more or less command, assuming you are on some flavour of Unix or Linux. Let's look at the top of my export.xml:

$ head -79 export6s3/export.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE HealthData [
<!– HealthKit Export Version: 3 >
<!ELEMENT HealthData (ExportDate,Me,(Record|Correlation|Workout|ActivitySummary)*)>
<!ATTLIST HealthData
<!ATTLIST ExportDate
  HKCharacteristicTypeIdentifierDateOfBirth         CDATA #REQUIRED
  HKCharacteristicTypeIdentifierBiologicalSex       CDATA #REQUIRED
  HKCharacteristicTypeIdentifierBloodType           CDATA #REQUIRED
  HKCharacteristicTypeIdentifierFitzpatrickSkinType CDATA #REQUIRED
<!ELEMENT Record (MetadataEntry*)>
<!ATTLIST Record
  type          CDATA #REQUIRED
  unit          CDATA #IMPLIED
  value         CDATA #IMPLIED
  sourceName    CDATA #REQUIRED
  sourceVersion CDATA #IMPLIED
  device        CDATA #IMPLIED
  creationDate  CDATA #IMPLIED
  startDate     CDATA #REQUIRED
  endDate       CDATA #REQUIRED
<!– Note: Any Records that appear as children of a correlation also appear as top-level records in this document. >
<!ELEMENT Correlation ((MetadataEntry|Record)*)>
<!ATTLIST Correlation
  type          CDATA #REQUIRED
  sourceName    CDATA #REQUIRED
  sourceVersion CDATA #IMPLIED
  device        CDATA #IMPLIED
  creationDate  CDATA #IMPLIED
  startDate     CDATA #REQUIRED
  endDate       CDATA #REQUIRED
<!ELEMENT Workout ((MetadataEntry|WorkoutEvent)*)>
<!ATTLIST Workout
  workoutActivityType   CDATA #REQUIRED
  duration              CDATA #IMPLIED
  durationUnit          CDATA #IMPLIED
  totalDistance         CDATA #IMPLIED
  totalDistanceUnit     CDATA #IMPLIED
  totalEnergyBurned     CDATA #IMPLIED
  totalEnergyBurnedUnit CDATA #IMPLIED
  sourceName            CDATA #REQUIRED
  sourceVersion         CDATA #IMPLIED
  device                CDATA #IMPLIED
  creationDate          CDATA #IMPLIED
  startDate             CDATA #REQUIRED
  endDate               CDATA #REQUIRED
<!ELEMENT WorkoutEvent EMPTY>
<!ATTLIST WorkoutEvent
<!ELEMENT ActivitySummary EMPTY>
<!ATTLIST ActivitySummary
  dateComponents           CDATA #IMPLIED
  activeEnergyBurned       CDATA #IMPLIED
  activeEnergyBurnedGoal   CDATA #IMPLIED
  activeEnergyBurnedUnit   CDATA #IMPLIED
  appleExerciseTime        CDATA #IMPLIED
  appleExerciseTimeGoal    CDATA #IMPLIED
  appleStandHours          CDATA #IMPLIED
  appleStandHoursGoal      CDATA #IMPLIED
<!ELEMENT MetadataEntry EMPTY>
<!ATTLIST MetadataEntry

This is immediately encouraging: Apple has provided DOCTYPE (DTD) information, which even though slightly old fashioned, tells us what we should expect to find in the file. DTD's are awkward to use, and when coming from untrusted sources, can leave the user potentially vulnerable to malicious attacks, but despite this, they are quite expressive and helpful, even just as plain-text documentation.

Roughly speaking, the lines:

<!ELEMENT HealthData (ExportDate,Me,(Record|Correlation|Workout)*)>
<!ATTLIST HealthData


  • that the top element will be a HealthData element

  • that this HealthData element will contain

    • an ExportDate element
    • a Me element
    • zero or more elements of type Record, Correlation or Workout
  • and that the HealthData element will have an attribute locale (which is mandatory).

The rest of this DTD section describes each kind of record in more detail.

The next 6 lines in my XML file are as follows (spread out for readability):

<HealthData locale="en_GB">
 <ExportDate value="2016-04-15 07:27:26 +0100"/>
 <Me HKCharacteristicTypeIdentifierDateOfBirth="1965-07-31"
 <Record type="HKQuantityTypeIdentifierHeight"
         creationDate="2016-01-02 09:45:10 +0100"
         startDate="2016-01-02 09:44:00 +0100"
         endDate="2016-01-02 09:44:00 +0100"
  <MetadataEntry key="HKWasUserEntered" value="1"/>

As you can see, the export format is verbose, but extremely comprehensible and comprehensive. It's also very easy to read into Python and explore.

Let's do that, here with an interactive python:

>>> from xml.etree import ElementTree as ET
>>> with open('export.xml') as f:
...     data = ET.parse(f)
>>> data
<xml.etree.ElementTree.ElementTree object at 0x107347a50>

The ElementTree module turns each XML element into an Element object, described by its tag, with a few standard attributes.

Inspecting the data object, we find:

>>> data.__dict__
{'_root': <Element 'HealthData' at 0x1073c2050>}

i.e., we have a single entry in data—a root element called HealthData.

Like all Element objects, it has the four standard attributes:2

>>> root = data._root
>>> root.__dict__.keys()
['text', 'attrib', 'tag', '_children']

These are:

>>> root.attrib
{'locale': 'en_GB'}

>>> root.text
'\n '

>>> root.tag

>>> len(root._children)

So nothing much apart from an encoding and a whole lot of child nodes. Let's inspect the first few of them:

>>> nodes = root._children
>>> nodes[0]
<Element 'ExportDate' at 0x1073c2090>

>>> ET.dump(nodes[0])
<ExportDate value="2016-04-15 07:27:26 +0100" />

>>> nodes[1]
<Element 'Me' at 0x1073c2190>
>>> ET.dump(nodes[1])
<Me HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexMale"
    HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet" />

>>> nodes[2]
<Element 'Record' at 0x1073c2410>
>>> ET.dump(nodes[2])
<Record creationDate="2016-01-02 09:45:10 +0100"
        endDate="2016-01-02 09:44:00 +0100"
        startDate="2016-01-02 09:44:00 +0100"
  <MetadataEntry key="HKWasUserEntered" value="1" />

>>> nodes[3]
<Element 'Record' at 0x1073c2550>
>>> nodes[4]
<Element 'Record' at 0x1073c2650>

So, exactly as the DTD indicated, we have an ExportDate node, a Me node and then what looks like a great number of records. Let's confirm that:

>>> set(node.tag for node in nodes[2:])
set(['Record', 'Workout', 'ActivitySummary'])

So in fact, there are three kinds of nodes after the ExportDate and Me records. Let's count them:

>>> records = [node for node in nodes if node.tag == 'Record']
>>> len(records)

These records are ones like the Height record we saw above, though in fact most of them are not Height but either StepCount, CaloriesBurned or DistanceWalkingRunning, e.g.:

>>> ET.dump(nodes[100000])
<Record creationDate="2015-01-11 07:40:15 +0000"
        endDate="2015-01-10 13:39:35 +0000"
        sourceName="njr iPhone 6s"
        startDate="2015-01-10 13:39:32 +0000"
        value="4" />

There is also one activity summary per day (since I got the watch).

>>> acts = [node for node in nodes if node.tag == 'ActivitySummary']
>>> len(acts)

The first one isn't very exciting:

>>> ET.dump(acts[0])
<ActivitySummary activeEnergyBurned="0"
                 dateComponents="2016-03-18" />

but they get better:

>>> ET.dump(acts[2])
<ActivitySummary activeEnergyBurned="652.014"
                 dateComponents="2016-03-20" />

Finally, there is a solitary Workout record.

>>> ET.dump(workouts[0])
<Workout creationDate="2016-04-02 11:12:57 +0100"
         endDate="2016-04-02 11:12:22 +0100"
         sourceName="NJR Apple&#160;Watch"
         startDate="2016-04-02 10:40:38 +0100"
         workoutActivityType="HKWorkoutActivityTypeOther" />

So there we have it.

Getting data out of Apple Health (The Code)

Given this exploration, we can take a first shot at writing an exporter for Apple Health Data. I'm going to ignore the activity summaries and workout(s) for now, and concentrate on the main records. (We'll get to the others in a later post.)

Here is the code:

""" Extract data from Apple Health App's export.xml.

Copyright (c) 2016 Nicholas J. Radcliffe
Licence: MIT
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import re
import sys

from xml.etree import ElementTree
from collections import Counter, OrderedDict

__version__ = '1.0'

FIELDS = OrderedDict((
    ('sourceName', 's'),
    ('sourceVersion', 's'),
    ('device', 's'),
    ('type', 's'),
    ('unit', 's'),
    ('creationDate', 'd'),
    ('startDate', 'd'),
    ('endDate', 'd'),
    ('value', 'n'),

PREFIX_RE = re.compile('^HK.*TypeIdentifier(.+)$')

def format_freqs(counter):
    Format a counter object for display.
    return '\n'.join('%s: %d' % (tag, counter[tag])
                     for tag in sorted(counter.keys()))

def format_value(value, datatype):
    Format a value for a CSV file, escaping double quotes and backslashes.

    None maps to empty.

    datatype should be
        's' for string (escaped)
        'n' for number
        'd' for datetime
    if value is None:
        return ''
    elif datatype == 's':  # string
        return '"%s"' % value.replace('\\', '\\\\').replace('"', '\\"')
    elif datatype in ('n', 'd'):  # number or date
        return value
        raise KeyError('Unexpected format value: %s' % datatype)

def abbreviate(s):
    Abbreviate particularly verbose strings based on a regular expression
    m = re.match(PREFIX_RE, s)
    return if ABBREVIATE and m else s

def encode(s):
    Encode string for writing to file.
    In Python 2, this encodes as UTF-8, whereas in Python 3,
    it does nothing
    return s.encode('UTF-8') if sys.version_info.major < 3 else s

class HealthDataExtractor(object):
    Extract health data from Apple Health App's XML export, export.xml.

        path:      Relative or absolute path to export.xml
        verbose:   Set to False for less verbose output

        Writes a CSV file for each record type found, in the same
        directory as the input export.xml. Reports each file written
        unless verbose has been set to False.
    def __init__(self, path, verbose=VERBOSE):
        self.in_path = path
        self.verbose = verbose = os.path.abspath(os.path.split(path)[0])
        with open(path) as f:
  'Reading data from %s . . . ' % path, end='')
   = ElementTree.parse(f)
        self.root =
        self.nodes = self.root.getchildren()
        self.n_nodes = len(self.nodes)

    def report(self, msg, end='\n'):
        if self.verbose:
            print(msg, end=end)

    def count_tags_and_fields(self):
        self.tags = Counter()
        self.fields = Counter()
        for record in self.nodes:
            self.tags[record.tag] += 1
            for k in record.keys():
                self.fields[k] += 1

    def count_record_types(self):
        self.record_types = Counter()
        for record in self.nodes:
            if record.tag == 'Record':
                self.record_types[record.attrib['type']] += 1

    def collect_stats(self):

    def open_for_writing(self):
        self.handles = {}
        self.paths = []
        for kind in self.record_types:
            path = os.path.join(, '%s.csv' % abbreviate(kind))
            f = open(path, 'w')
            f.write(','.join(FIELDS) + '\n')
            self.handles[kind] = f
  'Opening %s for writing' % path)

    def abbreviate_types(self):
        Shorten types by removing common boilerplate text.
        for node in self.nodes:
            if node.tag == 'Record':
                if 'type' in node.attrib:
                    node.attrib['type'] = abbreviate(node.attrib['type'])

    def write_records(self):
        for node in self.nodes:
            if node.tag == 'Record':
                attributes = node.attrib
                kind = attributes['type']
                values = [format_value(attributes.get(field), datatype)
                          for (field, datatype) in FIELDS.items()]
                line = encode(','.join(values) + '\n')

    def close_files(self):
        for (kind, f) in self.handles.items():
  'Written %s data.' % abbreviate(kind))

    def extract(self):

    def report_stats(self):
        print('\nTags:\n%s\n' % format_freqs(self.tags))
        print('Fields:\n%s\n' % format_freqs(self.fields))
        print('Record types:\n%s\n' % format_freqs(self.record_types))

if __name__ == '__main__':
    if len(sys.argv) != 2:
        print('USAGE: python /path/to/export.xml',
    data = HealthDataExtractor(sys.argv[1])

To run this code, clone the repo from with:

$ git clone

or save the text from this post as At the time of posting, the code is consistent with this, but this commit is also tagged with the version number, v1.0, so if you check it out later and want to use this version, check out that version by saying:

$ git checkout v1.0

If your data is in the same directory as the code, then simply run:

$ python export.xml

and, depending on size, wait a few minutes while it runs. The code runs under both Python 2 and Python 3.

When I do this, the output is as follows:

$ python applehealthdata/ export6s3/export.xml
Reading data from export6s3/export.xml . . . done

ActivitySummary: 29
ExportDate: 1
Me: 1
Record: 446670
Workout: 1

HKCharacteristicTypeIdentifierBiologicalSex: 1
HKCharacteristicTypeIdentifierBloodType: 1
HKCharacteristicTypeIdentifierDateOfBirth: 1
HKCharacteristicTypeIdentifierFitzpatrickSkinType: 1
activeEnergyBurned: 29
activeEnergyBurnedGoal: 29
activeEnergyBurnedUnit: 29
appleExerciseTime: 29
appleExerciseTimeGoal: 29
appleStandHours: 29
appleStandHoursGoal: 29
creationDate: 446671
dateComponents: 29
device: 84303
duration: 1
durationUnit: 1
endDate: 446671
sourceName: 446671
sourceVersion: 86786
startDate: 446671
totalDistance: 1
totalDistanceUnit: 1
totalEnergyBurned: 1
totalEnergyBurnedUnit: 1
type: 446670
unit: 446191
value: 446671
workoutActivityType: 1

Record types:
ActiveEnergyBurned: 19640
AppleExerciseTime: 2573
AppleStandHour: 479
BasalEnergyBurned: 26414
BodyMass: 155
DistanceWalkingRunning: 196262
FlightsClimbed: 2476
HeartRate: 3013
Height: 4
StepCount: 195654

Opening /Users/njr/qs/export6s3/BasalEnergyBurned.csv for writing
Opening /Users/njr/qs/export6s3/HeartRate.csv for writing
Opening /Users/njr/qs/export6s3/BodyMass.csv for writing
Opening /Users/njr/qs/export6s3/DistanceWalkingRunning.csv for writing
Opening /Users/njr/qs/export6s3/AppleStandHour.csv for writing
Opening /Users/njr/qs/export6s3/StepCount.csv for writing
Opening /Users/njr/qs/export6s3/Height.csv for writing
Opening /Users/njr/qs/export6s3/AppleExerciseTime.csv for writing
Opening /Users/njr/qs/export6s3/ActiveEnergyBurned.csv for writing
Opening /Users/njr/qs/export6s3/FlightsClimbed.csv for writing
Written BasalEnergyBurned data.
Written HeartRate data.
Written BodyMass data.
Written DistanceWalkingRunning data.
Written ActiveEnergyBurned data.
Written StepCount data.
Written Height data.
Written AppleExerciseTime data.
Written AppleStandHour data.
Written FlightsClimbed data.

As a quick preview of one of the files, here is the top of the second biggest output fiele, StepCount.csv:

$ head -5 StepCount.csv
"Health",,,"HKQuantityTypeIdentifierStepCount","count",2014-09-21 06:08:47 +0000,2014-09-13 09:27:54 +0000,2014-09-13 09:27:59 +0000,329
"Health",,,"HKQuantityTypeIdentifierStepCount","count",2014-09-21 06:08:47 +0000,2014-09-13 09:34:09 +0000,2014-09-13 09:34:14 +0000,283
"Health",,,"HKQuantityTypeIdentifierStepCount","count",2014-09-21 06:08:47 +0000,2014-09-13 09:39:29 +0000,2014-09-13 09:39:34 +0000,426
"Health",,,"HKQuantityTypeIdentifierStepCount","count",2014-09-21 06:08:48 +0000,2014-09-13 09:45:36 +0000,2014-09-13 09:45:41 +0000,61

You may need to scroll right to see all of it, or expand your browser window.

This blog post is long enough already, so I'll discuss (and plot) the contents of the various output files in later posts.

Notes on the Output

Format: The code writes CSV files including a header record with field names. Since the fields are XML attributes, which get read into a dictionary, they are unordered so the code sorts them alphabetically, which isn't optimal, but is at least consistent. Nulls are written as empty spaces, strings are quoted with double quotes, double quotes in strings are escaped with backslash and backslash is itself escaped with backslash. The output encoding is UTF-8.

Filenames: One file is written per record type, and the names is just the record type with extension .csv, except for record types including HK...TypeIdentifier, which is excised.

Summary Stats: Summary stats about the various CSV files are printed before the main extraction occurs.

Overwriting: Any existings CSV files are silently overwritten, so if you have multiple health data export files in the same directory, take care.

Data Sanitization: The code is almost completely opinionless, and with one exception simply flattens the data in the XML file into a collection of CSV files. The exception concerns file names and the type field file. Apple uses extraordinarily verbose and ugly names like HKQuantityTypeIdentifierStepCount and HKQuantityTypeIdentifierHeight to describe the contents of each record: the abbreviate function in the code uses a regular expression to strip off the nonsense, resulting in nicer, shorter, more comprehensible file names and record types. However, if you prefer to get your data verbatim, simply change the value of ABBREVIATE to False near the top of the file and all your HealthKit prefixes will be preserved, at the cost of a non-trivial expansion of the output file sizes.

Notes on the code: Wot, no tests?

The first thing to say about the code is that there are no tests provided with it, which is—cough—slightly ironic, given the theme of this blog. This isn't because I've written them but am holding them back for pedagogical reasons, or as an ironical meta-commentary on the whole test-driven movement, but merely because I haven't written any yet. Happily, writing tests is a good way of documenting and explaining code, so another post will follow, in which I will present some tests, possibly correct myriad bugs, and explain more about what the code is doing.

  1. I almost said 'I googled "Apple Health export"', but the more accurate statement would be that 'I DuckDuckGoed "Apple Health export"', but there are so many problems with DuckDuckGo as a verb, even in the present tense, let alone in the past as DuckDuckGod. Maybe I should propose the neologism "to DDGoogle". Or as Greg Wilson suggested, "to Duckle". Or maybe not . . . 

  2. The ElementTree structure in Python 3 is slightly different in this respect: this exploration was carried out with Python 2. However, the main code presented later in the post works under Python 2 and 3. 

Lessons Learned: Bad Data and other SNAFUs

Posted on Mon 15 February 2016 in TDDA • Tagged with tdda, bad data

My first paid programming job was working for my local education authority during the summer. The Advisory Unit for Computer-Based Education (AUCBE), run by a fantastic visionary and literal "greybeard" called Bill Tagg, produced software for schools in Hertfordshire and environs, and one of their products was a simple database called Quest. At this time (the early 1980s), two computers dominated UK schools—the Research Machines 380Z, a Zilog Z-80-based machine running CP/M, and the fantastic, new BBC Micro, 6502-based machine produced by Acorn, to specification agreed with the British Broadcasting Corporation. I was familiar with both, as my school had a solitary 380Z, and I had harangued my parents into getting me a BBC Model B,1 which was the joy of my life.

Figure: BBC Micro

The Quest database existed in two data-compatible forms. Peter Andrews had written a machine code implementation for the 380Z, and Bill Tagg himself had written an implementation in BBC Basic for the BBC Micro. They shared an interface and a manual, and my job was to produce a 6502 version that would also share that manual. Every deviation from the documented and actual behaviour of the BBC Basic implementation had to be personally signed off by Bill Tagg.

Writing Quest was a fantastic project for me, and the most highly constrained I have ever done: every aspect of it was pinned down by a combination of manuals, existing data files, specified interfaces, existing users and reference implementations. Peter Andrews was very generous in writing out, in fountain pen, on four A4 pages, a suggested implementation plan, which I followed scrupulously. That plan probably made the difference between my successfully completing the project and flailing endlessly, and the project was a success.

I learned an enormous amount writing Quest, but the path to success was not devoid of bumps in the road.

Once I had implemented enough of Quest for it to be worth testing, I took to delivering versions to Bill periodically. This was the early 1980s, so he didn't get them by pulling from Github, nor even by FTP or email; rather, I handed him floppy disks,2 in the early days, and later on, EPROMs—Erasable, Programmable Read-Only Memory chips that he could plug into the Zero-Insertion Force ("ZIF") socket3 on the side of his machine. (Did I mention how cool the BBC Micro was?)

Figure: ZIF Socket

Towards the end of my development of the 6502 implementation of Quest, I proudly handed over a version to Bill, and was slightly disappointed when he complained that it didn't work with one of his database files. In fact, his database file caused it to hang. He gave me a copy of his data and I set about finding the problem. It goes without saying that a bug that caused the software to hang was pretty bad, so it was clearly important to find it.

It was hard to track down. As I recall, it took me the best part of two solid days to find the problem. When I eventually did find it, it turned out to be a "bad data" problem. If I remember correctly, Quest saved data as flat files using the pipe character "|" to separate fields. The dataset Bill had given me had an extra pipe separator on one line, and was therefore not compliant with the data format. My reaction to this discovery was to curse Bill for sending me on a 2-day wild goose chase, and the following day I marched into AUCBE and told him—with the righteousness that only an arrogant teenager can muster—that it was his data that was at fault, not my beautiful code.

. . . to which Bill, of course, countered:

"And why didn't your beautiful code detect the bad data and report it, rather than hanging?"


Introducing SNAFU of the Week

Needless to say, Bill was right. Even if my software was perfect and would never write invalid data (which might not have been the case), and even if data could never become corrupt through disk errors (which was demonstrably not the case), that didn't mean it would never encounter bad data. So the software had to deal with invalid inputs rather better than going into an infinite loop (which is exactly what it did—nothing a hard reset wouldn't cure!)

And so it is with data analysis.

Obviously, there is such a thing as good data—perfectly formatted, every value present and correct; it's just that it is almost never safe to assume that data your software will receive will be good. Rather, we almost always need to perform checks to validate it, and to give various levels of warnings when things are not as they should be. Hanging or crashing on bad data is obviously bad, but in some ways, it is less bad than reading it without generating a warning or error. The hierarchy of evils for analytical software runs something like this:

  1. (Worst) Producing plausible but materially incorrect results from good inputs.

  2. Producing implausible, materially incorrect results from good inputs (generally less bad, because these are much less likely to go unnoticed, though obviously they can be even more serious if they do).

  3. (Least serious) Hanging or crashing (embarrassing and inconvenient, but not actively misleading).

In this spirit, we are going to introduce "SNAFU of the Week", which will be a (not-necessarily weekly) series of examples of kinds of things that can go wrong with data (especially data feeds), analysis, and analytical software, together with a discussion of whether and how it was, or could have been detected and what lessons we might learn from them.

  1. BBC Micro Image: Dave Briggs, under CC-BY-2.0

  2. Floppy disks were like 3D-printed versions of the save icon still used in much software, and in some cases could store over half a megabyte of data. Of course, the 6502 was a 16-bit processor, that could address a maximum of 64K of RAM. In the case of the BBC micro, a single program could occupy at most 16K, so a massive floppy disk could store many versions of Quest together with enormous database files. 

  3. Zero-Insertion Force Socket: Windell Oskay, under CC-BY-2.0

How far in advance are flights cheapest? An error of interpretation

Posted on Wed 06 January 2016 in TDDA • Tagged with tdda, errors, interpretation

Guest Post by Patrick Surry, Chief Data Scientist, Hopper

Every year, Expedia and ARC collaborate to publish some annual statistics about domestic airfare, including their treatment of the perennial question "How far in advance should you book your flight?" Here's what they presented in their report last year:

Figure: Average Ticket Price cs. Advance Purchase Days for Domestic Flights (Source; Expedia/ARC)

Although there are a lot of things wrong with this picture (including the callout not being at the right spot on the x-axis, and the $496 average appearing above $500 . . .), the most egregious is a more subtle error of interpretation. The accompanying commentary reads:

Still, the question remains: How early should travelers book? . . . Data collected by ARC indicates that the lowest average ticket price, about US$401, can be found 57 days in advance.

While that statement is presumably mathematically correct, it's completely misleading. The chart is drawn by calculating the average price of all domestic roundtrip tickets sold at each advance. That answers the question "how far in advance is the average ticket sold on the day lowest?" but is mistakenly interpreted as answering "how far in advance is a typical ticket cheapest?". That's a completely different question, because the mix of tickets changes with advance. Indeed, travelers tend to book more expensive trips earlier, and cheaper trips later. In fact, for most markets, prices are fairly flat at long advances, and then rise more or less steeply at some point before departure. As a simplification, assume there are only two domestic markets, a short, cheap trip, and a long, expensive one. Both have prices that are flat at long advances, and which start rising about 60 days before departure:

Figure: Price as a function of booking window, for short-haul and long-haul flights (Simulated Data)

Now let's assume that the relative demand is directly proportional to advance, i.e. 300 days ahead, all tickets sold are for FarFarAway, and 0 days ahead, all tickets sold are for StonesThrow, and let's calculate the price of the average ticket sold as a function of advance:

Figure: Average price as a function of booking window across long- and short-haul flights, with time-verying proportionate demand (simulated data)

What do you know? The average price declines as demand switches from more expensive to cheaper tickets, with a minimum coincidentally just less than 60 days in advance. To get a more meaningful answer to the question "how far in advance is the typical ticket cheapest?", we should instead simply calculate separate advance curves for each market, and then combine them based on the total number (or value) of tickets sold in each market. In our simple example, if we assume the two markets have equal overall weight, we get a much more intuitive result, with prices flat up to 60 days, and then rising towards departure:

Figure: Weighted average advance-purchase price across long-haul and short-haul, with weighting by volume

All this goes to show how important it is that we frame our analytical questions (and answers!) carefully. When the traveller asks: "How far in advance should I book my flight?", it's our responsibility as analysts to recognize that they mean

How far in advance is any given ticket cheapest?

rather than

How far in advance is the average price of tickets sold that day lowest?

Even a correct answer to the latter is dangerously misleading because the traveller is unlikely to recognize the distinction and take it as the (wrong!) answer to their real question.

Generalized Overfitting: Errors of Applicability

Posted on Mon 14 December 2015 in TDDA • Tagged with tdda, errors, applicability

Everyone building predictive models or performing statistical fitting knows about overfitting. This arises when the function represented by the model includes components or aspects that are overly specific to the particularities of the sample data used for training the model, and that are not general features of datasets to which the model might reasonably be applied. The failure mode associated with overfitting is that the performance of the model on the data we used to train it is significantly better than the performance when we apply the model to other data.

Figure: Overfitting

Figure: Overfitting. Points drawn from sin(x) + Gaussian noise. Left: Polynomial fit, degree 3 (cubic; good fit). Right: Polynomial fit, degree 10 (overfit).

Statisticians use the term cross-validation to describe the process of splitting the training data into two (or more) parts, and using one part to fit the model, and the other to assess whether or not it exhibits overfitting. In machine learning, this is more often referred to as a "test-training" approach.

A special form of this approach is longitudinal validation, in which we build the model on data from one time period and then check its performance against data from a later time period, either by partitioning the data available at build time into older and newer data, or by using outcomes collected after the model was built for validation. With longitudinal validation, we seek to verify not only that we did not overfit the characteristics of a particular data sample, but also that the patterns we model are stable over time.

Validating against data for which the outcomes were not known when the model was developed has the additional benefit of eliminating a common class of errors that arises when secondary information about validation outcomes "leaks" during the model building process. Some degree of such leakage—sometimes known as contaminating the validation data—is quite common.

Generalized Overfitting

As its name suggests, overfitting as normally conceived is a failure mode specific to model building, arising when we fit the training data "too well". Here, we are are going to argue that overfitting is an example of a more general failure mode that can be present in any analytical process, especially if we use the process with data other than that used to build it. Our suggested name for this broader class of failures is errors of applicability.

Here are some of the failure modes we are thinking about:

Changes in Distributions of Inputs (and Outputs)

  1. New categories. When we develop the analytical process, we see only categories A, B and C in some (categorical) input or output. In operation, we also see category D. At this point our process may fail completely ("crash"), produce meaningless outputs or merely produce less good results.

  2. Missing categories. The converse can be a problem too: what if a category disappears? Most prosaically, this might lead to a divide-by-zero error if we've explicitly used each category frequency in a denominator. Subtler errors can also creep in.

  3. Extended ranges. For numeric and other ordered data, the equivalent of new categories is values outside the range we saw in the development data. Even if the analysis code runs without incident, the process will be being used in a way that may be quite outside that considered and tested during development, so this can be dangerous.

  4. Distributions. More generally, even if the range of the input data doesn't change, its distribution may, either slowly or abruptly. At the very least, this indicates the process is being used in unfamiliar territory.

  5. Nulls. Did nulls appear in any fields where there were none when we developed the process? Does the process cater for this appropriately? And are such nulls valid?

  6. Higher Dimensional Shifts. Even if the the data ranges and distribution for individual fields don't change, their higher dimensional distributions (correlations) can change significantly. The pair of 2-dimensional distributions below illustrates this point in an extreme way. The distributions of both x and y values on the left and right are identical. But clearly, in 2 dimensions, we see that the space occupied by the two datasets is actually non-overlapping, and on the left x and y are negatively correlated, while on the right they are positively correlated.

    Figure: A shift in distribution (2D)

    Figure: The same x and y values are shared between these two plots (i.e. the disibution of x and y is identical in each case). However, the pairing of x and y coordinates is different. A model or other analytical process built with with negatively correlated data like that on the left might not work well for positively correlated data like that on the right. Even if it does work well, you may want to detect and report a fundamental change like this.

  7. Time (always marching on). Times and dates are notoriously problematical. There are many issues around date and time formats, many specifically around timezones (and the difference between a local times and times in a fixed time zone, such as GMT or UTC).

    For now, let's assume that we have an input that is a well-defined time, correctly read and analysed in a known timezone—say UTC.1 Obviously, new data will tend to have later times—sometimes non-overlapping later times. Most often, we need to change these to intervals measured with respect to a moving date (possibly today, or some variable event date, e.g. days since contact). But in other cases, absolute times, or times in a cycle matter. For example, season, time of month or time of day may matter—the last two, probably in local time rather than UTC.

    In handling time, we have to be careful about binnings, about absolute vs. relative measurement (2015-12-11T11:00:00 vs. 299 hours after the start of the current month), universal vs. local time, and appropriate bin boundaries that move or expand with the analytic time window being considered.

    Time is not unique in the way that its range and maximum naturally increase with each new data sample. Most obviously, other counters (such as customer number) and sum-like aggregates may have this same monotonically increasing character, meaning that it should be expected that new, higher (but perhaps not new lower) values will be present in newer data.

Concrete and Abstract Definitions

There's a general issue with choosing values based on data used during development. This concerns the difference between what we will term concrete and abstract values, and what it means to perform "the same" operation on different datasets.

Suppose we decide to handle outliers differently from the rest of the data in a dataset, at least for some part of the analysis. For example, suppose we're looking at flight prices in Sterling and we see the following distribution.

Figure: Ticket Prices

Figure: Ticket prices, in £100 bins to £1,000, then doubling widths to £256,000, with one final bin for prices above £256,000. (On the graph, the £100-width bins are red; the rest are blue.)

On the basis of this, we see that well over 99% of the data has prices under £4,000, and also that while there are a few thousand ticket prices in the £4,000–£32,000 range (most of which are probably real) the final few thousand probably contain bad data, perhaps as a result of currency conversion errors.

We may well want to choose one or more threshold values from the data—say £4,000 in this case—to specify some aspect of our analytical process. We might, for example, use this threshold in the analysis for filtering, outlier reporting, setting a final bin boundary or setting the range for the axes of a graph.

The crucial question here is: How do we specify and represent our threshold value?

  • Concrete Value: Our concrete value is £4,000. In the current dataset there are 60,995 ticket prices (0.55%) above this value and 10,807,905 (99.45%) below. (There are no prices of exactly £4,000.) Obviously, if we specify our threshold using this concrete value—£4,000—it will be the same for any dataset we use with the process.

  • Abstract Value: Alternatively, we might specify the value indirectly, as a function of the input data. One such abstract specification is the price P below which which 99.45% of ticket prices the dataset lie. If we specify a threshold using this abstract definition, it will vary according to the content of the dataset.

    • In passing, 99.45% is not precise: if we select the bottom 99.45% of this dataset by price we get 10,808,225 records with a maximum price of £4,007.65. The more precise specification is that 99.447046% of the dataset has prices under £4,000.

    • Of course, being human, if we were specifying the value in this way, we would probably round the percentage to 99.5%, and if we did that we would find that we shifted the threshold so that the maximum price below it was £4,186.15, and the minimum price above was £4,186.22.

  • Alternative Abstract Specifications: Of course, if we want to specify this threshold abstractly, there are countless other ways we might do it, some fraught with danger.

    Two things we should definitely avoid when working with data like this are means and variances across the whole column, because they will be rendered largely meaningless by outliers. If we blindly calculate the mean, μ, and standard deviation, σ, in this dataset, we get μ=£2,009.85 and σ=£983,956.28. That's because, as we noted previously, there are a few highly questionable ticket prices in the data, including a maximum of £1,390,276,267.42.2 Within the main body of the data—the ~99.45% with prices below £4,000.00—the corresponding values are μ=£462.09 and σ=£504.82. This emphasizes how dangerous it would be to base a definition on full-field moments3 such as mean or variance.

    In contrast, the median is much less affected by outliers. In the full dataset, for example the median ticket price is £303.77, while the median of those under £4,000.00 is £301.23. So another reasonably stable abstract definition of a threshold around £4,000.00 would be something like 13 times the median.

The reason for labouring this point around abstract vs. concrete definitions is that it arises very commonly and it is not always obvious which is preferable. Concrete definitions have the advantage of (numeric) consistency between analyses, but may result in analyses that are not well suited to a later dataset, because different choices would have been made if that later data had been considered by the developer of the process. Conversely, abstract definitions often make it easier to ensure that analyses are suitable for a broader range of input datasets, but can make comparability more difficult; they also tend to make it harder to get "nice" human-centric scales, bin boundaries and thresholds (because you end up, as we saw above, with values like £4,186.22, rather than £4,000).

Making a poor choice between abstract and concrete specifications of any data-derived values can lead to large sections of the data being omitted (if filtering is used), or made invisible (if used for axis boundaries), or conversely can lead to non-comparability between results or miscomputations if values are associated with bins having different boundaries in different datasets.

NOTE: A common source of the leakage of information from validation data into training data, as discussed above, is the use of the full dataset to make decisions about thresholds such as those discussed here. To get the full benefit of cross-validation, all modelling decisions need to be made solely on the basis of the training data; even feeding back performance information from the validation data begins to contaminate that data.

Data-derived thresholds and other values can occur almost anywhere in an analytical process, but specific dangers include:

  1. Selections (Filters). In designing analytical processes, we may choose to filter values, perhaps to removing outliers or nonsensical values. Over time, the distribution may shift, and these filters may become less appropriate and remove ever-increasing proportions of the data.

    A good example of this we have seen recently involves negative charges. In early versions of ticket price information, almost all charges were positive, and those that were negative were clearly erroneous, so we added a filter to remove all negative charges from the dataset. Later, we started seeing data in which there were many more, and less obviously erroneous negative charges. It turned out that a new data source generated valid negative charges, but we were misled in our initial analysis and the process we built was unsuitable for the new context.

  2. Binnings (Bandings, Buckets). Binning data is very common, and it is important to think carefully about when you want bin boundaries to be concrete (common across datasets) and when they should be abstract (computed, and therefore different for different datasets). Quantile binnings (such as deciles), of course, are intrinsically adaptive, though if those are used you have to be aware that any given bin in one dataset may have different boundaries from the "same" bin in another dataset.

  3. Statistics. As noted above, some care has to be taken when any statistic is used in the dataset to determine whether it should be recorded algorithmically (as an abstract value) in analysis or numerically (as a concrete value), and particular care should be taken with statistics that are sensitive to outliers.

Other Challenges to Applicability

In addition to the common sources of errors of applicability we have outlined above, we will briefly mention a few more.

  1. Non-uniqueness. Is a value that was different for each record in the input data now non-unique?

  2. Crazy outliers. Are there (crazy) outliers in fields where there were none before?

  3. Actually wrong. Are there detectable data errors in the operational data that were not seen during development?

  4. New data formats. Have formats changed, leading to misinterpretation of values?

  5. New outcomes. Even more problematical than new input categories or ranges are new outcome categories or a larger range of output values. When we see this, we should almost always re-evaluate our analytical processes.

Four Kinds of Analytical Errors

In the overview of TDDA we published in Predictive Analytic Times (available here), we made an attempt to summarize how the four main classes of errors arise with the following diagram:

Figure: Four Kinds of Analytical Error

While this was always intended to be a simplification, a particular problem is that it suggests there's no room for errors of interpretation in the operationalization phase, which is far from the case.4 Probably a better representation is as follows:

Figure: Four Kinds of Analytical Error (revisited)

  1. UTC is the curious abbreviation (malacronym?) used for coordinated universal time, which is the standardized version of Greenwich Mean Time now defined by the scientific community. It is the time at 0º longitude, with no "daylight saving" (British Summer Time) adjustment. 

  2. This is probably the result of a currency conversion error. 

  3. Statistical moments are the characterizations of distributions starting with mean and variance, and continuing with skewness and kurtosis. 

  4. It's never too late to misinterpret data or results. 

Overview of TDDA in Predictive Analytics Times

Posted on Fri 11 December 2015 in TDDA • Tagged with tdda

We have an overview piece in Predictive Analytics Times this week.

You can find it here.

Constraints and Assertions

Posted on Thu 26 November 2015 in TDDA • Tagged with tdda, components

Consistency Checking of Inputs, Outputs and Intermediates

While the idea of regression testing comes straight from test-driven development, the next idea we want to discuss is associated more with general defensive progamming than TDD. The idea is consistency checking, i.e. verifying that what might otherwise be implicit assumptions are in fact met by adding checks at various points in the process.1

Initially, we will assume that we are working with tabular data, but the ideas can be extended to other kinds of data.

Inputs. It is useful to perform some basic checks on inputs. Typical things to consider include:

  1. Are the names and types fields in the input data as we expect? In most cases, we also expect field names to be distinct, and perhaps to conform to some rules.

  2. Is the distribution of values in the fields reasonable? For example, are the minimum and maximum values reasonable?

  3. Are there nulls (missing values) in the data, and if so, are they permitted where they occur? If so, are there any restrictions (e.g. may all the value for a field or record be null?)

  4. Is the volume of data reasonable (exactly as expected, if there is a specific size expectation, or plausible, if the volume is variable)?

  5. Is any required metadata2 included?

In addition to basic sense checks like these, we can also often formulate self-consistency checks on the data. For example:

  1. Are any of the fields identifiers or keys for which every value should occur only once?3

  2. Are there row-level identities that should be true? For example, we might have a set of category counts and an overall total, and expect the category totals to sum to the overall total:

    nCategoryA + nCategoryB + nCategoryC = nOverall
  3. For categorical data, are all the values found in the data allowed, and are any required values missing?

  4. If the data has a time structure, are the times and dates self-consistent? For example, do any end dates precede start dates? Are there impossible future dates?

  5. Are there any ordering constraints on the data, and if so are they respected?

Our goal in formulating TDDA is pragmatic: we are not suggesting it is necessary to check for every possible inconsistency in the input data. Rather, we propose that even one or two simple, well-chosen checks can catch a surprising number of problems. As with regression testing, an excellent time to add new checks is when you discover problems. If you add a consistency check every time you discover bad inputs that such a test would have caught, you might quickly build up a powerful, well-targeted set of diagnostics and verification procedures. As we will see below, there is also a definite possibility of tool support in this area.

Intermediate Results and Outputs. Checking intermediates and outputs is very similar to checking inputs, and all same kinds of tests can be applied. Some further questions to consider in these contexts include:

  1. If we look up reference data, do (and should) all the lookups succeed? And are failed lookups handled appropriately?

  2. If we calculate a set of results that should exhibit some identity properties, do those hold? Just as physics has conservation laws for quantities such as energy and momentum, there are similar conservation principles in some analytical calculations. As a simple example, if we categorize spending into different, non-overlapping categories, the sum of the category totals should usually equal the sum of all the transactions, as long as we are careful about things like non-categorized values.

  3. If we build predictive models, do they cross-validate correctly (if we split the data into a training subset and a validation subset)? And, ideally, do they also validate longitudinally (i.e., on later data, if this is available)?

Transfer checks. With data analysis, our inputs are frequently generated by some other system or systems. Often those systems already perform some checking or reporting of the data they produce. If any information about checks or statistics from source systems is available, it is useful to verify that equivalent statistics calculated over the input data produce the same results. If our input data is transactional, maybe the source system reports (or can report) the number or value of transactions over some time period. Perhaps it breaks things down by category. Maybe we know other summary statistics or there are checksums available that can be verified.

The value of checking that the data received is the same as the data the source system was supposed to send is self-evident, and can help us to detect a variety of problems including data loss, data duplication, data corruption, encoding issues, truncation errors and conversion errors, to name but a few.

Tool Support: Automatic constraint suggestion

A perennial obstacle to better testing is the perception that it is a "nice to have", rather than a sine qua non, and that implementing it will require much tedious work. Because of this, any automated support that tools could provide would seem to be especially valuable.

Fortunately, there is low-hanging fruit in many areas, and one of of our goals with this blog is to explore various tool enhancements. We will do this first in our own Miró software and then, as we find things that work, will try to produce some examples, libraries and tools for broader use, probably focused around the Python data analysis stack.

In the spirit of starting simply, we're first going to look at what might be possible by way of automatic input checking.

One characteristic of data analysis is that we often start by trying to get a result from some particular dataset, rather than setting out to implement an analytical process to be used repeatedly with different inputs. In fact, when we start, we may not even have a very specific analytical goal in mind: we may simply have some data (perhaps poorly documented) and perform exploratory analysis with some broad analytical goal in mind. Perhaps we will stop when we have a result that seems useful, and which we have convinced ourselves is plausible. At some later point, we may get a similar dataset (possibly pertaining to a later period, or a different entity) and need to perform a similar analysis. It's at this point we may go back to see whether we can re-use our previous, embryonic analytical process, in whatever form it was recorded.

Let's assume, for simplicity, that the process at least exists as some kind of executable script, but that it's "hard-wired" to the previous data. We then have three main choices.

  • Edit the script (in place) to make it work with the new input data.

  • Take a copy of the script and make that work with the new input data.

  • Modify the script to allow it to work with either the new or the old data, by parameterizing and generalizing it.

Presented like this, the last sounds like the only sensible approach, and in general it is the better way forward. However, we've all taken the other paths from time to time, often because under pressure just changing a few old hard-wired values to new hard-wired values seems as if it will get us to our immediate result faster.4

The problem is that even if were very diligent when preparing the first script, in the context of the original analysis, it is easy for there to be subtle differences in a later dataset that might compromise or invalidate the analysis, and it's hard to force ourselves to be as vigilent the second (third, fourth, ...) time around.

A simple thing that can help is to generate statements about the original dataset and record these as constraints. If a later dataset violates these constraints, it doesn't necessarily mean that anything is wrong, but being alerted to the difference at least offers us an opportunity to consider whether this difference might be significant or problematical, and indeed, whether it might indicate a problem with the data.

Concretely: let's think about what we probably know about a dataset whenever we work with it. We'll use the Periodic Table as an example dataset, based on a snapshot of data I extracted from Wikipedia a few years ago. This is how Miró summarizes the dataset if we ask for a "long" listing of the fields with its ls -l command:

             Field      Type                        Min                         Max    Nulls
                 Z       int                          1                          92       0
              Name    string                   Actinium                   Zirconium       0
            Symbol    string                         Ac                          Zr       0
            Period       int                          1                           7       0
             Group       int                          1                          18      18
    ChemicalSeries    string                   Actinoid            Transition metal       0
      AtomicWeight      real                       1.01                      238.03       0
         Etymology    string                      Ceres                      zircon       0
RelativeAtomicMass      real                       1.01                      238.03       0
     MeltingPointC      real                    -258.98                    3,675.00       1
MeltingPointKelvin      real                      14.20                    3,948.00       1
     BoilingPointC      real                    -268.93                    5,596.00       0
     BoilingPointF      real                    -452.07                   10,105.00       0
           Density      real                       0.00                       22.61       0
       Description    string                        0.2            transition metal      40
            Colour    string    a soft silver-white ...    yellowish green or g ...      59

For each field, we have a name, a type, minimum and maximum values and a count of the number of missing values. [Scroll sideways if your window is too narrow to see the Nulls column on the right.] We also have, implicitly, the field order.

This immediately suggests a set of constraints we might want to construct. We've added an experimental command to Miró for generating constraints based on the field metadata shown earlier and a few other statistics. First, here's the "human-friendly" view that Miró produces if we use its autoconstraints -l command.

Figure 2: Auto-constraints from 92-element Periodic Table

In this table, the green cells represent constraints the system suggests for fields, and the orange cells show areas in which potential constraints were not constructed, though they would have been had the data been different. Darker shades of orange indicate constraints that were closer to be met within the data.

In addition to this human-friendly view, Miró generates out a set of declarations, which can be thought of as candidate assertions. Specifically, they are statements that are true in the current dataset, and therefore constitute potential checks we might want to carry out on any future input datasets we are using for the same analytical process.

Here they are:

declare (>= (min Z) 1)
declare (<= (max Z) 92)
declare (= (countnull Z) 0)
declare (non-nulls-unique Z)

declare (>= (min (length Name)) 3)
declare (<= (min (length Name)) 12)
declare (= (countnull Name) 0)
declare (non-nulls-unique Name)

declare (>= (min (length Symbol)) 1)
declare (<= (min (length Symbol)) 2)
declare (= (countnull Symbol) 0)
declare (non-nulls-unique Symbol)

declare (>= (min Period) 1)
declare (<= (max Period) 7)
declare (= (countnull Period) 0)

declare (>= (min Group) 1)
declare (<= (max Group) 18)

declare (>= (min (length ChemicalSeries)) 7)
declare (<= (min (length ChemicalSeries)) 20)
declare (= (countnull ChemicalSeries) 0)
declare (= (countzero
            (or (isnull ChemicalSeries)
                (in ChemicalSeries (list "Actinoid" "Alkali metal"
                                         "Alkaline earth metal"
                                         "Halogen" "Lanthanoid"
                                         "Metalloid" "Noble gas"
                                         "Nonmetal" "Poor metal"
                                         "Transition metal"))))

declare (>= (min AtomicWeight) 1.007946)
declare (<= (max AtomicWeight) 238.028914)
declare (= (countnull AtomicWeight) 0)
declare (> (min AtomicWeight) 0)

declare (>= (min (length Etymology)) 4)
declare (<= (min (length Etymology)) 39)
declare (= (countnull Etymology) 0)

declare (>= (min RelativeAtomicMass) 1.007946)
declare (<= (max RelativeAtomicMass) 238.028914)
declare (= (countnull RelativeAtomicMass) 0)
declare (> (min RelativeAtomicMass) 0)

declare (>= (min MeltingPointC) -258.975000)
declare (<= (max MeltingPointC) 3675.0)

declare (>= (min MeltingPointKelvin) 14.200000)
declare (<= (max MeltingPointKelvin) 3948.0)
declare (> (min MeltingPointKelvin) 0)

declare (>= (min BoilingPointC) -268.930000)
declare (<= (max BoilingPointC) 5596.0)
declare (= (countnull BoilingPointC) 0)

declare (>= (min BoilingPointF) -452.070000)
declare (<= (max BoilingPointF) 10105.0)
declare (= (countnull BoilingPointF) 0)

declare (>= (min Density) 0.000089)
declare (<= (max Density) 22.610001)
declare (= (countnull Density) 0)
declare (> (min Density) 0)

declare (>= (min (length Description)) 1)
declare (<= (min (length Description)) 83)

declare (>= (min (length Colour)) 4)
declare (<= (min (length Colour)) 80)

Each green entry in the table maps to a declaration in this list. Let's look at a few:

  1. Min and Max. Z is the atomic number. Each element has an atomic number, which is the number of protons in the nucleus, and each is unique. Hydrogen has the smallest number of protons, 1, and in this dataset, Uranium has the largest number—92. So the first suggested constraints are that these values should be in the observed range. These show up as the first two declarations:

    declare (>= (min Z) 1)
    declare (<= (max Z) 92)

    We should say a word about how these constraints are expressed. Miró includes expression language called (lisp-like), (because it's essentially a dialect of Lisp). Lisp is slightly unusual in that instead of writing f(x, y) you write (f x y). So the first expression would be more commonly expressed as

    min(Z) >= 1

    in regular ("infix") languages.

    Lisp weirdness aside, are these sensible constraints? Well, the first certainly is. Even if we find some elements beyond Uranium (which we will, below), we certainly don't expect them to have zero or negative numbers of protons, so the first constraint seems like a keeper.

    The second constraint is much less sensible. In fact, given that we know the dataset includes every value of Z from 1 to 92, we confidently expect that any future revisions of the periodic table will include values higher than 92. So we would probably discard that constraint.

    The crucial point is that no one wants to sit down and write out a bunch of constraints by hand (and anyway, "why have a dog an bark yourself?"). People are generally much more willing to review a list of suggested constraints and delete the ones that don't make sense, or modify them so that they do.

  2. Nulls. The next observation about Z is that it contains no nulls. This turns into the (lisp-like) constraint:

    declare (= (countnull Z) 0)

    This is also almost certainly a keeper: we'd probably be pretty unhappy if we received a Periodic Table with missing Z values for any elements.

    (Here, (countnull Z) just counts the number of nulls in field Z, and = tests for equality, so the expression reads "the number of nulls in Z is equal to zero".)

  3. Sign. The sign column is more interesting. Here, we have recorded the fact that all the values in Z are positive. Clearly, this is a logically implied by the fact that the minimum value for Z is 1, but we think it's useful to record two separate observations about the field—first, that its minimum value is 1, and secondly that it is always strictly positive. In cases where the minimum is 1, for an integer field, these statements are entirely equivalent, but if the minimum had been (say) 3, they would be different. The value of recording these observations separately arises if at some later stage the minimum changes, while remaining positive. In that case, we might want to discard the specific minimum constraint, but leave in place the constraint on the sign.

    Although we record the sign as a separate constraint in the table, in this case it does not generate a separate declaration, as it would be identical to the constraint on the minimum that we already have.

    In contrast, AtomicWeight, has a minimum value around 1.008, so it does get a separate sign constraint:

    declare (> (min AtomicWeight) 0)
  4. Uniqueness of Values. The next thing our autoconstraints framework has noticed about Z is that none of its values is repeated in the data—that all are unique (a.k.a. distinct). The table reports this as yes (the values are unique) and 92/92 (100%), meaning that there are 92 distinct values and 92 non-null values in total, so that 100% of values are unique. Other fields, such as Etymology, have potential constraints that are not quite met: Etymology has 89 different values in the field, so the ratio of distinct values to values is about 97%.

    NOTE: in considering this, we ignore nulls if there are any. You can see this if you look at the Unique entry for the field Group: here there are 18 different (non-null) values for Group, and 74 records have non-null values for Group.

    There is a dedicated function in (lisp-like) for checking whether the non-null values in a field are all distinct, so the expression in the declaration is just:

    (non-nulls-unique Z)

    which evaluates to true5 or false.

  5. Min and Max for String Fields. For string fields, the actual minimum and maximum values are usually less interesting. (Indeed, there are lots of reasonable alternative sort orders for strings, given choices such as case sensitivity, whether embedded numbers should be sorted numerically or alphanumerically, how spaces and punctuation should be handled, what to do with accents etc.) In the initial implementation, instead of using any min and max string values as the basis of constraints, we suggest constraints based on string length.

    For the string fields here, none of the constraints is particularly compelling, though a minimum length of 1 might be interesting and you might even think that a maximum length of 2 is sensible the symbol is useful. But in many cases they will be. One common case is fixed-length strings, such as the increasingly ubiquitous UUIDs,6 where the minimum and maximum values would both 36 if they are canonically formatted. (Of course, we can add much stronger constraints if we know all the strings in a field are UUIDs.)

  6. Categorical Values. The last kind of automatically generated constraint we will discuss today is a restriction of the values in a field to be chosen from some fixed set. In this case, Miró has noticed that there are only 10 different non-null values for ChemicalSeries, so has suggested a constraint to capture that reality. The slightly verbose way this currently gets expressed as a constraint is:

    declare (= (countzero
                (or (isnull ChemicalSeries)
                    (in ChemicalSeries
                        (list "Actinoid" "Alkali metal"
                              "Alkaline earth metal"
                              "Halogen" "Lanthanoid" "Metalloid" "Noble gas"
                              "Nonmetal" "Poor metal" "Transition metal"))))

    (The or statement starting on the second line is true for field values that are either in the list or null. The countzero function, when applied to booleans, counts false values, so this is saying that none of the results of the or statement should be false, i.e. all values should be null or in the list. This would be more elegantly expressed with an (all ...) statement; we will probably change it to that formulation soon, though the current version is more useful for reporting failures.)

    The current implementation generates these constraints only when the number of distinct values it sees is 20 or fewer, only for string fields, and only when not all the values in the field are distinct, but all of these aspect can probably be improved, and the user can override the number of categories to allow.

In addition to these constraints, we should also probably generate constraints on the field types and, as we will discuss in future articles, dataset-level constraints.

Tool Support: Using the Declarations

Obviously, if we run test the constraints against the same dataset we used to generate them, all the constraints should be (and are!) satisfied. Things are slightly more interesting if we run them against a different dataset. In this case, we excluded transuranic elements from the dataset we used to generate the constraints. But we can add them in. If we do so, and then execute a script (e92.miros) containing the autogenerated constraints, we get the following output:

$ miro
This is Miro, version 2.1.90.
Copyright © Stochastic Solutions 2008-2015.
Seed: 1463187505
Logs started at 2015/11/25 17:08:23 host tdda.local.
Logging to /Users/njr/miro/log/2015/11/25/session259.

[1]> load elements
elements.miro: 118 records; 118 (100%) selected; 16 fields.
[2]> . e92
[3]> # Autoconstraints for dataset elements92.miro.
[4]> # Generated from session /Users/njr/miro/log/2015/11/25/session256.miros
[5]> declare (>= (min Z) 1)
[6]> declare (<= (max Z) 92)

Miro Warning: Declaration failed: (<= (max Z) 92)

[7]> declare (= (countnull Z) 0)
[8]> declare (non-nulls-unique Z)
[9]> declare (>= (min (length Name)) 3)
[10]> declare (<= (min (length Name)) 12)
[11]> declare (= (countnull Name) 0)
[12]> declare (non-nulls-unique Name)
[13]> declare (>= (min (length Symbol)) 1)
[14]> declare (<= (min (length Symbol)) 2)
[15]> declare (= (countnull Symbol) 0)
[16]> declare (non-nulls-unique Symbol)
[17]> declare (>= (min Period) 1)
[18]> declare (<= (max Period) 7)
[19]> declare (= (countnull Period) 0)
[20]> declare (>= (min Group) 1)
[21]> declare (<= (max Group) 18)
[22]> declare (>= (min (length ChemicalSeries)) 7)
[23]> declare (<= (min (length ChemicalSeries)) 20)
[24]> declare (= (countnull ChemicalSeries) 0)
[25]> declare (= (countzero
                  (or (isnull ChemicalSeries)
                      (in ChemicalSeries (list "Actinoid" "Alkali metal"
                                               "Alkaline earth metal"
                                               "Halogen" "Lanthanoid"
                                               "Metalloid" "Noble gas"
                                               "Nonmetal" "Poor metal"
                                               "Transition metal"))))
[26]> declare (>= (min AtomicWeight) 1.007946)
[27]> declare (<= (max AtomicWeight) 238.028914)

Miro Warning: Declaration failed: (<= (max AtomicWeight) 238.028914)

[28]> declare (= (countnull AtomicWeight) 0)

Miro Warning: Declaration failed: (= (countnull AtomicWeight) 0)

[29]> declare (> (min AtomicWeight) 0)
[30]> declare (>= (min (length Etymology)) 4)
[31]> declare (<= (min (length Etymology)) 39)
[32]> declare (= (countnull Etymology) 0)

Miro Warning: Declaration failed: (= (countnull Etymology) 0)

[33]> declare (>= (min RelativeAtomicMass) 1.007946)
[34]> declare (<= (max RelativeAtomicMass) 238.028914)

Miro Warning: Declaration failed: (<= (max RelativeAtomicMass) 238.028914)

[35]> declare (= (countnull RelativeAtomicMass) 0)

Miro Warning: Declaration failed: (= (countnull RelativeAtomicMass) 0)

[36]> declare (> (min RelativeAtomicMass) 0)
[37]> declare (>= (min MeltingPointC) -258.975000)
[38]> declare (<= (max MeltingPointC) 3675.0)
[39]> declare (>= (min MeltingPointKelvin) 14.200000)
[40]> declare (<= (max MeltingPointKelvin) 3948.0)
[41]> declare (> (min MeltingPointKelvin) 0)
[42]> declare (>= (min BoilingPointC) -268.930000)
[43]> declare (<= (max BoilingPointC) 5596.0)
[44]> declare (= (countnull BoilingPointC) 0)

Miro Warning: Declaration failed: (= (countnull BoilingPointC) 0)

[45]> declare (>= (min BoilingPointF) -452.070000)
[46]> declare (<= (max BoilingPointF) 10105.0)
[47]> declare (= (countnull BoilingPointF) 0)

Miro Warning: Declaration failed: (= (countnull BoilingPointF) 0)

[48]> declare (>= (min Density) 0.000089)
[49]> declare (<= (max Density) 22.610001)

Miro Warning: Declaration failed: (<= (max Density) 22.610001)

[50]> declare (= (countnull Density) 0)

Miro Warning: Declaration failed: (= (countnull Density) 0)

[51]> declare (> (min Density) 0)
[52]> declare (>= (min (length Description)) 1)
[53]> declare (<= (min (length Description)) 83)
[54]> declare (>= (min (length Colour)) 4)
[55]> declare (<= (min (length Colour)) 80)

10 warnings and 0 errors generated.

Job completed after a total of 10.2801 seconds.
Logs closed at 2015/11/25 17:08:23 host tdda.local.
Logs written to /Users/njr/miro/log/2015/11/25/session259.

By default, Miró generates warnings when declared constraints are violated. In this case, ten of the declared constraints were not met, so there were ten warnings. We can also set the declarations to generate errors rather than warnings, allowing us to stop execution of a script if the data fails to meet our declared expectations.

In this case, the failed declarations are mostly unsurprising and untroubling. The maximum values for Z, AtomicWeight, RelativeAtomicMass, and Density all increase in this version of the data, which is expected given that all the new elements are heavier than those in the initial analysis set. Equally, while the fields AtomicWeight, RelativeAtomicMass, Etymology, BoilingPointC, BoilingPointF and Density were all populated in the original dataset, each now contains nulls. Again, this is unsurprising in this case, but in other contexts, detecting these sorts of changes in a feed of data might be important. Specifically, we should always be interested in unexpected differences between the datasets used to develop an analytical process, and ones for which that process is used at a later time: it is very possible that they will not be handled correctly if they were not seen or considered when the process was developed.

There are many further improvements we could make to the current state of the autoconstraint generation, and there are other kinds of constraints it can generate that we will discuss in later posts. But as simple as it is, this level of checking has already identified a number of problems in the work we have been carrying out with Skyscanner and other clients.

We will return to this topic, including discussing how we might add tool support for revising constraint sets in the light of failures, merging different sets of constraints and adding constraints that are true only of subsets of the data.

Parting thoughts

Outputs and Intermediates. While developing the ideas about automatically generating constraints, our focus was mostly on input datasets. But in fact, most of the ideas are almost as applicable to intermediate results and outputs (which, after all, often form the inputs to the next stage of an analysis pipeline). We haven't performed any analysis in this post, but if we had, there might be similar value in generating constraints for the outputs as well.

Living Constraints and Type Systems. In this article, we've also focused on checking constraints at particular points in the process—after loading data, or after generating results. But it's not too much of a stretch to think of constraints as statements that should always be true of data, even as we append records, redefine fields etc. We might call these living or perpetual constraints. If we do this, individual field constraints become more like types. This idea, together with dimensional analysis, will be discussed in future posts.

  1. See e.g. the timeless Little Bobby Tables XKCD and the Wikipedia entry on Defensive Programming

  2. Metadata is data about data. In the context of tabular data, the simplest kinds of metadata are the field names and types. Any statistics we can compute are another form of metadata, e.g. minimum and maximum values, averages, null counts, values present etc. There is literally no limit to what metadata can be associated with an underlying dataset. 

  3. Obviously, in many situations, it's fine for identifiers or keys to be repeated, but it is also often the case that in a particular table a field value must be unique, typically when the records act as master records, defining the entities that exist in some category. Such tables are often referred to as master tables in database contexts

  4. We're not saying this conviction is wrong: it is typically quicker just to whack in the new values each time. Our contention is that this is a more error-prone, less systematic approach. 

  5. (lisp-like) actually follows an amalgam of Lisp conventions, using t to represent True, like Common Lisp, and f for False, which is more like Scheme or Clojure. But it doesn't really matter here. 

  6. A so-called "universally unique identifier" (UUID) is a 128-bit number, usually formatted as a string of 32 hex digits separated into blocks of 8, 4, 4, 4, and 12 digits by hyphens—for example 12345678-1234-1234-1234-123456789abc. They are also known as globally unique identifiers (GUIDs) and are usually generated randomly, sometimes basing some bits on device and time to reduce the probability of collisions. Although fundamentally numeric in nature, it is fairly common for them to be stored and manipulated as strings. Wikipedia entry

Site News: Glossary; Table of Contents; Feeds

Posted on Mon 23 November 2015 in blog • Tagged with site news, glossary

The site now has a glossary, and also a table of contents, both linked from the side panel (which is at the top on mobile). The plan, obviously, is to keep these up-to-date as we discuss more topics. The table of contents is similar to the archives link at the top, but is chronological, rather than reverse-chronological, and has a short description of each article.

While writing the glossary, we decided that, in addition to the two classes of errors we discussed in Why Test-Driven Data Analysis—errors of implementation and errors of interpretation—we should probably break out a third category, namely errors of process. The first of the "interpretation" questions we listed was "Is the input data correct?". Presenting incorrect data to an analytical process certainly seems more like an error of process than an error of interpretation (though as we will discuss in one of the next posts, arguably the process should detect at least some kinds of input errors). We will certainly discuss other examples of process errors in future posts. We'll probably update the Why... post with at least with a footnote describing new interpretation.

We were also informed that some of the links to the RSS and Atom feeds were broken, even though the feeds themselves were OK. Apologies for this. As far as we can tell, they're all OK now. Please let us know if you try them and find they're not OK, or indeed if you find any other problems or errors.

Infinite Gain: The First Test

Posted on Mon 16 November 2015 in TDDA • Tagged with regression tests, reference tests

The first idea we want to appropriate from test-driven development is that of regression testing, and our specific analytical variant of this, the idea of a reference test.

We propose a "zeroth level" of test-driven data analysis as recording one or more specific sets of inputs to an analytical process, together with the corresponding outputs generated, and ensuring that the process can be re-run using those recorded inputs. The first test can then simply be checking that the results remain the same if the analysis is re-run.

In the language of test-driven development, this is a regression test, because it tests that no regressions have occurred, i.e. the results are the same now as previously. It is also a system test, in the sense that it checks the functioning of the whole system (the analytical process), rather than one or more specific subunits, as is the case with unit tests.

In our work with Skyscanner, Stochastic Solutions maintains a number of tests of this type for each of our major analytical processes. They help to ensure that as we make changes to the analysis scripts, and any of the software they depend on, we don't break anything without noticing. We also run them whenever we install new versions on Skyscanner servers, to check that we get identical results on their platforms as on our own development systems. We call these whole-system regression tests reference tests, and run them as part of the special commit process we use each time we update the version number of the software. In fact, our process only allows the version number to be updated if the relevant tests—including the relevant reference tests—pass.

Some practical considerations

  1. Stochastic (Randomized) Analyses

    We assume that our analytical process is deterministic. If it involves a random component, we can make it deterministic by fixing the seed (or seeds) used by the random number generators. Any seeds should be treated as input parameters; if the process seeds itself (e.g. from the clock), it is important it writes out the seeds to allow the analysis to be re-run.

  2. Correctness

    We also assume that the analyst has performed some level of checking of the results to convince herself that they are correct. In the worst case, this may consist of nothing more than verifying that the program runs to completion and produces output of the expected form that is not glaringly obviously incorrect.

    Needless to say, it is vastly preferable if more diligent checking than this has been carried out, but even if the level of initial checking of results is superficial, regression tests deliver value by allowing us to verify the impact of changes to the system. Specifically, they allow us to detect situations in which a result is unexpectedly altered by some modification of the process—direct or indirect—that was thought to be innocuous (see below).

  3. Size / Time

    Real analysis input datasets can be large, as can outputs, and complex analyses can take a long time. If the data is "too large" or the run-time excessive, it is quite acceptable (and in various ways advantageous) to cut it down. This should obviously be done with a view to maintaining the richness and variability of the inputs. Indeed, the data can also be changed to include more "corner cases", or, for example, to anonymize it, if it is sensitive.

    The main reason we are not specifically advocating cutting down the data is that we want to make the overhead of implementing a reference test as low as possible.

  4. Feeds

    If the analytical process directly connects to some dynamic data feed, it will be desirable (and possibly necessary) to replace that feed with a static input source, usually consisting of a snapshot of the input data. Obviously, in some circumstances, this might be onerous, though in our experience it is usually not very hard.

  5. Time-dependent analysis.

    Another factor that can cause analysis of fixed input data, with a fixed analytical process, to produce different results is explicit or implicit time-dependence in the analysis. For example, the analysis might convert an input that is a date stamp to something like "number of whole days before today", or the start of the current month. Obviously, such transformations produce different results when run on different days. As with seeds, if there are such transformations in the analysis code, they need to handled. To cope with this sort of situation, we typically look up any reference values such as today early in the analytical process, and allow optional override parameters to be provided. Thus, in ordinary use we might run an analysis script by saying:


    but in testing replace this by something like

      AAA_TODAY="2015/11/01" python

    to set the environment variable AAA_TODAY to an override value, or with a command such as

     python -d 2015/11/01

    to pass in the date as a command-line option to our script.

  6. Numerical Precision.

    Computers are basically deterministic, and, regardless of what numerical accuracy they achieve, if they are asked to perform the same operations, on the same inputs, in the same order, they will normally produce identical results every time. Thus even if our outputs are floating-point values, there is no intrinsic problem with testing them for exact equality. The only thing we really need to be careful about is that we don't perform an equality test between a rounded output value and an floating-point value held internally without rounding (or, more accurately, held as an IEEE floating point value, rather than a decimal value of given precision). In practice, when comparing floating-point values, we either need to compare formatted string output, rounded in some fixed manner, or compare to values to some fixed level of precision. In most cases, the level of precision will not matter very much, though in particular domains we may want to exercise more care in choosing this.

    To make this distinction clear, look at the following Python code:

      $ python
      Python 2.7.10 (default, Jul 14 2015, 19:46:27)
      [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
      Type "help", "copyright", "credits" or "license" for more information.
      >>> from __future__ import division
      >>> a = 1/3
      >>> b = 1/3
      >>> print a
      >>> a == 0.333333333333
      >>> a == b
      >>> round(a, 12) == round(0.333333333333, 12)
      >>> str(a) == '0.333333333333'
      >>> '%.12f' % a == '0.333333333333'

    In this code fragment,

    • The first line tells Python to return floating-point values from integer division (always a good idea).

    • The next two lines just assign a and b each to be a third.

    • The following line confirms the result of this is, as we'd expect 0.3333... But, crucially, this value is not exact. If we print it to 60 decimal places, we see:

      >>> print "%.60f" % a
    • Unsurprisingly, therefore, when in the next statement we ask Python whether a is equal to 0.333333333333, the result is False.

    • After this, as expected, we confirm that a == b is True.

    • We then confirm that if we round a to 12 decimal places, the result is exactly round(0.333333333333, 12). Do we need the round on the right-hand side? Probably not, but be aware that 0.333333333333 is not a value that can be stored exactly in binary, so:

      >>> print '%.60f' % 0.333333333333

      It's probably, therefore, both clearer to round both sides, or to use string comparisons.

    • Finally, we perform two string comparisons. The first relies on Python's default string formatting rules, and the second is more explicit.

    NOTE: When it comes to actually writing tests, Python's unittest module includes an assertAlmostEqual method, that takes a number of decimal places, so if a function f(x) is expected to return the result 1/3 when x = 1, the usual way to test this to 12dp is with the following code fragment:

      def testOneThird(self):
          self.assertAlmostEqual(f(1), 0.333333333333, 12)
  7. Parallel Processing.

    Another factor that can cause differences in results is parallel execution, which can often result in subtle changes of detailed sequence of operations carried out. A simple example would be a task farm in which each of a number of workers calculates a result. If those results are then summed by the controller process in the order they are returned, rather than in a predefined sequence, numerical rounding errors may result in different answers. Thus, more care has to be taken in these sorts of cases.

  8. Variable output.

    A final implementation detail is that we sometimes have to be careful about simply comparing output logs, graph files etc. It is very common for output to include things that may vary from run-to-run, such as timestamps, version information or sequence numbers (run 1, run 2...) In these cases, the comparison process needs to make suitable affordances. We will discuss some methods for handling this in a future article.

Reasons a Regression Test Might Fail

Changes to the system not intended to change the result, but sometimes doing so, can take many forms. For example:

  • We might extend our analysis code to accommodate some variation in the input data handled.

  • We might add an extra parameter or code path to allow some variation in the analysis performed.

  • We might upgrade some software, e.g. the operating system, libraries, the analysis software or the environment in which the software runs.

  • We might upgrade the hardware (e.g. adding memory, processing capacity or GPUs), potentially causing different code paths to be followed.

  • We might run the analysis on a different machine.

  • We might change the way in which the input data is stored, retrieved or presented to the software.

  • Hardware and software can develop faults, and data corruption can and does occur.

The Law of Software Regressions

Experience shows that regression tests are a very powerful tool for identifying unexpected changes, and that such changes occur more often than anyone expects. In fact writing this reminds me of the self-referential law1 proposed by Doug Hofstadter:

Hofstadter's Law:

It always takes longer than you expect, even when you take into account Hofstadter's Law.

Gödel, Esher Bach: An Eternal Golden Braid, Douglas R. Hofstadter.

In a similar vein, we might coin a Law of Software Regressions:

The Law of Software Regressions:

Software regressions happen more often than expected, even when you take into account the Law of Software Regressions.

  1. Douglas R. Hofstadter, Gödel, Esher Bach: An Eternal Golden Braid, p. 152. Penguin Books (Harmondsworth) 1980. 

How is this Misleading Data Misleading Me?

Posted on Fri 13 November 2015 in TDDA • Tagged with tdda, implementation, interpretation, correctness

"Why is this lying bastard lying to me?"

Louis Heren,1 often attributed to Jeremy Paxman.

In a previous post, we made a distinction between two kinds of errors—implementation errors and errors of interpretation. I want to amplify that today, focusing specifically on interpretation.

The most important question to keep in mind at all times is not whether the analysis is computing the thing we wanted it to compute, but rather whether the result we have produced means what we think it means. The distinction is crucial.

As a simple example, let's suppose we specify the goal of our analysis as calculating the mean of a set of numbers. We can test that by adding them up and dividing by the number of items. But if we think the goal is to characterize a typical transaction size, we have to ask whether the arithmetic mean is the right metric for understanding that. As we move more towards a business or conceptual goal, rather than a mathematical or algorithmic formulation of a calculation, we have more complex and nuanced considerations, such as:

  • Do we believe the inputs are correct?

  • Is our chosen metric capable of addressing our underlying need (in this case, determining a typical transaction size)?

  • How do we handle nulls (missing values)?

  • Will outliers (perhaps extremely large values) or invalid inputs (perhaps negative values) invalidate the calculation?

  • If the values have dimensionality,2 do all of the values have the same dimensionality, and in the same units (e.g. all money and all in pounds sterling, or all distances and all measured in miles).

  • For that matter, are the inputs even commensurate, i.e. do they quantify sufficiently similar things that calculating their mean is even meaningful?

Paxman/Heren's constant question quoted above—Why is this lying bastard lying to me?—will serve as an excellent question to keep in mind every time we view an analytical result, perhaps recast as how is this misleading data misleading me? There is a great temptation to believe beautifully formatted, painstakingly calculated results produced by the almost unfathomable power of modern computers. In fact, there is much to be said for thinking of the combination of data and processing as an adversary constantly trying to fool you into drawing false conclusions.

The questions of implementation are concerned with checking that the data received as input to the analytical process has been faithfully transmitted from the source systems, and that the calculations and manipulations performed in the analysis correctly implement the algorithms we intended to use. In contrast, as we outlined previously, the questions of interpretation emphasize that we need to be ever vigilent, asking ourselves:

  • Is the input data correct?

  • Is our interpretation of the input data correct?

  • Are the algorithms we are applying to the data meaningful and appropriate?

  • Is our interpretation of the results we produce correct?

  • Are the results plausible?

  • What am I missing?

  • How is this misleading data misleading me?

  1. This quote is usually attributed to Jeremy Paxman, as noted in The Guardian article Paxman answers the questions of 31st January 2005. According to the article, however, the true origin is a former deputy editor of the Times, Louis Heren, in his memoirs, with the full quote being "When a politician tells you something in confidence, always ask yourself: 'Why is this lying bastard lying to me?'" Still other reports, however, say Heren himself, was merely quoting advice he was given. Melvin J. Lasky writes in Profanity, Obscenity and the Media, Transaction Publishers (New Brunswick) 2005:

    "Find out why the lying bastards are lying!" This is the famous phrase of an editor of the Times, Louis Heren, who received it as "advice given him early in his career by ... a correspondent of the Daily Worker [the Communist daily in London]: 'Always ask yourself why these lying bastards are lying to you.'"

  2. Here, we use dimensionality in the sense of Dimensional Analysis, which allows us to make inferences about the results of calculations based on classifying the inputs by category. For example, we would distinguish lengths, from times from quantities of money and so forth. We would also treat separately dimensionless quantities, such as counts or ratios of quantitities of the same dimension (e.g. a ratio of two lengths lengths). 

Test-Driven Development: A Review

Posted on Mon 09 November 2015 in TDDA • Tagged with tdd

Since a key motivation for developing test-driven data analysis (TDDA) has been test-driven development (TDD), we need to conduct a lightning tour of TDD before outlining how we see TDDA developing. If you are already familiar with test-driven development, this may not contain too much that is new for you, though we will present it with half an eye to the repurposing of it that we plan as we move towards test-driven data analysis.

Test-driven development (TDD) has gained notable popularity as an approach to software engineering, both in its own right and as a key component of the Agile development methodology. Its benefits, as articulated by its adherents, include higher software quality, greater development speed, improved flexibility during development (i.e., more ability to adjust course during development), earlier detection of bugs and regressions1 and an increased ability to restructure ("refactor") code.

The Core Idea of Test-Driven Development

Automation + specification + verification + refactoring

The central idea in test-driven development is that of using a comprehensive suite of automated tests to specify the desired behaviour of a program and to verify that it is working correctly. The goal is to have enough, sufficiently detailed tests to ensure that when they all pass we feel genuine confidence that the system is functioning correctly.

The canonical test-driven approach to software development consists of the following stages:

  • First, a suite of tests is written specifying the correct behaviour of a software system. As a trivial example, if we are implementing a function, f, to compute the sum of two inputs, a and b, we might specify a set of correct input-output pairs. In TDD, we structure our tests as a series of assertions, each of which is a statement that must be satisfied in order for the test to pass. In this case, some possible assertions, expressed in pseudo-code, would be:

    assert f( 0,  0)  =  0
    assert f( 1,  7)  =  8
    assert f(-2, 17)  = 15
    assert f(-3, +3)  =  0

    Importantly, the tests should also, in general, check and specify the generation of errors and the handling of so-called edge cases. Edge cases are atypical but valid cases, which might include extreme input values, handling of null values and handling of empty datasets. For example:

    assert f("a", 7) –> TypeError
    assert f(MAX_FLOAT, MAX_FLOAT) = Infinity

NOTE This is not a comprehensive set of tests for f. We'll talk more about what might be considered adequate for this function in later posts. The purpose of this example is simply to show the general structure of typical tests.

  • An important aspect of testing frameworks is that they allow tests to take the form of executable code that can be run even before the functionality under test has been written. At this stage, since we have not even defined f, we expect the tests not to pass, but to produce errors such as "No such function: f". Once a minimal definition for f has been provided, such as one that always returns 0, or that returns no result, the errors should turn into failures, i.e. assertions that are not true.

  • When we have a suite of failing tests, software is written with the goal of making all the tests pass.

  • Once all the tests pass, TDD methodology dictates that coding should stop because if the test suite is adequate (and free of errors) we have now demonstrated that the software is complete and correct. Part of the TDD philosophy is that if more functionality is required, one or more further tests should be written to specify and demonstrate the need for more (or different) code.

  • There is one more important stage in test-driven development, namely refactoring. This is the process of restructuring, simplifying or otherwise improving code while maintaining its functionality (i.e., keeping the tests passing). It is widely accepted that complexity is one of the biggest problems in software, and simplifying code as soon as the tests pass allows us to attempt to reduce complexity as early as possible. It is a recognition of the fact that the first successful implementation of some feature will typically not be the most direct and straightforward.

The philosophy of writing tests before the code they are designed to validate leads some to suggest that the second "D" in TDD (development) should really stand for design (e.g. Allen Houlob3). This idea grows out of the observation that with TDD, testing is moved from its traditional place towards the end of the development cycle to a much earlier and more prominent position where specification and design would traditionally occur.

TDD advocates tend to argue for making tests very quick to run (preferably mere seconds for the entire suite) so that there is no impediment to running them frequently during development, not just between each code commit,4 but multiple times during the development of each function.

Another important idea is that of regression testing. As noted previously a regression is a defect that is introduced by a modification to the software. A natural consequence of maintaining and using a comprehensive suite of tests is that when such a regressions occur, they should be detected almost immediately. When a bug does slip through without triggering a test failure, the TDD philosophy dictates that before it is fixed, one or more failing tests should be added to demonstrate the incorrect behaviour. By definition, when the bug is fixed, these new tests will pass unless they themselves contain errors.

Common Variations, Flavours and Implementations

A distinction is often made between unit tests and system tests (also known as integration tests). Unit tests are supposed to test low-level software units (such individual functions, methods or classes). There is often a particular focus on these low-level unit tests, partly because these can often be made to run very quickly, and partly (I think) because there is an implicit belief or assumption that if each individual component is well tested, the whole system built out of those components is likely to be reliable. (Personally, I think this is a poor assumption.)

In contrast, system tests and integration tests exercise many parts of the system, often completing larger, more realistic tasks, and more often interfacing with external systems. Such tests are often slower and it can be hard to avoid their having side effects (such as updating entries in databases).

The distinction, however, between the different levels is somewhat subjective, and some organizations give more equal or greater weight to higher level tests. This will be an interesting issue as we consider how to move towards test-driven data analysis.

Another practice popular within some TDD schools is that of mocking. The general idea of mocking is to replace some functionality (such as a database lookup, a URL fetch, a disk write, a trigger event or a function call) with a simpler function call or a static value. This is done for two main reasons. First, if the mocked functionality is expensive, or has side effects, test code can often be made much faster and side-effect free if its execution is bypassed. Secondly, mocking allows a test to focus on the correctness of a particular aspect of functionality, without any dependence on the external part of the system being mocked out.

Other TDD practitioners are less keen on mocking, feeling that it leads to less complete and less realistic testing, and raises the risk of missing some kinds of defects. (Those who favour mocking also tend to place a strong emphasis on unit testing, and to argue that more expensive, non-mocked tests should form part of integration testing, rather than part of the more frequently run core unit test suite.)

While no special software is strictly required in order to follow a broadly test-driven approach to development, good tools are extremely helpful. There are standard libraries that support of this for most mainstream programming languages. The xUnit family of test software (e.g. CUnit for C, jUnit for Java, unittest for Python), uses a common architecture designed by Kent Beck.2 It is worth noting that the rUnit package is such a system for use with the popular data analysis package R.


As an example, the following Python code tests a function f, as described above, using Python's unittest module. Even if you are completely unfamilar with Python, you will be able to see the six crucial lines that implement exactly the six tests described in pseudo-code above, in this case through four separate test methods.

import sys
import unittest

def f(a, b):
    return a + b

class TestAddFunction(unittest.TestCase):
    def testNonNegatives(self):
        self.assertEqual(f(0, 0), 0)
        self.assertEqual(f(1, 7), 8)

    def testNegatives(self):
        self.assertEqual(f(-2, 17), 15)
        self.assertEqual(f(-3, +3), 0)

    def testStringInput(self):
        self.assertRaises(TypeError, f, "a", 7)

    def testOverflow(self):
        self.assertEqual(f(sys.float_info.max, sys.float_info.max),

if __name__ == '__main__':

If this code is run, including the function definition for f, the output is as follows:

$ python
Ran 4 tests in 0.000s


Here, each dot signifies a passing test.

However, if this is run without defining f, the result is the following output:

$ python
ERROR: testNegatives (__main__.TestAddFunction)
Traceback (most recent call last):
  File "", line 13, in testNegatives
    self.assertEqual(f(-2, 17), 15)
NameError: global name 'f' is not defined

ERROR: testNonNegatives (__main__.TestAddFunction)
Traceback (most recent call last):
  File "", line 9, in testNonNegatives
    self.assertEqual(f(0, 0), 0)
NameError: global name 'f' is not defined

ERROR: testOverflow (__main__.TestAddFunction)
Traceback (most recent call last):
  File "", line 20, in testOverflow
    self.assertEqual(f(sys.float_info.max, sys.float_info.max),
NameError: global name 'f' is not defined

ERROR: testStringInput (__main__.TestAddFunction)
Traceback (most recent call last):
  File "", line 17, in testStringInput
    self.assertRaises(TypeError, f, "a", 7)
NameError: global name 'f' is not defined

Ran 4 tests in 0.000s

FAILED (errors=4)

Here the four E's at the top of the output represent errors when running the tests. If a dummy definition of f is provided, such as:

def f(a, b):
    return 0

the tests will fail, producing F, rather than raising the errors that result in E's.

Benefits of Test-Driven Development

Correctness. The most obvious reason to adopt test-driven development is the pursuit of higher software quality. TDD proponents certainly feel that there is considerable benefit to maintaining a broad and rich set of tests that can be run automatically. There is rather more debate about how important it is to write the tests strictly before the code it is designed to test. I would say that to qualify as test-driven development, the tests should be produced no later than immediately after each piece of functionality is implemented, but purists would take a stricter view.

Regression detection. The second benefit of TDD is in the detection of regressions, i.e. failures of code in areas that previously ran successfully. In practice, regression testing is even more powerful than it sounds because not only can many different failure modes be detected by a single test, but experience shows that there are often areas of code that are susceptible to similar breakages from many different causes and disturbances. (This can be seen as a rare case of combinatorial explosion working to our advantage: there are many ways to get code wrong, and far fewer to get it right, so a single test can catch many different potential failures.)

Specification, Design and Documentation. One of the stronger reasons for writing tests before the functions they are designed to verify is that the test code then forms a concrete specification. In order even to write the test, a certain degree of clarity has to be brought to the question of precisely what the function that is being written is supposed to do. This is the key insight that leads towards the idea of TDD as test-driven design over test-driven development. A useful side effect of the test suite is that it also forms a precise and practical form of documentation as to exactly how the code can be used successfully, and one that, by definition, has to be kept up to date—a perenial problem for documentation.

Refactoring. The benefits listed so far are relatively unsurprising. The fourth is more profound. In many software projects, particularly large and complex ones, once the software is deemed to be working acceptably well, some areas of the code come to be regarded as too dangerous to modify, even when problems are discovered. Developers (and managers) who know how much pain and effort was required to make something work (or more-or-less work) become fearful that the risks associated with fixing or upgrading code are simply too high. In this way, code becomes brittle and neglected and thus essentially unmaintainable.

In my view, the single biggest benefit of test-driven development is that it goes a long way to eliminating this syndrome, allowing us to re-write, simplify and extend code safely, confident in the knowledge that if the tests continue to function, it is unlikely that anything very bad has happened to the code. The recommended practice of refactoring code as soon as the tests pass is one aspect of this, but the larger benefit of maintaining comprehensive set of tests is that such refactoring can be performed at any time.

These are just the most important and widely recognized benefits of TDD. Additional benefits include the ability to check that code is working correctly on new machines or systems, or in any other new context, providing a useful baseline of performance (if timed and recorded) and providing an extremely powerful resource if code needs to be ported or reimplemented.

  1. A software regression is a bug in a later version of software that was not present in a previous version of the software. It contrasts with bugs that may always have been present but were not detected. 

  2. Kent Beck, Test-Driven Development, Addison Wesley (Vaseem) 2003. 

  3. Allen Houlob, Test-Driven Design, Dr. Dobbs Journal, May 5th 2014.

  4. Most non-trivial software development uses a so-called revision control system to provide a comprehensive history of versions of the code. Developers normally run code frequently, and typically commit changes to the revision-controlled repository somewhat less frequently (though still, perhaps, many times a day). With TDD, the tests form an integral part of the code base, and it is common good practice to require that code is only committed when the tests pass. Sometimes this requirement is merely a rule or convention, while in other cases systems are set up in such a way as to enable code to be committed only when all of its associated tests pass. 

Why Test-Driven Data Analysis?

Posted on Thu 05 November 2015 in TDDA • Tagged with questions, tdda, tdd

OK, everything you need to know about TeX has been explained—unless you happen to be fallible. If you don't plan to make any errors, don't bother to read this chapter.

— The TeXbook, Chapter 27, Recovery from Errors. Donald E. Knuth.1

The concept of test-driven data analysis seeks to improve the answers to two sets of questions, which are defined with reference to an "analytical process".

Figure 1: A typical analytical process

The questions assume that you have used the analytical process at least once, with one or more specific collections of inputs, and that you are ready to use, share, deliver or simply believe the results.

The questions in the first group concern the implementation of your analytical process:

Implementation Questions

  1. How confident are you that the outputs produced by the analytical process, with the input data you have used, are correct?

  2. How confident are you that the outputs would be the same if the analytical process were repeated using the same input data?

  3. Does your answer change if you repeat the process using different hardware, or after upgrading the operating system or other software?

  4. Would the analytical process generate any warning or error if its results were different from when you first ran it and satisfied yourself with the results?

  5. If the analytical process relies on any reference data, how confident are you that you would know if that reference data changed or became corrupted?

  6. If the analytical process were run with different input data, how confident are you that the output would be correct on that data?

  7. If corrupt or invalid input data were used, how confident are you that the process would detect this and raise an appropriate warning, error or failure?

  8. Would someone else be able reliably to produce the same results as you from the same inputs, given detailed instructions and access?

  9. Corollary: do such detailed instructions exist? If you were knocked down by the proverbial bus, how easily could someone else use the analytical process?

  10. If someone developed an equivalent analytical process, and their results were different, how confident are you that yours would prove to be correct?

These questions are broadly similar to the questions addressed by test-driven development, set in the specific context of data analysis.

The questions in our second group are concerned with the meaning of the analysis, and a larger, more important sense of correctness:

Interpretation Questions

  1. Is the input data2 correct?3
  2. Is your interpretation of the input data correct?
  3. Are the algorithms you are applying to the data meaningful and appropriate?
  4. Are the results plausible?
  5. Is your interpretation of the results correct?
  6. More generally, what are you missing?

These questions are less clear cut than the implementation questions, but are at least as important, and in some ways are more important. If the implementation questions are about producing the right answers, the interpretation questions are about asking the right questions, and understanding the answers.

Over the coming posts, we will seek to shape a coherent methodology and set of tools to help us provide better answers to both sets of questions—implementational and interpretational. If we succeed, the result should be something worthy of the name test-driven data analysis.

  1. Donald E. Knuth, The TeXbook, Chapter 27, Recovery from Errors. Addison Wesley (Reading Mass) 1984. 

  2. I am aware that, classically, data is the plural of datum, and that purists would prefer my question to be phrased as "Are the data correct?" If the use of 'data' in the singular offends your sensibilities, I apologise. 

  3. When adding Error of Implementation and Error of Interpretation to the glossary, we decided that this first question really pertained to a third category of error, namely an Error of Process

Test-Driven Data Analysis

Posted on Thu 05 November 2015 in TDDA • Tagged with motivation

A dozen or so years ago I stumbled across the idea of test-driven development from reading various posts by Tim Bray on his Ongoing blog. It was obvious that this was a significant idea, and I adopted it immediately. It has since become an integral part of the software development processes at Stochastic Solutions, where we develop our own analytical software (Miró and the Artists Suite) and custom solutions for clients. But software development is only part of what we do at the company: the larger part of our work consists of actually doing data analysis for clients. This has a rather different dynamic.

Fast forward to 2012, and a conversation with my long-term collaborator and friend, Patrick Surry, during which he said something to the effect of:

So what about test-driven data analysis?

— Patrick Surry, c. 2012

The phrase resonated instantly, but neither of us entirely knew what it meant. It has lurked in my brain ever since, a kind of proto-meme, attempting to inspire and attach itself to a concept worthy of the name.

For the last fifteen months, my colleagues—Sam Rhynas and Simon Brown—and I have been feeling our way towards an answer to the question

What is test-driven data analysis?

We haven't yet pulled all the pieces together into coherent methodology, but we have assembled a set of useful practices, tools and processes that feel as if they are part of the answer.

A few weeks ago, my friend and ex-colleague Greg Wilson was in town for Edinburgh Parallel Computing Centre's twenty-fifth birthday bash. Greg is a computer scientist and former lecturer from University of Toronto. He now spends most of his time teaching scientists key ideas from software engineering through his Software Carpentry organization. He lamented that while he has no trouble persuading scientists of the benefits of adopting ideas such as version control, he finds them almost completely unreceptive when he champions software testing. I was initially rather shocked by this, since I routinely say that test-driven development is the most significant idea in software in the last thirty or forty years. Thinking about it more, however, I suspect the reasons for the resistance Greg encounters are similar to the reasons we have found it harder than we expected to take mainstream ideas from test-driven development and apply them in the rather specialized area of data analysis. Testing scientific code is more like testing analysis processes than it is like testing software per se.

As I reflected further on what Greg had said, I experienced a moment of clarity. The new insight it that while we have a lot useful components for test-driven data analysis, including some useful fragments of a methodology, we really don't have appropriate tools: the xUnit frameworks and their ilk are excellent for test-driven development, but don't provide specific support for the patterns we tend to need in analysis, and address only a subset of the issues we should want test-driven data analysis to cover.

The purpose of this new blog is to think out loud as we—in partnership with one of our key clients, Skyscanner—try to develop tools and methodologies to form coherent framework and support system for a more systematic approach to data science—a test-driven approach to data analysis.

So watch this space.

If you want to subscribe, this site has RSS and ATOM feeds, and also offers email subscriptions.1 We'll be tweeting on @tdda0 whenever there are new posts. Twitter is also probably the best to send feedback, since we haven't plumbed in comments at this time: we'd love to hear what you think.

  1. through MailChimp; thanks, MailChimp!