Test-Driven Data Analysis

Best Practices for Notebook Users

2024-12-17T16:00:00+00:00

In a previous post, I discussed some of the dangers of challenges, dangers and weaknesses of Jupyter Notebooks, JupyterLabs and their ilk. I used The Parables of Anne and Beth as a device to illustrate what I think of as good and bad practices for data science. A reasonable criticism of this was that it did not really offer anything to help people who might wish to continue using computational notebooks, but to work in such a way as to limit the harms identified.

Although it probably rings slightly hollow, my goal is absolutely to improve the quality of data science, data analyis, data engineering, and really all data work, and I very much see the attractions and strengths of Jupyter, despite being critical of certain dark patterns I see around their use.

Here are some suggesting best practices for notebook users that I hope might be helpful an constructive. It's true that if you adopt all of them, I might have succeeded in prising your Notebook from your hands, but if you adopt any of them, as you use notebooks, I think you will be safer and more successful. I'm very much in favour of half a loaf.

Subject to the vaguaries of the web, the checkboxes, online, should be clickable, should you find it useful as an actual checklist, and there's a PDF version available too.

Notebook Best Practices

NBP1 Ensure that your Notebook runs correctly after completion
- Develop the Notebook
- Take a temporary copy of the Notebook
- Clear the Notebook
- Run and confirm the results match the temporary copy
- Clear the new Notebook again
- Commit the new Notebook to version control
- Rerun (so the Notebook contains the results)
- Delete the temporary copy
NBP2 Store the (cleared) Notebook in version control (see NBP1)
NBP3 Parameterize inputs and outputs at the top of the Notebook

e.g. Set and use variables such as INPATH and OUTPATH
Write the most important outputs to file (if not already done)

NBP4 Replace some individual cells or groups of cells with functions

Move them into an importable file, import and use
Prioritize potentially re-usable code and code requiring testing

NBP5 Write some tests

Create a reference (regression) test for the whole process
Create unit tests for the individual functions

NBP6 Consider restructuring/extracting the code as a standalone script
NBP7 Allow the parameters to be set from the command line

Alternatively, read from a configuration file (e.g. .json or .toml)

NBP8 Consider using safer alternatives like Marimo and Quarto

NBP Version 1.0.

A printable PDF copy is available.

Log Graphs and Grokkability

2024-12-12T16:00:00+00:00

In his novel Stranger in a Strange Land, Robert Heinlein¹ introduced the word grok. It is used all the time in the computing sphere, but rarely, as far as I know, outside it. The definition that seems to me most closely to match its usage is:

grok (transitive verb).

To understand profoundly through intuition or empathy.

grok (verb).

To have or to have acquired an intuitive understanding of; to know (something) without having to think (such as knowing the number of objects in a collection without needing to count them: see subitize).
To fully and completely understand something in all its details and intricacies.
To get the meaning of something.

— The American Heritage Dictionary of the English Language, 5th edition.²

Graphs and other visualizations are among the most powerful tools we have for identifying and elucidating patterns in data—for helping us to grok data. A good graph can be extremely dense in information and easy to understand. By the same token, a bad graph can—deliberately or inadvently—be a powerful tool for misleading and spreading confusion and misunderstanding. Data professionals who subscribe to the Hippocratic Oath, “First, do no harm”, should avoid poor graphing practices as a matter of a high priority.

Log Scales

As I was writing a chapter on graphs in a book on TDDA, I happened upon the graph below from Our World in Data (OWID).

Graph from our world in data on Bluesky.

It uses data from the Luxembourg Income Study. and shows 14 base numbers—the post-tax incomes of the 10th percentile and 90th percentile citizens of seven countries (poorer and richer groups, respectively). The most common way of measuring income inequality uses the Gini coefficient,³ which is a single number that is powerful but rather abstract. The OWID graph is much more intuitive, focusing as it does on the ratio of incomes between a richer and poorer group, each defined by its position in the income distribution. The raw post-tax incomes and the derived multiples (ratios) are shown in this table:

The Data Shown in the OWID Graph.⁴

Country	South Africa	Brazil	China	Uraguy	UK	US	Norway
Year	2017	2022	2018	2022	2021	2022	2021
90th percentile ($USD)	2,480	1,650	1,900	2,220	4,100	6,830	5,130
10th percentile ($USD)	110	195	250	395	1,080	1,170	1,670
Multiple	22⨉	8.4⨉	7.8⨉	5.6⨉	3.8⨉	6⨉	3.1⨉

Before discussing weaknesses with the graph, consider a few of its exemplary features. The graph minimizes chart junk⁵ while clearly explaining the findings with various direct annotations. Labels are close to the data, the numbers are easy to read, and different weights and colours (shades) of text are used to emphasize and de-emphasize information. The seven numbers they most wish to focus on are the income ratios for each country, which are shown clearly. The notes at the bottom both specify the source of the data and provide useful guidance about interpreting the numbers. The use of colour is effective, and the colours used appear to have been chosen to work for readers with colour blindness as well as everyone else. I think the graph draws the reader in, and is much more likely to cause someone to stop and study it than the dry table of numbers alone would be, particularly in social media (where I saw this).

Despite these many merits, I think the graph only partially succeeds as a graph. My own experience was that I did get a feel for the information (I believe) the graph was trying to communicate by reading it (which did not take too long—it is, after all, only 21 numbers), but to gain that understanding I did have to read all 21 numbers—the 7 multipliers from the labels, and the 14 incomes by looking across at the y-axis. Even when I had done this, I still did not have the same intuitive, almost visceral understanding that comes from a really effective graph. The immediate reason for this is the log scale: humans simply do not have the same effortless ability to grok information presented using log-scaled lengths as we do with proportionately scaled lengths.

My first instinct was the that log scale was unnecessary: the range of values is not too great to show comfortably on a linear scale. So I plotted the top graph of the pair above, which comfortably displays all the incomes, albeit with the qualification that differences in the incomes of the poorer groups in first four countries are harder to differentiate. (Smaller markers would help here, but I wanted to change as little as possible from the OWID graph, where the marker size is not a problem.) I contend that the redrawn graph makes it easier to compare incomes across the seven different countries visually. Unfortunately, however, it is hopeless for conveying the income ratios, which are surely the quantities OWID most wanted to communicate. In the redrawn graph, a larger separation between the rich and poor groups does not mean that the income ratio is larger, still less is the separation proportional to the ratio: the absolute difference between incomes in the US is more that twice that in South Africa, because incomes are so much higher in the US, so its multiplier of 6⨉ is represented by a much longer line than is the 22⨉ multiplier for South Africa. The log plot cleverly resolves this because addition of logs is equivalent to multiplication of the original quantities. As a result, differences on a log plot are the logs of the ratios, and do rank the ratios correctly.⁶

The cleverness of the presentation chosen by OWID is that it allows a single graph, with a single scale, to show both the relative incomes among countries and the ratios between incomes of the the richer and poorer groups in each country. But the cost of this approach is that not only the income scale itself, but also the multipliers are shown on log scales which—to labour the point—are hard to grok.

The bottom graph in the redrawn figure shows the multipliers themselves on a linear scale. I think this is a significantly better (more grokkable) way to visualize them. There is, however, a final subtlety. In general, ratios of positive quantities can have any positive value. But in this case, the multiplier constructed is the ratio of a larger income (the 90th percentile) and a smaller one (the 10th percentile). Manifestly, this cannot be smaller than 1—the value it would take if everyone in the country had the same post-tax income. It is for this reason that I have drawn the connecting lines on the last plot from 1⨉ to the multiplier, rather than from 0, and have chosen not to present this as a conventional bar graph. One is the effective zero for these particular multipliers.⁷ A country with a perfectly even income distribution would have no distance between the 10th and 90th percentiles—once more emphasizing how clever the device of using a log scale for this data is, even if, as I contend, it was ultimately a poor choice for communcation.

Another way of saying this is that OWID (I believe) was trying to show two different things, lying on naturally different scales, on a single plot. OWID found a clever technical solution that allowed them to do this, but at the cost of grokkability. As I deconstructed it, I realised I needed two plots to show the data in ways that I think are much easier to understand. You, of course, must form you own judgement.

Heinlein, Robert A. (1961). Stranger in a Strange Land. G. P. Putnam’s Sons. ↩
The American Heritage Dictionary of the English Language (2022). 5th ed. Random House Inc. ↩
Hasell, Joe (2023). Measuring inequality: what is the Gini coefficient? In: Our World in Data. https://ourworldindata.org/what-is-the-gini-coefficient. ↩
The data was reconstructed from the OWID graph, so there will be minor deviations from the original Luxembough Income Study data. ↩
The Visual Display of Quantitative Information, Edward R. Tufte, Graphics Press, 1984. ↩
since logarithms are monotonic, increasing functions. ↩
because log 1 = 0 ↩

Jupyter Notebooks Considered Harmful: The Parables of Anne and Beth

2024-11-14T18:30:00+00:00

I have long considered writing a post about the various problems I see with computational notebooks such as Jupyter Notebooks. As part of a book I am writing on TDDA, I created four parables about good and bad development practices for analytical workflows. They were not intended to form this post; but they way they turned out fits the theme quite well.

Situation Report

Anne and Beth are data scientists, working in parallel roles in different organizations. Each was previously tasked with analysing data up to the end of the second quarter of 2024. Their analyses were successful and popular. Even though there had never been any suggestion that the analysis would need to be updated, on the last Friday of October Anne and Beth each receive an urgent request to “rerun the numbers using data up to the end of Q3”. The following parables show four different ways this might play out for our two protagonists from this common initial situation.

Parable #1: The Parable of Parameterization

Beth

Beth locates the Jupyter Notebook she used to run the numbers previously and copies it with a new name ending Q3. She changes the copy to use the new data and tries to run it but discovers that the Notebook does not work if run from start to finish. (During development, Beth jumped about in the Notebook, changing steps and rerunning cells out of order as she worked until the answers looked plausible.)

Beth spends the rest of Friday trying to make the analysis work on the new data, cursing her former self and trying to remember exactly what she did previously.

At 16:30, Beth texts her partner saying she might be a little late home.

Anne

Anne begins by typing

make test

to run the tests she created based on her Q2 analysis. They pass. She then puts the data in the data subdirectory, calling it data-2024Q3, and types

make analysis-2024Q3.pdf

Her Makefile’s pattern rule matches the Q3 data to the target output and runs her parameterized script, which performs the analysis using the new data. It produces the PDF, and issues a message confirming that consistency checks on the outputs passed. After checking the document, Anne issues it.

At 09:30, Anne starts to plan the rest of her Friday.

Computational notebooks, such as Jupyter Notebooks,¹ have taken the data science community by storm, to the point that it is now often assumed that analyses will be performed in a Notebook. Although I almost never use them myself, I do use a web interface (Salvador) to my own Miró that from a distance looks a version of Jupyter from a different solar system. Notebooks are excellent tools for ad hoc analysis, particularly data exploration, and offer clear benefits including the ability to embed graphical output, easy shareability, web approaches to handling wide tables, and facilitation of annotation of analysis in a spirit somewhat akin to literate programming. I do not wish to take away anyone's Notebooks, notwithstanding the title of this post. I do, however, see several key problems with the way Notebooks are used and abused. Briefly, these are:

Lack of Parameterization. I see Notebook users constantly copying Notebooks and editing them to work with new data, instead of writing parameterized code that handles different inputs. Anne's process uses the same program to process the Q2 and Q3 data. Beth's process uses a modified copy, which is significantly less manageable and offers more scope for error (particularly errors of process).
Lack of Automated testing. While it is possible to write tests for Notebooks, and some tools and guides exist e.g. Remedios 2021,² in my experience it is rare for this to be done even by the standards of data science, where testing is less common than I would like it to be.
Out-of-order execution. In Notebooks, individual cells can be executed in any order, and state is maintained between execution steps. Cells may fail to work as intended (or at all) if the state has not been set up correctly before they are run. When this happens, other cells can be executed to patch up the state and then the failing cell can be run again. Not only can critical setup code end up lower down a Notebook than code that uses it, causing a problem if the Notebook is cleared and re-run: the key setup code can be altered or deleted after it has been used to set the state. This is my most fundamental reservation about Notebooks, and it not merely a theoretical concern. I have known many analysts who routinely leave Notebooks in inconsistent states that prevent them from running straight through to produce the results. Notebooks are fragile.
Interaction with Source Control Systems. Notebooks can be stored in source control systems like Git, but some care is needed. Again, in my experience, Notebooks tend not to under version control, with the copy-paste-edit pattern (for whole Notebooks) being more common.

In my view, Notebooks should be treated as prototypes to be converted to parameterized, tested scripts immediately after development. This will often involve converting the code in cells (or groups of cells) into parameterized functions, something else that, Notebooks seem to discourage. This is probably because cells provide a subset of the benefits of a callable function by visually grouping a block of code and allowing it to be executed in isolation. Cells do not, however, provide other key benefits of functions and classes, such as separate scopes, parameters, enhanced reusability, enhanced testability, and abstraction.

Anne's process has four key features that differ from Beth's.

Anne runs her code from a command line using make. If you are not familiar with the make utility, it is well worth learning about, but the critical point here is that Anne's setup allows her to use her process on new data without editing any code: in this case her make command (is intended to) get expanded to python analyse.py 2024Q3 and uses the parameter 2024Q3 both to locate the input data and to name the matching report generated.
Anne also benefits from tests she previously wrote, so has confidence that her code is behaving as expected on known data. This is the essence of what we mean by reference testing. While you might think that that if Anne has not changed anything since she last ran the tests, they are bound to pass (as they do in this parable), this is not necessarily the case.
Anne's code also includes computational checks on the outputs. Of course, such checks can be included in a Notebook just as easily as they can be in scripts. The reason they are not is entirely because I am making one analyst a paragon of good practice and the other a caricature of sloppiness.
Finally, unlike Beth, Anne takes the time to check her outputs before sending them on. Once again, this is because Anne cares about getting correct results, and wants to find any problems herself, not because she does not use a Notebook.

Parable #2: The Parable of Testing

Beth

Beth copies, renames and edits her previous Notebook and is pleased to see it runs without error on the Q3 data. She issues the results and plans the rest of her Friday.

The following week, Beth's inbox is flooded with people saying her results are “obviously wrong”. Beth is surprised since she merely copied the Q2 analysis, updated the input and output file paths, and ran it. She opens her old Q2 Notebook and reruns all cells. She is dismayed to see all the values and graphs in the second half of the Notebook change.

Beth has some remedial work to do.

Anne

Anne runs her tests but one fails. On examining the failing test, Anne realises that a change she made to one her helper libraries means that a transformation she had previous applied in the main analysis is now done by the library, so should be removed from her analysis code.

After making this change, the tests (which were all based on the Q2 data) pass. Anne commits the change to source control before typing

make analysis-2024Q3.pdf

to analyse the new data. After sense checking the results, Anne issues them.

In the first parable, Beth's code would not run from start to finish; this time, it runs but produces different answers from when she ran it before using the Q2 data. This could be because she had failed to clear and re-run the Notebook to generate her final Q2 results, but here I am assuming that her results changed for the same reason as Anne's: they had both updated a helper library that their code used. Whereas Anne's tests detected the fact that her previous results had changed, Beth only discovered this when other people noticed her Q3 results did not look right (though had she checked her results, she might have noticed that something looked wrong.) Anne is in a slightly better position than Beth to diagnose what went wrong, because her “correct” (previous) results are stored as part of her tests. Now that Beth has updated her Notebook, it may be harder for her to recover the old results. Even if she has access to the old and new results, Beth is probably is less good position than Anne because Anne has at least one test highlighting how the result has changed. This should allow her to make faster progress and gain confidence that her fix is correct more easily.

Parable #3: The Parable of Units

Beth

Again, Beth copies, renames and updates her previous Notebook and is happy to see it runs straight through on the Q3 data. She issues the results and looks forward to a low-stress day.

Around 16:00, Beth's phone rings and a stressed executive tells her the answers “can't be right” and need to be fixed quickly. Beth is puzzled. She opens her Q2 Notebook, re-runs it and the output is stable. That, at least, is good.

Beth now compares the Q2 and Q3 datasets and notices that the values in the PurchasePrice column are some three orders of magnitude larger in Q3 than in Q2, as if the data is in different units. She checks with her data supplier to confirm that this is the case, then sends some rebarbative emails, with the subject Garbage In, Garbage Out! Beth grumpily adds a cell to her Q3 notebook dividing the relevant column by 1,000. She then adds _fixed to the Q3 notebook's name to encourage her to copy that one next time. She wonders why everyone else is so lazy and incompetent.

Anne

As usual, Anne first runs her tests, which pass. She then runs the analysis on the Q3 data by issuing the command

make analysis-2024Q3.pdf

The code stops with an error:

Input Data Check failed for: PurchasePrice_kGBP
Max expected: GBP 10.0k
Max found: GBP 7,843.21k

(Anne's code creates a renamed copy of the column after load because she had noticed while analysing Q2, that the prices were in thousands of pounds.)

Anne checks with her data supplier, who confirms a change of units, which will continue going forward. Anne persuades her data provider to change the field name for clarity, and to reissue the data.

Anne adds different code paths based on the supplied column names and adds tests for the new case. Once they pass, and she has received the updated data, Anne commits the change, runs and checks the analysis and issues the results.

This time, Anne is saved by virtue of having added checks to the input data, which Beth clearly did not (though, again, such checks could easily be included in a Notebook). This builds directly on the ideas of other articles in this blog, whether implemented through TDDA-style constraints or more directly as explicit checks on input values in the code.

Anne (being the personification of good practice) also noticed the ambiguity in the PurchasePrice variable and created a renamed copy of it for clarity. Note, however, that her check would have worked if she had not created a renamed variable.

A third difference is that Anne has effected a systematic improvement in her data feed by getting the supplier to rename the field. This reduces the likelihood that the unit will be changed without flagging it, decreases chances of its being misinterpreted, and allows Anne to have two paths through her single script, coping with data in either format safely. By re-sourcing the updated data, Anne also confirms that the promised change has actually been made, and that the new data looks correct.

Finally, Beth now has different copies of her code and has to be careful to copy the right one next time (hence _fixed). Anne's old code only exists in the version control system, and crucially, her new code safely handles both cases.

Parable #4: The Parable of Applicability

Beth

Beth dusts off her Jupyter Notebook and, as usual, copies it with a new name ending Q3. She makes the necessary changes to use the new data but it crashes with the error:

ZeroDivisionError: division by zero

After a few hours tracing through the code, Beth eventually realises that there is no data in Q3 for a category that had been quite numerous in the Q2 data. Her Notebook indexes other categories against the missing category by calculating their ratios. On checking with her data provider, Beth confirms that the data is correct, so adds extra code to the Q3 version of the Notebook to handle the case. She also makes a mental note to try to remember to copy the Q3 notebook in future.

Anne

Anne runs her tests by typing

make test

and enjoys watching the progress of the “green line of goodness”. She then runs the analysis on the Q3 data by typing

make analysis-2024Q3.pdf

but it stops with an error message: There is no data for Reference Store 001. If this is right, you need to choose a different Reference Store. After establishing that the data is indeed correct, Anne updates the code to handle this situation, checks that the existing tests pass, adds a couple regression tests to make sure that it copes not only with the default reference store having no data, but also alternative reference stores. When all tests pass, she runs the analysis in the usual way, checks it, commits her updated code and issues the results.

As in the previous parable, the most important difference between Beth's approach and Anne's is that Beth's fix for their common problem is ad hoc and leads to a further code proliferation³ and its concomitant risks if the analysis is run again. In contrast, Anne's code becomes more general and robust as it handles the new case along with the old and she adds new extra tests (regression tests) to try to ensure that nothing breaks the handling of this case in future. The “green line of goodness” mentioned is the name some testers use for the line of dots (sometimes green) many test frameworks issue each time a test passes.

So there we have it. On a more constructive note, I have been following the progress of Quarto with interest. Quarto is a development of RMarkdown, a different style of computational document popular in the R and Reproducibile Research communities.⁴ To my mind it has fewer of the problems highlighted here. It also supports Python and much of the Python data stack as first-class citizens, and in fact integrates closely with Jupyter, which it uses behind the scenes for many Python-based workflows. I have been using it over the last couple of days, and though it is still distinctly rough around the edges, I think it offers a very promising way forward, with excellent output options that include PDF (via LaTeX), HTML, Word documents and many other formats. It's both an interesting alternative to Notebooks and (perhaps more realistically) a reasonable target for migrating code from Notebook prototypes. I use most of the cells to call functions imported at the start, promoting code re-use and parametization, which avoids another of the pitfalls of Notebooks (in practice) discussed above.

I am using the term Jupyter Notebook to cover both what are now called Jupyter Notebooks “The Classic Notebook Interface”) and JupyterLabs (the “Next-Generation Notebook Interface”). This is both because most people I know continue to call them Jupyter Notebooks, even when using JupyterLab, and because “Notebooks” reads better in the text. I will capitalize Notebook when it refers to a computational notebook, as opposed to a paper notebook. ↩
Remedios 2021, How to Test Jupyter Notebooks with Pytest and Nbmake, 2021-12-14. ↩
analysis_fixed_fixed2_final.ipynb ↩
Reproducible research is very much a sister movement to TDDA—a much larger sister movement. Its goals are similar and it is wholly congruent with all the ideas of TDDA. To the extent that there is divergence, some of it simply arises from separate evolution, and some from the fact that focus of reproducible research is more allowing other people to access your code and data, to run it themselves and verify the outputs, or to write their own analysis to verify your results even more strongly, or to use your code on their data as a different sort of validation. I sometimes call TDDA “reproducible research for solipsists”, because of its greater focus on testing, and helping to discover and eliminate problems even if no second person is involved. Another related area I have recently become aware of is Veridical data science, as developed by Bin Yu and Rececca Barter. The link is to their book of that name. ↩

An Adware Malware Story Featuring Safari, Notification Centre, and Box Plots

2024-09-22T16:00:00+01:00

This is not, primarily, an article about TDDA, but I thought it was worth publishing here anyway. Itʼs a story about a kind of adware/malware incident I had this morning—with digressions about box plots.

Disgression

I was doing some research for a book (on TDDA), looking up information on box plots, also known as box-and-whisker diagrams. When I first came across box plots, I assumed the “box” in the name was a reference to the literal “box” part of a traditional box plot. If you are not familiar with box plots, they typically look like the ones shown in Wikipedia:¹:

There are variations, but typically the central line represents the median, the “box” delineates the interquartile range, the whiskers extend to either the minimum and maximum or, sometimes, other percentiles, such as 1 and 99. When the minimum and maximum are not used, outliers beyond those extents can be shown as individual points, as seen here for experiments 1 and 3.

At some point after learning about box plots, I became aware of the statistician George Box—he of “All models are wrong, but some models are useful”² fame, and ended up believing that box plots had in fact been invented by him (and should, therefore, arguably be called “Box plots” rather than “box plots”). Whether someone misinformed me or my brain simply put 2 and 2 together to make about 15, Tufte⁵ (who advocates his own “reduced box plot”, in line with his principle of maximizing data ink and minimizing chart junk) states definitively that the box plot was in fact a refinement by John Tukey³ of Mary Eleanor Spearʼs “range bars”⁴. So I was wrong.

Back to the malware

Anyway, back to the malware. I was clicking about on image search results for searchs like box plot "George Box" and hit a site that gave one of the ubiquitous “Are you a human?” prompts that sometimes, but not always, act as a gateway to solving CAPTCHAs to train AI models. But this one didnʼt seem to work. I closed the tab and moved on, but soon after started getting highly suspicious looking pop-up notifications like these:

These are comically obviously not legitimate warnings from anything that I would knowingly allow on a computer, which made me less alarmed that I might otherwise have been. But clearly something had happened as a result of clicking an image search result and an “I am not a robot” dialogue.

I wonʼt bore you with a blow-by-blow account of what I did, but the key points are that

Killing that tab did not stop the notifications.
Nor stopping and restarting Safari but bringing back all the old windows.
Nor did stopping and restarting Safari without beinging back any tabs or windows.
Nor did deleting todayʼs history from Safari.

So I did some web searches, almost all of the results of which advocated downloading sketchy “anti-malware” products to clean my Mac, which I was never going to do. Eventually, I came across the suggestion that it might be a site that had requested and been given permission to put notifications in Notification Centre. I think I was only half-aware that this was a possible behaviour, but it turns out that (on MacOS Ventura 13.6.9, with Safari 17.6) Safari → Settings → Websites has a Notifications section on the left that looks like this:

I must have been aware of this at various points, because I had set the four websites at the bottom of the list to Deny, but I had not noticed the Allow websites to ask for permission to send notifications checkbox, which was enabled (but is now disabled). The top one—looks suspicious, dunnit?—was set to Allow when I went in. I have a strong suspicion that the site I tricked me into giving it permission by getting me to click something that did not appear to be asking for such permission. I suspect it hides its URL by using a newline or a lot of whitespace, which is why it does not show up in the screenshot above.

Setting that top (blank-looking) site to Deny and (as a belt-and-braces and preventative measure) unchecking the checkbox so that sites are not even allowed to ask for permission to put popups in Notification Centre had the desired effect of making popups stop. I believe this consitutes a full fix and that no data was being exfiltrated from the Mac, despite the malicious notification. I will probably also Remove at least that top site (with the Remove button in the future) but will leave it there for now in case Apple (or anyone else) can tell me how to find out what site it comes from.

I also found (but cannot now find again) an option to reset the notications from those sites. This was the extremely confusing dialogue for the site in question.

I think whatʼs going on here is that some text that the site is using to identify itself to Safari when asking for permission consists of the following text, incuding the new lines:

ask you

Confirm that you're not a robot, you need to click Allow

This makes reading the dialogue quite hard and confusing. Looking more carefully at Notification Centre, I also see this:

I don't quite understand whether this is an image forming a notification, or an image included in some other notification, but offset, or something else. Whatever it is, it consists of (or includes) white text saying

ask you

Confirm that you're not a
robot, you need to click Allow

with a little bit of light grey background around the letters.

I donʼt entirely understand why the site would have used barely readable white text on a light grey background like this, but I presume somehow this text was involved in getting me to click the “I am not a robot” dialogue (which I believe to be the only click I performed on the site).

Anyway, the long and the short of it is that if anyone else runs into this, my recommendations (which do not come from a security expert, so use your own judgement) are:

Donʼt download a random binary from the internet to remove spyware.
Try to find the Safari Preference for Notifications under Websites and see if you have a sketchy-looking entry like mine. If so, set that to Deny
Probably also remove that site with the Remove buttom
Consider turning off the ability for sites to request permission to put notifications in Notification Centre if this is not something you want, or that no site you care about needs.

Taken from Wikipedia entry on box plots, own work by Wikipediaʼs User:Schutz (public domain). ↩
George Box, (1919-2013): a wit, a kind man and a statistician., Obituary by Julian Champkin, 4 April 2013 https://significancemagazine.com/george-box-1919-2013-a-wit-a-kind-man-and-a-statistician-2/ ↩
Exploratory data analysis, John Tukey, Reading/Addison-Wesley, 1977. ↩
Charting Statistics, Mary Eleanor Spear, McGraw Hill, 1952. ↩
The Visual Display of Quantitative Information, Edward R. Tufte, Graphics Press, 1984. ↩

PyData London 2024 TDDA Tutorial

2024-07-21T16:00:00+01:00

PyData London had its tenth conference in 2024, and it was excellent.

I gave a tutorial on TDDA, and the video is available on YouTube and below:

The slides are also available here.

Learning the Hard Way: Regression to the Mean

2024-06-20T20:00:00+01:00

I was at the tenth PyData London Conference last weekend, which was excellent, as always. One of the keynote speakers was Rebecca Bilbro who gave a rather brilliant (and cleverly titled) talk called Mistakes Were Made: Data Science 10 Years In.

The title is, of course, a reference to the tendency many of us have to be more willing to admit that mistakes were made, than to say "I made mistakes". So I thought I'd share a mistake I made fairly early in my data science career, probably around 1996 or 1997. This is not one of those interview-style "I-sometimes-worry-I'm-a-bit-too-much-of-a-perfectionist"-style admissions that we have all heard; this one was bad.

My company at the time, Quadstone, was under contract to analyse a large retailer's customer base for relationship marketing, using loyalty-card data. We had done all sorts of work in the area with this retailer, and one day the relationship manager we were working with decided that it would be good to incentivise more spending among the retailer's less active, lower spending customers. This is fairly standard. The idea was to set a reasonably high, but achievable, target spend level for each of these customers over a period of a few weeks. Those customers who hit their individual target would receive a large number of loyalty points worth a reasonable amount of money.

We had been tracking spend carefully, placing customers on a behavioural segmentation, and had enough data that we felt relatively confident we knew what be good indivualized stretch goals for customers (wrongly, as events would prove). We set the targets at levels such that retailer should break even if people just met them (foregoing profit, but not losing much money), and estimated how many people we thought would hit the target if the campaign did not have much effect, and then estimated volumes and costs for various higher levels of campaign success.

I'm sure many of you can already see how this will go, and even more of you will have been attuned to the problem by the title of this post. We, however, we had not seen the title of this post, and although I knew about the phenomenon of regression to the mean, I had not really internalized it. I didn't know it in my bones. I had not been bitten by it. I did not see the trap we were walking into.

As Confucious apparently did not say:

I hear and I forget. I see and I remember. I do and I understand.

— probably not Confucious; possibly Xunzi.

Well, I certainly now understand.

On the positive side, our treated group increased its level of spend by a decent amount, and a large number of the group earned many extra loyalty points. I don't believe we had developed uplift modelling at the time of this work, we were very aware that we needed a randomized control group in order to understand the behaviour change we had driven, and we had kept one. To our dismay, the level of spend in the control group, though lower than that in the treated group, also increased quite considerably. In fact, it increased enough that the return on investment for the activity was negative, rather than positive. It was at this point (just before admitting to the client what had happened, and negotiating with them about exactly who should shoulder the loss¹) a little voice in my head started saying regression to the mean, regression to the mean, regression to the mean, almost like a more analytical version of Long John Silver's parrot.

So (for those of you who don't know), what is regression to the mean? And why did it occur in this case? And why should we, in fact, have predicted that?

Allow me to lead you through the gory details.

Background: Control Groups

We all know that marketers can't honestly claim the credit for all the sales from people included in a direct marketing campaign, because (in almost all circumstances) some of them would have bought anyway. As with randomized control trials in medicine, in order to understand the true effect of our campaign, we need to divide our target population, uniformly at random,² into a treatment group, who receive the marketing treatment in question, and a control group, who remain untreated. The two groups do not need to be the same size, but both need to be big enough to allow us to measure the outcome accurately, and indeed to measure the difference between the behaviour of the two groups. This is slightly problematical, because we don't know the effect size before we take action. Happily, however, if the effect is too small to measure, it is pretty much guaranteed to be uninteresting and not to achieve a meaningful return on investment, so we can size the two groups by calculating the minimum effect we need to be able to detect in order to achieve a sufficiently positive ROI.

The effect size is the difference between the outcome in the treated group and the control group—usually a difference in response rate, for a binary outcome, or a difference in a continuous variable such as revenue. Things become more interesting when there are negative effects in play, which is sometimes the case with intrusive marketing or when retention activity is being undertaken. There can be negative effects for a subpopulation or, in the worst cases, for the population as a whole. When these happen, a company is literally spending money to drive customers away, which is usually undesirable.

Let's suppose, for simplicity, that we have selected an ideal target population of 2 million and we mail half of them (chosen on the toss of a fair coin) and keep the other 1 million as controls. If we then send a motivational mailing to the 1 million encouraging them to spend more, with or without an incentive to do so, we can measure their average weekly spend in a pre-period (say six weeks) and their average weekly send in a post-period, which for simplicity we will also take to be six weeks. In this case, we will assume that there was no financial incentive: it was simply a motivation mail along the lines of "we're really great: come and give us more of your money". (Good creatives would craft the message more attractively than this.) Let's suppose we do this and that the results for the treated group of one million are as follows:

Before	After
£50	£60

This is not enough information to say whether the campaign worked, and the reason has nothing to with statistical errors: at the scale of 1 million, you can guarantee the errors will be insignificant. It's also not primarily because we are measuring revenue rather than profit, nor because we haven't taken into account the cost of action (though those are things we should do). We can see that our 1 million customers spent, on average £10 per week more in the post-period than in the pre-period (a cool £60m in increased revenue over six weeks), but we don't know about causality: did our marketing campaign cause the effect?

To answer this, we need to look at what happened in the control group.

	Before	After
Mailed (Treated)	£50	£60
Unmailed (Control)	£50	£55

We immediately see that the spend in the pre-period was the same in both groups, as must be the case for a proper treatment-control split. We also seee that the spend in the control group rose to £55.

We now have enough evidence to say for with very high degree of confidence that the treatment caused a £5 per week increase in spend, but that some other factor—perhaps seasonality, TV ads, or a mis-step by our competitors—caused the other £5 increase per week across the larger population of 2 million customers. I should emphasize that this is a valid conclusion regardless of how the population of 2 million was chosen, as long as the treatment-control split was unbiased.

Behavioural Segmentation

Now let us take a similar treatment group drawn uniformly, at random, from a larger population of 10 million, We segment the treatment population by average weekly spend in the pre-period and plit the increase or decrease in spend, between the pre- and post-periods in each segment for our treatment group. The graph below shows a possible outcome.

For people who are not steeped in regression to mean, this graph may appear somewhat alarming. Depending on the distribution of the population, this might well represent an overall increase in spending (since probably more of the people are on the left of the graph, where the change in spend is positive). But I can almost guarantee that any marketing director would declare this to be disaster, saying (possibly more colourful language) "Look at the damage you have wreaked on my best customers!"

But would this be a reasonable reaction? Would it, in fact, be accurate? At this point, we have no idea whether the campaign caused the higher-spending customers' spend to decline, or whether something else did. To assess that, we need once more to look at the same information for the control group (9 million people, in this case). That's shown below.

What we clearly see is that in every segment the change in spend was either more positive or less negative in the treated group than in the control group. So the campaign did have a positive effect in every segment. Not so embarrassing. (If only this had been our case!)

Regression to the mean

To understand more clearly what's going on here, it's helpful to look at the same data but focus only the control group.

Remember, this is the control group: we have not done anything to this population. This is a classic case of regression to the mean. I would confidently predict that for almost any customer base, if we allocate people to segments on the basis of a behavioural characteristic over some period, then measure that same characteristic for the same people, using the same segment allocations, at a later period, we would see a pattern like this: the segments with low rates of the behaviour in question in the first place would increase that behaviour (at least, relative to the population as a whole), and the people in the segments that exhibited the behaviour more would fall back, on average.

Why?

Mixing Effects

When you segment a population on the basis of a behaviour, many of the people are captured exhibiting their typical behaviours. But inevitably, you capture some of the people exhibiting behaviour that is for them atypical.

Consider the first bar in the example above—people who spent nothing in the six weeks before the mailing. Ignoring the possibility of people returning goods, it is impossible for the average spend of this group to decline. In fact, if even a single person from this group buys something in the post-campaign period, the average spend for that segment will increase. In terms of the mixing I am talking about, some of the people will have completely lapsed, and will never spend again, while others were in an atypically low spending period for them: maybe they were on holiday, or trying out a competitor or didn't use their loyalty card and so their spending was not tracked. The thing that's special about this first group is that they literally cannot be in an atypically high spending period when they were assigned to segments, because they weren't spending anything.

It's less clear-cut, but a similar argument pertains to the group on the far right of the graph. Some of those are people who routinely spend over £90 a week at this retailer. But others will have had atypically high spend when we assigned them to segments: maybe they had a huge party and bought lots of alcohol for it, or maybe they shopped for someone else over that period. With the highest-spending group, there will probably be a small number of people whose spend was atypically low during the period we assigned them to segments, but there are likely to be far more people for whom this spend was atypically high at the right side of the distribution. So in this case, we can see it's likely that the average spend of these higher-spending segments will decline (relative to the population as a whole) if we measure them at a later time period.³

For people in the middle of the distribution, the story is similar but more balanced. Some people will have their typical spend where we measured it, and there will be others whom we captured at atypically high or atypically low spending periods, but those will tend to cancel out.

These mixing effects give the best explanation I know of the phenomenon of regression to the mean. It is always something to look for when you assign people to segments based a behaviour and then look for changes in the people in those segments a later time.

So how did we lose so much money?

The reason our campaign worked out so poorly was that we did not take into account regression to the mean when we set the targets, because we didn't think of it. Because we targeted more people with below-median spend than above-median spend, regression to the mean meant that although spend increased quite strongly among our treated customers, it also increased quite strongly for the control group in each segment. In that regard, the uplifts were less similar across the spend segments that I have shown here; something, in fact, that I now know to be characteristic of retail unike in many other areas. The most active shoppers are often also the most responsive to marketing campaigns.

The campaign produced an uplift in all segments, but much more of the increase in spend than we expected was due to regression to the mean in the population we targeted, with the result that value of the loyalty points given out significantly exceeded the incremental profit contribution from the people awarded those points.

This was a tough one. But at least I will remember it forever.

Without getting too specific, the loss was a six-figure sterling sum, back when that was an more significant amount of money than it is today. It was not really a material amount for the retailer, which had a significant fraction of the UK population as regular customers; but it was a highly material amount for Quadstone: more, in fact, than the four founders had invested in the company, the probably less than our entire first-round funding. And retailers don't get to be big and dominant by treating six-figure losses with equinimity. ↩
Uniformly at random means that, conceptually, a coin is tossed or a die is rolled to determine whether someone is allocated to the control group or the treated group. The coin or die does not need to be fair: it's fine to allocate all the 1's on the die to control and all the 2-6's to treated, or to use a weighted coin, as long as the procedure does use any other information to determine the allocation. For example, choosing to put two thirds of the men in control (chosen randomly) and only one third of the women in control is no good, because now, if there's a difference, it is hard to easily separate out the effect of the treatment from the effect of sex. (If the volume suffices, you could assess the uplift independently for men and women, in this particular case, but that quickly gets complicated, and there is more than enough scope for errors without such complications.) ↩
One possible confusion here is that I'm describing a static, rather than a dynamic, segmentation here: people are allocated to a segment on the basis of the spend in the pre-period and remain in that segment when assessed in the post-period. If we reassigned people on the basis of their later behaviour, we would not see this effect if the spend distribution were static. ↩

Name Styles

2024-03-04T16:00:00+00:00

This is just a bit of fun, but I've always been interested in the different kinds of names allowed, encouraged, and used in different areas of computing and data.

A few years ago, I tweeted some well-known naming styles and a collection of lesser-known naming styles. I was playing about with the same idea while thinking about metadata standards today and came up with this. Just as I often think one of the boxes on the uniqitous 2x2 "Boston-Box"-style matrices makes no sense, I think some of the boxes on the evil-good-lawful-chaotic breakdown (which I gather comes from Dungeons and Dragons make little sense, so forgive me if some of this looks slightly forced. But I think it's fun.

TOMLParams: TOML-based parameter files made better

2023-07-16T16:00:00+01:00

TOMLParams is a new open-source library that helps Python developers to externalize parameters in TOML files. This post will explain why storing parameters in non-code files is beneficial (including for reproducibility), why TOML was chosen, and some of the useful features of the library, which include structured sets of parameters using TOML tables, hierarchical inclusion with overriding, default values, parameter (key) checking, optional type checking, and features to help use across programs, including built-in support for setting parameters using environment variables.

The Benefits of Externalizing Parameters

Almost all software can do more than one thing, and has various parameters that are used to control exactly what it does. Some of these parameters are set once and never changed (typically configuration parameters, that each user or installation chooses), while others may be changed more often, perhaps from run to run. Command-line tools often accept some parameters on the command line itself, most obviously input and output files and core parameters such as search terms for search commands. On Unix and Linux systems, it's also common to use command line "switches" (also called flags or options) to refine behaviour. So for example, the Unix/Linux grep tool might be used in any of the following ways:

grep time            # find all lines including 'time' on stdin
grep time p*.txt     # ... on .txt files starting with 'p'
grep -i time p*.txt  # ... ignoring capitalization (-i switch)

All of time, p*.txt and -i are examples of command-line parameters.

Many tools also use configuration files to control the behaviour of the software. On Unix and Linux, these are typically stored in the user's home directory, often in 'dot' files, such as ~/.bashrc for the Bash shell, ~/.emacs for the Emacs editor and ~/.zshrc for the Z Shell. These files are sometimes in propriery formats, but increasly often are in "standard" formats such as .ini (MS-DOS initialization files), JSON (JavaScript Object Notation), YAML (YAML Ain't Markup Language), or TOML (Tom's Obvious Minimal Language).

When writing code, it's often tempting just to set parameters in code.

The quickest, dirtiest practice is just to have parameter values "hard-wired" as literals in code where they are used, possibly with the same values being repeated in different places. This is generally frowned upon for a few reasons:
- Hard-wired parameters are hard to find, and therefore hard to update and inspect;
- There may be repetition, violating the don't repeat yourself (DRY) principle, and leading to possible inconsistency (though see also the counterveiling rule of three (ROT) and the write everything in threes (WET) principle.
- If you want to change parameters, you have to go to many places;
- There may be no name associated with the parameter, making it hard to know what it means;
So implementing an energy calculation using Einstein's formula E = mc², we would probably prefer either:

def energy(mass):
    """Return energy in joules from mass in kilograms"""
    return mass * constants.speed_of_light * constants.speed_of_light

def energy(m):
    """Return energy in joules from mass m in kilograms"""
    c = constants.c  # the speed of light
    return m * c ** 2

(for a physicist) to

def energy(m):
    return m * 3e8 * 3e8

or (worse!)

def energy(m):
    return m * 9e16

For something like the speed of light (which is after all, a universal constant) in code is probably appropriate, so we might have a file called constants.py containing something like:

# constants.py: physical constants

speed_of_light = 299_792_458   # metres per second, in vacuuo. (c. 300 000 km/s)
h_cross = 1.054_571_817e-34    # reduced Planck constant in joule seconds (= h/2π)

For parameters that change in different situations, or for different users, it's usually much better practice to store the parameters in a separate file, ideally in a common format for which is good library support. Some of the advantages of this include:
- Changing parameters then does not require changing code. Code may be unavilable to the user—e.g. write locked, compiled or running on a remote server through an API. Moreover, editing code often requires more expertise and confidence than editing a parameter file.
- The same code can be run multiple times using different parameters at the same time; this can be more challenging if the parameters are "baked into" the code.
- It becomes easy to maintain multiple sets of parameters for different situations, allowing a user simply to choose a named set on each occasion (usually through the parameter file's name). Of course, the code itself can have different named sets of parameters, but this is less common, and usually less transparent to the user, and harder to maintain.

Why TOML?

There is nothing particularly special about TOML, but it is well suited as a format for parameter files, with a sane and pragmatic set of features, good library support in most modern programming languages and rapidly increasing adoption.

Here's a simple example of a TOML file:

# A TOML file (This is a comment)

start_date = 2024-01-01   # This will produce a datetime.date in Python
run_days = 366            # an integer (int)
tolerance = 0.0001        # a floating-point vlaue (float)
locale = "en_GB"          # a string value
title = "2024"            # also a string valuue
critical_event_time = 2024-07-31T03:22:22+03:00  # a datetime.datetime
fermi_approximate = true  # a boolean value (bool)


[logging]   # A TOML "table", which acts like a section, or group

format = ".csv"           # Yet another string
events = [                # An array (list)
    "financial",
    "telecoms",
]
title = "2024 events"     # The same key as before, but this one is
                          # in the logging table, so does not conflict
                          # with the previous title

Strengths of TOML as a parameter/configuration format

Ease of reading, writing and editing ("obviousness").
Lack of surprises: there are few if any weirdnesses in TOML (which is less true of the other main competing formats).
Good coverage of the main expected types: TOML supports (and differentiates between) booleans, integers, floating point values, strings, ISO8601-formatted dates and timestamps (with and without timezones, and mapping to datetime.date and datetime.datetime in Python), arrays (lists), key-value pairs (dictionaries), and hierarchical sections.
It supports comments (which begin with a hash mark #)
Support in most modern langauges, including C#, C++, Clojure, Dart, Elixir, Erlang, Go, Haskell, Java, Javascript, Lua, Objective-C, Perl, PHP, Python, Ruby, Rust, Swift, Scala
Flexible quoting with support for multiline strings, and no use of bare (unquoted) strings (avoiding ambiguity)
Well-defined semantics without becoming fussy and awkward to edit (e.g., being unfussy about trailing commas in arrays).
TOML is specifically designed to be a configuration file format. Quoting the front page of the website

TOML aims to be a minimal configuration file format that: is easy to read due to obvious semantics maps unambiguously to a hash table is easy to parse into data structures in a wide variety of languages"

In the context of Python, TOML is already being adopted widely, most notably in the increasingly ubiquitous pyproject.toml files. There are good libraries available for reading (tomli) and writing (tomli_w) TOML files, although these are not part of the standard library. (It appears that Python 3.11 does include tomllib, for reading TOML files, in the standard library.)

Weaknesses of TOML as a parameter/configuration format

We don't think TOML has any major weaknesses for this purpose, but points that might count against it for some include:

TOML does not have a null (NULL/nil/None/undefined) value. (TOMLParams could address this, but has no current plans to do so.)
Hierarchical sections ('tables') in TOML are not nested. So if you want So if you want a section/table called behaviours and subsections/subtables called personal and business, in TOML this might be represented by something like the excerpt below (possibly with a [behaviours] table with its own parameters as well).

Some people, however, really don't like TOML.

[behaviours.personal]

frequency = 3

[behaviours.business]

frequency = 3

Why not JSON, YAML, .ini...

We chose TOML because we think, overall, it is better than each of the obvious competing formats, but the purpose of this post isn't to do down those formats, which all have their established places. But we will comment briefly on the most obvious alternatives.

JSON, while good as a transfer format and very popular in web services, is tricky for humans to write correctly, even with editor support, because it requires correct nesting and refuses to accept trailing commas in lists and dictionaries. The lack of support for dates and timestamps is also a frequent source of frustration, with quoted strings typically being used instead, with all the concomitant problems of that approach.

At first glance, YAML appears more suitable as a configuration/parameter-file format, but is the opposite of obvious and often produces unexpected results (in practice). Sources of frustration include no requirement to quote strings, "magic values" like yes (which maps to true) and no which maps to false (much to the annoyance of Norwegians and anyone using the NO country code), inadvent coercison of numbers with leading zeroes to octal (in YAML 1.1), whitespace sensitivity, and issues around multi-line strings.¹ ² ³ ⁴

.ini files look a lot like TOML (I would guess they a major inspiration for it) but are much simpler and less rich, have less well-defined syntax, have fewer types and don't require quoting of strings, leading to ambiguity.

What are the Key Features of TOMLParams?

Simple externalization of parameters in one or more TOML files

You write your parameters in a TOML file, pass the name of the TOML file an instance of the TOMLParams class and it reads the parameters from the file and makes them available as attributes on the object, but also makes them available using dictionary-style look-up, i.e. if you TOMLParams instance is p and you have a parameter startdate you can access it as p.startdate or p['startdate'] (and, more pertinently, also as p[k] if k is set to 'startdate').

If you use tables, you can "dot" you way throught to the parameters (p.behaviours.personal.frequency) or use repeated dictionary lookups (p['behaviours']['personal']['frequency']).

Loading, saving, default values, parameter name checking and optional parameter type checking

The parameters that exist are defined by default value with TOMLParams. You can either store you defaults in a TOML file (e.g. defaults.toml) or pass them to the TOMParams initializer as a dictionary.

If you choose a different TOML file, all the parameter values are first set to their default values, and then any parameters set in the file you specify override those. New parameters (i.e. any not listed in defaults) raise an error.

If you wish turn on type checking, the library will check that the all parameter values provided match those of the defaults, and you choose whether these cases result in a warning (which doesn't raise an exception) or an error (which does).

Hierarchical file inclusion with overriding

Perhaps the most powerful feature of TOMLParams is the ability for one parameters file to 'include' one or more others. If you use TOMLParams specifying the parameters file name as base, and the first line of base.toml is

include = 'uk'

then parameters from uk.toml will be read and processed before those from base.toml. So all parameters will first be set to default values, then anything in uk.toml will override those values, and finally any values in base.toml will override those.

The include statement can also be a list

include = ['uk', 'metric', '2023']

Inclusions are processed left to right (i.e. uk.toml is processed, before metric.toml, followed by 2023.toml), followed by the parameters in the including file itself. So in this case, if defaults are in defaults.toml and the TOML file specified is base.toml, the full sequence is

defaults.toml
uk.toml
metric.toml
2023.toml
base.toml

Of course, the included files can themselves include others, but the library keeps track of this and each file is only processed once (the first time it is encountered), which prevents infinite recursion and repeated setting.

This makes it very easy to maintain different kinds and groups of parameters in different files, and to create a variation of a set of parameters simply by making the first line include whatever TOML file specifies the parameters you want to start from, and then override the specific parameter or parameters you want to be different in your new file.

** Support for writing consolidated parameters as TOML after hierarchical inclusion and resolution**

The library can also write the final parameters used out as a single, consolidated TOML file, which is useful when hierarchical inclusion and overriding are used, and keeps a record of the final values of all parameters. This helps with reproducibility and logging.

Support for using environment variables to select parameter set (as well as API)

You can choose how to specify the name of the parameters file to be used, and the name to which it should default. If you create a TOMLParams instance with:

params = TOMLParams(
    defaults='defaults',
    name='newparams',
    standard_params_dir='.'
)

default parameters will be read from ./defaults.toml and then ./newparams.toml will be processed, overriding default values.

If you want to specify the name of the parameters file to use on the command line, the usual pattern would be something like:

import sys
from tomlparams import TOMLParams

class Simulate:
    def __init__(params, ...)


if __name__ == '__main__:
    name = sys.argv[1] if len(sys.argv) > 1 else 'base'
    params = TOMLParams('defaults.toml', name)
    sim = Simulate(params, ...)

Sometimes, however, it's convenient to use an environment variable to set the name of the parameters file, particularly if you want to use the same parameters in multiple programs, or run from a shell script or a Makefile. You can specify an environment variable to use for this and TOMLParams will inspect that environment variable if no name is passed. If you choose SIMPARAMS for this and say

params = TOMLParams('defaults.toml', env_var='SIMPARAMS')

it will look for a name in SIMPARAMS in the environment which you can set with

SIMPARAMS="foo" python pythonfile.py

export SIMPARAMS="foo"
python pythonfile.py

If it's not set, it will use base.toml as the file name, or something else you choose with the base_params_stem argument to TOMLParams.

Check it out

You can install TOMLParams from PyPI in the usual way, e.g.

python -m pip -U tomlparams

The source is available from Github (under an MIT license), at github.com/smartdatafoundry/tomlparams

There's a README.md and documentation is available on ReAd the Docs at tomlparams.readthedocs.io.

You can get help within Python on the TOMLParams class with

>>> import tomlparams
>>> help(tomlparams)

After installing tomlparams, you will find you have a tomlparams command, which you can use to copy example code from the README

$ tomlparams examples
Examples copied to ./tomlparams_examples.

You can also get help from with tomlparams help:

$ tomlparams help
TOMLParams

USAGE:
    tomlparams help      — show this message
    tomlparams version   — report version number
    tomlparams examples  — copy the examples to ./tomlparams_examples
    tomlparams test      — run the tomlparams tests

Documentation: https://tomlparams.readthedocs.io/
Source code:   https://github.com/smartdatafoundry.com/tomlparams
Website:       https://tomlparams.com

Installation:

    python -m pip install -U tomlparams

TDDA on the Coding for Thought Podcast

2023-07-11T16:00:00+01:00

I had the pleasure of discussing TDDA with Peter Schmidt on his Coding for Thought podcast.

I think it came out really well, so this might be a nice way for people to learn about the ideas and motivations for the ideas and the library, which Simon Brown, Sam Rhynas and I developed over some years at Stochastic Solutions.

The podcast should be available by search in any podcast player for 'Code for Thought'.

Direct links are:

Overcast Logged-in iCloud Users: Self-Selection Bias and Customer Stickiness

2023-01-08T16:00:00+00:00

On Episode 258 of Marco Arment and “Underscore” David Smith’s podcast Under the Radar, and then on Episode 516 of Marco & co’s Accidental Tech Podcast, Marco describes the fact that his data suggests that about 12% of his users don’t have logged-in iCloud accounts with iCloud Drive enabled, which was a significant obstacle to moving his sync system for Overcast to use Apple’s CloudKit, which requires both.

This was a surprise to Marco, who had expected that the figure would be closer to 1%, and led Casey Liss to worry aloud about producing an app that depended on CloudKit.

Marco, entirely correctly, suggested that his user base may be non-representative, and pointed out that

Many people listen to podcasts at work and when commuting, so may be more that usually likely to use work-issued phones; such devices are often locked out of use of iCloud in general, and iCloud Drive, most specifically.
His users are almost certainly biased towards geeks.

It may not be immediately obvious, but there is also a strong “statistical” explanation for why Marco is almost certainly right, and Casey’s fears are likely to be somewhere between exaggerated and unfounded.

TL;DR: Self-Selection Bias and Customer Stickiness

Overcast is unusual (possibly unique) in offering a non-CloudKit-based sync system. Users who need non-CloudKit-based podcast syncing have a limited choice of options, possibly Hobson’s Choice. So it is extremely likely that Overcast has a disproportionately high number of users who can’t/don’t use iCloud Drive. Interestingly these people might also be exceptionally loyal, because they (perhaps) have nowhere else to go.

The Long Version

Let’s do some Fermi Estimation.

Let’s suppose Apple has 1 billion iOS users.
Let’s suppose 10% of them use a podcast player. That’s 100 million people.
Let’s suppose (consistent with Marco’s assumption) that 1% of those don’t have a logged-in iCloud account with iCloud Drive (forthwith to be refered to as “iCloudless”). That’s 1 million people.
Let’s suppose Overcast has 1% market share (1 million active users).

Understandably, Marco doesn’t release user stats per se, but did say:

“So, to give you some idea of what I mean by ’hardly anybody uses it’, I’m looking at a few hundred people who use the website, and that is not a large portion of the user base. And this is ... per day. ... It’s under 1,000 people. ... snd that’s ... well under 1% ... [I]t’s a very small portion of the user base.”

— Marco Arment, Accidential Tech Podcast #516, One of My Fits of Outrage, 3rd January 2023, from 31:16 (listen here).

That sounds consist with a million users, and certainly says the number significantly exceeds 100,000. Further, BuzzSprout’s Global Player stats for December 2022 put Overcast’s Market Share at 1%,¹ with that number pegged at 1,134,026² (users, presumably).

Based on these estimates, there are the same number of iCloudless podcast listeners—one million—as there are Overcast users, and their only choices are * to use Overcast, * to find another podcast player that offers non-iCloud-Drive-based sync (if there is one), or, * to forego sync. This is the self-selection bias. Just as restaurants with good vegan food probably have a disportionately high number of vegan customers (and much higher than ones that don't cater for vegans), and buildings with proper disabled access probably have disportionately high numbers of disabled customers (and much higher than those that don't offer disabled access), a podcast player that supports iCloudless sync will almost inevitably have a disproportionately high number of iCloudless users. In principle, if the estimates are reasonable, Marco could see up to 100% of his users in this iCloudless category without running out of iCloudless users (though this would plainly be ridiculous).

If these estimates are the right order of magnitude, 12% seems like a very plausible figure for what Marco would see in his stats. It means only 12% of people who might benefit from the sync service are using Overcast, but that’s an order of magnitude more than its estimated market share, which is pretty good. It also means that 88% of podcast listeners who might benefit from iCloudless sync are not currently using Overcast.

As I said, I don’t know whether any other podcast players do have an non-iCloud sync service, but either way, it also suggests that now that Marco has put his plan to discontinue it on ice, it might not be a terrible idea for him to lean into it. Maybe his marketing should specifically try to target users who want to sync podcasts across devices (and the web!) but are “iCloudless". After all, it’s very plausible that there are a million of them out there; and it would be really hard for Marco’s competitors to respond. As noted above, these also might be exceptionally loyal/sticky customers, because there may be nowhere else for them to go. It’s one of Marco's stronger competitive moats.

amazingly, given that I made the estimate before trying to look up any stats, and as you can see, my numbers really are Fermi Estimates. ↩
even more amazingly, this number of users is also within 2% of my Fermi Estimate. It’s a just a coincidence, but a very happy one! ↩

Gentest Talk at 2022 Toronto Workshop on Reproducibility

2022-02-25T16:00:00+00:00

We released version 2.0 of the Python TDDA library this week. The radical new feature of the 2.0 release is Gentest, a command-line tool for automatically generating tests for more-or-less any code that you can run from a command line.

Gentest was introduced at the 2022 Toronto Workshop on Reproducibility yesterday (24th February), where demonstrations included using it to write tests for three increasing complex R programs. This was to emphasize that Gentest is useful for much more than just testing Python code. Our (only slightly facetious) strapline is

Gentest writes tests, so you don't have to.™

Slides from the talk are available here:

Gentest: Automatic Test Generation for Data Science (SLIDES)

And here's the video:

We'll be posting more detail about Gentest here over the coming weeks.

Another major upgrade in the 2.0 TDDA release is the documentation. We've made much more effort to separate out the command-line uses of the TDDA library

constraint generation
data verification
data validation
inference of regular expressions, with Rexpy
(now) automatic test generation, with Gentest

from the API documentation, which is only really relevant to Python users.

The documentation is available at Read The Docs.

Unix & Linux Survival Guide for Data Science etc.

2022-02-21T21:00:00+00:00

PDF Version (A4)

PDF Version (Letter)

One Tiny Bug Fix etc.

2022-02-16T16:00:00+00:00

Why Code Rusts

2022-02-07T16:00:00+00:00

or Why Tests Spontanously Fail

You might think that if you write a program, and don't change anything, then come back a day later (or a decade later) and run it with the same inputs, it would produce the same output. At their core, reference tests exist because this isn't true, and it's useful to find out if code you wrote in the past no longer does the same thing it used to. This post collects together some of reasons the behaviour of code changes over time.¹

The Environment Has Changed

E1 You updated your compiler/interpreter (Python/R etc.)
E2 You updated libraries used in your code (e.g. from PyPI/CRAN).
E3 You updated the operating system of the machine you're running on.
E4 Someone else updated the operating system or library/compiler etc.
E5 Your code uses some other software on your machine (or another) machine that has been updated (e.g. a database).
E6 Your codes uses an external service whose behaviour has changed (e.g. calling a web service to get/do something).
E7 You have updated/replaced your hardware.
E8 You run it on different hardware (another machine or OS or OS version or under a different compiler or...)
E9 You move the code to a different location in the file system.
E10 You have changed something in the file system that messes up the code e.g.

deleting a file the code uses
renaming a file the code uses
editing a file the code uses
removing or renaming a directory the code uses
changing permissions on a file or directory the code uses
creating a file or directory that the code expects to create and is now unable to, e.g. because of permissions.

E11 You run as a different user.
E12 You run from a different directory while leaving the code in the same place.
E13 You run the code in a different way (e.g. from a script instead of interactively, or in a scheduler).
E14 A disk fills or some other resource becomes full or depleted.
E15 The load on the machine is higher, and the code runs out of memory or disk or some other source; or has a subtle timing dependency or assumption that fails under load.
E15a The load on the machine is lower, meaning part of your code runs faster, causing a race condition to behave differently. [Added 2022-02-17]
E16 The hardware has developed a fault.
E17 A systems manager has changed some limits e.g. disk quotas, allowed nice levels, a directory service, some permissions or groups...
E18 A shell variable or changed or was created or destroyed.
E19 The locale in which the machine is running changed.
E20 You changed your PYTHONPATH or equivalent.
E21 A new library that you don't use (or didn't think you used) has appeared in a site-packages or similar location, and was picked up by your code or something else your code uses.
E22 You updated your editor/IDE and now whenever you load a file it gets changes in some subtle way that matters (e.g. line endings, blank lines at the of files, encoding, tabs vs. spaces).
E23 The physical environment has changed in some way that affects the machine you are running on (e.g. causing it to slow down).
E24 A file has been touched² and the software determines order of processing by last update date.
E25 The code uses a password or key that is changed, expires or is revoked.
E26 The code requires network access and the network is unavailable, slow, or unreliable at the time the test is run.
E27 Almost any of the above (or below), but for a dependency of your code rather than your code itself, e.g. something in a data centre or library.
E28 Your PATH (the list of locations checks for executables) has changed, or an alias has changed so that the executable you run is different from before. [Added 2022-02-11]
E29 A different disk or share is mounted, so that even though you specify the same path, some file that you are using is different from before. [Added 2022-02-11]
E30 You run the code under a different shell or changed something in a shell startup file. [Added 2022-02-17]

Many of these are illuminated by one of my favourite quote from Beth Andres-Beck:

Mocking in unit tests makes the tests more stable because they don’t break when your code breaks.
— @bethcodes, 2020-12-29T01:26:00Z https://twitter.com/bethcodes/status/1343730015851069440

The Code Has, in Fact, Changed

C1 You think you didn't change the code, but actually you did.
C2 You did change the code, but only in a way that couldn't possibly change the behaviour in the case you're testing.
C3 You didn't change the code, you fixed a bug.
C4 You didn't change the code, but someone else did.
C5 You didn't change the code, but disk corruption did.
C6 You didn't change the code, but you did update some data it uses.
C7 You pulled the code again from a source-code repository but

someone else had pushed a change
you checked out a different branch
you pulled from the wrong repository.

C8 You're on the wrong branch.
C9 The system was restored from backup and you lost changes.
C10 You used a hard link to a file and didn't change the file here but did change it in one of the other linked locations.
C11 You used symbolic links and though your symbolic link didn't change, the code (or other file or files) it symbolically linked did.
C12 You used a diff tool to compare files, but a difference that does matter to your code was not detected by the diff tool (e.g. line endings or capitalization or whitespace).
C13 You are in fact running more tests than previously, or different tests from the ones you ran previously, without realising it.
C14 You reformatted your code thinking that you were only making changes to appearance.
C15 You ran a code formatter/beautifier/coding standard enforcement tool that had a bug in it and changed the meaning.
C16 You believe nothing has changed because git status tells you nothing has changed, but you are using files that aren't tracked or are ignored.
C17 You think a file hasn't changed because of its timestamp, but the timestamp is wrong or doesn't mean what you think it means.
C18 A hidden file changed (e.g. a dotfile).
C19 A file that doesn't match a glob pattern you use changed.

Also from Beth Andres-Beck:

If you have 100% test coverage and your tests use mocks, no you don’t.
— @bethcodes, 2020-12-29T01:51:00Z https://twitter.com/bethcodes/status/1343736477839020032

You Aren't Running the Code You Think You Are

There is another set of problems that aren't strictly causes of code rusting, but which help to explain a set of related situations every developer has probably experienced, which all fall under the general heading of you aren't running the code you think you are.

M1 The code you're running is not the the version you think it is (e.g. you're in the wrong directory).
M2 You are running the code on a different server from the one you think you are (e.g. you haven't realised you're ssh'd in to a different machine or editing a file over a network).
M3 You're editing the code in one place but running it in another.
M4 You have cross-mounted a file system and it's the wrong file system or you think you are/aren't using it when you actually aren't/are (respectively).
M5 Something (e.g. a browser) is caching your code (or some CSS or an image or something).
M6 The code has in fact run correctly (tests have passed) but you're look at the wrong output (wrong directory, wrong tab, wrong URL, wrong window, wrong machine...)
M7 Your compiled code is out-of-sync with your source code, so you're not running what you think you are.
M8 You're running (or not running) a virtual environment when you think you are not (or are), respectively.
M9 You're running a virtual environment and not understanding how it's doing its magic, with the result that you're not using the libraries/code you think you are.
M10 You use a package manager that's installed the right libraries into a different Python (or whatever) from the one you think it has.³
M11 You think you haven't changed the code/libraries/Python you're using, but in fact you did when you updated (what you thought was) a different virtual (or non-virtual) environment.
M12 You have a conflict between different import directories (e.g. a local site-packages and a system site-packages), with different versions of the same library, and aren't importing the one you think you are.
M13 You think the code hasn't changed because you recorded the version number, but there was a code change that didn't cause the version number to be changed, or the code has multiple version numbers, or the code is reporting its version number wrongly, or the version number actually refers to a number of slightly different builds that are supposed to have the same behaviour, but don't.
M14 You have defined the same class or method or function or variable more than once in a language that doesn't mind such things, and are looking at (and possibly) editing a copy of the relevant function/callable/object that is masked by the later definition. [Added 2022-09-14]
M15 A web server or application server has your code in memory and changing or recompiling your code won't have any effect until you restart that web server or application server. This is really a variation of M5, but is subtly different because you wouldn't normally think of this as caching. [Added 2024-03-30]

These are the ones that make you question your sanity.

TIP If what's happening can't be happening, trying introducing a clear syntax error or debug statement or some other change you should be able to see. Then check that it shows up as expected when you're running your code.

Almost every time I think I'm losing my mind when coding, it's because I'm editing and running different code (or viewing results from different code).

Time has Moved On

T1 Your code has a (usually implicit) date/time dependence in it, e.g.

it uses 2-digit dates
it assumes it's running in 2022
it assumes it's not 29th February, or 1st January, or isn't a weekend...
it assumes something else that's not true about (computer) time (no leap seconds, no daylight savings times, no time-zones, no half-hour-aligned timezones...)
it uses 2-digit dates with a pivot year and time (or some computed time the code uses) moves past the pivot year.

T2 Time is 'bigger' in some material way that causes a problem, e.g.

Y2K
Unix 2038 (when the numner of seconds from 1 Jan 1970 overflows 32-bit integers)
Number of days since the code was written needs more digits (10, 100, 1000).

T3 While the code is running, daylight savings time starts or stops, and a measured (local) time interval goes negative.
T4 Your code uses Excel to interpret data and today's a special date that Excel doesn't (or more likely does) recognize.
T5 The system clock is wrong (perhaps badly wrong); or the system clock was wrong when you ran it before and is now right.

Resources Used by the Code Have Changed

R1 A resource your code uses (a database, a reference file, a page on the internet, a web service) returns different data from the data it always previously returned.
R2 A resource your code uses returns data in a different format e.g. a different text encoding, different precision, different line endings (Unix vs. PC vs. Mac), presence or absence of a byte-order marker (BOM) in UTF-8, presence of new characters in Unicode, different normalization of unicode, indented or unindented JSON/XML, different sort order etc.
R3 A resource you depend on returns “the same” data as expected but something about the interaction is different, e.g. a different status code or some extra data you can ignore, or some redundant data you use has been removed.

Stochastic and Indeterminate Effects

S1 Your code uses random numbers and doesn't fix the seed.
S2 Your code uses random numbers and does fix the main seed but not other seeds that get used (e.g. the the seed for numpy is different from Python's main seed).
S3 A cosmic ray hits the machine and causes a bit flip.
S4 The code is running on a GPU (or even CPU) that does not, in fact, always produce the same answer.
S5 The code is running on a parallel, distributed, or multi-threaded system and there is inderminacy, a race condition, possible deadlock or livelock, or any number of other things that might cause indeterminate behaviour.
S6 Your code assumes something is deterministic or has specified behaviour that is in fact not determinisic or specified, especially if that result is the same most but not all of the time, e.g. tie-breaking in sorts, order of extraction from sets or (unordered) dictionaries, or the order in which results arrive from asynchronous calls.⁴
S7 Your code relies on something likely but not certain, e.g. that two randomly-generated, fairly long IDs will be different from each other.
S8 Your code uses random numbers and does fix the main seed, but the sequence of random numbers has changed. This has happened with NumPy, where they realised that one of the sampling functions was drawing unnecessary samples from the PRNG. In making the sampler more efficient, they changed the samples that were returned for the same PRNG seed. [Contributed by Rob Moss (@rob_models and @rob_models@mas.to), who "had a quick search for the relevant issue/changelog item, but it was a long time ago (~NumPy 1.7, maybe)." He "couldn't find the original NumPy issue, but here's a similar one: https://github.com/numpy/numpy/issues/14522". Thanks, Rob!]

It Never Worked (or didn't work when you thought it did)

[Added 2024-07-19]

I realised there's another whole class of errors of process/errors of interpretation that could lead us to think that code has “rusted” despite not having been changed. These are all broadly the same as one of the explanations offered before, but now for the original run when you thought it worked, rather than for the current or new run, when it fails.

N1 You thought you ran the code before, and that it worked correctly, but you are mistaken: you didn't run it at all, or it in fact failed but you did not notice.
N2 You did run the code before, but picked up the output from a previous state, before you broke it, when it did work.
N3 You did run the code before, and it did produce the wrong output then as now, but you used a defective procedure or tool to examine the output then, and failed to realise it was wrong/failing.
N4 You did run the code before, and it did pass, but you passed the wrong parameters/inputs/whatever and are now passing the correct (or different) parameters/inputs/whatever so it now fails as it would have done then if you had done the same.

If you have think of other reasons code rusts, do let me know and I'll be happy to expand this list (and attribute, of course) ↩
Touching a file (the unix touch command) updates the last update date on a file without changing its contents. ↩
For this reason, a lot of people prefer to run python -m pip rather than pip, because this way you can have greater confidence that the module is getting installed in the site-packages for the version of python you're actually running. ↩
Most of these kinds of indeterminacy will, in fact, usually be stable given identical inputs on the same machine running the same software, but it can take very little to change that, and should not be relied upon. ↩

Flat Files (a.k.a. CSV files)

2021-07-16T18:00:00+01:00

This week, a client I'm working for received a large volume of data, and as usual the data was sent as "flat" files—or CSV (comma-separated values¹) files, as they are more often called. Everyone hates CSV files, because they are badly specified, contain little metadata and are generally an unreliable way to transfer information accurately. They continue to be used, of course, because they are the lowest-common denominator format and just about everything can read and write them in some fashion.

Some of the problems with CSV files are well captured in a pithy blog post by Jesse Donat entitled Falsehoods Programmers Believe about CSVs.

Among other things, the data we received this week featured:

unescaped commas in unquoted (comma-separated) values;
an unspecified non-UTF-8 encoding that also did not appear to be iso-8859-1 ("latin-1" to its friends), nor indeed iso-8859-15 ("latin-9");
different null markers in different fields, and some cases, different null markers in a single field;²
field names (column headers) that included spaces, apostrophes, dashes and (in at least one case) a non-ASCII non-alphanumeric character;
multiple date formats, even within a single field, including some dates with three-digit years.

All of this is a bit frustrating, but far from unusual, and only one of these problems was actually fatal—the use of unquoted, unescaped separators in values, which makes the file inherently ambiguous. I'm almost sure this data was written but not read or validated, because I don't believe the supplier would have been able to read it reliably either.

Metadata

In an ideal world, we'd move away from CSV files, but we also need to recognise not only that this probably won't happen, but that the universality, plain-text nature, grokkability and simplicity of CSV files are all strengths; for all that we might gain using fancier, better-specified formats, we would lose quite a lot too, not least the utility of awk, split, grep and friends in many cases.

So if we can't get away from CSV files, how can we increase reliability when using them? Standardizing might be good, but again, this is going to be hard to achieve. What we might be able to do, however, is to work towards a way of specifying flat files that at least allows a receiver of them to know what to expect, or a generator to know what to write. I've been involved with a few such ideas over the years, and the software my company produces (Miró) used its own non-standard, XML-based way of describing flat files.

What I'm thinking about is trying to produce something more general, less opinionated, and more modern (think JSON, rather than XML, for starters) that addresses more issues. The initial goal would be simply descriptive—to allow a metadata file to be created that accurately describes the specific features of a given flat file so that a reader (human or machine) knows how to interpret it. Over time, this might grow into something bigger. I think obvious things to do after the format is created include:

In my case, getting Miró to accept these in place of³ its current XML-based files when reading (or writing) flat files. (Initially, at least, Miró would not be able to read or write all files that could be specified in this way, but could at least warn the user when it couldn't.)
Also getting the Python tdda library to be able to use this when using CSV files for input (and perhaps also for output).
Writing an "argument generator" for some of the standard (Python) CSV readers and writers to set the read/write options to be consistent with a given metadata description, and then probably to provide wrapped versions of those readers/writers that can accept a path for a CSV file and a path to a metadata file and use the underlying CSV reader or writer to read or write the file using that specification.
Writing (yet another) "smart" reader to try to read any old CSV files (using heuristics) and write out a metadata file that appears to match the data provided. This could not possibly work completely reliably because of all inherent ambiguity in flat files already alluded to, but an "80%" solution for real-world files should certainly be achievable as many programs make a reasonable job of handling arbitrary CSV files already.
Writing a validator to confirm whether a given CSV file is consistent with the specification in the metadata file.
Incorporating such a flat-file validator into TDDA so that it can check not only the (semantic) content of a dataset, but also the syntactic/formatting validity of data, confirming that it has been or can be read correctly.⁴

Together, a smart reader that generates a metadata file for a CSV file (item 4 above) and a validator that validates a CSV file against such a metadata specification (item 5) are very analogous to the current constraint discovery and data verification, respectively, but in the space of CSV files—roughly, "syntactic" conformance—rather than data (or "semantic") correctness.

Miró's Flat File Description format (XMD Files)

Here is an example, from its documentation, of the XMD data files that Miró uses.

<?xml version="1.0" encoding="UTF-8"?>
<dataformat>
    <sep>,</sep>                     <!-- field separator -->
    <null></null>                    <!-- NULL marker -->
    <quoteChar>"</quoteChar>         <!-- Quotation mark -->
    <encoding>UTF-8</encoding>       <!-- any python coding name -->
    <allowApos>True</allowApos>      <!-- allow apostophes in strings -->
    <skipHeader>False</skipHeader>   <!-- ignore the first line of file -->
    <pc>False</pc>                   <!-- Convert 1.2% to 0.012 etc. -->
    <excel>False</excel>             <!-- pad short lines with NULLs -->
    <dateFormat>eurodt</dateFormat>  <!-- Miró date format name -->
    <fields>
        <field extname="ext1" name="int2" type="string"/>
        <field extname="ext2" name="int2" type="int"/>
        <field name="int3" type="real"/>
        <field extname="ext4" type="date"/>
        <field extname="ext5" type="date" format="rdt"/>
        <field extname="ext6" type="date" format="rd"/>
        <field extname="ext7" type="bool"/>
    </fields>
    <requireAllFields>False</requireAllFields>
    <banExtraFields>False</banExtraFields>
</dataformat>

Three things to note immediately about this:

I'm not presenting this as the solution: the XMD format is now rather out of vogue and there are a number of things I would definitely do differently fifteen years on (such as using more standard names for types and more standard date format specifiers).
The XMD format is slightly more than just a flat file description, in that it contains a couple of things that are more about how to interpret and handle the data after reading, rather than simply describing the data.
The XMD file supports the notion of two different names for a field. The extname is the name in the CSV file (the external name), while the name is the name for Miró to use for the field. The semantics of this are slightly complicated, but allow for renaming of fields on import, and for naming of fields where there is no external name, or external names are repeated, or the external name is otherwise unusable by Miró. If the CSV file has a header and each field has a different name in the header, the order of the fields int he XMD file does not matter, but if there are missing or repeated field names, Miró will use the field order in the XMD file.

Notwithstanding the amazing variety seen in CSV files, as illuminated by Jesse Donat's aforementioned blogpost, most CSV files from mature systems vary only in the ways covered by a few of the items described in the CSV file. The most important things to know about a flat file overall are normally:

Encoding. The file encoding—these days, most commonly UTF-8.
Separator. The separator character—most commonly a comma (,), but pipe (|), tab and semicolon (;) are also frequently used.
Quoting. What character is used to quote strings (if any). There are quite number of subtleties here (not all capable of being expressed in the XMD file) including:
- Are all strings quoted or just some (e.g. ones containing the field separator)?
- Are non-string values (e.g. numbers) quoted too?⁵
- Are missing values (NULL) quoted?⁶
Missing Values. How are missing values (NULLs) denoted in the file, should there be any?
Escaping. How are characters "escaped"? This really covers a set of different issues, and the XMD file is not rich enough to cover all possibilities. One aspect is, when strings are quoted, how are quotes in the string handled? The most common answers are either by preceding them with an escape character, usually backslash (\), e.g.
```
"This is an escaped \" character in a string"
```
or by stuttering:
```
"This is a stuttered "" character in a string"
```
Escaping is also a way of including the separator in non-quoted values, like these display prices:
```
Price,DisplayPrice
100.0,£100.00
1000.0,£1\,000.00
1000000,£1\,000\,000.00
```
Escaping is also a way of specifying some special characters, e.g. \n for a newline, \t for a tab etc., and as a result when an actually backslash is required it is self-escaped (as \\).
Row Truncation after the last non-null value. Are rows in which the last value is missing truncated? Like many CSV writers, Excel writes missing values as blanks so that 1,,3 is read as 1 for the first field, a missing value for the second field and 3 for the third field. More quirkily, when Excel writes out CSV files, if there are n columns and the last m of them on a row are missing, Excel will write out only the non-missing values, and no further separators, so that there will be only n – m values on that line and only n – m – 1 separators. This behaviour is hard to describe and (as far as I know) unique to Excel, so in the XMD file this is simply marked as <excel>True</excel>.⁷
Header handling. Although the common case is for CSV files to have a single line at the start with the field names, sometimes there is no such line, and sometimes there are multiple lines before the data (one or more of which many specify the field names). As a minimum, a metadata description needs to be able to specify whether there is a header line, and ideally how many such lines there are and how headers should be extracted from them. If there are no headers, the specification should probably specify the field names. (Miró imaginatively calls the fields Field1 to FieldN if no fieldnames are available in the flat file or any XMD file.)

Per-Field Information

It's always useful and sometimes necessary to specify field types, and as discussed above, sometimes field names. Typing is almost always ambigous, and such ambiguity is increased if there are any bad values in the data. Moreover, in some cases (especially dates and timestamps), it is useful to specify the date format. Although good flat-file readers generally make a reasonable job of inferring types, and often date formats too, it is clearly helpful for a metadata specification to include these.

Just as date formats can vary between fields, other things can vary too, most obviously null indicators (missing value information), quoting and escaping. Moreover, if numeric data is formatted (e.g. including currency indicators, thousand separators etc.) these can all usefully be specified.

Required/Allowed Fields

The final pair of settings in the XMD file look slightly different from the others, partly because they are phrased as directives rather than descriptions. requireAllFields, when set, is a directive to Miró to raise a warning or an error if any of the fields in the XMD file are not present in the CSV file. Similarly, banExtraFields is a directive to raise such a warning or error if any fields are found in the CSV file that are not listed in the XMD file. Miró has several ways to specify whether infringements result in warnings or errors.

These directives can, however, be recast as declarations. The banExtraFields directive, when true, can equally be thought of as a declaration the field list is complete. Similarly, the requireAllFields directive, when true, can be thought of as a declaration that the field list is not just describing types that and formats for fields that might be in the CSV files, but rather that all fields listed are actually in the file.⁸

In principle, I think it would probably be better if these descriptions were more obviously descriptive or declarative, but I am struggling to find a pair of words/phrases that would capture that elegantly. At this point I am tempted to retain their imperative nature but make them slightly more symmetrical, perhaps with:

"require-all-fields": true,
"allow-extra-fields": false

Alternatively a more declarative syntax might be something like:

"csv-file-might-omit-fields": false,
"csv-file-might-include-extra-fields": false

The reader might wonder why the fields in the metadata file would ever not correspond exactly to those in file. In practice, it is not uncommon when dealing with relatively "good" CSV files to write an XMD file that specifies types and formats only for fields that trip up the flat-file reader. Conversely, it can be useful to have XMD files that describe a variety of possible files that share field names and types; in those cases, the extra ones do no harm.

What Might a Metadata File Look Like?

The XMD file gets quite a lot of things right:

As XML, it's a standard format that's easy to read, though today JSON is clearly more popular for this sort of use. (It would be fairly easy to allow a common format to be expressed in JSON, XML or YAML, but there's something to be said for a single format, probably JSON.)
All of the most fundamental overall properties are represented—encoding, separator, null marker, escape characters, and date format.
There's a separation between the overall file properties and the per-field properties, with the ability to specify the actual fieldname in the file, the field type and, in the case of date fields, custom formats on a per-field basis, if necessary.
It can give enough enough information to allow Excel-style truncated lines can be read successfully.

There are also a few major shortcomings:

The single escape chaaracter specification covers multiple things.
There is no explicit support for quote stuttering (which is fairly common).
The format does not recognise multiple headers.
The format does not provide any way to specify non-date field formats such as boolean specifiers, possible thousand separators and decimal point markers.
The format assume a single NULL indicator for all fields and assumes that there is only one kind of missing value/missing value.
The date formats supported are not comprehensive and are not expressed in a standard way.
Type specifiers are also somewhat non-standard.
XMD files fail to recognize the possibility that null markers are quoted, and implicitly assume that any empty string is distinct from a missing string value. This is probably too opinionated.

Some of these shortcomings reflect the fact that the XMD format was conceived less as a general-purpose flat-file descriptor than a specification as to how Miró should read or write a given flat file, and also a way for Miró to specify how it has written a flat file.

Essentially, I think a good flat-file description format would preserve the good aspects and remedy the faults identified, as well as providing a mechanism for specifying some more esoteric possibilies not mentioned so far.

I'll propose something concrete in subsequent posts.

Sometimes the separator in a flat file is a character other than a comma, and you occasionally see .tsv used an extension when the separator is a tab character, or .psv when the separator is a pipe character (|). Often, however, a csv extension is still used, and as result the acronym CSV is sometimes restyled as character-separated values. I had always heard this extension attributed to Microsoft, but have been unable to verify this. ↩
To be fair, the notion of different kinds of missing values is reasonable—missing because it wasn't recorded, missing because it was unreadable, missing because it's an undefined result (e.g. mean of no values) etc. But this wasn't that: it was just multiple ways of denoting generic missing values. ↩
by which, of course, I mean as well as ... ↩
There's an interesting question as to whether the CSV format specification should be incorporated as an optional part of a TDDA file, and if so, whether it should simply be a nested section or whether the field-specific components should be merged with TDDA's field sections. There are pros and cons. ↩
Yes, some systems do this. ↩
I know, madness! But such practices occur! ↩
Maybe it should have been called quirks mode ↩
Miró's slightly extended version of TDDA files includes lists of required and allowed fields, which serve a similar purpose to these settings. ↩

Sharing Tests across Implementations by Externalizing Test Data

2020-08-30T17:30:00+01:00

I've been dabbling in Swift—Apple's new-ish programming language—recently. One of the things I often do when learning a new language is either to take an existing project in a language I know (usually, Python) and translate it to the new one, or (better) to try a new project, first writing it in Python then translating it. This allows me to separate out debugging the algorithm from debugging my understanding of the new language, and also give me something to test against.

I have a partially finished Python project for analysing chords that I've been starting to translate, and this has led me to begin to experiment with some new extensions to the TDDA library (not yet pushed/published).

It's a bit fragmented and embryonic, but this what I'm thinking about.

Sharing test data between languages

Many tests boil down to "check that passing these inputs to this function¹ produces this result". There would be some benefits in sharing the inputs and expected outputs between implementations:

DRY principle (don't repeat yourself);
reducing the chances of things getting out of sync;
more confidence that the two implementations really do the same thing;
less typing / less code.

** Looping over test cases **

Standard unit-testing dogma tends to focus on the idea of testing small units using many tests, each containing a single assertion, usually as the last statement in the test.² The benefit of using a single assertion is that when there's a failure it's very clear what it was, and an earlier failure doesn't prevent a later check (assertion) from being carried out: you get all your failures in one go. Less importantly, it also means that the number of tests executed is the same as the number of assertions tested, which might be useful and psychologically satisfying.

On the other hand, it is extremely common to want to test multiple input-output pairs and it is natural and convenient to collect those together and loop over them. I do this all the time, and the reference testing capability in the TDDA library already helps mitigate some downsides of this approach in some situations.

A common way I do this is to loop over a dictionary or a list of tuples specifying input-output pairs. For example, if I were testing a function that did string slicing from the left in python (string[:n]) I might use something like

cases = {
    ('Catherine', 4): 'Cath',
    ('Catherine', -6): 'Cath',  # deliberately wrong, for illustration
    ('', 7): '',
    ('Miró forever', 4): 'Miró',
    ('Miró forever', 0): ' '    # also deliberately wrong
}
for (text, n), expected in cases.items():
    self.assertEqual(left_string(text, n), expected)

In Python this is fine, because tuples, being hashable, can be used as dictionary keys, and there's something quite intuitive and satisfying about the cases being presented as lines of the form input: expected output. But I also often just use nested tuples or lists, partly as a hangover from older versions of Python in which dictionaries weren't sorted.³ Here's a full example using tuples:

from tdda.referencetest import ReferenceTestCase

def left_string(s, n):
    return s[:n]


class TestLeft(ReferenceTestCase):
    def testLeft(self):
        cases = (
            (('Catherine', 4), 'Cath'),
            (('Catherine', -6), 'Cath'),    # deliberately wrong, for illustration
            (('', 7), ''),
            (('Miró forever', 4), 'Miró'),
            (('Miró forever', 0), ' ')      # also deliberately wrong
        )
        for (text, n), expected in cases:
            self.assertEqual(left_string(text, n), expected)


if __name__ == '__main__':
    ReferenceTestCase.main()

As noted above, two problems with this are:

if one test case fails, it's not necessarily easy to figure out which one it was, especially if expected values (e.g. 'Cath') are repeated.
an earlier failure prevents later cases from running.

We can see both of these problems if we run this:

$ python3 looptest.py
F
======================================================================
FAIL: testLeft (__main__.TestLeft)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "looptest.py", line 18, in testLeft
    self.assertEqual(left_string(text, n), expected)
AssertionError: 'Cat' != 'Cath'
- Cat
+ Cath
?    +


----------------------------------------------------------------------
Ran 1 test in 0.000s

FAILED (failures=1)

It's actually the second case that failed, and the fifth case would also fail if it ran (since it should produce an empty string, not a space).

A technique I've long used to address the first problem is to include the test case in the equality assertion, replacing

    self.assertEqual(actual, expected)

with

    self.assertEqual((case, actual), (case, expected))

like so:

def testLeft(self):
    cases = (
        (('Catherine', 4), 'Cath'),
        (('Catherine', -6), 'Cath'),
        (('', 7), ''),
        (('Miró forever', 4), 'Miró'),
        (('Miró forever', 0), ' ')
        )
    for case, expected in cases:
        (text, n) = case
        self.assertEqual((case, left_string(text, n)),
                         (case, expected))

Now when a case fails, we see what the failure is more easily:

$ python3 looptest2.py
F
======================================================================
FAIL: testLeft (__main__.TestLeft)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "looptest2.py", line 20, in testLeft
    (case, expected))
AssertionError: Tuples differ: (('Catherine', -6), 'Cat') != (('Catherine', -6), 'Cath')

First differing element 1:
'Cat'
'Cath'

- (('Catherine', -6), 'Cat')
+ (('Catherine', -6), 'Cath')
?                         +


----------------------------------------------------------------------
Ran 1 test in 0.001s

FAILED (failures=1)

I wouldn't call it beautiful, but it does the job, at least when the inputs and outputs are of a manageable size.

This still leaves the problem that the failure of an earlier case prevents later cases from running. The TDDA library already addresses this in the case of file checks, by providing the assertFilesCorrect (plural) assertion in addition to the assertFileCorrect (singular); we'll come back to it later.

Externalizing Test Data

Returning to the main theme of this post, when there are multiple implementations of software, potentially in different languages, there is some attraction to being able to share the test data—ideally, both the inputs being tested and the expected results.

The project I'm translating is a chord analysis tool focused on jazz guitar chords, especially moveable ones with no root. It has various classes, functions and structures concerned with musical notes, scales, abstract chords, tunings, chord shapes, chord names and so forth. It includes an easy-to-type text format that uses # as the sharp sign and b as the flat sign, though on output, these are usually translated to ♯ and ♭. Below are two simple tests from the Python code.

For those interested, the first tests a function transpose that transposes a note by an number of semitones. There's an optional key parameter which, when provided, is used to decide whether to express the result as a sharp or flat note (when appropriate).

def testTranspose(self):
    self.assertEqual(transpose('C', 0), 'C')

    self.assertEqual(transpose('C', 1), 'C#')
    self.assertEqual(transpose('C', 1, key='F'), 'Db')
    self.assertEqual(transpose('C#', -1), 'C')
    self.assertEqual(transpose('Db', -1), 'C')

    self.assertEqual(transpose('C', 2), 'D')
    self.assertEqual(transpose('D', -2), 'C')

    self.assertEqual(transpose('C', 3, key='A'), 'D#')
    self.assertEqual(transpose('C', 3), 'Eb')
    self.assertEqual(transpose('C', 3, key='Bb'), 'Eb')

    self.assertEqual(transpose('D#', -3), 'C')
    self.assertEqual(transpose('Eb', -3), 'C')

    self.assertEqual(transpose('C', -1), 'B')
    self.assertEqual(transpose('B', 1), 'C')

    self.assertEqual(transpose('C', -2), 'Bb')
    self.assertEqual(transpose('C', -2, 'E'), 'A#')
    self.assertEqual(transpose('Bb', 2), 'C')
    self.assertEqual(transpose('A#', 2), 'C')

    self.assertEqual(transpose('C', -3), 'A')
    self.assertEqual(transpose('A', 3), 'C')

    self.assertEqual(transpose('G', 4), 'B')
    self.assertEqual(transpose('B', -4), 'G')

    self.assertEqual(transpose('F#', 4), 'Bb')
    self.assertEqual(transpose('F#', 4, 'E'), 'A#')
    self.assertEqual(transpose('Bb', -4), 'F#')
    self.assertEqual(transpose('A#', -4), 'F#')
    self.assertEqual(transpose('Bb', -4, 'F'), 'Gb')
    self.assertEqual(transpose('A#', -4, 'Eb'), 'Gb')

    self.assertEqual(transpose('G', 4), 'B')
    self.assertEqual(transpose('F#', 4), 'Bb')
    self.assertEqual(transpose('F#', 4, 'E'), 'A#')
    self.assertEqual(transpose('B', -4), 'G')
    self.assertEqual(transpose('Bb', -4), 'F#')
    self.assertEqual(transpose('A#', -4), 'F#')
    self.assertEqual(transpose('Bb', -4, 'F'), 'Gb')
    self.assertEqual(transpose('A#', -4, 'F'), 'Gb')

Clearly, this test does not use looping, but does combine some 36 test cases in a single test (dogma be damned!)

A second test is for a function to_flat_equiv, which (again, for those interested) accepts chord names (in various forms) and—where the chord's key is sharp, as written—converts them to the equivalent flat form. (Here, o is one of the ways to indicate a diminished chord (e.g. Dº) and M is on of the ways of describing a major chord (also maj or Δ). The function also accepts None as an input (returned unmodified) and R as an abstract chord with no key specified (also unmodified).⁴

def test_to_flat_equiv(self):
    cases = (a
        ('C', 'C'),
        ('C#m', 'Dbm'),
        ('Db7', 'Db7'),
        ('C#M7', 'DbM7'),
        ('Do', 'Do'),
        ('D#M', 'EbM'),
        ('E9', 'E9'),
        ('FmM7', 'FmM7'),
        ('F#mM7', 'GbmM7'),
        ('G', 'G'),
        ('G#11', 'Ab11'),
        ('Ab11', 'Ab11'),
        ('Am11', 'Am11'),
        ('A#+', 'Bb+'),
        ('Bb+', 'Bb+'),
        ('A♯+', 'B♭+'),
        ('B♭+', 'B♭+'),

        (None, None),
        ('R', 'R'),
        ('R#', 'R#'),
        ('Rm', 'Rm'),
        )
    for k, v in cases:
        self.assertEqual(to_flat_equiv(k), v)

    for letter in 'BEPQaz@':
        self.assertRaises(NoteError, to_flat_equiv, letter + '#')

This function uses two loops within the test, one for the good cases and another for eight illegal input cases that raise exceptions. The looping has a clear benefit, but there's no reason to have combined the good and bad test cases in a single test function other than laziness.

In 2020, if we're going to share the test data between implementations, it hard to look beyond JSON. Here's an extract from a file scale-tests.json that encapsulates the inputs and expected outputs for all the tests above:

{
    "transpose": [
        [["C", 0], "C"],

        [["C", 1], "C#"],
        [["C", 1, "F"], "Db"],
        [["C#", -1], "C"],
        [["Db", -1], "C"],

        [["C", 2], "D"],
        [["D", -2], "C"],

        [["C", 3, "A"], "D#"],
        [["C", 3], "Eb"],
        [["C", 3, "Bb"], "Eb"],

        [["D#", -3], "C"],
        [["Eb", -3], "C"],

        [["C", -1], "B"],
        [["B", 1], "C"],

        [["C", -2], "Bb"],
        [["C", -2, "E"], "A#"],
        [["Bb", 2], "C"],
        [["A#", 2], "C"],

        [["C", -3], "A"],
        [["A", 3], "C"],

        [["G", 4], "B"],
        [["B", -4], "G"],

        [["F#", 4], "Bb"],
        [["F#", 4, "E"], "A#"],
        [["Bb", -4], "F#"],
        [["A#", -4], "F#"],
        [["Bb", -4, "F"], "Gb"],
        [["A#", -4, "Eb"], "Gb"],

        [["G", 4], "B"],
        [["F#", 4], "Bb"],
        [["F#", 4, "E"], "A#"],
        [["B", -4], "G"],
        [["Bb", -4], "F#"],
        [["A#", -4], "F#"],
        [["Bb", -4, "F"], "Gb"],
        [["A#", -4, "F"], "Gb"],
    ],
    "flat_equivs": [
        ["C", "C"],
        ["C#m", "Dbm"],
        ["Db7", "Db7"],
        ["C#M7", "DbM7"],
        ["Do", "Do"],
        ["D#M", "EbM"],
        ["E9", "E9"],
        ["FmM7", "FmM7"],
        ["F#mM7", "GbmM7"],
        ["G", "G"],
        ["G#11", "Ab11"],
        ["Ab11", "Ab11"],
        ["Am11", "Am11"],
        ["A#+", "Bb+"],
        ["Bb+", "Bb+"],
        ["A♯+", "B♭+"],
        ["B♭+", "B♭+"],

        [null, null],

        ["R", "R"],
        ["R#", "R#"],
        ["Rm", "Rm"]
    ],
    "flat_equiv_bads": "BEPQaz@"
}

I have a function that reads this uses json.load to read this and other test data, storing the results in an object with a .scale attribute, like so:

>>> from moveablechords.utils import ReadJSONTestData
>>> from pprint import pprint
>>> TestData = ReadJSONTestData()
>>> pprint(TestData.scale['transpose'])
[[['C', 0], 'C'],
 [['C', 1], 'C#'],
 [['C', 1, 'F'], 'Db'],
 [['C#', -1], 'C'],
 [['Db', -1], 'C'],
 [['C', 2], 'D'],
 [['D', -2], 'C'],
 [['C', 3, 'A'], 'D#'],
 [['C', 3], 'Eb'],
 [['C', 3, 'Bb'], 'Eb'],
 [['D#', -3], 'C'],
 [['Eb', -3], 'C'],
 [['C', -1], 'B'],
 [['B', 1], 'C'],
 [['C', -2], 'Bb'],
 [['C', -2, 'E'], 'A#'],
 [['Bb', 2], 'C'],
 [['A#', 2], 'C'],
 [['C', -3], 'A'],
 [['A', 3], 'C'],
 [['G', 4], 'B'],
 [['B', -4], 'G'],
 [['F#', 4], 'Bb'],
 [['F#', 4, 'E'], 'A#'],
 [['Bb', -4], 'F#'],
 [['A#', -4], 'F#'],
 [['Bb', -4, 'F'], 'Gb'],
 [['A#', -4, 'Eb'], 'Gb'],
 [['G', 4], 'B'],
 [['F#', 4], 'Bb'],
 [['F#', 4, 'E'], 'A#'],
 [['B', -4], 'G'],
 [['Bb', -4], 'F#'],
 [['A#', -4], 'F#'],
 [['Bb', -4, 'F'], 'Gb'],
 [['A#', -4, 'F'], 'Gb']]

I have also made it so you can get entries using attribute lookup on the objects, i.e. TestData.scale.transpose rather than TestData.scale['transpose'], just because it looks more elegant and readable to me.

A straightforward refactoring of the testTranspose function to use the JSON-loaded data in TestData.scale would be

def testTranspose2(self):
    for (case, expected) in TestData.scale.transpose:
        if len(case) == 2:
            (note, offset) = case
            self.assertEqual((case, transpose(note, offset)),
                             (case, expected))
        else:
            (note, offset, key) = case
            self.assertEqual((case, transpose(note, offset, key=key)),
                             (case, expected))

In case this isn't self-explanatory

The loop runs over the cases and expected values, so on the first iteration case is ["C", 0] and expected is 'C';
The assignments set the note and offset variables; if the list is of length three the key variable is also set;
As discussed above, rather than just using things like self.assertEqual(transpose(note, offset), expected), we're including the case (the tuple of input parameters) on both sides of the assertion so that if there's a failure, we can see which case is failing.

We can simplify this further since the transpose function has only one optional (keyword) argument, key, which can also be provided as a third positional argument. Assuming we don't specifically need to test the handling of key as a keyword argument, we can combine the two branches as follows:

def testTranspose3(self):
    for (case, expected) in TestData.scale.transpose:
        self.assertEqual((case, transpose(*case)),
                         (case, expected))

Here, we're using the * operator to unpack⁵ case into an argument list for the transpose function.

Adding TDDA Support

It probably hasn't escaped your attention that this third version of testTranspose is rather generic: the same structure would work for any function f and list of input-output pairs Pairs:

def testAnyOldFunction_f(self):
    for (case, expected) in Pairs:
        self.assertEqual((case, f(*case)), (case, expected))

This makes it fairly easy to add TDDA support. I added prototype support for this that allows us to use an even shorter version of the test:

def testTranspose4(self):
    self.checkFunctionByArgs(transpose, TestData.scale.transpose)

This new checkFunctionByArgs takes a function to test and a list of input output pairs and runs a slightly fancier version of testAnyOldFunction. I'll go into extensions in another post, but the most important difference is that it will report all failures rather than stopping at the first one.

We can illustrate this by changing the last first and last cases in TestData.scale['transpose'] to be incorrect, say:

   [[['C', 0], 'Z'],
     ...
    [["A#", -4, "F"], "Zb"]

If we run testTranspose3 using this modified test data, we get only the first failing case, and although the test case is listed in the output, the output isn't particularly easy to grok.

$ python3 testscale.py
.....F.......
======================================================================
FAIL: testTranspose2 (__main__.TestScale)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "testscale.py", line 22, in testTranspose2
    (case, expected))
AssertionError: Tuples differ: (['C', 0], 'C') != (['C', 0], 'Z')

First differing element 1:
'C'
'Z'

- (['C', 0], 'C')
?             ^

+ (['C', 0], 'Z')
?             ^


----------------------------------------------------------------------
Ran 13 tests in 0.002s

FAILED (failures=1)

But if we use the TDDA's prototype checkFunctionByArgs functionality, we see both failures and it shows them in a more digestible format:

$ python3 testscale.py
.....

Case transpose('C', 0): failure.
    Actual: 'C'
  Expected: 'Z'


Case transpose('A#', -4, 'F'): failure.
    Actual: 'Gb'
  Expected: 'Zb'
F.......
======================================================================
FAIL: testTranspose4 (__main__.TestScale)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "testscale.py", line 15, in testTranspose4
    self.checkFunctionByArgs(transpose, TestData.scale.transpose)
  File "/Users/njr/python/tdda/tdda/referencetest/referencetest.py", line 899, in checkFunctionByArgs
    self._check_failures(failures, msgs)
  File "/Users/njr/python/tdda/tdda/referencetest/referencetest.py", line 919, in _check_failures
    self.assert_fn(failures == 0, msgs.message())
AssertionError: False is not true :

Case transpose('C', 0): failure.
    Actual: 'C'
  Expected: 'Z'


Case transpose('A#', -4, 'F'): failure.
    Actual: 'Gb'
  Expected: 'Zb'

----------------------------------------------------------------------
Ran 13 tests in 0.001s

FAILED (failures=1)

The failures currently get shown twice, once during execution of the tests and again at the end in the summary, and the test just counts this as a single failure, though these are both things that could be changed.

There are variant forms of the prototype checking function above to handle keyword arguments only and mixed positional and keyword argmuents. There's also a version specifically for single-argument functions, where it's natural not to write the arguments as a tuple, but a simple value.

Is this a Good Idea?

I think the potential benefits of sharing data between different implementations of the same project are pretty clear. I haven't actually modified the Swift implementation to use the JSON, but I'm sure doing so will be easy and a clear win. I hope the example above also illustrates that good support from testing frameworks can significantly mitigate the downsides of looping over test cases within a single test function. But there are other potential downsides.

The most obvious problem, to me, is that the separation of the test data from the test it makes it harder to see what's being tested (and perhaps means you have to trust the framework more, though that is quite easy to check). Arguably, this is even more true when the test is reduced to the one-line form in testTranpose4, rather than longer form in testTranspose2, where the function arguments are unpacked and named, so that you can see a bit more of what is actually being passed into the function.

There's a broader point about the utility of tests as a form of documentation. A web search for externalizing test data uncovered this post from Arvind Patil in 2005 in which he proposes something like scheme here for Java (with XML taking the place of JSON, in 2005, of course). Three replies to the post are quite hostile, including the first for Irakli Nadareishvili, who says:

sorry, but this is a quite dangerous anti-pattern. Unit-tests are not simply for testing a piece of code. They carry several, additional, very important roles. One of them is - documentation.

In a well-tested code, unit-tests are the first examples of API usage (API that they test). A TDD-experienced developer can learn a lot about the API, looking at its unit-tests. For the readability and clarity of what unit-test tests, it is very important that test data is in the code and the reader does not have to consistently hop from a configuration file to the test code.

Also, usually boundary conditions for a code (which is what test data commonly is) almost never change, so there is more harm in this "pattern" than gain, indeed.

This is definitely a reasonable concern. Even if code has good documentation, it is all too common for it to become out of date, whereas (passing) tests, almost by definition, tend to stay up-to-date with API changes. We could mitigate this issue quite a lot by hooking into verbose mode (-v or --verbose) and having it show each call as well as the test function being run, which seems like a good idea anyway. At the moment, if you run the scale tests with -v on my chord project like this you get output like this:

$ python3 testscale.py -v
testAsSmallestIntervals (__main__.TestScale) ... ok
testDeMinorMajors (__main__.TestScale) ... ok
testFretForNoteOnString (__main__.TestScale) ... ok
testNotePairIntervals (__main__.TestScale) ... ok
testRelMajor (__main__.TestScale) ... ok
testTranspose (__main__.TestScale) ... ok
test_are_not_same (__main__.TestScale) ... ok
test_are_same (__main__.TestScale) ... ok
test_are_same_invalids (__main__.TestScale) ... ok
test_flat_equiv (__main__.TestScale) ... ok
test_flat_equiv_bads (__main__.TestScale) ... ok
test_preferred_equiv (__main__.TestScale) ... ok
test_preferred_equiv_bads (__main__.TestScale) ... ok

----------------------------------------------------------------------
Ran 13 tests in 0.002s

OK

but we could (probably) extend this to something more like:⁶

$ python3 testscale.py -v
testAsSmallestIntervals (__main__.TestScale) ... ok
testDeMinorMajors (__main__.TestScale) ... ok
testFretForNoteOnString (__main__.TestScale) ... ok
testNotePairIntervals (__main__.TestScale) ... ok
testRelMajor (__main__.TestScale) ... ok
testTranspose (__main__.TestScale) ...
    transpose('C', 0): OK
    transpose('C', 1): OK
    transpose('C', 1, 'F'): OK
    transpose('C#', -1): OK
    transpose('Db', -1): OK
    transpose('C', 2): OK
    transpose('D', -2): OK
    transpose('C', 3, 'A'): OK
    transpose('C', 3): OK
    transpose('C', 3, 'Bb'): OK
    transpose('D#', -3): OK
    transpose('Eb', -3): OK
    transpose('C', -1): OK
    transpose('B', 1): OK
    transpose('C', -2): OK
    transpose('C', -2, 'E'): OK
    transpose('Bb', 2): OK
    transpose('A#', 2): OK
    transpose('C', -3): OK
    transpose('A', 3): OK
    transpose('G', 4): OK
    transpose('B', -4): OK
    transpose('F#', 4): OK
    transpose('F#', 4, 'E'): OK
    transpose('Bb', -4): OK
    transpose('A#', -4): OK
    transpose('Bb', -4, 'F'): OK
    transpose('A#', -4, 'Eb'): OK
    transpose('G', 4): OK
    transpose('F#', 4): OK
    transpose('F#', 4, 'E'): OK
    transpose('B', -4): OK
    transpose('Bb', -4): OK
    transpose('A#', -4): OK
    transpose('Bb', -4, 'F'): OK
    transpose('A#', -4, 'F'): OK
... testTranspose (__main__.TestScale): 36 tests: ... ok
test_are_not_same (__main__.TestScale) ... ok
test_are_same (__main__.TestScale) ... ok
test_are_same_invalids (__main__.TestScale) ... ok
test_flat_equiv (__main__.TestScale) ... ok
test_flat_equiv_bads (__main__.TestScale) ... ok
test_preferred_equiv (__main__.TestScale) ... ok
test_preferred_equiv_bads (__main__.TestScale) ... ok

----------------------------------------------------------------------
Ran 49 test cases across 13 tests in 0.002s

OK

I also found this post from Jeremy Wadhams in 2015 on the subject of Sharing unit tests between several language implementations of one spec. It discusses JsonLogic:

JsonLogic is a data format (built on top of JSON) for storing and sharing rules between front-end and back-end code. It's essential that the same rule returns the same result whether executed by the JavaScript client or the PHP client.

Currently the JavaScript client has tests in QUnit, and the PHP client has tests in PHPunit. The vast majority of tests are "given these inputs (rule and data), assert the output equals the expected result."

Jeremy also suggests something very like the scheme above, again using JSON.

Conclusion

I think this has been quite a promising experiment. It reduced the length of testscale.py from 223 lines to 75, which wasn't an aim (and carries the potential issues noted above), but which does make the scope and structure of the tests easier to understand. It also achieved the primary goal of allowing test data to be shared between implementations, which seems like a valuable prize. Eventually, the project might gain a command line in both implementations, and and that will potentially enable my favourite mode of testing—pairs of input command lines and expected output. But this is a useful start.

Meanwhile, I will probably refine (and document and test!) the prototype implementations a bit more and then release it.

If you have thoughts, do get in touch.

or, more generally, this callable. ↩
other than, perhaps, and manual teardown in a try...finally block. ↩
From Python 3.8 on, all Python dictionaries are ordered. This is also the case in CPython implementations from 3.6 onwards. ↩
The function does not accept B# or E#, even though musically these can be used as alternatives to C and F respectively. That is outside the scope of this function. ↩
this operation is sometimes called splatting, and sometimes unsplatting or desplatting. ↩
Would I seem like a very old fuddy-duddy if I ask "who writes 'ok' in lower case anyway?" ↩

Reference Testing Exercise 2 (pytest flavour)

2019-10-31T09:30:00+00:00

This exercise (video 2m 58s) shows a powerful way to run only a single test, or some subset of tests, by using the @tag decorator available in the TDDA library. This is useful for speeding up the test cycle and allowing you to focus on a single test, or a few tests. We will also see, in the next exercise, how it can be used to update test results more easily and safely when expected behaviour changes.

(If you do not currently use pytest for writing tests, you might prefer the unittest-flavoured version of this exercise, since unittest is in Python's standard library.)

Prerequisites

★ You need to have the TDDA Python library (version 1.0.31 or newer) installed see installation. Use

tdda version

to check the version that you have.

Step 1: Copy the exercises (if you don't already have them)

You need to change to some directory in which you're happy to create three directories with data. We are use ~/tmp for this. Then copy the example code.

$ cd ~/tmp
$ tdda examples    # copy the example code

Step 2: Go the exercise files and examine them:

$ cd referencetest_examples/exercises-pytest/exercise2  # Go to exercise2

As in the first exercise, you should have at least the following four files:

$ ls
conftest.py expected.html   generators.py   test_all.py

conftest.html is configuration to extend pytest with referencetest capabilities,
expected.html contains the expected output from one test,
generators.py contains the code to be tested,
test_all.py contains the tests.

If you look at test_all.py, you'll see it contains five test functions. Only one of the tests is useful (testExampleStringGeneration) with all the others making manifestly true assertions and most of them deliberately wasting time to simulate annoyingly slow tests.

from generators import generate_string

def testZero():
    assert True

def testOne():
    time.sleep(1)
    assert 1 == 1

def testExampleStringGeneration(ref):
    actual = generate_string()
    ref.assertStringCorrect(actual, 'expected.html')

def testTwo():
    time.sleep(2)
    assert 2 == 2

def testThree():
    time.sleep(3)
    assert 3 == 3

Step 3: Run the tests, which should be slow and produce one failure

$ pytest           #  This will work with Python 3 or Python2

When you run the tests, you should get a single failure, that being the non-trivial test testExampleStringGeneration.

The output will be something like:

============================= test session starts ==============================
test_all.py ..F..

[...details of test failure...]

====================== 1 failed, 4 passed in 6.17 seconds ======================

We get a test failure because we haven't added the ignore_substrings parameter that we saw in Exercise 1 is needed for it to pass.

The tests should take slightly over 6 seconds in total to run, because of the three annoyingly slow tests with sleep statements in them—testOne, testTwo and testThree. (If you're not annoyed by a 6-second delay, increase the sleep time in one of the "sleepy" tests until you are annoyed!)

The point of this exercise is to show some simple but very useful functionality for running only tests on which we wish to focus, such as our failing test.

Step 4: Tag the failing test using @tag

The TDDA library includes a function called tag; this is a decorator function¹ that we can put before individual tests, to mark them as being of special interest temporarily.

Edit test_all.py to decorate the failing test by an import statement to bring in tag from the TDDA library, and then decorate the definition of testStringFunction by preceding it with @tag as follows:

from tdda.referencetest import tag

def testZero():
    assert True

def testOne():
    time.sleep(1)
    assert 1 == 1

@tag
def testExampleStringGeneration(ref):
    actual = generate_string()
    ref.assertStringCorrect(actual, 'expected.html')

Step 5: Run only the tagged test

Having tagged the failing test, if we run the tests again adding --tagged to the command, it will run only the tagged test, and take hardly any time. The (abbreviated) output should be something like

============================= test session starts ==============================
$ pytest --tagged
test_all.py F

[...details of test failure...]

=========================== 1 failed in 0.16 seconds ===========================

We can tag as many tests as we like, across any number of test files, to run a subset of tests, rather than a single one.

Step 6: Locating @tag decorators

Although it's not hard to use grep or grep -r to find them, the library can actually do this for you. If you use the --istagged flag instead of running the tests, the library will report which test classes in which files have tagged tests. So in our case:

$ pytest --istagged
============================= test session starts ==============================
platform darwin -- Python 3.7.3, pytest-4.4.0, py-1.8.0, pluggy-0.9.0
rootdir: /Users/njr/tmp/referencetest_examples/exercises-pytest/exercise2
collecting ...
test_all.testExampleStringGeneration
collected 5 items

========================= no tests ran in 0.01 seconds =========================

Obviously, in the case of a single test file, this is not a big deal, but if you have dozens or hundreds of source files, in a directory hierarchy, and have tagged a few functions across them, it becomes significantly more helpful.

Recap: What we have seen

This simple exercise has shown how we can easily run subsets of tests by tagging them and then using --tagged to run only tagged tests.

In this case, the motivation was simply to save time and reduce clutter in the output, focusing on one test, or a small number of tests.

In the Exercise 3, we will see how this combines with the ability to automatically regenerate updated reference outputs to make for a safe and efficient way to update tests after code changes.

Decorator functions in Python are functions that are used to transform other functions: they take a function as an argument and return a new function that modifies the original in some way. Out decorator function tag is called by writing @tag on the line before function (or class) definition, and the effect of this is that the function returned by @tag replaces the function (or class) it precedes. In our case, all @tag does is set an attribute on the function in question so that the TDDA reference test framework can identify it as a tagged function, and choose to run only tagged tests when so requested. ↩

Reference Testing Exercise 2 (unittest flavour)

2019-10-30T08:30:00+00:00

This exercise (video 3m 34s) shows a powerful way to run only a single test, or some subset of tests, by using the @tag decorator available in the TDDA library. This is useful for speeding up the test cycle and allowing you to focus on a single test, or a few tests. We will also see, in the next exercise, how it can be used to update test results more easily and safely when expected behaviour changes.

(If you use pytest for writing tests, you might prefer the pytest-flavoured version of this exercise.)

Prerequisites

★ You need to have the TDDA Python library (version 1.0.31 or newer) installed see installation. Use

tdda version

to check the version that you have.

Step 1: Copy the exercises (if you don't already have them)

You need to change to some directory in which you're happy to create three directories with data. We are use ~/tmp for this. Then copy the example code.

$ cd ~/tmp
$ tdda examples    # copy the example code

Step 2: Go the exercise files and examine them:

$ cd referencetest_examples/exercises-unittest/exercise2  # Go to exercise2

As in the first exercise, you should have at least the following three files

$ ls
expected.html   generators.py   test_all.py

expected.html contains the expected output from one test,
generators.py contains the code to be tested,
test_all.py contains the tests.

If you look at test_all.py, you'll see it contains two test classes with five tests between them. Only one of the tests is useful (testExampleStringGeneration) with all the others making manifestly true assertions and deliberately wasting time to simulate annoyingly slow tests.

import time
from tdda.referencetest import ReferenceTestCase, tag
from generators import generate_string

class TestQuickThings(ReferenceTestCase):

    def testExampleStringGeneration(self):
        actual = generate_string()
        self.assertStringCorrect(actual, 'expected.html')

    def testZero(self):
        self.assertIsNone(None)


class TestSuperSlowThings(ReferenceTestCase):

    def testOne(self):
        time.sleep(1)
        self.assertEqual(1, 1)

    def testTwo(self):
        time.sleep(2)
        self.assertEqual(2, 2)

    def testThree(self):
        time.sleep(3)
        self.assertEqual(3, 3)

Step 3: Run the tests, which should be slow and produce one failure

$ python test_all.py   #  This will work with Python 3 or Python2

When you run the tests, you should get a single failure, that being the non-trivial test testExampleStringGeneration from the class TestQuickThings.

The output will be:

F....

[...details of test failure...]

Ran 5 tests in 6.007s
FAILED (failures=1)

We get a test failure because we haven't added the ignore_substrings parameter that we saw in Exercise 1 is needed for it to pass.

The tests should take slightly over 6 seconds in total to run, because of the annoyingly slow tests in TestSuperSlowThings. (If you're not annoyed by a 6-second delay, increase the sleep time in one of the "slow" tests until you are annoyed!)

The point of this exercise is to show some simple but very useful functionality for running only tests on which we wish to focus, such as our failing test.

Step 4: Tag the failing test using @tag

If you look at the import statements, you'll see that as well as ReferenceTestCase we also import tag. This is a decorator function¹ that we can put before individual tests, or test classes, to indicate that they are of special interest temporarily.

Edit test_all.py to decorate the failing test by adding @tag on the line before it, thus:

class TestQuickThings(ReferenceTestCase):

    @tag
    def testExampleStringGeneration(self):
        actual = generate_string()
        self.assertStringCorrect(actual, 'expected.html')

    def testZero(self):
        self.assertIsNone(None)

Step 5: Run only the tagged test

Having tagged the failing test, if we run the tests again adding -1 (the digit one, for "single",not the letter ell) to the command, it will run only the tagged test, and take hardly any time. The (abbreviated) output should be something like

$ python test_all.py -1
F

[...details of test failure...]

Ran 1 tests in 0.006s
FAILED (failures=1)

You can also use --tagged instead of -1 if you like more descriptive flags.

We can tag as many tests as we like, across any number of test files, and we can also tag whole classes by placing the @tag decorator before a test class definition. So if we instead use:

@tag
class TestQuickThings(ReferenceTestCase):

    def testExampleStringGeneration(self):
        actual = generate_string()
        self.assertStringCorrect(actual, 'expected.html')

    def testZero(self):
        self.assertIsNone(None)

and run the tests with -1, we will get output more like:

$ python test_all.py -1
F.

[...details of test failure...]

Ran 2 tests in 0.006s
FAILED (failures=1)

In this case, both the tests in our first test class were run, but no others (and, in particular, not our painfully slow tests!)

Step 6: Locating @tag decorators

In a typical debugging or test development cycle in which you have been using the @tag decorator to focus on just a few failing tests, you might end up with @tag decorations scattered across several files, perhaps in multiple directories. (We're assuming here you have test_all.py or similar that imports all the other test classes so you can easily run them all together.)

Although it's not hard to use grep or grep -r to find them, the library can actually do this for you. If you use the -0 flag (the digit zero, for "no tests"), or the --istagged flag, instead of running the tests, the library will report which test classes in which files have tagged tests. So in our case:

$ python test_all.py -0

produces:

__main__.TestQuickThings

Here, __main__ stands for the current file; other files would be referenced by their imported name.

Recap: What we have seen

This simple exercise has shown how we can easily run subsets of tests by tagging them and then using the -1 flag (or --tagged) to run only tagged tests.

In this case, the motivation was simply to save time and reduce clutter in the output, focusing on one test, or a small number of tests.

In the Exercise 3, we will see how this combines with the ability to automatically regenerate updated reference outputs to make for a safe and efficient way to update tests after code changes.

Decorator functions in Python are functions that are used to transform other functions: they take a function as an argument and return a new function that modifies the original in some way. Out decorator function tag is called by writing @tag on the line before function (or class) definition, and the effect of this is that the function returned by @tag replaces the function (or class) it precedes. In our case, all @tag does is set an attribute on the function in question so that the TDDA reference test framework can identify it as a tagged function, and choose to run only tagged tests when so requested. ↩

Reference Testing Exercise 1 (pytest flavour)

2019-10-29T12:00:00+00:00

This exercise (video 8m 54s) shows how to migrate a test from using pytest directly to the exploiting the referencetest capabilities in the TDDA library. (If you do not currently use pytest for writing tests, you might prefer the unittest-flavoured version of this exercise, since unittest is in Python's standard library.)

We will see how even simple use of referencetest

makes it much easier to see how tests have failed when complex outputs are generated
helps us to update reference outputs (the expected values) when we have verified that a new behaviour is correct
allows us easily to write tests of code whose outputs are not identical from run to run. We do this by specifying exclusions from the comparisons used in assertions.

Prerequisites

★ You need to have the TDDA Python library installed (version 1.0.31 or newer) see installation. Use

tdda version

to check the version that you have.

Step 1: Copy the exercises

You need to change to some directory in which you're happy to create three new directories with data. We are use ~/tmp for this. Then copy the example code.

$ cd ~/tmp
$ tdda examples    # copy the example code

Step 2: Go the exercise files and examine them:

$ cd referencetest_examples/exercises-pytest/exercise1  # Go to exercise1

You should have at least the following four files:

$ ls
conftest.py expected.html   generators.py   test_all.py

generators.py contains a function called generate_string that, when called, returns HTML text suitable for viewing as a web page.
expected.html is the result of calling that function, saved to file
test_all.py contains a single unittest-based test of that file.
conftest.py imports key referencetest functionality from the tdda library into pytest.

It's probably useful to look at the web page expected.html in a browser, either by navigating to it in a file browser and double clicking it, or by using

open expected.html

if your OS supports this. As you can see, it's just some text and an image. The image is an inline SVG vector image, generated along with the text.

Also have a look at the test code. The core part of it is very short:

from generators import generate_string

def testExampleStringGeneration():
    actual = generate_string()
    with open('expected.html') as f:
        expected = f.read()
    assert actual == expected

The code

calls generate_string() to create the content
stores its output in the variable actual
reads the expected content into the variable expected
asserts that the two strings are the same.

Step 3. Run the test, which should fail

$ pytest      #  This will whether pytest uses Python2 or Python3

You should get a failure, and pytest tries quite hard to show what's causing the failure:

=================================== FAILURES ===================================
_________________________ testExampleStringGeneration __________________________

    def testExampleStringGeneration():
        actual = generate_string()
        with open('expected.html') as f:
            expected = f.read()
>       assert actual == expected
E       AssertionError: assert '<!DOCTYPE ht...y>\n</html>\n' == '<!DOCTYPE htm...y>\n</html>\n'
E         Skipping 69 identical leading characters in diff, use -v to show
E         -  Solutions, 2016
E         +  Solutions Limited, 2016
E         ?           ++++++++
E         -     Version 1.0.0
E         ?             ^
E         +     Version 0.0.0...
E
E         ...Full output truncated (31 lines hidden), use '-vv' to show

test_all.py:24: AssertionError
=========================== 1 failed in 0.11 seconds ===========================

You can certainly see that there's a different in the Version number in the output and also a line including 2016 (a copyright notice, in fact).

But it also says:

...Full output truncated (31 lines hidden), use '-vv' to show

and if you do that, the output becomes a bit overwhelming.

We'll convert the test to use the TDDA libraries referencetest and see how that helps.

Step 4. Change the code to use referencetest.

The key change we need to make is the to the assertion, which will now be:

ref.assertStringCorrect(actual, 'expected.html')

ref is object made available by conftest.py, and is passed into our test function by pytest. We therefore need to change the function declaration to take ref as an argument:

def testExampleStringGeneration(ref):

Finally, because assertStringCorrect compares a string in memory to content from a file, we don't need the lines in the middle that read the file:

* Delete the middle two lines of the test function.

The result is:

from generators import generate_string

def testExampleStringGeneration(ref):
    actual = generate_string()
    ref.assertStringCorrect(actual, 'expected.html')

Step 5. Run the modified test

$ pytest

You should see very different output, that includes, near the end, something like this:

E       AssertionError: 2 lines are different, starting at line 5
E       Expected file expected.html
E       Compare raw with:
E           diff /var/folders/zv/3xvhmvpj0216687_pk__2f5h0000gn/T/actual-raw-expected.html expected.html
E
E       Compare post-processed with:
E           diff /var/folders/zv/3xvhmvpj0216687_pk__2f5h0000gn/T/actual-expected.html /var/folders/zv/3xvhmvpj0216687_pk__2f5h0000gn/T/expected-expected.html

/Users/njr/python/tdda/tdda/referencetest/referencepytest.py:187: AssertionError

(You will probably need to scroll right to see all of the message on this page.)

Because the test failed, the TDDA library has written a copy of the actual ouput to file to make it easy for us to examine it and to use diff commands to see how it actually differs from what we expected. (In fact, it's written out two copies, a "raw" and a "post-precocessed" one, but we haven't used any processing, so they will be the same in our case. So we ignore the second diff command suggested for now.)

It's also given us the precise diff command we need to see the differences between our actual and expected output.

Step 6. Copy the first diff command and run it. You should see something similar to this:

$ diff /var/folders/zv/3xvhmvpj0216687_pk__2f5h0000gn/T/actual-raw-expected.html expected.html
5,6c5,6
<     Copyright (c) Stochastic Solutions, 2016
<     Version 1.0.0
—
>     Copyright (c) Stochastic Solutions Limited, 2016
>     Version 0.0.0
35c35
< </html>
\ No newline at end of file
—
> </html>

(If you have a visual diff tool, can also use that. For example, on a Mac, if you have Xcode installed, you should have the opendiff command available.)

The diff makes it clear that there are three differences:

The copyright notice has changed slightly
The version number has changed
The string doesn't have a newline at the end, whereas the file does.

The Copyright and version numbers lines are both in comments in the HTML, so don't affect the rendering at all. You might want to confirm that if you look at the actual file it saved (/var/folders/zv/3xvhmvpj0216687_pk__2f5h0000gn/T/actual-raw-expected.html, the first file in the diff command), you should see that it looks identical.

In this case, therefore, we might now feel that we should simply update expected.html with what generate_string() is now producing. It would be (by design) extremely easy to change the diff in the command it gave is to cp to achieve that.

However, there's better thing we can do in this case.

Step 7. Specify exclusions

Standing back, it seems obvious likely that periodically the version number and Copyright line written to comments in the HTML will change. If the only difference between out expected output and what we actually generate are those, we'd probably prefer the test didn't fail.

The ref.assertStringCorrect function from referencetest gives us several mechanisms for specifying changes that can be ignored when checking whether a string is correct. The simplest one, which will be enough for our example, is just to specify strings which, if they occur on a line in the output, case differences in those lines to be ignored, so that the assertion doesn't fail.

** Step 7a. Add the ignore_substrings parameter to assertStringCorrect as follows:**

        ref.assertStringCorrect(actual, 'expected.html',
                                ignore_substrings=['Copyright', 'Version'])

Step 7b. Run the test again. It should now pass:

$ pytest
============================= test session starts ==============================

test_all.py .                                                            [100%]

=========================== 1 passed in 0.04 seconds ===========================

Recap: What we have seen

We've seen

Converting standard pytest-based tests to use referencetestcase is straightfoward.
When we do that, we gain access to powerful new kinds of assertion such as assertStringCorrect. Among the immediate benefits:
- When there is failure, this saves the failing output to a temporary file
- It tells you the exact diff command you need to see be able to see differences
- This also makes it very easy to copy the new "known good" answer into place if you've verified that the new answer is now correct. (In fact, the library also has a more powerful way to do this, as we'll see in a later exercise).
The ref.assertStringCorrect fucntion also has a number of mechanisms for allowing specific expected differences to occur without causing the test to fail. The simplest of these mechanisms is the ignore_substrings keyword argument we used here.

Reference Testing Exercise 1 (unittest flavour)

2019-10-28T08:15:00+00:00

This exercise (video 8m 53s) shows how to migrate a test from using unittest directly to the exploiting the referencetest capabilities in the TDDA library. (If you use pytest for writing tests, you might prefer the pytest-flavoured version of this exercise.)

We will see how even simple use of referencetest

makes it much easier to see how tests have failed when complex outputs are generated
helps us to update reference outputs (the expected values) when we have verified that a new behaviour is correct
allows us easily to write tests of code whose outputs are not identical from run to run. We do this by specifying exclusions from the comparisons used in assertions.

Prerequisites

★ You need to have the TDDA Python library (version 1.0.31 or newer) installed see installation. Use

tdda version

to check the version that you have.

Step 1: Copy the exercises

You need to change to some directory in which you're happy to create three new directories with data. We are use ~/tmp for this. Then copy the example code.

$ cd ~/tmp
$ tdda examples    # copy the example code

Step 2: Go the exercise files and examine them:

$ cd referencetest_examples/exercises-unittest/exercise1  # Go to exercise1

You should have at least the following three files:

$ ls
expected.html   generators.py   test_all.py

generators.py contains a function called generate_string that, when called, returns HTML text suitable for viewing as a web page.
expected.html is the result of calling that function, saved to file
test_all.py contains a single unittest-based test of that file.

It's probably useful to look at the web page expected.html in a browser, either by navigating to it in a file browser and double clicking it, or by using

open expected.html

if your OS supports this. As you can see, it's just some text and an image. The image is an inline SVG vector image, generated along with the text.

Also have a look at the test code. The core part of it is very short:

import unittest

from generators import generate_string

class TestFileGeneration(unittest.TestCase):
    def testExampleStringGeneration(self):
        actual = generate_string()
        with open('expected.html') as f:
            expected = f.read()
        self.assertEqual(actual, expected)

if __name__ == '__main__':
    unittest.main()

The code

calls generate_string() to create the content
stores its output in the variable actual
reads the expected content into the variable expected
asserts that the two strings are the same.

Step 3. Run the test, which should fail

$ python test_all.py   #  This will work with Python 3 or Python2

You should get a failure, but it will probably be quite hard to see exactly what the differences are.

We'll convert the test to use the TDDA libraries referencetest and see how that helps.

Step 4. Change the code to use referencetest.

First we need our test to use ReferenceTestCase from tdda.referencetest instead of unittest.TestCase. ReferenceTestCase is a subclass of unittest.TestCase.

Change the import statement to from tdda.referencetest import ReferenceTestCase
Replace unittest.TestCase with ReferenceTestCase in the class declaration
Replace unittest.main() with ReferenceTestCase.main()

The result is:

from tdda.referencetest import ReferenceTestCase

from generators import generate_string

class TestFileGeneration(ReferenceTestCase):
    def testExampleStringGeneration(self):
        actual = generate_string()
        with open('expected.html') as f:
            expected = f.read()
        self.assertEqual(actual, expected)

if __name__ == '__main__':
    ReferenceTestCase.main()

If you run this, it's behaviour should be exactly the same, because we haven't used any of the extra features of tdda.referencetest yet.

Step 5. Change the assertion to use assertStringCorrect

TDDA's ReferenceTestCase provides the assertStringCorrect method, which expects as its first positional arguments an actual string and the path to a file containing the expected result. So:

Change assertEqual to assertStringCorrect
Change expected to expected.html as the second argument to the assertion

Delete the two lines reading the file and assigning to expected as we no longer need that.

def testExampleStringGeneration(self):
    actual = generate_string()
    self.assertStringCorrect(actual, 'expected.html')

Step 6. Run the modified test

$ python test_all.py

You should see very different output, that includes, near the end, something like this:

Expected file expected.html
Compare raw with:
    diff /var/folders/zv/3xvhmvpj0216687_pk__2f5h0000gn/T/actual-raw-expected.html expected.html

Compare post-processed with:
    diff /var/folders/zv/3xvhmvpj0216687_pk__2f5h0000gn/T/actual-expected.html /var/folders/zv/3xvhmvpj0216687_pk__2f5h0000gn/T/expected-expected.html

It's also given us the precise diff command we need to see the differences between our actual and expected output.

Step 6a. Copy the first diff command and run it. You should see something similar to this:

$ diff /var/folders/zv/3xvhmvpj0216687_pk__2f5h0000gn/T/actual-raw-expected.html expected.html
5,6c5,6
<     Copyright (c) Stochastic Solutions, 2016
<     Version 1.0.0
—
>     Copyright (c) Stochastic Solutions Limited, 2016
>     Version 0.0.0
35c35
< </html>
\ No newline at end of file
—
> </html>

(If you have a visual diff tool, can also use that. For example, on a Mac, if you have Xcode installed, you should have the opendiff command available.)

The diff makes it clear that there are three differences:

The copyright notice has changed slightly
The version number has changed
The string doesn't have a newline at the end, whereas the file does.

However, there's better thing we can do in this case.

Step 7. Specify exclusions

The assertStringCorrect method from referencetest gives us several mechanisms for specifying changes that can be ignored when checking whether a string is correct. The simplest one, which will be enough for our example, is just to specify strings which, if they occur on a line in the output, case differences in those lines to be ignored, so that the assertion doesn't fail.

** Step 7a. Add the ignore_substrings parameter to assertStringCorrect as follows:**

        self.assertStringCorrect(actual, 'expected.html',
                                 ignore_substrings=['Copyright', 'Version'])

Step 7b. Run the test again. It should now pass:

$ python3 test_all.py
.
----------------------------------------------------------------------
Ran 1 test in 0.002s

OK

Recap: What we have seen

We've seen

Converting unittest-based tests to use ReferenceTestCase is straightfoward.
When we do that, we gain access to powerful new assert methods such as assertStringCorrect. Among the immediate benefits:
- When there is failure, this method saves the failing output to a temporary file
- It tells you the exact diff command you need to see be able to see differences
- This also makes it very easy to copy the new "known good" answer into place if you've verified that the new answer is now correct. (In fact, the library also has a more powerful way to do this, as we'll see in a later exercise).
The assertStringCorrect method also has a number of mechanisms for allowing specific expected differences to occur without causing the test to fail. The simplest of these mechanisms is the ignore_substrings keyword argument we used here.

Screencasts and Exercises

2019-10-25T15:00:00+01:00

We've started producing a series of exercises for various aspects of TDDA, available on the blog, with follow-along screencasts.

There will be a series of posts about these, starting on Monday (28th October). There's a YouTube channel as well, if you want to subscribe.

The goal has been for each exercise to be as short and simple as it can reasonably be while still covering useful aspects.

The first set of exercises will cover the reference testing capabilities of TDDA, and at least some of them will be available both as unittest-favoured versions and pytest variants. If you don't currently use either, you probably want to follow the unittest variants, since unittest is part of Python's standard library.

There's a page for the exercises at:

tdda.info/exercises

which we'll try to keep up-to-date as we add more.

Please note: if you want to do the exercises, you'll need the latest TDDA release, and as we add more (unfortunately) you'll probably need to update each time we add new exercises, with something like

pip install -U tdda

python3 -m pip install -U tdda

depending on your setup. See the installation instructions for details.

Installation

2019-10-24T15:00:00+01:00

This post is a standing post that we plan to try to keep up to date, describing options for obtaining the open-source Python TDDA library that we maintain.

Using pip from PyPI

If you don't need source, and have Python installed, the easiest way to get the TDDA library is from the Python package index PyPI using the pip utility.

Assuming you have a working pip setup, you should be able to install the tdda library by typing:

pip install tdda

or, if your permissions don't allow use in this mode

sudo pip install tdda

If pip isn't working, or is associated with a different Python from the one you are using, try:

python -m pip install tdda

sudo python -m pip install tdda

The tdda library supports both Python 3 (tested with 3.6 and 3.7) and Python 2 (tested with 2.7). (We'll start testing against 3.8 real soon!)

Upgrading

If you have a version of the tdda library installed and want to upgrade it with pip, add -U to one of the command above, i.e. use whichever of the following you need for your setup:

pip install -U tdda
sudo pip install -U tdda
python -m pip install -U tdda
sudo python -m pip install -U tdda

Installing from Source

The source for the tdda library is available from Github and can be cloned with

git clone https://github.com/tdda/tdda.git

git clone git@github.com:tdda/tdda.git

When installing from source, if you want the command line tdda utility to be available, you need to run

python setup.py install

from the top-level tdda directory after downloading it.

Documentation

The main documentation for the tdda library is available on Read the Docs.

You can also build it youself if you have downloaded the source from Github. In order to do this, you will need an installation of Sphinx. The HTML documentation is built, starting from the top-level tdda directory by running:

cd doc
make html

Running TDDA's tests

Once you have installed TDDA (whether using pip or from source), you can run its tests by typing

tdda test

If you have all the dependencies, including optional dependencies, installed, you should get a line of dots and the message OK at the end, something like this:

$ tdda test
........................................................................................................................
----------------------------------------------------------------------
Ran 122 tests in 3.251s

OK

If you don't have some of the optional dependencies installed, some of the dots will be replaced by the letter 's'. For example:

$ tdda test
.................................................................s.............................s........................
----------------------------------------------------------------------
Ran 120 tests in 3.221s

OK (skipped=2)

This does not indicate a problem, and simply means there will be some of the functionality unavailable (e.g. usually one or more database types).

Using the TDDA examples

The tdda library includes three sets of examples, covering reference testing, automatic constraint discovery and verification, and Rexpy (discovery of regular expressions from examples, outside the context of constraints).

The tdda command line can be used to copy the relevant files into place. To get the examples, first change to a directory where you would like them to be placed, and then use the command:

tdda examples

This should produce the following output:

Copied example files for tdda.referencetest to ./referencetest-examples
Copied example files for tdda.constraints to ./constraints-examples
Copied example files for tdda.rexpy to ./rexpy-examples

Quick Reference Guides

There is a quick reference guides available for the TDDA library. These are often a little behind the current release, but are usually still quite helpful.

These are available from here.

Online Tutorials

Various videos of tutorials, and accompanying slides, are available online. Exercises with screencasts are under development, and we hope to begin to release these shortly.

Rexpy for Generating Regular Expressions: Postcodes

2019-02-20T18:20:00+00:00

Rexpy is a powerful tool we created that generates regular expressions from examples. It's available online at https://rexpy.herokuapp.com and forms part of our open-source TDDA library.

Miró users can use the built-in rex command.

This post illustrates using Rexpy to find regular expressions for UK postcodes.

A regular expression for Postcodes

If someone asked you what a UK postcode looks like, and you don't live in London, you'd probably say something like:

A couple of letters, then a number then a space, then a number then a couple of letters.

About the simplest way to get Rexpy to generate a regular expression is to give it at least two examples. You can do this online at https://rexpy.herokuapp.com or using the open-source TDDA library.

If you give it EH1 3LH and BB2 5NR, Rexpy generates [A-Z]{2}\d \d[A-Z]{2}, as illustrated here, using the online version of rexpy:

This is the regular-expression equivalent of what we said:

[A-Z]{2} means exactly two ({2}) characters from the range [A-Z], i.e. two capital letters
\d means a digit (which is the same as [0-9]—two characters from the range 0 to 9)
the gap () is a space character
\d is another digit
[A-Z]{2} is two more letters.

This doesn't cover all postcodes, but it's a good start.

Other cases

Any easy way to try out the regular expression we generated is to use the grep command¹. This is built into all Unix and Linux systems, and is available on Windows if you install a Linux distribution under WSL.

If we try matching a few postcodes using this regular expression, we'll see that many—but not all—postcodes match the pattern.

On Linux, the particular variant of grep we need is grep -P, to tell it we're using Perl-style regular expressions.
On Unix (e.g. Macintosh), we need to use grep -E (or egrep) to tell it we're using "extended" regular expressions

If we write a few postcodes to a file:

$ cat > postcodes
HA2 6QD
IP4 2LS
PR1 9BW
BB2 5NR
G1 9PU
DH9 6DU
RG22 4EX
EC1A 1AB
OL14 8DQ
CT2 7UD

we can then use grep to find the lines that match:

$ grep -E '[A-Z]{2}\d \d[A-Z]{2}' postcodes
HA2 6QD
IP4 2LS
PR1 9BW
BB2 5NR
DH9 6DU
CT2 7UD

(Use -P instead of -E on Linux.)

More relevantly, for present purposes, we can also add the -v flag, to ask the match to be "inVerted", i.e. to show lines that fail to match:

$ grep -v -E '[A-Z]{2}\d \d[A-Z]{2}' postcodes
G1 9PU
RG22 4EX
EC1A 1AB
OL14 8DQ

The first of these, a Glasgow postcode, fails because it only has a single letter at the start.
The second and fourth fail because they have two digits after the letters.
The third fails because it's a London postcode with an extra letter, A after the EC1.

Let's add an example of each in turn:

If we first add the Glasgow postcode, Rexpy generates ^[A-Z]{1,2}\d \d[A-Z]{2}$.

Here [A-Z]{1,2} in brackets means 1–2 capital letters, and we've checked the anchor checkbox, to get it to add in ^ at the start and $ at the end of the regular expression.² If we use this with our grep command, we get:

$ grep -v -E '^[A-Z]{1,2}\d \d[A-Z]{2}$' postcodes
RG22 4EX
EC1A 1AB
OL14 8DQ

If we now add in an example with two digits in the first part of the postcode—say RG22 4EX—rexpy further refines the expression to ^[A-Z]{1,2}\d{1,2} \d[A-Z]{2}$, which is good for all(?) non-London postcodes. If we repeat the grep with this new pattern:

$ grep -v -E '^[A-Z]{1,2}\d{1,2} \d[A-Z]{2}$' postcodes
EC1A 1AB

only the London example now fails.

In a perfect world, just by adding EC1A 1AB, Rexpy would produce our ideal regular expression—something like ^[A-Z]{1,2}\d[A-Z]? \d[A-Z]{2}$. (Here, the ? is the equivalent to {0,1}, meaning that the term before can occur zero times or once, i.e. it is optional.)

Unfortunately, that's not what happens. Instead, Rexpy produces:

^[A-Z0-9]{2,4} \d[A-Z]{2}$

Unfortunately, Rexpy has concluded that the first part is just a jumble of capital letters and numbers and is saying that the first part can be any mixture of 2-4 letters and numbers.

In this case, we'd probably fix up the regular expression by hand, or separately pass in the special Central London postcodes and all the rest. If we feed in a few London postcodes on their own, we get:

^[A-Z]{2}\d[A-Z] \d[A-Z]{2}$

which is also a useful start.

Have fun with Rexpy!

By the way: if you're in easy reach of Edinburgh, we're running a training course on the TDDA library as part of the Fringe of the Edinburgh DataFest, on 20th March. This will include use of Rexpy. You should come!

grep stands for global regular expression print, and the e in egrep stands for extended. ↩
Sometimes, regular expressions match any line that contains the pattern anywhere in them, rather than requiring the pattern to match the whole line. In such cases, using the anchored form of the regular expression, ^[A-Z]{2}\d \d[A-Z]{2}$, means that matching lines must not contain anything before or after the text that matches the regular expression. (You can think of ^ as matching the start of the string, or line, and $ as matching the end.) ↩

Tagging PyTest Tests

2018-05-22T12:00:00+01:00

A recent post described the new ability to run a subset of ReferenceTest tests from the tdda library by tagging tests or test classes with the @tag decorator. Initially, this ability was only available for unittest-based tests. From version 1.0 of the tdda library, now available, we have extended this capability to work with pytest.

This post is very similar to the previous one on tagging unittest-based tests, but adapted for pytest.

Overview

A decorator called tag can be imported and used to decorate individual tests or whole test classes (by preceding the test function or class with @tag).
When pytest is run using the --tagged option, only tagged tests and tests from tagged test classes will be run.
There is a second new option, --istagged. When this is used, the software will report which test classes are tagged, or contain tests that are tagged, but will not actually run any tests. This is helpful if you have a lot of test classes, spread across different files, and want to change the set of tagged tests.

Benefits

The situations where we find this particularly helpful are:

Fixing a broken test or working on a new feature or dataset. We often find ourselves with a small subset of tests failing (perhaps, a single test) either because we're adding a new feature, or because something has changed, or because we are working with data that has slightly different characteristics. If the tests of interest run in a few seconds, but the whole test suite takes minutes or hours to run, we can iterate dramatically faster if we have an easy way to run only the subset of tests currently failing.
Re-writing test output. The tdda library provides the ability to re-write the expected ("reference") output from tests with the actual result from the code, using the --write-all command-line flag. If it's only a subset of the tests that have failed, there is real benefit in re-writing only their output. This is particularly true if the reference outputs contain some differences each time (version numbers, dates etc.) that are being ignored using the ignore-lines or ignore-patterns options provided by the library. If we regenerate all the test outputs, and then look at which files have changed, we might see differences in many reference files. In contrast, if we only regenerate the tests that need to be updated, we avoid committing unnecessary changes and reduce the likelihood of overlooking changes that may actually be incorrect.

Prerequisites

In order to use the reference test functionality with pytest, you have always needed to add some boilerplate code to conftest.py in the directory from which you are running pytest. To use the tagging capability, you need to add one more function definition, pytest_collection_modifyitems.

The recommended imports in conftest.py are now:

from tdda.referencetest.pytestconfig import (pytest_addoption,
                                             pytest_collection_modifyitems,
                                             set_default_data_location,
                                             ref)

conftest.py is also a good place to set the reference file location if you want to do so using set_default_data_location.

Example

We'll illustrate this with a simple example. The code below implements four trivial tests, two in a class and two as plain functions.

Note the import of the tag decorator function near the top, and that test_a and the class TestClassA are decorated with the @tag decorator.

### test_all.py

from tdda.referencetest import tag

@tag
def test_a(ref):
    assert 'a' == 'a'

def test_b(ref):
    assert 'b' == 'b'

@tag
class TestClassA:
    def test_x(self):
        assert 'x' * 2 == 'x' + 'x'

    def test_y(self):
        assert 'y' > 'Y'

If we run this as normal, all four tests run and pass:

$ pytest
============================= test session starts ==============================
platform darwin -- Python 3.5.1, pytest-3.2.1, py-1.4.34, pluggy-0.4.0
rootdir: /Users/njr/tmp/referencetest_examples/pytest, inifile:
plugins: hypothesis-3.4.2
collected 4 items

test_all.py ....

=========================== 4 passed in 0.02 seconds ===========================

But if we add the –tagged flag, only three tests run:

$ pytest --tagged
============================= test session starts ==============================
platform darwin -- Python 3.5.1, pytest-3.2.1, py-1.4.34, pluggy-0.4.0
rootdir: /Users/njr/tmp/referencetest_examples/pytest, inifile:
plugins: hypothesis-3.4.2
collected 4 items

test_all.py ...

=========================== 3 passed in 0.02 seconds ===========================

Adding the –-verbose flag confirms that these three are the tagged test and the tests in the tagged class, as expected:

$ pytest --tagged --verbose
============================= test session starts ==============================
platform darwin -- Python 3.5.1, pytest-3.2.1, py-1.4.34, pluggy-0.4.0 -- /usr/local/Cellar/python/3.5.1/bin/python3.5
cachedir: .cache
rootdir: /Users/njr/tmp/referencetest_examples/pytest, inifile:
plugins: hypothesis-3.4.2
collected 4 items

test_all.py::test_a PASSED
test_all.py::TestClassA::test_x PASSED
test_all.py::TestClassA::test_y PASSED

=========================== 3 passed in 0.01 seconds ===========================

Finally, if we want to find out which classes include tagged tests, we can use the --istagged flag:

pytest --istagged
============================= test session starts ==============================
platform darwin -- Python 3.5.1, pytest-3.2.1, py-1.4.34, pluggy-0.4.0
rootdir: /Users/njr/tmp/referencetest_examples/pytest, inifile:
plugins: hypothesis-3.4.2
collected 4 items

test_all.test_a
test_all.TestClassA

========================= no tests ran in 0.01 seconds =========================

This is particularly helpful when our tests are spread across multiple files, as the filenames are then shown as well as the class names.

Installation

Information about installing the library is available in this post.

Other Features

Other features of the ReferenceTest capabilities of the tdda library are described in this post. Its capabilities in the area of constraint discovery and verification are discussed in this post, and this post.

Detecting Bad Data and Anomalies with the TDDA Library (Part I)

2018-05-04T10:30:00+01:00

The test-driven data analysis library, tdda, has two main kinds of functionality

support for testing complex analytical processes with unittest or pytest
support for verifying data against constraints, and optionally for discovering such constraints from example data.

Until now, however, the verification process has only reported which constraints failed to be satisfied by a dataset.

We have now extended the tdda library to allow identification of individual failing records, allowing it to act as a general purpose anomaly detection framework.

The new functionality is available through a new detect_df API call, and from the command line with the new tdda detect command.

The diagram shows conceptually how detection works, separating out anomalous records from the rest.

With the TDDA framework, anomalous simply means fails at least one constraint. We'll discuss cases in which the constraints have been developed to try to model some subset of data of interest (defects, high-risk applicants, heart arrythmias, flawless diamonds, model patients etc.) in part II of this post. In those cases, we start to be able to discuss classifications such as true and false positives, and true and false negatives.

Example Usage from the Command Line

Suppose we have a simple transaction stream with just three fields, id, category and price, like this:

       id    category     price
710316821       QT       150.39
516025643       AA       346.69
414345845       QT       205.83
590179892       CB        55.61
117687080       QT       142.03
684803436       AA       152.10
611205703       QT        39.65
399848408       AA       455.67
289394404       AA       102.61
863476710       AA       297.82
534170200       KA        80.96
898969231       QT        81.39
255672456       QT        71.67
133111344       TB       229.19
763476994       CB       338.40
769595502       QT       310.19
464477044       QT        54.41
675155634       QT       199.07
483511995       QT       209.53
416094320       QT        83.31

and the following constraints (which might have been created by hand, or generated using the tdda discover command).

id (integer): Identifier for item. Should not be null, and should be unique in the table
category (string): Should be one of “AA”, “CB”, “QT”, “KA” or “TB”
price (floating point value): unit price in pounds sterling. Should be non-negative and no more than 1,000.00.

This would be represented in a TDDA file with the following constraints.

{
  "fields": {
    "id": {
      "type": "int",
      "max_nulls": 0,
      "no_duplicates": true
    },
    "category": {
      "type": "string",
      "max_nulls": 0,
      "allowed_values":
        ["AA", "CB", "QT", "KA", "TB"]
    },
    "price": {
      "type": "real",
      "min": 0.0,
      "max": 1000.0,
      "max_nulls": 0
    }
  }
}

We can use the tdda verify command to verify a CSV file or a feather file¹ against these files, and get a summary of which constraints pass and fail. If our data is in the file items.feather and the JSON constraints are in constraints.tdda, and there are some violations we will get output exemplified by the following:

$ tdda verify items.feather constraints.tdda
FIELDS:

id: 1 failure  2 passes  type ✓  max_nulls ✓  no_duplicates ✗

category: 1 failure  2 passes  type ✓  max_nulls ✓  allowed_values ✗

price: 2 failures  2 passes  type ✓  min ✓  max ✗  max_nulls ✗

SUMMARY:

Constraints passing: 6
Constraints failing: 4

The new tdda detect command allows us to go further and find which individual records fail.

We can use the following command to write out a CSV file, bads.csv, containing the records that fail constraints:

$ tdda detect items.feather constraints.tdda bads.csv --per-constraint --output-fields

The flag --per-constraint tells the software to write out a boolean column for each constraint, indicating whether the record passed, and the --output-fields tells the software to include all the input fields in the output.

The result is the following CSV file:

id,category,price,id_nodups_ok,category_values_ok,price_max_ok,price_nonnull_ok,n_failures
113791348,TQ,318.63,true,false,true,true,1
102829374,AA,65.24,false,true,true,true,1
720313295,TB,1004.72,true,true,false,true,1
384044032,QT,478.65,false,true,true,true,1
602948968,TB,209.31,false,true,true,true,1
105983384,AA,8.95,false,true,true,true,1
444140832,QT,1132.87,true,true,false,true,1
593548725,AA,282.58,false,true,true,true,1
545398672,QT,1026.4,true,true,false,true,1
759425162,CB,1052.72,true,true,false,true,1
452691252,AA,1028.19,true,true,false,true,1
105983384,QT,242.64,false,true,true,true,1
102829374,KA,71.64,false,true,true,true,1
105983384,AA,10.24,false,true,true,true,1
405321922,QT,85.23,false,true,true,true,1
102829374,,100000.0,false,false,false,true,3
872018391,QT,51.69,false,true,true,true,1
862101984,QT,158.53,false,true,true,true,1
274332319,AA,1069.25,true,true,false,true,1
827919239,QT,1013.0,true,true,false,true,1
105983384,QT,450.68,false,true,true,true,1
102829374,,100000.0,false,false,false,true,3
872018391,QT,199.37,false,true,true,true,1
602948968,KA,558.73,false,true,true,true,1
328073211,CB,1031.67,true,true,false,true,1
405321922,TB,330.97,false,true,true,true,1
334193154,QT,1032.31,true,true,false,true,1
194125540,TB,,true,true,,false,1
724692620,TB,1025.81,true,true,false,true,1
862101984,QT,186.76,false,true,true,true,1
593548725,QT,196.56,false,true,true,true,1
384044032,AA,157.25,false,true,true,true,1

which, we can read a bit more easily if we reformat this (using · to denote nulls) as:

       id  category        price        id  category  price     price  n_failures
                                   _nodups   _values   _max  _nonnull
                                       _ok       _ok    _ok       _ok
113791348        TQ       318.63      true     false   true      true           1
102829374        AA        65.24     false      true   true      true           1
720313295        TB     1,004.72      true      true  false      true           1
384044032        QT       478.65     false      true   true      true           1
602948968        TB       209.31     false      true   true      true           1
105983384        AA         8.95     false      true   true      true           1
444140832        QT     1132.87       true      true  false      true           1
593548725        AA       282.58     false      true   true      true           1
545398672        QT     1,026.40      true      true  false      true           1
759425162        CB     1,052.72      true      true  false      true           1
452691252        AA     1,028.19      true      true  false      true           1
105983384        QT       242.64     false      true   true      true           1
102829374        KA        71.64     false      true   true      true           1
105983384        AA        10.24     false      true   true      true           1
405321922        QT        85.23     false      true   true      true           1
102829374         ·   100,000.00     false     false  false      true           3
872018391        QT        51.69     false      true   true      true           1
862101984        QT       158.53     false      true   true      true           1
274332319        AA     1,069.25      true      true  false      true           1
827919239        QT     1,013.00      true      true  false      true           1
105983384        QT       450.68     false      true   true      true           1
102829374         ·   100,000.00     false     false  false      true           3
872018391        QT       199.37     false      true   true      true           1
602948968        KA       558.73     false      true   true      true           1
328073211        CB     1,031.67      true      true  false      true           1
405321922        TB       330.97     false      true   true      true           1
334193154        QT     1,032.31      true      true  false      true           1
194125540        TB            ·      true      true      ·     false           1
724692620        TB     1,025.81      true      true  false      true           1
862101984        QT       186.76     false      true   true      true           1
593548725        QT       196.56     false      true   true      true           1
384044032        AA       157.25     false      true   true      true           1

Command Line Syntax

The basic form of the command-line command is:

$ tdda detect INPUT CONSTRAINTS OUTPUT

where

INPUT is normally either a .csv file, in a suitable format, or a .feather file¹ containing a DataFrame, preferably with an accompanying .pmm file¹
CONSTRAINTS is a JSON file containing constraints, usually with a .tdda suffix. This can be created by the tdda discover command, or edited by hand.
OUTPUT is again either a .csv or .feather file to be created with the output rows. If the pmmif library is installed, a .pmm metadata file will be generated alongside the .feather file, when .feather output is requested.

Options

Several command line options are available to control the detailed behaviour:

defaults: If no command-line options are supplied:
- only failing records will be written
- only a record identifier and the number of failing constraints will be written.
  - When the input is Pandas, the record identifier will be the index for the failing records;
  - when the input is a CSV file, the record identifier will be the row number, with the first row after the header being numbered 1.
--per-constraint when this is added, an _ok column will also be written for every constraint that has any failures, with true for rows that satisfy the contraint, false for rows that do not satisfy the constraint and a missing value where the constraint is inapplicable (which does not count as a failure).
--output-fields [FIELD1 FIELD2 ...] If the --output-fields flag is used without specifying any fields, all fields from the input will be included in the output. Alternatively, a space-separated list of fields may be provided, in which case only those will be included. Whenever this option is used, no index or row-number is written unless specifically requested
--write-all If this flag is used, all records from the input will be included in the output, including those that have no constraint failures.
--index This flag forces the writing of the index (for DataFrame inputs) or row number (for CSV inputs).
--int When writing boolean values to CSV files (either from input data or as per-constraint output fields), use 1 for true and 0 for false.

API Access

The detection functionality is also available through the TDDA library's API with a new detect_df function, which takes similar parameters to the command line. The corresponding call, with a DataFrame df in memory, would be:

from tdda.constraints import detect_df

verification = detect_df(df, 'constraints.tdda', per_constraint=True,
                         output_fields=[])
bads_df = verification.detected()

The feather file format is an interoperable way to save DataFrames from Pandas or R. Its aim is to preserve metadata better and be faster than CSV files. It has a few issues, particularly around types and nulls, and when available, we save a secondary .pmm file alongside .feather files which makes reading and writing them more robust when our extensions in the pmmif library are used. We'll do a future blog post about this, but if you install both feather and pmmif with pip install feather and pip install pmmif, and use featherpmm.write_dataframe, imported from pmmif, rather than feather.write_dataframe, you should get more robust behaviour. ↩↩↩

Saving Time Running Subsets of Tests with Tagging

2018-05-01T10:30:00+01:00

It is common, when working with tests for analytical processes, for test suites to take non-trivial amount of time to run. It is often helpful to have a convenient way to execute a subset of tests, or even a single test.

We have added a simple mechanism for allowing this to unittest-based tests in the ReferenceTest functionality of the tdda Python library. It is based on tagging tests.

The quick summary is:

A decorator called tag can be imported and used to decorate individual tests or whole test classes (by preceding the test function or class with @tag).
When a script calling ReferenceTest.main() is run, if the flag --tagged (or –1, the digit one) is used on the command line, only tagged tests and tests from tagged test classes will be run.
There is a second new option, --istagged (or –0, the digit zero). When this is used, the software will report only which test classes are tagged, or contain tests that are tagged, and will not actually run any tests. This is helpful if you have a lot of test classes, spread across different files, and want to change the set of tagged tests.

Benefits

The situations where we find this particularly helpful are:

Fixing a broken test or working on a new feature or dataset. We often find ourselves with a small subset of tests failing (perhaps, a single test case), either because we're adding a new feature, or because something has changed or we are working with data that has slightly different characteristics. If the tests of interest run in a few seconds, but the whole test suite takes minutes or hours to run, we can iterate dramatically faster if we have an easy way to run only the subset of tests currently failing.
Re-writing test output. The tdda library provides the ability to re-write the expected ("reference") output from tests with whatever the code is currently generating, using the --write-all command-line flag. If it's only a subset of the tests that have failed, there is real benefit in re-writing only the output for the previously failing tests, rather than for all tests. This is particularly true if the reference outputs contain some differences each time (version numbers, dates etc.) that are being ignored using the ignore-lines or ignore-patterns options provided by the library. If we re-write all the tests, and then look at which files have changed, we might see differences in all reference files, whereas if we only regenerate the tests with meaningful changes, we will avoid committing changes that were not required.

Example

We'll illustrate this with a simple example. The code below implements four trivial tests across two classes.

Note the import of the tag decorator function near the top, and that the two of the tests—testTwo and testThree in the class Tests—are decorated with the @tag decorator, as is the entire test class MoreTests.

# tests.py
from tdda.referencetest import ReferenceTestCase, tag

class Tests(ReferenceTestCase):
    def testOne(self):
        self.assertEqual(1, 1)

    @tag
    def testTwo(self):
        self.assertEqual(2, 2)

    @tag
    def testThree(self):
        self.assertEqual(3, 3)

    def testFour(self):
        self.assertEqual(4, 4)


@tag
class MoreTests(ReferenceTestCase):
    def testFive(self):
        self.assertEqual(5, 5)

    def testSix(self):
        self.assertEqual(6, 6)


if __name__ == '__main__':
     ReferenceTestCase.main()

If we run this as normal, all six tests run and pass:

$ python tests.py
......
----------------------------------------------------------------------
Ran 6 tests in 0.000s

OK

But if we add the –1 flag, only four tests run:

$ python tests.py -1
....
----------------------------------------------------------------------
Ran 4 tests in 0.000s

OK

Adding the –v (verbose) flag confirms that these four are the tagged tests, as expected:

$ python tests.py -1 -v
testFive (__main__.MoreTests) ... ok
testSix (__main__.MoreTests) ... ok
testThree (__main__.Tests) ... ok
testTwo (__main__.Tests) ... ok

----------------------------------------------------------------------
Ran 4 tests in 0.000s

OK

Finally, if we want to find out which classes include tagged tests, we can use the –0 flag:

$ python tests.py -0
__main__.MoreTests
__main__.Tests

----------------------------------------------------------------------
Ran 0 tests in 0.000s

OK

This is particularly helpful when our tests are spread across multiple files, as the filenames are then shown as well as the class names.

Installation

Information about installing the library is available in this post.

Other Features

Our Approach to Data Provenance

2017-12-12T15:30:00+00:00

Our previous post introduced the idea of data provenance (a.k.a. data lineage), which has been discussed on a couple of podcasts recently. This is an issue that is close to our hearts at Stochastic Solutions. Here, we'll talk about how we handle this issue, both methodologically and in our Miró software.

We'll focus on seven key ideas from our approach to data provenance:

Automatic Logging
Audit trail stored in datasets
Recording of field definitions and contexts
Constraint generation and verification
Data Dictionary Generation & Import
Data signatures (hashing)
Comparing datasets (diff commands)

Automatic Logging

Automatic logging provides a concrete record of all analysis performed in the software.

Our analysis software, Miró, is normally used through a scripting interface that automatically writes several detailed logs. Of these, the most important are:

A log of all commands executed (in editable, re-executable form)
A full interleaved log of commands issued and the resulting output, in several forms (including plain text and rich HTML).

Logs build up in a year/month/day/session hierarchy indefinitely, providing a comprehensive (and searchable) record of the analysis that has been performed.

Even when the software is used through the API, the sequence of operations is recorded in the log, though in that case the ability to re-execute the operations is normally lost.

Audit trail

The audit trail in a dataset tracks changes to the data and metadata across sessions, users, and machines, making it possible to see the sequence of operations that led to the current state of the data.

Like a table in a database, or a DataFrame in Pandas or R, Miró's most important data structure is a dataset—a tabular structure with named fields as columns and different observations (records) stored in rows. These form a column-based data store, and datasets can be saved to disk with a .miro extension—a folder that contains the typed data together with rich metadata.

Every time a change is made to a dataset, the operation that caused the change is recorded in the Audit Trail section of the dataset. This is true both for changes to data and to metadata:

Examples of changes to data in a dataset include creating new fields, deleting fields, filtering out records, appending new records and (more rarely) changing the original values in the data.¹
Miró maintains many types of metadata, including, field descriptions, field and dataset tags, binnings on fields, long and short names, custom colours and labels, a current record selection (filter) and various formatting information.

In this post, we'll illustrate the most concept using the following small dataset containing transactions for four different customers, identified by id:

id	date	categ	amount	days-since-prev
1	2009-01-31 00:00:00	A	1,000.00	∅
2	2009-02-02 00:00:00	A	2,000.00	∅
2	2009-02-02 22:22:22	B	2,222.22	0.93
3	2009-03-03 00:00:00	A	1,000.00	∅
3	2009-03-03 13:33:33	B	3,000.00	0.56
3	2009-03-03 23:33:33	B	3,333.33	0.42
4	2009-04-04 00:00:00	A	1,000.00	∅
4	2009-04-04 04:44:44	B	1,111.11	0.20
4	2009-04-04 14:44:44	B	∅	0.42
4	2009-04-04 20:44:44	B	4,444.44	0.25

Here is the audit trail recorded in that dataset:

Date	Description	Command	Host	Session	Line
2017/12/07 14:05:30	Load from flat file /Users/njr/python/artists/miro/testdata/trans.csv	load testdata/trans.csv	godel.local	/Users/njr/miro/log/2017/12/02/session142	1
2017/12/07 14:05:30	Save as /miro/datasets/trans.miro	save trans	godel.local	/Users/njr/miro/log/2017/12/02/session142	3
2017/12/11 12:48:39	Set dataset description to "Some demo transactions".	description -d "Some demo transactions"	bartok.local	/Users/njr/miro/log/2017/12/11/session013	2
2017/12/11 12:48:39	Set description for field amount to "transaction value (GBP)".	description amount "transaction value (GBP)"	bartok.local	/Users/njr/miro/log/2017/12/11/session013	3
2017/12/11 12:48:39	Defined field days-since-prev as (/ (- date (prev-by date id)) 24 60 60)	def days-since-prev (/ (- date (prev-by date id)) 24 60 60)	bartok.local	/Users/njr/miro/log/2017/12/11/session013	4
2017/12/11 12:48:39	Tag field days-since-prev with L="Time Since Previous Transaction (days)".	tag L="Time Since Previous Transaction (days)" days-since-prev	bartok.local	/Users/njr/miro/log/2017/12/11/session013	5
2017/12/11 12:48:39	Save as /miro/datasets/trans.miro	save	bartok.local	/Users/njr/miro/log/2017/12/11/session013	6

In this case, the history is quite short, but includes information about

where the data originally came from (first line)
when the data has been saved (second and seventh lines)
metadata changes (third, fourth and sixth lines)
changes to the data content (in this case, creation of a new field, days-since-prev).
detail about when (column 1) and where (column 4) changes were made, in what session these occurred (column 5, linking to the logs), what commands were used (column 3) and where in the log files to find those commands and context (column 6).

Field Definitions and Contexts

Fields remember their definitions, including—where relevant—the context in which they were created. This allows us to understand where any value in a dataset came from, as a sequence of transformations of input values.

In the previous section, we saw that the audit trail contained information about a derived field in the data, including the expression used to derive it. That information is also available as a basic property of the field:

[2]> ls -D days-since-prev
          Field                                 Definition
days-since-prev    (/ (- date (prev-by date id)) 24 60 60)

In other cases, entire datasets are created by taking "measurements" from a base dataset, such as the transactional data shown above. For example, we might want to create a dataset that measures how many transactions each customer has, and their total value.

One way of doing this is Miró is with the following command:

[4]> xtab -d -R MEASURES count sum amount by id
MEASURES: 4 records; 4 (100%) selected; 3 fields.

which creates a new dataset. As you might expect, this is the result:

id	count	sumamount
1	1	1,000.00
2	2	4,222.22
3	3	7,333.33
4	4	6,555.55

The field definitions in the dataset created by default) are attached to the fields, as we can see:

[6]> MEASURES.ls -D
    Field                                  Definition
       id             Rollup variable [in trans.miro]
    count         count by id [in dataset trans.miro]
sumamount    sum amount by id [in dataset trans.miro]

Of course, the audit trail for MEASURES also contains this information, together with more detailed information about the session or sessions in which the relevant commands were issued.

Constraint Generation and Verification

Automatically generated constraints can be used to identify anomalous and possibly incorrect data within a source dataset. The can also be used to check that new data with the same structure has similar or expected properties.

We've talked in previous posts (here, here, here, and here) and in a white paper about automatically generating constraints that characterize either all of, or an inferred "good" subset of, the data in a dataset. Such constraints are useful for finding bad and anomalous data in the original dataset, and also for checking ("verifying") new data as it comes in.

We won't go over all the details of constraint generation and verification in this post, but do note that this relates to Roger Peng's idea, discussed in the last post, of tracking changes to data tests as a surrogate for tracking changes to the actual data. Obviously, having generated constraints, it's a good idea to store the constraints under version control, to facilitate such tracking. More directly, the results of verification allow you to see some changes to data directly.

Data Dictionary

The data dictionary provides a useful reference for any user of the data particularly when it includes helpful (and up-to-date) annotations. By making it easily editable, we encourage users to record useful information alongside the data.

Miró can generate a data dictionary from the data. This contains:

Summary information about the overall dataset
Per-field information, including
- basic metadata about each field, including name and type
- some information about the range of each field, including minimum and maximum values, null count etc.
- further characterization information, including whether there are duplicate values,
- any annotations that have been added to the dataset (such as descriptions, alternate names, tags etc.)

Here's an example of the first part of the data dictionary for our transaction dataset:

Name	trans.miro
Path	/miro/datasets/trans.miro
Host	bartok.local
Hash	0971dde52de7bc2fb2ad2282a572f6ca295c33fae105d8b0fab7a618f4c70b71
Description	Some demo transactions
Tags
Creation Date	2017-12-11 13:33:36
Number of Records	10
Number of Fields	5

And here's the second part, with an entry for each field:

Name	Type	Min	Max	Min Length	Max Length	# nulls	# positive	# negative	Any duplicates	Values	Description	Long Name	Definition
id	int	1	4			0	10	0	yes
date	date	2009-01-31T00:00:00	2009-04-04T20:44:44			0	10	0	no
categ	string	A	B	1	1	0			yes	"A" "B"
amount	real	1,000.00	4,444.44			1	9	0	yes		transaction value (GBP)
days-since-prev	real	0.20	0.93			4	6	0	yes			Time Since Previous Transaction (days)	(/ (- date (prev-by date id)) 24 60 60)

While some of the information in the data dictionary is derived directly from the data, other parts (descriptions, alternate names, and tags) are created by annotation actions, whether by humans, bots or scripts. Although there are Miró commands for setting all the (editable) metadata properties, to encourage maximum use, Miró can also export the metadata to a spreadsheet, where users can update it, and then the appropriate parts of the metadata can be re-imported using Miró's importmetadata command.

Data signatures

A data signature is a very compact way to summarize a dataset. This allows quick and efficient checking that analysis is being performed using the correct data.

One of the properties reported in the data dictionary (the first table above) is a hash. If you're not familiar with hashing, a hash function is one that takes a (typically large) input and converts it to a much smaller output. Hash functions are designed so that different inputs tend (but are not guaranteed) to map to different outputs. So if you store the hashes of two large objects and they are different, you can be certain that the objects are different. If the hashes are the same, this does not absolutely guarantee that the objects are the same, but hash functions are designed to make so-called "hash collisions" extremely rare, especially between similar inputs. For most practical purposes, therefore, we can safely assume that if two hashes are the same, the inputs were the same.²

The point of storing the hash is that it acts as a much smaller, very efficient proxy for our original data, and if we want to know whether some dataset we have lying around contains the same data as the one used to generate the data dictionary, all we have to do is compare the hashes.

The hashes that Miró constructs for fields use only the data values, not any metadata, as inputs.³ This is a choice, of course, and we could also hash some or all of the metadata, but our primary concern here is whether we have the same underlying data or not, so we view it as an advantage that the hash is based solely on the data values.

There is an option to store hashes for individual fields, as well, which we have not used in generating the data dictionary shown.

Comparing Datasets (diff commands)

The ability to compare two datasets, and when they are different, to see clearly what the differences are, is as fundamental as the ability to compare two files in Unix or git, or to perform a track changes operation on a Word document.

If we want to compare two datasets to see if they are the same, comparing hashes is a very efficient way to do so.

If, however, we actually want to understand the differences between datasets, we need something more like Unix's diff command, which was discussed in the previous post, or Microsoft Word's Compare Documents functionality.

Miró includes a ddiff command for comparing two datasets. Let's look at an example.

Here's our transaction data again, without the derived days-since-prev field:

id	date	categ	amount
1	2009-01-31 00:00:00	A	1,000.00
2	2009-02-02 00:00:00	A	2,000.00
2	2009-02-02 22:22:22	B	2,222.22
3	2009-03-03 00:00:00	A	1,000.00
3	2009-03-03 13:33:33	B	3,000.00
3	2009-03-03 23:33:33	B	3,333.33
4	2009-04-04 00:00:00	A	1,000.00
4	2009-04-04 04:44:44	B	1,111.11
4	2009-04-04 14:44:44	B	∅
4	2009-04-04 20:44:44	B	4,444.44

and here's a variant of it:

id	date	categ	amount
1	2009-01-31 00:00:00	A	1,000.00
2	2009-02-02 00:00:00	A	2,000.00
2	2009-02-02 22:22:22	B	2,222.22
3	2009-03-03 00:00:00	A	1,000.00
3	2009-03-03 13:33:33	B	3,000.00
3	2009-03-03 23:33:33	B	3,333.33
4	2009-04-04 00:00:00	A	1,000.00
4	2009-04-04 04:44:44	B	1,111.11
4	2009-04-04 14:44:44	B	3,874.18
4	2009-04-04 20:44:44	A	4,444.44

These datasets are small enough that you can probably see the differences by inspection (though it might take a little while to be confident that you've spotted them all), but when there are millions of rows and hundreds of columns, that becomes less easy.

Using the hash trick we talked about previously, we can see whether there are any differences, and slightly more besides. Assuming the current working dataset in Miró is the first, and the second is in TRANS2, we can hash them both:

[46]>  hash
19451c13321284f2b0dd7736b75b443945fac1eae08a8118d600ec0d49b6bf87 id
f3c08f1d2d23abaa06da0529237f63bf8099053b9088328dfd5642d9b06e8f6a date
3a0070ae42b9f341e7e266a18ea1c78c7d8be093cb628c7be06b6175c8b09f23 categ
f5b3f6284f7d510f22df32daac2784597122d5c14b89b5355464ce05f84ce120 amount

b89ae4b74f95187ecc5d49ddd7f45a64849a603539044ae318a06c2dc7292cf9 combined

[47]>  TRANS2.hash
19451c13321284f2b0dd7736b75b443945fac1eae08a8118d600ec0d49b6bf87 id
f3c08f1d2d23abaa06da0529237f63bf8099053b9088328dfd5642d9b06e8f6a date
fcd11dbd69eee0bf6d2a405a6e4ef9227bb3f0279d9cc7866e2efe5b4c97112c categ
64c5c97e9e9676ec085b522303d75ff11b0ebe01a1ceebaf003719b3718f12bb amount

2e171d2a24183e5e25bbcc50d9cd99ad8b4ca48ee7e1abfa6027edd291a22584 combined

We can see immediately that these are different, but that the individual hashes for the id and date fields are the same, indicating that their content is (almost certainly) the same. It's the categ and amount fields that differ between the two datasets.

We can use the ddiff command to get a more detailed diagnosis:

[49]>  ddiff -P TRANS2

Number of differences       Field Pair
           0:                   id : id-2
           0:                 date : date-2
           1:                categ : categ-2
           1:               amount : amount-2

Diff fields:
           1: diff-amount
           1: diff-categ

Total number of differences found: 2

The output here confirms that there are no differences between the id and date fields in the two datasets, but that one value differs for each of the categ and amount fields. The -P flag that we passed to the ddiff command told it to preserve information about the differences, and if we now look at the data, we see five extra fields (on the first dataset)—the fields as they were in the other dataset, TRANS2. Miró also creates an overall diff field showing whether each record has any differences across the two datasets.

[50] show

id	date	categ	categ-2	diff-categ	amount	amount-2	diff-amount	diff
1	2009-01-31 00:00:00	A	A	0	1,000.00	1,000.00	0	0
2	2009-02-02 00:00:00	A	A	0	2,000.00	2,000.00	0	0
2	2009-02-02 22:22:22	B	B	0	2,222.22	2,222.22	0	0
3	2009-03-03 00:00:00	A	A	0	1,000.00	1,000.00	0	0
3	2009-03-03 13:33:33	B	B	0	3,000.00	3,000.00	0	0
3	2009-03-03 23:33:33	B	B	0	3,333.33	3,333.33	0	0
4	2009-04-04 00:00:00	A	A	0	1,000.00	1,000.00	0	0
4	2009-04-04 04:44:44	B	B	0	1,111.11	1,111.11	0	0
4	2009-04-04 14:44:44	B	B	0	∅	3,874.18	1	1
4	2009-04-04 20:44:44	B	A	1	4,444.44	4,444.44	0	1

This makes is easy to identify and select only those fields or records with differences, which is one of the key tasks when trying to track data lineage.

As powerful as Miró's ddiff and related commands are, there is also much more that we would like (and plan) to support. Our comparison is fundamentally based on joining the two datasets (either on one or more nominated key fields, or, as in this case, implicitly on record number). When we are using a join key, it's quite easy to deal with row addition and deletion, but that is harder when we are just joining on record number. It would be useful to have a Unix diff-like ability to spot single rows or groups of rows that have been inserted, deleted, or re-ordered, but we don't have that today. In certain cases, spotting other kinds of systematic edits would be interesting—for example, thinking of the table in speadsheet-like terms, it would be useful to spot cases in which blocks of cells shift up, down, left or right. This situation is not very common in the data we most commonly work with, but there are domains in which those sorts of changes might be frequent.

What Next?

We surveyed a few of the ways we think about and implement features in our software (and workflows) to help track data provenance and data lineage. There's a great deal more we could do, and over time we will almost certainly add more. Hopefully these ideas will prove useful and interesting, and obviously any of you fortunate enough to use Miró can try them out.

We'll keep you posted as we extend our thinking.

Credits

Thanks to our Social Media Manager, for actively policing our content even as it was being created.

Changing (original) values in data is actually so rare that Miró does provides very few facilities for doing so directly, but data can effectively be changed through deleting one field or record and adding another, and there are a couple of operations that can change values in place—primarily anonymization operations and functions that update special, automatically created fields, albeit usually by deletion and regeneration. In fact, one of the rules on our "unwritten list" of good practices is never to replace an input field with an edited copy, but instead always to derive a new field with a variant name, so that when we see a field from source data we know its values have not been altered. ↩
A familiar example comes from git, which uses hashes to compare the contents of files efficiently, and also uses a SHA-1 hash to identify a commit, by hashing all of the important information about that commit. ↩
The overall hash depends on the name of the fields and their order, as well as the data in the fields, but the individual field hashes do not. As a result, two fields containing the same values (in the same order) will receive the same hash, but datasets in which fields have been renamed or reordered will have different overall hashes. We should also note that missing values (NULLs) also contribute to the field hashes. ↩

Data Provenance and Data Lineage: the View from the Podcasts

2017-11-30T15:30:00+00:00

In Episode 49 of the Not So Standard Deviations podcast, the final segment (starting at 59:32) discusses data lineage, after Roger Peng listened to the September 3rd (2017) episode of another podcast, Linear Digressions, which discussed that subject.

This is a topic very close to our hearts, and I thought it would be useful to summarize the discussions on both podcasts, as a precursor to writing up how we approach some of these issues at Stochastic Solutions—in our work, in our Miró software and through the TDDA approaches discussed on this blog.

It probably makes sense to begin by summarizing the Linear Digressions, episode, in which Katie Malone explains the ideas of data lineage (also known as data provenance) as the tracking of changes to a dataset.

Any dataset starts from one or more "original sources"—the sensors that first measured the quantities recorded, or the system that generated the original data. In almost all cases, a series of transformations is then applied to the data before it is finally used in some given application. For example, in machine learning, Katie describes typical transformation stages as:

Cleaning the data
Making additions, subtractions and merges to the dataset
Aggregating the data in some way
Imputing missing values

She describes this as the process view of data lineage.

An alternative perspective focuses less on the processes than the resulting sequence of datasets, as snapshots. As her co-host, Ben Jaffe points out, this is more akin to the way version control systems view files or collections of files, and diff tools (see below) effectively reconstruct the process view¹ from the data.

In terms of tooling, Katie reckons that the tools for tracking data lineage are relatively well developed (but specialist) in large scientific collaborations such as particle physics (she used to work on the LHC at CERN) and genomics, but her sense is tools are less well developed in many business contexts.

She then describes five reasons to care about data provenance:

(To improve/track/ensure) data quality
(To provide) an audit trail
(To aid) replication (her example was enabling the rebuilding of a predictive model that had been lost if you knew how it had been produced, subject obviously, not only to having the data but full details of the parameters, training regime and any random number seeds used)
(To support) attribution (e.g. providing a credit to the original data collector when publishing a paper)
Informational, (i.e. to keep track of and aid navigation within large collection of datasets).

I recommend listening to the episode.

In Not-so-Standard Deviations 49, Roger Peng introduces the idea of tracking and versioning data, as a result of listening to Katie and Ben discuss the issue on their podcast. Roger argues that while you can stick a dataset into Git or other version control software, doing so it not terribly helpful because, most of the time the dataset acts essentially as a blob,² rather than as a structured entity that diff tools help you to understand.

In this, he is exactly right. In case you're not familar with version control and diff tools, let me illustrate the point. In a previous post on Rexpy, I added a link to a page about our Miró software between two edits. If I run the relevant git diff command, this is its output:

git diff of two versions of the markdown for a blog post

As you can see, this shows pretty clearly what's changed. Using a visual diff tool, we get an even clearer picture of the changes (especially when the changes are more numerous and complex):

visual diff (opendiff) of two versions of the markdown for a blog post

In contrast, if I do a diff on two Excel files stored in Git, I get the following:

git diff b1b85ddc448a723845c36688480cfe5072f28c1a -- test-excel-sheet1.xlsx 
diff --git a/testdata/test-excel-sheet1.xlsx b/testdata/test-excel-sheet1.xlsx
index 0bd63cb..91e5b0e 100644
Binary files a/testdata/test-excel-sheet1.xlsx and b/testdata/test-excel-sheet1.xlsx differ

This is better than nothing, but gives no insight into what has changed. (In fact, it's worse than it looks, because even changes to the metadata inside an Excel file--such as the location of the selected cell—will cause the files to be shown as different. As a result, there are many hard-to-detect false positives when using diff commands with binary files.)

Going back to the podcast, Hilary Parker then talked about the difficulty of the idea of data provenance in the context of streaming data, but argued there was a bit more hope with batch processes, since at least the datasets used then are "frozen".

Roger then argued that there are good custom tools used in particular places like CERN, but those are not tools he can just pick up and use. He then wondered aloud whether such tools cannot really exist, because they require too much understanding on analytical goals. (I don't agree with this, as the next post will show.) He then rowed back slightly, saying that maybe it's too hard for general data, but more reasonable in a narrower context such as "tidy data".

If you aren't familiar with the term "tidy data", it really just refers to data stored in the same way that relational databases store data in tables, with columns corresponding to variables, rows corresponding to whatever the items being measured are (the observations), consistent types for all the values in a column and a resulting regular, grid-like structure. (This contrasts with, say, JSON data, which is hierarchical, and with many spreadsheets, in which different parts of the sheet are used for different things, and with data in which different observations are in columns and the different variables are in rows.) So "tidy data" is an extremely large subset of the data we use in structured data analysis, at least after initial regularization.

Hilary then mentioned an R package called testdat, that she had had worked on at an "unconference". This aimed to test things check things like missing values and contiguity of dates in datasets. These ideas are very similar to those of constraint generation and verification, which we frequently discuss on this blog, as supported in the TDDA package. But Hilary (and others?) concluded that the package was not really useful, and that what was more important was tools to make writing tests for data easier. (I guess we disagree that general tools like this aren't useful, but very much support the latter point.)

Roger then raised the idea of tracking changes to such tests as a kind of proxy for tracking changes to the data, though clearly this is a very partial solution.

Both podcasts are interesting and worth a listen, but both of them seemed to feel that there is very little standardization in this area, and that it's really hard.

We have a lot of processes, software and ideas addressing many aspects of these issues, and I'll discuss them and try to relate them to the various points raised here in a subsequent post, fairly soon, I hope.

Strictly speaking, diff tools cannot know what the actual processes used to transform the data were, but construct a set of atomic changes (known as patches) that are capable of transforming the old dataset into the new one. ↩
a blob, in version control systems like Git, is a Binary Large OBject. Git allows you to store blobs, and will track different versions of them, but all you can really see is whether two versions are the same, whereas for "normal" files (text files), visual diff tools normally allow you to see easily exactly what changes have been made, much like the track changes feature in Word documents. ↩

Automatic Constraint Generation and Verification White Paper

2017-10-06T15:30:00+01:00

We have a new White Paper available:

Automatic Constraint Generation and Verification

Abstract

Correctness is a key problem at every stage of data science projects: completing an entire analysis without a serious error at some stage is surprisingly hard. Even errors that reverse or completely invalidate the analysis can be hard to detect. Test-Driven Data Analysis (TDDA) attempts to identify, reduce, and aid correction of such errors. A core tool that we use in TDDA is Automatic Constraint Discovery and Verification, the focus of this paper.

Download White Paper here

Constraint Generation in the Presence of Bad Data

2017-09-21T17:30:00+01:00

Bad data is widespread and pervasive.¹

Only datasets and analytical processes that have been subject to rigorous and sustained quality assurance processes are typically capable of achieving low or zero error rates. "Badness" can take many forms and have various aspects, including incorrect values, missing values, duplicated entries, misencoded values, values that are inconsistent with other entries in the same dataset, and values that are inconsistent with those in other datasets, to name but a few.

We have previously discussed automatic constraint discovery and verification as a mechanism for finding, specifying, and executing checks on data, and have advocated using these to verify input, output, and intermediate datasets. Until now, however, such constraint discovery has been based on the assumption that the datasets provided to the discovery algorithm contain only good data—a significant limitation.

We have recently been thinking about ways to extend our approach to constraint generation to handle cases in which this assumption is relaxed, so that the "discovery" process can be applied to datasets that include bad data. This article discusses and illustrates one such approach, which we have prototyped, as usual, in our own Miró software. We plan to bring the same approach to the open-source Python tdda library as soon as we have gained further experience using it and convinced ourselves we are going in a good direction.

The most obvious benefit of extending constraint generation to function usefully even when the dataset used contains some bad data is increased applicability. Perhaps a more important benefit is that it means that, in general, more and tighter constraints will be generated, potentially increasing their utility and more clearly highlighting areas that should be of concern.

The Goal and Pitfalls: "Seeing Through" Bad Data

Our aim is to add to constraint generation a second mode of operation whose goal is not to discover constraints that are true over our example data, but rather to generate constraints for which there is reasonable evidence, even if some of them are not actually satisfied by all of the example data. This is not so much constraint discovery as constraint suggestion. We want the software to attempt to "see through" the bad data to find constraints similar to those that would have been discovered had there been only good data. This is hard (and ill specified) because the bad data values are not identified, but we shall not be deterred.

For constraints generated in this way, the role of a human supervisor is particularly important: ideally, the user would look at the constraints produced—both the ordinary ones actually satisfied by the data and the more numerous, tighter suggestions typically produced by the "assuming bad data" version—and decide which to accept on a case-by-case basis.

Extending constraint generation with the ability to "see through" bad data is very powerful, but carries obvious risks: as an example, constraint suggestion might look at lending data and conclude that defaults (non-payment of loans—hopefully a small proportion of customers) are probably outliers that should be excluded from the data. Clearly, that would be the wrong conclusion.

Notwithstanding the risks, our initial experiences with generation in the presence of bad data have been rather positive: the process has generated constraints that have identified problems of which we had been blissfully unaware. The very first time we used it highlighted:

a nearly but not-quite unique identifier (that should, of course, have been unique)
a number of low-frequency, bad categorical values in a field
several numeric fields with much smaller true good ranges than we had realized.

Additionally, even when the constraints discovered go beyond identifying incorrect data, they often highlight data that we are happy to have flagged because they do represent outliers deserving of scrutiny.

Our Approach

Our initial attempt at extending constraint discovery to the case in which there is (or might be) bad data is straightforward: we simply allow that a proportion of the data might be bad, and look for the best constraints we can find on that basis.

With our current implementation, in Miró, the user has to supply an upper limit on the proportion, p, of values that are thought likely to be bad. Eventually, we expect to be able to determine this proportion heuristically, at least by default. In practice, for datasets of any reasonable size, we are mostly just using 1%, which seems to be working fairly well.

We should emphasize that the "bad" proportion p provided is not an estimate of how much of the data is bad, but an upper limit on how much we believe is reasonably likely to be bad.² As a result, when we use 1%, our goal is not to find the tighest possible constraints that are consistent with 99% of data, but rather to work with the assumption that at least 99% of the data values are good, and then to find constraints that separate out values that look very different from the 99%. If this turns out to be only 0.1%, or 0.001%, or none at all, so much the better.

The way this plays out is different for different kinds of constraint. Let's work through some examples.

Minimum and Maximum Constraints

Looking for minimum and maximum constraints in the presence of possible bad values is closely connected to univariate outlier detection. We cannot, however, sensibly make any assumptions about the shape of the data in the general case, so parametric statistics (means, variances etc.) are not really appropriate. We also need to be clear that what we are looking for is not the tails of some smooth distribution, but rather values that look as if they might have been generated by some completely different process, or drawn from some other distribution.

An example of the kind of thing what we are looking for is something qualitatively like the situation in the top diagram (though perhaps considerably more extreme), where the red and blue sections of the distribution look completely different from the main body, and we want to find cutoffs somewhere around the values shown. This is the sort of thing we might see, for example, if the main body of values are prices in one currency and the outliers, in contrast, are the similar prices, but measured in two other currencies. In contrast, we are not aiming to cut off the tails of a smooth, regular distribution, such as the normal distribution shown in the lower diagram.

In our initial implementation we have used the possible bad proportion p to define an assumed "main body" of the distribution as running between the quantiles at (1 – p) and p (from the 1st percentile to the 99th, when p = 1%). We assume all values in this range are good. We then search out from this main body of the distribution, looking for a gap between successive values that is at least α times the (1% – 99%) interquantile range. We are using α=0.5 for now. Once we have identified a gap, we then pick a value within it, currently favouring the end nearer the main distribution and also favouring "round" numbers.

Let's look at an example.

We have a dataset with 12,680,141 records and 3 prices (floating-point values), Price1, Price2 and Price3, in sterling. If we run ordinary TDDA constraint discovery on this, we get the following results:

Individual Field Constraints
Name	Type Allowed	Min Allowed	Max Allowed
Price1	real	0.0	198,204.34
Price2	real	–468,550,685.56	2,432,595.87
Price3	real	0.0	1,390,276,267.42

With ordinary constraint discovery, the min and max constraints are just set to the largest and smallest values in the dataset, so this tells us that the largest value for Price1 in the dataset is nearly £200,000, that Price2 ranges from close to –£500m to about +£2.5m, and that Price3 runs up to about £1.4bn.

If we rerun the constraint generation allowing for up to 1% bad data, we get the following instead.

Individual Field Constraints
Name	Type Allowed	Min Allowed	Max Allowed
Price1	real	0.0	38,000.0
Price2	real	–3,680.0	3,870.0
Price3	real	0.0	125,000.0

Let's look at why these values were chosen, and see whether they are reasonable.

Starting with Price1, let's get a few more statistics.

min Price1	percentile 1 Price1	median Price1	percentile 99 Price1	max Price1
0.00	24.77	299.77	2,856.04	198,204.34

So here our median value is just under £300, while our first percentile value is about £25 and our 99th percentile is a little under £3,000, giving us an interquantile range also somewhat under £3,000 (£2,831.27, to be precise).

The way our algorithm currently works, this interquantile range defines the scale for the gap we need to define something as an outlier—in this case, half of £2,831.27, which is about £1,415.

If we were just looking to find the tighest maximum constraint consistent with 1% of the data being bad (with respect to this constraint), we would obviously just set the upper limit to fractionally above £2,856.04. But this would be fairly nonsensensical as we can see if we look at the data around that threshold. It is a feature of this particular data that most values are duplicated a few times (typically between 3 and 10 times), so we first deduplicate them. Here are the first 20 distinct values above £2,855:

2,855.02   2,855.42   2,855.80   2,856.04
2,855.16   2,855.50   2,855.82   2,856.08
2,855.22   2,855.64*  2,855.86   2,856.13
2,855.38   2,855.76   2,855.91   2,856.14

Obviously, there's nothing special about £2,854.04 (marked with *), and there is no meaningful gap after it, so that would be a fairly odd place to cut off.

Now let's look at the (distinct) values above £30,000:

30,173.70   31,513.21   39,505.67   44,852.01   67,097.01    72,082.60
30,452.45   32,358.52   39,703.93   45,944.21   60,858.88    72,382.57
30,562.55   32,838.23   40,888.79   47,026.70   60,911.84    72,388.55
30,586.96   33,906.98   40,999.04   47,058.99   63,160.28    72,657.04
30,601.67   34,058.27   41,252.50   47,126.60   63,984.64    72,760.90
30,620.78   35,302.33   41,447.00   49,827.28   64,517.16    90,713.28
31,118.10   36,472.53   41,513.95   51,814.22   67,256.27   103,231.44
31,139.86   36,601.86   42,473.36   53,845.76   68,081.68   123,629.97
31,206.71   36,941.76*  43,384.46   55,871.84   70,782.28   198,204.34
31,449.57   39,034.38   43,510.03   56,393.14   71,285.95

The cutoff the TDDA algorithm has chosen (£38,000) is just after the value marked *, which is the start of the first sizable gap. We are not claiming that there is anything uniquely right or "optimal" about the cutoff that has been chosen in the context of this data, but it looks reasonable. The first few values here are still comparatively close together—with gaps of only a few hundred between each pair of successive value—and the last ones are almost an order of magnitude bigger.

We won't look in detail at Price2 and Price3, where the algorithm has narrowed the range rather more, except to comment briefly on the negative values for Price2, which you might have expected to be excluded. This table shows why:³

countnegative Price2	countzero Price2	countpositive Price2	countnull Price2	countnonnull Price2
1,300	51,640	39,750	15,880	92,690

Although negative prices are not very common, they do account for about 1.2% of the data (critically, more than 1%). Additionally, nearly 15% of the values for Price2 are missing, and we exclude nulls from this calculation, so the truly relevant figure is that 1,300 of 92,690 non-null values are negative, or about 1.4%.

Other Constraints

We will now use a slightly wider dataset, with more field types, to look at how constraint generation works for other kinds of constraints in the presence of bad data. Here are the first 10 records from a dataset with 108,559 records, including a similar set of three price fields (with apologies for the slightly unintuitive colouring of the Colour field).

ID	Date	Price1	Price2	Price3	Code	nItems	Parts	Colour
af0370b4-16e4-1891-8b30-cbb075438394	2016-11-01 00:38:34	830.25	∅	830.25	NLOM	1	62444	red
126edf08-16e5-1891-851c-d926416373f2	2016-11-01 00:41:21	983.08	0.00	983.08	QLDZ	1	62677	yellow
73462586-16ed-1891-b164-7da16012a3ab	2016-11-01 01:41:20	540.82	0.00	540.82	TKNX	1	62177 62132	red
1cd55aac-170e-1891-aec3-0ffc0ab020e7	2016-11-01 05:35:08	398.89	0.00	398.89	PKWI	1	61734	red
68f74486-170e-1891-8b30-cbb075438394	2016-11-01 05:37:16	314.54	8.21	314.54	PLUA	1	62611	red
28dff654-16fa-1891-bd69-03d25ee96dee	2016-11-01 03:12:18	479.75	∅	479.75	UHEG	1	62128	red
14321670-1703-1891-ab75-77d18aced2f5	2016-11-01 04:16:09	733.41	0.00	733.41	RTKT	2	61829	red
60957d68-1717-1891-a5b2-67fe1e248d60	2016-11-01 06:41:27	537.81	0.00	537.81	OBFZ	1	61939 62371	red
4fea2bb6-171d-1891-80a6-2f322b85f525	2016-11-01 07:23:56	132.60	2.42	135.02	TACG	2	62356	red
204ed3d6-1725-1891-a1fb-15b6ffac52ce	2016-11-01 08:19:52	866.32	∅	866.32	XHLE	2	61939	red

If we run ordinary constraint discovery on this (not allowing for any bad data), these are the results:

Individual Field Constraints
Name	Type Allowed	Min Allowed	Max Allowed	Sign Allowed	Nulls Allowed	Duplicates Allowed	Values Allowed	# Regular Expressions
ID	string	length 32	length 36		0	-		2
Date	date	1970-01-01	2017-06-03 23:59:06		0
Date:time-before-now	timedelta	107 days, 11:28:05	17428 days, 11:27:11	positive
Price1	real	0.0	19,653,405.06	non-negative	0
Price2	real	-4,331,261.54	589,023.50
Price3	real	0.0	20,242,428.57	non-negative	0
Code	string	length 4	length 4		0	-		3
nItems	int	1	99	positive		-
Parts	string	length 5	length 65		1	-		5
Colour	string	length 3	length 6		0	-	6 values	2

There are a few points to note:

To make the table more manageable, the regular expressions and field values are not shown, by default. We'll see them later.
The third line of constraints is generated is slightly different from the others, and looks not at the actual values of the dates, but rather than how far in the past or future they are. This is something else we are experimenting with, but doesn't matter too much for this example.

Let's repeat the process telling the software that up to 1% of the data might be bad.

Individual Field Constraints
Name	Type Allowed	Min Allowed	Max Allowed	Sign Allowed	Nulls Allowed	Duplicates Allowed	Values Allowed	# Regular Expressions
ID	string	length 32	length 36		0	no		1
Date	date	1970-01-01 01:23:20	2017-06-03 23:59:06		0
Date:time-before-now	timedelta	107 days, 11:55:27	17428 days, 11:54:33	positive
Price1	real	0.0	25000.0	non-negative	0
Price2	real	-520.0	790.0
Price3	real	0.0	24000.0	non-negative	0
Code	string	length 4	length 4		0	-		2
nItems	int	1	10	positive	0	-
Parts	string	length 5	length 65		0	-		2
Colour	string	length 3	length 6		0	-	3 values	1

Now let's consider the (other) constraint kinds in turn.

Sign

Nothing has changed for any of the sign constraints. In general, all that we do in this case is see whether less than our nominated proportion p is negative, or less than our nominated proportion p is positive, and if so write a constraint to this effect, but in the example nothing has changed.

Nulls

The nulls allowed constraint puts an upper limit on the number of nulls allowed in a field. The only two values we ever use are 0 and 1, with 0 meaning that no nulls are allowed and 1 meaning that a single null is allowed. The Parts field originally had a value of 1 here, meaning that there is a single null in this field in the data, and therefore the software wrote a constraint that a maxumim of 1 null is permitted. When the software is allowed to assume that 1% of the data might be bad, a more natural constraint is not to allow nulls at all.

No constraint on nulls was produced for the field nItems originally, but now we have one. If we check the number of nulls in nItems, it turns out to be just 37 or around 0.03%. Since that is well below our 1% limit, a "no-nulls" constraint has been generated.

Duplicates

In the original constraint discovery, no constraints banning duplicates were generated, whereas in this case ID did get a "no duplicates" constraint. If we count the number of distinct values in ID, there are in fact, 108,569 for 108,570 records, so there is clearly one duplicate, affecting 2 records. Since 2/108,570 is again well below our 1% limit (more like 0.002%), the software generates this constraint.

Values allowed.

In the original dataset, a constraint on the values for the field Colour was generated. Specifically, the six values it allowed were:

"red" "yellow" "blue" "green" "CC1734" "???"

When we ran the discovery process allowing for bad data, the result was that only 3 values were allowed, which turn out to be "red", "yellow", and "blue". If we look at the breakdown of the field, we will see why.

Colour	count
red	88,258
yellow	10,979
blue	8,350
green	978
CC1734	4
???	1

The two values we might have picked out as suspicious (or at least, having a different form from the others) are obviously CC1734 and ???, and constraint generation did exclude those, but also green, which we probably would not have done. There are 978 green values (about 0.9%), which is slighty under out 1% cutoff, so it can be excluded by the algorithm, and is. The algorithm simply works through the values in order, starting with the least numerous one, and removes values until the maximum possible bad proportion p is reached (cumulatively). Values with the same frequency are treated together, which means that if we had had another value (say "purple") with exactly the same frequency as green (978), neither would have been excluded.

Regular Expressions

Sets of regular expressions were generated to characterize each of the four string fields, and in every case the number generated was smaller when assuming the possible presence of bad data than when not.

For the ID field, two regular expressions were generated:

^[0-9a-f]{32}$
^[0-9a-f]{8}\-[0-9a-f]{4}\-1891\-[0-9a-f]{4}\-[0-9a-f]{12}$

The first of these just corresponds to a 32-digit hex number, while the second is a 32-digit hex number broken into five groups of 8, 4, 4, 4, and 12 digits, separated by dashes—i.e. a UUID.⁴

If we get Rexpy to give us the coverage information We see that all but three of the IDs are properly formatted UUIDs, with just three being plain 32-digit numbers.

Regular Expression	Incremental Coverage
^[0-9a-f]{8}\-[0-9a-f]{4}\-1891\-[0-9a-f]{4}\-[0-9a-f]{12}$	108,567
^[0-9a-f]{32}$	3

and indeed, it is the 3 records that are excluded by the constraint generation assuming bad data.

We won't go through all the cases, but will look at one more. The field Parts is a list of 5-digit numbers, separated by spaces, with a single one being generated most commonly. Here are the regular expressions that Rexpy generated, together with the coverage information.

Regular Expression	Incremental Coverage
^\d{5}$	99,255
^\d{5} \d{5}$	8,931
^\d{5} \d{5} \d{5}$	378
^\d{5} \d{5} \d{5} \d{5}$	4
^61000 61000 61000 61000 61000 61000 61000 61000 61000 61000 61000$	1

As you can see, Rexpy has generated five separate regular expressions, where in an ideal world we might have preferred it produced a single one:

^\d{5}( \d{5})*$

In fact, however, the fact it has produced separate expressions for five clearly distinguishable cases turns out to be very helpful for TDDA purposes.

In this case, we can see that the vast bulk of the cases have either one or two 5-digit codes (which are the two regular expressions retained by constraint generation), but we would almost certainly consider the 382 cases with three and four codes to be also correct. The last one is more interesting. First, the number of codes (eleven) is noticably larger than for any other record. Secondly, the fact that it is eleven copies of a single code, that is a relatively round number is suspicious. (Looking at the data, it turns out that in no other case are any codes repeated when there are multiple codes, suggesting even more strongly that there's something not right with this record.)

Verifying the Constraint Generation Data Against the Suggested Constraints

With the previous approach to constraint discovery, in which we assume that the example data given to us contains only good data, it should always be the case that if we verify the example data against constraints generated against it, they will all pass.⁵ With the new approach, this is no longer the case, because our hope is that the constraints will help us to identify bad data. We show below the result of running verification against the generated contraints for the example we have been looking at:

Constraints
47

Records
108,570

Fields
10

Values
1,085,700

Failing Constraints
15

Failing Records
1,722

Failing Fields
10

Failing Values
1,887

We can also see a little more information about where the failures were in the table below.

Name	Failures		Type			Minimum			Maximum			Sign			Max Nulls			Duplicates			Values			Rex
Name	Values	Constraints	Allowed	Actual	✓	Allowed	Actual	✓	Allowed	Actual	✓	Allowed	Actual	✓	Allowed	Actual	✓	Allowed	Actual	✓	Allowed	Actual	✓	Allowed	Actual	✓
Colour	983	2	string	string	✓	length 3	length 3	✓	length 6	length 6	✓	-	-	-	0	0	✓	-	-	-	3 values	e.g. "green"	✗	1 pattern	e.g. "???"	✗
Parts	384	2	string	string	✓	length 5	length 5	✓	length 65	length 65	✓	-	-	-	0	1	✗	-	-	-	-	-	-	2 patterns	e.g. "62702 62132 62341"	✗
Price2	281	2	real	real	✓	-520.00	-4,331,261.54	✗	790.00	589,023.50	✗	-	-	-	-	15880	-	-	-	-	-	-	-	-	-	-
Price3	92	1	real	real	✓	0.00	0.00	✓	24,000.00	20,242,428.57	✗	≥ 0	✓	✓	0	0	✓	-	-	-	-	-	-	-	-	-
Price1	91	1	real	real	✓	0.00	0.00	✓	25,000.00	19,653,405.06	✗	≥ 0	✓	✓	0	0	✓	-	-	-	-	-	-	-	-	-
nItems	44	2	int	int	✓	1	1	✓	10	99	✗	> 0	✓	✓	0	37	✗	-	-	-	-	-	-	-	-	-
Code	4	1	string	string	✓	length 4	length 4	✓	length 4	length 4	✓	-	-	-	0	0	✓	-	-	-	-	-	-	2 patterns	e.g. "A__Z"	✗
ID	3	2	string	string	✓	length 32	length 32	✓	length 36	length 36	✓	-	-	-	0	0	✓	no	e.g. "374e4f9816e51891b1647da16012a3ab"	✗	-	-	-	1 pattern	e.g. "374e4f9816e51891b1647da16012a3ab"	✗
Date	3	1	date	date	✓	1970-01-01 01:23:20	1970-01-01	✗	2017-06-03 23:59:06	2017-06-03 23:59:06	✓	-	-	-	0	0	✓	-	-	-	-	-	-	-	-	-
Date:time-before-now	2	1	-	-	-	109 days, 16:27:17	109 days, 16:28:09	✓	17430 days, 16:26:23	17430 days, 16:27:15	✗	> 0	✓	✓	-	0	-	-	-	-	-	-	-	-	-	-

Because the verification fails in this way, and in doing so creates indicator fields for each failing constraint, and an overall field with the number of failures for each record, it is then easy to narrow down to the data being flagged by these constraints to see whether the constraints are useful or over zealous, and adjust them as necessary.

Summary

In this post, we've shown how we're extending Automatic Constraint Generation in TDDA to cover cases where the datasets used are not assumed to be perfect. We think this is quite a significant development. We'll use it a bit more, and when it's solid, extend the open-source tdda library to include this functionality.

As previously, I am aware that, classically, data is the plural of datum, and that purists would prefer the assertion to be "bad data are widespread and pervastive." I apologise to anyone whose sensibilities are offended by my use of the word data in the singular. ↩
As a further clarification, the proportion p is a used for each potential constraint on each field separately, so it's not "1% of all values" might be bad, but rather "1% of the Ages might be higher than the maximum constraint we generate" and so on. Obviously, we could generalize this approach to allow different possible propotions for each constraint type, or each field, or both, at the cost of increasing the number of free parameters. ↩
Like you, I look at those figures and am immediately suspicious that all the counts shown are multiples of 10. But this is one of those cases our suspicions are wrong: this is a random sample from larger data, but the roundness of these numbers is blind chance. ↩
In fact, looking carefully the third group of digits is fixed and starts with 1, indicating this is a UUID-1, which is something we hadn't noticed in this data until we got Rexpy to generate a regular expression for us, as part of TDDA. ↩
With the time-delta constraints we are now generating, this is not strictly true, but this need not concern us in this case. ↩

Obtaining the Python tdda Library

2017-09-14T15:30:00+01:00

This post is a standing post that we plan to try to keep up to date, describing options for obtaining the open-source Python TDDA library that we maintain.

Using pip from PyPI

Assuming you have a working pip setup, you should be able to install the tdda library by typing:

pip install tdda

or, if your permissions don't allow use in this mode

sudo pip install tdda

If pip isn't working, or is associated with a different Python from the one you are using, try:

python -m pip install tdda

sudo python -m pip install tdda

The tdda library supports both Python 3 (tested with 3.6 and 3.7) and Python 2 (tested with 2.7). (We'll start testing against 3.8 real soon!)

Upgrading

If you have a version of the tdda library installed and want to upgrade it with pip, add -U to one of the command above, i.e. use whichever of the following you need for your setup:

pip install -U tdda
sudo pip install -U tdda
python -m pip install -U tdda
sudo python -m pip install -U tdda

Installing from Source

The source for the tdda library is available from Github and can be cloned with

git clone https://github.com/tdda/tdda.git

git clone git@github.com:tdda/tdda.git

When installing from source, if you want the command line tdda utility to be available, you need to run

python setup.py install

from the top-level tdda directory after downloading it.

Documentation

The main documentation for the tdda library is available on Read the Docs.

cd doc
make html

Running TDDA's tests

Once you have installed TDDA (whether using pip or from source), you can run its tests by typing

tdda test

If you have all the dependencies, including optional dependencies, installed, you should get a line of dots and the message OK at the end, something like this:

$ tdda test
........................................................................................................................
----------------------------------------------------------------------
Ran 122 tests in 3.251s

OK

If you don't have some of the optional dependencies installed, some of the dots will be replaced by the letter 's'. For example:

$ tdda test
.................................................................s.............................s........................
----------------------------------------------------------------------
Ran 120 tests in 3.221s

OK (skipped=2)

This does not indicate a problem, and simply means there will be some of the functionality unavailable (e.g. usually one or more database types).

Using the TDDA examples

The tdda command line can be used to copy the relevant files into place. To get the examples, first change to a directory where you would like them to be placed, and then use the command:

tdda examples

This should produce the following output:

Copied example files for tdda.referencetest to ./referencetest-examples
Copied example files for tdda.constraints to ./constraints-examples
Copied example files for tdda.rexpy to ./rexpy-examples

Quick Reference Guides

There is a quick reference guides available for the TDDA library. These are often a little behind the current release, but are usually still quite helpful.

These are available from here.

Tutorial from PyData London

There is a video online of a workshop at PyData London 2017. Watching a video of a workshop probably isn't ideal, but it does have a fairly detailed and gentle introduction to using the library, so if you are struggling, it might be a good place to start.

GDPR, Consent and Microformats: A Half-Baked Idea

2017-09-08T11:00:00+01:00

Last night I went to The Protectors of Data Scotland Meetup on the subject of Marketing and GDPR. If you're not familiar with Europe's fast-approaching General Data Protection Regulation, and you keep or process any personal data about humans,¹, you probably ought to learn about it. A good place to start is episode 202 of Horace Dediu's The Critical Path podcast, in which he interviews Tim Walters.

During the meeting, I had an idea, and though it is rather less than half-baked, right now it seems just about interesting enough that I thought I'd record it.

One of the key provisions in GDPR is that data processing generally requires consent of the data subject, and that consent is defined as

any freely given, specific, informed and unambiguous indication of his or her wishes by which the data subject, either by a statement or by a clear affirmative action, signifies agreement to personal data relating to them being processed

This is further clarified as follows:

This could include ticking a box when visiting an Internet website, choosing technical settings for information society services or by any other statement or conduct which clearly indicates in this context the data subject’s acceptance of the proposed processing of their personal data.

Silence, pre-ticked boxes or inactivity should therefore not constitute consent.

The Idea in a Nutshell

Briefly, the idea is this:

Websites (and potentially apps) requesting consents should include a digital specification of that consent in a standardized format to be defined (probably either an HTML microformat or a standardized JSON bundle).
This would allow software to understand the consents being requested unambiguously and present them in a standardized, uniform, easy-to-understand format. It would also encourages businesses and other organizations to standardize the forms of consent they request. I imagine that if this happened, initially browser extensions and special apps such as password managers would learn to read the format and present the information clearly, but if were successful, eventually web browsers themselves would do this.
Software could also allow people to create one or more templates or default responses, allowing, for example, someone who never wants to receive marketing to make this their default response, and someone who wants as many special offers as possible to have settings that reflect that. Obviously, you might want several different formats for organizations towards which you have different feelings.
A very small extension to the idea would extend the format to record the choices made, allowing password managers, browsers, apps etc. to record for the user exactly what consents were given.²

Benefits

I believe such a standard has potential benefits for all parties—businesses and other organizations requesting consent, individuals giving consent, regulators and courts:

Businesses and other data processing/capturing organizations would benefit from a clear set of consent kinds, each of which could have a detailed description (perhaps on an EU or W3C document) that could be referenced by a specific label (e.g. marketing_contact_email_organization). Best practice would hopefully quickly move to using these standard categories.
Software could present information in a standardized, clear way to users, highlighting non-standard provisions (preferably with standard symbols, a bit like the Creative Commons Symbols
By using template responses, users could more easily complete consent forms with less effort and less fear of ticking the wrong boxes.

Digital Specification? Microformat? JSON?

What are we actually talking about here?

The main (textual) content of a web page consists of the actual human-readable text together with annotation ("markup") to specify formatting and layout, as well as special features like the sort of checkboxes used to request consent. In older versions of the web, a web page was literally a text file in the a special format originally defined by Tim Berners-Lee (HTML).

For example, in HTML, this one-sentence paragraph with the word very in bold might be written as follows:

<p>For example, in HTML, this one-sentence paragraph with the
word <b>very</b> in bold might be written as follows:</p>

Since the advent of "Web 2.0", many web pages are generated dynamically, with much of the data being sent in a format called JSON. A simple example of some JSON for describing (say) an element in the periodic might be

{
    "name": "Lithium",
    "atomicnumber": 3,
    "metal": true,
    "period": 2,
    "group": 1,
    "etymology": "Greek lithos"
}

It would be straightforward³ to develop a format for allowing all the common types of marketing consent (and indeed, many other kinds of processing consent) to be expressed either in JSON or an HTML microformat (which might not be rendered directly by the webpage). As a sketch, the kind of thing a marketing consent request might look like in JSON could be:

{
    "format": "GDPR-Marketing-Consent-Bundle",
    "format_version": "1.0",
    "requesting_organization": "Stochastic Solutions Limited",
    "requesting_organization_data_protection_policy_page:
        "https://stochasticsolutions.com/privacy.html",
    "requesting_organization_partners": [],
    "requested_consents": [
        "marketing_contact_email_organization",
        "marketing_contact_mobile_phone_organization",
        "marketing_contact_physical_mail_organization"
    ],
    "request_date": "2017-09-08",
    "request_url": "https://StochasticSolutios.com/give-us-your-data"
}

Key features I am trying to illustrate here are:

The format would include details about the organization making the request
The format would have the capacity to list partner organizations in cases in which consent for partner marketing or processing was also requested
The format would be granular with a taxonomy of known kinds of consents. These might be parameterized, rather than being simple strings. In this case, I've included a few different contact mechanisms and the suffix "organization" to indicate this is consent for the organization itself, rather than any partners or other randoms.

Undoubtedly, a real implementation would end up a bit bigger than this, and perhaps more hierarchical, but hopefully not too much bigger.

The format could be extended very simply to include the response, which could then be sent back to the site and also made available on the page to the browser/password manager/apps etc. Here is an augmentation of the request format that also captures the responses:

{
    "format": "GDPR-Marketing-Consent-Bundle",
    "format_version": "1.0",
    "requesting_organization": "Stochastic Solutions Limited",
    "requesting_organization_data_protection_policy_page:
        "https://stochasticsolutions.com/privacy.html",
    "requesting_organization_partners": [],
    "requested_consents": {
        "marketing_contact_email_organization": false,
        "marketing_contact_mobile_phone_organization": false
        "marketing_contact_physical_mail_organization: true
    },
    "request_date": "2017-09-08",
    "request_url": "https://StochasticSolutions.com/give-us-your-data"
}

This example indicates consent to marketing contact by paper mail from the organization, but not by phone or email.

Exactly the same could be achieved with an HTML Microformat, perhaps with something like this:

<div id="GDPR-Marketing-Consent-Bundle">
    <span id="format_version">1.0</span>
    <span id="requesting_organization">
        Stochastic Solutions Limited
    </span>
    <span id="requesting_organization_data_protection_policy_page">
        "https://stochasticsolutions.com/privacy.html"
    </span>
    <ol id="requesting_organization_partners">
    </old>
    <ol id="requested_consents">
        <li>marketing_contact_email_organization</li>
        <li>marketing_contact_mobile_phone_organization<li>
        <li>marketing_contact_physical_mail_organization</li>
    </ol>
    <span id="request_date">2017-09-08</span>
    <span id="request_url">
        https://StochasticSolutions.com/give-us-your-data
    </span>
</div>

(Again, I've no idea whether this is actually what HTML-based microformats typically look like; this is purely illustrative.)

Useful?

I don't know whether this idea is useful or feasible, nor whether it is merely a half-baked version of something that an phalanx of people in Brussels has already specified, though I did perform many seconds of arduous due dilligence in the form of a web searches for terms like "marketing consent microformat" without turning up anything obviously relevant.

It seems to me that if something like this were created and adopted, it might help make GDPR and web/app-based consent avoid the ignominious fate of the cookie pop-ups that were so well intentioned but such a waste of time in practice. Ideally, some kind of collaboration between the relevant part of the EU and either W3C would produce (or at least endorse) any format.

Do get in touch through any of the channels (@tdda0, mail to info@ this domain etc.) if you have thoughts.

So-called personally identifiable information (PII). ↩
Possibly even on a blockchain, if you want to be terribly au courrant and have the possibility of cryptographic verification. ↩
technically straightforward; obviously this would require much work and hammering out of special cases and mechanisms for non-standard requirements. ↩

Quick Reference for TDDA Library

2017-05-04T15:30:00+01:00

A quick-reference guide ("cheat sheet") is now available for the Python TDDA library. This is linked in the sidebar and available here.

We will try to keep it up-to-date as the library evolves.

See you all at PyData London 2017 this weekend (5-6 May 2017), where we'll be running a TDDA tutorial on Friday.

Improving Rexpy

2017-03-09T15:30:00+00:00

Today we are announcing some enhancements to Rexpy, the tdda tool for finding regular expressions from examples. In short, the new version often finds more precise regular expressions than was previously the case, with the only downside being a modest increase in run-time.

Background on Rexpy is available in two previous posts:

This post introduced the concept and Python library
This post discussed the addition of coverage information—statistics about which how many examples each regular expression matched.

Rexpy is also available online at https://rexpy.herokuapp.com.

Weaknesses addressed

Rexpy is not intended to be an entirely general-purpose tool: it is specifically focused on the case of trying to find regular expressions to characterize the sort of structured textual data we most often see in database and datasets. We are very interested in characterizing things like identifiers, zip codes, phone numbers, URLs, UUIDs, social security numbers and (string) bar codes, and much less interested in characterizing things like sentences, tweets, programs and encrypted text.

Within this focus, there were some obvious shortcomings of Rexpy, a significant subset of which the current release (tdda version 0.3.0) now addresses.

Example 1: Postcodes

Rexpy's tests have always included postcodes, but Rexpy never did a very good job with them. Here is the output from using tdda version 0.2.7:

$ python rexpy.py
EH1 1AA
B2 8EA

^[A-Z0-9]{2,3}\ [0-9A-F]{3}$

Rexpy's result is completely valid, but not very specific. It has correctly identified that there are two main parts, separated by a space, and that the first part is a mixture of two or three characters, each a capital letter or a number, and that the second part is exactly three characters, again all capital letters or numbers. However, it has failed to notice that the first group starts with a letter and follows this with a single digit and that the second group is one digit followed by two letters. What a human would probably have written is something more like:

^[A-Z]{1,2}[0-9]\ [0-9][A-F]{2}$

Let's try Rexpy 0.3.0.

^[A-Z]{1,2}\d\ \d[A-Z]{2}$

Now Rexpy does exactly what we would probably have wanted it to do, and actually written it slightly more compactly—\d is any digit, i.e. it is precisely equivalent to [0-9].

With a few more examples, it still does the perfect thing (in 0.3.0).

$ rexpy
EC1 1BB
W1 0AX
M1 1AE
B33 8TH
CR2 6XH
DN55 1PT

^[A-Z]{1,2}\d{1,2}\ \d[A-Z]{2}$

(Note that the 0.3.0 release of TDDA includes wrapper scripts, rexpy and tdda that allow the main functions to be used directly from command line. These are installed when you pip install tdda. So the rexpy above is exactly equivalent to running python rexpy.py.)

We should note, however, that it still doesn't work perfectly if we include general London postcodes (even with 0.3.0):

$ rexpy
EC1A 1BB
W1A 0AX
M1 1AE
B33 8TH
CR2 6XH
DN55 1PT

^[A-Z0-9]{2,4}\ \d[A-Z]{2}$

In this case, the addition of the final letter in the first block for EC1A and W1A has convinced the software that the first block is just a jumble of capital letters and numbers. We might hope that examples like these (at least, if expanded) would result in something like:

^[A-Z]{1,2}\d{1,2}A?\ \d[A-Z]{2}$

though the real pattern for postcodes is actually quite complex, with only certain London postal areas being allowed a trailing letter, and only in cases where there is a single digit in the first group, and that letter can actually be A, C, P or W.

So while it isn't perfect, Rexpy is doing fairly well with postcodes now.

Let's look at another couple of examples.

Example 2: Toy Examples

Looking at the logs from the Rexpy online, it is clear that a lot of people (naturally) start by trying the sorts of examples commonly used for teaching regular expressions. Here are some examples motivated by what we tend to see in logs.

First, let's try a common toy example in the old version of Rexpy (0.2.7):

$ python rexpy.py
ab
abb
abbb
abbbb
abbbbb
abbbbbb

^[a-z]+$

Not so impressive.

Now in the new version (0.3.0):

$ rexpy
ab
abb
abbb
abbbb
abbbbb
abbbbbb

^ab+$

That's more like it!

Example 3: Names

Here's another example it's got better at. First under the old version:

$ python rexpy.py
Albert Einstein
Rosalind Franklin
Isaac Newton

^[A-Za-z]+\ [A-Za-z]{6,8}$

Again, this is not wrong, but Rexpy has singularly failed to notice the pattern of capitalization.

Now, under the new version:

$ rexpy
Albert Einstein
Rosalind Franklin
Isaac Newton

^[A-Z][a-z]+\ [A-Z][a-z]{5,7}$

Better.

Incidentally, it's not doing anything special with the first character of groups. Here are some related examples:

$ rexpy
AlbertEinstein
RosalindFranklin
IsaacNewton

^[A-Z][a-z]+[A-Z][a-z]{5,7}$

Example 4: Identifiers

Some of the examples we used previously were like this (same result under both versions):

$ rexpy
123-AA-22
576-KY-18
989-BR-93

^\d{3}\-[A-Z]{2}\-\d{2}$

What worked less well in the old version were examples like these:

$ python rexpy.py
123AA22
576KY18
989BR93

^[A-Z0-9]{7}$

These work much better under the 0.3.0:

$ rexpy
123AA22
576KY18
989BR93

^\d{3}[A-Z]{2}\d{2}$

Some Remaining Areas for Improvement

The changes that we've made in this release of Rexpy appear to be almost unambiguous improvements. Both from trying examples, and from understanding the underlying code changes, we can find almost no cases in which the changes make the results worse, and a great number where the results are improved. Of course, that's not to say that there don't remain areas that could be improved.

Here we summarize a few of the things we still hope to improve:

Alternations of Whole Groups

Rexpy isn't very good at generating alternations at the moment, either at a character or group level. So for example, you might have hoped that in the following example, Rexpy would notice that the middle two letters are always AA or BB (or, possibly, that the letter is repeated).

$ rexpy
123-AA-321
465-BB-763
777-AA-81
434-BB-987
101-BB-773
032-BB-881
094-AA-662

^\d{3}\-[A-Z]{2}\-\d{2,3}$

Unfortunately, it does not. (This probably won't be very hard to change.)

Alternations within Groups

Similarly, you might hope that it would do rather better than this:

$ rexpy
Roger
Boger
Coger

^[A-Z][a-z]{4}$

Clearly, we would like this to produce

^[A-Z]oger$

and you might think from the previous examples that it would do this, but it can't combine the fixed oger with the adjacent letter range.

Too Many Expressions (or Combining Results)

Rexpy also produces rather too many regular expressions in many cases, particularly by failing to use optionals when it could. For example:

$ rexpy
Angela Carter
Barbara Kingsolver
Socrates
Cher
Martin Luther King
James Clerk Maxwell

^[A-Z][a-z]+$
^[A-Z][a-z]{5,6}\ [A-Z][a-z]+$
^[A-Z][a-z]{4,5}\ [A-Z][a-z]{4,5}\ [A-Z][a-z]+$

At least in some circumstances, we might prefer that this would produce a single result such as:

^[A-Z][a-z]+((\ [A-Z][a-z]+(\ [A-Z][a-z]+)?)?$

or, ever better:

^[A-Z][a-z]+(\ [A-Z][a-z])*$

Although we would definitely like Rexpy to be able to produce one of these results, we don't necessarily always want this behaviour. It transpires that in a TDDA context, producing different expressions for the different cases is very often useful. So if we do crack the "combining" problem, we'll probably make it an option (perhaps with levels); that will just leave the issue of deciding on a default!

Plans

We have ideas on how to address all of these, albeit not perfectly, so expect further improvements.

If you use Rexpy and have feedback, do let us know. You can reach us on Twitter at (@tdda0), and there's also a TDDA Slack (#TDDA) that we'd be happy to invite you to.

An Error of Process

2017-03-08T13:00:00+00:00

Yesterday, email subscribers to the blog, and some RSS/casual viewers, will have seen a half-finished (in fact, abandoned) post that began to try to characterize success and failure on the crowd-funding platform Kickstarter.

The post was abandoned because I didn't believe its first conclusion, but unfortunately was published by mistake yesterday.

This post explains what happened and tries to salvage a "teachable moment" out of this minor fiasco.

The problem the post was trying to address

Kickstarter is a crowd-funding platform that allows people to back creative projects, usually in exchange for rewards of various kinds. Projects set a funding goal and backers only pay out if the aggregate pledges made match or exceed the funding goal during a funding period—usually 30 days.

Project Phoenix on Kickstarter, from The Icon Factory, seeks to fund the development of a new version of Twitterrific for Mac. Twitterrific was the first independent Twitter client, and was responsible for many of the things that define Twitter today.¹ (You were, and are, cordially invited to break off from reading this post to go and back the project before reading on.)

Ollie is the bird in Twitterrific's icon.

At the time I started the post, the project had pledges of $63,554 towards a funding goal of $75,000 (84%) after 13 days, with 17 days to go. This is what the amount raised over time looked like (using data from Kicktraq):

Given that the amount raised was falling each day, and looked asymptotic, questions I was interested in were:

How likely was the project to succeed (i.e. to reach its funding goal by day 30? (In fact, it is now fully funded.)
How much was the project likely to raise?
How likely was the project to reach its stretch goal of $100,000?

The idea was to use some open data from Kickstarter and simple assumptions to try to find out what successful and unsuccessful projects look like.

Data and Assumptions

[This paragraph is unedited from the post yesterday, save that I have made the third item below bold.]

Kickstarter does not have a public API, but is scrapable. The site Web Robots makes available a series of roughly monthly scrapes of Kickstarter data from October 2015 to present; as well as seven older datasets. We have based our analysis on this data, making the following assumptions:

The data is correct and covers all Kickstarter Projects
That we are interpreting the fields in the data correctly
Most critically: if any projects are missing from this data, the missing projects are random. Our analysis is completely invalid if failing projects are removed from the datasets.

[That last point, heavily signalled as critical, turned out not to be the case. As soon as I saw the 99.9% figure below, I went to try to validate that projects didn't go missing from month to month in the scraped data. In fact, they do, all the time, and when I realised this, I abandoned the post. There would have been other ways to try to make the prediction, but they would have been less reliable and required much more work.]

We started with the latest dataset, from 15th February 2017. This included data about 175,085 projects, which break down as follows.

Only projects with a 30-day funding period were included in the comparison, and only those for which we knew the final outcome.

count           is 30 day?
state           no     yes    TOTAL
failed      41,382  42,134   83,516
successful  44,071  31,142   75,213
canceled     6,319   5,463   11,782
suspended      463     363      826
live         2,084   1,664    3,748
TOTAL       94,319  80,766  175,085
-----------------------------------
less live:           1,664
-----------------------------------
Universe            79,102

The table showed that 80,766 of the projects are 30-day, and of these, 79,102 are not live. So this is our starting universe for analysis. NOTE: We deliberately did not exclude suspended or canceled projects, since doing so would have biased our results.

Various fields are available in the JSON data provided by Web Robots. The subset we have used are listed below, together with our interpretation of the meaning of each field:

launched_at — Unix timestamp (seconds since 1 January 1970) for the start of the funding period
deadline — Unix timestamp for the end of the funding period
state — (see above)
goal — the amount required to be raised for the project to be funded
pledged — the total amount of pledges (today); pledges can only be made during the funding period
currency — the currency in which the goal and pledges are made.
backers_count — the number of people who have pledged money.

Overall Statistics for 30-day, non-live projects

These are the overall statistics for our 30-day, non-live projects:

succeeded    count        %
no          47,839   60.48%
yes         31,263   39.52%
TOTAL       79,102  100.00%

Just under 40% of them succeed.

But what proportion reach 84% and still fail to reach 100%? According to the detailed data, the answer was just 0.10%, suggesting 99.90% of 30-day projects that reached 84% of their funding goal, at any stage of the campaign went on to be fully funded.

That looked wildly implausible to me, and immediately made me question whether the data I was trying to use was capable of supporting this analysis. In particular, my immediate worry was that projects that looked like they were not going to reach their goal might end up being removed—for whatever reason—more often that those that were on track. Although I have not proved that this is the case, it is clear that projects do quite often disappear between successive scrapes.

To check this, I went back over all the earlier datasets I had collected and extracted the projects that were live in those datasets. There were 47,777 such projects. I then joined those onto the latest dataset to see how many of them were in the latest dataset. 15,276 (31.97%) of the once-live projects were not in the latest data (based on joining on id).

It was at this point I abandoned the blog post.

Error of Process

So what did we learn?

The whole motivation for test-driven data analysis is the observation that data analysis is hard to get right, and most of us make mistakes all the time. We have previously classified these mistakes as

errors of interpretation (where we or a consumer of our analysis misunderstand the data, the methods, or our results)
errors of implementation (bugs)
errors of process (where we make a mistake in using our analytical process, and this leads to a false result being generated or propagated)
errors of applicability (where we use an analytical process with data that does not satisfy the requirements or assumptions (explicit or implicit) of the analysis).

We are trying to develop methodologies and tools to reduce the likelihood and impact of each of these kinds of errors.

While we wouldn't normally regard this blog as an analytical process, it's perhaps close enough that we can view this particular error through the TDDA lens. I was writing up the analysis as I did it, fully expecting to generate a useful post. Although I got as far as writing into the entry the (very dubious) 99.9% of 30-day projects that reach 84% funding at any stage on Kickstarter go on to be fully funded, that result immediately smelled wrong and I went off to try to see whether my assumptions about the data were correct. So I was trying hard to avoid an error of interpretation.

But an error of process occurred. This blog is published using Pelican, a static site generator that I mostly quite like. The way Pelican works is posts are (usually) written in Markdown with some metadata at the top. One of the bits of metadata is a Status field, which can either be set to Draft or Published.

When writing the posts, before publishing, you can either run a local webserver to view the output, or actually post them to the main site (on Github Pages, in this case). As long as their status is set to Draft, the posts don't show up as part of the blog in either site (local or on Github), but have to be accessed through a special draft URL. Unfortunately, the draft URL is a little hard to guess, so I generally work with posts with status set to Published until I push them to Github to allow other people to review them before setting them live.

What went wrong here is that the abandoned post had its status left as Published, which was fine until I started the next post (due tomorrow) and pushed that (as draft) to Github. Needless to say, a side-effect of pushing the site with a draft of tomorrow's post was that the abandoned post got pushed too, with its status as Public. Oops!

So the learning for me is that I either have to be more careful with Status (which is optimistic) or I need to add some protection in the publishing process to stop this happening. Realistically, that probably means creating a new Status—Internal—which will get the make process to transmogrify into Published when compiling locally, and Draft when pushing to Github. That should avoid repeats of this particular error of process.

good things, like birds and @names and retweeting; not the abuse. ↩

Errors of Interpretation: Bad Graphs with Dual Scales

2017-02-20T15:30:00+00:00

It is a primary responsibility of analysts to present findings and data clearly, in ways to minimize the likelihood of misinterpretation. Graphs should help this, but all too often, if drawn badly (whether deliberately or through oversight) they can make misinterpretation highly likely. I want to illustrate this danger with a unifortunate graph I came across recently in a very interesting—and good, and insightful—article on the US Election.

Take a look at this graph, taken from an article called The Road to Trumpsville: The Long, Long Mistreatment of the American Working Class, by Jeremy Grantham.¹

In the article, this graph ("Exhibit 1") is described as follows by Grantham:

The combined result is shown in Exhibit 1: the share of GDP going to labor hit historical lows as recently as 2014 and the share going to corporate profits hit a simultaneous high.

Is that what you interpret from the graph? I agree with these words, but they don't really sum up my first reading of the graph. Rather, I think the natural reading of the graph is as follows:

Wow: Labor's share and Capital's share of GDP crossed over, apparently for good, around 2002. Before then, Capital's share was mostly materially lower than Labor's (though they were nearly equal, briefly, in 1965, and crossed for a for a few years in 1995), but over the 66-year period shown Capital's share increased while Labor's fell, until now is taking about four times as much as Labor.

I think something like that is what most people will read from the graph, unless they read it particularly carefully.

But that is not what this graph is saying. In fact, this is one of the most misleading graphs I have ever come across.

If you look carefully, the two lines use different scales: the red one, for Labor, uses the scale on the right, which runs from 23% to about 34%, whereas the blue line for Capital, uses the scale on the left, which runs from 3% to 11%.

Dual-scale graphs are always difficult to read; so difficult, in fact, that my personal recommendation is

Never plot data on two different scales on the same graph.

Not everyone agrees with this, but most people accept that dual-scale graphs are confusing and hard to read. Even, however, by the standards of dual scale graphs, this is bad.

Here are the problems, in roughly decreasing order of importance:

The two lines are showing commensurate² figures of roughly the same order of magnitude, so could and should have been on the same scale: this isn't a case of showing price against volume, where the units are different, or even a case in which one size in millimetres and another in miles: these are both percentages, of the same thing, all between 4% and 32%.
The graphs cross over when the data doesn't. The very strong suggestion from the graphs that we go from Labor's share of GDP exceeding that of Capital to being radically lower than that of Capital is entirely false.
Despite measuring the same quantity, the magnification is different on the two axes (i.e. the distance on the page between ticks is different, and the percentage-point gap represented by ticks on the two scales is different). As a consequence slopes (gradients) are not comparable.
Neither scale goes to zero.
The position of the two series relative to their scales is inconsistent: the Labor graph goes right down to the x-axis at its minimum (23%) while the Capital graph—whose minimum is also very close to an integer percentage—does not. This adds further to the impression that Labor's share has been absolutely annihilated.
There are no gridlines to help you read the data. (Sure, gridlines are chart junk³, but are especially important when different scales are used, so you have some hope of reading the values.)

I want to be clear: I am not accusing Jeremy Grantham of deliberately plotting the data in a misleading way. I do not believe he intended to distort or manipulate. I suspect he's plotted it this way because stock graphs, which may well be the graphs he most often looks at,⁴ are frequently plotted with false zeros. Despite this, he has unfortunately plotted the graphs in a way⁵ that visually distorts the data in almost exactly the way I would choose to do if I wanted to make the points he is making appear even stronger than they are.

I don't have the source numbers, so I have gone through a rather painful exercise, of reading the numbers off the graph (at slightly coarser granularity) so that I can replot the graph as it should, in my opinion, have been plotted in the first place. (I apologise if I have misread any values; transcribing numbers from graphs is tedious and error-prone.) This is the result:

Even after I'd looked carefully at the scales and appreciated all the distortions in the original graph, I was quite shocked to see the data presented neutrally. To be clear: Grantham's textual summary of the data is accurate: a few years ago, Capital's share of GDP (from his figures) were at an all time—albeit not dramatically higher than in 1949 or about 1966—and Labor's share of GDP, a few years ago, was at an all-time low around 23%, down from 30%. But the true picture just doesn't look like the graph Gratham showed. (Again: I feel a bit bad about going on about this graph from such a good article; but the graph encapsulates a number of problematical practices that it makes a perfect illustration.)

How to Lie with Statistics

In 1954, Darrell Huff published a book called How to Lie with Statistics⁶. Chapter 5 is called The Gee Wizz Graph. His first example is the following graph (neutrally presented) graph:

As Huff says:

That is very well if all you want to do is convey information. But suppose you wish to win an argument, shock a reader, move him into action, sell him something. For that, this chart lacks schmaltz. Chop off the bottom.

Huff continues:

Thats more like it. (You've saved paper⁷ too, something to point out if any carping fellow objects to your misleading graphics.)

But there's more, folks:

Now that you have practised to deceive, why stop with truncating? You have one more trick available that's worth a dozen of that. It will make your modest rise of ten per cent look livelier than one hundred percent is entitled to look. Simply change the proportion between the ordinate and the abscissa:

Both of these unfortunate practices are present in Exhibit 1, and that's before we even get to dual scales.

Errors of Interpretation

In our various overviews of test-driven data analysis, (e.g., this summary) we have described four major classes of errors:

errors of interpretation
errors of implementation (bugs)
errors of process
errors of applicability

Errors of interpretation can occur at any point in the process: not only are we, the analysts, susceptible to misinterpreting our inputs, our methods, our intermediate results and our outputs, but the recipients of our insights and analyses are in even greater danger of misinterpreting our results, because they have not worked through the process and seen all that we did. As analysts, we have a special responsibility to make our results as clear as possible, and hard to misinterpret. We should assume not that the reader will be diligent, unhurried and careful, reading every number and observing every subtlety, but that she or he will be hurried and will rely on us to have brought out the salient points and to have helped the reader towards the right conclusions.

The purpose of a graph is to bring allow a reader to assimilate large quantities of data, and to understand patterns therein, more quickly and more easily than is possible from tables of numbers. There are strong conventions about how to do that, based on known human strengths and weaknesses as well as commonsense "fair treatment" of different series.

However well intentioned, Exhibit 1 fails in every respect: I would guess very few casual readers would get an accurate impression from the data as presented.

If data scientists had the equivalent of a Hippocratic Oath, it would be something like:

First, do not mislead.

The Road to Trumpsville: The Long, Long Mistreatment of the American Working Class, by Jeremy Grantham, in the GMO Quarterly Newsletter, 4Q, 2016. https://www.gmo.com/docs/default-source/public-commentary/gmo-quarterly-letter.pdf ↩
two variables are commensurate if they are measured in the same units and it is meaningful to make a direct comparison between them. ↩
Tufte describes all ink on a graph that is not actually plotting data "chart junk", and advocates "maximizing data ink" (the amount of the ink on a graph actually devoted to plotting the data points) and minimizing chart junk. These are excellent principles. The Visual Display of Quantitative Information, Edward R. Tufte, Graphics Press (Cheshire, Connecticut) 1983. ↩
Mr Grantham works for GMO, a "global investment management firm". https://gmo.com ↩
chosen to use a plot, if he isn't responsible for the plot ↩
How to Lie with Statistics, Darrell Huff, published Victor Gollancz, 1954. Republished, 1973, by Pelican Books. ↩
Obviously the "saving paper" argument had more force in 1954, and the constant references to "him", "he" and "fellows" similarly stood out less than they do today. ↩

TDDA 1-pager

2017-02-10T15:00:00+00:00

We have written a 1-page summary of some of the core ideas in TDDA.

It is available as a PDF from stochasticsolutions.com/pdf/TDDA-One-Pager.pdf.

Coverage information for Rexpy

2017-01-31T15:00:00+00:00

Rexpy Stats

We previously added rexpy to the Python tdda module. Rexpy is used to find regular expressions from example strings.

One of the most common requests from Rexpy users has been for information regarding how many examples each resulting regular expression matches.

We have now added a few methods to Rexpy to support this.

Currently this is only available in the Python library for Rexpy, available as part of the tdda module, with either

pip install tdda

git clone https://github.com/tdda/tdda.git

Needless to say, we also plan to use this functionality in the online version of Rexpy in the future.

Rexpy: Quick Recap

The following example shows simple use of Rexpy from Python:

$ python
>>> from tdda import rexpy
>>>
>>> corpus = ['123-AA-971', '12-DQ-802', '198-AA-045', '1-BA-834']
>>> for r in rexpy.extract(corpus):
>>>    print(r)
>>>
^\d{1,3}\-[A-Z]{2}\-\d{3}$

In this case, Rexpy found a single regular expression that matched all the strings, but in many cases it returns a list of regular expressions, each covering some subset of the examples.

The way the algorithm currently works, in most cases¹ each example will match only one regular expression, but in general, some examples might match more than one pattern. So we've designed the new functionality to work even when this is the case. We've provided three new methods on the Extractor class, which gives a more powerful API than the simple extract function.

Here's an example based on one of Rexpy's tests:

>>> urls2 = [
    'stochasticsolutions.com/',
    'apple.com/',
    'stochasticsolutions.com/',  # actual duplicate
    'https://www.stochasticsolutions.co.uk/',
    'https://www.google.co.uk/',
    'https://www.google.com',
    'https://www.google.com/',
    'https://www.guardian.co.uk/',
    'https://www.guardian.com',
    'https://www.guardian.com/',
    'https://www.stochasticsolutions.com',
    'web.stochasticsolutions.com',
    'https://www.stochasticsolutions.com',
    'tdda.info',
    'gov.uk',
    'https://web.web',
]
>>> x = rexpy.Extractor(urls2)
>>> for r in x.results.rex:
>>>    print(r)

^[a-z]{3,4}\.[a-z]{2,4}$
^[a-z]+\.com\/$
^[a-z]{3,4}[\.\/\:]{1,3}[a-z]+\.[a-z]{3}$
^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
^http\:\/\/www\.[a-z]{6,8}\.com\/$
^http\:\/\/www\.[a-z]+\.co\.uk\/$

As you can see, Rexpy has produced six different regular expressions, some of which should probably be collapsed together. The Extractor object we have created has three new methods available.

The New Coverage Methods

The simplest new method is coverage(dedup=False), which returns a list of the number of matches for each regular expression returned, in the same order as the regular expressions in x.results.rex. So:

>>> print(x.coverage())
[2, 3, 2, 4, 2, 3]

is the list of frequencies for the six regular expressions given, in order. So the pairings are illustrated by:

>>> for k, n in zip(x.results.rex, x.coverage()):
        print('%d examples are matched by %s' % (n, k))

2 examples are matched by ^[a-z]{3,4}\.[a-z]{2,4}$
3 examples are matched by ^[a-z]+\.com\/$
2 examples are matched by ^[a-z]{3,4}[\.\/\:]{1,3}[a-z]+\.[a-z]{3}$
4 examples are matched by ^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
2 examples are matched by ^http\:\/\/www\.[a-z]{6,8}\.com\/$
3 examples are matched by ^http\:\/\/www\.[a-z]+\.co\.uk\/$

The optional dedup parameter, when set to True, requests deduplicated frequencies, i.e. ignoring any duplicate strings passed in (remembering that Rexpy strips whitespace from both ends of input strings). In this case, there is just one duplicate string (stochasticsolutions.com/). So:

>>> print(x.coverage(dedup=True))
[2, 2, 2, 4, 2, 3]

where the second number (the matches for ^[a-z]+\.com\/$) is now 2, because stochasticsolutions.com/ has been deduplicated.

We can also find the total number of examples, with or without duplicates, by calling the n_examples(dedup=False) method:

>>> print(x.n_examples())
16
>>> print(x.n_examples(dedup=True))
15

But what we will probably normally be most interested in doing is sorting the regular expressions from highest to lowest coverage, ignoring any examples matched by an earlier pattern in cases where they do overlap. That's exactly what the incremental_coverage(dedup=False) method does for us. It returns an ordered dictionary.

>>> for (k, n) in x.incremental_coverage().items():
        print('%d: %s' % (n, k))
4: ^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
3: ^[a-z]+\.com\/$
3: ^http\:\/\/www\.[a-z]+\.co\.uk\/$
2: ^[a-z]{3,4}[\.\/\:]{1,3}[a-z]+\.[a-z]{3}$
2: ^[a-z]{3,4}\.[a-z]{2,4}$
2: ^http\:\/\/www\.[a-z]{6,8}\.com\/$

This is our sixteen input strings (including duplicates), and the number of examples matched by this expression, not matched by any previous expression. (As noted earlier, that caveat probably won't make any difference at the moment, but it will in future versions.) So, to be explicit, this is saying:

The regular expression that matches most examples is: ^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$ which matches 4 of the 16 strings.
Of the remaining 12 examples, 3 are matched by ^[a-z]+\.com\/$.
Of the remaining 9 examples, 3 more are matched by ^http\:\/\/www\.[a-z]+\.co\.uk\/$
and so on.

Note, that in the case of ties, Rexpy sorts regular expressions as strings to break ties.

We can get the deduplicated numbers if we prefer:

>>> for (k, n) in x.incremental_coverage(dedup=True).items():
        print('%d: %s' % (n, k))
4: ^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
3: ^http\:\/\/www\.[a-z]+\.co\.uk\/$
2: ^[a-z]+\.com\/$
2: ^[a-z]{3,4}[\.\/\:]{1,3}[a-z]+\.[a-z]{3}$
2: ^[a-z]{3,4}\.[a-z]{2,4}$
2: ^http\:\/\/www\.[a-z]{6,8}\.com\/$

That's all the new functionality for now. Let us know how you get on, and if you find any problems. And tweet your email address to @tdda0 if you want to join the TDDA Slack to discuss anything around the subject of test-driven data analysis.

[NOTE: This post was updated on 10.2.2017 after an update to the rexpy library changed function and attribute names from "sequential" (which was not very descriptive) to "incremental", which is better.]

In fact, probably in all cases, currently ↩

The New ReferenceTest class for TDDA

2017-01-26T15:30:00+00:00

Since the last post, we have extended the reference test functionality in the Python tdda library. Major changes (as of version 0.2.5, at the time of writing) include:

Introduction of a new ReferenceTest class that has significantly more functionality from the previous (now deprecated) WritableTestCase.
Support for pytest as well as unittest.
Available from PyPI with pip install tdda, as well as from Github.
Support for comparing CSV files.
Support for comparing pandas DataFrames.
Support for preprocessing results before comparison (beyond simply dropping lines) in reference tests.
Greater consistency between parameters and options for all comparison methods
Support for categorizing kinds of reference data and rewriting only nominated categories (with -w or --write)
More (meta) tests of the reference test functionality.

Background: reference tests and WritableTestCase

We previously introduced the idea of a reference test with an example in the post First Test, and then when describing the WritableTestCase library. A reference test is essentially the TDDA equivalent of a software system test (or integration test), and is characterized by:

normally testing a relatively large unit of analysis functionality (up to and including whole analytical processes)
normally generating one or more large or complex outputs that are complex to verify (e.g. datasets, tables, graphs, files etc.)
sometimes featuring unimportant run-to-run differences that mean testing equality of actual and expected output will often fail (e.g. files may contain date stamps, version numbers, or random identifiers)
often being impractical to generate by hand
often needing to be regenerated automatically (after verification!) when formats change, when bugs are fixed or when understanding changes.

The old WritableTestCase class that we made available in the tdda library provided support for writing, running and updating such reference tests in Python by extending the unittest.TestCase class and providing methods for writing reference tests and commands for running (and, where necessary, updating the reference results used by) reference tests.

Deprecating WritableTestCase in Favour of ReferenceTest

Through our use of reference testing in various contexts and projects at Stochastic Solutions, we have ended up producing three different implementations of reference-test libraries, each with different capabilities. We also become aware that an increasing number of Python developers have a marked preference for pytest over unittest, and wanted to support that more naturally. The new ReferenceTest class brings together all the capabilities we have developed, standardizes them and fills in missing combinations while providing idiomatic patterns for using it both in the context of unittest and pytest.

We have no immediate plans to remove WritableTestCase from the tdda library (and, indeed, continue use it extensively ourselves), but encourage people to adopt ReferenceTest instead as we believe it is superior in all respects.

Availability and Installation

You can now install the tdda module with pip:

pip install tdda

and the source remains available under an MIT licence from github with

git clone https://github.com/tdda/tdda.git

The tdda library works under Python 2.7 and Python 3 and includes reference test functionality mentionedabove, constraint discovery and verification, including a Pandas version and automatic discovery of regular expressions from examples (rexpy, also available online at here).

After installation, you can run TDDA's tests as follows:

$ python -m tdda.testtdda
............................................................................
----------------------------------------------------------------------
Ran 76 tests in 0.279s

Getting example code

However you have obtained the tdda module, you can get a copy of its reference test examples by running the command

$ python -m tdda.referencetest.examples

which will place them in a referencetest-examples subdirectory of your current directory. Alternatively, you can specify that you want them in a particular place with

$ python -m tdda.referencetest.examples /path/to/particular/place

There are variations for getting examples for the constraint generation and validation functionality, and for regular expression extraction (rexpy)

$ python -m tdda.constraints.examples [/path/to/particular/place]
$ python -m tdda.rexpy.examples [/path/to/particular/place]

Example use of ReferenceTest from unittest

Here is a cut-down example of how to use the ReferenceTest class with Python's unittest, based on the example in referencetest-examples/unittest. For those who prefer pytest, there is a similar pytest-ready example in referencetest-examples/pytest.

from __future__ import unicode_literals

import os
import sys
import tempfile

from tdda.referencetest import ReferenceTestCase

# ensure we can import the generators module in the directory above,
# wherever that happens to be
FILE_DIR = os.path.abspath(os.path.dirname(__file__))
PARENT_DIR = os.path.dirname(FILE_DIR)
sys.path.append(PARENT_DIR)

from generators import generate_string, generate_file

class TestExample(ReferenceTestCase):
    def testExampleStringGeneration(self):
        actual = generate_string()
        self.assertStringCorrect(actual, 'string_result.html',
                                 ignore_substrings=['Copyright', 'Version'])

    def testExampleFileGeneration(self):
        outdir = tempfile.gettempdir()
        outpath = os.path.join(outdir, 'file_result.html')
        generate_file(outpath)
        self.assertFileCorrect(outpath, 'file_result.html',
                               ignore_patterns=['Copy(lef|righ)t',
                                                'Version'])

TestExample.set_default_data_location(os.path.join(PARENT_DIR, 'reference'))


if __name__ == '__main__':
    ReferenceTestCase.main()

Notes

These tests illustrate comparing a string generated by some code to a reference string (stored in a file), and testing a file generated by code to a reference file.
We need to tell the ReferenceTest class where to find reference files used for comparison. The call to set_default_data_location, straight after defining the class, does this.
The first test generates the string actual and compares it to the contents of the file string_result.html in the data location specified (../reference). The ignore_substrings parameter specifies strings which, when encountered, cause these lines to be omitted from the comparison.
The second test instead writes a file to a temporary directory (but using the same name as the reference file). In this case, rather than ignore_strings we have used an ignore_patterns parameter to specify regular expressions which, when matched, cause lines to be disregarded in comparisons.
There are a number of other parameters that can be added to the various assert... methods to allow other kinds of discrepancies between actual and generated files to be disregarded.

Running the reference tests: Success, Failure and Rewriting

If you just run the code above, or the file in the examples, you should see output like this:

$ python test_using_referencetestcase.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.003s

OK

or, if you use pytest, like this:

$ pytest
============================= test session starts ==============================
platform darwin -- Python 2.7.11, pytest-3.0.2, py-1.4.31, pluggy-0.3.1
rootdir: /Users/njr/tmp/referencetest-examples/pytest, inifile:
plugins: hypothesis-3.4.2
collected 2 items

test_using_referencepytest.py ..

=========================== 2 passed in 0.01 seconds ===========================

If you then edit generators.py in the directory above and make some change to the HTML in the generate_string and generate_file functions preferably non-semantic changes, like adding an extra space before the > in </html>) and then rerun the tests, you should get failures. Changing just the generate_string function:

$ python unittest/test_using_referencetestcase.py
.1 line is different, starting at line 33
Expected file /Users/njr/tmp/referencetest-examples/reference/string_result.html
Note exclusions:
    Copyright
    Version
Compare with "diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/actual-string_result.html
/Users/njr/tmp/referencetest-examples/reference/string_result.html".
F
======================================================================
FAIL: testExampleStringGeneration (__main__.TestExample)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "unittest/test_using_referencetestcase.py", line 62, in testExampleStringGeneration
    ignore_substrings=['Copyright', 'Version'])
  File "/Users/njr/python/tdda/tdda/referencetest/referencetest.py", line 527, in assertStringCorrect
    self.check_failures(failures, msgs)
  File "/Users/njr/python/tdda/tdda/referencetest/referencetest.py", line 709, in check_failures
    self.assert_fn(failures == 0, '\n'.join(msgs))
AssertionError: 1 line is different, starting at line 33
Expected file /Users/njr/tmp/referencetest-examples/reference/string_result.html
Note exclusions:
    Copyright
    Version
Compare with "diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/actual-string_result.html
/Users/njr/tmp/referencetest-examples/reference/string_result.html".


    ----------------------------------------------------------------------
    Ran 2 tests in 0.005s

As expected, the string test now fails, and the ReferenceTest library suggests a command you can use to diff the output: because the test failed, it wrote the actual output to a temporary file. (It reports the failure twice, once as it occurs and once at the end. This is deliberate as it's convenient to see it when it happens if the tests take any non-trivial amount of time to run, and convenient to collect together all the failures at the end too.)

Because these are HTML files, I would probably instead open them both (using the open command on Mac OS) and visually inspect them. In these case, the pages look identical, and diff will confirm that the changes are only those we expect:

$ diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/actual-string_result.html
/Users/njr/tmp/referencetest-examples/reference/string_result.html
5,6c5,6
<     Copyright (c) Stochastic Solutions, 2016
<     Version 1.0.0
—
>     Copyright (c) Stochastic Solutions Limited, 2016
>     Version 0.0.0
33c33
< </html >
—
> </html>

In this case

We see that the copyright and version lines are different, but we used ignore_strings to avoid say that's OK
It shows us the extra space before the close tag.

If we are happy that the new output is OK and should replace the previous reference test, you can rerun with the -W (or --write-all).

$ python unittest/test_using_referencetestcase.py -W
Written /Users/njr/tmp/referencetest-examples/reference/file_result.html
.Written /Users/njr/tmp/referencetest-examples/reference/string_result.html
.
----------------------------------------------------------------------
Ran 2 tests in 0.003s

OK

Now if you run them again without -W, the tests should all pass.

You can do the same with pytest, except that in this case you need to use

pytest --write-all

because pytest does not allow short flags.

Other kinds of tests

Since this post is already quite long, we won't go through all the other options, parameters and kinds of tests in detail, but will mention a few other points:

In addition to assertStringCorrect and assertFileCorrect there are various other methods available:
- assertFilesCorrect for checking that multiple files are as expected
- assertCSVFileCorrect for checking a single CSV file
- assertCSVFilesCorrect for checking multiple CSV files
- assertDataFramesEqual to check equality of Pandas DataFrames
- assertDataFrameCorrect to check a data frame matches data in a CSV file.
Where appropriate, assert methods accept the various optional parameters, including:
- lstrip — ignore whitespace at start of files/strings (default False)
- rstrip — ignore whitespace at end of files/strings (default False)
- kind — optional category for test; these can be used to allow selective rewriting of test output with the -w/--write flag
- preprocess — function to call on expected and actual data before comparing results

We'll add more detail in future posts.

If you'd like to join the slack group where we discuss TDDA and related topics, DM your email address to @tdda on Twitter and we'll send you an invitation.

Introducing Rexpy: Automatic Discovery of Regular Expressions

2016-11-11T15:00:00+00:00

Motivation

There's a Skyscanner data feed we have been working with for a year or so. It's produced some six million records so far, each of which has a transaction ID consisting of three parts—a four-digit alphanumeric transaction type, a numeric timestamp and a UUID, with the three parts separated by hyphens. Things like this:

adyt-1466611238-cf68496e-40f1-455e-94d9-ea13a96ff044
ooqt-1466602219-012da468-a820-11e6-8ba1-b8f6b118f191
z65e-1448755954-2d677190-ecda-4279-acb2-31a31ec8e86e

The only thing that really matters is that the transaction IDs are unique, but if everything is working correctly, the three parts should have the right structure and match data that we have in other fields in the feed.

We're pretty familiar with this data; or so we thought . . .

We've added a command to our data analysis software—Miró—for characterizing the patterns in string fields. The command is rex and when we run it on the field (tid), by saying:

[1] load skyscanner/transactions.miro
transactions.miro: 6,020,946 records; 6,020,946 (100%) selected; 57 fields.

[2] rex tid

this is the output:

^[A-Za-z0-9]{2,4}[\-\_]{1,3}\d{10}\-$
^dckx\-1466604137\-1aada032aa7348e1ac0fcfdd02a80f9c$
^[A-Za-z0-9]{2,4}[\-\_]{1,3}\d{10}\-[0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12}$
^q\_\_s\-\d{10}\-[0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12}$

The rex command has found four patterns that, between them, characterize all the values in the field, and it has expressed these in the form of regular expressions. (If you aren't familiar with regular expressions, you might want to read the linked Wikipedia article.)

The third pattern in the list is more-or-less the one we thought would characterize all the transaction IDs. It reads:

start¹ (^)
then 2--4 letters or numbers ([A-Za-z0-9]{2,4})
then a mixture one to three hyphens and underscores ([\-\_]{1,3})
then 10 digits (\d{10})
then a hyphen (\-)
then a UUID, which is a hyphen-separated collection of 28 hex digits, in groups of 8, 4, 4, 4 and 12 ([0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12})
and end² ($).

The only difference between this and what we expected is that it turns out that some of the "alphanumeric transaction types" end with two underscores rather than being stricly alphanumeric, and the rex command has expressed this as "2-4 alphanumeric characters followed by 1-3 hyphens or underscores", rather than "2-4 characters that are alphanumeric or underscores, followed by a hyphen", which would have been more like the way we think about it.

What are the other three expressions?

The first one is the same, but without the UUID. Occasionally the UUID is missing (null), and when this happens the UUID portion of the tid is blank. It is possible to write a regular expression that combines these two cases, but rex doesn't quite yet know how to do this. The way to do it would be to make the UUID optional by enclosing it in parentheses and following it by a question mark:

([0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12})?

which reads zero or one UUIDs, given that we know the pattern inside the parentheses corresponds to a UUID.

The last pattern is another one we could unify with the other two: it is the same except that it identifies a particular transaction type that again uses underscores, but now in the middle: q__s. So we could replace those three (and might, in an ideal world, want rex to find)

^[A-Za-z0-9\_]{4}[\-]\d{10}\-[0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12}$

This is exactly what we described, except with the inclusion of _ as an allowable character in the 4-character transaction type at the start.

But what about the second pattern?

^dckx\-1466604137\-1aada032aa7348e1ac0fcfdd02a80f9c$

It's actually a single, completely specific transaction ID, that matches the main pattern except for omitting the hyphens in the UUID. This shouldn't be possible—by which I mean, the UUID generator should never omit the hyphens. But clearly either it did for this one transaction, or something else stripped them out later. Either way, it shows that among the several million transactions, there a bad transaction ID.

A further check in Miró shows that this occurs just a single time (i.e. it is not duplicated):

[4]>  select tid = "dckx-1466604137-1aada032aa7348e1ac0fcfdd02a80f9c"
transactions.miro: 6,020,946 records; 1 (0%) selected; 57 fields.
Selection:
(= tid "dckx-1466604137-1aada032aa7348e1ac0fcfdd02a80f9c")

So we've learnt something interesting and useful about our data, and Miró's rex command has helped us produce regular expressions that characterize all our transaction IDs. It hasn't done a perfect job, but it was pretty useful, and it's easy for us to merge the three main patterns by hand. We plan to extend the functionality to cover these cases better over coming weeks.

Rexpy outside Miró

If you don't happen to have Miró, or want to find regular expressions from data outside the context of a Miró dataset, you can use equivalent functionality directly from the rexpy module of our open-source, MIT-licenced tdda package, available from Github:

git clone https://github.com/tdda/tdda.git

This provides both a Python API for finding regular expressions from example data, and a command line tool.

The Rexpy Command Line

If you just run rexpy.py from the command line, with no arguments, it expects you to type (or paste) a set of strings, and when you finish (with CTRL-D on unix-like systems, or CTRL-Z on Windows systems) it will spit out the regular expressions it thinks you need to match them: For example:

$ python rexpy.py
EH1 7JQ
WC1 4AA
G2 3PQ
^[A-Za-z0-9]{2,3}\ [A-Za-z0-9]{3}$

Or, of course, you can pipe input into it. If, for example, I do that in a folder full of photos from a Nikon camera, I get

$ ls *.* | python ~/python/tdda/rexpy/rexpy.py
^DSC\_\d{4}\.[A-Za-z]{3}$

(because these are all files like DSC_1234.NEF or DSC_9346.xmp).

You can also give it a filename as the first argument, in which case it will read strings (one per line) from a file.

So given the file ids.txt (which is in the rexpy/examples subdirectory in the tdda repository), containing:

123-AA-971
12-DQ-802
198-AA-045
1-BA-834

we can use rexpy on it by saying:

$ python rexpy.py examples/ids.txt
^\d{1,3}\-[A-Z]{2}\-\d{3}$

and if there is a header line you want to skip, you can add either -h or --header to tell Rexpy to skip that.

You can also give a filename as a second command line argument, in which case Rexpy will write the results (one per line) to that file.

Motivation for Rexpy

There's an old joke among programmers, generally attributed to Jamie Zawinski that bears repeating:

Some people, when confronted with a problem, think "I know, I'll use regular expressions."

Now they have two problems.

— Jamie Zawinski

As powerful as regular expressions are, even their best friends would probably concede that they can be hard to write, harder to read and harder still to debug.

Despite this, regular expressions are an attractive way to specify constraints on string fields for two main reasons:

First, regular expressions constitute a fast, powerful, and near-ubiquitous mechanism for describing a wide variety of possible structures in string data.
Secondly, regular expressions include the concept of capture groups. You may recall that in a grouped³ regular expression, one or more subcomponents is enclosed in parentheses and its matched value can be extracted. As we will see below, this is particulary interesting in the context of test-driven data analysis.

For concreteness, suppose we have a field containing strings such as:

123-AA-971
12-DQ-802
198-AA-045
1-BA-834

One obvious regular expression to describe these would be:

^\d+\-[A-Z]{2}\-\d{3}$

(start [^] with one or more digits [\d+], then a hyphen [\-], then two capital letters [A-Z]{2}, then another hyphen [\-], then three digits [\d{3}] then stop [$]).

But it is in the nature of regular expressions that there are both more and less specific formulations we could use to describe the same strings, ranging from the fairly specific:

^1[2389]{0,2}\-(AA|DQ|BA)\-\d{3}$

(start [^] with a 1 [1], followed by up to two digits chosen from {2, 3, 8, 9} [[2389]{0,2}], followed by a hyphen [\-] then AA, DA or BA [(AA|DQ|BA)], followed by another hypen [\-] then three digits [\d{3}] and finish [$])

to the fully general

^.*$

(start [^], then zero or more characters [.*], then stop [$]).

Going back to our first formulation

^\d+\-[A-Z]{2}\-\d{3}$

one possible grouped equivalent is

^(\d+)\-([A-Z]{2})\-(\d{3})$

The three parenthesized sections are known as groups, and regular expression implementations usually provide a way of looking up the value of these groups when a particular string is matched by it. For example, in Python we might say:

import re

pattern = re.compile(r'^(\d+)\-([A-Z]{2})\-(\d{3})$')
m = re.match(pattern, '123-ZQ-987')
if m:
    print('1: "%s"  2: "%s"  3: "%s"' % (m.group(1), m.group(2),
                                         m.group(3)))

m = re.match(pattern, '00-FT-020')
if m:
    print('1: "%s"  2: "%s"  3: "%s"' % (m.group(1), m.group(2),
                                         m.group(3)))

If we run this, the output is:

1: "123"  2: "ZQ"  3: "987"
1: "00"  2: "FT"  3: "020"

The Big Idea: Automatic Discovery of Constraints on String Structure

In the context of test-driven data analysis, the idea is probably obvious: for string fields with some kind of structure—telephone numbers, post codes, zip codes, UUIDs, more-or-less any kind of structured identifier, airline codes, airport codes, national insurance numbers, credit card numbers, bank sort codes, social security numbers—we would like to specify constraints on the values in the field using regular expressions. A natural extension to the TDDA constraints file format introduced in the last post would be something along the lines of:⁴

"regex": "^\\d+\\-[A-Z]{2}\\-\\d{3}$"

if there is a single regular expression that usefully matches all the allowed field values. If a field contains strings in multiple formats that are so different that using a single regular expression would be unhelpful, we might instead provide a list of regular expressions, such as:

"regex": ["^\\d+\\-[A-Z]{2}\\-\\d{3}$", "^[A-Z]{5}\\+\\d{5}$"]

which would mean each field values should match at least one of the regular expressions in the list.

Just as with the automatic discovery of other types of constraints, we want the TDDA constraint discovery library to be able to suggest suitable regular expression constraints on string fields, where appropriate. This is where Rexpy comes in:

Rexpy is a library for finding regular expressions that usefully characterize a given corpus of strings.

We are choosing to build Rexpy as a stand-alone module because it has clear utility outside the context of constraint generation.

The Other Idea: Automatically discovered Quasi-Fields

We can imagine going beyond simply using (and automatically discovering) regular expressions to describe constraints on string data. Once we have useful regular expressions that characterize some string data—and more particularly, in cases where we have a single regular expression that usefully describes the structure of the string—we can tag meaningful subcomponents. In the example we used above, we had three groups:

(\d+) — the digits at the start of the identifier
([A-Z]{2}) — the pair of letters in the middle
(\d{3}) — the three digits at the end

It's not totally trivial to work out which subcomponents are useful to tag, but I think we could probably find pretty good heuristics that would do a reasonable job, at least in simple cases. Once we know the groups, we can potentially start to treat them as quasi-fields in their own right. So in this case, if we had a field ID containing string identifiers like those shown, we might create from that three quasi fields as follows:

ID_qf1, of type int, values 123, 12, 198, and 1
ID_qf2, of type string, values AA, DQ, AA, and BA
ID_qf3, of type int, values 971, 802, 45 and 834

Once we have these quasi fields, we can potentially subject them to the usual TDDA constraint generation process, which might suggest extra, stronger constraints. For example, we might find the the numbers in ID_qf3 are unique, or form a contiguous sequence, and we might find that although our regular expression only specified that there were two letters in the middle, in fact the only combinations found in the (full) data were AA, DQ, BA and BB.

I don't want to suggest that all of this is easy: there are three or four non-trivial steps to get from where Rexpy is today to this full vision:

First, it has to get better at merging related regular expressions into a useful single regular expression with optional components and alternations than it is today.
Secondly, it would have to be able to identify good subcomponents for grouping.
Thirdly, it would have to do useful type inference on the groups it identifies.
Finally, it would have to be extended to create the quasi fields and apply the TDDA discovery process to them.

But none of this seems hopelessly difficult. So continue to watch this space.

The Rexpy API

Assuming you have cloned the TDDA library somewhere on your PYTHONPATH, you should then be able to use it through the API as follows:

from tdda import rexpy

corpus = ['123-AA-971', '12-DQ-802', '198-AA-045', '1-BA-834']
results = rexpy.extract(corpus)
print('Number of regular expressions found: %d' % len(results))
for rex in results:
    print('   ' +  rex)

which produces:

$ python ids.py
Number of regular expressions found: 1
   ^\d{1,3}\-[A-Z]{2}\-\d{3}$

In general, Rexpy returns a list of regular expressions, and at the moment it is not very good at merging them. There's quite a lot of unused code that, I hope, will soon allow it to do so. But even as it is, it can do a reasonable job of characterizing simple strings. Within reason, the more examples you give it, the better it can do, and it is reasonably performant with hundreds of thousands or even millions of strings.

Rexpy: the Pandas interface

There's also a Pandas binding. You can say:

from tdda import rexpy
results = rexpy.pdextract(df['A'])

to find regular expressions for the strings in column A of a dataframe df, or

from tdda import rexpy
results = rexpy.pdextract([df['A'], df['B'], df['C']])

to find a single set of regular expressions that match all the strings from columns A, B and C. In all cases, null (pandas.np.NaN) values are ignored. The results are returned as a (Python) list of regular expressions, as strings.

Final Words

Take it for a spin and let us know how you get on.

As always, follow us or tweet at us (@tdda0) if you want to hear more, and watch out for the TDDA Slack team, which will be opening up very soon.

More precisely, ^ matches start of line; by default, rex always starts regular expressions with ^ and finishes them with $ on the assumption that strings will be presented one-per-line ↩
Again, more precisely, $ matches end of line. ↩
Grouped regular expressions are also referred to variously as marked or tagged regular expressions, and the groups are also sometimes known as subexpressions. ↩
One thing to notice here is that in JSON we need extra backslashes in the regular expression. This is because regular expressions themselves make fairly liberal use of backslashes, and JSON uses backslash as an escape character. We could avoid this in Python by using raw strings, which are introduced with an r prefix (e.g. '^(\d+)\-([A-Z]{2})\-(\d{3})$'). In such strings, backslashes are not treated in any special way. Since JSON has no equivalent mechanism, we have to escape all our backslashes, leading to the ugliness above. ↩

The TDDA Constraints File Format

2016-11-04T14:30:00+00:00

Background

We recently extended the tdda library to include support for automatic discovery of constraints from datasets, and for verification of datasets against constraints. Yesterday's post—Constraint Discovery and Verification for Pandas DataFrames—describes these developments and the API.

The library we published is intended to be a base for producing various implementations of the constraint discovery and verification process, and uses a JSON file format (extension .tdda) to save constraints in a form that should be interchangable between implementations. We currently have two compatible implementations—the open-source Pandas code in the library and the implementation in our own analytical software, Miró.

This post describes the .tdda JSON file format. The bulk of it is merely a snapshot of the documentation shipped with the library in the Github repository (visible on Github). We intend to keep that file up to date as we expand the format.

The TDDA JSON File Format

The TDDA constraints library (Repository https://github.com/tdda/tdda, module constraints) uses a JSON file to store constraints.

This document describes that file format.

Purpose

TDDA files describe constraints on a dataset, with a view to verifying the dataset to check whether any or all of the specified constraints are satisfied.

A dataset is assumed to consist of one or more fields (also known as columns), each of which has a (different) name¹ and a well-defined type.² Each field has a value for each of a number of records (also known as rows). In some cases, values may be null (or missing).³ Even a field consisting entirely of nulls can be considered to have a type.

Familiar examples of datasets include:

tables in relational databases
DataFrames in (Pandas and R)
flat ("CSV") files (subject to type inference or assigment)
sheets in spreadsheets, or areas within sheets, if the columns have names, are not merged, and have values with consistent meanings and types over an entire column
more generally, many forms of tabular data.

In principle, TDDA files are intended to be capable of supporting any kind of constraint regarding datasets. Today, we are primarily concerned with * field types * minimum and maximum values (or in the case of string fields, minumum and maximum string lengths) * whether nulls are allowed, * whether duplicate values are allowed within a field * the allowed values for a field.

The format also has support for describing relations between fields.

Future extensions we can already foresee include:

dataset-level constraints (e.g. numbers of records; required or disallowed fields)
sortedness of fields, of field values or both
regular expressions to which string fields should conform
constraints on subsets of the data (e.g. records dated after July 2016 should not have null values for the ID field)
constraints on substructure within fields (e.g. constraints on tagged subexpressions from regular expressions to which string fields are expected to conform)
potentially checksums (though this is more suitable for checking the integreity of transfer of a specific dataset, than for use across multiple related datasets)
constraints between datasets, most obviously key relations (e.g. every value field KA in dataset A should also occur in field KB in dataset B).

The motivation for generating, storing and verifying datasets against such sets of constraints is that they can provide a powerful way of detecting bad or unexpected inputs to, or outputs from, a data analysis process. They can also be valuable as checks on intermediate results. While manually generating constraints can be burdensome, automatic discovery of constraints from example datasets, potentially followed by manual removal of over-specific constraints, provides a good cost/benefit ratio in many situations.

Filename and encoding

The preferred extension for TDDA Constraints files is .tdda.
TDDA constraints files must be encoded as UTF-8.
TDDA files must be valid JSON.

Example

This is an extremely simple example TDDA file:

{
    "fields": {
        "a": {
            "type": "int",
            "min": 1,
            "max": 9,
            "sign": "positive",
            "max_nulls": 0,
            "no_duplicates": true
        },
        "b": {
            "type": "string",
            "min_length": 3,
            "max_length": 3,
            "max_nulls": 1,
            "no_duplicates": true,
            "allowed_values": [
                "one",
                "two"
            ]
        }
    }
}

General Structure

A TDDA file is a JSON dictionary. There are currently two supported top-level keys:

fields: constraints for individual fields, keyed on the field name. (In TDDA, we generally refer to dataset columns as fields.)
field_groups: constraints specifying relations between multiple fields (two, for now). field_groups constraints are keyed on a comma-separated list of the names of the fields to which they relate, and order is significant.

Both top-level keys are optional.

In future, we expect to add further top-level keys (e.g. for possible constraints on the number of rows, required or disallowed fields etc.)

The order of constraints in the file is immaterial (of course; this is JSON), though writers may choose to present fields in a particular order, e.g. dataset order or sorted on fieldname.

Field Constraints

The value of a field constraints entry (in the fields section) is a dictionary keyed on constraint kind. For example, the constraints on field a in the example above are specified as:

"a": {
    "type": "int",
    "min": 1,
    "max": 9,
    "sign": "positive",
    "max_nulls": 0,
    "no_duplicates": true
}

The TDDA library currently recognizes the following kinds of constraints:

type
min
max
min_length
max_length
sign
sign
max_nulls
no_duplicates
allowed_values

Other constraint libraries are free to define their own, custom kinds of constraints. We will probably recommend that non-standard constraints have names beginning with colon-terminated prefix. For example, if we wanted to support more specific Pandas type constraints, we would probably use a key such as pandas:type for this.

The value of a constraint is often simply a scalar value, but can be a list or a dictionary; when it is a dictionary, it should include a key value, with the principle value associated with the constraint (true, if there is no specific value beyond the name of the constraint).

If the value of a constraint (the scalar value, or the value key if the value is a dictionary) is null, this is taken to indicate the absence of a constraint. A constraint with value null should be completely ignored, so that a constraints file including null-valued constraints should produce identical results to one omitting those constraints. (This obviously means that we are discouraging using null as a meaningful constraint value, though a string "null" is fine, and in fact we use this for sign constraints.)

The semantics and values of the standard field constraint types are as follows:

type: the allowed (standard, TDDA) type of the field. This can be a single value from bool (boolean), int (integer; whole-numbered); real (floating point values); string (unicode in Python3; byte string in Python2) or date (any kind of date or date time, with or without timezone information). It can also be a list of such allowed values (in which case, order is not significant).

It is up to the generation and verification libraries to map between the actual types in whatever dataset/dataframe/table/... object is used and these TDDA constraint types, though over time we may provide further guidance.

Examples:
- {"type": "int"}
- {"type": ["int", "real"]}
min: the minimum allowed value for a field. This is often a simple value, but in the case of real fields, it can be convenient to specify a level of precision. In particular, a minimum value can have precision (default: fuzzy):
- closed: all non-null values in the field must be greater than or equal to the value specified.
- open: all non-null values in the field must be strictly greater than the value specified.
- fuzzy: when the precision is specified as fuzzy, the verifier should allow a small degree of violation of the constraint without generating a failure. Verifiers take a parameter, epsilon, which specifies how the fuzzy constraints should be taken to be: epsilon is a fraction of the constraint value by which field values are allowed to exceed the constraint without being considered to fail the constraint. This defaults to 0.01 (i.e. 1%). Notice that this means that constraint values of zero are never fuzzy.
Examples are:
- {"min": 1},
- {"min": 1.2},
- {"min": {"value": 3.4}, {"precision": "fuzzy"}}.
JSON, does not—of course—have a date type. TDDA files specifying dates should use string representations in one of the following formats:
- YYYY-MM-DD for dates without times
- YYYY-MM-DD hh:mm:ss for date-times without timezone
- YYYY-MM-DD hh:mm:ss [+-]ZZZZ for date-times with timezone.
We recommend that writers use precisely these formats, but that readers offer some flexibility in reading, e.g. accepting / as well as - to separate date components, and T as well as space to separate the time component from the date.
max: the maximum allowed value for a field. Much like min, but for maximum values. Examples are:
- {"max": 1},
- {"max": 1.2},
- {"max": {"value": 3.4}, {"precision": "closed"}}.
Dates should be formatted as for min values.
min_length: the minimum allowed length of strings in a string field. How unicode strings are counted is up to the implementation. Example:
- {"min_length": 2}
max_length: the minimum allowed length of strings in a string field. How unicode strings are counted is up to the implementation.
- {"max_length": 22}
sign: For numeric fields, the allowed sign of (non-null) values. Although this overlaps with minimum and maximum values, it it often useful to have a separate sign constraint, which carries semantically different information. Allowed values are:
- positive: All values must be greater than zero
- non-negative: No value may be less than zero
- zero: All values must be zero
- non-positive: No value may be greater than zero
- negative: All values must be negative
- null: No signed values are allowed, i.e. the field must be entirely null.
Example:
- {"sign": "non-negative"}
max_nulls: The maximum number of nulls allowed in the field. This can be any non-negative value. We recommend only writing values of zero (no nulls values are allowed) or 1 (At most a single null is allowed) into this constraint, but checking against any value found.

Example:
- {"max_nulls": 0}
no_duplicates: When this constraint is set on a field (with value true), it means that each non-null value must occur only once in the field. The current implementation only uses this constraint for string fields.

Example:
- {"no_duplicates": true}
allowed_values: The value of this constraint is a list of allowed values for the field. The order is not significant.

Example:
- {"allowed_values": ["red", "green", "blue"]}

MultiField Constraints

Multifield constraints are not yet being generated by this implementation, though our (proprietary) Miró implementation does produce them. The currently planned constraint types for field relations cover field equality and inequality for pairs of fields, with options to specify null relations too.

A simple example would be:

"field_groups": {
    "StartDate,EndDate": {"lt": true}
}

This is a less-than constraint, to be interpreted as

StartDate < EndDate wherever StartDate and EndDate and both non-null.

The plan is to support the obvious five equality and inequality relations:

lt: first field value is strictly less than the second field value for each record
lte: first field value is less than or equal to the second field value for each record
eq: first field value is equal to the second field value for each record
gte: first field value is greater than or equal to the second field value for each record
gt: first field value is strictly greater than the second field value for each record.

In the case of equality (only), we will probably also support a precision parameter with values fuzzy or precise.

There should probably also be an option to specify relations between null values in pairs of columns, either as a separate constraint or as a quality on each of the above.

Pandas, of course, allows multiple columns to have the same name. This format makes no concessions to such madness, though there is nothing to stop a verifier or generator sharing constraints across all columns with the same name. The Pandas generators and verifiers in this library do not currently attempt to do this. ↩
Pandas also allows columns of mixed type. Again, this file format does not recognize such columns, and it would probably be sensible not to use type constraints for columns of mixed type. ↩
Pandas uses not-a-number (pandas.np.NaN) to represent null values for numeric, string and boolean fields; it uses a special not-a-time (pandas.NaT) value to represent null date (Timestamp) values. ↩

Constraint Discovery and Verification for Pandas DataFrames

2016-11-03T15:30:00+00:00

Background

In a previous post, Constraints and Assertions, we introduced the idea of using constraints to verify input, output and intermediate datasets for an analytical process. We also demonstrated that candidate constraints can be automatically generated from example datasets. We prototyped this in our own software (Miró) expressing constraints as lisp S-expressions.

Improving and Extending the Approach: Open-Source Pandas Code

We have now taken the core ideas, polished them a little and made them available through an open-source library, currently on Github. We will push it to PyPI when it has solidified a little further.

The constraint code I'm referring to is available in the constraints module of the tdda repository for the tdda user on github. So if you issue the command

git clone https://github.com/tdda/tdda.git

in a directory somewhere on your PYTHONPATH, this should enable you to use it.

The TDDA library:

is under an MIT licence;
runs under Python2 and Python3;
other than Pandas itself, has no dependencies outside the standard library unless you want to use feather files (see below);
includes a base layer to help with building constraint verification and discovery libraries for various systems;
includes Pandas implementations of constraint discovery and verification through a (Python) API;
uses a new JSON format (normally stored in .tdda files) for saving constraints;
also includes a prototype command-line tool for verifying a dataframe stored in feather format against a .tdda file of constraints. Feather is a file format developed by Wes McKinney (the original creator of Pandas) and Hadley Wickham (of ggplot and tidyverse fame) for dataframes that allows interchange between R and Pandas while preserving type information and exact values. It is based on the Apache Arrow project. It can be used directly or using our ugly-but-useful extension library pmmif, which allows extra metadata (including extended type information) to be saved alongside a .feather file, in a companion .pmm file.

Testing

All the constraint-handling code is in the constraints module within the TDDA repository.

After you've cloned the repository, it's probably a good idea to run the tests. There are two sets, and both should run under Python2 or Python3.

$ cd tdda/constraints
$ python testbase.py
.....
----------------------------------------------------------------------
Ran 5 tests in 0.003s

OK

$ python testpdconstraints.py
.....................
----------------------------------------------------------------------
Ran 21 tests in 0.123s

OK

There is example code (which we'll walk through below) in the examples subdirectory of constraints.

Basic Use

Constraint Discovery

Here is some minimal code for getting the software to discover constraints satisfied by a Pandas DataFrame:

import pandas as pd
from tdda.constraints.pdconstraints import discover_constraints

df = pd.DataFrame({'a': [1,2,3], 'b': ['one', 'two', pd.np.NaN]})
constraints = discover_constraints(df)
with open('/tmp/example_constraints.tdda', 'w') as f:
    f.write(constraints.to_json())

(This is the core of the example code in tdda/constraints/examples/simple_discovery.py, and is included in the docstring for the discover_constraints function.)

Hopefully the code is fairly self-explanatory, but walking through the lines after the imports:

We first generate a trivial 3-row DataFrame with an integer column a and a string column b. The string column includes a Pandas null (NaN).
We then pass that DataFrame to the discover_constraints function from tdda.constraints.pdconstraints, and it returns a DatasetConstraints object, which is defined in tdda.constraints.base.
The resulting constraints object has a to_json() method which converts the structured constraints into a JSON string.
In the example, we write that to /tmp/example_constraints.tdda; we encourage everyone to use the .tdda extension for these JSON constraint files.

This is what happens if we run the example file:

$ cd tdda/constraints/examples

$ python simple_discovery.py
Written /tmp/example_constraints.tdda successfully.

$ cat /tmp/example_constraints.tdda
{
    "fields": {
        "a": {
            "type": "int",
            "min": 1,
            "max": 9,
            "sign": "positive",
            "max_nulls": 0,
            "no_duplicates": true
        },
        "b": {
            "type": "string",
            "min_length": 3,
            "max_length": 3,
            "max_nulls": 1,
            "no_duplicates": true,
            "allowed_values": [
                "one",
                "two"
            ]
        }
    }
}

As you can see, in this case, the system has 'discovered' six constraints for each field, and it's rather easy to read what they are and at least roughly what they mean. We'll do a separate post describing the .tdda JSON file format, but it's documented in tdda/constraints/tdda_json_file_format.md in the repository (which—through almost unfathomable power of Github—means you can see it formatted here).

Constraint Verification

Now that we have a .tdda file, we can use it to verify a DataFrame.

First, let's look at code that should lead to a successful verification (this code is in tdda/constraints/examples/simple_verify_pass.py).

import pandas as pd
from tdda.constraints.pdconstraints import verify_df

df = pd.DataFrame({'a': [2, 4], 'b': ['one', pd.np.NaN]})
v = verify_df(df, 'example_constraints.tdda')

print('Passes: %d' % v.passes)
print('Failures: %d\n\n\n' % v.failures)
print(str(v))
print('\n\n')
print(v.to_frame())

Again, hopefully the code is fairly self-explanatory, but:

df is a DataFrame that is different from the one we used to generate the constraints, but is nevertheless consistent with the constraints.
The verify_df function from tdda.constraints.pdconstraints takes a DataFrame and the location of a .tdda file and verifies the DataFrame against the constraints in the file. The function returns a PandasVerification object. The PandasVerification class is a subclass of Verification from tdda.constraints.base, adding the ability to turn the verification object into a DataFrame.
All verification objects include passes and failures attributes, which respectively indicate the number of constraints that passed and the number that failed. So the simplest complete verification is simply to check that v.failures == 0.
Verification methods also include a __str__ method. Its output currently includes a section for fields and a summary.
the .to_frame() method converts a PandasVerification into a Pandas DataFrame.

If we run this, the result is as follows:

$ python simple_verify_pass.py
Passes: 12
Failures: 0



FIELDS:

a: 0 failures  6 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✓  no_duplicates ✓

b: 0 failures  6 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  no_duplicates ✓  allowed_values ✓

SUMMARY:

Passes: 12
Failures: 0



  field  failures  passes  type   min min_length   max max_length  sign  \
0     a         0       6  True  True        NaN  True        NaN  True
1     b         0       6  True   NaN       True   NaN       True   NaN

  max_nulls no_duplicates allowed_values
0      True          True            NaN
1      True          True           True

In the DataFrame produced from the Verification

True indicates a constraint that was satisfied in the dataset;
False indicates a constraint that was not satisfied in the dataset;
NaN (null) indicates a constraint that was not present for that field.

As you would expect, the field, failures and passes columns are, respectively, the name of the field, the number of failures and the number of passes for that field.

If we now change the DataFrame definition to:

df = pd.DataFrame({'a': [0, 1, 2, 10, pd.np.NaN],
                   'b': ['one', 'one', 'two', 'three', pd.np.NaN]})

(as is the case in tdda/constraints/examples/simple_verify_fail.py), we now expect some constraint failures. If we run this, we see:

$ python simple_verify_fail.py
Passes: 5
Failures: 7



FIELDS:

a: 4 failures  2 passes  type ✓  min ✗  max ✗  sign ✗  max_nulls ✗  no_duplicates ✓

b: 3 failures  3 passes  type ✓  min_length ✓  max_length ✗  max_nulls ✓  noh_duplicates ✗  allowed_values ✗

SUMMARY:

Passes: 5
Failures: 7



  field  failures  passes  type    min min_length    max max_length   sign  \
0     a         4       2  True  False        NaN  False        NaN  False
1     b         3       3  True    NaN       True    NaN      False    NaN

  max_nulls no_duplicates allowed_values
0     False          True            NaN
1      True         False          False

Final Notes

There are more options and there's more to say about the Pandas implementation, but that's probably enough for one post. We'll have follow-ups on the file format, more options, and the foibles of Pandas.

If you want to hear more, follow us on twitter at @tdda0.

WritableTestCase: Example Use

2016-09-18T15:30:00+01:00

In my PyCon UK talk yesterday I promised to update the and document the copy of writabletestcase.WritableTestCase on GitHub.

The version I've put up is not quite as powerful as the example I showed in the talk—that will follow—but has the basic functionality.

I've now added examples to the repository and, below, show how these work.

The library is available with

git clone https://github.com/tdda/tdda.git

WritableTestCase extends unittest.TestCase, from the Python's standard library, in three main ways:

It provides methods for testing strings produced in memory or files written to disk against reference results in files. When a test fails, rather than just showing a hard-to-read difference, it writes the actual result to file (if necessary) and then shows the diff command needed to compare it—something like this:
```
Compare with "diff /path/to/actual-output /path/to/expected-output"
```
Obviously, the diff command can be replaced with a graphical diff tool, an open command or whatever.

Although this shouldn't be necessary (see below), you also have the option, after verification, or replacing diff with cp to copy the actual output as the new reference output.
Secondly, the code supports excluding lines of the output contain nominated strings. This is often handy for excluding things like date stamps, version numbers, copyright notices etc. These often appear in output, and vary, without affecting the semantics.

(The version of the library I showed at PyCon had more powerful variants of this, which I'll add later.)
Thirdly, if you verify that the new output is correct, the library supports re-running with the -w flag to overwrite the expected ("reference") results with the ones generated by the code.

Obviously, if this feature is abused, the value of the tests will be lost, but provided you check the output carefully before re-writing, this is a significant convenience.

The example code is in the examples subdirectory, called test_using_writabletestcase.py. It has two test functions, one of which generates HTML output as a string, and the other of which produces some slightly different HTML output as a file. In each case, the output produced by the function is not identical to the expected "reference" output (in examples/reference), but differs only on lines containing "Copyright" and "Version". Since these are passed into the test as exclusions, the tests should pass.

Here is the example code:

# -*- coding: utf-8 -*-
"""
test_using_writabletestcase.py: A simple example of how to use
tdda.writabletestcase.WritableTestCase.

Source repository: https://github.com/tdda/tdda

License: MIT

Copyright (c) Stochastic Solutions Limited 2016
"""
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import tempfile

from tdda import writabletestcase
from tdda.examples.generators import generate_string, generate_file


class TestExample(writabletestcase.WritableTestCase):
    def testExampleStringGeneration(self):
        """
        This test uses generate_string() from tdda.examples.generators
        to generate some HTML as a string.

        It is similar to the reference HTML in
        tdda/examples/reference/string_result.html except that the
        Copyright and version lines are slightly different.

        As shipped, the test should pass, because the ignore_patterns
        tell it to ignore those lines.

        Make a change to the generation code in the generate_string
        function in generators.py to change the HTML output.

        The test should then fail and suggest a diff command to run
        to see the difference.

        Rerun with

            python test_using_writabletestcase.py -w

        and it should re-write the reference output to match your
        modified results.
        """
        actual = generate_string()
        this_dir = os.path.abspath(os.path.dirname(__file__))
        expected_file = os.path.join(this_dir, 'reference',
                                     'string_result.html')
        self.check_string_against_file(actual, expected_file,
                                       ignore_patterns=['Copyright',
                                                        'Version'])


    def testExampleFileGeneration(self):
        """
        This test uses generate_file() from tdda.examples.generators
        to generate some HTML as a file.

        It is similar to the reference HTML in
        tdda/examples/reference/file_result.html except that the
        Copyright and version lines are slightly different.

        As shipped, the test should pass, because the ignore_patterns
        tell it to ignore those lines.

        Make a change to the generation code in the generate_file function
        in generators.py to change the HTML output.

        The test should then fail and suggest a diff command to run
        to see the difference.

        Rerun with

            python test_using_writabletestcase.py -w

        and it should re-write the reference output to match your
        modified results.
        """
        outdir = tempfile.gettempdir()
        outpath = os.path.join(outdir, 'file_result.html')
        generate_file(outpath)
        this_dir = os.path.abspath(os.path.dirname(__file__))
        expected_file = os.path.join(this_dir, 'reference',
                                     'file_result.html')
        self.check_file(outpath, expected_file,
                        ignore_patterns=['Copyright', 'Version'])


if __name__ == '__main__':
    writabletestcase.main(argv=writabletestcase.set_write_from_argv())

If you download it, and try running it, you should output similar to the following:

$ python test_using_writabletestcase.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.004s

OK

The reference output files it compares against are:

examples/reference/string_result.html
examples/reference/file_result.html

To see what happens when there's a difference, try editing one or both of the main functions that generate the HTML in generators.py. They're most just using explicit strings, so the simplest thing is just to change a word or something in the output.

If I change It's to It is in the generate_string() function and rerun, I get this output:

$ python test_using_writabletestcase.py
.
File check failed.
Compare with "diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/string_result.html /Users/njr/python/tdda/examples/reference/string_result.html".

Note exclusions:
Copyright
Version
F
======================================================================
FAIL: testExampleStringGeneration (__main__.TestExample)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_using_writabletestcase.py", line 55, in testExampleStringGeneration
    'Version'])
  File "/Users/njr/python/tdda/writabletestcase.py", line 294, in check_string_against_file
    self.assertEqual(failures, 0)
AssertionError: 1 != 0

----------------------------------------------------------------------
Ran 2 tests in 0.005s

FAILED (failures=1)
1 godel:$

If I then run the diff command it suggests, the output is:

$ diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/string_result.html /Users/njr/python/tdda/examples/reference/string_result.html
5,6c5,6
<     Copyright (c) Stochastic Solutions, 2016
<     Version 1.0.0
—
>     Copyright (c) Stochastic Solutions Limited, 2016
>     Version 0.0.0
29c29
<     It is not terribly exciting.
—
>     It's not terribly exciting.

Here you can see the differences that are excluded, and the change I actually made.

(The version I showed at PyCon has an option to see the only the non-excluded differences, but this version doesn't; that will come!)

If I now run again using -w, to re-write the reference output, it shows:

$ python test_using_writabletestcase.py -w
.Expected file /Users/njr/python/tdda/examples/reference/string_result.html written.
.
----------------------------------------------------------------------
Ran 2 tests in 0.003s

OK

And, of course, if I run a third time, without -w, the test now passes:

$ python test_using_writabletestcase.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.003s

OK

So that's a quick overview of it works.

Slides and Rough Transcript of TDDA talk from PyCon UK 2016

2016-09-17T15:30:00+01:00

Python UK 2016, Cardiff.

I gave a talk on test-driven data analysis at PyCon UK 2016, Cardiff, today.

The slides (which are kind-of useless without the words) are available here.

More usefully, a rough transcript, with thumbnail slides, is available here.

Extracting More Apple Health Data

2016-04-20T15:30:00+01:00

The first version of the Python code for extracting data from the XML export from the Apple Health on iOS neglected to extract Activity Summaries and Workout data. We will now fix that.

As usual, I'll remind you how to get the code, if you want, then discuss the changes to the code, the reference test and the unit tests. Then in the next post, we'll actually start looking at the data.

The Updated Code

As before, you can get the code from Github with

$ git clone https://github.com/tdda/applehealthdata.git

or if you have pulled it before, with

$ git pull --tags

This version of the code is tagged with v1.3, so if it has been updated by the time you read this, get that version with

$ git checkout v1.3

I'm not going to list all the code here, but will pull out a few key changes as we discuss them.

Changes

Change 1: Change FIELDS to handle three different field structures.

The first version of the extraction code wrote only Records, which contain the granular activity data (which is the vast bulk of it, by volume).

Now I want to extend the code to handle the other two main kinds of data it writes—ActivitySummary and Workout elements in the XML.

The three different element types contain different XML attributes, which correspond to different fields in the CSV, and although they overlap, I think the best approach is to have separate record structures for each, and then to create a dictionary mapping the element kind to its field information.

Accordingly, the code that sets FIELDS changes to become:

RECORD_FIELDS = OrderedDict((
    ('sourceName', 's'),
    ('sourceVersion', 's'),
    ('device', 's'),
    ('type', 's'),
    ('unit', 's'),
    ('creationDate', 'd'),
    ('startDate', 'd'),
    ('endDate', 'd'),
    ('value', 'n'),
))

ACTIVITY_SUMMARY_FIELDS = OrderedDict((
    ('dateComponents', 'd'),
    ('activeEnergyBurned', 'n'),
    ('activeEnergyBurnedGoal', 'n'),
    ('activeEnergyBurnedUnit', 's'),
    ('appleExerciseTime', 's'),
    ('appleExerciseTimeGoal', 's'),
    ('appleStandHours', 'n'),
    ('appleStandHoursGoal', 'n'),
))

WORKOUT_FIELDS = OrderedDict((
    ('sourceName', 's'),
    ('sourceVersion', 's'),
    ('device', 's'),
    ('creationDate', 'd'),
    ('startDate', 'd'),
    ('endDate', 'd'),
    ('workoutActivityType', 's'),
    ('duration', 'n'),
    ('durationUnit', 's'),
    ('totalDistance', 'n'),
    ('totalDistanceUnit', 's'),
    ('totalEnergyBurned', 'n'),
    ('totalEnergyBurnedUnit', 's'),
))

FIELDS = {
    'Record': RECORD_FIELDS,
    'ActivitySummary': ACTIVITY_SUMMARY_FIELDS,
    'Workout': WORKOUT_FIELDS,
}

and we have to change references (in both the main code and the test code) to refer to RECORD_FIELDS where previously there were references to FIELDS.

Change 2: Add a Workout to the test data

There was a single workout in the data I exported from the phone (a token one I performed primarily to generate a record of this type). I simply used grep to extract that line from export.xml and poked it into the test data `testdata/export6s3sample.xml'.

Change 3: Update the tag and field counters

The code for counting record types previously considered only nodes of type Record. Now we also want to handle Workout and ActivitySummary elements. Workouts do come in different types (they have a workoutActivityType field), so it may be that we will want to write out different workout types into different CSV files, but since I have only, so far, seen a single workout, I don't really want to do this. So instead, we'll write all Workout elements to a corresponding Workout.csv file, and all ActivitySummary elements to an ActivitySummary.csv file.

Accordingly, the count_record_types method now uses an extra Counter attribute, other_types to count the number of each of these elements, keyed on their tag (i.e. Workout or ActivitySummary).

Change 4: Update the test results to reflect the new behaviour

Two of the unit tests introduced last time need to be updated to reflect this Change 3. First, the field counts change, and secondly we need reference values for the other_types counts. Hence the new section in test_extracted_reference_stats:

    expectedOtherCounts = [
       ('ActivitySummary', 2),
       ('Workout', 1),
    ]
    self.assertEqual(sorted(data.other_types.items()),
                     expectedOtherCounts)

Change 5: Open (and close) files for Workouts and ActivitySummaries

We need to open new files for Workout.csv and ActivitySummary.csv if we have any such records. This is handled in the open_for_writing method.

Change 6: Write records for Workouts and ActivitySummaries

There are minor changes to the write_records method to allow it to handle writing Workout and ActivitySummary records. The only real difference is that the different CSV files have different fields, so we need to look up the right values, in the order specified by the header for each kind. The new code does that:

def write_records(self):
    kinds = FIELDS.keys()
    for node in self.nodes:
        if node.tag in kinds:
            attributes = node.attrib
            kind = attributes['type'] if node.tag == 'Record' else node.tag
            values = [format_value(attributes.get(field), datatype)
                      for (field, datatype) in FIELDS[node.tag].items()]
            line = encode(','.join(values) + '\n')
            self.handles[kind].write(line)

Change 7: Update the reference test

Finally, the reference test itself now generates two more files, so I've added reference copies of those to the testdata subdirectory and changed the test to loop over all four files:

def test_tiny_reference_extraction(self):
    path = copy_test_data()
    data = HealthDataExtractor(path, verbose=VERBOSE)
    data.extract()
    for kind in ('StepCount', 'DistanceWalkingRunning',
                 'Workout', 'ActivitySummary'):
        self.check_file('%s.csv' % kind)

Mission Accomplished

We've now extracted essentially all the data from the export.xml file from the Apple Health app, and created various tests for that extraction process. We'll start to look at the data in future posts. There is one more component in my extract—another XML file called export_cda.xml. This contains a ClinicalDocument, apparently conforming to a standard from (or possibly administered by) Health Level Seven International. It contains heart-rate data from my Apple Watch. I probably will extract it and publish the code for doing so, but later.

Unit Tests

2016-04-19T21:05:00+01:00

In the last post, we presented some code for implementing a "reference" test for the code for extracting CSV files from the XML dump that the Apple Health app on iOS can produce.

We will now expand that test with a few other, smaller and more conventional unit tests. Each unit test focuses on a smaller block of functionality.

The Test Code

As before, you can get the code from Github with

$ git clone https://github.com/tdda/applehealthdata.git

or if you have pulled it previously, with

$ git pull

This version of the code is tagged with v1.2, so if it has been updated by the time you read this, get that version with

$ git checkout v1.2

Here is the updated test code.

# -*- coding: utf-8 -*-
"""
testapplehealthdata.py: tests for the applehealthdata.py

Copyright (c) 2016 Nicholas J. Radcliffe
Licence: MIT
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import re
import shutil
import sys
import unittest

from collections import Counter


from applehealthdata import (HealthDataExtractor,
                             format_freqs, format_value,
                             abbreviate, encode)

CLEAN_UP = True
VERBOSE = False


def get_base_dir():
    """
    Return the directory containing this test file,
    which will (normally) be the applyhealthdata directory
    also containing the testdata dir.
    """
    return os.path.split(os.path.abspath(__file__))[0]


def get_testdata_dir():
    """Return the full path to the testdata directory"""
    return os.path.join(get_base_dir(), 'testdata')


def get_tmp_dir():
    """Return the full path to the tmp directory"""
    return os.path.join(get_base_dir(), 'tmp')


def remove_any_tmp_dir():
    """
    Remove the temporary directory if it exists.
    Returns its location either way.
    """
    tmp_dir = get_tmp_dir()
    if os.path.exists(tmp_dir):
        shutil.rmtree(tmp_dir)
    return tmp_dir


def make_tmp_dir():
    """
    Remove any existing tmp directory.
    Create empty tmp direcory.
    Return the location of the tmp dir.
    """
    tmp_dir = remove_any_tmp_dir()
    os.mkdir(tmp_dir)
    return tmp_dir


def copy_test_data():
    """
    Copy the test data export6s3sample.xml from testdata directory
    to tmp directory.
    """
    tmp_dir = make_tmp_dir()
    name = 'export6s3sample.xml'
    in_xml_file = os.path.join(get_testdata_dir(), name)
    out_xml_file = os.path.join(get_tmp_dir(), name)
    shutil.copyfile(in_xml_file, out_xml_file)
    return out_xml_file


class TestAppleHealthDataExtractor(unittest.TestCase):
    @classmethod
    def tearDownClass(cls):
        """Clean up by removing the tmp directory, if it exists."""
        if CLEAN_UP:
            remove_any_tmp_dir()

    def check_file(self, filename):
        expected_output = os.path.join(get_testdata_dir(), filename)
        actual_output = os.path.join(get_tmp_dir(), filename)
        with open(expected_output) as f:
            expected = f.read()
        with open(actual_output) as f:
            actual = f.read()
        self.assertEqual(expected, actual)

    def test_tiny_reference_extraction(self):
        path = copy_test_data()
        data = HealthDataExtractor(path, verbose=VERBOSE)
        data.extract()
        self.check_file('StepCount.csv')
        self.check_file('DistanceWalkingRunning.csv')

    def test_format_freqs(self):
        counts = Counter()
        self.assertEqual(format_freqs(counts), '')
        counts['one'] += 1
        self.assertEqual(format_freqs(counts), 'one: 1')
        counts['one'] += 1
        self.assertEqual(format_freqs(counts), 'one: 2')
        counts['two'] += 1
        counts['three'] += 1
        self.assertEqual(format_freqs(counts),
                         '''one: 2
three: 1
two: 1''')

    def test_format_null_values(self):
        for dt in ('s', 'n', 'd', 'z'):
            # Note: even an illegal type, z, produces correct output for
            # null values.
            # Questionable, but we'll leave as a feature
            self.assertEqual(format_value(None, dt), '')

    def test_format_numeric_values(self):
        cases = {
            '0': '0',
            '3': '3',
            '-1': '-1',
            '2.5': '2.5',
        }
        for (k, v) in cases.items():
            self.assertEqual((k, format_value(k, 'n')), (k, v))

    def test_format_date_values(self):
        hearts = 'any string not need escaping or quoting; even this: ♥♥'
        cases = {
            '01/02/2000 12:34:56': '01/02/2000 12:34:56',
            hearts: hearts,
        }
        for (k, v) in cases.items():
            self.assertEqual((k, format_value(k, 'd')), (k, v))

    def test_format_string_values(self):
        cases = {
            'a': '"a"',
            '': '""',
            'one "2" three': r'"one \"2\" three"',
            r'1\2\3': r'"1\\2\\3"',
        }
        for (k, v) in cases.items():
            self.assertEqual((k, format_value(k, 's')), (k, v))

    def test_abbreviate(self):
        changed = {
            'HKQuantityTypeIdentifierHeight': 'Height',
            'HKQuantityTypeIdentifierStepCount': 'StepCount',
            'HK*TypeIdentifierStepCount': 'StepCount',
            'HKCharacteristicTypeIdentifierDateOfBirth': 'DateOfBirth',
            'HKCharacteristicTypeIdentifierBiologicalSex': 'BiologicalSex',
            'HKCharacteristicTypeIdentifierBloodType': 'BloodType',
            'HKCharacteristicTypeIdentifierFitzpatrickSkinType':
                                                    'FitzpatrickSkinType',
        }
        unchanged = [
            '',
            'a',
            'aHKQuantityTypeIdentifierHeight',
            'HKQuantityTypeIdentityHeight',
        ]
        for (k, v) in changed.items():
            self.assertEqual((k, abbreviate(k)), (k, v))
            self.assertEqual((k, abbreviate(k, False)), (k, k))
        for k in unchanged:
            self.assertEqual((k, abbreviate(k)), (k, k))

    def test_encode(self):
        # This test looks strange, but because of the import statments
        #     from __future__ import unicode_literals
        # in Python 2, type('a') is unicode, and the point of the encode
        # function is to ensure that it has been converted to a UTF-8 string
        # before writing to file.
        self.assertEqual(type(encode('a')), str)

    def test_extracted_reference_stats(self):
        path = copy_test_data()
        data = HealthDataExtractor(path, verbose=VERBOSE)

        self.assertEqual(data.n_nodes, 19)
        expectedRecordCounts = [
           ('DistanceWalkingRunning', 5),
           ('StepCount', 10),
        ]
        self.assertEqual(sorted(data.record_types.items()),
                         expectedRecordCounts)

        expectedTagCounts = [
           ('ActivitySummary', 2),
           ('ExportDate', 1),
           ('Me', 1),
           ('Record', 15),
        ]
        self.assertEqual(sorted(data.tags.items()),
                         expectedTagCounts)
        expectedFieldCounts = [
            ('HKCharacteristicTypeIdentifierBiologicalSex', 1),
            ('HKCharacteristicTypeIdentifierBloodType', 1),
            ('HKCharacteristicTypeIdentifierDateOfBirth', 1),
            ('HKCharacteristicTypeIdentifierFitzpatrickSkinType', 1),
            ('activeEnergyBurned', 2),
            ('activeEnergyBurnedGoal', 2),
            ('activeEnergyBurnedUnit', 2),
            ('appleExerciseTime', 2),
            ('appleExerciseTimeGoal', 2),
            ('appleStandHours', 2),
            ('appleStandHoursGoal', 2),
            ('creationDate', 15),
            ('dateComponents', 2),
            ('endDate', 15),
            ('sourceName', 15),
            ('startDate', 15),
            ('type', 15),
            ('unit', 15),
            ('value', 16),
        ]
        self.assertEqual(sorted(data.fields.items()),
                         expectedFieldCounts)


if __name__ == '__main__':
    unittest.main()

Notes

We're not going to discuss every part of the code, but will point out a few salient features.

I've added a coding line at the top of both the test script and the main applehealthdata.py script. This tells Python (and my editor, Emacs) the encoding of the file on disk (UTF-8). This is now relevant because one of the new tests (test_format_date_values) features a non-ASCII character in a string literal.
The previous test method test_tiny_fixed_extraction has been renamed test_tiny_reference_extraction, but is otherwise unchanged.
Several of the tests loop over dictionaries or lists of input-output pairs, with an assertion of some kind in the main body. Some people don't like this, and prefer one assertion per test. I don't really agree with that, but do think it's important to be able to see easily which assertion fails. An idiom I often use to assist this is to include the input on both sides of the test. For example, in test_abbreviate, when checking the abbreviation of items that should change, the code reads:
```
for (k, v) in changed.items():
    self.assertEqual((k, abbreviate(k)), (k, v))
```
rather than
```
for (k, v) in changed.items():
    self.assertEqual(abbreviate(k), v)
```
This makes it easy to tell which input fails, if one does, even in cases in which the main values being compared (abbreviate(k) and v, in this case) are long, complex or repeated across different inputs. It doesn't actually make much difference in these examples, but in general I find it helpful.
The test test_extracted_reference_stats checks that three counters used by the code work as expected. Some people would definitely advocate splitting this into three tests, but, even though it's quick, it seems more natural to test these together to me. This also means we don't have to process the XML file three times. There are other ways of achieving the same end, and this approach has the potential disadvantage that the later cases won't be run if the first one fails.

The other point to note here is that the Counter objects are unordered, so I've sorted the expected results on their keys in the expected values, and then used Python's sorted function, which returns a generator to return the values of a list (or other iterable) in sorted order. We could avoid the sort by constructing sets or a dictionaries from the Counter objects and checking those instead, but the sort here is not expensive, and this approach is probably simpler.
I haven't bothered to write a separate test for the extraction phase (checking that it writes the right CSV files) because that seems to me to add almost nothing over the existing reference test (test_tiny_reference_extraction).

Closing

That's it for this post. The unit tests are not terribly exciting, but they will prove useful as we extend the extraction code, which we'll start to do in the next post.

In a few posts' time, we will start analysing the data extracted from the app; it will be interesting to see whether, at that stage, we discover any more serious problems with the extraction code. Experience teaches that we probably will.

First Test

2016-04-18T16:20:00+01:00

In the last post, I presented some code for extracting (some of) the data from the XML file exported by the Apple Health app on iOS, but—almost comically, given this blog's theme—omitted to include any tests. This post and the next couple (in quick succession) will aim to fix that.

This post begins to remedy that by writing a single "reference" test. To recap: a reference test is a test that tests a whole analytical process, checking that the known inputs produce the expected outputs. So far, our analytical process is quite small, consisting only of data extraction, but this will still prove very worthwhile.

Dogma

While the mainstream TDD dogma states that tests should be written before the code, it is far from uncommon to write them afterwards, and in the context of test-driven data analysis I maintain that this is usually preferable. Regardless, when you find yourself in a situation in which you have written some code and possess any reasonable level of belief that it might be right,¹ an excellent starting point is simply to capture the input(s) that you have already used, together with the output that it generates, and write a test that checks that the input you provided produces the expected output. That's exactly the procedure I advocated for TDDA, and that's how we shall start here.

Test Data

The only flies in the ointment in this case are

the input data I used initially was quite large (5.5MB compressed; 109MB uncompressed), leading to quite a slow test;
the data is somewhat personal.

For both these reasons, I have decided to reduce it so that it will be more manageable, run more quickly, and be more suitable for public sharing.

So I cut down the data to contain only the DTD header, the Me record, ten StepCount records, and five DistanceWalkingRunning records. That results in a small, valid XML file (under 7K) containing exactly 100 lines. It's in the testdata subdirectory of the repository, and if I run it (which you probably don't want do, at least in situ, as that will trample over the reference output), the following output is produced:

$ python applehealthdata/applehealthdata.py testdata/export6s3sample.xml
Reading data from testdata/export6s3sample.xml . . . done

Tags:
ActivitySummary: 2
ExportDate: 1
Me: 1
Record: 15

Fields:
HKCharacteristicTypeIdentifierBiologicalSex: 1
HKCharacteristicTypeIdentifierBloodType: 1
HKCharacteristicTypeIdentifierDateOfBirth: 1
HKCharacteristicTypeIdentifierFitzpatrickSkinType: 1
activeEnergyBurned: 2
activeEnergyBurnedGoal: 2
activeEnergyBurnedUnit: 2
appleExerciseTime: 2
appleExerciseTimeGoal: 2
appleStandHours: 2
appleStandHoursGoal: 2
creationDate: 15
dateComponents: 2
endDate: 15
sourceName: 15
startDate: 15
type: 15
unit: 15
value: 16

Record types:
DistanceWalkingRunning: 5
StepCount: 10

Opening /Users/njr/qs/testdata/StepCount.csv for writing
Opening /Users/njr/qs/testdata/DistanceWalkingRunning.csv for writing
Written StepCount data.
Written DistanceWalkingRunning data.

The two CSV files it writes, which are also in the testdata subdirectory in the repository, are as follows:

$ cat testdata/StepCount.csv
sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value
"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:27:54 +0100,2014-09-13 10:27:59 +0100,329
"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:34:09 +0100,2014-09-13 10:34:14 +0100,283
"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:39:29 +0100,2014-09-13 10:39:34 +0100,426
"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:45:36 +0100,2014-09-13 10:45:41 +0100,61
"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:51:16 +0100,2014-09-13 10:51:21 +0100,10
"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:57:40 +0100,2014-09-13 10:57:45 +0100,200
"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:03:00 +0100,2014-09-13 11:03:05 +0100,390
"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:08:10 +0100,2014-09-13 11:08:15 +0100,320
"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:27:22 +0100,2014-09-13 11:27:27 +0100,216
"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:33:24 +0100,2014-09-13 11:33:29 +0100,282

and

$ cat testdata/DistanceWalkingRunning.csv
sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value
"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:28 +0100,2014-09-20 10:41:30 +0100,0.00288
"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:30 +0100,2014-09-20 10:41:33 +0100,0.00284
"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:33 +0100,2014-09-20 10:41:36 +0100,0.00142
"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:43:54 +0100,2014-09-20 10:43:56 +0100,0.00639
"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:43:59 +0100,2014-09-20 10:44:01 +0100,0.0059

Reference Test

The code for a single reference test is below. It's slightly verbose, because it tries to use sensible locations for everything, but not complex.

As before, you can get the code from Github with

$ git clone https://github.com/tdda/applehealthdata.git

or if you have pulled it previously, you can update it with

$ git pull

This version of the code is tagged with v1.1, so if it has been updated by the time you read this, get that version with

$ git checkout v1.1

Here is the code:

"""
testapplehealthdata.py: tests for the applehealthdata.py

Copyright (c) 2016 Nicholas J. Radcliffe
Licence: MIT
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import re
import shutil
import sys
import unittest

from applehealthdata import HealthDataExtractor

CLEAN_UP = True
VERBOSE = False


def get_base_dir():
    """
    Return the directory containing this test file,
    which will (normally) be the applyhealthdata directory
    also containing the testdata dir.
    """
    return os.path.split(os.path.abspath(__file__))[0]


def get_testdata_dir():
    """Return the full path to the testdata directory"""
    return os.path.join(get_base_dir(), 'testdata')


def get_tmp_dir():
    """Return the full path to the tmp directory"""
    return os.path.join(get_base_dir(), 'tmp')


def remove_any_tmp_dir():
    """
    Remove the temporary directory if it exists.
    Returns its location either way.
    """
    tmp_dir = get_tmp_dir()
    if os.path.exists(tmp_dir):
        shutil.rmtree(tmp_dir)
    return tmp_dir


def make_tmp_dir():
    """
    Remove any existing tmp directory.
    Create empty tmp direcory.
    Return the location of the tmp dir.
    """
    tmp_dir = remove_any_tmp_dir()
    os.mkdir(tmp_dir)
    return tmp_dir


def copy_test_data():
    """
    Copy the test data export6s3sample.xml from testdata directory
    to tmp directory.
    """
    tmp_dir = make_tmp_dir()
    name = 'export6s3sample.xml'
    in_xml_file = os.path.join(get_testdata_dir(), name)
    out_xml_file = os.path.join(get_tmp_dir(), name)
    shutil.copyfile(in_xml_file, out_xml_file)
    return out_xml_file


class TestAppleHealthDataExtractor(unittest.TestCase):
    @classmethod
    def tearDownClass(cls):
        """Clean up by removing the tmp directory, if it exists."""
        if CLEAN_UP:
            remove_any_tmp_dir()

    def check_file(self, filename):
        expected_output = os.path.join(get_testdata_dir(), filename)
        actual_output = os.path.join(get_tmp_dir(), filename)
        with open(expected_output) as f:
            expected = f.read()
        with open(actual_output) as f:
            actual = f.read()
        self.assertEqual(expected, actual)

    def test_tiny_fixed_extraction(self):
        path = copy_test_data()
        data = HealthDataExtractor(path, verbose=VERBOSE)
        data.extract()
        self.check_file('StepCount.csv')
        self.check_file('DistanceWalkingRunning.csv')


if __name__ == '__main__':
    unittest.main()

Running the Test

This is what I get if I run it:

$ python testapplehealthdata.py
.
----------------------------------------------------------------------
Ran 1 test in 0.007s

OK
$

That's encouraging, but not particularly informative. If we change the value of VERBOSE at the top of the test file to True, we see slightly more reassuring output:

$ python testapplehealthdata.py
Reading data from /Users/njr/qs/applehealthdata/tmp/export6s3sample.xml . . . done
Opening /Users/njr/qs/applehealthdata/tmp/StepCount.csv for writing
Opening /Users/njr/qs/applehealthdata/tmp/DistanceWalkingRunning.csv for writing
Written StepCount data.
Written DistanceWalkingRunning data.
.
----------------------------------------------------------------------
Ran 1 test in 0.006s

NOTE: The tearDownClass method is a special Python class method that the unit testing framework runs after executing all the tests in the class, regardless of whether they pass, fail or produce errors. I use it to remove the tmp directory containing any test output, which is normally good practice. In a later post, we'll either modify this to leave the output around if any tests fail, or make some other change to make it easier to diagnose what's gone wrong. In the meantime, if you change the value of CLEAN_UP, towards the top of the code, to False, it will leave the tmp directory around, allowing you to examine the files it has produced.

Overview

The test itself is in the 5-line method test_tiny_fixed_extraction. Here's what the five lines do:

Copy the input XML file from the testdata directory to the tmp directory. The Github repository contains the 100-line input XML file together with the expected output in the testdata subdirectory. Because the data extractor writes the CSV files next to the input data, the cleanest thing for us to do is to take a copy of the input data, write it into a new directory (applehealthdata/tmp) and also to use that directory as the location for the output CSV files. The copy_test_data function removes any existing tmp directory it finds, creates a fresh one, copies the input test data into it and returns the path to the test data file. The only "magic" here is that the get_base_dir function figures out where to locate everything by using __file__, which is the location of the source file being executed by Python.
Create a HealthDataExtractor object, using the location of the copy of the input data returned by copy_test_data(). Note that it sets verbose to False, making the test silent, and allowing the line of dots from a successful test run (in this case, a single dot) to be presented without interruption.
Extract the data. This writes two output files to the applehealthdata/tmp directory.
Check that the contents of tmp/StepCount.csv match the reference output in testdata/StepCount.csv.
Check that the contents of tmp/DistanceWalkingRunning.csv match the reference output in testdata/DistanceWalkingRunning.csv.

Write-Test-Break-Run-Repair-Rerun

In cases in which the tests are written after the code, it's important to check that they really are running correctly. My usual approach to that is to write the test, and if appears to pass first time,² to break it deliberately to verify that it fails when it should, before repairing it. In this case, the simplest way to break the test is to change the reference data temporarily. This will also reveal a weakness in the current check_file function.

We'll try three variants of this:

Variant 1: Break the StepCount.csv reference data.

First, I add a Z to the end of testdata/StepCount.csv and re-run the tests:

$ python testapplehealthdata.py
F
======================================================================
FAIL: test_tiny_fixed_extraction (__main__.TestAppleHealthDataExtractor)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "testapplehealthdata.py", line 98, in test_tiny_fixed_extraction
    self.check_file('StepCount.csv')
  File "testapplehealthdata.py", line 92, in check_file
    self.assertEqual(expected, actual)
AssertionError: 'sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:27:54 +0100,2014-09-13 10:27:59 +0100,329\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:34:09 +0100,2014-09-13 10:34:14 +0100,283\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:39:29 +0100,2014-09-13 10:39:34 +0100,426\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:45:36 +0100,2014-09-13 10:45:41 +0100,61\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:51:16 +0100,2014-09-13 10:51:21 +0100,10\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:57:40 +0100,2014-09-13 10:57:45 +0100,200\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:03:00 +0100,2014-09-13 11:03:05 +0100,390\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:08:10 +0100,2014-09-13 11:08:15 +0100,320\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:27:22 +0100,2014-09-13 11:27:27 +0100,216\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:33:24 +0100,2014-09-13 11:33:29 +0100,282\nZ' != 'sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:27:54 +0100,2014-09-13 10:27:59 +0100,329\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:34:09 +0100,2014-09-13 10:34:14 +0100,283\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:39:29 +0100,2014-09-13 10:39:34 +0100,426\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:45:36 +0100,2014-09-13 10:45:41 +0100,61\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:51:16 +0100,2014-09-13 10:51:21 +0100,10\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:57:40 +0100,2014-09-13 10:57:45 +0100,200\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:03:00 +0100,2014-09-13 11:03:05 +0100,390\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:08:10 +0100,2014-09-13 11:08:15 +0100,320\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:27:22 +0100,2014-09-13 11:27:27 +0100,216\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:33:24 +0100,2014-09-13 11:33:29 +0100,282\n'

----------------------------------------------------------------------
Ran 1 test in 0.005s

FAILED (failures=1)
$

That causes the expected failure. Because, however, we've compared the entire contents of the two CSV files, it's hard to see what's actually gone wrong. We'll address this by improving the check_file method in a later post.

Variant 2: Break the DistanceWalkingRunning.csv reference data.

After restoring the StepCount.csv data, I modify the reference testdata/DistanceWalkingRunning.csv data. This time, I'll change Health to Wealth throughout.

$ python testapplehealthdata.py
F
======================================================================
FAIL: test_tiny_fixed_extraction (__main__.TestAppleHealthDataExtractor)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "testapplehealthdata.py", line 99, in test_tiny_fixed_extraction
    self.check_file('DistanceWalkingRunning.csv')
  File "testapplehealthdata.py", line 92, in check_file
    self.assertEqual(expected, actual)
AssertionError: 'sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value\n"Wealth",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:28 +0100,2014-09-20 10:41:30 +0100,0.00288\n"Wealth",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:30 +0100,2014-09-20 10:41:33 +0100,0.00284\n"Wealth",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:33 +0100,2014-09-20 10:41:36 +0100,0.00142\n"Wealth",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:43:54 +0100,2014-09-20 10:43:56 +0100,0.00639\n"Wealth",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:43:59 +0100,2014-09-20 10:44:01 +0100,0.0059\n' != 'sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value\n"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:28 +0100,2014-09-20 10:41:30 +0100,0.00288\n"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:30 +0100,2014-09-20 10:41:33 +0100,0.00284\n"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:33 +0100,2014-09-20 10:41:36 +0100,0.00142\n"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:43:54 +0100,2014-09-20 10:43:56 +0100,0.00639\n"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:43:59 +0100,2014-09-20 10:44:01 +0100,0.0059\n'

----------------------------------------------------------------------
Ran 1 test in 0.005s

FAILED (failures=1)
$

The story is very much the same: the test has failed, which is good, but again the source of difference is hard to discern.

Variant 3: Break the input XML Data.

After restoring DistanceWalkingRunning.csv, I modify the input XML file. In this case, I'll just change the first step count to be 330 instead of 329:

$ python testapplehealthdata.py
F
======================================================================
FAIL: test_tiny_fixed_extraction (__main__.TestAppleHealthDataExtractor)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "testapplehealthdata.py", line 98, in test_tiny_fixed_extraction
    self.check_file('StepCount.csv')
  File "testapplehealthdata.py", line 92, in check_file
    self.assertEqual(expected, actual)
AssertionError: 'sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:27:54 +0100,2014-09-13 10:27:59 +0100,329\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:34:09 +0100,2014-09-13 10:34:14 +0100,283\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:39:29 +0100,2014-09-13 10:39:34 +0100,426\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:45:36 +0100,2014-09-13 10:45:41 +0100,61\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:51:16 +0100,2014-09-13 10:51:21 +0100,10\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:57:40 +0100,2014-09-13 10:57:45 +0100,200\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:03:00 +0100,2014-09-13 11:03:05 +0100,390\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:08:10 +0100,2014-09-13 11:08:15 +0100,320\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:27:22 +0100,2014-09-13 11:27:27 +0100,216\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:33:24 +0100,2014-09-13 11:33:29 +0100,282\n' != 'sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:27:54 +0100,2014-09-13 10:27:59 +0100,330\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:34:09 +0100,2014-09-13 10:34:14 +0100,283\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:39:29 +0100,2014-09-13 10:39:34 +0100,426\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:45:36 +0100,2014-09-13 10:45:41 +0100,61\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:51:16 +0100,2014-09-13 10:51:21 +0100,10\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:57:40 +0100,2014-09-13 10:57:45 +0100,200\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:03:00 +0100,2014-09-13 11:03:05 +0100,390\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:08:10 +0100,2014-09-13 11:08:15 +0100,320\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:27:22 +0100,2014-09-13 11:27:27 +0100,216\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:33:24 +0100,2014-09-13 11:33:29 +0100,282\n'

----------------------------------------------------------------------
Ran 1 test in 0.005s

FAILED (failures=1)
$

Again, we get the expected failure, and again it's hard to see what it is. (We really will need to improve check_file.)

Enough

That's enough for this post. We've successfully added a single "reference" test to the code, which should at least make sure that if we break it during further enhancements, we will notice. It will also check that it is working correctly on other platforms (e.g., yours).

We haven't done anything to check the the CSV files produced are genuinely right beyond the initial eye-balling I did on first extracting the data before. But if we see problems when we start doing proper analysis, it will be easy to correct the expected output to keep the test running. And in the meantime, we'll notice if we make changes to the code that result in different output when it wasn't meant to do so. This is one part of the pragmatic essence of basic TDDA.

We also haven't written any unit tests at all for the extraction code; we'll do that in a later post.

For example, you might have already blogged about it and pushed it to a public repository on Github ↩
Which is not always the case ↩

In Defence of XML: Exporting and Analysing Apple Health Data

2016-04-15T15:35:00+01:00

I'm going to present a series of posts based around the sort of health and fitness data that can now be collected by some phones and dedicated fitness trackers. Not all of these will be centrally on topic for test-driven data analysis, but I think they'll provide an interesting set of data for discussing many issues of relevance, so I hope readers will forgive me to the extent that these stray from the central theme.

The particular focus for this series will be the data available from an iPhone and the Apple Health app, over a couple of different phones, and with a couple of different devices paired to them.

In particular, the setup will be:

Apple iPhone 6s (November 2015 to present)
Apple iPhone 5s (with fitness data from Sept 2014 to Nov 2015)
Several Misfit Shine activity trackers (until early March 2016)
An Apple Watch (about a month of data, to date)

Getting data out of Apple Health (The Exploratory Version)

I hadn't initially spotted a way to get the data out of Apple's Health app, but a quick web search¹ turned up this very helpful article: https://www.idownloadblog.com/2015/06/10/how-to-export-import-health-data. It turns out there is a properly supported way to export granular data from Apple Health, described in detail in the post. Essentially:

Open the Apple Health App.
Navigate to the Health Data section (left icon at the bottom)
Select All from the list of categories
There is a share icon at the top right (a vertical arrow sticking up from a square)
Tap that to export all data
It thinks for a while (quite a while, in fact) and then offers you various export options, which for me included Airdrop, email and handing the data to other apps. I used Airdrop to dump it onto a Mac.

The result is a compressed XML file called export.zip. For me, this was about 5.5MB, which expanded to 109MB when unzipped. (Interestingly, I started this with an earlier export a couple of weeks ago, when the zipped file was about 5MB and the expanded version was 90MB, so it is growing fairly quickly, thanks to the Watch.)

As helpful as the iDownloadBlog article is, I have to comment on its introduction to exporting data, which reads

There are actually two ways to export the data from your Health app. The first way, is one provided by Apple, but it is virtually useless.

To be fair to iDownloadBlog, an XML file like this probably is useless to the general reader, but it builds on a meme fashionable among developers and data scientists to the effect of "XML is painful to process, verbose and always worse than JSON", and I think this is somewhat unfair.

Let's explore export.xml using Python and the ElementTree library. Although the decompressed file is quite large (109MB), it's certainly not problematically large to read into memory on a modern machine, so I'm not going to worry about reading it in bits: I'm just going to find out as quickly as possible what's in it.

The first thing to do, of course, is simply to look at the file, probably using either the more or less command, assuming you are on some flavour of Unix or Linux. Let's look at the top of my export.xml:

$ head -79 export6s3/export.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE HealthData [
<!-- HealthKit Export Version: 3 -->
<!ELEMENT HealthData (ExportDate,Me,(Record|Correlation|Workout|ActivitySummary)*)>
<!ATTLIST HealthData
  locale CDATA #REQUIRED
>
<!ELEMENT ExportDate EMPTY>
<!ATTLIST ExportDate
  value CDATA #REQUIRED
>
<!ELEMENT Me EMPTY>
<!ATTLIST Me
  HKCharacteristicTypeIdentifierDateOfBirth         CDATA #REQUIRED
  HKCharacteristicTypeIdentifierBiologicalSex       CDATA #REQUIRED
  HKCharacteristicTypeIdentifierBloodType           CDATA #REQUIRED
  HKCharacteristicTypeIdentifierFitzpatrickSkinType CDATA #REQUIRED
>
<!ELEMENT Record (MetadataEntry*)>
<!ATTLIST Record
  type          CDATA #REQUIRED
  unit          CDATA #IMPLIED
  value         CDATA #IMPLIED
  sourceName    CDATA #REQUIRED
  sourceVersion CDATA #IMPLIED
  device        CDATA #IMPLIED
  creationDate  CDATA #IMPLIED
  startDate     CDATA #REQUIRED
  endDate       CDATA #REQUIRED
>
<!-- Note: Any Records that appear as children of a correlation also appear as top-level records in this document. -->
<!ELEMENT Correlation ((MetadataEntry|Record)*)>
<!ATTLIST Correlation
  type          CDATA #REQUIRED
  sourceName    CDATA #REQUIRED
  sourceVersion CDATA #IMPLIED
  device        CDATA #IMPLIED
  creationDate  CDATA #IMPLIED
  startDate     CDATA #REQUIRED
  endDate       CDATA #REQUIRED
>
<!ELEMENT Workout ((MetadataEntry|WorkoutEvent)*)>
<!ATTLIST Workout
  workoutActivityType   CDATA #REQUIRED
  duration              CDATA #IMPLIED
  durationUnit          CDATA #IMPLIED
  totalDistance         CDATA #IMPLIED
  totalDistanceUnit     CDATA #IMPLIED
  totalEnergyBurned     CDATA #IMPLIED
  totalEnergyBurnedUnit CDATA #IMPLIED
  sourceName            CDATA #REQUIRED
  sourceVersion         CDATA #IMPLIED
  device                CDATA #IMPLIED
  creationDate          CDATA #IMPLIED
  startDate             CDATA #REQUIRED
  endDate               CDATA #REQUIRED
>
<!ELEMENT WorkoutEvent EMPTY>
<!ATTLIST WorkoutEvent
  type CDATA #REQUIRED
  date CDATA #REQUIRED
>
<!ELEMENT ActivitySummary EMPTY>
<!ATTLIST ActivitySummary
  dateComponents           CDATA #IMPLIED
  activeEnergyBurned       CDATA #IMPLIED
  activeEnergyBurnedGoal   CDATA #IMPLIED
  activeEnergyBurnedUnit   CDATA #IMPLIED
  appleExerciseTime        CDATA #IMPLIED
  appleExerciseTimeGoal    CDATA #IMPLIED
  appleStandHours          CDATA #IMPLIED
  appleStandHoursGoal      CDATA #IMPLIED
>
<!ELEMENT MetadataEntry EMPTY>
<!ATTLIST MetadataEntry
  key   CDATA #REQUIRED
  value CDATA #REQUIRED
>
]>

This is immediately encouraging: Apple has provided DOCTYPE (DTD) information, which even though slightly old fashioned, tells us what we should expect to find in the file. DTD's are awkward to use, and when coming from untrusted sources, can leave the user potentially vulnerable to malicious attacks, but despite this, they are quite expressive and helpful, even just as plain-text documentation.

Roughly speaking, the lines:

<!ELEMENT HealthData (ExportDate,Me,(Record|Correlation|Workout)*)>
<!ATTLIST HealthData
  locale CDATA #REQUIRED
>

say

that the top element will be a HealthData element
that this HealthData element will contain
- an ExportDate element
- a Me element
- zero or more elements of type Record, Correlation or Workout
and that the HealthData element will have an attribute locale (which is mandatory).

The rest of this DTD section describes each kind of record in more detail.

The next 6 lines in my XML file are as follows (spread out for readability):

<HealthData locale="en_GB">
 <ExportDate value="2016-04-15 07:27:26 +0100"/>
 <Me HKCharacteristicTypeIdentifierDateOfBirth="1965-07-31"
     HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexMale"
     HKCharacteristicTypeIdentifierBloodType="HKBloodTypeNotSet"
     HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet"/>
 <Record type="HKQuantityTypeIdentifierHeight"
         sourceName="Health"
         sourceVersion="9.2"
         unit="cm"
         creationDate="2016-01-02 09:45:10 +0100"
         startDate="2016-01-02 09:44:00 +0100"
         endDate="2016-01-02 09:44:00 +0100"
         value="194">
  <MetadataEntry key="HKWasUserEntered" value="1"/>
 </Record>

As you can see, the export format is verbose, but extremely comprehensible and comprehensive. It's also very easy to read into Python and explore.

Let's do that, here with an interactive python:

>>> from xml.etree import ElementTree as ET
>>> with open('export.xml') as f:
...     data = ET.parse(f)
... 
>>> data
<xml.etree.ElementTree.ElementTree object at 0x107347a50>

The ElementTree module turns each XML element into an Element object, described by its tag, with a few standard attributes.

Inspecting the data object, we find:

>>> data.__dict__
{'_root': <Element 'HealthData' at 0x1073c2050>}

i.e., we have a single entry in data—a root element called HealthData.

Like all Element objects, it has the four standard attributes:²

>>> root = data._root
>>> root.__dict__.keys()
['text', 'attrib', 'tag', '_children']

These are:

>>> root.attrib
{'locale': 'en_GB'}

>>> root.text
'\n '

>>> root.tag
'HealthData'

>>> len(root._children)
446702

So nothing much apart from an encoding and a whole lot of child nodes. Let's inspect the first few of them:

>>> nodes = root._children
>>> nodes[0]
<Element 'ExportDate' at 0x1073c2090>

>>> ET.dump(nodes[0])
<ExportDate value="2016-04-15 07:27:26 +0100" />

>>> nodes[1]
<Element 'Me' at 0x1073c2190>
>>> ET.dump(nodes[1])
<Me HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexMale"
    HKCharacteristicTypeIdentifierBloodType="HKBloodTypeNotSet"
    HKCharacteristicTypeIdentifierDateOfBirth="1965-07-31"
    HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet" />

>>> nodes[2]
<Element 'Record' at 0x1073c2410>
>>> ET.dump(nodes[2])
<Record creationDate="2016-01-02 09:45:10 +0100"
        endDate="2016-01-02 09:44:00 +0100"
        sourceName="Health"
        sourceVersion="9.2"
        startDate="2016-01-02 09:44:00 +0100"
        type="HKQuantityTypeIdentifierHeight"
        unit="cm"
        value="194">
  <MetadataEntry key="HKWasUserEntered" value="1" />
 </Record>

>>> nodes[3]
<Element 'Record' at 0x1073c2550>
>>> nodes[4]
<Element 'Record' at 0x1073c2650>

So, exactly as the DTD indicated, we have an ExportDate node, a Me node and then what looks like a great number of records. Let's confirm that:

>>> set(node.tag for node in nodes[2:])
set(['Record', 'Workout', 'ActivitySummary'])

So in fact, there are three kinds of nodes after the ExportDate and Me records. Let's count them:

>>> records = [node for node in nodes if node.tag == 'Record']
>>> len(records)
446670

These records are ones like the Height record we saw above, though in fact most of them are not Height but either StepCount, CaloriesBurned or DistanceWalkingRunning, e.g.:

>>> ET.dump(nodes[100000])
<Record creationDate="2015-01-11 07:40:15 +0000"
        endDate="2015-01-10 13:39:35 +0000"
        sourceName="njr iPhone 6s"
        startDate="2015-01-10 13:39:32 +0000"
        type="HKQuantityTypeIdentifierStepCount"
        unit="count"
        value="4" />

There is also one activity summary per day (since I got the watch).

>>> acts = [node for node in nodes if node.tag == 'ActivitySummary']
>>> len(acts)
29

The first one isn't very exciting:

>>> ET.dump(acts[0])
<ActivitySummary activeEnergyBurned="0"
                 activeEnergyBurnedGoal="0"
                 activeEnergyBurnedUnit="kcal"
                 appleExerciseTime="0"
                 appleExerciseTimeGoal="30"
                 appleStandHours="0"
                 appleStandHoursGoal="12"
                 dateComponents="2016-03-18" />

but they get better:

>>> ET.dump(acts[2])
<ActivitySummary activeEnergyBurned="652.014"
                 activeEnergyBurnedGoal="500"
                 activeEnergyBurnedUnit="kcal"
                 appleExerciseTime="77"
                 appleExerciseTimeGoal="30"
                 appleStandHours="17"
                 appleStandHoursGoal="12"
                 dateComponents="2016-03-20" />

Finally, there is a solitary Workout record.

>>> ET.dump(workouts[0])
<Workout creationDate="2016-04-02 11:12:57 +0100"
         duration="31.73680251737436"
         durationUnit="min"
         endDate="2016-04-02 11:12:22 +0100"
         sourceName="NJR Apple&#160;Watch"
         sourceVersion="2.2"
         startDate="2016-04-02 10:40:38 +0100"
         totalDistance="0"
         totalDistanceUnit="km"
         totalEnergyBurned="139.3170000000021"
         totalEnergyBurnedUnit="kcal"
         workoutActivityType="HKWorkoutActivityTypeOther" />

So there we have it.

Getting data out of Apple Health (The Code)

Given this exploration, we can take a first shot at writing an exporter for Apple Health Data. I'm going to ignore the activity summaries and workout(s) for now, and concentrate on the main records. (We'll get to the others in a later post.)

Here is the code:

"""
applehealthdata.py: Extract data from Apple Health App's export.xml.

Copyright (c) 2016 Nicholas J. Radcliffe
Licence: MIT
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import re
import sys

from xml.etree import ElementTree
from collections import Counter, OrderedDict

__version__ = '1.0'

FIELDS = OrderedDict((
    ('sourceName', 's'),
    ('sourceVersion', 's'),
    ('device', 's'),
    ('type', 's'),
    ('unit', 's'),
    ('creationDate', 'd'),
    ('startDate', 'd'),
    ('endDate', 'd'),
    ('value', 'n'),
))

PREFIX_RE = re.compile('^HK.*TypeIdentifier(.+)$')
ABBREVIATE = True
VERBOSE = True

def format_freqs(counter):
    """
    Format a counter object for display.
    """
    return '\n'.join('%s: %d' % (tag, counter[tag])
                     for tag in sorted(counter.keys()))


def format_value(value, datatype):
    """
    Format a value for a CSV file, escaping double quotes and backslashes.

    None maps to empty.

    datatype should be
        's' for string (escaped)
        'n' for number
        'd' for datetime
    """
    if value is None:
        return ''
    elif datatype == 's':  # string
        return '"%s"' % value.replace('\\', '\\\\').replace('"', '\\"')
    elif datatype in ('n', 'd'):  # number or date
        return value
    else:
        raise KeyError('Unexpected format value: %s' % datatype)


def abbreviate(s):
    """
    Abbreviate particularly verbose strings based on a regular expression
    """
    m = re.match(PREFIX_RE, s)
    return m.group(1) if ABBREVIATE and m else s


def encode(s):
    """
    Encode string for writing to file.
    In Python 2, this encodes as UTF-8, whereas in Python 3,
    it does nothing
    """
    return s.encode('UTF-8') if sys.version_info.major < 3 else s



class HealthDataExtractor(object):
    """
    Extract health data from Apple Health App's XML export, export.xml.

    Inputs:
        path:      Relative or absolute path to export.xml
        verbose:   Set to False for less verbose output

    Outputs:
        Writes a CSV file for each record type found, in the same
        directory as the input export.xml. Reports each file written
        unless verbose has been set to False.
    """
    def __init__(self, path, verbose=VERBOSE):
        self.in_path = path
        self.verbose = verbose
        self.directory = os.path.abspath(os.path.split(path)[0])
        with open(path) as f:
            self.report('Reading data from %s . . . ' % path, end='')
            self.data = ElementTree.parse(f)
            self.report('done')
        self.root = self.data._root
        self.nodes = self.root.getchildren()
        self.n_nodes = len(self.nodes)
        self.abbreviate_types()
        self.collect_stats()

    def report(self, msg, end='\n'):
        if self.verbose:
            print(msg, end=end)
            sys.stdout.flush()

    def count_tags_and_fields(self):
        self.tags = Counter()
        self.fields = Counter()
        for record in self.nodes:
            self.tags[record.tag] += 1
            for k in record.keys():
                self.fields[k] += 1

    def count_record_types(self):
        self.record_types = Counter()
        for record in self.nodes:
            if record.tag == 'Record':
                self.record_types[record.attrib['type']] += 1

    def collect_stats(self):
        self.count_record_types()
        self.count_tags_and_fields()

    def open_for_writing(self):
        self.handles = {}
        self.paths = []
        for kind in self.record_types:
            path = os.path.join(self.directory, '%s.csv' % abbreviate(kind))
            f = open(path, 'w')
            f.write(','.join(FIELDS) + '\n')
            self.handles[kind] = f
            self.report('Opening %s for writing' % path)

    def abbreviate_types(self):
        """
        Shorten types by removing common boilerplate text.
        """
        for node in self.nodes:
            if node.tag == 'Record':
                if 'type' in node.attrib:
                    node.attrib['type'] = abbreviate(node.attrib['type'])


    def write_records(self):
        for node in self.nodes:
            if node.tag == 'Record':
                attributes = node.attrib
                kind = attributes['type']
                values = [format_value(attributes.get(field), datatype)
                          for (field, datatype) in FIELDS.items()]
                line = encode(','.join(values) + '\n')
                self.handles[kind].write(line)

    def close_files(self):
        for (kind, f) in self.handles.items():
            f.close()
            self.report('Written %s data.' % abbreviate(kind))

    def extract(self):
        self.open_for_writing()
        self.write_records()
        self.close_files()

    def report_stats(self):
        print('\nTags:\n%s\n' % format_freqs(self.tags))
        print('Fields:\n%s\n' % format_freqs(self.fields))
        print('Record types:\n%s\n' % format_freqs(self.record_types))


if __name__ == '__main__':
    if len(sys.argv) != 2:
        print('USAGE: python applehealthdata.py /path/to/export.xml',
              file=sys.stderr)
        sys.exit(1)
    data = HealthDataExtractor(sys.argv[1])
    data.report_stats()
    data.extract()

To run this code, clone the repo from github.com/tdda/applehealthdata with:

$ git clone https://github.com/tdda/applehealthdata.git

or save the text from this post as healthdata.py. At the time of posting, the code is consistent with this, but this commit is also tagged with the version number, v1.0, so if you check it out later and want to use this version, check out that version by saying:

$ git checkout v1.0

If your data is in the same directory as the code, then simply run:

$ python healthdata.py export.xml

and, depending on size, wait a few minutes while it runs. The code runs under both Python 2 and Python 3.

When I do this, the output is as follows:

$ python applehealthdata/applehealthdata.py export6s3/export.xml
Reading data from export6s3/export.xml . . . done

Tags:
ActivitySummary: 29
ExportDate: 1
Me: 1
Record: 446670
Workout: 1

Fields:
HKCharacteristicTypeIdentifierBiologicalSex: 1
HKCharacteristicTypeIdentifierBloodType: 1
HKCharacteristicTypeIdentifierDateOfBirth: 1
HKCharacteristicTypeIdentifierFitzpatrickSkinType: 1
activeEnergyBurned: 29
activeEnergyBurnedGoal: 29
activeEnergyBurnedUnit: 29
appleExerciseTime: 29
appleExerciseTimeGoal: 29
appleStandHours: 29
appleStandHoursGoal: 29
creationDate: 446671
dateComponents: 29
device: 84303
duration: 1
durationUnit: 1
endDate: 446671
sourceName: 446671
sourceVersion: 86786
startDate: 446671
totalDistance: 1
totalDistanceUnit: 1
totalEnergyBurned: 1
totalEnergyBurnedUnit: 1
type: 446670
unit: 446191
value: 446671
workoutActivityType: 1

Record types:
ActiveEnergyBurned: 19640
AppleExerciseTime: 2573
AppleStandHour: 479
BasalEnergyBurned: 26414
BodyMass: 155
DistanceWalkingRunning: 196262
FlightsClimbed: 2476
HeartRate: 3013
Height: 4
StepCount: 195654

Opening /Users/njr/qs/export6s3/BasalEnergyBurned.csv for writing
Opening /Users/njr/qs/export6s3/HeartRate.csv for writing
Opening /Users/njr/qs/export6s3/BodyMass.csv for writing
Opening /Users/njr/qs/export6s3/DistanceWalkingRunning.csv for writing
Opening /Users/njr/qs/export6s3/AppleStandHour.csv for writing
Opening /Users/njr/qs/export6s3/StepCount.csv for writing
Opening /Users/njr/qs/export6s3/Height.csv for writing
Opening /Users/njr/qs/export6s3/AppleExerciseTime.csv for writing
Opening /Users/njr/qs/export6s3/ActiveEnergyBurned.csv for writing
Opening /Users/njr/qs/export6s3/FlightsClimbed.csv for writing
Written BasalEnergyBurned data.
Written HeartRate data.
Written BodyMass data.
Written DistanceWalkingRunning data.
Written ActiveEnergyBurned data.
Written StepCount data.
Written Height data.
Written AppleExerciseTime data.
Written AppleStandHour data.
Written FlightsClimbed data.
$

As a quick preview of one of the files, here is the top of the second biggest output fiele, StepCount.csv:

$ head -5 StepCount.csv
sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value
"Health",,,"HKQuantityTypeIdentifierStepCount","count",2014-09-21 06:08:47 +0000,2014-09-13 09:27:54 +0000,2014-09-13 09:27:59 +0000,329
"Health",,,"HKQuantityTypeIdentifierStepCount","count",2014-09-21 06:08:47 +0000,2014-09-13 09:34:09 +0000,2014-09-13 09:34:14 +0000,283
"Health",,,"HKQuantityTypeIdentifierStepCount","count",2014-09-21 06:08:47 +0000,2014-09-13 09:39:29 +0000,2014-09-13 09:39:34 +0000,426
"Health",,,"HKQuantityTypeIdentifierStepCount","count",2014-09-21 06:08:48 +0000,2014-09-13 09:45:36 +0000,2014-09-13 09:45:41 +0000,61

You may need to scroll right to see all of it, or expand your browser window.

This blog post is long enough already, so I'll discuss (and plot) the contents of the various output files in later posts.

Notes on the Output

Format: The code writes CSV files including a header record with field names. Since the fields are XML attributes, which get read into a dictionary, they are unordered so the code sorts them alphabetically, which isn't optimal, but is at least consistent. Nulls are written as empty spaces, strings are quoted with double quotes, double quotes in strings are escaped with backslash and backslash is itself escaped with backslash. The output encoding is UTF-8.

Filenames: One file is written per record type, and the names is just the record type with extension .csv, except for record types including HK...TypeIdentifier, which is excised.

Summary Stats: Summary stats about the various CSV files are printed before the main extraction occurs.

Overwriting: Any existings CSV files are silently overwritten, so if you have multiple health data export files in the same directory, take care.

Data Sanitization: The code is almost completely opinionless, and with one exception simply flattens the data in the XML file into a collection of CSV files. The exception concerns file names and the type field file. Apple uses extraordinarily verbose and ugly names like HKQuantityTypeIdentifierStepCount and HKQuantityTypeIdentifierHeight to describe the contents of each record: the abbreviate function in the code uses a regular expression to strip off the nonsense, resulting in nicer, shorter, more comprehensible file names and record types. However, if you prefer to get your data verbatim, simply change the value of ABBREVIATE to False near the top of the file and all your HealthKit prefixes will be preserved, at the cost of a non-trivial expansion of the output file sizes.

Notes on the code: Wot, no tests?

The first thing to say about the code is that there are no tests provided with it, which is—cough—slightly ironic, given the theme of this blog. This isn't because I've written them but am holding them back for pedagogical reasons, or as an ironical meta-commentary on the whole test-driven movement, but merely because I haven't written any yet. Happily, writing tests is a good way of documenting and explaining code, so another post will follow, in which I will present some tests, possibly correct myriad bugs, and explain more about what the code is doing.

I almost said 'I googled "Apple Health export"', but the more accurate statement would be that 'I DuckDuckGoed "Apple Health export"', but there are so many problems with DuckDuckGo as a verb, even in the present tense, let alone in the past as DuckDuckGod. Maybe I should propose the neologism "to DDGoogle". Or as Greg Wilson suggested, "to Duckle". Or maybe not . . . ↩
The ElementTree structure in Python 3 is slightly different in this respect: this exploration was carried out with Python 2. However, the main code presented later in the post works under Python 2 and 3. ↩

Lessons Learned: Bad Data and other SNAFUs

2016-02-15T13:30:00+00:00

My first paid programming job was working for my local education authority during the summer. The Advisory Unit for Computer-Based Education (AUCBE), run by a fantastic visionary and literal "greybeard" called Bill Tagg, produced software for schools in Hertfordshire and environs, and one of their products was a simple database called Quest. At this time (the early 1980s), two computers dominated UK schools—the Research Machines 380Z, a Zilog Z-80-based machine running CP/M, and the fantastic, new BBC Micro, 6502-based machine produced by Acorn, to specification agreed with the British Broadcasting Corporation. I was familiar with both, as my school had a solitary 380Z, and I had harangued my parents into getting me a BBC Model B,¹ which was the joy of my life.

The Quest database existed in two data-compatible forms. Peter Andrews had written a machine code implementation for the 380Z, and Bill Tagg himself had written an implementation in BBC Basic for the BBC Micro. They shared an interface and a manual, and my job was to produce a 6502 version that would also share that manual. Every deviation from the documented and actual behaviour of the BBC Basic implementation had to be personally signed off by Bill Tagg.

Writing Quest was a fantastic project for me, and the most highly constrained I have ever done: every aspect of it was pinned down by a combination of manuals, existing data files, specified interfaces, existing users and reference implementations. Peter Andrews was very generous in writing out, in fountain pen, on four A4 pages, a suggested implementation plan, which I followed scrupulously. That plan probably made the difference between my successfully completing the project and flailing endlessly, and the project was a success.

I learned an enormous amount writing Quest, but the path to success was not devoid of bumps in the road.

Once I had implemented enough of Quest for it to be worth testing, I took to delivering versions to Bill periodically. This was the early 1980s, so he didn't get them by pulling from Github, nor even by FTP or email; rather, I handed him floppy disks,² in the early days, and later on, EPROMs—Erasable, Programmable Read-Only Memory chips that he could plug into the Zero-Insertion Force ("ZIF") socket³ on the side of his machine. (Did I mention how cool the BBC Micro was?)

Towards the end of my development of the 6502 implementation of Quest, I proudly handed over a version to Bill, and was slightly disappointed when he complained that it didn't work with one of his database files. In fact, his database file caused it to hang. He gave me a copy of his data and I set about finding the problem. It goes without saying that a bug that caused the software to hang was pretty bad, so it was clearly important to find it.

It was hard to track down. As I recall, it took me the best part of two solid days to find the problem. When I eventually did find it, it turned out to be a "bad data" problem. If I remember correctly, Quest saved data as flat files using the pipe character "|" to separate fields. The dataset Bill had given me had an extra pipe separator on one line, and was therefore not compliant with the data format. My reaction to this discovery was to curse Bill for sending me on a 2-day wild goose chase, and the following day I marched into AUCBE and told him—with the righteousness that only an arrogant teenager can muster—that it was his data that was at fault, not my beautiful code.

. . . to which Bill, of course, countered:

"And why didn't your beautiful code detect the bad data and report it, rather than hanging?"

Oops.

Introducing SNAFU of the Week

Needless to say, Bill was right. Even if my software was perfect and would never write invalid data (which might not have been the case), and even if data could never become corrupt through disk errors (which was demonstrably not the case), that didn't mean it would never encounter bad data. So the software had to deal with invalid inputs rather better than going into an infinite loop (which is exactly what it did—nothing a hard reset wouldn't cure!)

And so it is with data analysis.

Obviously, there is such a thing as good data—perfectly formatted, every value present and correct; it's just that it is almost never safe to assume that data your software will receive will be good. Rather, we almost always need to perform checks to validate it, and to give various levels of warnings when things are not as they should be. Hanging or crashing on bad data is obviously bad, but in some ways, it is less bad than reading it without generating a warning or error. The hierarchy of evils for analytical software runs something like this:

(Worst) Producing plausible but materially incorrect results from good inputs.
Producing implausible, materially incorrect results from good inputs (generally less bad, because these are much less likely to go unnoticed, though obviously they can be even more serious if they do).
(Least serious) Hanging or crashing (embarrassing and inconvenient, but not actively misleading).

In this spirit, we are going to introduce "SNAFU of the Week", which will be a (not-necessarily weekly) series of examples of kinds of things that can go wrong with data (especially data feeds), analysis, and analytical software, together with a discussion of whether and how it was, or could have been detected and what lessons we might learn from them.

BBC Micro Image: Dave Briggs, https://www.flickr.com/photos/theclosedcircle/3349126651/ under CC-BY-2.0. ↩
Floppy disks were like 3D-printed versions of the save icon still used in much software, and in some cases could store over half a megabyte of data. Of course, the 6502 was a 16-bit processor, that could address a maximum of 64K of RAM. In the case of the BBC micro, a single program could occupy at most 16K, so a massive floppy disk could store many versions of Quest together with enormous database files. ↩
Zero-Insertion Force Socket: Windell Oskay, https://www.flickr.com/photos/oskay/2226425940 under CC-BY-2.0. ↩

How far in advance are flights cheapest? An error of interpretation

2016-01-06T15:00:00+00:00

Guest Post by Patrick Surry, Chief Data Scientist, Hopper

Every year, Expedia and ARC collaborate to publish some annual statistics about domestic airfare, including their treatment of the perennial question "How far in advance should you book your flight?" Here's what they presented in their report last year:

Although there are a lot of things wrong with this picture (including the callout not being at the right spot on the x-axis, and the $496 average appearing above $500 . . .), the most egregious is a more subtle error of interpretation. The accompanying commentary reads:

Still, the question remains: How early should travelers book? . . . Data collected by ARC indicates that the lowest average ticket price, about US$401, can be found 57 days in advance.

While that statement is presumably mathematically correct, it's completely misleading. The chart is drawn by calculating the average price of all domestic roundtrip tickets sold at each advance. That answers the question "how far in advance is the average ticket sold on the day lowest?" but is mistakenly interpreted as answering "how far in advance is a typical ticket cheapest?". That's a completely different question, because the mix of tickets changes with advance. Indeed, travelers tend to book more expensive trips earlier, and cheaper trips later. In fact, for most markets, prices are fairly flat at long advances, and then rise more or less steeply at some point before departure. As a simplification, assume there are only two domestic markets, a short, cheap trip, and a long, expensive one. Both have prices that are flat at long advances, and which start rising about 60 days before departure:

Now let's assume that the relative demand is directly proportional to advance, i.e. 300 days ahead, all tickets sold are for FarFarAway, and 0 days ahead, all tickets sold are for StonesThrow, and let's calculate the price of the average ticket sold as a function of advance:

What do you know? The average price declines as demand switches from more expensive to cheaper tickets, with a minimum coincidentally just less than 60 days in advance. To get a more meaningful answer to the question "how far in advance is the typical ticket cheapest?", we should instead simply calculate separate advance curves for each market, and then combine them based on the total number (or value) of tickets sold in each market. In our simple example, if we assume the two markets have equal overall weight, we get a much more intuitive result, with prices flat up to 60 days, and then rising towards departure:

All this goes to show how important it is that we frame our analytical questions (and answers!) carefully. When the traveller asks: "How far in advance should I book my flight?", it's our responsibility as analysts to recognize that they mean

How far in advance is any given ticket cheapest?

rather than

How far in advance is the average price of tickets sold that day lowest?

Even a correct answer to the latter is dangerously misleading because the traveller is unlikely to recognize the distinction and take it as the (wrong!) answer to their real question.

Tools and Tooling

2015-12-16T08:30:00+00:00

Good tools for testing matter because the temptation to skimp on testing is real even for true believers: anything that reduces the friction and pain associated with actually adding tests therefore has a disproportionate effect on adoption and implementation rates.

I think there are several reasons the temptation to forego writing tests seems to be strong for most people.

The first is that code is sometimes implemented correctly without writing any tests. There's no real temptation to miss out mandatory boiler-plate code at the top of a Java program because we know it can never work without it. The fact that it is possible to produce correct programs (and correct analytical processes) without writing tests means that even if you are intellectually convinced, and conditioned by experience, to believe that it is ultimately faster to produce reliable results by following a test-driven methodology, there's always the lingering memory of those rare occasions when things did just work first time without the "extra" work of writing tests.

The second reason it's tempting to skimp on writing tests is that this is a support activity rather than the main task at hand, making the testing part seem less glamorous and more of a chore. (Everyone is familiar with Abraham Lincoln's "Give me six hours to chop down a tree and I will spend the first four sharpening the axe", but how many adopt his principle?)

Perhaps the final reason for skimping on test code is that large programs and processes typically start life as small programs that we may not intend to use repeatedly. This is especially true in data analysis. While no task is so simple it cannot be botched, the benefits of and need for systematic testing grow with project size. The dynamic of "knocking up a script to calculate something", then later finding yourself using and modifying that script regularly, gradually extending it to handle ever more cases, while very typical for some kinds of development and analysis, carries a high risk. The frequent result is that by the time anyone realises that it really needs a test suite, the code has become grown complex, undocumented and hard to back-fill with tests.

The approaches we propose for test-driven data analysis are deliberately compatible with retrofitting, because in practice so much code develops in this organic way. If we ignore this reality, we will automatically exclude a large proportion—perhaps a majority—of analytical processes actually deployed. Given that we believe the test-driven approach to data analysis has much to offer, we are keen to bring its benefits as widely as possible, so we need to have regard to the case of analytical processes developed in the real world without TDDA as a primary focus.

Understanding that there is a natural temptation not to develop tests helps us to realise that anything we can do to make the process of testing less painful to implement and easier to retrofit when it has been neglected is likely to help adoption.

With these thoughts in mind, over the coming weeks and months, we will have further posts on specific aspects of tooling for TDDA. The broad plan is to discuss ideas we have already implemented in our own Miró data analysis suite, some as extensions to Python's built-in unittest framework, and to start to extract and publish core functionality from there in forms that are applicable to the broader Python (and perhaps, in some cases R) data analysis toolsets.

Generalized Overfitting: Errors of Applicability

2015-12-14T11:00:00+00:00

Everyone building predictive models or performing statistical fitting knows about overfitting. This arises when the function represented by the model includes components or aspects that are overly specific to the particularities of the sample data used for training the model, and that are not general features of datasets to which the model might reasonably be applied. The failure mode associated with overfitting is that the performance of the model on the data we used to train it is significantly better than the performance when we apply the model to other data.

Figure: Overfitting. Points drawn from sin(x) + Gaussian noise. Left: Polynomial fit, degree 3 (cubic; good fit). Right: Polynomial fit, degree 10 (overfit).

Statisticians use the term cross-validation to describe the process of splitting the training data into two (or more) parts, and using one part to fit the model, and the other to assess whether or not it exhibits overfitting. In machine learning, this is more often referred to as a "test-training" approach.

A special form of this approach is longitudinal validation, in which we build the model on data from one time period and then check its performance against data from a later time period, either by partitioning the data available at build time into older and newer data, or by using outcomes collected after the model was built for validation. With longitudinal validation, we seek to verify not only that we did not overfit the characteristics of a particular data sample, but also that the patterns we model are stable over time.

Validating against data for which the outcomes were not known when the model was developed has the additional benefit of eliminating a common class of errors that arises when secondary information about validation outcomes "leaks" during the model building process. Some degree of such leakage—sometimes known as contaminating the validation data—is quite common.

Generalized Overfitting

As its name suggests, overfitting as normally conceived is a failure mode specific to model building, arising when we fit the training data "too well". Here, we are are going to argue that overfitting is an example of a more general failure mode that can be present in any analytical process, especially if we use the process with data other than that used to build it. Our suggested name for this broader class of failures is errors of applicability.

Here are some of the failure modes we are thinking about:

Changes in Distributions of Inputs (and Outputs)

New categories. When we develop the analytical process, we see only categories A, B and C in some (categorical) input or output. In operation, we also see category D. At this point our process may fail completely ("crash"), produce meaningless outputs or merely produce less good results.
Missing categories. The converse can be a problem too: what if a category disappears? Most prosaically, this might lead to a divide-by-zero error if we've explicitly used each category frequency in a denominator. Subtler errors can also creep in.
Extended ranges. For numeric and other ordered data, the equivalent of new categories is values outside the range we saw in the development data. Even if the analysis code runs without incident, the process will be being used in a way that may be quite outside that considered and tested during development, so this can be dangerous.
Distributions. More generally, even if the range of the input data doesn't change, its distribution may, either slowly or abruptly. At the very least, this indicates the process is being used in unfamiliar territory.
Nulls. Did nulls appear in any fields where there were none when we developed the process? Does the process cater for this appropriately? And are such nulls valid?
Higher Dimensional Shifts. Even if the the data ranges and distribution for individual fields don't change, their higher dimensional distributions (correlations) can change significantly. The pair of 2-dimensional distributions below illustrates this point in an extreme way. The distributions of both x and y values on the left and right are identical. But clearly, in 2 dimensions, we see that the space occupied by the two datasets is actually non-overlapping, and on the left x and y are negatively correlated, while on the right they are positively correlated.

Figure: The same x and y values are shared between these two plots (i.e. the disibution of x and y is identical in each case). However, the pairing of x and y coordinates is different. A model or other analytical process built with with negatively correlated data like that on the left might not work well for positively correlated data like that on the right. Even if it does work well, you may want to detect and report a fundamental change like this.
Time (always marching on). Times and dates are notoriously problematical. There are many issues around date and time formats, many specifically around timezones (and the difference between a local times and times in a fixed time zone, such as GMT or UTC).

For now, let's assume that we have an input that is a well-defined time, correctly read and analysed in a known timezone—say UTC.¹ Obviously, new data will tend to have later times—sometimes non-overlapping later times. Most often, we need to change these to intervals measured with respect to a moving date (possibly today, or some variable event date, e.g. days since contact). But in other cases, absolute times, or times in a cycle matter. For example, season, time of month or time of day may matter—the last two, probably in local time rather than UTC.

In handling time, we have to be careful about binnings, about absolute vs. relative measurement (2015-12-11T11:00:00 vs. 299 hours after the start of the current month), universal vs. local time, and appropriate bin boundaries that move or expand with the analytic time window being considered.

Time is not unique in the way that its range and maximum naturally increase with each new data sample. Most obviously, other counters (such as customer number) and sum-like aggregates may have this same monotonically increasing character, meaning that it should be expected that new, higher (but perhaps not new lower) values will be present in newer data.

Concrete and Abstract Definitions

There's a general issue with choosing values based on data used during development. This concerns the difference between what we will term concrete and abstract values, and what it means to perform "the same" operation on different datasets.

Suppose we decide to handle outliers differently from the rest of the data in a dataset, at least for some part of the analysis. For example, suppose we're looking at flight prices in Sterling and we see the following distribution.

Figure: Ticket prices, in £100 bins to £1,000, then doubling widths to £256,000, with one final bin for prices above £256,000. (On the graph, the £100-width bins are red; the rest are blue.)

On the basis of this, we see that well over 99% of the data has prices under £4,000, and also that while there are a few thousand ticket prices in the £4,000–£32,000 range (most of which are probably real) the final few thousand probably contain bad data, perhaps as a result of currency conversion errors.

We may well want to choose one or more threshold values from the data—say £4,000 in this case—to specify some aspect of our analytical process. We might, for example, use this threshold in the analysis for filtering, outlier reporting, setting a final bin boundary or setting the range for the axes of a graph.

The crucial question here is: How do we specify and represent our threshold value?

Concrete Value: Our concrete value is £4,000. In the current dataset there are 60,995 ticket prices (0.55%) above this value and 10,807,905 (99.45%) below. (There are no prices of exactly £4,000.) Obviously, if we specify our threshold using this concrete value—£4,000—it will be the same for any dataset we use with the process.
Abstract Value: Alternatively, we might specify the value indirectly, as a function of the input data. One such abstract specification is the price P below which which 99.45% of ticket prices the dataset lie. If we specify a threshold using this abstract definition, it will vary according to the content of the dataset.
- In passing, 99.45% is not precise: if we select the bottom 99.45% of this dataset by price we get 10,808,225 records with a maximum price of £4,007.65. The more precise specification is that 99.447046% of the dataset has prices under £4,000.
- Of course, being human, if we were specifying the value in this way, we would probably round the percentage to 99.5%, and if we did that we would find that we shifted the threshold so that the maximum price below it was £4,186.15, and the minimum price above was £4,186.22.
Alternative Abstract Specifications: Of course, if we want to specify this threshold abstractly, there are countless other ways we might do it, some fraught with danger.

Two things we should definitely avoid when working with data like this are means and variances across the whole column, because they will be rendered largely meaningless by outliers. If we blindly calculate the mean, μ, and standard deviation, σ, in this dataset, we get μ=£2,009.85 and σ=£983,956.28. That's because, as we noted previously, there are a few highly questionable ticket prices in the data, including a maximum of £1,390,276,267.42.² Within the main body of the data—the ~99.45% with prices below £4,000.00—the corresponding values are μ=£462.09 and σ=£504.82. This emphasizes how dangerous it would be to base a definition on full-field moments³ such as mean or variance.

In contrast, the median is much less affected by outliers. In the full dataset, for example the median ticket price is £303.77, while the median of those under £4,000.00 is £301.23. So another reasonably stable abstract definition of a threshold around £4,000.00 would be something like 13 times the median.

The reason for labouring this point around abstract vs. concrete definitions is that it arises very commonly and it is not always obvious which is preferable. Concrete definitions have the advantage of (numeric) consistency between analyses, but may result in analyses that are not well suited to a later dataset, because different choices would have been made if that later data had been considered by the developer of the process. Conversely, abstract definitions often make it easier to ensure that analyses are suitable for a broader range of input datasets, but can make comparability more difficult; they also tend to make it harder to get "nice" human-centric scales, bin boundaries and thresholds (because you end up, as we saw above, with values like £4,186.22, rather than £4,000).

Making a poor choice between abstract and concrete specifications of any data-derived values can lead to large sections of the data being omitted (if filtering is used), or made invisible (if used for axis boundaries), or conversely can lead to non-comparability between results or miscomputations if values are associated with bins having different boundaries in different datasets.

NOTE: A common source of the leakage of information from validation data into training data, as discussed above, is the use of the full dataset to make decisions about thresholds such as those discussed here. To get the full benefit of cross-validation, all modelling decisions need to be made solely on the basis of the training data; even feeding back performance information from the validation data begins to contaminate that data.

Data-derived thresholds and other values can occur almost anywhere in an analytical process, but specific dangers include:

Selections (Filters). In designing analytical processes, we may choose to filter values, perhaps to removing outliers or nonsensical values. Over time, the distribution may shift, and these filters may become less appropriate and remove ever-increasing proportions of the data.

A good example of this we have seen recently involves negative charges. In early versions of ticket price information, almost all charges were positive, and those that were negative were clearly erroneous, so we added a filter to remove all negative charges from the dataset. Later, we started seeing data in which there were many more, and less obviously erroneous negative charges. It turned out that a new data source generated valid negative charges, but we were misled in our initial analysis and the process we built was unsuitable for the new context.
Binnings (Bandings, Buckets). Binning data is very common, and it is important to think carefully about when you want bin boundaries to be concrete (common across datasets) and when they should be abstract (computed, and therefore different for different datasets). Quantile binnings (such as deciles), of course, are intrinsically adaptive, though if those are used you have to be aware that any given bin in one dataset may have different boundaries from the "same" bin in another dataset.
Statistics. As noted above, some care has to be taken when any statistic is used in the dataset to determine whether it should be recorded algorithmically (as an abstract value) in analysis or numerically (as a concrete value), and particular care should be taken with statistics that are sensitive to outliers.

Other Challenges to Applicability

In addition to the common sources of errors of applicability we have outlined above, we will briefly mention a few more.

Non-uniqueness. Is a value that was different for each record in the input data now non-unique?
Crazy outliers. Are there (crazy) outliers in fields where there were none before?
Actually wrong. Are there detectable data errors in the operational data that were not seen during development?
New data formats. Have formats changed, leading to misinterpretation of values?
New outcomes. Even more problematical than new input categories or ranges are new outcome categories or a larger range of output values. When we see this, we should almost always re-evaluate our analytical processes.

Four Kinds of Analytical Errors

In the overview of TDDA we published in Predictive Analytic Times (available here), we made an attempt to summarize how the four main classes of errors arise with the following diagram:

While this was always intended to be a simplification, a particular problem is that it suggests there's no room for errors of interpretation in the operationalization phase, which is far from the case.⁴ Probably a better representation is as follows:

UTC is the curious abbreviation (malacronym?) used for coordinated universal time, which is the standardized version of Greenwich Mean Time now defined by the scientific community. It is the time at 0º longitude, with no "daylight saving" (British Summer Time) adjustment. ↩
This is probably the result of a currency conversion error. ↩
Statistical moments are the characterizations of distributions starting with mean and variance, and continuing with skewness and kurtosis. ↩
It's never too late to misinterpret data or results. ↩

Overview of TDDA in Predictive Analytics Times

2015-12-11T16:20:00+00:00

We have an overview piece in Predictive Analytics Times this week.

You can find it here.

Anomaly Detection

2015-12-01T08:30:00+00:00

The Broader Process: Anomaly detection and Alerting.

The fourth major area we will focus on as we develop the ideas of test-driven data analysis is the correctness of the broader process. This relates partly to some of the ideas about consistency checking discussed earlier, but goes further.

A common situation with analysis processes is that they are used repeatedly on some kind of feed of data. When this is the case, in addition to checking the internal consistency of data, we have the opportunity to compare the current dataset with the previous datasets that have been seen with a view to detecting and reporting sudden and potentially unexpected changes. Simple examples might include changes in data volumes, shifts in distributions, increases in missing data rates and new or disappearing categories. More complex examples are multivariate, involving changes in the relationships between variables over time.

While this can be a complex topic, simply tracking a time series of summary stats about various data (especially inputs) and setting thresholds for deviations between the current data and what's come before can catch a good many problems. A more sophisticated and ambitious approach might involve trying to do general automatic anomaly detection on the incoming data, again using information about data previously seen as a reference point.

Depending on how automated the process is, it might be appropriate for the result of such anomaly detection to be a simple section or note in the output, or the creation of some kind of alert (such as a triggered email).

Unit Testing

2015-11-28T08:30:00+00:00

Systematic Unit Tests, System Tests and Reference Tests

The third major idea in test-driven data analysis is the one most directly taken from test-driven development, namely systematically developing both unit tests for small components of the analytical process and carefully constructed, specific tests for the whole system or larger components.

With regression testing, we emphasized the idea of taking whole-system analyses that have already been performed and turning those into overall system tests. In contrast, here we are talking about creating (or selecting) specific patterns in the input data for particular functions that exercise both core functionality (the main code paths) and so-called edge cases (the less-trodden paths). This is definitely significant extra work, and may (particularly if retrofitted) require either restructuring of code or use of some kind of mocking.

When we talk about "edge cases", we are referring to patterns, code-paths or functionality that is used only a minority of the time, or perhaps only on rare occasions. Examples might include handling missing values, extreme values, small data volumes and so forth. Some questions that might help to illuminate common edge cases include:

What happens if we use a standard deviation and there is only one value (so that the standard deviation is not defined)?
What happens if there are no records?
What happens if some or all of the data is null?
What happens if a few extreme (and possibly erroneous) values occur: will this, for example, cause a mean to have an non-useful value or bin boundaries to be set to useless values?
What happens if two fields are perfectly correlated: will this cause instability or errors when performing, for example, a statistical regression?
What happens if there are invalid characters in string data, especially field names or file names?
What happens if input data formats (e.g. string encodings, date formats, separators) are not as expected.
Are the various styles of line-endings (e.g. Unix vs. pc.) handled correctly?
What happens if an external system (such as a database) is running in some unexpected mode, such as with a different locale?

This idea of performing checks throughout the software does not really have an analogue in mainstream TDD, but is certainly good software engineering practice—see, e.g. the timeless Little Bobby Tables XKCD https://xkcd.com/327/—and relates fairly closely to the software engineering practice of defensive programming https://en.wikipedia.org/wiki/Defensive_programming. ↩
Obviously, in many situations, it's fine for identifiers or keys to be repeated, but it is also often the case that in a particular table a field value must be unique, typically when the records act as master records, defining the entities that exist in some category. Such tables are often referred to as master tables in database contexts https://encyclopedia2.thefreedictionary.com/master+file. ↩

Constraints and Assertions

2015-11-26T11:00:00+00:00

Consistency Checking of Inputs, Outputs and Intermediates

While the idea of regression testing comes straight from test-driven development, the next idea we want to discuss is associated more with general defensive progamming than TDD. The idea is consistency checking, i.e. verifying that what might otherwise be implicit assumptions are in fact met by adding checks at various points in the process.¹

Initially, we will assume that we are working with tabular data, but the ideas can be extended to other kinds of data.

Inputs. It is useful to perform some basic checks on inputs. Typical things to consider include:

Are the names and types fields in the input data as we expect? In most cases, we also expect field names to be distinct, and perhaps to conform to some rules.
Is the distribution of values in the fields reasonable? For example, are the minimum and maximum values reasonable?
Are there nulls (missing values) in the data, and if so, are they permitted where they occur? If so, are there any restrictions (e.g. may all the value for a field or record be null?)
Is the volume of data reasonable (exactly as expected, if there is a specific size expectation, or plausible, if the volume is variable)?
Is any required metadata² included?

In addition to basic sense checks like these, we can also often formulate self-consistency checks on the data. For example:

Are any of the fields identifiers or keys for which every value should occur only once?³
Are there row-level identities that should be true? For example, we might have a set of category counts and an overall total, and expect the category totals to sum to the overall total:
```
nCategoryA + nCategoryB + nCategoryC = nOverall
```
For categorical data, are all the values found in the data allowed, and are any required values missing?
If the data has a time structure, are the times and dates self-consistent? For example, do any end dates precede start dates? Are there impossible future dates?
Are there any ordering constraints on the data, and if so are they respected?

Our goal in formulating TDDA is pragmatic: we are not suggesting it is necessary to check for every possible inconsistency in the input data. Rather, we propose that even one or two simple, well-chosen checks can catch a surprising number of problems. As with regression testing, an excellent time to add new checks is when you discover problems. If you add a consistency check every time you discover bad inputs that such a test would have caught, you might quickly build up a powerful, well-targeted set of diagnostics and verification procedures. As we will see below, there is also a definite possibility of tool support in this area.

Intermediate Results and Outputs. Checking intermediates and outputs is very similar to checking inputs, and all same kinds of tests can be applied. Some further questions to consider in these contexts include:

If we look up reference data, do (and should) all the lookups succeed? And are failed lookups handled appropriately?
If we calculate a set of results that should exhibit some identity properties, do those hold? Just as physics has conservation laws for quantities such as energy and momentum, there are similar conservation principles in some analytical calculations. As a simple example, if we categorize spending into different, non-overlapping categories, the sum of the category totals should usually equal the sum of all the transactions, as long as we are careful about things like non-categorized values.
If we build predictive models, do they cross-validate correctly (if we split the data into a training subset and a validation subset)? And, ideally, do they also validate longitudinally (i.e., on later data, if this is available)?

Transfer checks. With data analysis, our inputs are frequently generated by some other system or systems. Often those systems already perform some checking or reporting of the data they produce. If any information about checks or statistics from source systems is available, it is useful to verify that equivalent statistics calculated over the input data produce the same results. If our input data is transactional, maybe the source system reports (or can report) the number or value of transactions over some time period. Perhaps it breaks things down by category. Maybe we know other summary statistics or there are checksums available that can be verified.

The value of checking that the data received is the same as the data the source system was supposed to send is self-evident, and can help us to detect a variety of problems including data loss, data duplication, data corruption, encoding issues, truncation errors and conversion errors, to name but a few.

Tool Support: Automatic constraint suggestion

A perennial obstacle to better testing is the perception that it is a "nice to have", rather than a sine qua non, and that implementing it will require much tedious work. Because of this, any automated support that tools could provide would seem to be especially valuable.

Fortunately, there is low-hanging fruit in many areas, and one of of our goals with this blog is to explore various tool enhancements. We will do this first in our own Miró software and then, as we find things that work, will try to produce some examples, libraries and tools for broader use, probably focused around the Python data analysis stack.

In the spirit of starting simply, we're first going to look at what might be possible by way of automatic input checking.

One characteristic of data analysis is that we often start by trying to get a result from some particular dataset, rather than setting out to implement an analytical process to be used repeatedly with different inputs. In fact, when we start, we may not even have a very specific analytical goal in mind: we may simply have some data (perhaps poorly documented) and perform exploratory analysis with some broad analytical goal in mind. Perhaps we will stop when we have a result that seems useful, and which we have convinced ourselves is plausible. At some later point, we may get a similar dataset (possibly pertaining to a later period, or a different entity) and need to perform a similar analysis. It's at this point we may go back to see whether we can re-use our previous, embryonic analytical process, in whatever form it was recorded.

Let's assume, for simplicity, that the process at least exists as some kind of executable script, but that it's "hard-wired" to the previous data. We then have three main choices.

Edit the script (in place) to make it work with the new input data.
Take a copy of the script and make that work with the new input data.
Modify the script to allow it to work with either the new or the old data, by parameterizing and generalizing it.

Presented like this, the last sounds like the only sensible approach, and in general it is the better way forward. However, we've all taken the other paths from time to time, often because under pressure just changing a few old hard-wired values to new hard-wired values seems as if it will get us to our immediate result faster.⁴

The problem is that even if were very diligent when preparing the first script, in the context of the original analysis, it is easy for there to be subtle differences in a later dataset that might compromise or invalidate the analysis, and it's hard to force ourselves to be as vigilent the second (third, fourth, ...) time around.

A simple thing that can help is to generate statements about the original dataset and record these as constraints. If a later dataset violates these constraints, it doesn't necessarily mean that anything is wrong, but being alerted to the difference at least offers us an opportunity to consider whether this difference might be significant or problematical, and indeed, whether it might indicate a problem with the data.

Concretely: let's think about what we probably know about a dataset whenever we work with it. We'll use the Periodic Table as an example dataset, based on a snapshot of data I extracted from Wikipedia a few years ago. This is how Miró summarizes the dataset if we ask for a "long" listing of the fields with its ls -l command:

             Field      Type                        Min                         Max    Nulls
                 Z       int                          1                          92       0
              Name    string                   Actinium                   Zirconium       0
            Symbol    string                         Ac                          Zr       0
            Period       int                          1                           7       0
             Group       int                          1                          18      18
    ChemicalSeries    string                   Actinoid            Transition metal       0
      AtomicWeight      real                       1.01                      238.03       0
         Etymology    string                      Ceres                      zircon       0
RelativeAtomicMass      real                       1.01                      238.03       0
     MeltingPointC      real                    -258.98                    3,675.00       1
MeltingPointKelvin      real                      14.20                    3,948.00       1
     BoilingPointC      real                    -268.93                    5,596.00       0
     BoilingPointF      real                    -452.07                   10,105.00       0
           Density      real                       0.00                       22.61       0
       Description    string                        0.2            transition metal      40
            Colour    string    a soft silver-white ...    yellowish green or g ...      59

For each field, we have a name, a type, minimum and maximum values and a count of the number of missing values. [Scroll sideways if your window is too narrow to see the Nulls column on the right.] We also have, implicitly, the field order.

This immediately suggests a set of constraints we might want to construct. We've added an experimental command to Miró for generating constraints based on the field metadata shown earlier and a few other statistics. First, here's the "human-friendly" view that Miró produces if we use its autoconstraints -l command.

In this table, the green cells represent constraints the system suggests for fields, and the orange cells show areas in which potential constraints were not constructed, though they would have been had the data been different. Darker shades of orange indicate constraints that were closer to be met within the data.

In addition to this human-friendly view, Miró generates out a set of declarations, which can be thought of as candidate assertions. Specifically, they are statements that are true in the current dataset, and therefore constitute potential checks we might want to carry out on any future input datasets we are using for the same analytical process.

Here they are:

declare (>= (min Z) 1)
declare (<= (max Z) 92)
declare (= (countnull Z) 0)
declare (non-nulls-unique Z)

declare (>= (min (length Name)) 3)
declare (<= (min (length Name)) 12)
declare (= (countnull Name) 0)
declare (non-nulls-unique Name)

declare (>= (min (length Symbol)) 1)
declare (<= (min (length Symbol)) 2)
declare (= (countnull Symbol) 0)
declare (non-nulls-unique Symbol)

declare (>= (min Period) 1)
declare (<= (max Period) 7)
declare (= (countnull Period) 0)

declare (>= (min Group) 1)
declare (<= (max Group) 18)

declare (>= (min (length ChemicalSeries)) 7)
declare (<= (min (length ChemicalSeries)) 20)
declare (= (countnull ChemicalSeries) 0)
declare (= (countzero
            (or (isnull ChemicalSeries)
                (in ChemicalSeries (list "Actinoid" "Alkali metal"
                                         "Alkaline earth metal"
                                         "Halogen" "Lanthanoid"
                                         "Metalloid" "Noble gas"
                                         "Nonmetal" "Poor metal"
                                         "Transition metal"))))
           0)

declare (>= (min AtomicWeight) 1.007946)
declare (<= (max AtomicWeight) 238.028914)
declare (= (countnull AtomicWeight) 0)
declare (> (min AtomicWeight) 0)

declare (>= (min (length Etymology)) 4)
declare (<= (min (length Etymology)) 39)
declare (= (countnull Etymology) 0)

declare (>= (min RelativeAtomicMass) 1.007946)
declare (<= (max RelativeAtomicMass) 238.028914)
declare (= (countnull RelativeAtomicMass) 0)
declare (> (min RelativeAtomicMass) 0)

declare (>= (min MeltingPointC) -258.975000)
declare (<= (max MeltingPointC) 3675.0)

declare (>= (min MeltingPointKelvin) 14.200000)
declare (<= (max MeltingPointKelvin) 3948.0)
declare (> (min MeltingPointKelvin) 0)

declare (>= (min BoilingPointC) -268.930000)
declare (<= (max BoilingPointC) 5596.0)
declare (= (countnull BoilingPointC) 0)

declare (>= (min BoilingPointF) -452.070000)
declare (<= (max BoilingPointF) 10105.0)
declare (= (countnull BoilingPointF) 0)

declare (>= (min Density) 0.000089)
declare (<= (max Density) 22.610001)
declare (= (countnull Density) 0)
declare (> (min Density) 0)

declare (>= (min (length Description)) 1)
declare (<= (min (length Description)) 83)

declare (>= (min (length Colour)) 4)
declare (<= (min (length Colour)) 80)

Each green entry in the table maps to a declaration in this list. Let's look at a few:

Min and Max. Z is the atomic number. Each element has an atomic number, which is the number of protons in the nucleus, and each is unique. Hydrogen has the smallest number of protons, 1, and in this dataset, Uranium has the largest number—92. So the first suggested constraints are that these values should be in the observed range. These show up as the first two declarations:
```
declare (>= (min Z) 1)
declare (<= (max Z) 92)
```
We should say a word about how these constraints are expressed. Miró includes expression language called (lisp-like), (because it's essentially a dialect of Lisp). Lisp is slightly unusual in that instead of writing f(x, y) you write (f x y). So the first expression would be more commonly expressed as
```
min(Z) >= 1
```
in regular ("infix") languages.

Lisp weirdness aside, are these sensible constraints? Well, the first certainly is. Even if we find some elements beyond Uranium (which we will, below), we certainly don't expect them to have zero or negative numbers of protons, so the first constraint seems like a keeper.

The second constraint is much less sensible. In fact, given that we know the dataset includes every value of Z from 1 to 92, we confidently expect that any future revisions of the periodic table will include values higher than 92. So we would probably discard that constraint.

The crucial point is that no one wants to sit down and write out a bunch of constraints by hand (and anyway, "why have a dog an bark yourself?"). People are generally much more willing to review a list of suggested constraints and delete the ones that don't make sense, or modify them so that they do.
Nulls. The next observation about Z is that it contains no nulls. This turns into the (lisp-like) constraint:
```
declare (= (countnull Z) 0)
```
This is also almost certainly a keeper: we'd probably be pretty unhappy if we received a Periodic Table with missing Z values for any elements.

(Here, (countnull Z) just counts the number of nulls in field Z, and = tests for equality, so the expression reads "the number of nulls in Z is equal to zero".)
Sign. The sign column is more interesting. Here, we have recorded the fact that all the values in Z are positive. Clearly, this is a logically implied by the fact that the minimum value for Z is 1, but we think it's useful to record two separate observations about the field—first, that its minimum value is 1, and secondly that it is always strictly positive. In cases where the minimum is 1, for an integer field, these statements are entirely equivalent, but if the minimum had been (say) 3, they would be different. The value of recording these observations separately arises if at some later stage the minimum changes, while remaining positive. In that case, we might want to discard the specific minimum constraint, but leave in place the constraint on the sign.

Although we record the sign as a separate constraint in the table, in this case it does not generate a separate declaration, as it would be identical to the constraint on the minimum that we already have.

In contrast, AtomicWeight, has a minimum value around 1.008, so it does get a separate sign constraint:
```
declare (> (min AtomicWeight) 0)
```
Uniqueness of Values. The next thing our autoconstraints framework has noticed about Z is that none of its values is repeated in the data—that all are unique (a.k.a. distinct). The table reports this as yes (the values are unique) and 92/92 (100%), meaning that there are 92 distinct values and 92 non-null values in total, so that 100% of values are unique. Other fields, such as Etymology, have potential constraints that are not quite met: Etymology has 89 different values in the field, so the ratio of distinct values to values is about 97%.

NOTE: in considering this, we ignore nulls if there are any. You can see this if you look at the Unique entry for the field Group: here there are 18 different (non-null) values for Group, and 74 records have non-null values for Group.

There is a dedicated function in (lisp-like) for checking whether the non-null values in a field are all distinct, so the expression in the declaration is just:
```
(non-nulls-unique Z)
```
which evaluates to true⁵ or false.
Min and Max for String Fields. For string fields, the actual minimum and maximum values are usually less interesting. (Indeed, there are lots of reasonable alternative sort orders for strings, given choices such as case sensitivity, whether embedded numbers should be sorted numerically or alphanumerically, how spaces and punctuation should be handled, what to do with accents etc.) In the initial implementation, instead of using any min and max string values as the basis of constraints, we suggest constraints based on string length.

For the string fields here, none of the constraints is particularly compelling, though a minimum length of 1 might be interesting and you might even think that a maximum length of 2 is sensible the symbol is useful. But in many cases they will be. One common case is fixed-length strings, such as the increasingly ubiquitous UUIDs,⁶ where the minimum and maximum values would both 36 if they are canonically formatted. (Of course, we can add much stronger constraints if we know all the strings in a field are UUIDs.)
Categorical Values. The last kind of automatically generated constraint we will discuss today is a restriction of the values in a field to be chosen from some fixed set. In this case, Miró has noticed that there are only 10 different non-null values for ChemicalSeries, so has suggested a constraint to capture that reality. The slightly verbose way this currently gets expressed as a constraint is:
```
declare (= (countzero
            (or (isnull ChemicalSeries)
                (in ChemicalSeries
                    (list "Actinoid" "Alkali metal"
                          "Alkaline earth metal"
                          "Halogen" "Lanthanoid" "Metalloid" "Noble gas"
                          "Nonmetal" "Poor metal" "Transition metal"))))
           0)
```
(The or statement starting on the second line is true for field values that are either in the list or null. The countzero function, when applied to booleans, counts false values, so this is saying that none of the results of the or statement should be false, i.e. all values should be null or in the list. This would be more elegantly expressed with an (all ...) statement; we will probably change it to that formulation soon, though the current version is more useful for reporting failures.)

The current implementation generates these constraints only when the number of distinct values it sees is 20 or fewer, only for string fields, and only when not all the values in the field are distinct, but all of these aspect can probably be improved, and the user can override the number of categories to allow.

In addition to these constraints, we should also probably generate constraints on the field types and, as we will discuss in future articles, dataset-level constraints.

Tool Support: Using the Declarations

Obviously, if we run test the constraints against the same dataset we used to generate them, all the constraints should be (and are!) satisfied. Things are slightly more interesting if we run them against a different dataset. In this case, we excluded transuranic elements from the dataset we used to generate the constraints. But we can add them in. If we do so, and then execute a script (e92.miros) containing the autogenerated constraints, we get the following output:

$ miro
This is Miro, version 2.1.90.
Copyright © Stochastic Solutions 2008-2015.
Seed: 1463187505
Logs started at 2015/11/25 17:08:23 host tdda.local.
Logging to /Users/njr/miro/log/2015/11/25/session259.

[1]> load elements
elements.miro: 118 records; 118 (100%) selected; 16 fields.
[2]> . e92
[3]> # Autoconstraints for dataset elements92.miro.
[4]> # Generated from session /Users/njr/miro/log/2015/11/25/session256.miros
[5]> declare (>= (min Z) 1)
[6]> declare (<= (max Z) 92)

Miro Warning: Declaration failed: (<= (max Z) 92)

[7]> declare (= (countnull Z) 0)
[8]> declare (non-nulls-unique Z)
[9]> declare (>= (min (length Name)) 3)
[10]> declare (<= (min (length Name)) 12)
[11]> declare (= (countnull Name) 0)
[12]> declare (non-nulls-unique Name)
[13]> declare (>= (min (length Symbol)) 1)
[14]> declare (<= (min (length Symbol)) 2)
[15]> declare (= (countnull Symbol) 0)
[16]> declare (non-nulls-unique Symbol)
[17]> declare (>= (min Period) 1)
[18]> declare (<= (max Period) 7)
[19]> declare (= (countnull Period) 0)
[20]> declare (>= (min Group) 1)
[21]> declare (<= (max Group) 18)
[22]> declare (>= (min (length ChemicalSeries)) 7)
[23]> declare (<= (min (length ChemicalSeries)) 20)
[24]> declare (= (countnull ChemicalSeries) 0)
[25]> declare (= (countzero
                  (or (isnull ChemicalSeries)
                      (in ChemicalSeries (list "Actinoid" "Alkali metal"
                                               "Alkaline earth metal"
                                               "Halogen" "Lanthanoid"
                                               "Metalloid" "Noble gas"
                                               "Nonmetal" "Poor metal"
                                               "Transition metal"))))
                 0)
[26]> declare (>= (min AtomicWeight) 1.007946)
[27]> declare (<= (max AtomicWeight) 238.028914)

Miro Warning: Declaration failed: (<= (max AtomicWeight) 238.028914)

[28]> declare (= (countnull AtomicWeight) 0)

Miro Warning: Declaration failed: (= (countnull AtomicWeight) 0)

[29]> declare (> (min AtomicWeight) 0)
[30]> declare (>= (min (length Etymology)) 4)
[31]> declare (<= (min (length Etymology)) 39)
[32]> declare (= (countnull Etymology) 0)

Miro Warning: Declaration failed: (= (countnull Etymology) 0)

[33]> declare (>= (min RelativeAtomicMass) 1.007946)
[34]> declare (<= (max RelativeAtomicMass) 238.028914)

Miro Warning: Declaration failed: (<= (max RelativeAtomicMass) 238.028914)

[35]> declare (= (countnull RelativeAtomicMass) 0)

Miro Warning: Declaration failed: (= (countnull RelativeAtomicMass) 0)

[36]> declare (> (min RelativeAtomicMass) 0)
[37]> declare (>= (min MeltingPointC) -258.975000)
[38]> declare (<= (max MeltingPointC) 3675.0)
[39]> declare (>= (min MeltingPointKelvin) 14.200000)
[40]> declare (<= (max MeltingPointKelvin) 3948.0)
[41]> declare (> (min MeltingPointKelvin) 0)
[42]> declare (>= (min BoilingPointC) -268.930000)
[43]> declare (<= (max BoilingPointC) 5596.0)
[44]> declare (= (countnull BoilingPointC) 0)

Miro Warning: Declaration failed: (= (countnull BoilingPointC) 0)

[45]> declare (>= (min BoilingPointF) -452.070000)
[46]> declare (<= (max BoilingPointF) 10105.0)
[47]> declare (= (countnull BoilingPointF) 0)

Miro Warning: Declaration failed: (= (countnull BoilingPointF) 0)

[48]> declare (>= (min Density) 0.000089)
[49]> declare (<= (max Density) 22.610001)

Miro Warning: Declaration failed: (<= (max Density) 22.610001)

[50]> declare (= (countnull Density) 0)

Miro Warning: Declaration failed: (= (countnull Density) 0)

[51]> declare (> (min Density) 0)
[52]> declare (>= (min (length Description)) 1)
[53]> declare (<= (min (length Description)) 83)
[54]> declare (>= (min (length Colour)) 4)
[55]> declare (<= (min (length Colour)) 80)

10 warnings and 0 errors generated.

Job completed after a total of 10.2801 seconds.
Logs closed at 2015/11/25 17:08:23 host tdda.local.
Logs written to /Users/njr/miro/log/2015/11/25/session259.

By default, Miró generates warnings when declared constraints are violated. In this case, ten of the declared constraints were not met, so there were ten warnings. We can also set the declarations to generate errors rather than warnings, allowing us to stop execution of a script if the data fails to meet our declared expectations.

In this case, the failed declarations are mostly unsurprising and untroubling. The maximum values for Z, AtomicWeight, RelativeAtomicMass, and Density all increase in this version of the data, which is expected given that all the new elements are heavier than those in the initial analysis set. Equally, while the fields AtomicWeight, RelativeAtomicMass, Etymology, BoilingPointC, BoilingPointF and Density were all populated in the original dataset, each now contains nulls. Again, this is unsurprising in this case, but in other contexts, detecting these sorts of changes in a feed of data might be important. Specifically, we should always be interested in unexpected differences between the datasets used to develop an analytical process, and ones for which that process is used at a later time: it is very possible that they will not be handled correctly if they were not seen or considered when the process was developed.

There are many further improvements we could make to the current state of the autoconstraint generation, and there are other kinds of constraints it can generate that we will discuss in later posts. But as simple as it is, this level of checking has already identified a number of problems in the work we have been carrying out with Skyscanner and other clients.

We will return to this topic, including discussing how we might add tool support for revising constraint sets in the light of failures, merging different sets of constraints and adding constraints that are true only of subsets of the data.

Parting thoughts

Outputs and Intermediates. While developing the ideas about automatically generating constraints, our focus was mostly on input datasets. But in fact, most of the ideas are almost as applicable to intermediate results and outputs (which, after all, often form the inputs to the next stage of an analysis pipeline). We haven't performed any analysis in this post, but if we had, there might be similar value in generating constraints for the outputs as well.

Living Constraints and Type Systems. In this article, we've also focused on checking constraints at particular points in the process—after loading data, or after generating results. But it's not too much of a stretch to think of constraints as statements that should always be true of data, even as we append records, redefine fields etc. We might call these living or perpetual constraints. If we do this, individual field constraints become more like types. This idea, together with dimensional analysis, will be discussed in future posts.

See e.g. the timeless Little Bobby Tables XKCD https://xkcd.com/327/ and the Wikipedia entry on Defensive Programming. ↩
Metadata is data about data. In the context of tabular data, the simplest kinds of metadata are the field names and types. Any statistics we can compute are another form of metadata, e.g. minimum and maximum values, averages, null counts, values present etc. There is literally no limit to what metadata can be associated with an underlying dataset. ↩
Obviously, in many situations, it's fine for identifiers or keys to be repeated, but it is also often the case that in a particular table a field value must be unique, typically when the records act as master records, defining the entities that exist in some category. Such tables are often referred to as master tables in database contexts https://encyclopedia2.thefreedictionary.com/master+file. ↩
We're not saying this conviction is wrong: it is typically quicker just to whack in the new values each time. Our contention is that this is a more error-prone, less systematic approach. ↩
(lisp-like) actually follows an amalgam of Lisp conventions, using t to represent True, like Common Lisp, and f for False, which is more like Scheme or Clojure. But it doesn't really matter here. ↩
A so-called "universally unique identifier" (UUID) is a 128-bit number, usually formatted as a string of 32 hex digits separated into blocks of 8, 4, 4, 4, and 12 digits by hyphens—for example 12345678-1234-1234-1234-123456789abc. They are also known as globally unique identifiers (GUIDs) and are usually generated randomly, sometimes basing some bits on device and time to reduce the probability of collisions. Although fundamentally numeric in nature, it is fairly common for them to be stored and manipulated as strings. Wikipedia entry. ↩

Site News: Glossary; Table of Contents; Feeds

2015-11-23T11:40:00+00:00

The site now has a glossary, and also a table of contents, both linked from the side panel (which is at the top on mobile). The plan, obviously, is to keep these up-to-date as we discuss more topics. The table of contents is similar to the archives link at the top, but is chronological, rather than reverse-chronological, and has a short description of each article.

While writing the glossary, we decided that, in addition to the two classes of errors we discussed in Why Test-Driven Data Analysis—errors of implementation and errors of interpretation—we should probably break out a third category, namely errors of process. The first of the "interpretation" questions we listed was "Is the input data correct?". Presenting incorrect data to an analytical process certainly seems more like an error of process than an error of interpretation (though as we will discuss in one of the next posts, arguably the process should detect at least some kinds of input errors). We will certainly discuss other examples of process errors in future posts. We'll probably update the Why... post with at least with a footnote describing new interpretation.

We were also informed that some of the links to the RSS and Atom feeds were broken, even though the feeds themselves were OK. Apologies for this. As far as we can tell, they're all OK now. Please let us know if you try them and find they're not OK, or indeed if you find any other problems or errors.

Infinite Gain: The First Test

2015-11-16T19:00:00+00:00

The first idea we want to appropriate from test-driven development is that of regression testing, and our specific analytical variant of this, the idea of a reference test.

We propose a "zeroth level" of test-driven data analysis as recording one or more specific sets of inputs to an analytical process, together with the corresponding outputs generated, and ensuring that the process can be re-run using those recorded inputs. The first test can then simply be checking that the results remain the same if the analysis is re-run.

In the language of test-driven development, this is a regression test, because it tests that no regressions have occurred, i.e. the results are the same now as previously. It is also a system test, in the sense that it checks the functioning of the whole system (the analytical process), rather than one or more specific subunits, as is the case with unit tests.

In our work with Skyscanner, Stochastic Solutions maintains a number of tests of this type for each of our major analytical processes. They help to ensure that as we make changes to the analysis scripts, and any of the software they depend on, we don't break anything without noticing. We also run them whenever we install new versions on Skyscanner servers, to check that we get identical results on their platforms as on our own development systems. We call these whole-system regression tests reference tests, and run them as part of the special commit process we use each time we update the version number of the software. In fact, our process only allows the version number to be updated if the relevant tests—including the relevant reference tests—pass.

Some practical considerations

Stochastic (Randomized) Analyses

We assume that our analytical process is deterministic. If it involves a random component, we can make it deterministic by fixing the seed (or seeds) used by the random number generators. Any seeds should be treated as input parameters; if the process seeds itself (e.g. from the clock), it is important it writes out the seeds to allow the analysis to be re-run.
Correctness

We also assume that the analyst has performed some level of checking of the results to convince herself that they are correct. In the worst case, this may consist of nothing more than verifying that the program runs to completion and produces output of the expected form that is not glaringly obviously incorrect.

Needless to say, it is vastly preferable if more diligent checking than this has been carried out, but even if the level of initial checking of results is superficial, regression tests deliver value by allowing us to verify the impact of changes to the system. Specifically, they allow us to detect situations in which a result is unexpectedly altered by some modification of the process—direct or indirect—that was thought to be innocuous (see below).
Size / Time

Real analysis input datasets can be large, as can outputs, and complex analyses can take a long time. If the data is "too large" or the run-time excessive, it is quite acceptable (and in various ways advantageous) to cut it down. This should obviously be done with a view to maintaining the richness and variability of the inputs. Indeed, the data can also be changed to include more "corner cases", or, for example, to anonymize it, if it is sensitive.

The main reason we are not specifically advocating cutting down the data is that we want to make the overhead of implementing a reference test as low as possible.
Feeds

If the analytical process directly connects to some dynamic data feed, it will be desirable (and possibly necessary) to replace that feed with a static input source, usually consisting of a snapshot of the input data. Obviously, in some circumstances, this might be onerous, though in our experience it is usually not very hard.
Time-dependent analysis.

Another factor that can cause analysis of fixed input data, with a fixed analytical process, to produce different results is explicit or implicit time-dependence in the analysis. For example, the analysis might convert an input that is a date stamp to something like "number of whole days before today", or the start of the current month. Obviously, such transformations produce different results when run on different days. As with seeds, if there are such transformations in the analysis code, they need to handled. To cope with this sort of situation, we typically look up any reference values such as today early in the analytical process, and allow optional override parameters to be provided. Thus, in ordinary use we might run an analysis script by saying:
```
  python analysis_AAA.py
```
but in testing replace this by something like
```
  AAA_TODAY="2015/11/01" python analysis_AAA.py
```
to set the environment variable AAA_TODAY to an override value, or with a command such as
```
 python analysis_AAA.py -d 2015/11/01
```
to pass in the date as a command-line option to our script.
Numerical Precision.

Computers are basically deterministic, and, regardless of what numerical accuracy they achieve, if they are asked to perform the same operations, on the same inputs, in the same order, they will normally produce identical results every time. Thus even if our outputs are floating-point values, there is no intrinsic problem with testing them for exact equality. The only thing we really need to be careful about is that we don't perform an equality test between a rounded output value and an floating-point value held internally without rounding (or, more accurately, held as an IEEE floating point value, rather than a decimal value of given precision). In practice, when comparing floating-point values, we either need to compare formatted string output, rounded in some fixed manner, or compare to values to some fixed level of precision. In most cases, the level of precision will not matter very much, though in particular domains we may want to exercise more care in choosing this.

To make this distinction clear, look at the following Python code:
```
  $ python
  Python 2.7.10 (default, Jul 14 2015, 19:46:27)
  [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
  Type "help", "copyright", "credits" or "license" for more information.
  >>> from __future__ import division
  >>> a = 1/3
  >>> b = 1/3
  >>> print a
  0.333333333333
  >>> a == 0.333333333333
  False
  >>> a == b
  True
  >>> round(a, 12) == round(0.333333333333, 12)
  True
  >>> str(a) == '0.333333333333'
  True
  >>> '%.12f' % a == '0.333333333333'
  True
```
In this code fragment,
- The first line tells Python to return floating-point values from integer division (always a good idea).
- The next two lines just assign a and b each to be a third.
- The following line confirms the result of this is, as we'd expect 0.3333... But, crucially, this value is not exact. If we print it to 60 decimal places, we see:
```
>>> print "%.60f" % a
0.333333333333333314829616256247390992939472198486328125000000
```
- Unsurprisingly, therefore, when in the next statement we ask Python whether a is equal to 0.333333333333, the result is False.
- After this, as expected, we confirm that a == b is True.
- We then confirm that if we round a to 12 decimal places, the result is exactly round(0.333333333333, 12). Do we need the round on the right-hand side? Probably not, but be aware that 0.333333333333 is not a value that can be stored exactly in binary, so:
```
>>> print '%.60f' % 0.333333333333
0.333333333333000025877623784253955818712711334228515625000000
```
  It's probably, therefore, both clearer to round both sides, or to use string comparisons.
- Finally, we perform two string comparisons. The first relies on Python's default string formatting rules, and the second is more explicit.
NOTE: When it comes to actually writing tests, Python's unittest module includes an assertAlmostEqual method, that takes a number of decimal places, so if a function f(x) is expected to return the result 1/3 when x = 1, the usual way to test this to 12dp is with the following code fragment:
```
  def testOneThird(self):
      self.assertAlmostEqual(f(1), 0.333333333333, 12)
```
Parallel Processing.

Another factor that can cause differences in results is parallel execution, which can often result in subtle changes of detailed sequence of operations carried out. A simple example would be a task farm in which each of a number of workers calculates a result. If those results are then summed by the controller process in the order they are returned, rather than in a predefined sequence, numerical rounding errors may result in different answers. Thus, more care has to be taken in these sorts of cases.
Variable output.

A final implementation detail is that we sometimes have to be careful about simply comparing output logs, graph files etc. It is very common for output to include things that may vary from run-to-run, such as timestamps, version information or sequence numbers (run 1, run 2...) In these cases, the comparison process needs to make suitable affordances. We will discuss some methods for handling this in a future article.

Reasons a Regression Test Might Fail

Changes to the system not intended to change the result, but sometimes doing so, can take many forms. For example:

We might extend our analysis code to accommodate some variation in the input data handled.
We might add an extra parameter or code path to allow some variation in the analysis performed.
We might upgrade some software, e.g. the operating system, libraries, the analysis software or the environment in which the software runs.
We might upgrade the hardware (e.g. adding memory, processing capacity or GPUs), potentially causing different code paths to be followed.
We might run the analysis on a different machine.
We might change the way in which the input data is stored, retrieved or presented to the software.
Hardware and software can develop faults, and data corruption can and does occur.

The Law of Software Regressions

Experience shows that regression tests are a very powerful tool for identifying unexpected changes, and that such changes occur more often than anyone expects. In fact writing this reminds me of the self-referential law¹ proposed by Doug Hofstadter:

Hofstadter's Law:

It always takes longer than you expect, even when you take into account Hofstadter's Law.

— Gödel, Esher Bach: An Eternal Golden Braid, Douglas R. Hofstadter.

In a similar vein, we might coin a Law of Software Regressions:

The Law of Software Regressions:

Software regressions happen more often than expected, even when you take into account the Law of Software Regressions.

Douglas R. Hofstadter, Gödel, Esher Bach: An Eternal Golden Braid, p. 152. Penguin Books (Harmondsworth) 1980. ↩

How is this Misleading Data Misleading Me?

2015-11-13T17:15:00+00:00

"Why is this lying bastard lying to me?"

Louis Heren,¹ often attributed to Jeremy Paxman.

In a previous post, we made a distinction between two kinds of errors—implementation errors and errors of interpretation. I want to amplify that today, focusing specifically on interpretation.

The most important question to keep in mind at all times is not whether the analysis is computing the thing we wanted it to compute, but rather whether the result we have produced means what we think it means. The distinction is crucial.

As a simple example, let's suppose we specify the goal of our analysis as calculating the mean of a set of numbers. We can test that by adding them up and dividing by the number of items. But if we think the goal is to characterize a typical transaction size, we have to ask whether the arithmetic mean is the right metric for understanding that. As we move more towards a business or conceptual goal, rather than a mathematical or algorithmic formulation of a calculation, we have more complex and nuanced considerations, such as:

Do we believe the inputs are correct?
Is our chosen metric capable of addressing our underlying need (in this case, determining a typical transaction size)?
How do we handle nulls (missing values)?
Will outliers (perhaps extremely large values) or invalid inputs (perhaps negative values) invalidate the calculation?
If the values have dimensionality,² do all of the values have the same dimensionality, and in the same units (e.g. all money and all in pounds sterling, or all distances and all measured in miles).
For that matter, are the inputs even commensurate, i.e. do they quantify sufficiently similar things that calculating their mean is even meaningful?

Paxman/Heren's constant question quoted above—Why is this lying bastard lying to me?—will serve as an excellent question to keep in mind every time we view an analytical result, perhaps recast as how is this misleading data misleading me? There is a great temptation to believe beautifully formatted, painstakingly calculated results produced by the almost unfathomable power of modern computers. In fact, there is much to be said for thinking of the combination of data and processing as an adversary constantly trying to fool you into drawing false conclusions.

The questions of implementation are concerned with checking that the data received as input to the analytical process has been faithfully transmitted from the source systems, and that the calculations and manipulations performed in the analysis correctly implement the algorithms we intended to use. In contrast, as we outlined previously, the questions of interpretation emphasize that we need to be ever vigilent, asking ourselves:

Is the input data correct?
Is our interpretation of the input data correct?
Are the algorithms we are applying to the data meaningful and appropriate?
Is our interpretation of the results we produce correct?
Are the results plausible?
What am I missing?
How is this misleading data misleading me?

This quote is usually attributed to Jeremy Paxman, as noted in The Guardian article Paxman answers the questions https://www.theguardian.com/media/2005/jan/31/mondaymediasection.politicsandthemedia of 31st January 2005. According to the article, however, the true origin is a former deputy editor of the Times, Louis Heren, in his memoirs, with the full quote being "When a politician tells you something in confidence, always ask yourself: 'Why is this lying bastard lying to me?'" Still other reports, however, say Heren himself, was merely quoting advice he was given. Melvin J. Lasky writes in Profanity, Obscenity and the Media, Transaction Publishers (New Brunswick) 2005:

"Find out why the lying bastards are lying!" This is the famous phrase of an editor of the Times, Louis Heren, who received it as "advice given him early in his career by ... a correspondent of the Daily Worker [the Communist daily in London]: 'Always ask yourself why these lying bastards are lying to you.'"

↩
Here, we use dimensionality in the sense of Dimensional Analysis, which allows us to make inferences about the results of calculations based on classifying the inputs by category. For example, we would distinguish lengths, from times from quantities of money and so forth. We would also treat separately dimensionless quantities, such as counts or ratios of quantitities of the same dimension (e.g. a ratio of two lengths lengths). ↩

Test-Driven Development: A Review

2015-11-09T10:30:00+00:00

Since a key motivation for developing test-driven data analysis (TDDA) has been test-driven development (TDD), we need to conduct a lightning tour of TDD before outlining how we see TDDA developing. If you are already familiar with test-driven development, this may not contain too much that is new for you, though we will present it with half an eye to the repurposing of it that we plan as we move towards test-driven data analysis.

Test-driven development (TDD) has gained notable popularity as an approach to software engineering, both in its own right and as a key component of the Agile development methodology. Its benefits, as articulated by its adherents, include higher software quality, greater development speed, improved flexibility during development (i.e., more ability to adjust course during development), earlier detection of bugs and regressions¹ and an increased ability to restructure ("refactor") code.

The Core Idea of Test-Driven Development

Automation + specification + verification + refactoring

The central idea in test-driven development is that of using a comprehensive suite of automated tests to specify the desired behaviour of a program and to verify that it is working correctly. The goal is to have enough, sufficiently detailed tests to ensure that when they all pass we feel genuine confidence that the system is functioning correctly.

The canonical test-driven approach to software development consists of the following stages:

First, a suite of tests is written specifying the correct behaviour of a software system. As a trivial example, if we are implementing a function, f, to compute the sum of two inputs, a and b, we might specify a set of correct input-output pairs. In TDD, we structure our tests as a series of assertions, each of which is a statement that must be satisfied in order for the test to pass. In this case, some possible assertions, expressed in pseudo-code, would be:
```
assert f( 0,  0)  =  0
assert f( 1,  7)  =  8
assert f(-2, 17)  = 15
assert f(-3, +3)  =  0
```
Importantly, the tests should also, in general, check and specify the generation of errors and the handling of so-called edge cases. Edge cases are atypical but valid cases, which might include extreme input values, handling of null values and handling of empty datasets. For example:
```
assert f("a", 7) --> TypeError
assert f(MAX_FLOAT, MAX_FLOAT) = Infinity
```

NOTE This is not a comprehensive set of tests for f. We'll talk more about what might be considered adequate for this function in later posts. The purpose of this example is simply to show the general structure of typical tests.

An important aspect of testing frameworks is that they allow tests to take the form of executable code that can be run even before the functionality under test has been written. At this stage, since we have not even defined f, we expect the tests not to pass, but to produce errors such as "No such function: f". Once a minimal definition for f has been provided, such as one that always returns 0, or that returns no result, the errors should turn into failures, i.e. assertions that are not true.
When we have a suite of failing tests, software is written with the goal of making all the tests pass.
Once all the tests pass, TDD methodology dictates that coding should stop because if the test suite is adequate (and free of errors) we have now demonstrated that the software is complete and correct. Part of the TDD philosophy is that if more functionality is required, one or more further tests should be written to specify and demonstrate the need for more (or different) code.
There is one more important stage in test-driven development, namely refactoring. This is the process of restructuring, simplifying or otherwise improving code while maintaining its functionality (i.e., keeping the tests passing). It is widely accepted that complexity is one of the biggest problems in software, and simplifying code as soon as the tests pass allows us to attempt to reduce complexity as early as possible. It is a recognition of the fact that the first successful implementation of some feature will typically not be the most direct and straightforward.

The philosophy of writing tests before the code they are designed to validate leads some to suggest that the second "D" in TDD (development) should really stand for design (e.g. Allen Houlob³). This idea grows out of the observation that with TDD, testing is moved from its traditional place towards the end of the development cycle to a much earlier and more prominent position where specification and design would traditionally occur.

TDD advocates tend to argue for making tests very quick to run (preferably mere seconds for the entire suite) so that there is no impediment to running them frequently during development, not just between each code commit,⁴ but multiple times during the development of each function.

Another important idea is that of regression testing. As noted previously a regression is a defect that is introduced by a modification to the software. A natural consequence of maintaining and using a comprehensive suite of tests is that when such a regressions occur, they should be detected almost immediately. When a bug does slip through without triggering a test failure, the TDD philosophy dictates that before it is fixed, one or more failing tests should be added to demonstrate the incorrect behaviour. By definition, when the bug is fixed, these new tests will pass unless they themselves contain errors.

Common Variations, Flavours and Implementations

A distinction is often made between unit tests and system tests (also known as integration tests). Unit tests are supposed to test low-level software units (such individual functions, methods or classes). There is often a particular focus on these low-level unit tests, partly because these can often be made to run very quickly, and partly (I think) because there is an implicit belief or assumption that if each individual component is well tested, the whole system built out of those components is likely to be reliable. (Personally, I think this is a poor assumption.)

In contrast, system tests and integration tests exercise many parts of the system, often completing larger, more realistic tasks, and more often interfacing with external systems. Such tests are often slower and it can be hard to avoid their having side effects (such as updating entries in databases).

The distinction, however, between the different levels is somewhat subjective, and some organizations give more equal or greater weight to higher level tests. This will be an interesting issue as we consider how to move towards test-driven data analysis.

Another practice popular within some TDD schools is that of mocking. The general idea of mocking is to replace some functionality (such as a database lookup, a URL fetch, a disk write, a trigger event or a function call) with a simpler function call or a static value. This is done for two main reasons. First, if the mocked functionality is expensive, or has side effects, test code can often be made much faster and side-effect free if its execution is bypassed. Secondly, mocking allows a test to focus on the correctness of a particular aspect of functionality, without any dependence on the external part of the system being mocked out.

Other TDD practitioners are less keen on mocking, feeling that it leads to less complete and less realistic testing, and raises the risk of missing some kinds of defects. (Those who favour mocking also tend to place a strong emphasis on unit testing, and to argue that more expensive, non-mocked tests should form part of integration testing, rather than part of the more frequently run core unit test suite.)

While no special software is strictly required in order to follow a broadly test-driven approach to development, good tools are extremely helpful. There are standard libraries that support of this for most mainstream programming languages. The xUnit family of test software (e.g. CUnit for C, jUnit for Java, unittest for Python), uses a common architecture designed by Kent Beck.² It is worth noting that the rUnit package is such a system for use with the popular data analysis package R.

Example

As an example, the following Python code tests a function f, as described above, using Python's unittest module. Even if you are completely unfamilar with Python, you will be able to see the six crucial lines that implement exactly the six tests described in pseudo-code above, in this case through four separate test methods.

import sys
import unittest


def f(a, b):
    return a + b


class TestAddFunction(unittest.TestCase):
    def testNonNegatives(self):
        self.assertEqual(f(0, 0), 0)
        self.assertEqual(f(1, 7), 8)

    def testNegatives(self):
        self.assertEqual(f(-2, 17), 15)
        self.assertEqual(f(-3, +3), 0)

    def testStringInput(self):
        self.assertRaises(TypeError, f, "a", 7)

    def testOverflow(self):
        self.assertEqual(f(sys.float_info.max, sys.float_info.max),
                         float('inf'))

if __name__ == '__main__':
    unittest.main()

If this code is run, including the function definition for f, the output is as follows:

$ python add_function.py
....
----------------------------------------------------------------------
Ran 4 tests in 0.000s

OK

Here, each dot signifies a passing test.

However, if this is run without defining f, the result is the following output:

$ python add_function.py
EEEE
======================================================================
ERROR: testNegatives (__main__.TestAddFunction)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "add_function.py", line 13, in testNegatives
    self.assertEqual(f(-2, 17), 15)
NameError: global name 'f' is not defined

======================================================================
ERROR: testNonNegatives (__main__.TestAddFunction)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "add_function.py", line 9, in testNonNegatives
    self.assertEqual(f(0, 0), 0)
NameError: global name 'f' is not defined

======================================================================
ERROR: testOverflow (__main__.TestAddFunction)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "add_function.py", line 20, in testOverflow
    self.assertEqual(f(sys.float_info.max, sys.float_info.max),
NameError: global name 'f' is not defined

======================================================================
ERROR: testStringInput (__main__.TestAddFunction)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "add_function.py", line 17, in testStringInput
    self.assertRaises(TypeError, f, "a", 7)
NameError: global name 'f' is not defined

----------------------------------------------------------------------
Ran 4 tests in 0.000s

FAILED (errors=4)

Here the four E's at the top of the output represent errors when running the tests. If a dummy definition of f is provided, such as:

def f(a, b):
    return 0

the tests will fail, producing F, rather than raising the errors that result in E's.

Benefits of Test-Driven Development

Correctness. The most obvious reason to adopt test-driven development is the pursuit of higher software quality. TDD proponents certainly feel that there is considerable benefit to maintaining a broad and rich set of tests that can be run automatically. There is rather more debate about how important it is to write the tests strictly before the code it is designed to test. I would say that to qualify as test-driven development, the tests should be produced no later than immediately after each piece of functionality is implemented, but purists would take a stricter view.

Regression detection. The second benefit of TDD is in the detection of regressions, i.e. failures of code in areas that previously ran successfully. In practice, regression testing is even more powerful than it sounds because not only can many different failure modes be detected by a single test, but experience shows that there are often areas of code that are susceptible to similar breakages from many different causes and disturbances. (This can be seen as a rare case of combinatorial explosion working to our advantage: there are many ways to get code wrong, and far fewer to get it right, so a single test can catch many different potential failures.)

Specification, Design and Documentation. One of the stronger reasons for writing tests before the functions they are designed to verify is that the test code then forms a concrete specification. In order even to write the test, a certain degree of clarity has to be brought to the question of precisely what the function that is being written is supposed to do. This is the key insight that leads towards the idea of TDD as test-driven design over test-driven development. A useful side effect of the test suite is that it also forms a precise and practical form of documentation as to exactly how the code can be used successfully, and one that, by definition, has to be kept up to date—a perenial problem for documentation.

Refactoring. The benefits listed so far are relatively unsurprising. The fourth is more profound. In many software projects, particularly large and complex ones, once the software is deemed to be working acceptably well, some areas of the code come to be regarded as too dangerous to modify, even when problems are discovered. Developers (and managers) who know how much pain and effort was required to make something work (or more-or-less work) become fearful that the risks associated with fixing or upgrading code are simply too high. In this way, code becomes brittle and neglected and thus essentially unmaintainable.

In my view, the single biggest benefit of test-driven development is that it goes a long way to eliminating this syndrome, allowing us to re-write, simplify and extend code safely, confident in the knowledge that if the tests continue to function, it is unlikely that anything very bad has happened to the code. The recommended practice of refactoring code as soon as the tests pass is one aspect of this, but the larger benefit of maintaining comprehensive set of tests is that such refactoring can be performed at any time.

These are just the most important and widely recognized benefits of TDD. Additional benefits include the ability to check that code is working correctly on new machines or systems, or in any other new context, providing a useful baseline of performance (if timed and recorded) and providing an extremely powerful resource if code needs to be ported or reimplemented.

A software regression is a bug in a later version of software that was not present in a previous version of the software. It contrasts with bugs that may always have been present but were not detected. ↩
Kent Beck, Test-Driven Development, Addison Wesley (Vaseem) 2003. ↩
Allen Houlob, Test-Driven Design, Dr. Dobbs Journal, May 5th 2014. https://www.drdobbs.com/architecture-and-design/test-driven-design/240168102. ↩
Most non-trivial software development uses a so-called revision control system to provide a comprehensive history of versions of the code. Developers normally run code frequently, and typically commit changes to the revision-controlled repository somewhat less frequently (though still, perhaps, many times a day). With TDD, the tests form an integral part of the code base, and it is common good practice to require that code is only committed when the tests pass. Sometimes this requirement is merely a rule or convention, while in other cases systems are set up in such a way as to enable code to be committed only when all of its associated tests pass. ↩

Why Test-Driven Data Analysis?

2015-11-05T08:42:00+00:00

OK, everything you need to know about TeX has been explained—unless you happen to be fallible. If you don't plan to make any errors, don't bother to read this chapter.

— The TeXbook, Chapter 27, Recovery from Errors. Donald E. Knuth.¹

The concept of test-driven data analysis seeks to improve the answers to two sets of questions, which are defined with reference to an "analytical process".

The questions assume that you have used the analytical process at least once, with one or more specific collections of inputs, and that you are ready to use, share, deliver or simply believe the results.

The questions in the first group concern the implementation of your analytical process:

Implementation Questions

How confident are you that the outputs produced by the analytical process, with the input data you have used, are correct?
How confident are you that the outputs would be the same if the analytical process were repeated using the same input data?
Does your answer change if you repeat the process using different hardware, or after upgrading the operating system or other software?
Would the analytical process generate any warning or error if its results were different from when you first ran it and satisfied yourself with the results?
If the analytical process relies on any reference data, how confident are you that you would know if that reference data changed or became corrupted?
If the analytical process were run with different input data, how confident are you that the output would be correct on that data?
If corrupt or invalid input data were used, how confident are you that the process would detect this and raise an appropriate warning, error or failure?
Would someone else be able reliably to produce the same results as you from the same inputs, given detailed instructions and access?
Corollary: do such detailed instructions exist? If you were knocked down by the proverbial bus, how easily could someone else use the analytical process?
If someone developed an equivalent analytical process, and their results were different, how confident are you that yours would prove to be correct?

These questions are broadly similar to the questions addressed by test-driven development, set in the specific context of data analysis.

The questions in our second group are concerned with the meaning of the analysis, and a larger, more important sense of correctness:

** Interpretation Questions**

Is the input data² correct?³
Is your interpretation of the input data correct?
Are the algorithms you are applying to the data meaningful and appropriate?
Are the results plausible?
Is your interpretation of the results correct?
More generally, what are you missing?

These questions are less clear cut than the implementation questions, but are at least as important, and in some ways are more important. If the implementation questions are about producing the right answers, the interpretation questions are about asking the right questions, and understanding the answers.

Over the coming posts, we will seek to shape a coherent methodology and set of tools to help us provide better answers to both sets of questions—implementational and interpretational. If we succeed, the result should be something worthy of the name test-driven data analysis.

Donald E. Knuth, The TeXbook, Chapter 27, Recovery from Errors. Addison Wesley (Reading Mass) 1984. ↩
I am aware that, classically, data is the plural of datum, and that purists would prefer my question to be phrased as "Are the data correct?" If the use of 'data' in the singular offends your sensibilities, I apologise. ↩
When adding Error of Implementation and Error of Interpretation to the glossary, we decided that this first question really pertained to a third category of error, namely an Error of Process. ↩

Test-Driven Data Analysis

2015-11-05T08:30:00+00:00

A dozen or so years ago I stumbled across the idea of test-driven development from reading various posts by Tim Bray on his Ongoing blog. It was obvious that this was a significant idea, and I adopted it immediately. It has since become an integral part of the software development processes at Stochastic Solutions, where we develop our own analytical software (Miró and the Artists Suite) and custom solutions for clients. But software development is only part of what we do at the company: the larger part of our work consists of actually doing data analysis for clients. This has a rather different dynamic.

Fast forward to 2012, and a conversation with my long-term collaborator and friend, Patrick Surry, during which he said something to the effect of:

So what about test-driven data analysis?

— Patrick Surry, c. 2012

The phrase resonated instantly, but neither of us entirely knew what it meant. It has lurked in my brain ever since, a kind of proto-meme, attempting to inspire and attach itself to a concept worthy of the name.

For the last fifteen months, my colleagues—Sam Rhynas and Simon Brown—and I have been feeling our way towards an answer to the question

What is test-driven data analysis?

We haven't yet pulled all the pieces together into coherent methodology, but we have assembled a set of useful practices, tools and processes that feel as if they are part of the answer.

A few weeks ago, my friend and ex-colleague Greg Wilson was in town for Edinburgh Parallel Computing Centre's twenty-fifth birthday bash. Greg is a computer scientist and former lecturer from University of Toronto. He now spends most of his time teaching scientists key ideas from software engineering through his Software Carpentry organization. He lamented that while he has no trouble persuading scientists of the benefits of adopting ideas such as version control, he finds them almost completely unreceptive when he champions software testing. I was initially rather shocked by this, since I routinely say that test-driven development is the most significant idea in software in the last thirty or forty years. Thinking about it more, however, I suspect the reasons for the resistance Greg encounters are similar to the reasons we have found it harder than we expected to take mainstream ideas from test-driven development and apply them in the rather specialized area of data analysis. Testing scientific code is more like testing analysis processes than it is like testing software per se.

As I reflected further on what Greg had said, I experienced a moment of clarity. The new insight it that while we have a lot useful components for test-driven data analysis, including some useful fragments of a methodology, we really don't have appropriate tools: the xUnit frameworks and their ilk are excellent for test-driven development, but don't provide specific support for the patterns we tend to need in analysis, and address only a subset of the issues we should want test-driven data analysis to cover.

The purpose of this new blog is to think out loud as we—in partnership with one of our key clients, Skyscanner—try to develop tools and methodologies to form coherent framework and support system for a more systematic approach to data science—a test-driven approach to data analysis.

So watch this space.

If you want to subscribe, this site has RSS and ATOM feeds, and also offers email subscriptions.¹ We'll be tweeting on @tdda0 whenever there are new posts. Twitter is also probably the best to send feedback, since we haven't plumbed in comments at this time: we'd love to hear what you think.

through MailChimp; thanks, MailChimp! ↩