Test-Driven Data Analysis

2015-11-05: Test-Driven Data Analysis Motivation for test-driven data analysis

2015-11-05: Why Test-Driven Data Analysis? Two sets of questions that inspire test-driven data analysis. The first set focus on correctness of implementation ("getting the right answers"). The second set focus on correctness of interpretation ("asking the right questions").

2015-11-09: Test-Driven Development: A Review A brief overview of test-driven development and its benefits.

2015-11-13: How is this Misleading Data Misleading Me? More on errors of interpretation.

2015-11-16: Infinite Gain: The First Test Regression tests for TDDA: Introducing reference tests and a Law of Software Regressions.

2015-11-23: Site News: Glossary; Table of Contents; Feeds A few site changes and plans

2015-11-26: Constraints and Assertions Automatic generation and verification of constraints on datasets.

2015-12-11: Overview of TDDA in Predictive Analytics Times

2015-12-14: Generalized Overfitting: Errors of Applicability On the many ways even analytical processes can be overfit data even when they do no involve predictive modelling. Topics include concrete and abstract specification.

2016-01-06: How far in advance are flights cheapest? An error of interpretation On the important-but-subtle difference between the questions "How far in advance is any given ticket cheapest?" and "How far in advance is the average price of tickets sold that day lowest?”

2016-02-15: Lessons Learned: Bad Data and other SNAFUs

2016-04-15: In Defence of XML: Exporting and Analysing Apple Health Data Extracting CSV files from the export.xml file written by the Apple Health app on iOS.

2016-04-18: First Test On writing a first "reference" test

2016-04-19: Unit Tests On adding some unit tests

2016-04-20: Extracting More Apple Health Data On extending the iOS Apple Health App data extractor

2016-09-17: Slides and Rough Transcript of TDDA talk from PyCon UK 2016 Test-Driven Data Analysis Talk (slides and transcript) from

2016-09-18: WritableTestCase: Example Use Example of how to use writabletestcase.WritableTestCase

2016-11-03: Constraint Discovery and Verification for Pandas DataFrames Introducing the TDDA constraints library with Pandas bindings.

2016-11-04: The TDDA Constraints File Format The .tdda Constraints File Format

2016-11-11: Introducing Rexpy: Automatic Discovery of Regular Expressions Regular expressions are powerful pattern-matching rules for strings. They are fast and widely supported but hard to write and harder to read and debug. Rexpy is a library that aims to take the pain out of producing useful, correct regular expressions by finding them automatically from the collection of strings that are to be matched.

2017-01-26: The New ReferenceTest class for TDDA The Python tdda module has been extended with a new ReferenceTest class, which supersedes WritableTestCase and has many more features. The tdda library is also now available using pip from PyPI.

2017-01-31: Coverage information for Rexpy The tdda library's regular-expression discovery functionality has been extended to provide information about how many examples each resulting regular expressions matches ("covers"). There are new methods for getting various information about such coverage.

2017-02-10: TDDA 1-pager A 1-page summary of TDDA is available.

2017-02-20: Errors of Interpretation: Bad Graphs with Dual Scales It is a primary responsibility of analysts to present findings and data clearly, in ways to minimize the likelihood of misinterpretation. Graphs should help this, but all too often, if drawn badly (whether deliberately or through oversight) they can make misinterpretation highly likely.

2017-03-08: An Error of Process Yesterday, email subscribers to the blog, and some RSS/casual viewers, will have seen a half-finished (in fact, abandoned) post that began to try to characterize success and failure on the crowd-funding platform Kickstarter. This post explains what happened and tries to salvage a "teachable moment" out of this minor fiasco.

2017-03-09: Improving Rexpy Rexpy is an open-source Python library and online tool for finding regular expressions from examples. It focuses on regular expressions for structured data (such as those used for things like identifiers, postcodes, URLs and telephone numbers) rather than free text or toy examples. A new release significantly improves the algorithm used for finding regular expressions, often resulting in more precise regular expressions while degrading performance in very few places.

2017-05-04: Quick Reference for TDDA Library A quick-reference guide ("cheat sheet") for the TDDA library is now available.

2017-09-08: GDPR, Consent and Microformats: A Half-Baked Idea The Generalized Data Protection Regulation (GDPR) is coming. This post outlines and idea for a way to make it more workable by using a "microformat" (or similar) to specify consent requests and responses in a simple digital form, on websites and in apps, that would be more precise, consistent and verifiable for all sides.

2017-09-14: Obtaining the Python tdda Library Reference information about how to obtain/install/use the TDDA library

2017-09-21: Constraint Generation in the Presence of Bad Data Relaxing the requirement that datasets used for algorithmic constraint generation contain only good data.

2017-10-06: Automatic Constraint Generation and Verification White Paper Correctness is a key problem at every stage of data science projects: completing an entire analysis without a serious error at some stage is surprisingly hard. Even errors that reverse or completely invalidate the analysis can be hard to detect. Test-Driven Data Analysis (TDDA) attempts to identify, reduce, and aid correction of such errors. A core tool that we use in TDDA is Automatic Constraint Discovery and Verification. The paper links from this post describes the approach in detail.

2017-11-30: Data Provenance and Data Lineage: the View from the Podcasts Summary of a couple of podcast discussions of data provenance and data lineage

2017-12-12: Our Approach to Data Provenance The last post introduced the idea of data provenance (a.k.a. data lineage), as outlined on a couple of podcasts. This post explains how we approach this issue at Stochastic Solutions, both from a methodological and a software perspective.

2018-05-01: Saving Time Running Subsets of Tests with Tagging When your tests take any non-trivial amount of time to run, as is reasonable common with analytical tests, it is useful to have a convenient way to run a subset of them, or a single test. We have added this capability to unittest-based tests through the new tagging mechanism in the TDDA library.

2018-05-04: Detecting Bad Data and Anomalies with the TDDA Library (Part I) The data verification capbilities of the TDDA library have been extended to allow the identification of individual records failing constraints, optionally with detailed diagnosis of how they fail. There is a new API call and a new command-line primative, detect, and this starts to allow the TDDA library to be used as a general-purpose anomaly detection system.

2018-05-22: Tagging PyTest Tests We recently introduced the @tag decorator to tag a subset of tests or test classes to be run, but this was available only for unittest. This has now been extended to work under pytest.

2019-02-20: Rexpy for Generating Regular Expressions: Postcodes Rexpy generates regular expressions from examples. This post illustrates how Rexpy can help in a simple case.

2019-10-24: Installation Reference information about how to obtain/install/use the TDDA library

2019-10-25: Screencasts and Exercises Introducing the TDDA exercises

2019-10-28: Reference Testing Exercise 1 (unittest flavour) This describes the unit-test flavoured version of Exercise 1 for reference testing and shows some of the immediate benefits that are available by switching to use tdda.referencetest, including easier diagnosis of failures, easier updating of tests when (correct) results change, and the ability easily to write tests that allow known variations in output while failing when true regressions occur.

2019-10-29: Reference Testing Exercise 1 (pytest flavour) This describes the pytest-test flavoured version of Exercise 1 for reference testing and shows some of the immediate benefits that are available by switching to use tdda.referencetest, including easier diagnosis of failures, easier updating of tests when (correct) results change, and the ability easily to write tests that allow known variations in output while failing when true regressions occur.

2019-10-30: Reference Testing Exercise 2 (unittest flavour) This describes the unit-test flavoured version of Exercise 2 for reference testing and shows how TDDA's @tag decorator can easily be used to run only a single test, or a subset of tests,

2019-10-31: Reference Testing Exercise 2 (pytest flavour) This describes the pytest-test flavoured version of Exercise 2 for reference testing and shows how TDDA's @tag decorator can easily be used to run only a single test, or a subset of tests,

2020-08-30: Sharing Tests across Implementations by Externalizing Test Data Thinking about separating test data from tests

2021-07-16: Flat Files (a.k.a. CSV files) Towards a flat-file metadata format

2022-02-07: Why Code Rusts Reasons Code Rusts and Tests Start Failing

2022-02-16: One Tiny Bug Fix etc.

2022-02-21: Unix & Linux Survival Guide for Data Science etc.

2022-02-25: Gentest Talk at 2022 Toronto Workshop on Reproducibility Version 2.0 of the TDDA Library has been released, and includes Gentest: "Gentest writes tests, so you don't need to"™. It was launched at the Rohan Alexander's 2022 Toronto Workshop on Reproducibility, with a demonstration and the video and slides are available

2023-01-08: Overcast Logged-in iCloud Users: Self-Selection Bias and Customer Stickiness Why Marco is right that his users’ iCloud usage is probably atypical

2023-07-11: TDDA on the Coding for Thought Podcast TDDA on the Coding for Thought Podcast

2023-07-16: TOMLParams: TOML-based parameter files made better TOMLParams is an open-source Python library for managing parameters stored in TOML files. It support hierarchy, inheritance, default values, type checking and makes it easy to change the behaviour of software without editing code, which has multiple benefits. It also supports writing out the parameters used, which helps with reproducibility

2024-03-04: Name Styles Evil-Good-Lawful-Chaotic classification of conventions and allowability of different styles of names as identifiers in data and computing.

2024-06-20: Learning the Hard Way: Regression to the Mean Regression to the Mean is a widespread statistical phenomenon that often trips up analysts. This post describes how I was badly tripped up by it over 25 years ago, and the painful lesso I learned.

2024-07-21: PyData London 2024 TDDA Tutorial The video and slides from the TDDA Tutorial at PyData London 2024.

2024-09-22: An Adware Malware Story Featuring Safari, Notification Centre, and Box Plots I had a kind-of malware incident on the Mac today. This describes it, together with some digression on box plots.

2024-11-14: Jupyter Notebooks Considered Harmful: The Parables of Anne and Beth Jupyter and other Computational Notebooks have spread like wildfire through the data science community. Unfortunately, they seem to encourage a number of problematical workflows. This article outlines some of the problems though the Parables of Anne and Beth.

2024-12-12: Log Graphs and Grokkability Our World in Data produced a very interesting graph of income inequality in seven countries based on data from the Luxembourg Income Study. This post discusses its merits and demerits, attempts to re-draw it using linear scales, and discusses the central importance of grokkability of graphs.

2024-12-17: Best Practices for Notebook Users In a previous post, I discussed some of the dangers of challenges, dangers and weaknesses of Jupyter Notebooks, JupyterLabs and their ilk. Here, on a more constructive note, I suggest some best practices for using Notebooks productively and safely.

2024-12-23: TDDA and Quality for LLMs Whither LLMs?

2025-06-23: tdda.serial: Metadata for Flat Files (CSV Files) In a previous post, I discussed metadata for flats files (CSV etc.), and promised further posts (nearly four years ago). The tdda.serial format, and developing support for it in the tdda software, is the next stage in this process.

2025-09-02: Test-Driven Document Development Computational documents are those that include the results of computer code and have a mechanism for ensuring that when the code changes, the results also change. They may also include (in the formatted output) the generating code. Familiar examples include Quarto, cog, and RMarkdown. Computation notebooks, such as Jupyter Notebooks, JupyterLab and marimo, thought slightly different, are also computational documents. This post briefly describes them, together with a key danger with them, that of co-rusting. We will then describe test-driven document development, which counters this danger by including tests that will fail if the outputs fail.

2026-05-19: TDDA: The Book, the 3.0 Library, and the PyData London 2026 Tutorial The book, Test-Driven Data Analysis, is published today. The 3.0 version of the tdda library and tools was published yesterday. And there'll be a tutorial on TDDA on Friday 5th June 2026 at PyData London.

2026-05-20: Reference Testing with TDDA: A Guide for LLMs and Coding Bots A guide to tdda.referencetest for LLMs and coding agents — written by an LLM, reviewed by LLMs. Reference testing is a regression technique where you capture a program's actual output, verify it once, and thereafter test that nothing has changed. This post covers the library, the correct workflow, and the critical -W trap — using --write-all to make failing tests pass without verifying the new output is correct — which LLMs routinely fall into. Distilled from the book, source code, and documentation.

2026-05-21: CSV Metadata and tdda.serial: A Guide for LLMs and Coding Agents A guide to tdda.serial and CSV metadata for LLMs and coding agents. CSV files lose type information in transit: nulls become strings, integers become floats, dates stay strings, and the Pandas index silently adds a column. Companion metadata files fix this. This post explains what tdda.serial is, when it's worth using, and how to use it — including comparison with CSVW and Frictionless. Distilled from the book, source code, and documentation.

2026-05-22: Data Validation with tdda Constraints: A Guide for LLMs and Coding Agents A guide to tdda constraint discovery and validation for LLMs and coding agents. tdda discovers what "good" looks like in your data, encodes it as constraints in a JSON file, and validates new data against those constraints. The workflow has two phases: development (discover, read, adapt, validate against holdout) and deployment (verify, monitor, refine). Skipping the development phase generates many more false negatives—bad data passes through undetected — and this is the dominant, dangerous failure mode. The library has deliberately few constraint types; the design philosophy is to bring data to the constraints by deriving columns, computing roll-ups, and regularizing measurements. Distilled from the book, source code, and documentation.

2026-05-27: TDDA Book Online Serialization The TDDA Book is being published online for free, with a chapter released each week from 25th May to 14th September 2026.

2026-06-19: PyData London 2026 TDDA Tutorial The video and slides from the TDDA Tutorial at PyData London 2026.