Test-Driven Data Analysis

TDDA: The Book, the 3.0 Library, and the PyData London 2026 Tutorial

Posted on Tue 19 May 2026 in TDDA • Tagged with library, talk, book

This blog has been quite quiet, but there is a great deal of news and it may be less quiet for a while.

The Book

Today, 19th May 2026, sees the world-wide release of Test-Driven Data Analysis, from CRC Press.

It is available from all good booksellers and all sellers of good books, and until 31 Dec 2026 the code 26ESA3 will give a 20% discount from the publisher's site.

The book covers:

the TDDA methodology
- including areas not obviously amenable to software support, such as errors of interpretation, errors of applicability, errors of process, and errors of judgement
the TDDA command-line tools for
- data validation,
- reference-test generation with Gentest (test for code in any language),
- a diff tool for on-disk data frames (as parquet files and flat files)
- tools for working with the tdda.serial format and also with CSVW (CSV on the Web) and Frictionless.
Reference testing with tdda.referencetest under unittest or pytest
Test-Driven Document Development (TDDD)
APIs for all functionality

Resources from the book are available at book.tdda.info, including

22 Checklists
All figures
Glossary
Data Profiles
Data Dictionaries
TDDD tests for the book.

Examples from the book are available from the tdda library by using the tdda command:

tdda examples book

The whole of TDDA is really built around the encapsulation of the data-analysis cycle shown below, and the diagram shows how the book covers these ideas.

The main part of the diagram consists of six circles from
left to right.
The first five circles have failure mode text
under them and an error class below that.
1. CHOOSE APPROACH.
Failure: 'Fail to understand data, problem domain, or methods',
ERROR OF INTERPRETATION (error of formulation).
Ch 13.
2. DEVELOP ANALYTICAL PROCESS.
Failure: 'Mistakes during coding' and the associated
ERROR OF IMPLEMENTATION (bug).
Ch 9-12.
3. RUN ANALYTICAL PROCESS.
Failure: 'Use the software incorrectly'
ERROR OF PROCESS (operator error).
Ch 16.
4. PRODUCE ANALYTICAL RESULTS
Failure 'Mismatch between development data or assumptions
and deployment data'
ERROR OF APPLICABILITY (category error).
Ch 1-7 & 17.
5. INTERPRET ANALYTICAL RESULTS
Failure 'Misinterpret the results'
ERROR OF INTERPRETATION (communication error).
Ch 14 & 15.
6. `First, Do No Harm'.
ERROR OF JUDGEMENT.
Ch 17.
Arrows lead to FAILURE and SUCCESS boxes.
Remedies and book chapters sit underneath the main diagram.

The TDDA Library, Version 3.0

Top Line: Three Machines illustrating
1. constraint discover and data validation: an input hopper takes training
data and produces constraints, or training data + constraints to produce
data validations at the output chute.
2. Rexpy, which takes strings in its input hopper and produces
regular expressions at the output chute,
3. TDDA gentest, which takes code in the input hopper and produces a Python
reference-test script as output.
Bottom Line: 4. tdda diff which compares data in flat files and parquet
files to detect (semantic) differences.
5. tdda.serial, which is a format for describing flat-file formats and
a suite of tools for working with tdda.serial, CSVW, and Frictionless
6. tdda.referencetest, for semantic testing of complex analytical results.

Version 3.0 of the library and command-line tools is a major upgrade.

All the main features have upgrades:

Data validation using constraints, which can be generated from training data.
Inference of regular expressions from example strings.
Automatic generation of tests for almost any non-GUI code in any language (Gentest).
"Gentest writes tests so you don't have to."™
Enhanced test support for complex results in both Python's unittest and in pytest with reference testing.

New features include:

Support for Pandas 3.0, including all three backends (original, numpy_nullable, and pyarrow).
Support for Polars DataFrames in most areas of the library.
Comprehensive Parquet support, replacing feather format.
tdda diff: find and visualize differences between datasets in flat files (like CSV files) and parquet files, with control over specificity and scope.
Flat-file metadata support: the new tdda.serial format allows the format of CSV and other flat files to be described for accurate reading across libraries. This includes inference of flat-file formats, Python code generation, helper functions for reading and writing flat files with metadata, and conversion between tdda.serial, CSVW (CSV on the Web), and Frictionless.
Text utilities for Unicode, including glyph counting and extended normalization forms beyond canonical composition and decomposition (NFC, NFD), and kompatibility normalization (NFKC and NFKD). Form NFTK performs further kompatibility normalization including accent stripping.
Man pages for all commands
Upgraded documentation for command line tools and the API.

PyData London TDDA Tutorial, 5th June 2026, 14:10

I'll be giving a 90-minute hands-on tutorial on TDDA on 5th June 2026 at PyData London. Do come along if you can. PyData is always great, for experts and novices and all levels of technical interest and proficiency. It would be great to see you there.

Get tickets from PyData.

And if you have something to share, prepare a 5-minute Lightning Talk. They are always a highlight of the conference.

Test-Driven Document Development

Posted on Tue 02 September 2025 in TDDA • Tagged with TDDD

Summary

Computational documents attempt to guarantee that results included within them—such as graphs—correspond to the code and data claimed to generate them. They typically achieve this by generating the outputs from the code at the time the document is generated or viewed. This solves significant problems, including those of code rusting (exhibiting changed behaviour) and of unintentional inclusion of stale, incorrect, or unvalidated results. There is, however, a danger of what I term co-rusting, whereby the code and its outputs drift away from correctness (rust) together, without the author realizing. This is likely if the code continues to generate output (i.e., does not crash or report an error).

Computational documents are an important part of reproducible research, within which the main approach to avoiding co-rusting tends to be the use of reproducible environments, which aim to prevent rusting by pinning down as much of the computational environment as possible.

Test-Driven Document Development (TDDD) builds on computational documents by adding automated tests that fail when results change (materially). If these tests are run as part of the build process for the document, the possibily of co-rusting is reduced or eliminated. TDDD can be viewed as the application of test-driven data analysis (TDDA) to the process of document creation, essentially considering the generation of a document as an analytical process that should be supported by reference tests.

The tests can be created by hand, but the Gentest functionality of the tdda tool turns out to be powerful for implementing the tests needed by TDDD, whatever language is used to generate the results.

Background: Computational Documents

Computational documents include one or more results generated by computer code, and provide some guarantee that each result matches its generating code. This is usually achieved by including the code in the document and generating the output either as part of document production (compilation, e.g., Quarto, or in a more limited way, cog) or on-the-fly, for computational notebooks (interpretation, like Jupyter Notebooks / JupyterLab and marimo).

Here is a simple Quarto computational document that calculates the number of potential UK postcodes as defined by a regular expression describing valid ones.¹ This number is quoted in a book I am writing on TDDA. Prior to today, it was pasted into the book by copying the output from an interactive Python session where I calculated it. I probably inserted the thousand separators by hand (another error-prone process). Today I not only changed the number to be included from a calculation when the book is compiled, but also added reference tests to detect if it changes. (source)

---
title: "Quarto Postcodes (inline)"
format:
  html:
    code-fold: true
  pdf:
    toc: false
jupyter: python3
---

```{python}
from letters import nL

RE = r'^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$'

def n_poss_postcodes_for_re():
    """
    Number of strings matching:
      ^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$
    """
    n_postal_areas = nL + nL * nL  # 1 or two letters
    n_postal_districts = 10 + 100  # Any one or two digit number
                                   # 0 and 0x aren't used, but match the regex
    n_subdistricts = nL + 1        # Not all letters are used,
                                   # and only for some London codes,
                                   # but for our regex...
                                   # The +1 is for ones not using a subdistrict

    n_outcodes = n_postal_areas * n_postal_districts * n_subdistricts
    n_incodes = 10 * nL * nL       # Digit then two letters
    n_postcodes = n_outcodes * n_incodes

    return n_postcodes


if __name__ == '__main__':
    n = n_poss_postcodes_for_re()
```
The number of postcode-like strings matching

    `{python} RE`

is `{python} f'{n:,}'`

This document is written in a dialect of Markdown defined by Quarto. It has a header at the top, containing metadata, then a fenced Markdown Python block containing (which defines two variables used later in the document), and some text that uses those two variables (RE and n_formatted) to say how many postcodes match. It has a confected dependency on an another Python file, letters.py defining the number of letters, nL, in English:

nL = 26

It can be compiled with:

    quarto render postcodes1.qmd

producing this page and this document. This rather simple computational document, which shows the code and one important output number that is “guaranteed” to be generated from the code shown. It would be usual to includes graphs or tables of some sort, but this is minimal example so I really wanted only a single number.

The version of the code actually used to generate the number in the book, does not import nL from letters.py, but includes the line nL = 26 in the main program. That's because I'm not trying to make it fail in the book. I've written in this way for the post to give me an easy way to demonstrate co-rusting, which is a entirely real phenomenon. A change in a dependency is a common reason for rusting. (If you do not believe in code rusting or co-rusting, try reading Why Code Rusts; if that doesn't convince you, this article may not be for you.)

Writing Tests For the Code

We will begin by writing tests for essentially the same code, just written as a standalone Python program rather than embedded in a Quarto document.

Here is same code as an actual python script postcodes.py, together with some slightly different behaviour after calling the postcode-counting function.

import json
from letters import nL
from tdda.utils import dict_to_tex_macros

RE = r'^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$'

def n_poss_postcodes_for_re():
    """
    Number of strings matching:
      ^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$
    """
    n_postal_areas = nL + nL * nL  # 1 or two letters
    n_postal_districts = 10 + 100  # Any one or two digit number
                                   # 0 and 0x aren't used, but match the regex
    n_subdistricts = nL + 1        # Not all letters are used,
                                   # and only for some London codes,
                                   # but for our regex...
                                   # The +1 is for ones not using a subdistrict

    n_outcodes = n_postal_areas * n_postal_districts * n_subdistricts
    n_incodes = 10 * nL * nL       # Digit then two letters
    n_postcodes = n_outcodes * n_incodes

    return n_postcodes


if __name__ == '__main__':
    n = n_poss_postcodes_for_re()
    d = {'n': n, 'n_str': f'{n:,}', 'postcodeRE': RE}
    json_path = 'postcodes.json'
    with open(json_path, 'w') as f:
        json.dump(d, f, indent=4)
    dict_to_tex_macros(d, 'postcodes-defs.tex', verbose=False)

If we run this code, it produces no output but writes two files. The first is a JSON file, postcodes.json,)

{
    "n": 434464659200,
    "n_str": "434,464,659,200",
    "postcodeRE": "^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$"
}

We have chosen to write into this the values we might want in the document (in this case, both the number as a number, as the formatted number, as well as the relevant regular expression).

There's a second file, postcodes-defs.tex, which we will use later when we use LaTeX as a TDDD engine. This contains the same values, but now as TeX macros:

\def\n{434464659200}
\def\nStr{434,464,659,200}
\def\postcodeRE{\^[A-Z]\{1,2\}[0-9]\{1,2\}[A-Z]? [0-9][A-Z]\{2\}\$}

If you have the tdda library installed, you have as part of it a tool called Gentest, which can write tests in Python for essentially any command-line program, script, or command, in any language.

The line below instructs Gentest to generate tests for running the Python program postcodes.py.

$ tdda gentest 'python postcodes.py'

This produces the following output:


Running command 'python postcodes.py' to generate output (run 1 of 2).
Saved (empty) output to stdout to /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/STDOUT.
Saved (empty) output to stderr to /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/STDERR.
Copied $(pwd)/postcodes-defs.tex to $(pwd)/ref/python_postcodes_py/postcodes-defs.tex
Copied $(pwd)/postcodes.json to $(pwd)/ref/python_postcodes_py/postcodes.json

Running command 'python postcodes.py' to generate output (run 2 of 2).
Saved (empty) output to stdout to /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/2/STDOUT.
Saved (empty) output to stderr to /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/2/STDERR.
Copied $(pwd)/postcodes-defs.tex to $(pwd)/ref/python_postcodes_py/2/postcodes-defs.tex
Copied $(pwd)/postcodes.json to $(pwd)/ref/python_postcodes_py/2/postcodes.json

Test script written as /Users/njr/blogs/tdda-code/tddd-postcodes/test_python_postcodes_py.py
Command execution took: 0.44s

SUMMARY:

Directory to run in:        /Users/njr/blogs/tdda-code/tddd-postcodes
Shell command:              python postcodes.py
Test script generated:      /Users/njr/blogs/tdda-code/tddd-postcodes/test_python_postcodes_py.py
Reference files:
    $(pwd)/postcodes-defs.tex
    $(pwd)/postcodes.json
Check stdout:               yes (was empty)
Check stderr:               yes (was empty)
Expected exit code:         0
Clobbering permitted:       yes
Number of times script ran: 2
Number of tests written:    6

If you run tdda gentest without specifying a command, you get a wizard, which asks what command to run and also gives you various other options that can alternatively be passed on the command line.

The output is intended to be self explanatory, but to elaborate, what Gentest has done is:

Run the command twice;
Recorded what was printed (both on the normal output stream stdout and also, separately, what was printed on the error output stream stderr;
Taken copies of any files created—in our case case, the .json and .tex files.
Noted the exit code from the program (here 0, indicating successful completion);
Looked to see whether there were any differences between the two runs, and whether anything in the output looked highly dependent on the environment or context. Here nothing did, but if it had Gentest would have generated tests that attempted to factor out things that look as if they might vary from run to run. (Examples include timestamps, run durations, hostnames etc.);
Written a test script, test_python_postcodes_py.py. When run, this executes the command under test and compares its behaviour and outputs to those it collected when generating the tests. The tests only pass if the behaviour and outputs were identical other than anything Gentest decided was not fixed. In this case, there was nothing Gentest thought classes as not fixed.

The code generated is in test_python_postcodes_py.py

If we run this test script, thus:

$ python test_python_postcodes_py.py

we get

......
----------------------------------------------------------------------
Ran 6 tests in 0.439s

OK

which shows that our tests have passed, meaning that the output is unchanged. I'm not going to go through the tests, but by all means look at them.

Simulated Co-Rusting

Let's look at what happens if our code's behaviour changes as a result of rusting. We will simulate this by replacing letters.py with letters52.py, which records the number of upper- and lower-case letters in English.²

    cp letters52.py letters.py

if we do this and run the tests again we get two test failures and some suggested diff commands to run to understand them,

..2 lines are different, starting at line 1
Compare with:
    diff /Users/njr/blogs/tdda-code/tddd-postcodes/postcodes-defs.tex /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/postcodes-defs.tex

F2 lines are different, starting at line 2
Compare with:
    diff /Users/njr/blogs/tdda-code/tddd-postcodes/postcodes.json /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/postcodes.json

F..
======================================================================
FAIL: test_postcodes_defs_tex (__main__.Test_PYTHON_POSTCODES.test_postcodes_defs_tex)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/njr/blogs/tdda-code/tddd-postcodes/test_python_postcodes_py.py", line 52, in test_postcodes_defs_tex
    self.assertTextFileCorrect(os.path.join(self.cwd, 'postcodes-defs.tex'),
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               os.path.join(self.refdir, 'postcodes-defs.tex'),
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               encoding='ascii')
                               ^^^^^^^^^^^^^^^^^
AssertionError: False is not true : 2 lines are different, starting at line 1
Compare with:
    diff /Users/njr/blogs/tdda-code/tddd-postcodes/postcodes-defs.tex /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/postcodes-defs.tex


======================================================================
FAIL: test_postcodes_json (__main__.Test_PYTHON_POSTCODES.test_postcodes_json)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/njr/blogs/tdda-code/tddd-postcodes/test_python_postcodes_py.py", line 57, in test_postcodes_json
    self.assertTextFileCorrect(os.path.join(self.cwd, 'postcodes.json'),
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               os.path.join(self.refdir, 'postcodes.json'),
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               encoding='ascii')
                               ^^^^^^^^^^^^^^^^^
AssertionError: False is not true : 2 lines are different, starting at line 2
Compare with:
    diff /Users/njr/blogs/tdda-code/tddd-postcodes/postcodes.json /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/postcodes.json


----------------------------------------------------------------------
Ran 6 tests in 0.443s

FAILED (failures=2)

and if we run the second suggested diff command (on the JSON files), we see:

2,3c2,3
<     "n": 434464659200,
<     "n_str": "434,464,659,200",
---
>     "n": 14094194400,
>     "n_str": "14,094,194,400",

This is showing us that, with the changed dependency, the code is now producing well over 400 million potential postcodes, rather than th 14 million we expected. (The lack of a newline at the end of stdout is not significant, and is ignored by the test.) So as we hoped, the test detected the rusting of our code, and the co-rusting of its output.

The second diff command shows exactly the same differences in the TeX macros written:

1,2c1,2
< \def\n{434464659200}
< \def\nStr{434,464,659,200}
---
> \def\n{14094194400}
> \def\nStr{14,094,194,400}

If we run the Quarto file postcodes1.qmd with the change, there is no obvious problem: the code and the result continue to match, but are now different from what I intended and orginally validated. Here are the html and pdf

A TDDD Version of the Quarto Doc

We can make the Quarto document more robust (and have the benefit of keeping the code in a script, rather than forcing it into the document) by using this Quarto file, postcodes2.qmd.

---
title: "Quarto Postcodes (with inclusion)"
format:
  html:
    code-fold: true
  pdf:
    toc: false
jupyter: python3
---

{{< include _postcodes.py.qmd >}}
```{python}
with open('ref/python_postcodes_py/postcodes.json') as f:
    ref = json.load(f)
assert d == ref
```
The number of postcode-like strings matching

    `{python} ref['postcodeRE']`

is `{python} ref['n_str']`

The include line at the top imports the file _postcodes.py.qmd. This file is just our script, in Quarto Markdown fences, with a underscore filename, which Quarto requires for inclusions for some reason. We construct the file automatically as part of the build process (in the Makefile).

After the inclusion, we read the JSON file that Gentest saved in its reference directory into Python as a dictionary called ref and then, check that thi refernece dictionary is equal to the one we generated when we ran the code as part of the Quarto rendering process. The Makefile runs the tests (outside Quarto) immediately before rendering so if the assertion passes, we actually know two things:

The tests passed when we ran them outside Quarto (showing that the produce the results we previously validated as OK), and
When we ran the same code inside Quarto, its results (or at least, the results in the dictionary) were also the same as the reference results in the test.

The rest of the Quarto document is the same as the first version except that use the results from the dictionary (since those are validated) and choose to use the preformatted string ref['n_str'] rather than formatting it inline. (This makes no difference.)

In this case, and many others, it makes no difference whether we use ref (the results read from the refernece JSON file) or d as the source of our values, because the assertion checked that they were identical. The reason I've used ref is that in some other cases, the we allow non-material differences between the actual and reference results, typically things like datestamps indicating run-time, machine names etc. (If those are different, we need to use a slightly different assertion.) By using the reference results, we ensure that the document does not change each time we compile it if there are no material differences.

Discussion

Look at the JSON and TeX macros
Change the letters to be 52
Show the test failing
Show how to use the script code in Quarto
Do the LaTeX version.

All current valid postcodes match this expression, but many string that match it do not exist and some would probably not be considered valid. ↩
By way of full disclosure, when I actually replaced letters.py with letters52.py and ran the tests they passed, to my dismay. This happened not because of a problem with the tests, but because I created letters52.py and letters26.py by copying letters.py and failed to update the contents of th letters52.py. If you were were to look back in the Git history for the repo, you'd see that. I mention this simply as a further demonstration that all humans are prone to error, which is some of the reason TDDD and TDDA are helpful! Of course, some humans are less errir-prone than others! ↩

tdda.serial: Metadata for Flat Files (CSV Files)

Posted on Mon 23 June 2025 in misc

Almost all data scientists and data engineers have to work with flat files (CSV files) from time to time. Despite their many problems, CSVs are too ubiquitous, too universal, and (whisper it) have too many strengths for them to be likely to disappear. Even if they did, they would quickly be reinvented. The problems with them are widely known and discussed, and will be familar to almost everyone who works with them. They include issues with encodings, types, quoting, nulls, headers, and with dates and times. My favourite summary of them remains Jesse Donat's Falsehoods Programmers Believe about CSVs. I wrote about them on this blog nearly four years ago (Flat Files).

Over the last year or so I've been writing a book on test-driven data analysis. The only remaining chapter without a full draft discusses the same topics as this post—metadata for CSV files and new parts of the TDDA software that assist with its creation and use. This post documents my current thinking, plans and ambitions in this area, and shows some of what is already implemented.¹

A Metadata Format for Flat Files: tdda.serial

The core of the new work is a new format, tdda.serial, for describing data in CSV files.

The previous post showed an example (“XMD”) metadata file used by the Miró software from my company Stochastic Solutions, which was as follows:²

    <?xml version="1.0" encoding="UTF-8"?>
    <dataformat>
        <sep>,</sep>                     <!-- field separator -->
        <null></null>                    <!-- NULL marker -->
        <quoteChar>"</quoteChar>         <!-- Quotation mark -->
        <encoding>UTF-8</encoding>       <!-- any python coding name -->
        <allowApos>True</allowApos>      <!-- allow apostophes in strings -->
        <skipHeader>False</skipHeader>   <!-- ignore the first line of file -->
        <pc>False</pc>                   <!-- Convert 1.2% to 0.012 etc. -->
        <excel>False</excel>             <!-- pad short lines with NULLs -->
        <dateFormat>eurodt</dateFormat>  <!-- Miró date format name -->
        <fields>
            <field extname="mc id" name="ID" type="string"/>
            <field extname="mc nm" name="MachineName" type="int"/>
            <field extname="secs" name="TimeToManufacture" type="real"/>
            <field extname="commission date" name="DateOfCommission"
                   type="date"/>
            <field extname="mc cp" name="Completion Time" type="date"
                   format="rdt"/>
            <field extname="sh dt" name="ShipDate" type="date" format="rd"/>
            <field extname="qa passed?" name="Passed QA" type="bool"/>
        </fields>
        <requireAllFields>False</requireAllFields>
        <banExtraFields>False</banExtraFields>
    </dataformat>

Here is one equivalent way of expressing essentially the same information in the (evolving) tdda.serial format:

{
    "format": "http://tdda.info/ns/tdda.serial",
    "writer": "tdda.serial-2.2.15",
    "tdda.serial": {
        "encoding": "UTF-8",
        "delimiter": "|",
        "quote_char": "\"",
        "escape_char": "\\",
        "stutter_quotes": false,
        "null_indicators": "",
        "accept_percentages_as_floats": false,
        "header_row_count": 1,
        "map_missing_trailing_cols_to_null": false,
        "fields": {
            "mc id": {
                "name": "ID",
                "fieldtype": "int"
            },
            "mc nm": {
                "name": "Name",
                "fieldtype": "string"
            },
            "secs": {
                "name": "TimeToManufacture",
                "fieldtype": "int"
            },
            "commission date": {
                "name": "DateOfCommission",
                "fieldtype": "date",
                "format": "iso8601date"
            },
            "mc cp": {
                "name": "CompletionTime",
                "fieldtype": "datetime",
                "format": "iso8601datetime"
            },
            "sh dt":  {
                "name": "ShipDate",
                "fieldtype": "date",
                "format": "iso8601date"
            },
            "qa passed?": {
                "name": "PassedQA",
                "fieldtype": "bool",
                "true_values": "yes",
                "false_values": "no"
            }
        }
    }
}

The details don't matter too much at this stage, and may yet change, but briefly here we see the file (typically with a .serial extension), describing:

the text encoding used for the data (UTF-8);
the field separator (pipe, |);
the quote character (double quote, ");
the escape character (\), which is used to escape double quotes in double-quoted strings, among other things;
whether quotes are stuttered or escaped within quoted strings;
the string used to denote null values (this can be a single string or a list);
the number of header rows;
an explicit note not to accept percentages in the file as floating-point values;
whether or not lines with too few fields should be regarded as having nulls for the apparently missing fields. (Excel usually does not write values after the last non-empty cell in each row on a worksheet.)
information about individual fields. In this case, a dictionary is used to map names in the flat file to names to be used in the dataset. Numbers can also be used to indicate column position, particularly if there is no header, though they have to be quoted because this is JSON. Field types are also specified, together with any extra information required, e.g. the non-standard true and false values for the boolean field collected? (in the file), which becomes HasBeenCollected once read. Formats for the date and time fields are also specified here.

When the fields are presented as a dictionary, as here, this allows for the possibility that there are other fields in the file, for which metadata is not provided. If a list is used instead, the field list is taken to be complete. In this case, external names can be provided using an csvname attribute, if they are different.

Pretty much everything is optional, and, where appropriate, defaults can be put in the main section and over-ridden on a per-field basis. This is useful if, for example, one or two fields use different null markers from the default, or if multiple date formats are used. (The format key will probably change to dateformat and boolformat to make this overriding work better.)

Here is a simple example of its use with Pandas. Suppose we have the following pipe-separated flat file, with the name machines.psv.

mc id|mc nm|secs|commission date|mc cp|sh dt|qa passed?
1111111|"Machine 1"|86400|2025-06-01|2025-06-07T12:34:56|2025-06-21|yes
2222222|"Machine 2"||2025-06-02|2025-06-08T12:34:57|2025-06-22
3333333|"Machine 3"|86399|2025-06-03|2025-06-09T12:34:55|2025-06-22|no

Then we can use the following Python code to load the data, informed by the metadata in machines.serial (the example shown above).

from tdda.serial import csv_to_pandas

df = csv_to_pandas('machines.psv', 'machines.serial')
print(df, '\n')
df.info()

This produces the following output:

$ python pd-read-machines.py
        ID       Name  TimeToManufacture DateOfCommission      CompletionTime   ShipDate  PassedQA
0  1111111  Machine 1              86400       2025-06-01 2025-06-07 12:34:56 2025-06-21      True
1  2222222  Machine 2               <NA>       2025-06-02 2025-06-08 12:34:57 2025-06-22      <NA>
2  3333333  Machine 3              86399       2025-06-03 2025-06-09 12:34:55 2025-06-22     False

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   ID                 3 non-null      Int64
 1   Name               3 non-null      string
 2   TimeToManufacture  2 non-null      Int64
 3   DateOfCommission   3 non-null      datetime64[ns]
 4   CompletionTime     3 non-null      datetime64[ns]
 5   ShipDate           3 non-null      datetime64[ns]
 6   PassedQA           2 non-null      boolean
dtypes: Int64(2), boolean(1), datetime64[ns](3), string(1)
memory usage: 288.0 bytes

There's nothing particularly special here, but Pandas has read the file correctly using the metadata to understand

the pipe separator;
the date and time formats;
the yes/no format of PassedQA;
the null indicator;
the intended, more usable internal field names;
field types, here defaulting to nullable types.

As with the pandas.read_csv, we can choose whether to prefer nullable types, but the default using tdda.serial is to do so. In this case, the date formats and null indicators would be fine anyway, with Pandas defaults, but here we could instead have specified, say, European dates and ? for nulls.

This code:

from tdda.serial load_metadata, serial_to_pandas_read_csv_args
from rich import print as rprint

md = load_metadata('machines.serial')
kwargs = serial_to_pandas_read_csv_args(md)
rprint(kwargs)

shows the parameters actually passed to pandas.read_csv:

{
    'dtype': {'ID': 'Int64', 'Name': 'string', 'TimeToManufacture': 'Int64', 'PassedQA': 'boolean'},
    'date_format': {'DateOfCommission': 'ISO8601', 'CompletionTime': 'ISO8601', 'ShipDate': 'ISO8601'},
    'parse_dates': ['DateOfCommission', 'CompletionTime', 'ShipDate'],
    'sep': '|',
    'encoding': 'UTF-8',
    'escapechar': '\\',
    'quotechar': '"',
    'doublequote': False,
    'na_values': [''],
    'keep_default_na': False,
    'names': ['ID', 'Name', 'TimeToManufacture', 'DateOfCommission', 'CompletionTime', 'ShipDate', 'PassedQA'],
    'header': 0,
    'true_values': ['yes'],
    'false_values': ['no']
}

We can do the very similar things using Polars (and “soon”, other libraries). Here's a way to read the file with Polars:

from tdda.serial import csv_to_polars

df = csv_to_polars('machines.psv', 'machines.serial',
                   map_other_bools_to_string=True)
print(df)

which produces:

This does mostly the same thing as the Pandas version, but issues two warnings. The first is because an escape character is specified, which the Polars CSV reader doesn't really understand. The second warning is because the Polars CSV reader can't handle non-standard booleans. By default, when these are specified for Polars, tdda.serial will issue a warning but still call polars.read_csv to read the file, because they might not, in fact, be used. The parameter passed in the Python code above (map_other_bools_to_string=True) tells tdda.serial to direct Polars to read this column as a string instead (as it would if we didn't specify a type). Of course, it would be possible to have the reader then go through and turn the strings into booleans after reading, but that feels like more a metadata library should do.

The warnings helpfully tell you what to look out for as possible issues when the file is read. This as an example of a principle I'm trying to use throughout tdda.serial: when there's something in the serial metadata that a given reader might not be able to handle correctly, issue a warning, and possibly provide an option to control that behaviour.

We can do the same thing as we did for Pandas and look at the arguments generated for Polars, using the following, very similar, Python code:

from tdda.serial import (load_metadata, serial_to_polars_read_csv_args)
from rich import print as rprint

md = load_metadata('machines.serial')
kwargs = serial_to_polars_read_csv_args(md, map_other_bools_to_string=True)
rprint(kwargs)

This produces

{
    'separator': '|',
    'quote_char': '"',
    'null_values': [''],
    'encoding': 'UTF-8',
    'schema': {
        'ID': Int64,
        'Name': String,
        'TimeToManufacture': Int64,
        'DateOfCommission': Datetime,
        'CompletionTime': Datetime,
        'ShipDate': Datetime,
        'PassedQA': String
    },
    'new_columns': [
        'ID',
        'Name',
        'TimeToManufacture',
        'DateOfCommission',
        'CompletionTime',
        'ShipDate',
        'PassedQA'
    ]
}

The only subtlety here is that the types in Schema are actual polars types (pl.Int64 etc.) rather than strings, hence their not being quoted. (They're not prefixed because repr(pl.Int64) is the string "Int64", which prints as Int64.) The library can also write a tdda.serial file containing the Polars arguments explicitly. It looks like this:

{
    "format": "http://tdda.info/ns/tdda.serial",
    "writer": "tdda.serial-2.2.15",
    "polars.read_csv": {
        "separator": "|",
        "quote_char": "\"",
        "null_values": [
            ""
        ],
        "encoding": "UTF-8",
        "schema": {
            "ID": "Int64",
            "Name": "String",
            "TimeToManufacture": "Int64",
            "DateOfCommission": "Datetime",
            "CompletionTime": "Datetime",
            "ShipDate": "Datetime",
            "PassedQA": "String"
        },
        "new_columns": [
            "ID",
            "Name",
            "TimeToManufacture",
            "DateOfCommission",
            "CompletionTime",
            "ShipDate",
            "PassedQA"
        ]
    }
}

Here, because we need to serialize the tdda.serial file as JSON, the polars types are mapped to their string names. The tdda library takes care of the conversion in both directions.

A single .serial file can contain multiple flavours of metadata—tdda.serial, polars.read_csv, pandas.read_csv etc. When it does, a call to load_metadata can specify a preferred flavour, or let the library choose. My hope, however, is that in most cases the tdda.serial section will contain enough information to work as well as a library-specific specification.

Goals for tdda.serial

Image showing a circle with tdda.serial in the middle and arrows leading in and out for three formats (CSVW, tdda.serial, and Frictionless), five libraries (DuckDB, Python csv, Pandas, Polars, and Apache Arrow) and Excel. Pandas, CSVW, tdda.serial and Polars are bold for both input and output.

When I went to write down the goals for tdda.serial, I was surprised at how long the list was. Not all of this is implemented but here is the current state of the goals for tdda.serial. (The image above shows the vision for it, with the bold parts mostly implemented, and the rest currently only planned.)

Describe Flat File Formats. Allow accurate representation, full or partial, of flat-file formats used (or potentially used) by one or more concrete flat files. or .
- It primarily targets comma-separated values (.csv) and related formats (tab-separated, pipe-separated etc.), but also potentially other tabular data. It could, for example, be used to describe things like date formats and numeric subtypes for tabular data stored in JSON or JSON Lines.
- Full or partial is important. When reading data, it is often convenient only to specify things that are causing problems. On write, fuller specifications are, of course, desirable.
Read Flat Files. Assist with reading flat files correctly, based on metadata in .serial files and other formats (like CSVW), primarily using data in the "tdda.serial" format.
- Convert metadata currently stored as tdda.serial to dictionaries of arguments for other libraries that work with CSVs.
- Provide an API to get such libraries to read flat-file data correctly, guided by the metadata
- Generate code to get such libraries to read flat-file data correctly, guided by the metadata. Assist with writing flat files in documented formats.
- Interoperate, where possible, with other metadata formats like CSVW and Frictionless.
Generate tdda.serial Metadata Files. Assist with generating metadata describing the format of CSV files based on the write arguments provided to the writing software.
Write Flat Files. Assist with getting libraries to write CSV files using a format specified in a tdda.serial file.
- This provides a second way of increasing interoperability: we can help readers to read from a specific format, and writers to write to that same format.
Assist/Support other Software Reading, Writing, and otherwise handling Flat Files.
- DataFrame Libraries
  - Pandas
  - Polars
  - Apache Arrow
- Databases
  - DuckDB
  - SQLite
  - Postgres
  - MySQL
- Miscellaneous
  - Python csv
  - tdda
Support Library-specific Read/Write Metadata. Provide a mechanism for documenting library-specific read/write parameters for CSV files explicitly:
- For storing the library-specific write parameters used with pandas.to_csv, polars.write_csv in .serial files (and the ability to use such parameters)
- For storing the library-specific read parameters required to read a flat file with high fidelity using, e.g. pandas.read_csv , polars.read_csv etc.
Assist with Format Choice. Provide a mechanism for helping to choose a good CSV format for a concrete dataset to be written, e.g. choosing null indicators that are not likely to be confused with serialized non-null values in the dataset.
SERDE Verification. Provide mechanisms for checking whether a dataset can be round-tripped successfully to a flat file (i.e. that the same library, at least, can write data to a flat file, read it back, and recover identical, equivalent, or similar data).³
CLI Tools. Through the associated command-line tool, tdda diff, and equivalent API functions, to check whether two datasets are equivalent.
- In the case of the command-line tool this is two datasets on disk (flat files, parquet files etc.). It might also be possible to compare two database tables, in the same or different RDBMS instances, or data in a database table and in a file on disk, though this is not yet implemented. (The next post will discuss tdda diff further.)
- In the case of the API, this can also include in-memory data structures such as data frames.
Provide Metadata Format Conversions. Provide mechanisms for converting between different library-specific flat-file parameters and tdda's tdda.serial format, as well as between the tdda.serial format, csvw, and (perhaps) frictionless.
Generate Validation Statistics and Validate using them. (Potentially) write additional data for a concrete dataset that can be used for further validation that it has been read correctly, e.g. summary statistics, checksums etc.

Discussion

The usual observation when proposing something new like this is that the last thing the world needs is another “standard”. As Randall Munro puts it: (https://imgs.xkcd.com/927):

HOW STANDARDS PROLIFERATE: (See A/C chargers, character encodings, instant messaging etc. Cartoon. Panel 1: SITUATION: There are 14 competing standards. Panel 2: (Conversation between two people.) 14? Ridiculous! We need to develop one universal standard that covers everyone's use cases. (Yeah.) Panel 3 (SOON:): SITUATION: There are 15 competing standards.

In this case, however, I don't think there are all that many recognized ways of describing flat-file formats. I was involved in one (the .fdd flat-file description data format) while at Quadstone, and I currently use the XMD format above at Stochastic Solutions, but pretty-much no one else does. While working with a friend, Neil Skilling, he ran across the CSVW standard, developed under the auspices of W3C, and that led to my finding the Python frictionless project. At first I thought one of those might be the solution I was looking for, but in fact they have goals and desgins that are different enough that they don't quite fulfill the most important goals for tdda.serial, as impressive as both projects are. Reluctantly, therefore, I began working on tdda.serial, which aims to interoperate with and support CSVW, (and to some extent, frictionless), but also to handle other cases.

The biggest single difference between the focus of tdda.serial and the CSVW is that tdda.serial is primarily concerned with documenting a format that might be used by many flat files (different concrete datasets sharing the same sttructure and formatting) whereas CSVW is primarily concerned with documenting either a single specific CSV file or a specific collection of CSV files, usually each having different structure. This seems like a rather subtle difference, but in fact turns out to be quite consequential.

Here's the first example CSVW file from csvw.org:

{
  "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}],
  "tables": [{
    "url": "http://opendata.leeds.gov.uk/downloads/gritting/grit_bins.csv",
    "tableSchema": {
      "columns": [
      {
        "name": "location",
        "datatype": "integer"
      },
      {
        "name": "easting",
        "datatype": "decimal",
        "propertyUrl": "http://data.ordnancesurvey.co.uk/ontology/spatialrelations/easting"
      },
      {
        "name": "northing",
        "datatype": "decimal",
        "propertyUrl": "http://data.ordnancesurvey.co.uk/ontology/spatialrelations/northing"
      }
      ],
      "aboutUrl": "#{location}"
    }
  }],
  "dialect": {
    "header": false
  }
}

Notice that the CSVW file caters for multiple CSV files (a list of tables in the tables element), and that the location of the table is provided as a URL (which is a required element in CSVW). In the context of CSV on the web, this makes complete sense. It's specified as being URL, but can be a file: URL, or a simple path. One convention, fora CSVW file documenting a single dataset, seems to be that the metadata for grit_bins.csv is stored in grit_bins-metadata.json in the same directory as the CSV file itself (locally, or on the web).

What is significant, however, is that this establishes either a one-to-one relationship between CSV files and CSVW metadata files or, if the CSVW file contains metadata about several files, a one-to-one relationship between CSVW files and metadata tables in a CSVW file. Here, for example, is Example 5 from the CSVW Primer:

{
  "@context": "http://www.w3.org/ns/csvw",
  "tables": [{
    "url": "countries.csv"
  }, {
    "url": "country-groups.csv"
  }, {
    "url": "unemployment.csv"
  }]
}

The metadata “knows” the data file (or data files) that it describes. In contrast, the main concern of tdda.serial is to describe a format and structure that might well be used for many specific (“concrete”) flat files. The relationship is almost reversed as shown here:

Left: The CSVW file above, containing three CSV URLS, having arrows from each filename (URL) to that CSV file, as a named icon. Right: Three csv filesn named machines1.csv, machines2.csv, and machines3.csv, each with arrows to a single tdda.serial file (the one shown above).

Even though the URL (url) is a mandatory parameter in CSVW, there is nothing to prevent us from taking a CSVW file (particularly one describing a single table) and using its metadata to define a format to be used with other flat files. In doing, however, we would clearly be going against the grain of the design of CSVW. As an example of how it then does not quite fit, sometimes we want the metadata to describe exactly the fields in the data, and other times we want it to be a partial specification. In the XMD file, there are explicit parameters to say whether or not extra fields are allowed, and whether all fields are required. In the case of the tdda.serial file, we use a list of fields when we are describing all the fields allowed and required in a flat file, and a dictionary when we are providing information only on a subset, not necessarily in order.⁴ This sort of flexibility is harder in CSVW, which always uses a list to specify the fields. I could propose and use extensions, or try to get extensions added to the standard, but the former seem undesirable, and the latter hard an unlikely. (It does not look as if there have been and revisions to CSVW since 2022.) There are, in fact, many details of CSVW that are problematical for even the first two libaries I've looked at (Pandas and Polars), so unfortunately I think something different is needed.

Library-specific Support in tdda.serial

Another goal for tdda.serial is that it should be useful even for people who are only using a single library—e.g. Pandas. In such cases, there is typically a function or method for writing CSV files (pandas.DataFrame.to_csv), and another for reading them (pandas.read_csv). Both typically have many optional arguments, and in keeping with Postel's Law (the Robustness Principle), they typically have more flexibility in read formats than in write formats. In the case of Pandas, the read function's signature is:

pandas.read_csv(
    filepath_or_buffer, *, sep=<no_default>,
    delimiter=None, header='infer', names=<no_default>, index_col=None,
    usecols=None, dtype=None, engine=None, converters=None,
    true_values=None, false_values=None, skipinitialspace=False,
    skiprows=None, skipfooter=0, nrows=None, na_values=None,
    keep_default_na=True, na_filter=True, verbose=<no_default>,
    skip_blank_lines=True, parse_dates=None,
    infer_datetime_format=<no_default>, keep_date_col=<no_default>,
    date_parser=<no_default>, date_format=None, dayfirst=False,
    cache_dates=True, iterator=False, chunksize=None,
    compression='infer', thousands=None, decimal='.',
    lineterminator=None, quotechar='"', quoting=0, doublequote=True,
    escapechar=None, comment=None, encoding=None,
    encoding_errors='strict', dialect=None, on_bad_lines='error',
    delim_whitespace=<no_default>, low_memory=True, memory_map=False,
    float_precision=None, storage_options=None,
  dtype_backend=<no_default>
)

(49 parameters), while the write method's signature is:

DataFrame.to_csv(
    path_or_buf=None, *, sep=',', na_rep='',
    float_format=None, columns=None, header=True, index=True,
    index_label=None, mode='w', encoding=None, compression='infer',
    quoting=None, quotechar='"', lineterminator=None, chunksize=None,
    date_format=None, doublequote=True, escapechar=None, decimal='.',
    errors='strict', storage_options=None
)

(22 parameters).

The tdda library's command-line tools allow a tdda.serial specification to be converted to parameters for pandas.read_csv, returning them as a dictionary that can be passed in using **kargs. It can also generate python code to do the read using pandas.read_csv or directly perform the read, saving the result to parquet.

Similarly, the library can take a set of arguments for DataFrame.to_csv and create a tdda.serial file describing the format used (or write the data and metadata together).

For a user working with a single library, however, converting to and from tdda.serial's metadata description might be unnecessarily cumbersome and may work imperfectly. This is because different libraries represent data differently, and are based on slighlty different conceptions of CSV files. While I am going to make some effort to allow tdda.serial universal, it is likely that there will always be some cases in which there is a loss of fidelity moving between any specific library's arguments and the .serial representation.

For these reasons, the tdda library also supports directly writing arguments for a given library. That is why the tdda.serial metadata description is one level down inside the tdda.serial file, under a tdda.serial key. It is also possible to have sections for pandas.read_csv, polars.read_csv with exactly the arguments they need.

The functionality used on this post is not in the release version of the tdda library, but is there on a branch called detectreport, so can be accessed if anyone it particulary keen. ↩
In fact, in writing this post, I updated the previous one to use a slightly more sensible example that previously; this is the new, slightly more useful example. ↩
CSV is not a very suitable format for perfect round-tripping of data for reasons including numeric rounding, multiple types for the same data, and equivalent representations such as string and categoricals. Even using a typed format such as parquet, some of these details may change on round-tripping and most software needs a library-specific format in order to achieve perfect fidelity when serializing and deserializing data. ↩
This precise mechanism may change, but it is important for tdda.serial's purpose that is supports both full and partial field schema specification. ↩

TDDA and Quality for LLMs

Posted on Mon 23 December 2024 in misc

It is December 2024 as I write, and large language models (LLMs) are having an extended moment as I have been writing a book on tet-driven data analysis. Several people have suggested that I should write about LLMs or artificial intelligence (AI), a term that for many people now means either LLMs or LLMs and other the other forms of generative AI.

Training Inference

Size Training Data

Inputs

Goal

First do no harm.

Strong AI.

Beliefs. Hallucinations.

Stochastic hypothesis generators.

Rhydwaith

LLMs are neural networks that (loosely) predict the next word.*
Given some text, they predict the next word
You sentences by appending each predicted word to the input and iterating.

Mary had a -> little Mary had a little -> lamb, Mary had a little lamb, -> its Mary had a little lamb, its -> fleece

Mary had a -> seizure Mary had a seizure -> last Mary had a seizure last -> night

LLMs are trained on unimaginably large corpuses of data, mainly from the web.
LLMs have trillions of parameters—knobs that can be set to different values
With any given parameter settings, the LLMs will predict the next word
Some knob settings match the next-word associations better than others
Training an LLMs consists of optimizing the knob settings
(Most of) the parameters (knobs) are called ``weights''.
During training, the current weights are used to predict the next word
- When it is ``wrong'' (differs from the input), the weights are adjusted
- Even when it is ``right'', the weights are usually adjusted
- The raw prediction is not a single word, but probabilities for possible words
- There is always an error, which can always be reduced.
- The adjustments are calculated to try to reduce the errors over time.

Best Practices for Notebook Users

Posted on Tue 17 December 2024 in misc

In a previous post, I discussed some of the dangers of challenges, dangers and weaknesses of Jupyter Notebooks, JupyterLabs and their ilk. I used The Parables of Anne and Beth as a device to illustrate what I think of as good and bad practices for data science. A reasonable criticism of this was that it did not really offer anything to help people who might wish to continue using computational notebooks, but to work in such a way as to limit the harms identified.

Although it probably rings slightly hollow, my goal is absolutely to improve the quality of data science, data analyis, data engineering, and really all data work, and I very much see the attractions and strengths of Jupyter, despite being critical of certain dark patterns I see around their use.

Here are some suggesting best practices for notebook users that I hope might be helpful an constructive. It's true that if you adopt all of them, I might have succeeded in prising your Notebook from your hands, but if you adopt any of them, as you use notebooks, I think you will be safer and more successful. I'm very much in favour of half a loaf.

Subject to the vaguaries of the web, the checkboxes, online, should be clickable, should you find it useful as an actual checklist, and there's a PDF version available too.

Notebook Best Practices

Jupyter logo with checkmark

NBP1 Ensure that your Notebook runs correctly after completion
- Develop the Notebook
- Take a temporary copy of the Notebook
- Clear the Notebook
- Run and confirm the results match the temporary copy
- Clear the new Notebook again
- Commit the new Notebook to version control
- Rerun (so the Notebook contains the results)
- Delete the temporary copy
NBP2 Store the (cleared) Notebook in version control (see NBP1)
NBP3 Parameterize inputs and outputs at the top of the Notebook

e.g. Set and use variables such as INPATH and OUTPATH
Write the most important outputs to file (if not already done)

NBP4 Replace some individual cells or groups of cells with functions

Move them into an importable file, import and use
Prioritize potentially re-usable code and code requiring testing

NBP5 Write some tests

Create a reference (regression) test for the whole process
Create unit tests for the individual functions

NBP6 Consider restructuring/extracting the code as a standalone script
NBP7 Allow the parameters to be set from the command line

Alternatively, read from a configuration file (e.g. .json or .toml)

NBP8 Consider using safer alternatives like Marimo and Quarto

NBP Version 1.0.

A printable PDF copy is available.

Older Posts Newer Posts