Data Validation with tdda Constraints: A Guide for LLMs and Coding Agents

Posted on Fri 22 May 2026 in TDDA • Tagged with data validation, constraints, LLMs, coding bots, tdda, pandas, parquet

Raw markdown for direct use in LLM context windows.

This post is addressed primarily to LLMs and coding agents. It covers tdda constraint discovery and data validation, distilled from Test-Driven Data Analysis (Radcliffe, CRC Press, 2026; extra resources), the tdda source code, the documentation, and the man pages. If you are a human who wants a dense, no-nonsense introduction to data validation with tdda, you are welcome too.

If you are a coding agent, the fast path is:

  1. Profile the data (e.g. tdda profiles); discuss findings with the human.
  2. tdda discover -x data.csv constraints.tdda—generate constraints from known-good data.
  3. Read the generated .tdda file; edit it by hand. Discovery generates constraints that fit the training data—some will be too tight (stale date ranges, over-fitted min/max), some too loose (weak regex patterns), and some redundant or overlapping. Only you, in consultation with your tame human, can fix this. If you skip it, you will get both false positives and, worse, false negatives.
  4. tdda detect data.csv constraints.tdda bad.csv --key id_field—find and write failing records. Or tdda verify data.csv constraints.tdda for a quick field-level summary without writing individual failures.
  5. Never skip step 3. If you skip it, you will miss bad data. See The Workflow.

Data Validation with Constraints in TDDA

The tdda constraint discovery tool examines a dataset, finds patterns that hold across it, and encodes them as a JSON file (.tdda). You then validate new data against that file. Think of it as unit tests for data: you capture what "good" looks like once, and thereafter test that new data matches.

The core tool is the tdda command-line utility, with three subcommands:

  • tdda discover—generate constraints from training data
  • tdda verify—check whether data satisfies constraints (summary)
  • tdda detect—find the individual records that fail constraints

Install:

pip install tdda
tdda examples  # download example data and code

The Workflow

There are two phases: development and deployment. Using only deployment — skipping development—is the single most common mistake and is covered separately below.

Development phase (training data, then holdout data)

Step 0: Decide the operating point. Before doing anything else, discuss the false-positive/false-negative trade-off with a suitable human. In a safety-critical pipeline you may want to work hard to avoid false negatives even at the cost of more alerts; in a high-volume low-stakes pipeline the balance may be very different. This decision should drive every subsequent adaptation choice—it is not something to assume a default for.

Step 1: Profile and discuss. Before discovering constraints, produce a data profile—frequency distributions, null counts, summary statistics, outlier analysis. Use whatever tools you have (ydata-profiling, custom pandas code, etc.); the profiles at book.tdda.info/profiles show the sort of thing you need. Ideally, discuss the profile and data with a suitable human expert. The profile helps you understand what "valid" looks like before formalizing it. Profiling is not part of tdda.

Step 2: Discover. Run tdda discover on known-good training data to generate a .tdda constraints file automatically. The -x flag enables regex generation for string fields; -G suppresses grouping (usually produces simpler patterns). Both together: -xG.

tdda discover -xG data.csv constraints.tdda

Step 3: Read. Read the generated .tdda file. This is a named step, not a preamble to editing. Understand what was discovered before touching it.

Step 4: Adapt. Edit the constraints by hand. The vocabulary of adaptation is: Tighten / Relax / Add / Delete / Choose Among. This step is not optional. Auto-generated constraints are always a first draft—they will have stale date ranges, unnecessary no_duplicates constraints, and over-fitted or under-specified regex patterns. See Hand-Editing the .tdda File.

Step 5: Validate against holdout. Apply the adapted constraints to holdout data—data not used in discovery. Adapt further as needed. This is where you discover that your constraints are too tight (false positives on valid data) or too loose (missing real problems).

Deployment phase (operational data)

Step 6: Verify. Run tdda verify on each incoming batch of operational data. Fast and terse—reports which constraints fail and for how many records.

Step 7: Monitor. Classify failures:

  • True positives—bad data caught correctly. Act: reject the batch, fix the root cause, or improve normalisation, cleansing, or the upstream pipeline.
  • False positives—valid data flagged wrongly. Relax or remove the offending constraints.
  • False negatives—bad data that passed through. Tighten or add constraints.

Step 8: Refine. Adapt the constraints based on what monitoring reveals (same vocabulary as step 4). Loop back to step 7. Data changes, pipelines change, and edge cases surface over time — constraints must evolve with them. Alert fatigue is a real risk: too many false positives desensitise reviewers. Filter recurring known-benign failures, but don't suppress so aggressively that real problems hide.

What happens when you skip the development phase

The reduced process skips steps 3–5: you discover, then go straight to deployment without reading, adapting, or validating against holdout.

The result:

  • Many more false negatives. This is the dominant failure mode. Constraints were generated mechanically from imperfect training data and never tightened. Bad data that wasn't in the training set passes through undetected. You are systematically blind in the more dangerous direction.
  • More false positives. Training data rarely covers the full breadth of valid values, so valid operational data trips constraints that were set too tight against the training sample.

False positives are annoying. False negatives are bad data propagating downstream. The reduced process makes both worse, but the false-negative problem is structurally larger because nothing in the reduced process ever tightens the constraints.

A Worked Example: Elements 92 to 118

The periodic table makes a good illustration because everyone knows the domain. The tdda examples command installs sample datasets; one of them is elements92.csv — the first 92 elements.

Run discovery on the 92-element training set:

tdda discover -xG elements92.csv elements92.tdda

The -xG flags suppress date/time constraints and inter-column constraints, keeping the output focused. Three fields from the result:

"Z": {
    "type": "int", "min": 1, "max": 92,
    "sign": "positive", "max_nulls": 0, "no_duplicates": true
},
"ChemicalSeries": {
    "type": "string", "min_length": 7, "max_length": 20,
    "max_nulls": 0,
    "allowed_values": ["Actinoid", "Alkali metal", "Alkaline earth metal",
                       "Halogen", "Lanthanoid", "Metalloid", "Noble gas",
                       "Nonmetal", "Poor metal", "Transition metal"],
    "rex": ["^[A-Z][a-z]+$", "^[A-Z][a-z]+ [a-z]{3,5}$",
            "^Alkaline earth metal$"]
},
"AtomicWeight": {
    "type": "real", "min": 1.007947, "max": 238.028913,
    "sign": "positive", "max_nulls": 0
}

Now verify against elements118.csv — all 118 elements including the synthetic heavy ones discovered since element 92:

tdda verify -f elements118.csv elements92.tdda
Z:             1 failure   max 
Symbol:        2 failures  max_length   rex 
AtomicWeight:  2 failures  max   max_nulls 
...
Failing Fields: 11/16   Failing Constraints: 17/80

Seventeen constraints fail — all training-data artefacts. Here is what to do with the three fields shown:

Z. The atomic number running to 92 is an artefact of the training set. Remove max entirely, or set it to something like 200 if you want a sanity-check upper bound. Keep sign, min, max_nulls, and no_duplicates — those are domain facts.

ChemicalSeries. This field has both allowed_values and rex. The set of chemical series is closed — new elements join existing series — so allowed_values is exactly right. Remove rex, min_length, and max_length: they add nothing when allowed_values is present and will only generate false positives if a value is ever formatted slightly differently.

AtomicWeight. The max of 238 is uranium's weight — again a training artefact. Oganesson (element 118) weighs ~294. Remove max or set a generous upper bound. Keep sign: that is a genuine domain constraint (atomic weights are positive) and acts as a safeguard even if min/max are later adjusted.

After adapting, verify again. The 17 failures should drop to zero on the holdout data — and that result is meaningful because you reviewed each change rather than just deleting constraints to make the number go down.

Reading and Editing the .tdda File

A .tdda file is JSON. The top-level structure:

{
    "creation_metadata": { ... },
    "dataset": {
        "required_fields": ["*"],
        "allowed_fields": []
    },
    "fields": {
        "field_name": { ... },
        ...
    }
}

The dataset section controls which fields must be present (required_fields) and which extra fields are permitted (allowed_fields). Wildcards * and ? are supported. The default required_fields: ["*"] means all fields listed in fields are required. allowed_fields: [] means no extra fields are permitted.

Per-field constraints:

Constraint Types Notes
type all int, real, string, bool, date
min int, real, date date as ISO 8601 string
max int, real, date same
sign int, real positive, non-negative, zero, non-positive, negative
max_nulls all 0 = no nulls allowed
no_duplicates all true if values must be unique
min_length string length in Unicode code points
max_length string same
allowed_values string generated when ≤ 20 distinct values in training data
rex string list of regex patterns; value must match at least one

type: date covers both dates and datetimes. Naive only (no timezone). Dates stored as ISO 8601 strings in the JSON.

Editing the file

Auto-generated constraints are always a first draft. Always edit before deploying. For LLMs this is the step most likely to be skipped and most likely to matter.

The vocabulary: Tighten / Relax / Add / Delete / Choose Among.

min and max on dates. Remove max dates that will become stale as new data arrives (open_date max lags permanently). Set min from domain knowledge (e.g. bank founding date), not just from training data.

min and max on numeric fields. Adjust to domain-meaningful bounds. If account numbers are always 8-digit and start with 1, set "min": 10000000, "max": 19999999. The constraint should reflect what is valid, not just what happened to appear in training data.

sign. Keep it even when min/max make it redundant—it acts as a safeguard if those are later loosened or removed.

no_duplicates. Remove if duplicates are legitimately possible (shared phone numbers, email addresses across accounts, etc.).

allowed_values vs rex. Don't keep both. If the set of values is closed, use allowed_values and remove rex, min_length, max_length. If open-ended, write a tighter regex.

rex patterns. Auto-generated patterns are often too loose. Replace with domain-specific patterns:

"rex": ["^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$"]

Precautionary principle. Every adaptation decision should be guided by the false-positive/false-negative trade-off established in Step 1. When uncertain, prefer tightening: in most contexts bad data propagating downstream undetected is worse than a spurious alert. But in high-volume pipelines where alert fatigue is a real risk, or where false positives have operational cost, the balance shifts. Apply the trade-off your human set; don't invent one.

The Three CLI Commands

CSV and flat-file input

CSV files are read using csv_to_pandas, which handles type inference, date parsing, and null markers. If the file has a companion .serial metadata file (or CSVW or Frictionless metadata), appending : to the filename tells tdda to find and use it automatically—giving accurate types rather than guessed ones. See post 077 for full details.

tdda discover data.csv: constraints.tdda   # auto-find metadata
tdda verify   data.csv: constraints.tdda
tdda detect   data.csv: constraints.tdda bad.csv --key id

tdda discover

Generates constraints from training data and writes a .tdda file.

# Basic (no regex generation)
tdda discover data.csv constraints.tdda

# With regex generation, ungrouped (recommended for string fields)
tdda discover -xG data.csv constraints.tdda

# From Parquet
tdda discover -xG data.parquet constraints.tdda

# From database
tdda discover -xG postgres:tablename constraints.tdda

# Write to stdout
tdda discover -xG data.csv

# Also write an HTML report
tdda discover -xG data.csv constraints.tdda -r html -o constraints

Key flags:

  • -x / --rex—enable regex generation for string fields
  • -G / --no-group-rex—do not group patterns (simpler output; default is ungrouped)
  • --no-md—omit creation metadata from the output file
  • --no-ar—omit allowed_fields and required_fields from the dataset section
  • -r FORMAT—also write a report in html, md, txt, json, yaml, or toml

tdda verify

Checks whether data satisfies constraints. Reports at the field level—how many records failed each constraint. Does not identify which records failed.

tdda verify data.csv constraints.tdda

# Show only fields with failures
tdda verify -f data.csv constraints.tdda

tdda verify data.csv: constraints.tdda

Key flags:

  • -f / --fields—report only fields with failures
  • -a / --all—report all fields including those with no failures
  • --epsilon E—tolerance for floating-point comparisons (default: 1e-6)

tdda detect

Finds and writes the individual records that fail constraints. Use this when you need to identify and act on specific failing records.

tdda detect data.csv constraints.tdda bad.csv --key account_id

# With text report alongside the output CSV
tdda detect data.csv constraints.tdda bad.csv -r txt --key account_id

# From/to Parquet
tdda detect data.parquet constraints.tdda bad.parquet -r txt --key id

The output file contains all failing records. By default it includes all original columns plus a n_failures count per row. The --key flag adds named field(s) to the text report for identification.

Key flags:

  • --key FIELD [FIELD ...]—key fields for the text report
  • -r FORMAT—report format(s): html, md, txt, json, yaml, toml
  • --per-constraint—write one flag column per failing constraint (default: on)
  • --no-per-constraint—omit per-constraint flag columns
  • --output-fields [FIELD ...]—original columns to include; no args = all
  • --write-all-records—include passing records in the output
  • --index—include row-number index in output

If no records fail, no output file is created (and any existing file at that path is deleted).

The Design Philosophy: Bring Data to the Constraints

The tdda library has a deliberately small set of constraint types. It does not have cross-column constraints, aggregate constraints, or constraints on non-tabular data. This is a design choice.

The answer to "how do I constrain X" is almost always: derive a column or take a measurement that reduces X to something tdda can handle natively. There are three patterns.

Pattern 1: Derived columns for cross-column constraints

For constraints that involve more than one column, compute a new column that captures the constraint, then discover and adapt constraints on that column. Two approaches:

Boolean column (convention: True = bad):

df['no_tel'] = df['home_tel'].isnull() & df['mobile_tel'].isnull()
# Constraint: max_nulls = 0, allowed_values = [False]
# Or cast to int and constrain sign = zero:
df['no_tel'] = df['no_tel'].astype(int)
# Constraint: sign = zero

Numeric column (constrain with sign):

import datetime
now = datetime.datetime.now()
df['open_secs_in_future'] = (
    (df['open_date'] - now).dt.total_seconds()
)
# Constraint: sign = negative (open dates must be in the past)

After adding derived columns, run tdda discover on the augmented DataFrame, then edit the generated constraints—keep type, max_nulls, sign; remove training-specific min/max on the derived columns.

Pattern 2: Roll-up constraints for aggregate checks

Problems invisible at the individual-record level often show up in counts, sums, and proportions per group. Compute the aggregates, write them as a small dataset, and discover constraints on that.

import pandas as pd
from tdda.serial.io import read_df, write_df

df = read_df('data.parquet')

# Whole-table statistics
stats = pd.DataFrame({'n_records': [len(df)]})
write_df(stats, 'stats.csv')
# tdda discover stats.csv stats.tdda

# Grouped statistics
dfg = (df.groupby('region', observed=True)
         .agg(count=('id', 'count'),
              total=('amount', 'sum'))
         .reset_index())
write_df(dfg, 'regional_stats.csv')
# tdda discover regional_stats.csv regional_stats.tdda

This catches fraud patterns (abnormally high counts per entity), data drift (proportions shifting), and coverage gaps (expected groups missing).

Pattern 3: Regularizing measurements for non-tabular data

Constraint discovery works on tabular data. For anything else—transaction logs, images, JSON documents, text files, time series—the approach is to extract a tabular dataset of measurements from the source data and validate that.

This is not a workaround. It is the intended design. The key insight is that most data quality problems manifest as anomalies in well-chosen measurements, and a small set of measurements often catches a large fraction of real problems.

Transaction logs → customer-level features:

features = (transactions
    .groupby('customer_id')
    .agg(
        n_transactions=('id', 'count'),
        total_spend=('amount', 'sum'),
        max_transaction=('amount', 'max'),
        days_since_last=('date', lambda x: (today - x.max()).days),
    )
    .reset_index())
# discover constraints on features

Other sources:

  • Arrays / time series: extract min, max, mean, null count, trend sign
  • Images: extract EXIF metadata, pixel statistics, checksum
  • JSON / XML: flatten to tabular using pandas json_normalize, or extract specific fields by path
  • Text files: line count, word count, pattern match counts, encoding checks

Even crude measurements catch many real problems. Start simple. The goal is not to capture every possible constraint—it is to catch most real failures with the least machinery.

Python API

For pipeline integration, use the Python API directly.

import pandas as pd
from tdda.constraints.pd.constraints import discover_df, verify_df, detect_df

df = pd.read_parquet('data.parquet')

# Discover
constraints = discover_df(df)
constraints.write_constraints_file('constraints.tdda')

# Verify
result = verify_df(df, 'constraints.tdda')
print(result)

# Detect—returns DataFrame of failing records
failures = detect_df(df, 'constraints.tdda', output_fields=[])
# output_fields=[] includes all original columns
# output_fields=None includes only index and failure columns

For CSV/Parquet I/O with metadata support (see post 077):

from tdda.serial.io import read_df, write_df
df = read_df('data.csv:')        # auto-find .serial metadata
write_df(df, 'output.parquet')

Database Support

The tdda CLI connects to PostgreSQL, MySQL, SQLite, and MongoDB. Database tables work throughout: use DBTYPE:tablename anywhere you would use a CSV or Parquet path. Connection parameters go in ~/.tdda_db_conn_DBTYPE (a JSON file):

{
    "dbtype": "postgres",
    "db": "mydb",
    "host": "localhost",
    "port": "5432",
    "user": "myuser",
    "password": "secret"
}

Use password_env_var instead of password to avoid cleartext credentials. Set file permissions to 600.

Reference tables as DBTYPE:tablename or DBTYPE:schema.tablename:

tdda discover -x postgres:accounts constraints.tdda
tdda verify -f postgres:accounts constraints.tdda
tdda detect postgres:accounts constraints.tdda bad.csv

For custom derived-column constraints on database tables, create a SQL view with the derived columns and run tdda against the view.

Checklist

Profile and discuss before discovering. Understand what "valid" means before encoding it. Involve a human at this stage.

Discover on known-good data. Remove known anomalies from training data before running discover. The better the training data, the better the starting constraints.

Read the generated .tdda file. Not skimming—reading. Before touching it.

Adapt before deploying. Tighten/Relax/Add/Delete/Choose Among. Auto-generated constraints are a first draft, not a finished product.

Validate against holdout data. Apply adapted constraints to data not used in discovery. Adapt further. Do not skip this.

Never skip the development phase. Discover → deploy without read/adapt/validate generates many more false negatives. Bad data propagates downstream undetected. The constraints were never tight enough and nothing in the reduced process fixes that.

Know your trade-off. Discuss false-positive/false-negative tolerance with your human before adapting. Safety-critical pipelines minimise false negatives at the cost of more alerts; high-volume pipelines may need the opposite. Apply that decision consistently; don't assume a default.

Use derived columns for cross-column constraints. Boolean (True = bad) or numeric (constrain with sign). Discover on the augmented DataFrame.

Use roll-up constraints for aggregate checks. Compute counts, sums, and proportions per group; discover on the result. Problems invisible at record level often show up here.

Take measurements to validate non-tabular data. Extract a tabular dataset of measurements. Start simple—even crude measurements catch most real problems.

Classify monitoring failures. True positives: act/reject/fix. False positives: relax constraints. False negatives: tighten or add constraints.

Refine over time. Data changes, pipelines change, edge cases surface. Keep monitoring and adapting.

Beware alert fatigue. Filter recurring known-benign failures. Don't suppress so aggressively that real problems hide.

Further Reading


CSV Metadata and tdda.serial: A Guide for LLMs and Coding Agents

Posted on Thu 21 May 2026 in TDDA • Tagged with tdda.serial, CSV, metadata, LLMs, pandas, polars, frictionless, csvw

Raw markdown for direct use in LLM context windows.

This post is addressed primarily to LLMs and coding agents. It covers the tdda.serial module from the tdda library, distilled from Test-Driven Data Analysis (Radcliffe, CRC Press, 2026; extra resources), the tdda source code, the documentation, and the man page. If you are a human wanting a practical reference, you are welcome too.

If you are a coding agent: (1) Look for a companion .serial file. (2) If present, treat it as authoritative — do not infer types that contradict it. (3) Use csv_to_pandas('foo.csv:') or csv_to_polars('foo.csv:'). (4) If no metadata exists, generate one with tdda serial --generate or write one by hand. (5) When writing data that will be reused, use pandas_to_csv(df, 'output.csv:').

What CSV Data Loses in Transit

CSV files are untyped. When a reader opens one, it must infer types from the formatted values. This inference is often wrong, and different libraries make different inference decisions.

Common failures:

  • Integer columns with nulls become floats. Even in Pandas 3.0, with its two nullable backends, the default read_csv promotes a nullable integer column to float64 because NaN is a float. A column of 1, 2, null becomes 1.0, 2.0, NaN.
  • Non-standard null markers are read as strings. A column using - to indicate missing values produces a string column rather than a nullable int or float.
  • Dates become strings. Unless the reader is told to parse dates, a date column like 2024-03-15 is read as a str or object column.
  • The Pandas index round-trip. df.to_csv() writes the Pandas index as an unnamed first column by default. pd.read_csv() does not restore it as an index. The resulting DataFrame has an extra unnamed column.
  • Different libraries, different results. The same CSV file may read as different types in Pandas and Polars, or between Pandas with the original backend and Pandas with the numpy_nullable backend.

A minimal demonstration: write a single-column Pandas DataFrame with a nullable integer and read it back using all defaults.

import pandas as pd

df = pd.DataFrame({'a': pd.array([1, 2, None], dtype='Int64')})
df.to_csv('test.csv')
df2 = pd.read_csv('test.csv')
print(df2.dtypes)   # a is float64, there's also an unnamed index column
print(df2)

Output:

Unnamed: 0    int64
a           float64
dtype: object
   Unnamed: 0    a
0           0  1.0
1           1  2.0
2           2  NaN

The round-trip has lost type information and added a spurious column. This happens with default settings, not unusual data.

When Metadata Is Worth Using

CSV metadata — a companion file that records the format used — addresses these problems. It is not always worth the overhead.

Use metadata when:

  • The CSV uses non-standard formats: an unusual null indicator, a non-ISO date format, a non-comma separator, a non-UTF-8 encoding.
  • The data will be read by code written separately from the code that wrote it, or by a different person or system.
  • The data will be read again some time after it was written.
  • You are writing data-producing code and the format is non-obvious. The marginal cost of writing a .serial file alongside the write code is small, and it serves as machine-readable documentation.

Don't use metadata when:

  • The CSV is clean, standard, and will be read once by the same code that wrote it.
  • It is a simple file that whatever reads it handles correctly in practice.
  • It is a throwaway file that nothing downstream will consume.

A note for LLMs: the cost of generating a .serial file alongside data-writing code is near zero at code-generation time. The threshold for "worth it" is therefore lower for LLM-generated code than for humans typing it out. Lean toward including metadata when the format is non-standard or the data leaves the immediate script.

Three Metadata Formats

Three metadata formats are in use for CSV files. Understanding their philosophical differences helps you choose and explains why they don't translate perfectly to each other.

tdda.serial (.serial files)

A tdda.serial file describes a format, not a specific file. One .serial file can apply to any number of CSV files that share the same format. It is not URL-centric. It can be hand-written or generated. It has a strong, flexible date format system. It is the native format of the tdda.serial module. You can use @ as a glob wildcard in .serial filenames to indicate which files a metadata file is intended for — sales_@.serial would match sales_2024.csv, sales_2025.csv etc.

A .serial file is a JSON file. By convention, it has the same stem name as the data file it accompanies (foo.csvfoo.serial), but this is not required. Any .serial file can be applied to any compatible flat file.

CSVW

CSVW is a W3C standard for describing CSV files on the web. It is designed around a one-to-one relationship between a metadata file and a specific named data file, identified by URL. A CSVW file that describes foo.csv contains a url pointing to that specific file.

CSVW is comprehensive and has W3C backing, but tooling is sparse and fragmented — the tools listed on the CSVW site are mostly RDF-focused, not CSV-focused. It is a heavy format, and date handling is less flexible than tdda.serial. If you receive a CSVW file, tdda.serial can use it; if you are creating new metadata, tdda.serial is a simpler and more practical choice.

Frictionless

Frictionless is a data packaging ecosystem with good Python tooling (pip install frictionless). Its primary abstractions are resources (a single dataset plus its metadata) and packages (a collection of resources). This makes it well-suited to supplying data as a self-described package, but less suited to describing a shared format applied to many files. A Frictionless schema is reusable but is not commonly used standalone. If you are distributing a dataset for others to consume, Frictionless is a reasonable choice. If you are describing an internal format, use tdda.serial.

Interoperability

The tdda.serial library reads and writes all three formats, and they can be used interchangeably in most tdda contexts. If you receive a CSVW or Frictionless file, you can pass it to csv_to_pandas exactly as you would a .serial file. The tdda serial command converts between formats.

The .serial File Format

A .serial file is a JSON object. Here is the full top-level structure:

{
    "format": "http://tdda.info/ns/tdda.serial",
    "writer": "tdda.serial-3.0.0",
    "tdda.serial": { ... },
    "pandas.read_csv": { ... },
    "pandas.DataFrame.to_csv": { ... },
    "polars.read_csv": { ... }
}

The format key is required; all others are optional. Only the tdda.serial section is described here — library-specific sections contain verbatim keyword arguments for the corresponding function and are generated by tdda serial --to pd.r etc.

Dataset-level keys in tdda.serial

All are optional. Omitted keys fall back to library defaults.

Key Type Description
encoding string Text encoding, e.g. "UTF-8", "latin-1"
delimiter string Field separator, e.g. ",", "\|", ";"
quote_char string Quote character, almost always "\"" or "'"
escape_char string Escape character; "\\" means backslash
stutter_quotes bool If true, embedded quotes are doubled ("")
null_indicator string or array Null marker(s), e.g. "", "-", "NULL"
date_format string Default format for date fields
datetime_format string Default format for datetime fields
header_row_count int Number of header rows (default: 1)
header_row int Zero-based index of the column-name row (default: 0)
decimal_point string Decimal point character (default: ".")
thou_sep string Thousands separator, e.g. ","
true_values string or array Values interpreted as true for bool fields
false_values string or array Values interpreted as false for bool fields
quoting string Quoting style (see below)
fields array or object Per-field descriptions

The quoting field accepts Python csv module constants (QUOTE_ALL, QUOTE_MINIMAL, QUOTE_NONNUMERIC, QUOTE_NONE, QUOTE_NOTNULL, QUOTE_STRINGS) and also QUOTE_STRINGS_ONLY, which quotes only string values (not nulls, numbers, dates, or booleans). QUOTE_STRINGS_ONLY is similar to what JSON does.

The fields entry

Fields can be specified as an array (complete and ordered) or an object/dictionary (partial, keyed by the CSV column name).

Array form — used when the complete field list is known and ordered:

"fields": [
    {"name": "id",    "fieldtype": "int"},
    {"name": "price", "fieldtype": "float"},
    {"name": "date",  "fieldtype": "date"}
]

Object form — used for partial specifications or when internal names differ from CSV column names:

"fields": {
    "commission date": {"name": "DateOfCommission", "fieldtype": "date"},
    "passed qa?":      {"name": "PassedQA", "fieldtype": "bool",
                        "true_values": "yes", "false_values": "no"}
}

In object form the dictionary key is the name as it appears in the CSV; the optional name key gives the internal (DataFrame column) name.

Per-field keys

Key Description
name Internal name (DataFrame column name). Required in array form.
fieldtype Type of the field (see table below)
csvname CSV column name when different from name (array form)
format Date/datetime format for this field; overrides dataset-level
null_indicator Null marker(s) for this field; overrides dataset-level
true_values True value(s) for bool fields
false_values False value(s) for bool fields
description Human-readable description

Field types

Value Description
bool Boolean
int Integer
float Floating-point
number Unspecified numeric
string Text
date Date (no time component)
datetime Date and time
datetime_tz Date and time with timezone
time Time only
iso8601 ISO 8601 date or datetime (unspecified)

Date format specifications

Four forms are accepted:

  1. Named ISO 8601 formats: iso8601-date (2000-12-31), iso8601-datetime (2000-12-31T12:34:56), iso8601-datetime-tz (2000-12-31T12:34:56+00:00), iso8601 (any of the above). These are the recommended choices for new data.

  2. YYYY/MM/DD-style specifiers: Tokens: YYYY, YY, MM (month or minute, by context), DD, HH, SS, SS.S (fractional), MON (Jan), MONTH (January), +ZZ:ZZ (timezone), AM/PM. Examples: YYYY-MM-DD, DD/MM/YYYY HH:MM:SS, MM/DD/YY, YYYY-MM-DDTHH:MM:SS.S+ZZ:ZZ.

  3. Unambiguous literal examples: any actual date/time value where the day is ≥ 13 or the year is 4 digits or ≥ 60. So 2000-12-31T12:34:56 is accepted; 01/02/2000 is not (ambiguous: day-first or month-first?).

  4. Python strftime strings: %Y-%m-%dT%H:%M:%S etc.

A complete example

This .serial file describes a CSV with non-standard settings — a hyphen as null indicator and ISO 8601 dates — matching the elements3-old.csv file distributed with the tdda library:

{
    "format": "http://tdda.info/ns/tdda.serial",
    "tdda.serial": {
        "encoding": "UTF-8",
        "delimiter": ",",
        "quote_char": "\"",
        "escape_char": "\\",
        "stutter_quotes": false,
        "null_indicator": "-",
        "date_format": "YYYY-MM-DD",
        "header_row_count": 1,
        "fields": [
            {"name": "Z",               "fieldtype": "int"},
            {"name": "Name",            "fieldtype": "string"},
            {"name": "Symbol",          "fieldtype": "string"},
            {"name": "Period",          "fieldtype": "int"},
            {"name": "Group",           "fieldtype": "int"},
            {"name": "AtomicWeight",    "fieldtype": "float"},
            {"name": "ApproxDiscovery", "fieldtype": "date"}
        ]
    }
}

An LLM that knows a CSV file's format can write a .serial file like this directly, without needing to run inference. This is usually faster and more reliable than --generate when you can examine the data. Metadata describes the intended format. Values that do not conform are data errors, not type-inference hints.

Reading with tdda.serial

Reading a format you know

When you have a .serial file (or can write one), use csv_to_pandas or csv_to_polars:

from tdda.serial import csv_to_pandas, csv_to_polars

# Explicit metadata path
df = csv_to_pandas('elements3-old.csv', md_path='elements3-old.serial')
df = csv_to_polars('elements3-old.csv', md_path='elements3-old.serial')

# Auto-locate metadata (same stem name, same directory)
df = csv_to_pandas('elements3-old.csv', find_md=True)
df = csv_to_polars('elements3-old.csv', find_md=True)

# Colon suffix — equivalent to find_md=True
df = csv_to_pandas('elements3-old.csv:')
df = csv_to_polars('elements3-old.csv:')

# Colon with explicit metadata path
df = csv_to_pandas('elements3-old.csv:elements3-old.serial')

The auto-locate (find_md=True / colon suffix) searches for metadata in priority order: foo.csv.serial, foo.serial, wildcard matches using @ as a glob character (e.g. @.serial), then CSVW and Frictionless naming conventions.

Pandas backends: csv_to_pandas defaults to the numpy_nullable backend. The backend parameter overrides this:

df = csv_to_pandas('foo.csv:', backend='original')    # traditional Pandas dtypes
df = csv_to_pandas('foo.csv:', backend='pyarrow')     # Arrow-backed dtypes
df = csv_to_pandas('foo.csv:', backend='numpy_nullable')  # default

Polars note: polars.read_csv is less flexible than pandas.read_csv for unusual formats. In particular, Polars can only parse ISO 8601 dates directly. csv_to_polars works around this by reading problematic fields as strings and converting them in a post-processing step.

Reading an unfamiliar format

When you don't know a CSV file's format, use tdda serial --generate to infer it:

tdda serial --generate foo.csv foo.serial

This reads foo.csv, applies heuristics, and writes foo.serial. The result is a starting point, not a guarantee — inspect and correct it before relying on it. Key override switches:

--sep C              Set field delimiter to C
--nulls S            Set null indicator(s)
--date-format FMT    Set default date format
--quote-char Q       Set quote character
--escape             Use backslash escaping
--stutter            Use quote stuttering
--encoding ENC       Set encoding
--sample-lines N     Use N lines for inference (default: 1000)

For LLMs: if you can read the CSV file directly, you can often write the .serial by hand more quickly and reliably than inference. Use --generate when the format is complex or uncertain.

After generating or writing the .serial, read with csv_to_pandas or csv_to_polars as shown above.

Writing with tdda.serial

The write wrapper is currently available for Pandas only.

pandas_to_csv

from tdda.serial import pandas_to_csv

# Write CSV and generate accompanying .serial metadata automatically
info = pandas_to_csv(df, 'output.csv', auto_md_outpath=True)
# Writes output.csv and output.serial

# Colon suffix — equivalent
info = pandas_to_csv(df, 'output.csv:')

# Explicit metadata output path
info = pandas_to_csv(df, 'output.csv', md_outpath='output.serial')

# Use an existing .serial to specify the write format
info = pandas_to_csv(df, 'output.csv', md_inpath='format.serial')

# Use an existing .serial for write format and write a .serial for readers
info = pandas_to_csv(df, 'output.csv',
                     md_inpath='shared-format.serial',
                     md_outpath='output.serial')

The return value is a WriteInfo object showing the path written, the metadata output path, and the keyword arguments passed to to_csv.

Any keyword arguments you pass that to_csv accepts (such as sep, na_rep, encoding, date_format) are forwarded to to_csv and also reflected in the written .serial file. For example:

info = pandas_to_csv(df, 'output.psv',
                     auto_md_outpath=True,
                     sep='|',
                     na_rep='NULL',
                     encoding='latin-1')

writes a pipe-separated file with NULL as the null marker and records these settings in output.serial.

pandas_to_csv sets index=False by default — it does not write the Pandas index as a column. This is almost always what you want.

Writing from Polars

There is currently no polars_to_csv wrapper, though one is planned. For now, write using the native df.write_csv() method and generate or write the .serial file separately:

# Write the CSV
df.write_csv('output.csv')

# Generate a .serial from the written file
# (run in shell or via subprocess)
# tdda serial --generate output.csv output.serial

# Or write the .serial by hand if you know the format

If you need the Pandas write behaviour for a Polars DataFrame, convert first: pandas_to_csv(df.to_pandas(), 'output.csv:', use_pyarrow=True).

Writing a format-only .serial (no field info)

A .serial file can record only the format conventions without any field-level detail. This is useful as a shared "house format" that specifies separator, encoding, null indicator etc., leaving field names and types to be inferred by the reader. Generate one with:

tdda serial --generate "" format.serial --sep "|" --nulls "NULL" --encoding "latin-1"

(Empty filename generates a fieldless metadata file.)

Or write it by hand — it is a small JSON file:

{
    "format": "http://tdda.info/ns/tdda.serial",
    "tdda.serial": {
        "encoding": "latin-1",
        "delimiter": "|",
        "null_indicator": "NULL"
    }
}

Use md_inpath='format.serial' when writing any file in this format.

The tdda serial CLI: Conversion and Code Generation

The tdda serial command converts between metadata formats and generates Python code for reading files without requiring the tdda library.

Format conversion

tdda serial infile outfile [--to FORMAT]

Format is inferred from filename when it follows conventions; use --to when it doesn't. Format abbreviations:

Short Long form
. tdda.serial (default)
pd.r pandas.read_csv
pd.w pandas.DataFrame.to_csv
pl.r polars.read_csv
pl.w polars.DataFrame.write_csv
csvw CSVW
fl Frictionless
fl.r Frictionless resource
fl.p Frictionless package

Examples:

# Convert between formats (inferred from filenames)
tdda serial foo.serial foo-metadata.json        # tdda.serial → CSVW
tdda serial foo-metadata.json foo.serial        # CSVW → tdda.serial

# Explicit format
tdda serial --to csvw foo.serial foo-out.json
tdda serial --to pl.r foo.serial foo-pl.serial  # add Polars read_csv section

# Generate from a CSV file
tdda serial --generate foo.csv foo.serial

# Pandas backend when converting to Pandas sections
tdda serial --to pd.r --backend a foo.serial foo-pdr.serial  # PyArrow dtypes

Generating Python code

Use a .py output extension to generate a standalone read_data() function that does not require tdda to be installed:

tdda serial foo.serial foo_reader.py --to pd.r

This produces something like:

import pandas as pd

def read_data(inpath):
    return pd.read_csv(inpath, sep=',', encoding='UTF-8',
        escapechar='\\', quotechar='"',
        dtype={'id': 'Int64', 'price': 'Float64'},
        na_values='-', keep_default_na=False)

This is useful when sharing code with users who do not have tdda installed, or when you want to hard-wire the read parameters.

The --for FILE flag sets the data path in CSVW or Frictionless output (CSVW requires a url):

tdda serial --to csvw foo.serial foo-metadata.json --for foo.csv

Using Metadata with Other tdda Tools

The colon syntax works with all tdda command-line tools that accept CSV files. Adding : to a CSV path tells tdda to find and use metadata automatically; adding :path specifies the metadata explicitly.

# tdda verify — validate data against constraints
tdda verify foo.csv elements118.tdda              # no metadata: types may be wrong
tdda verify foo.csv: elements118.tdda             # auto-find metadata
tdda verify foo.csv:foo.serial elements118.tdda   # explicit metadata

# tdda diff — compare two datasets
tdda diff old.csv: new.csv:                       # use metadata for both

# tdda discover — generate constraints
tdda discover foo.csv: foo.tdda

Without metadata, type mismatches cause spurious failures in verify and diff. A Group column read as string (because the null marker isn't recognised) fails type, min, max, and sign constraints — four failures from one formatting oversight.

CSVW and Frictionless files work in the same positions as .serial files. A CSVW file that contains the data path can be specified instead of the CSV file itself:

tdda verify foo-metadata.json elements118.tdda

When you encounter a CSV with a .serial file alongside it, use csv_to_pandas('foo.csv:') or csv_to_polars('foo.csv:') rather than bare pd.read_csv.

When writing CSV data that will be shared or reused, use pandas_to_csv(df, 'output.csv:') to write a .serial alongside it. If the format has non-standard settings, pass them as keyword arguments and they will be recorded in the .serial.

When writing data-producing code, consider whether the CSV will be read later or by other code. If so, add auto_md_outpath=True. For LLM-generated code the overhead is negligible.

When reading an unfamiliar CSV, run tdda serial --generate foo.csv foo.serial, inspect the result, correct if needed, then use csv_to_pandas('foo.csv:'). If you can read the file directly, writing the .serial by hand is often faster.

When running tdda verify or tdda diff on CSV files, add the colon suffix: foo.csv: rather than foo.csv, if a suitably-named metadata file is available. Without it, type mismatches produce spurious failures.

CSVW and Frictionless files are accepted wherever .serial files are. If you receive data with CSVW or Frictionless metadata, pass it directly to csv_to_pandas or use the colon syntax.

Don't add metadata to simple throwaway scripts where the CSV is standard (comma, UTF-8, no dates, no unusual nulls) and nothing downstream will read it. The overhead is not worth it.

To share read code without requiring tdda, generate a standalone Python reader: tdda serial foo.serial foo_reader.py --to pd.r.

To convert between metadata formats (tdda.serial ↔ CSVW ↔ Frictionless), use tdda serial infile outfile.

Further Reading


Reference Testing with TDDA: A Guide for LLMs and Coding Bots

Posted on Wed 20 May 2026 in TDDA • Tagged with reference testing, LLMs, coding bots, gentest, pytest, unittest

Raw markdown for direct use in LLM context windows.

This post is addressed primarily to LLMs and coding agents—it was distilled by Claude from Test-Driven Data Analysis (Radcliffe, CRC Press, 2026; extra resources), the tdda source code, the documentation, and the man pages (also available as markdown). If you are a human who wants a dense, no-nonsense introduction to reference testing with TDDA, you are welcome too. It was written in response to repeated direct observations of coding agents—including Claude—using -W to make failing tests pass without verifying that the new output was correct.

What Is a Reference Test?

A reference test is a semantic regression test. 'Semantic' here means the comparison understands equivalence: outputs that differ in timestamps, row order, version strings, hostnames, file paths, or other legitimately variable content can still pass — only meaningful changes cause failures. You run your analytical process or function, capture its outputs, inspect them and decide they are correct, then store them as reference outputs. From that point on, the test reruns the process and checks that the outputs still match the reference. If something changes—a library update, a refactoring, a subtle bug—the test catches it.

The tdda.referencetest library makes this easy for Python code. It extends unittest (and works with pytest) and provides assertion methods for:

  • strings and text files (assertStringCorrect, assertTextFileCorrect)
  • data frames (assertDataFrameCorrect, assertDataFramesEquivalent)
  • binary files (assertBinaryFileCorrect)

It also provides tdda gentest, a command-line tool that watches your process run and generates a test suite automatically.

The key insight: reference testing is particularly well suited to analytical code because the outputs are often large, complex, and hard to specify in advance. You develop the process, carefully inspect and verify the output is right once, store it, and thereafter the test does the checking.

The -W (--write-all) Trap

Do not blindly run -W (or --write-all) merely to make failing tests pass.

This is the most important thing in this document.

When a reference test fails, tdda.referencetest tells you what differs between the actual output and the stored reference. It also offers you a way to update the reference to match the new output: pass -W (or --write-all under pytest) when running the tests.

Running -W overwrites the reference files with whatever the code currently produces. After this, the tests pass. This tells you nothing about whether the output is correct. The tests will pass even if the output is completely wrong, because you just told them that the wrong output is the new reference.

The correct workflow when tests fail is:

  1. Read the failure message. It tells you what changed.
  2. Run the diff command suggested in the failure output.
  3. Look at the actual differences. Are they expected? Are they correct?
  4. Only if you have verified the new output is correct: update the references. With unittest, use -1W (tagged tests only, recommended) or -W (all); with pytest, use --tagged --write-all -s or --write-all -s. See Running a Subset of Tests with Tags for the full commands. If you have a tame human to hand, this is a good moment to involve them—humans are often better at judging whether output is actually correct, and can get quite sweary when you overwrite correct reference results with nonsense.
  5. Run the tests again (without -W) to confirm they pass.
  6. Check the updated reference files into version control.

If you skip step 3 and go straight to -W, you have not tested anything. You have merely synchronized the reference to whatever the code happens to produce right now.

Safe use of -W: the git audit pattern

If the reference files are clean in git before you run -W, you can use git diff afterwards to inspect exactly what changed, and git checkout -- path/to/testdata/ to revert all reference changes at once if anything looks wrong. This makes -W a controlled and auditable operation—but only if the working tree was clean before you ran it. Always check before running -W, not after.

Unit-Enhanced Reference Tests

The test code and command-line flags differ between reference tests built on unittest and those built on pytest. The sections below cover the unittest variants first; see Writing Reference Tests with pytest for the pytest equivalents. Where flags differ, both are given.

Task unittest pytest
Run all tests python tests.py -F pytest tests/ --log-failures
Run only tagged tests python tests.py -F -1 pytest tests/ --log-failures --tagged
Rewrite all references -W --write-all -s
Rewrite tagged only -1W --tagged --write-all -s

Full syntax and explanations for each are in the sections below; this table is a quick reference.

A partial structural defence against careless -W use is unit-enhanced reference tests: after the reference assertion, add one or more specific assertions about things that must be true regardless (shown here in unittest style):

def test_output(self):
    result = run_my_process(input_data)
    self.assertStringCorrect(result, 'expected.txt')
    # These survive a careless -W rewrite:
    self.assertIn('Total: 42 records', result)
    self.assertTrue(result.strip().endswith('OK'))

The reference assertion runs first. If it fails, tdda writes the actual output and suggests a diff command—the normal workflow. If you then carelessly rewrite with -W, the subsequent assertions will still fail if the output is wrong in ways they cover.

This is not a complete defence—you have to choose the assertions carefully—but it makes it much harder to accidentally accept a broken result. Choose assertions that reflect the core correctness property the test was designed to verify.

This pattern emerged from the author's direct experience of coding agents (including Claude) repeatedly using -W to make tests pass without verifying the results. It is recommended for any test where the reference output has semantic structure that can be spot-checked.

The -F (--log-failures) Flag

Always pass the log-failures flag when running tests. It logs the IDs of any failing tests to a timestamped file (YYYY-MM-DDTHHMMSS-failing-tests.txt) in your system temp directory (overridable with $TDDA_FAIL_DIR). This enables the tdda tag workflow: tdda tag reads the most recent such file and adds @tag to the failing tests, so you can re-run and regenerate references for just those tests.

Without the flag, no failures file is written and tdda tag has nothing to work with.

With unittest (running directly with Python)

Pass -F (or --log-failures):

python tests/test_mycode.py -F

With pytest

Pass --log-failures:

pytest tests/ --log-failures

Permanent default

To avoid passing the flag every time, add this to ~/.tdda.toml:

[referencetest]
log_failures = true

This modifies the user's global configuration. Consult your human before doing it.

The Kicker: the -W Problem is Not Restricted to TDDA

The anti-pattern described above — rewriting expected outputs to make tests pass without verifying the new output is correct — applies far beyond tdda.referencetest. LLM coding agents routinely treat passing tests as the goal rather than as evidence of correctness. Whether rewriting a reference file, updating a pytest snapshot, regenerating Jest snapshots, or changing a hardcoded expected value in an assertion, the same question applies first: is the new result actually correct?

Green tests after any kind of expected-value rewrite tell you nothing about correctness. They tell you only that the code now matches whatever you told it to match.

The correct workflow is the same regardless of framework:

  1. A test fails. Read the failure. What changed?
  2. Is the change correct, or is it a bug?
  3. Only if correct: update the expected value.
  4. If you're not sure: ask your human.

The specific value of tdda.referencetest is that it makes step 1 easy — the diff tooling is built in, and -F/tdda tag/-1W limit the blast radius. But the discipline is universal.

Running a Subset of Tests with Tags

To run only some tests, use the @tag decorator:

from tdda.referencetest import ReferenceTestCase, tag

class TestMyProcess(ReferenceTestCase):

    @tag
    def test_main_output(self):
        result = run_my_process()
        self.assertStringCorrect(result, 'expected_output.txt')

    def test_other_thing(self):
        ...

@tag can decorate individual test methods or entire test classes. The flags to run only tagged tests differ between unittest and pytest.

With unittest (running directly with Python)

python tests/test_mycode.py -F -1        # run only tagged tests
python tests/test_mycode.py -F -1W       # regenerate references for tagged tests only

-1W combines -1 and -W (--write-all). This is the safe way to regenerate, because it limits the blast radius to tests you have explicitly chosen and tagged.

With pytest

pytest tests/ --log-failures --tagged               # run only tagged tests
pytest tests/ --log-failures --tagged --write-all -s  # regenerate references for tagged tests only

Pass -s to prevent pytest from capturing output, so that tdda can report which reference files were written.

The full workflow with tdda tag

The -Ftdda tag-1W workflow lets you rewrite only the references that actually failed, without manually deciding which tests to tag:

  1. Run tests with -F (or --log-failures) to record failing test IDs
  2. Run tdda tag to add @tag to those tests automatically
  3. Inspect the diffs to verify the new output is correct
  4. Run -1W (or --tagged --write-all -s) to rewrite only those references
  5. Run make untag (or the sed command below) to remove the tags

This is always preferable to bare -W, which rewrites every reference file regardless of whether the test failed.

Removing stale tags

Before adding new tags, remove any stale @tag decorators from previous sessions. There is usually a make untag target that does this, or you can use:

# macOS (BSD sed):
sed -i '' '/^[[:space:]]*@tag[[:space:]]*$/d' tests/test_mycode.py

# Linux (GNU sed):
sed -i '/^[[:space:]]*@tag[[:space:]]*$/d' tests/test_mycode.py

Writing Reference Tests with unittest

A minimal test file:

import os
from tdda.referencetest import ReferenceTestCase, tag

TESTDIR = os.path.join(os.path.dirname(os.path.abspath(__file__)),
                       'testdata')

class TestMyProcess(ReferenceTestCase):

    def test_output(self):
        result = run_my_process(input_data)
        self.assertStringCorrect(result,
                                 os.path.join(TESTDIR, 'expected.txt'))

    def test_dataframe(self):
        df = produce_dataframe()
        self.assertDataFrameCorrect(df,
                                    os.path.join(TESTDIR, 'expected.csv'))

if __name__ == '__main__':
    ReferenceTestCase.main()

When running under pytest, the if __name__ == '__main__': block is simply ignored—the same test file works with both runners unchanged.

Run it:

python test_myprocess.py -F           # run all tests
python test_myprocess.py -F -1        # run only tagged tests
python test_myprocess.py -F -1W       # regenerate references for tagged tests

The first time you run with -1W after writing a new test, it writes the reference file. Subsequent runs compare against it.

After writing references with -1W, always inspect the files that were written. The fact that the test now passes means only that the reference matches the output. It says nothing about whether either is correct.

Writing Reference Tests with pytest

The same test classes work under pytest, with different flags:

pytest tests/                           # run all tests
pytest tests/ --tagged                  # run only tagged tests
pytest tests/ --tagged --write-all -s   # regenerate references for tagged tests

Note: - Use --write-all instead of -W. - Use --tagged instead of -1. - Pass -s to prevent pytest from capturing output, so that tdda can report which reference files were written. - The short flags -W and -1 are tdda extensions; they only work when running the test file directly with Python, not under pytest.

Assertion API: Text and Strings

assertStringCorrect(string, ref_path, ...) Check an in-memory string against a reference text file.

assertTextFileCorrect(actual_path, ref_path, ...) Check a text file on disk against a reference text file.

assertTextFilesCorrect(actual_paths, ref_paths, ...) Check multiple text files against corresponding reference files.

All three share these optional parameters for handling variable output:

Parameter Effect
lstrip=True Strip leading whitespace from each line before comparing
rstrip=True Strip trailing whitespace from each line before comparing
ignore_substrings=['foo','bar'] Ignore any line in the expected file containing one of these substrings; the corresponding actual line can be anything
ignore_patterns=[r'pattern'] Lines differing only in substrings matching these regexes pass; text outside the match must be identical in both
remove_lines=['foo'] Remove lines containing these substrings from both actual and expected before comparing
preprocess=fn Apply fn(list_of_lines) to both actual and expected (as lists of strings) before comparing
max_permutation_cases=N Pass if lines differ only in order, up to N permutations; None = unlimited

ignore_substrings—ignore whole lines by substring

Lines in the expected output containing the substring are skipped. The match is against the expected file only—the actual output can have anything on those lines (or nothing):

# Reference file contains:
#   Copyright (c) Stochastic Solutions Limited, 2016
#   Version 0.0.0
# Actual output has current year and version—but we don't care:
self.assertStringCorrect(actual, 'expected.html',
    ignore_substrings=['Copyright', 'Version'])

ignore_patterns—ignore variable substrings within a line

Lines pass if they differ only in parts matching the regex. Everything outside the match must be identical in both files:

# Actual:   "Generated: 2026-05-20T14:32:01 by pipeline v2.3.1"
# Expected: "Generated: 2024-01-15T09:00:00 by pipeline v1.0.0"
# Both lines still match with:
self.assertStringCorrect(actual, 'expected.txt',
    ignore_patterns=[
        r'\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}',
        r'v\d+\.\d+\.\d+',
    ])

ignore_patterns is stricter than ignore_substrings: the non-matching parts of each line must agree exactly, so you cannot accidentally mask a real change in the surrounding text.

remove_lines—strip lines from both files

Lines containing the substring are removed from both actual and expected before comparing. Use this for optional or ephemeral lines that should not appear in the reference at all:

# Both files have lines like "WARNING: cache miss" that are
# present sometimes and absent other times:
self.assertStringCorrect(actual, 'expected.txt',
    remove_lines=['WARNING: cache miss'])

Unlike ignore_substrings, remove_lines strips from both sides, so the reference file also need not contain these lines.

preprocess—transform both files before comparing

Takes a function that accepts a list of strings (lines) and returns a transformed list. Applied to both actual and expected:

def strip_timestamps(lines):
    # remove leading timestamp prefix "2026-05-20 14:32:01 " from each line
    return [line[20:] if len(line) > 20 else line for line in lines]

self.assertStringCorrect(actual, 'expected.txt',
    preprocess=strip_timestamps)

max_permutation_cases—allow reordered lines

Pass if the lines are a permutation of each other, up to the given number of permutations. Use None for unlimited:

# Output order is non-deterministic, but the set of lines is fixed:
self.assertStringCorrect(actual, 'expected.txt',
    max_permutation_cases=None)

Assertion API: DataFrames

The DataFrame assertion methods work with Pandas 2.x and 3.x (all three backends: numpy_nullable, pyarrow, and original) and with Polars. You can even compare DataFrames across engines—e.g. a Pandas actual against a Polars reference—with the engine parameter if needed.

assertDataFramesEquivalent(df, ref_df, ...) Compare two in-memory DataFrames (Pandas or Polars).

assertDataFrameCorrect(df, ref_path, ...) Compare an in-memory DataFrame against a reference file (CSV or Parquet).

assertStoredDataFrameCorrect(actual_path, ref_path, ...) Compare two DataFrames both stored on disk.

assertStoredDataFramesCorrect(actual_paths, ref_paths, ...) Compare multiple pairs of on-disk DataFrames.

check_data and check_types—exclude columns

The most common use is excluding columns whose values are legitimately variable (random seeds, run IDs, timestamps):

# Exclude the 'random' column from both value and type checks:
columns = self.all_fields_except(['random'])
self.assertDataFrameCorrect(df, 'expected.csv',
                            check_data=columns,
                            check_types=columns)

check_data, check_types, and check_order all accept the same forms: - None or True: check all fields (default) - False: skip entirely - a list of field names to check - a function taking a DataFrame and returning a list of field names

sortby—sort before comparing

Use when row order is non-deterministic:

self.assertDataFrameCorrect(df, 'expected.csv',
                            sortby=['country', 'date'])

condition—filter rows before comparing

Use when only a subset of rows is relevant to the test:

# Only compare rows where status is 'complete':
self.assertDataFrameCorrect(df, 'expected.csv',
    condition=lambda df: df['status'] == 'complete')

precision—floating-point tolerance

Default is 7 decimal places. Loosen it when values come via CSV (which can lose precision):

self.assertDataFrameCorrect(df, 'expected.csv', precision=5)

type_matching—dtype strictness

  • 'strict' (default for Parquet): dtypes must be identical
  • 'medium' (default for CSV): same underlying type (int, float, datetime) but different bit width or nullability allowed
  • 'loose': anything Pandas can compare
# CSV round-trips can change int64 to float64—use medium:
self.assertDataFrameCorrect(df, 'expected.csv', type_matching='medium')

fuzzy_nulls—treat different null types as equal

# pd.NaN and None treated as equivalent:
self.assertDataFramesEquivalent(df, ref_df, fuzzy_nulls=True)

engine—Pandas or Polars

Inferred automatically from the DataFrames. Only needed when comparing across types (a Pandas actual against a Polars reference or vice versa):

self.assertDataFramesEquivalent(pandas_df, polars_df, engine='pandas')

tdda diff—Understanding DataFrame Failures

When a DataFrame assertion fails, the failure message suggests one or more diff commands. For tabular data, it often suggests both a raw diff and a tdda diff:

Compare with:
    diff /tmp/actual-expected.csv /path/to/testdata/expected.csv
Compare with:
    tdda diff /tmp/actual-expected.csv /path/to/testdata/expected.csv

tdda diff uses the same comparison logic as the assertion methods and produces a structured summary: which columns differ, how many rows, and a table showing the differing values side by side. It is much easier to read than raw diff for anything beyond a handful of rows. Always prefer it for DataFrame failures. Example output:

Columns with differences: 1 / 12
Rows with differences:    3 / 1000

Values:
  Row   Column    Actual    Expected
   42   revenue   1500.50   1500.00
  108   revenue      0.00       NaN
  731   revenue    999.99   1000.00

It accepts the same field-selection flags as the assertion methods:

tdda diff actual.csv expected.csv --xfields random,run_id

Assertion API: Binary Files

assertBinaryFileCorrect(actual_path, ref_path) Check that a binary file is byte-for-byte identical to a reference file. No options for partial matching—if you need that, extract the relevant data and use a string or DataFrame assertion instead.

Generating Tests Automatically with Gentest

If you have a command-line process—a script, a shell command, an R program—tdda gentest can generate a test suite for it:

tdda gentest 'python my_analysis.py input.csv' testsuite.py

Gentest runs the command multiple times, captures all outputs (stdout, stderr, exit code, any files written), detects which parts vary between runs, and writes a test script that checks the stable parts. The generated script uses tdda.referencetest and can be run and maintained like any other reference test.

Inspect the generated test and the reference outputs before trusting them. Gentest is good at generating structurally correct tests; you still need to verify that the reference outputs are actually correct.

The Reference Test Checklist

Create at least one reference test for every analytical process you write.
Run tests before making changes, so you know the baseline.
Run tests after making changes, before assuming they worked.
When a test fails, read the diff before doing anything else.
Never run -W without first verifying the new output is correct.
Prefer -1W (or --tagged --write-all -s) over bare -W—rewrite only the references that actually failed.
Use -F and tdda tag to automatically tag failing tests for targeted reruns and rewrites.
After writing references, inspect the files. Tests passing after -W or --write-all is not evidence of correctness.
Ensure reference files are clean in git before running -W, so you can use git diff to review changes and revert with git checkout -- testdata/ if needed.
Consider unit-enhanced reference tests for anything with checkable semantic structure.
Add a regression test for every bug you fix.
1 test vs. 0 tests is a bigger difference than 100 vs. 1.

Further Reading


TDDA: The Book, the 3.0 Library, and the PyData London 2026 Tutorial

Posted on Tue 19 May 2026 in TDDA • Tagged with library, talk, book

This blog has been quite quiet, but there is a great deal of news and it may be less quiet for a while.

The Book

Today, 19th May 2026, sees the world-wide release of Test-Driven Data Analysis, from CRC Press.

The cover of the book Test-Driven Data Analysis by Nicholas J. Radcliffe. It is published by Chapman and Hall, part of CRC Press, from Taylor & Francis Group, and is part of the DATA SCIENCE SERIES. The cover is black with mostly white text and a white graphic. The graphic is a 3-row by 4-column grid of squares. Each square contains a number of dots laid out on a regular 32x32 grid. The top-left square has 1024 dots (“full”) and working along each row in turn, the number of dots roughly halves each time, apparently at random (and, actually, pseudo-randomly). The last row’s boxes have six, two, two, and one dot.

It is available from all good booksellers and all sellers of good books, and until 30th June 2026 the code 26SMA1 will give a 20% discount from the publisher's site.

The book covers:

  • the TDDA methodology
    • including areas not obviously amenable to software support, such as errors of interpretation, errors of applicability, errors of process, and errors of judgement
  • the TDDA command-line tools for
    • data validation,
    • reference-test generation with Gentest (test for code in any language),
    • a diff tool for on-disk data frames (as parquet files and flat files)
    • tools for working with the tdda.serial format and also with CSVW (CSV on the Web) and Frictionless.
  • Reference testing with tdda.referencetest under unittest or pytest
  • Test-Driven Document Development (TDDD)
  • APIs for all functionality

Resources from the book are available at book.tdda.info, including

  • 22 Checklists
  • All figures
  • Glossary
  • Data Profiles
  • Data Dictionaries
  • TDDD tests for the book.

Examples from the book are available from the tdda library by using the tdda command:

tdda examples book

The whole of TDDA is really built around the encapsulation of the data-analysis cycle shown below, and the diagram shows how the book covers these ideas.

The main part of the diagram consists of six circles from
left to right.
The first five circles have failure mode text
under them and an error class below that.
1. CHOOSE APPROACH.
Failure: 'Fail to understand data, problem domain, or methods',
ERROR OF INTERPRETATION (error of formulation).
Ch 13.
2. DEVELOP ANALYTICAL PROCESS.
Failure: 'Mistakes during coding' and the associated
ERROR OF IMPLEMENTATION (bug).
Ch 9-12.
3. RUN ANALYTICAL PROCESS.
Failure: 'Use the software incorrectly'
ERROR OF PROCESS (operator error).
Ch 16.
4. PRODUCE ANALYTICAL RESULTS
Failure 'Mismatch between development data or assumptions
and deployment data'
ERROR OF APPLICABILITY (category error).
Ch 1-7 & 17.
5. INTERPRET ANALYTICAL RESULTS
Failure 'Misinterpret the results'
ERROR OF INTERPRETATION (communication error).
Ch 14 & 15.
6. `First, Do No Harm'.
ERROR OF JUDGEMENT.
Ch 17.
Arrows lead to FAILURE and SUCCESS boxes.
Remedies and book chapters sit underneath the main diagram.

The TDDA Library, Version 3.0

Top Line: Three Machines illustrating
1. constraint discover and data validation: an input hopper takes training
data and produces constraints, or training data + constraints to produce
data validations at the output chute.
2. Rexpy, which takes strings in its input hopper and produces
regular expressions at the output chute,
3. TDDA gentest, which takes code in the input hopper and produces a Python
reference-test script as output.
Bottom Line: 4. tdda diff which compares data in flat files and parquet
files to detect (semantic) differences.
5. tdda.serial, which is a format for describing flat-file formats and
a suite of tools for working with tdda.serial, CSVW, and Frictionless
6. tdda.referencetest, for semantic testing of complex analytical results.

Version 3.0 of the library and command-line tools is a major upgrade.

All the main features have upgrades:

  • Data validation using constraints, which can be generated from training data.

  • Inference of regular expressions from example strings.

  • Automatic generation of tests for almost any non-GUI code in any language (Gentest).
    "Gentest writes tests so you don't have to."™

  • Enhanced test support for complex results in both Python's unittest and in pytest with reference testing.

New features include:

  • Support for Pandas 3.0, including all three backends (original, numpy_nullable, and pyarrow).

  • Support for Polars DataFrames in most areas of the library.

  • Comprehensive Parquet support, replacing feather format.

  • tdda diff: find and visualize differences between datasets in flat files (like CSV files) and parquet files, with control over specificity and scope.

  • Flat-file metadata support: the new tdda.serial format allows the format of CSV and other flat files to be described for accurate reading across libraries. This includes inference of flat-file formats, Python code generation, helper functions for reading and writing flat files with metadata, and conversion between tdda.serial, CSVW (CSV on the Web), and Frictionless.

  • Text utilities for Unicode, including glyph counting and extended normalization forms beyond canonical composition and decomposition (NFC, NFD), and kompatibility normalization (NFKC and NFKD). Form NFTK performs further kompatibility normalization including accent stripping.

  • Man pages for all commands

  • Upgraded documentation for command line tools and the API.

PyData London TDDA Tutorial, 5th June 2026, 14:10

I'll be giving a 90-minute hands-on tutorial on TDDA on 5th June 2026 at PyData London. Do come along if you can. PyData is always great, for experts and novices and all levels of technical interest and proficiency. It would be great to see you there.

Get tickets from PyData.

And if you have something to share, prepare a 5-minute Lightning Talk. They are always a highlight of the conference.


Test-Driven Document Development

Posted on Tue 02 September 2025 in TDDA • Tagged with TDDD

Summary

Computational documents attempt to guarantee that results included within them—such as graphs—correspond to the code and data claimed to generate them. They typically achieve this by generating the outputs from the code at the time the document is generated or viewed. This solves significant problems, including those of code rusting (exhibiting changed behaviour) and of unintentional inclusion of stale, incorrect, or unvalidated results. There is, however, a danger of what I term co-rusting, whereby the code and its outputs drift away from correctness (rust) together, without the author realizing. This is likely if the code continues to generate output (i.e., does not crash or report an error).

Computational documents are an important part of reproducible research, within which the main approach to avoiding co-rusting tends to be the use of reproducible environments, which aim to prevent rusting by pinning down as much of the computational environment as possible.

Test-Driven Document Development (TDDD) builds on computational documents by adding automated tests that fail when results change (materially). If these tests are run as part of the build process for the document, the possibily of co-rusting is reduced or eliminated. TDDD can be viewed as the application of test-driven data analysis (TDDA) to the process of document creation, essentially considering the generation of a document as an analytical process that should be supported by reference tests.

The tests can be created by hand, but the Gentest functionality of the tdda tool turns out to be powerful for implementing the tests needed by TDDD, whatever language is used to generate the results.

Background: Computational Documents

Computational documents include one or more results generated by computer code, and provide some guarantee that each result matches its generating code. This is usually achieved by including the code in the document and generating the output either as part of document production (compilation, e.g., Quarto, or in a more limited way, cog) or on-the-fly, for computational notebooks (interpretation, like Jupyter Notebooks / JupyterLab and marimo).

Here is a simple Quarto computational document that calculates the number of potential UK postcodes as defined by a regular expression describing valid ones.1 This number is quoted in a book I am writing on TDDA. Prior to today, it was pasted into the book by copying the output from an interactive Python session where I calculated it. I probably inserted the thousand separators by hand (another error-prone process). Today I not only changed the number to be included from a calculation when the book is compiled, but also added reference tests to detect if it changes. (source)

---
title: "Quarto Postcodes (inline)"
format:
  html:
    code-fold: true
  pdf:
    toc: false
jupyter: python3
---

```{python}
from letters import nL

RE = r'^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$'

def n_poss_postcodes_for_re():
    """
    Number of strings matching:
      ^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$
    """
    n_postal_areas = nL + nL * nL  # 1 or two letters
    n_postal_districts = 10 + 100  # Any one or two digit number
                                   # 0 and 0x aren't used, but match the regex
    n_subdistricts = nL + 1        # Not all letters are used,
                                   # and only for some London codes,
                                   # but for our regex...
                                   # The +1 is for ones not using a subdistrict

    n_outcodes = n_postal_areas * n_postal_districts * n_subdistricts
    n_incodes = 10 * nL * nL       # Digit then two letters
    n_postcodes = n_outcodes * n_incodes

    return n_postcodes


if __name__ == '__main__':
    n = n_poss_postcodes_for_re()
```
The number of postcode-like strings matching

    `{python} RE`

is `{python} f'{n:,}'`

This document is written in a dialect of Markdown defined by Quarto. It has a header at the top, containing metadata, then a fenced Markdown Python block containing (which defines two variables used later in the document), and some text that uses those two variables (RE and n_formatted) to say how many postcodes match. It has a confected dependency on an another Python file, letters.py defining the number of letters, nL, in English:

nL = 26

It can be compiled with:

    quarto render postcodes1.qmd

producing this page and this document. This rather simple computational document, which shows the code and one important output number that is “guaranteed” to be generated from the code shown. It would be usual to includes graphs or tables of some sort, but this is minimal example so I really wanted only a single number.

The version of the code actually used to generate the number in the book, does not import nL from letters.py, but includes the line nL = 26 in the main program. That's because I'm not trying to make it fail in the book. I've written in this way for the post to give me an easy way to demonstrate co-rusting, which is a entirely real phenomenon. A change in a dependency is a common reason for rusting. (If you do not believe in code rusting or co-rusting, try reading Why Code Rusts; if that doesn't convince you, this article may not be for you.)

Writing Tests For the Code

We will begin by writing tests for essentially the same code, just written as a standalone Python program rather than embedded in a Quarto document.

Here is same code as an actual python script postcodes.py, together with some slightly different behaviour after calling the postcode-counting function.

import json
from letters import nL
from tdda.utils import dict_to_tex_macros

RE = r'^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$'

def n_poss_postcodes_for_re():
    """
    Number of strings matching:
      ^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$
    """
    n_postal_areas = nL + nL * nL  # 1 or two letters
    n_postal_districts = 10 + 100  # Any one or two digit number
                                   # 0 and 0x aren't used, but match the regex
    n_subdistricts = nL + 1        # Not all letters are used,
                                   # and only for some London codes,
                                   # but for our regex...
                                   # The +1 is for ones not using a subdistrict

    n_outcodes = n_postal_areas * n_postal_districts * n_subdistricts
    n_incodes = 10 * nL * nL       # Digit then two letters
    n_postcodes = n_outcodes * n_incodes

    return n_postcodes


if __name__ == '__main__':
    n = n_poss_postcodes_for_re()
    d = {'n': n, 'n_str': f'{n:,}', 'postcodeRE': RE}
    json_path = 'postcodes.json'
    with open(json_path, 'w') as f:
        json.dump(d, f, indent=4)
    dict_to_tex_macros(d, 'postcodes-defs.tex', verbose=False)

If we run this code, it produces no output but writes two files. The first is a JSON file, postcodes.json,)

{
    "n": 434464659200,
    "n_str": "434,464,659,200",
    "postcodeRE": "^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$"
}

We have chosen to write into this the values we might want in the document (in this case, both the number as a number, as the formatted number, as well as the relevant regular expression).

There's a second file, postcodes-defs.tex, which we will use later when we use LaTeX as a TDDD engine. This contains the same values, but now as TeX macros:

\def\n{434464659200}
\def\nStr{434,464,659,200}
\def\postcodeRE{\^[A-Z]\{1,2\}[0-9]\{1,2\}[A-Z]? [0-9][A-Z]\{2\}\$}

If you have the tdda library installed, you have as part of it a tool called Gentest, which can write tests in Python for essentially any command-line program, script, or command, in any language.

The line below instructs Gentest to generate tests for running the Python program postcodes.py.

$ tdda gentest 'python postcodes.py'

This produces the following output:


Running command 'python postcodes.py' to generate output (run 1 of 2).
Saved (empty) output to stdout to /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/STDOUT.
Saved (empty) output to stderr to /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/STDERR.
Copied $(pwd)/postcodes-defs.tex to $(pwd)/ref/python_postcodes_py/postcodes-defs.tex
Copied $(pwd)/postcodes.json to $(pwd)/ref/python_postcodes_py/postcodes.json

Running command 'python postcodes.py' to generate output (run 2 of 2).
Saved (empty) output to stdout to /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/2/STDOUT.
Saved (empty) output to stderr to /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/2/STDERR.
Copied $(pwd)/postcodes-defs.tex to $(pwd)/ref/python_postcodes_py/2/postcodes-defs.tex
Copied $(pwd)/postcodes.json to $(pwd)/ref/python_postcodes_py/2/postcodes.json

Test script written as /Users/njr/blogs/tdda-code/tddd-postcodes/test_python_postcodes_py.py
Command execution took: 0.44s

SUMMARY:

Directory to run in:        /Users/njr/blogs/tdda-code/tddd-postcodes
Shell command:              python postcodes.py
Test script generated:      /Users/njr/blogs/tdda-code/tddd-postcodes/test_python_postcodes_py.py
Reference files:
    $(pwd)/postcodes-defs.tex
    $(pwd)/postcodes.json
Check stdout:               yes (was empty)
Check stderr:               yes (was empty)
Expected exit code:         0
Clobbering permitted:       yes
Number of times script ran: 2
Number of tests written:    6

If you run tdda gentest without specifying a command, you get a wizard, which asks what command to run and also gives you various other options that can alternatively be passed on the command line.

The output is intended to be self explanatory, but to elaborate, what Gentest has done is:

  • Run the command twice;
  • Recorded what was printed (both on the normal output stream stdout and also, separately, what was printed on the error output stream stderr;
  • Taken copies of any files created—in our case case, the .json and .tex files.
  • Noted the exit code from the program (here 0, indicating successful completion);
  • Looked to see whether there were any differences between the two runs, and whether anything in the output looked highly dependent on the environment or context. Here nothing did, but if it had Gentest would have generated tests that attempted to factor out things that look as if they might vary from run to run. (Examples include timestamps, run durations, hostnames etc.);
  • Written a test script, test_python_postcodes_py.py. When run, this executes the command under test and compares its behaviour and outputs to those it collected when generating the tests. The tests only pass if the behaviour and outputs were identical other than anything Gentest decided was not fixed. In this case, there was nothing Gentest thought classes as not fixed.

The code generated is in test_python_postcodes_py.py

If we run this test script, thus:

$ python test_python_postcodes_py.py

we get

......
----------------------------------------------------------------------
Ran 6 tests in 0.439s

OK

which shows that our tests have passed, meaning that the output is unchanged. I'm not going to go through the tests, but by all means look at them.

Simulated Co-Rusting

Let's look at what happens if our code's behaviour changes as a result of rusting. We will simulate this by replacing letters.py with letters52.py, which records the number of upper- and lower-case letters in English.2

    cp letters52.py letters.py

if we do this and run the tests again we get two test failures and some suggested diff commands to run to understand them,

..2 lines are different, starting at line 1
Compare with:
    diff /Users/njr/blogs/tdda-code/tddd-postcodes/postcodes-defs.tex /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/postcodes-defs.tex

F2 lines are different, starting at line 2
Compare with:
    diff /Users/njr/blogs/tdda-code/tddd-postcodes/postcodes.json /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/postcodes.json

F..
======================================================================
FAIL: test_postcodes_defs_tex (__main__.Test_PYTHON_POSTCODES.test_postcodes_defs_tex)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/njr/blogs/tdda-code/tddd-postcodes/test_python_postcodes_py.py", line 52, in test_postcodes_defs_tex
    self.assertTextFileCorrect(os.path.join(self.cwd, 'postcodes-defs.tex'),
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               os.path.join(self.refdir, 'postcodes-defs.tex'),
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               encoding='ascii')
                               ^^^^^^^^^^^^^^^^^
AssertionError: False is not true : 2 lines are different, starting at line 1
Compare with:
    diff /Users/njr/blogs/tdda-code/tddd-postcodes/postcodes-defs.tex /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/postcodes-defs.tex


======================================================================
FAIL: test_postcodes_json (__main__.Test_PYTHON_POSTCODES.test_postcodes_json)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/njr/blogs/tdda-code/tddd-postcodes/test_python_postcodes_py.py", line 57, in test_postcodes_json
    self.assertTextFileCorrect(os.path.join(self.cwd, 'postcodes.json'),
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               os.path.join(self.refdir, 'postcodes.json'),
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               encoding='ascii')
                               ^^^^^^^^^^^^^^^^^
AssertionError: False is not true : 2 lines are different, starting at line 2
Compare with:
    diff /Users/njr/blogs/tdda-code/tddd-postcodes/postcodes.json /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/postcodes.json


----------------------------------------------------------------------
Ran 6 tests in 0.443s

FAILED (failures=2)

and if we run the second suggested diff command (on the JSON files), we see:

2,3c2,3
<     "n": 434464659200,
<     "n_str": "434,464,659,200",
---
>     "n": 14094194400,
>     "n_str": "14,094,194,400",

This is showing us that, with the changed dependency, the code is now producing well over 400 million potential postcodes, rather than th 14 million we expected. (The lack of a newline at the end of stdout is not significant, and is ignored by the test.) So as we hoped, the test detected the rusting of our code, and the co-rusting of its output.

The second diff command shows exactly the same differences in the TeX macros written:

1,2c1,2
< \def\n{434464659200}
< \def\nStr{434,464,659,200}
---
> \def\n{14094194400}
> \def\nStr{14,094,194,400}

If we run the Quarto file postcodes1.qmd with the change, there is no obvious problem: the code and the result continue to match, but are now different from what I intended and orginally validated. Here are the html and pdf

A TDDD Version of the Quarto Doc

We can make the Quarto document more robust (and have the benefit of keeping the code in a script, rather than forcing it into the document) by using this Quarto file, postcodes2.qmd.

---
title: "Quarto Postcodes (with inclusion)"
format:
  html:
    code-fold: true
  pdf:
    toc: false
jupyter: python3
---

{{< include _postcodes.py.qmd >}}
```{python}
with open('ref/python_postcodes_py/postcodes.json') as f:
    ref = json.load(f)
assert d == ref
```
The number of postcode-like strings matching

    `{python} ref['postcodeRE']`

is `{python} ref['n_str']`


The include line at the top imports the file _postcodes.py.qmd. This file is just our script, in Quarto Markdown fences, with a underscore filename, which Quarto requires for inclusions for some reason. We construct the file automatically as part of the build process (in the Makefile).

After the inclusion, we read the JSON file that Gentest saved in its reference directory into Python as a dictionary called ref and then, check that thi refernece dictionary is equal to the one we generated when we ran the code as part of the Quarto rendering process. The Makefile runs the tests (outside Quarto) immediately before rendering so if the assertion passes, we actually know two things:

  1. The tests passed when we ran them outside Quarto (showing that the produce the results we previously validated as OK), and

  2. When we ran the same code inside Quarto, its results (or at least, the results in the dictionary) were also the same as the reference results in the test.

The rest of the Quarto document is the same as the first version except that use the results from the dictionary (since those are validated) and choose to use the preformatted string ref['n_str'] rather than formatting it inline. (This makes no difference.)

In this case, and many others, it makes no difference whether we use ref (the results read from the refernece JSON file) or d as the source of our values, because the assertion checked that they were identical. The reason I've used ref is that in some other cases, the we allow non-material differences between the actual and reference results, typically things like datestamps indicating run-time, machine names etc. (If those are different, we need to use a slightly different assertion.) By using the reference results, we ensure that the document does not change each time we compile it if there are no material differences.

Discussion

Next:

  • Look at the JSON and TeX macros
  • Change the letters to be 52
  • Show the test failing
  • Show how to use the script code in Quarto
  • Do the LaTeX version.

  1. All current valid postcodes match this expression, but many string that match it do not exist and some would probably not be considered valid. 

  2. By way of full disclosure, when I actually replaced letters.py with letters52.py and ran the tests they passed, to my dismay. This happened not because of a problem with the tests, but because I created letters52.py and letters26.py by copying letters.py and failed to update the contents of th letters52.py. If you were were to look back in the Git history for the repo, you'd see that. I mention this simply as a further demonstration that all humans are prone to error, which is some of the reason TDDD and TDDA are helpful! Of course, some humans are less errir-prone than others!