Reference Testing with TDDA: A Guide for LLMs and Coding Bots

Posted on Wed 20 May 2026 in TDDA

Raw markdown for direct use in LLM context windows.

This post is addressed primarily to LLMs and coding agents—it was distilled by Claude from Test-Driven Data Analysis (Radcliffe, CRC Press, 2026; extra resources), the tdda source code, the documentation, and the man pages (also available as markdown). If you are a human who wants a dense, no-nonsense introduction to reference testing with TDDA, you are welcome too. It was written in response to repeated direct observations of coding agents—including Claude—using -W to make failing tests pass without verifying that the new output was correct.

What Is a Reference Test?

A reference test is a semantic regression test. 'Semantic' here means the comparison understands equivalence: outputs that differ in timestamps, row order, version strings, hostnames, file paths, or other legitimately variable content can still pass — only meaningful changes cause failures. You run your analytical process or function, capture its outputs, inspect them and decide they are correct, then store them as reference outputs. From that point on, the test reruns the process and checks that the outputs still match the reference. If something changes—a library update, a refactoring, a subtle bug—the test catches it.

The tdda.referencetest library makes this easy for Python code. It extends unittest (and works with pytest) and provides assertion methods for:

  • strings and text files (assertStringCorrect, assertTextFileCorrect)
  • data frames (assertDataFrameCorrect, assertDataFramesEquivalent)
  • binary files (assertBinaryFileCorrect)

It also provides tdda gentest, a command-line tool that watches your process run and generates a test suite automatically.

The key insight: reference testing is particularly well suited to analytical code because the outputs are often large, complex, and hard to specify in advance. You develop the process, carefully inspect and verify the output is right once, store it, and thereafter the test does the checking.

The -W (--write-all) Trap

Do not blindly run -W (or --write-all) merely to make failing tests pass.

This is the most important thing in this document.

When a reference test fails, tdda.referencetest tells you what differs between the actual output and the stored reference. It also offers you a way to update the reference to match the new output: pass -W (or --write-all under pytest) when running the tests.

Running -W overwrites the reference files with whatever the code currently produces. After this, the tests pass. This tells you nothing about whether the output is correct. The tests will pass even if the output is completely wrong, because you just told them that the wrong output is the new reference.

The correct workflow when tests fail is:

  1. Read the failure message. It tells you what changed.
  2. Run the diff command suggested in the failure output.
  3. Look at the actual differences. Are they expected? Are they correct?
  4. Only if you have verified the new output is correct: update the references. With unittest, use -1W (tagged tests only, recommended) or -W (all); with pytest, use --tagged --write-all -s or --write-all -s. See Running a Subset of Tests with Tags for the full commands. If you have a tame human to hand, this is a good moment to involve them—humans are often better at judging whether output is actually correct, and can get quite sweary when you overwrite correct reference results with nonsense.
  5. Run the tests again (without -W) to confirm they pass.
  6. Check the updated reference files into version control.

If you skip step 3 and go straight to -W, you have not tested anything. You have merely synchronized the reference to whatever the code happens to produce right now.

Safe use of -W: the git audit pattern

If the reference files are clean in git before you run -W, you can use git diff afterwards to inspect exactly what changed, and git checkout -- path/to/testdata/ to revert all reference changes at once if anything looks wrong. This makes -W a controlled and auditable operation—but only if the working tree was clean before you ran it. Always check before running -W, not after.

Unit-Enhanced Reference Tests

The test code and command-line flags differ between reference tests built on unittest and those built on pytest. The sections below cover the unittest variants first; see Writing Reference Tests with pytest for the pytest equivalents. Where flags differ, both are given.

Task unittest pytest
Run all tests python tests.py -F pytest tests/ --log-failures
Run only tagged tests python tests.py -F -1 pytest tests/ --log-failures --tagged
Rewrite all references -W --write-all -s
Rewrite tagged only -1W --tagged --write-all -s

Full syntax and explanations for each are in the sections below; this table is a quick reference.

A partial structural defence against careless -W use is unit-enhanced reference tests: after the reference assertion, add one or more specific assertions about things that must be true regardless (shown here in unittest style):

def test_output(self):
    result = run_my_process(input_data)
    self.assertStringCorrect(result, 'expected.txt')
    # These survive a careless -W rewrite:
    self.assertIn('Total: 42 records', result)
    self.assertTrue(result.strip().endswith('OK'))

The reference assertion runs first. If it fails, tdda writes the actual output and suggests a diff command—the normal workflow. If you then carelessly rewrite with -W, the subsequent assertions will still fail if the output is wrong in ways they cover.

This is not a complete defence—you have to choose the assertions carefully—but it makes it much harder to accidentally accept a broken result. Choose assertions that reflect the core correctness property the test was designed to verify.

This pattern emerged from the author's direct experience of coding agents (including Claude) repeatedly using -W to make tests pass without verifying the results. It is recommended for any test where the reference output has semantic structure that can be spot-checked.

The -F (--log-failures) Flag

Always pass the log-failures flag when running tests. It logs the IDs of any failing tests to a timestamped file (YYYY-MM-DDTHHMMSS-failing-tests.txt) in your system temp directory (overridable with $TDDA_FAIL_DIR). This enables the tdda tag workflow: tdda tag reads the most recent such file and adds @tag to the failing tests, so you can re-run and regenerate references for just those tests.

Without the flag, no failures file is written and tdda tag has nothing to work with.

With unittest (running directly with Python)

Pass -F (or --log-failures):

python tests/test_mycode.py -F

With pytest

Pass --log-failures:

pytest tests/ --log-failures

Permanent default

To avoid passing the flag every time, add this to ~/.tdda.toml:

[referencetest]
log_failures = true

This modifies the user's global configuration. Consult your human before doing it.

The Kicker: the -W Problem is Not Restricted to TDDA

The anti-pattern described above — rewriting expected outputs to make tests pass without verifying the new output is correct — applies far beyond tdda.referencetest. LLM coding agents routinely treat passing tests as the goal rather than as evidence of correctness. Whether rewriting a reference file, updating a pytest snapshot, regenerating Jest snapshots, or changing a hardcoded expected value in an assertion, the same question applies first: is the new result actually correct?

Green tests after any kind of expected-value rewrite tell you nothing about correctness. They tell you only that the code now matches whatever you told it to match.

The correct workflow is the same regardless of framework:

  1. A test fails. Read the failure. What changed?
  2. Is the change correct, or is it a bug?
  3. Only if correct: update the expected value.
  4. If you're not sure: ask your human.

The specific value of tdda.referencetest is that it makes step 1 easy — the diff tooling is built in, and -F/tdda tag/-1W limit the blast radius. But the discipline is universal.

Running a Subset of Tests with Tags

To run only some tests, use the @tag decorator:

from tdda.referencetest import ReferenceTestCase, tag

class TestMyProcess(ReferenceTestCase):

    @tag
    def test_main_output(self):
        result = run_my_process()
        self.assertStringCorrect(result, 'expected_output.txt')

    def test_other_thing(self):
        ...

@tag can decorate individual test methods or entire test classes. The flags to run only tagged tests differ between unittest and pytest.

With unittest (running directly with Python)

python tests/test_mycode.py -F -1        # run only tagged tests
python tests/test_mycode.py -F -1W       # regenerate references for tagged tests only

-1W combines -1 and -W (--write-all). This is the safe way to regenerate, because it limits the blast radius to tests you have explicitly chosen and tagged.

With pytest

pytest tests/ --log-failures --tagged               # run only tagged tests
pytest tests/ --log-failures --tagged --write-all -s  # regenerate references for tagged tests only

Pass -s to prevent pytest from capturing output, so that tdda can report which reference files were written.

The full workflow with tdda tag

The -Ftdda tag-1W workflow lets you rewrite only the references that actually failed, without manually deciding which tests to tag:

  1. Run tests with -F (or --log-failures) to record failing test IDs
  2. Run tdda tag to add @tag to those tests automatically
  3. Inspect the diffs to verify the new output is correct
  4. Run -1W (or --tagged --write-all -s) to rewrite only those references
  5. Run make untag (or the sed command below) to remove the tags

This is always preferable to bare -W, which rewrites every reference file regardless of whether the test failed.

Removing stale tags

Before adding new tags, remove any stale @tag decorators from previous sessions. There is usually a make untag target that does this, or you can use:

# macOS (BSD sed):
sed -i '' '/^[[:space:]]*@tag[[:space:]]*$/d' tests/test_mycode.py

# Linux (GNU sed):
sed -i '/^[[:space:]]*@tag[[:space:]]*$/d' tests/test_mycode.py

Writing Reference Tests with unittest

A minimal test file:

import os
from tdda.referencetest import ReferenceTestCase, tag

TESTDIR = os.path.join(os.path.dirname(os.path.abspath(__file__)),
                       'testdata')

class TestMyProcess(ReferenceTestCase):

    def test_output(self):
        result = run_my_process(input_data)
        self.assertStringCorrect(result,
                                 os.path.join(TESTDIR, 'expected.txt'))

    def test_dataframe(self):
        df = produce_dataframe()
        self.assertDataFrameCorrect(df,
                                    os.path.join(TESTDIR, 'expected.csv'))

if __name__ == '__main__':
    ReferenceTestCase.main()

When running under pytest, the if __name__ == '__main__': block is simply ignored—the same test file works with both runners unchanged.

Run it:

python test_myprocess.py -F           # run all tests
python test_myprocess.py -F -1        # run only tagged tests
python test_myprocess.py -F -1W       # regenerate references for tagged tests

The first time you run with -1W after writing a new test, it writes the reference file. Subsequent runs compare against it.

After writing references with -1W, always inspect the files that were written. The fact that the test now passes means only that the reference matches the output. It says nothing about whether either is correct.

Writing Reference Tests with pytest

The same test classes work under pytest, with different flags:

pytest tests/                           # run all tests
pytest tests/ --tagged                  # run only tagged tests
pytest tests/ --tagged --write-all -s   # regenerate references for tagged tests

Note: - Use --write-all instead of -W. - Use --tagged instead of -1. - Pass -s to prevent pytest from capturing output, so that tdda can report which reference files were written. - The short flags -W and -1 are tdda extensions; they only work when running the test file directly with Python, not under pytest.

Assertion API: Text and Strings

assertStringCorrect(string, ref_path, ...) Check an in-memory string against a reference text file.

assertTextFileCorrect(actual_path, ref_path, ...) Check a text file on disk against a reference text file.

assertTextFilesCorrect(actual_paths, ref_paths, ...) Check multiple text files against corresponding reference files.

All three share these optional parameters for handling variable output:

Parameter Effect
lstrip=True Strip leading whitespace from each line before comparing
rstrip=True Strip trailing whitespace from each line before comparing
ignore_substrings=['foo','bar'] Ignore any line in the expected file containing one of these substrings; the corresponding actual line can be anything
ignore_patterns=[r'pattern'] Lines differing only in substrings matching these regexes pass; text outside the match must be identical in both
remove_lines=['foo'] Remove lines containing these substrings from both actual and expected before comparing
preprocess=fn Apply fn(list_of_lines) to both actual and expected (as lists of strings) before comparing
max_permutation_cases=N Pass if lines differ only in order, up to N permutations; None = unlimited

ignore_substrings—ignore whole lines by substring

Lines in the expected output containing the substring are skipped. The match is against the expected file only—the actual output can have anything on those lines (or nothing):

# Reference file contains:
#   Copyright (c) Stochastic Solutions Limited, 2016
#   Version 0.0.0
# Actual output has current year and version—but we don't care:
self.assertStringCorrect(actual, 'expected.html',
    ignore_substrings=['Copyright', 'Version'])

ignore_patterns—ignore variable substrings within a line

Lines pass if they differ only in parts matching the regex. Everything outside the match must be identical in both files:

# Actual:   "Generated: 2026-05-20T14:32:01 by pipeline v2.3.1"
# Expected: "Generated: 2024-01-15T09:00:00 by pipeline v1.0.0"
# Both lines still match with:
self.assertStringCorrect(actual, 'expected.txt',
    ignore_patterns=[
        r'\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}',
        r'v\d+\.\d+\.\d+',
    ])

ignore_patterns is stricter than ignore_substrings: the non-matching parts of each line must agree exactly, so you cannot accidentally mask a real change in the surrounding text.

remove_lines—strip lines from both files

Lines containing the substring are removed from both actual and expected before comparing. Use this for optional or ephemeral lines that should not appear in the reference at all:

# Both files have lines like "WARNING: cache miss" that are
# present sometimes and absent other times:
self.assertStringCorrect(actual, 'expected.txt',
    remove_lines=['WARNING: cache miss'])

Unlike ignore_substrings, remove_lines strips from both sides, so the reference file also need not contain these lines.

preprocess—transform both files before comparing

Takes a function that accepts a list of strings (lines) and returns a transformed list. Applied to both actual and expected:

def strip_timestamps(lines):
    # remove leading timestamp prefix "2026-05-20 14:32:01 " from each line
    return [line[20:] if len(line) > 20 else line for line in lines]

self.assertStringCorrect(actual, 'expected.txt',
    preprocess=strip_timestamps)

max_permutation_cases—allow reordered lines

Pass if the lines are a permutation of each other, up to the given number of permutations. Use None for unlimited:

# Output order is non-deterministic, but the set of lines is fixed:
self.assertStringCorrect(actual, 'expected.txt',
    max_permutation_cases=None)

Assertion API: DataFrames

The DataFrame assertion methods work with Pandas 2.x and 3.x (all three backends: numpy_nullable, pyarrow, and original) and with Polars. You can even compare DataFrames across engines—e.g. a Pandas actual against a Polars reference—with the engine parameter if needed.

assertDataFramesEquivalent(df, ref_df, ...) Compare two in-memory DataFrames (Pandas or Polars).

assertDataFrameCorrect(df, ref_path, ...) Compare an in-memory DataFrame against a reference file (CSV or Parquet).

assertStoredDataFrameCorrect(actual_path, ref_path, ...) Compare two DataFrames both stored on disk.

assertStoredDataFramesCorrect(actual_paths, ref_paths, ...) Compare multiple pairs of on-disk DataFrames.

check_data and check_types—exclude columns

The most common use is excluding columns whose values are legitimately variable (random seeds, run IDs, timestamps):

# Exclude the 'random' column from both value and type checks:
columns = self.all_fields_except(['random'])
self.assertDataFrameCorrect(df, 'expected.csv',
                            check_data=columns,
                            check_types=columns)

check_data, check_types, and check_order all accept the same forms: - None or True: check all fields (default) - False: skip entirely - a list of field names to check - a function taking a DataFrame and returning a list of field names

sortby—sort before comparing

Use when row order is non-deterministic:

self.assertDataFrameCorrect(df, 'expected.csv',
                            sortby=['country', 'date'])

condition—filter rows before comparing

Use when only a subset of rows is relevant to the test:

# Only compare rows where status is 'complete':
self.assertDataFrameCorrect(df, 'expected.csv',
    condition=lambda df: df['status'] == 'complete')

precision—floating-point tolerance

Default is 7 decimal places. Loosen it when values come via CSV (which can lose precision):

self.assertDataFrameCorrect(df, 'expected.csv', precision=5)

type_matching—dtype strictness

  • 'strict' (default for Parquet): dtypes must be identical
  • 'medium' (default for CSV): same underlying type (int, float, datetime) but different bit width or nullability allowed
  • 'loose': anything Pandas can compare
# CSV round-trips can change int64 to float64—use medium:
self.assertDataFrameCorrect(df, 'expected.csv', type_matching='medium')

fuzzy_nulls—treat different null types as equal

# pd.NaN and None treated as equivalent:
self.assertDataFramesEquivalent(df, ref_df, fuzzy_nulls=True)

engine—Pandas or Polars

Inferred automatically from the DataFrames. Only needed when comparing across types (a Pandas actual against a Polars reference or vice versa):

self.assertDataFramesEquivalent(pandas_df, polars_df, engine='pandas')

tdda diff—Understanding DataFrame Failures

When a DataFrame assertion fails, the failure message suggests one or more diff commands. For tabular data, it often suggests both a raw diff and a tdda diff:

Compare with:
    diff /tmp/actual-expected.csv /path/to/testdata/expected.csv
Compare with:
    tdda diff /tmp/actual-expected.csv /path/to/testdata/expected.csv

tdda diff uses the same comparison logic as the assertion methods and produces a structured summary: which columns differ, how many rows, and a table showing the differing values side by side. It is much easier to read than raw diff for anything beyond a handful of rows. Always prefer it for DataFrame failures. Example output:

Columns with differences: 1 / 12
Rows with differences:    3 / 1000

Values:
  Row   Column    Actual    Expected
   42   revenue   1500.50   1500.00
  108   revenue      0.00       NaN
  731   revenue    999.99   1000.00

It accepts the same field-selection flags as the assertion methods:

tdda diff actual.csv expected.csv --xfields random,run_id

Assertion API: Binary Files

assertBinaryFileCorrect(actual_path, ref_path) Check that a binary file is byte-for-byte identical to a reference file. No options for partial matching—if you need that, extract the relevant data and use a string or DataFrame assertion instead.

Generating Tests Automatically with Gentest

If you have a command-line process—a script, a shell command, an R program—tdda gentest can generate a test suite for it:

tdda gentest 'python my_analysis.py input.csv' testsuite.py

Gentest runs the command multiple times, captures all outputs (stdout, stderr, exit code, any files written), detects which parts vary between runs, and writes a test script that checks the stable parts. The generated script uses tdda.referencetest and can be run and maintained like any other reference test.

Inspect the generated test and the reference outputs before trusting them. Gentest is good at generating structurally correct tests; you still need to verify that the reference outputs are actually correct.

The Reference Test Checklist

Create at least one reference test for every analytical process you write.
Run tests before making changes, so you know the baseline.
Run tests after making changes, before assuming they worked.
When a test fails, read the diff before doing anything else.
Never run -W without first verifying the new output is correct.
Prefer -1W (or --tagged --write-all -s) over bare -W—rewrite only the references that actually failed.
Use -F and tdda tag to automatically tag failing tests for targeted reruns and rewrites.
After writing references, inspect the files. Tests passing after -W or --write-all is not evidence of correctness.
Ensure reference files are clean in git before running -W, so you can use git diff to review changes and revert with git checkout -- testdata/ if needed.
Consider unit-enhanced reference tests for anything with checkable semantic structure.
Add a regression test for every bug you fix.
1 test vs. 0 tests is a bigger difference than 100 vs. 1.

Further Reading