Reference Testing with TDDA: A Guide for LLMs and Coding Bots
Posted on Wed 20 May 2026 in TDDA • Tagged with reference testing, LLMs, coding bots, gentest, pytest, unittest
Raw markdown for direct use in LLM context windows.
This post is addressed primarily to LLMs and coding agents—it was
distilled by Claude from Test-Driven Data Analysis
(Radcliffe, CRC Press, 2026; extra resources),
the tdda source code,
the documentation, and the
man pages
(also available as markdown).
If you are a human who wants a dense, no-nonsense introduction to
reference testing with TDDA, you are welcome too. It was written in
response to repeated direct observations of coding agents—including
Claude—using -W to make failing tests pass without verifying that
the new output was correct.
What Is a Reference Test?
A reference test is a semantic regression test. 'Semantic' here means the comparison understands equivalence: outputs that differ in timestamps, row order, version strings, hostnames, file paths, or other legitimately variable content can still pass — only meaningful changes cause failures. You run your analytical process or function, capture its outputs, inspect them and decide they are correct, then store them as reference outputs. From that point on, the test reruns the process and checks that the outputs still match the reference. If something changes—a library update, a refactoring, a subtle bug—the test catches it.
The tdda.referencetest library makes this easy for Python code.
It extends unittest (and works with pytest) and provides
assertion methods for:
- strings and text files (
assertStringCorrect,assertTextFileCorrect) - data frames (
assertDataFrameCorrect,assertDataFramesEquivalent) - binary files (
assertBinaryFileCorrect)
It also provides tdda gentest, a command-line tool that watches
your process run and generates a test suite automatically.
The key insight: reference testing is particularly well suited to analytical code because the outputs are often large, complex, and hard to specify in advance. You develop the process, carefully inspect and verify the output is right once, store it, and thereafter the test does the checking.
The -W (--write-all) Trap
Do not blindly run -W (or --write-all) merely to make failing tests pass.
This is the most important thing in this document.
When a reference test fails, tdda.referencetest tells you what
differs between the actual output and the stored reference. It also
offers you a way to update the reference to match the new output:
pass -W (or --write-all under pytest) when running the tests.
Running -W overwrites the reference files with whatever the code
currently produces. After this, the tests pass. This tells you
nothing about whether the output is correct. The tests will pass
even if the output is completely wrong, because you just told them
that the wrong output is the new reference.
The correct workflow when tests fail is:
- Read the failure message. It tells you what changed.
- Run the
diffcommand suggested in the failure output. - Look at the actual differences. Are they expected? Are they correct?
- Only if you have verified the new output is correct: update
the references. With unittest, use
-1W(tagged tests only, recommended) or-W(all); with pytest, use--tagged --write-all -sor--write-all -s. See Running a Subset of Tests with Tags for the full commands. If you have a tame human to hand, this is a good moment to involve them—humans are often better at judging whether output is actually correct, and can get quite sweary when you overwrite correct reference results with nonsense. - Run the tests again (without
-W) to confirm they pass. - Check the updated reference files into version control.
If you skip step 3 and go straight to -W, you have not tested
anything. You have merely synchronized the reference to whatever
the code happens to produce right now.
Safe use of -W: the git audit pattern
If the reference files are clean in git before you run -W, you can use
git diff afterwards to inspect exactly what changed, and
git checkout -- path/to/testdata/ to revert all reference changes
at once if anything looks wrong. This makes -W a controlled and
auditable operation—but only if the working tree was clean before you
ran it. Always check before running -W, not after.
Unit-Enhanced Reference Tests
The test code and command-line flags differ between reference tests
built on unittest and those built on pytest. The sections below
cover the unittest variants first; see
Writing Reference Tests with pytest
for the pytest equivalents. Where flags differ, both are given.
| Task | unittest | pytest |
|---|---|---|
| Run all tests | python tests.py -F |
pytest tests/ --log-failures |
| Run only tagged tests | python tests.py -F -1 |
pytest tests/ --log-failures --tagged |
| Rewrite all references | -W |
--write-all -s |
| Rewrite tagged only | -1W |
--tagged --write-all -s |
Full syntax and explanations for each are in the sections below; this table is a quick reference.
A partial structural defence against careless -W use is
unit-enhanced reference tests: after the reference assertion, add
one or more specific assertions about things that must be true
regardless (shown here in unittest style):
def test_output(self):
result = run_my_process(input_data)
self.assertStringCorrect(result, 'expected.txt')
# These survive a careless -W rewrite:
self.assertIn('Total: 42 records', result)
self.assertTrue(result.strip().endswith('OK'))
The reference assertion runs first. If it fails, tdda writes the
actual output and suggests a diff command—the normal workflow. If
you then carelessly rewrite with -W, the subsequent assertions will
still fail if the output is wrong in ways they cover.
This is not a complete defence—you have to choose the assertions carefully—but it makes it much harder to accidentally accept a broken result. Choose assertions that reflect the core correctness property the test was designed to verify.
This pattern emerged from the author's direct experience of coding
agents (including Claude) repeatedly using -W to make tests pass
without verifying the results. It is recommended for any test where
the reference output has semantic structure that can be spot-checked.
The -F (--log-failures) Flag
Always pass the log-failures flag when running tests. It logs the IDs
of any failing tests to a timestamped file
(YYYY-MM-DDTHHMMSS-failing-tests.txt) in your system temp directory
(overridable with $TDDA_FAIL_DIR). This enables the tdda tag
workflow: tdda tag reads the most recent such file and adds @tag
to the failing tests, so you can re-run and regenerate references for
just those tests.
Without the flag, no failures file is written and tdda tag has
nothing to work with.
With unittest (running directly with Python)
Pass -F (or --log-failures):
python tests/test_mycode.py -F
With pytest
Pass --log-failures:
pytest tests/ --log-failures
Permanent default
To avoid passing the flag every time, add this to ~/.tdda.toml:
[referencetest]
log_failures = true
This modifies the user's global configuration. Consult your human before doing it.
The Kicker: the -W Problem is Not Restricted to TDDA
The anti-pattern described above — rewriting expected outputs to make
tests pass without verifying the new output is correct — applies far
beyond tdda.referencetest. LLM coding agents routinely treat passing
tests as the goal rather than as evidence of correctness. Whether
rewriting a reference file, updating a pytest snapshot, regenerating
Jest snapshots, or changing a hardcoded expected value in an assertion,
the same question applies first: is the new result actually correct?
Green tests after any kind of expected-value rewrite tell you nothing about correctness. They tell you only that the code now matches whatever you told it to match.
The correct workflow is the same regardless of framework:
- A test fails. Read the failure. What changed?
- Is the change correct, or is it a bug?
- Only if correct: update the expected value.
- If you're not sure: ask your human.
The specific value of tdda.referencetest is that it makes step 1
easy — the diff tooling is built in, and -F/tdda tag/-1W limit
the blast radius. But the discipline is universal.
Running a Subset of Tests with Tags
To run only some tests, use the @tag decorator:
from tdda.referencetest import ReferenceTestCase, tag
class TestMyProcess(ReferenceTestCase):
@tag
def test_main_output(self):
result = run_my_process()
self.assertStringCorrect(result, 'expected_output.txt')
def test_other_thing(self):
...
@tag can decorate individual test methods or entire test classes.
The flags to run only tagged tests differ between unittest and pytest.
With unittest (running directly with Python)
python tests/test_mycode.py -F -1 # run only tagged tests
python tests/test_mycode.py -F -1W # regenerate references for tagged tests only
-1W combines -1 and -W (--write-all). This is the safe way to
regenerate, because it limits the blast radius to tests you have
explicitly chosen and tagged.
With pytest
pytest tests/ --log-failures --tagged # run only tagged tests
pytest tests/ --log-failures --tagged --write-all -s # regenerate references for tagged tests only
Pass -s to prevent pytest from capturing output, so that tdda
can report which reference files were written.
The full workflow with tdda tag
The -F → tdda tag → -1W workflow lets you rewrite only the
references that actually failed, without manually deciding which tests
to tag:
- Run tests with
-F(or--log-failures) to record failing test IDs - Run
tdda tagto add@tagto those tests automatically - Inspect the diffs to verify the new output is correct
- Run
-1W(or--tagged --write-all -s) to rewrite only those references - Run
make untag(or the sed command below) to remove the tags
This is always preferable to bare -W, which rewrites every reference
file regardless of whether the test failed.
Removing stale tags
Before adding new tags, remove any stale @tag decorators from
previous sessions. There is usually a make untag target that does
this, or you can use:
# macOS (BSD sed):
sed -i '' '/^[[:space:]]*@tag[[:space:]]*$/d' tests/test_mycode.py
# Linux (GNU sed):
sed -i '/^[[:space:]]*@tag[[:space:]]*$/d' tests/test_mycode.py
Writing Reference Tests with unittest
A minimal test file:
import os
from tdda.referencetest import ReferenceTestCase, tag
TESTDIR = os.path.join(os.path.dirname(os.path.abspath(__file__)),
'testdata')
class TestMyProcess(ReferenceTestCase):
def test_output(self):
result = run_my_process(input_data)
self.assertStringCorrect(result,
os.path.join(TESTDIR, 'expected.txt'))
def test_dataframe(self):
df = produce_dataframe()
self.assertDataFrameCorrect(df,
os.path.join(TESTDIR, 'expected.csv'))
if __name__ == '__main__':
ReferenceTestCase.main()
When running under pytest, the if __name__ == '__main__': block is
simply ignored—the same test file works with both runners unchanged.
Run it:
python test_myprocess.py -F # run all tests
python test_myprocess.py -F -1 # run only tagged tests
python test_myprocess.py -F -1W # regenerate references for tagged tests
The first time you run with -1W after writing a new test, it
writes the reference file. Subsequent runs compare against it.
After writing references with -1W, always inspect the files that
were written. The fact that the test now passes means only that the
reference matches the output. It says nothing about whether either is
correct.
Writing Reference Tests with pytest
The same test classes work under pytest, with different flags:
pytest tests/ # run all tests
pytest tests/ --tagged # run only tagged tests
pytest tests/ --tagged --write-all -s # regenerate references for tagged tests
Note:
- Use --write-all instead of -W.
- Use --tagged instead of -1.
- Pass -s to prevent pytest from capturing output, so that tdda
can report which reference files were written.
- The short flags -W and -1 are tdda extensions; they only work
when running the test file directly with Python, not under pytest.
Assertion API: Text and Strings
assertStringCorrect(string, ref_path, ...)
Check an in-memory string against a reference text file.
assertTextFileCorrect(actual_path, ref_path, ...)
Check a text file on disk against a reference text file.
assertTextFilesCorrect(actual_paths, ref_paths, ...)
Check multiple text files against corresponding reference files.
All three share these optional parameters for handling variable output:
| Parameter | Effect |
|---|---|
lstrip=True |
Strip leading whitespace from each line before comparing |
rstrip=True |
Strip trailing whitespace from each line before comparing |
ignore_substrings=['foo','bar'] |
Ignore any line in the expected file containing one of these substrings; the corresponding actual line can be anything |
ignore_patterns=[r'pattern'] |
Lines differing only in substrings matching these regexes pass; text outside the match must be identical in both |
remove_lines=['foo'] |
Remove lines containing these substrings from both actual and expected before comparing |
preprocess=fn |
Apply fn(list_of_lines) to both actual and expected (as lists of strings) before comparing |
max_permutation_cases=N |
Pass if lines differ only in order, up to N permutations; None = unlimited |
ignore_substrings—ignore whole lines by substring
Lines in the expected output containing the substring are skipped. The match is against the expected file only—the actual output can have anything on those lines (or nothing):
# Reference file contains:
# Copyright (c) Stochastic Solutions Limited, 2016
# Version 0.0.0
# Actual output has current year and version—but we don't care:
self.assertStringCorrect(actual, 'expected.html',
ignore_substrings=['Copyright', 'Version'])
ignore_patterns—ignore variable substrings within a line
Lines pass if they differ only in parts matching the regex. Everything outside the match must be identical in both files:
# Actual: "Generated: 2026-05-20T14:32:01 by pipeline v2.3.1"
# Expected: "Generated: 2024-01-15T09:00:00 by pipeline v1.0.0"
# Both lines still match with:
self.assertStringCorrect(actual, 'expected.txt',
ignore_patterns=[
r'\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}',
r'v\d+\.\d+\.\d+',
])
ignore_patterns is stricter than ignore_substrings: the non-matching
parts of each line must agree exactly, so you cannot accidentally mask
a real change in the surrounding text.
remove_lines—strip lines from both files
Lines containing the substring are removed from both actual and expected before comparing. Use this for optional or ephemeral lines that should not appear in the reference at all:
# Both files have lines like "WARNING: cache miss" that are
# present sometimes and absent other times:
self.assertStringCorrect(actual, 'expected.txt',
remove_lines=['WARNING: cache miss'])
Unlike ignore_substrings, remove_lines strips from both sides, so
the reference file also need not contain these lines.
preprocess—transform both files before comparing
Takes a function that accepts a list of strings (lines) and returns a transformed list. Applied to both actual and expected:
def strip_timestamps(lines):
# remove leading timestamp prefix "2026-05-20 14:32:01 " from each line
return [line[20:] if len(line) > 20 else line for line in lines]
self.assertStringCorrect(actual, 'expected.txt',
preprocess=strip_timestamps)
max_permutation_cases—allow reordered lines
Pass if the lines are a permutation of each other, up to the given
number of permutations. Use None for unlimited:
# Output order is non-deterministic, but the set of lines is fixed:
self.assertStringCorrect(actual, 'expected.txt',
max_permutation_cases=None)
Assertion API: DataFrames
The DataFrame assertion methods work with Pandas 2.x and 3.x (all three
backends: numpy_nullable, pyarrow, and original) and with Polars.
You can even compare DataFrames across engines—e.g. a Pandas actual
against a Polars reference—with the engine parameter if needed.
assertDataFramesEquivalent(df, ref_df, ...)
Compare two in-memory DataFrames (Pandas or Polars).
assertDataFrameCorrect(df, ref_path, ...)
Compare an in-memory DataFrame against a reference file (CSV or Parquet).
assertStoredDataFrameCorrect(actual_path, ref_path, ...)
Compare two DataFrames both stored on disk.
assertStoredDataFramesCorrect(actual_paths, ref_paths, ...)
Compare multiple pairs of on-disk DataFrames.
check_data and check_types—exclude columns
The most common use is excluding columns whose values are legitimately variable (random seeds, run IDs, timestamps):
# Exclude the 'random' column from both value and type checks:
columns = self.all_fields_except(['random'])
self.assertDataFrameCorrect(df, 'expected.csv',
check_data=columns,
check_types=columns)
check_data, check_types, and check_order all accept the same forms:
- None or True: check all fields (default)
- False: skip entirely
- a list of field names to check
- a function taking a DataFrame and returning a list of field names
sortby—sort before comparing
Use when row order is non-deterministic:
self.assertDataFrameCorrect(df, 'expected.csv',
sortby=['country', 'date'])
condition—filter rows before comparing
Use when only a subset of rows is relevant to the test:
# Only compare rows where status is 'complete':
self.assertDataFrameCorrect(df, 'expected.csv',
condition=lambda df: df['status'] == 'complete')
precision—floating-point tolerance
Default is 7 decimal places. Loosen it when values come via CSV (which can lose precision):
self.assertDataFrameCorrect(df, 'expected.csv', precision=5)
type_matching—dtype strictness
'strict'(default for Parquet): dtypes must be identical'medium'(default for CSV): same underlying type (int, float, datetime) but different bit width or nullability allowed'loose': anything Pandas can compare
# CSV round-trips can change int64 to float64—use medium:
self.assertDataFrameCorrect(df, 'expected.csv', type_matching='medium')
fuzzy_nulls—treat different null types as equal
# pd.NaN and None treated as equivalent:
self.assertDataFramesEquivalent(df, ref_df, fuzzy_nulls=True)
engine—Pandas or Polars
Inferred automatically from the DataFrames. Only needed when comparing across types (a Pandas actual against a Polars reference or vice versa):
self.assertDataFramesEquivalent(pandas_df, polars_df, engine='pandas')
tdda diff—Understanding DataFrame Failures
When a DataFrame assertion fails, the failure message suggests one or
more diff commands. For tabular data, it often suggests both a raw
diff and a tdda diff:
Compare with:
diff /tmp/actual-expected.csv /path/to/testdata/expected.csv
Compare with:
tdda diff /tmp/actual-expected.csv /path/to/testdata/expected.csv
tdda diff uses the same comparison logic as the assertion methods and
produces a structured summary: which columns differ, how many rows, and
a table showing the differing values side by side. It is much easier to
read than raw diff for anything beyond a handful of rows. Always prefer
it for DataFrame failures. Example output:
Columns with differences: 1 / 12
Rows with differences: 3 / 1000
Values:
Row Column Actual Expected
42 revenue 1500.50 1500.00
108 revenue 0.00 NaN
731 revenue 999.99 1000.00
It accepts the same field-selection flags as the assertion methods:
tdda diff actual.csv expected.csv --xfields random,run_id
Assertion API: Binary Files
assertBinaryFileCorrect(actual_path, ref_path)
Check that a binary file is byte-for-byte identical to a reference file.
No options for partial matching—if you need that, extract the relevant
data and use a string or DataFrame assertion instead.
Generating Tests Automatically with Gentest
If you have a command-line process—a script, a shell command, an R
program—tdda gentest can generate a test suite for it:
tdda gentest 'python my_analysis.py input.csv' testsuite.py
Gentest runs the command multiple times, captures all outputs (stdout,
stderr, exit code, any files written), detects which parts vary between
runs, and writes a test script that checks the stable parts. The
generated script uses tdda.referencetest and can be run and maintained
like any other reference test.
Inspect the generated test and the reference outputs before trusting them. Gentest is good at generating structurally correct tests; you still need to verify that the reference outputs are actually correct.
The Reference Test Checklist
☐ Create at least one reference test for every analytical process you write.
☐ Run tests before making changes, so you know the baseline.
☐ Run tests after making changes, before assuming they worked.
☐ When a test fails, read the diff before doing anything else.
☐ Never run -W without first verifying the new output is correct.
☐ Prefer -1W (or --tagged --write-all -s) over bare -W—rewrite only the references that actually failed.
☐ Use -F and tdda tag to automatically tag failing tests for targeted reruns and rewrites.
☐ After writing references, inspect the files. Tests passing after -W or --write-all is not evidence of correctness.
☐ Ensure reference files are clean in git before running -W, so you can use git diff to review changes and revert with git checkout -- testdata/ if needed.
☐ Consider unit-enhanced reference tests for anything with checkable semantic structure.
☐ Add a regression test for every bug you fix.
☐ 1 test vs. 0 tests is a bigger difference than 100 vs. 1.
Further Reading
- TDDA library documentation
- Reference test examples
man tdda,man tdda-gentest,man tdda-diff- Test-Driven Data Analysis (Radcliffe, CRC Press, 2026), Part II, Chapters 9–12
- Book resources




