<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Test-Driven Data Analysis</title><link href="https://tdda.info/" rel="alternate"></link><link href="https://tdda.info/feeds/all.atom.xml" rel="self"></link><id>https://tdda.info/</id><updated>2026-05-20T12:00:00+01:00</updated><entry><title>Reference Testing with TDDA: A Guide for LLMs and Coding Bots</title><link href="https://tdda.info/reference-testing-with-tdda-a-guide-for-llms-and-coding-bots.html" rel="alternate"></link><published>2026-05-20T12:00:00+01:00</published><updated>2026-05-20T12:00:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2026-05-20:/reference-testing-with-tdda-a-guide-for-llms-and-coding-bots.html</id><summary type="html">&lt;p&gt;&lt;em&gt;&lt;a href="https://tdda.info/md/076-reference-testing-with-tdda-a-guide-for-llms-and-coding-bots.md"&gt;Raw markdown&lt;/a&gt; for direct use in LLM context windows.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;This post is addressed primarily to LLMs and coding agents—it was
distilled by Claude from &lt;a href="https://www.routledge.com/Test-Driven-Data-Analysis/Radcliffe/p/book/9781032897158"&gt;&lt;em&gt;Test-Driven Data Analysis&lt;/em&gt;&lt;/a&gt;
(Radcliffe, CRC Press, 2026; &lt;a href="https://book.tdda.info"&gt;extra resources&lt;/a&gt;),
the &lt;a href="https://github.com/tdda/tdda"&gt;tdda source code&lt;/a&gt;,
the &lt;a href="https://tdda.readthedocs.io"&gt;documentation&lt;/a&gt;, and the
&lt;a href="https://tdda.readthedocs.io/en/latest/cli.html"&gt;man pages&lt;/a&gt;
(also available as &lt;a href="https://github.com/tdda/tdda/blob/master/doc/source/cli.md"&gt;markdown&lt;/a&gt;).
If you …&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;em&gt;&lt;a href="https://tdda.info/md/076-reference-testing-with-tdda-a-guide-for-llms-and-coding-bots.md"&gt;Raw markdown&lt;/a&gt; for direct use in LLM context windows.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;This post is addressed primarily to LLMs and coding agents—it was
distilled by Claude from &lt;a href="https://www.routledge.com/Test-Driven-Data-Analysis/Radcliffe/p/book/9781032897158"&gt;&lt;em&gt;Test-Driven Data Analysis&lt;/em&gt;&lt;/a&gt;
(Radcliffe, CRC Press, 2026; &lt;a href="https://book.tdda.info"&gt;extra resources&lt;/a&gt;),
the &lt;a href="https://github.com/tdda/tdda"&gt;tdda source code&lt;/a&gt;,
the &lt;a href="https://tdda.readthedocs.io"&gt;documentation&lt;/a&gt;, and the
&lt;a href="https://tdda.readthedocs.io/en/latest/cli.html"&gt;man pages&lt;/a&gt;
(also available as &lt;a href="https://github.com/tdda/tdda/blob/master/doc/source/cli.md"&gt;markdown&lt;/a&gt;).
If you are a human who wants a dense, no-nonsense introduction to
reference testing with TDDA, you are welcome too. It was written in
response to repeated direct observations of coding agents—including
Claude—using &lt;code&gt;-W&lt;/code&gt; to make failing tests pass without verifying that
the new output was correct.&lt;/p&gt;
&lt;h3 id="what-is-a-reference-test"&gt;What Is a Reference Test?&lt;/h3&gt;
&lt;p&gt;A reference test is a semantic regression test. 'Semantic' here
means the comparison understands equivalence: outputs that differ
in timestamps, row order, version strings, hostnames, file paths,
or other legitimately variable content can still pass — only
meaningful changes cause failures. You run your
analytical process or function, capture its outputs, inspect them
and decide they are correct, then store them as &lt;em&gt;reference outputs&lt;/em&gt;.
From that point on, the test reruns the process and checks that the
outputs still match the reference. If something changes—a library
update, a refactoring, a subtle bug—the test catches it.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;tdda.referencetest&lt;/code&gt; library makes this easy for Python code.
It extends &lt;code&gt;unittest&lt;/code&gt; (and works with &lt;code&gt;pytest&lt;/code&gt;) and provides
assertion methods for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;strings and text files (&lt;code&gt;assertStringCorrect&lt;/code&gt;, &lt;code&gt;assertTextFileCorrect&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;data frames (&lt;code&gt;assertDataFrameCorrect&lt;/code&gt;, &lt;code&gt;assertDataFramesEquivalent&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;binary files (&lt;code&gt;assertBinaryFileCorrect&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It also provides &lt;code&gt;tdda gentest&lt;/code&gt;, a command-line tool that watches
your process run and generates a test suite automatically.&lt;/p&gt;
&lt;p&gt;The key insight: reference testing is particularly well suited to
analytical code because the outputs are often large, complex, and
hard to specify in advance. You develop the process, carefully inspect and verify the output
is right &lt;em&gt;once&lt;/em&gt;, store it, and thereafter the test does the checking.&lt;/p&gt;
&lt;h3 id="the-w-write-all-trap"&gt;The &lt;code&gt;-W&lt;/code&gt; (&lt;code&gt;--write-all&lt;/code&gt;) Trap&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Do not blindly run &lt;code&gt;-W&lt;/code&gt; (or &lt;code&gt;--write-all&lt;/code&gt;) merely to make failing tests pass.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This is the most important thing in this document.&lt;/p&gt;
&lt;p&gt;When a reference test fails, &lt;code&gt;tdda.referencetest&lt;/code&gt; tells you what
differs between the actual output and the stored reference. It also
offers you a way to update the reference to match the new output:
pass &lt;code&gt;-W&lt;/code&gt; (or &lt;code&gt;--write-all&lt;/code&gt; under &lt;code&gt;pytest&lt;/code&gt;) when running the tests.&lt;/p&gt;
&lt;p&gt;Running &lt;code&gt;-W&lt;/code&gt; overwrites the reference files with whatever the code
currently produces. After this, the tests pass. &lt;strong&gt;This tells you
nothing about whether the output is correct.&lt;/strong&gt; The tests will pass
even if the output is completely wrong, because you just told them
that the wrong output is the new reference.&lt;/p&gt;
&lt;p&gt;The correct workflow when tests fail is:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Read the failure message. It tells you what changed.&lt;/li&gt;
&lt;li&gt;Run the &lt;code&gt;diff&lt;/code&gt; command suggested in the failure output.&lt;/li&gt;
&lt;li&gt;Look at the actual differences. Are they expected? Are they correct?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Only if you have verified the new output is correct:&lt;/strong&gt; update
   the references. With unittest, use &lt;code&gt;-1W&lt;/code&gt; (tagged tests only,
   recommended) or &lt;code&gt;-W&lt;/code&gt; (all); with pytest, use
   &lt;code&gt;--tagged --write-all -s&lt;/code&gt; or &lt;code&gt;--write-all -s&lt;/code&gt;. See
   &lt;a href="#running-a-subset-of-tests-with-tags"&gt;Running a Subset of Tests with Tags&lt;/a&gt;
   for the full commands. If you have a tame human to hand, this is a
   good moment to involve them—humans are often better at judging
   whether output is actually correct, and can get quite sweary when
   you overwrite correct reference results with nonsense.&lt;/li&gt;
&lt;li&gt;Run the tests again (without &lt;code&gt;-W&lt;/code&gt;) to confirm they pass.&lt;/li&gt;
&lt;li&gt;Check the updated reference files into version control.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If you skip step 3 and go straight to &lt;code&gt;-W&lt;/code&gt;, you have not tested
anything. You have merely synchronized the reference to whatever
the code happens to produce right now.&lt;/p&gt;
&lt;h4 id="safe-use-of-w-the-git-audit-pattern"&gt;Safe use of &lt;code&gt;-W&lt;/code&gt;: the git audit pattern&lt;/h4&gt;
&lt;p&gt;If the reference files are clean in git before you run &lt;code&gt;-W&lt;/code&gt;, you can use
&lt;code&gt;git diff&lt;/code&gt; afterwards to inspect exactly what changed, and
&lt;code&gt;git checkout -- path/to/testdata/&lt;/code&gt; to revert all reference changes
at once if anything looks wrong. This makes &lt;code&gt;-W&lt;/code&gt; a controlled and
auditable operation—but only if the working tree was clean before you
ran it. Always check before running &lt;code&gt;-W&lt;/code&gt;, not after.&lt;/p&gt;
&lt;h3 id="unit-enhanced-reference-tests"&gt;Unit-Enhanced Reference Tests&lt;/h3&gt;
&lt;p&gt;The test code and command-line flags differ between reference tests
built on &lt;code&gt;unittest&lt;/code&gt; and those built on &lt;code&gt;pytest&lt;/code&gt;. The sections below
cover the &lt;code&gt;unittest&lt;/code&gt; variants first; see
&lt;a href="#writing-reference-tests-with-pytest"&gt;Writing Reference Tests with pytest&lt;/a&gt;
for the pytest equivalents. Where flags differ, both are given.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;unittest&lt;/th&gt;
&lt;th&gt;pytest&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Run all tests&lt;/td&gt;
&lt;td&gt;&lt;code&gt;python tests.py -F&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;pytest tests/ --log-failures&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Run only tagged tests&lt;/td&gt;
&lt;td&gt;&lt;code&gt;python tests.py -F -1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;pytest tests/ --log-failures --tagged&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rewrite all references&lt;/td&gt;
&lt;td&gt;&lt;code&gt;-W&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;--write-all -s&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rewrite tagged only&lt;/td&gt;
&lt;td&gt;&lt;code&gt;-1W&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;--tagged --write-all -s&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Full syntax and explanations for each are in the sections below; this
table is a quick reference.&lt;/p&gt;
&lt;p&gt;A partial structural defence against careless &lt;code&gt;-W&lt;/code&gt; use is
&lt;em&gt;unit-enhanced reference tests&lt;/em&gt;: after the reference assertion, add
one or more specific assertions about things that must be true
regardless (shown here in &lt;code&gt;unittest&lt;/code&gt; style):&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;run_my_process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertStringCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;expected.txt&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# These survive a careless -W rewrite:&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertIn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Total: 42 records&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertTrue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;OK&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The reference assertion runs first. If it fails, &lt;code&gt;tdda&lt;/code&gt; writes the
actual output and suggests a diff command—the normal workflow. If
you then carelessly rewrite with &lt;code&gt;-W&lt;/code&gt;, the subsequent assertions will
still fail if the output is wrong in ways they cover.&lt;/p&gt;
&lt;p&gt;This is not a complete defence—you have to choose the assertions
carefully—but it makes it much harder to accidentally accept a broken
result. Choose assertions that reflect the core correctness property
the test was designed to verify.&lt;/p&gt;
&lt;p&gt;This pattern emerged from the author's direct experience of coding
agents (including Claude) repeatedly using &lt;code&gt;-W&lt;/code&gt; to make tests pass
without verifying the results. It is recommended for any test where
the reference output has semantic structure that can be spot-checked.&lt;/p&gt;
&lt;h3 id="the-f-log-failures-flag"&gt;The &lt;code&gt;-F&lt;/code&gt; (&lt;code&gt;--log-failures&lt;/code&gt;) Flag&lt;/h3&gt;
&lt;p&gt;Always pass the log-failures flag when running tests. It logs the IDs
of any failing tests to a timestamped file
(&lt;code&gt;YYYY-MM-DDTHHMMSS-failing-tests.txt&lt;/code&gt;) in your system temp directory
(overridable with &lt;code&gt;$TDDA_FAIL_DIR&lt;/code&gt;). This enables the &lt;code&gt;tdda tag&lt;/code&gt;
workflow: &lt;code&gt;tdda tag&lt;/code&gt; reads the most recent such file and adds &lt;code&gt;@tag&lt;/code&gt;
to the failing tests, so you can re-run and regenerate references for
just those tests.&lt;/p&gt;
&lt;p&gt;Without the flag, no failures file is written and &lt;code&gt;tdda tag&lt;/code&gt; has
nothing to work with.&lt;/p&gt;
&lt;h4 id="with-unittest-running-directly-with-python"&gt;With unittest (running directly with Python)&lt;/h4&gt;
&lt;p&gt;Pass &lt;code&gt;-F&lt;/code&gt; (or &lt;code&gt;--log-failures&lt;/code&gt;):&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;python tests/test_mycode.py -F
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h4 id="with-pytest"&gt;With pytest&lt;/h4&gt;
&lt;p&gt;Pass &lt;code&gt;--log-failures&lt;/code&gt;:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;pytest tests/ --log-failures
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h4 id="permanent-default"&gt;Permanent default&lt;/h4&gt;
&lt;p&gt;To avoid passing the flag every time, add this to &lt;code&gt;~/.tdda.toml&lt;/code&gt;:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;[referencetest]&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;log_failures&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This modifies the user's global configuration. Consult your human
before doing it.&lt;/p&gt;
&lt;h3 id="the-kicker-the-w-problem-is-not-restricted-to-tdda"&gt;The Kicker: the &lt;code&gt;-W&lt;/code&gt; Problem is Not Restricted to TDDA&lt;/h3&gt;
&lt;p&gt;The anti-pattern described above — rewriting expected outputs to make
tests pass without verifying the new output is correct — applies far
beyond &lt;code&gt;tdda.referencetest&lt;/code&gt;. LLM coding agents routinely treat passing
tests as the goal rather than as evidence of correctness. Whether
rewriting a reference file, updating a pytest snapshot, regenerating
Jest snapshots, or changing a hardcoded expected value in an assertion,
the same question applies first: is the new result actually correct?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Green tests after any kind of expected-value rewrite tell you nothing
about correctness. They tell you only that the code now matches
whatever you told it to match.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The correct workflow is the same regardless of framework:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;A test fails. Read the failure. What changed?&lt;/li&gt;
&lt;li&gt;Is the change correct, or is it a bug?&lt;/li&gt;
&lt;li&gt;Only if correct: update the expected value.&lt;/li&gt;
&lt;li&gt;If you're not sure: ask your human.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The specific value of &lt;code&gt;tdda.referencetest&lt;/code&gt; is that it makes step 1
easy — the diff tooling is built in, and &lt;code&gt;-F&lt;/code&gt;/&lt;code&gt;tdda tag&lt;/code&gt;/&lt;code&gt;-1W&lt;/code&gt; limit
the blast radius. But the discipline is universal.&lt;/p&gt;
&lt;h3 id="running-a-subset-of-tests-with-tags"&gt;Running a Subset of Tests with Tags&lt;/h3&gt;
&lt;p&gt;To run only some tests, use the &lt;code&gt;@tag&lt;/code&gt; decorator:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tdda.referencetest&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ReferenceTestCase&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestMyProcess&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ReferenceTestCase&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="nd"&gt;@tag&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_main_output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;run_my_process&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertStringCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;expected_output.txt&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_other_thing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="o"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;@tag&lt;/code&gt; can decorate individual test methods or entire test classes.
The flags to run only tagged tests differ between unittest and pytest.&lt;/p&gt;
&lt;h4 id="with-unittest-running-directly-with-python_1"&gt;With unittest (running directly with Python)&lt;/h4&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;test_mycode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mh"&gt;1&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;only&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tagged&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;test_mycode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mh"&gt;1&lt;/span&gt;&lt;span class="n"&gt;W&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="p"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;regenerate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;references&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tagged&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;only&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;-1W&lt;/code&gt; combines &lt;code&gt;-1&lt;/code&gt; and &lt;code&gt;-W&lt;/code&gt; (&lt;code&gt;--write-all&lt;/code&gt;). This is the safe way to
regenerate, because it limits the blast radius to tests you have
explicitly chosen and tagged.&lt;/p&gt;
&lt;h4 id="with-pytest_1"&gt;With pytest&lt;/h4&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;pytest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;failures&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;tagged&lt;/span&gt;&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="p"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;only&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tagged&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;pytest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;failures&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;tagged&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;all&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;regenerate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;references&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tagged&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;only&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Pass &lt;code&gt;-s&lt;/code&gt; to prevent pytest from capturing output, so that &lt;code&gt;tdda&lt;/code&gt;
can report which reference files were written.&lt;/p&gt;
&lt;h4 id="the-full-workflow-with-tdda-tag"&gt;The full workflow with &lt;code&gt;tdda tag&lt;/code&gt;&lt;/h4&gt;
&lt;p&gt;The &lt;code&gt;-F&lt;/code&gt; → &lt;code&gt;tdda tag&lt;/code&gt; → &lt;code&gt;-1W&lt;/code&gt; workflow lets you rewrite only the
references that actually failed, without manually deciding which tests
to tag:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Run tests with &lt;code&gt;-F&lt;/code&gt; (or &lt;code&gt;--log-failures&lt;/code&gt;) to record failing test IDs&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;tdda tag&lt;/code&gt; to add &lt;code&gt;@tag&lt;/code&gt; to those tests automatically&lt;/li&gt;
&lt;li&gt;Inspect the diffs to verify the new output is correct&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;-1W&lt;/code&gt; (or &lt;code&gt;--tagged --write-all -s&lt;/code&gt;) to rewrite only those references&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;make untag&lt;/code&gt; (or the sed command below) to remove the tags&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is always preferable to bare &lt;code&gt;-W&lt;/code&gt;, which rewrites every reference
file regardless of whether the test failed.&lt;/p&gt;
&lt;h4 id="removing-stale-tags"&gt;Removing stale tags&lt;/h4&gt;
&lt;p&gt;Before adding new tags, remove any stale &lt;code&gt;@tag&lt;/code&gt; decorators from
previous sessions. There is usually a &lt;code&gt;make untag&lt;/code&gt; target that does
this, or you can use:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;macOS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BSD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;sed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/^[[:space:]]*@tag[[:space:]]*$/d&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;test_mycode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Linux&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;GNU&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;sed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/^[[:space:]]*@tag[[:space:]]*$/d&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;test_mycode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3 id="writing-reference-tests-with-unittest"&gt;Writing Reference Tests with &lt;code&gt;unittest&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;A minimal test file:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tdda.referencetest&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ReferenceTestCase&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt;

&lt;span class="n"&gt;TESTDIR&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dirname&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;abspath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="vm"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                       &lt;span class="s1"&gt;&amp;#39;testdata&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestMyProcess&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ReferenceTestCase&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;run_my_process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertStringCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                 &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TESTDIR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;expected.txt&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_dataframe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;produce_dataframe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertDataFrameCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                    &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TESTDIR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;expected.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="vm"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;__main__&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;ReferenceTestCase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;When running under &lt;code&gt;pytest&lt;/code&gt;, the &lt;code&gt;if __name__ == '__main__':&lt;/code&gt; block is
simply ignored—the same test file works with both runners unchanged.&lt;/p&gt;
&lt;p&gt;Run it:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;test_myprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="p"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;all&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;test_myprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mh"&gt;1&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;only&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tagged&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;test_myprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mh"&gt;1&lt;/span&gt;&lt;span class="n"&gt;W&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="p"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;regenerate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;references&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tagged&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The first time you run with &lt;code&gt;-1W&lt;/code&gt; after writing a new test, it
&lt;em&gt;writes&lt;/em&gt; the reference file. Subsequent runs compare against it.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;After writing references with &lt;code&gt;-1W&lt;/code&gt;, always inspect the files that
were written.&lt;/strong&gt; The fact that the test now passes means only that the
reference matches the output. It says nothing about whether either is
correct.&lt;/p&gt;
&lt;h3 id="writing-reference-tests-with-pytest"&gt;Writing Reference Tests with &lt;code&gt;pytest&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;The same test classes work under &lt;code&gt;pytest&lt;/code&gt;, with different flags:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;pytest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="w"&gt;                           &lt;/span&gt;&lt;span class="p"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;all&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;pytest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;tagged&lt;/span&gt;&lt;span class="w"&gt;                  &lt;/span&gt;&lt;span class="p"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;only&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tagged&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;pytest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;tagged&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;all&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="p"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;regenerate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;references&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tagged&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Note:
 - Use &lt;code&gt;--write-all&lt;/code&gt; instead of &lt;code&gt;-W&lt;/code&gt;.
 - Use &lt;code&gt;--tagged&lt;/code&gt; instead of &lt;code&gt;-1&lt;/code&gt;.
 - Pass &lt;code&gt;-s&lt;/code&gt; to prevent &lt;code&gt;pytest&lt;/code&gt; from capturing output, so that &lt;code&gt;tdda&lt;/code&gt;
   can report which reference files were written.
 - The short flags &lt;code&gt;-W&lt;/code&gt; and &lt;code&gt;-1&lt;/code&gt; are &lt;code&gt;tdda&lt;/code&gt; extensions; they only work
   when running the test file directly with Python, not under &lt;code&gt;pytest&lt;/code&gt;.&lt;/p&gt;
&lt;h3 id="assertion-api-text-and-strings"&gt;Assertion API: Text and Strings&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;assertStringCorrect(string, ref_path, ...)&lt;/code&gt;&lt;/strong&gt;
Check an in-memory string against a reference text file.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;assertTextFileCorrect(actual_path, ref_path, ...)&lt;/code&gt;&lt;/strong&gt;
Check a text file on disk against a reference text file.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;assertTextFilesCorrect(actual_paths, ref_paths, ...)&lt;/code&gt;&lt;/strong&gt;
Check multiple text files against corresponding reference files.&lt;/p&gt;
&lt;p&gt;All three share these optional parameters for handling variable output:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Effect&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;lstrip=True&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Strip leading whitespace from each line before comparing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;rstrip=True&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Strip trailing whitespace from each line before comparing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ignore_substrings=['foo','bar']&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Ignore any line in the &lt;em&gt;expected&lt;/em&gt; file containing one of these substrings; the corresponding actual line can be anything&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ignore_patterns=[r'pattern']&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Lines differing only in substrings matching these regexes pass; text outside the match must be identical in both&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;remove_lines=['foo']&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Remove lines containing these substrings from &lt;em&gt;both&lt;/em&gt; actual and expected before comparing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;preprocess=fn&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Apply &lt;code&gt;fn(list_of_lines)&lt;/code&gt; to both actual and expected (as lists of strings) before comparing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;max_permutation_cases=N&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pass if lines differ only in order, up to N permutations; &lt;code&gt;None&lt;/code&gt; = unlimited&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h4 id="ignore_substringsignore-whole-lines-by-substring"&gt;&lt;code&gt;ignore_substrings&lt;/code&gt;—ignore whole lines by substring&lt;/h4&gt;
&lt;p&gt;Lines in the expected output containing the substring are skipped.
The match is against the expected file only—the actual output can
have anything on those lines (or nothing):&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Reference file contains:&lt;/span&gt;
&lt;span class="c1"&gt;#   Copyright (c) Stochastic Solutions Limited, 2016&lt;/span&gt;
&lt;span class="c1"&gt;#   Version 0.0.0&lt;/span&gt;
&lt;span class="c1"&gt;# Actual output has current year and version—but we don&amp;#39;t care:&lt;/span&gt;
&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertStringCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;expected.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ignore_substrings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Copyright&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Version&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h4 id="ignore_patternsignore-variable-substrings-within-a-line"&gt;&lt;code&gt;ignore_patterns&lt;/code&gt;—ignore variable substrings within a line&lt;/h4&gt;
&lt;p&gt;Lines pass if they differ only in parts matching the regex.
Everything outside the match must be identical in both files:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Actual:   &amp;quot;Generated: 2026-05-20T14:32:01 by pipeline v2.3.1&amp;quot;&lt;/span&gt;
&lt;span class="c1"&gt;# Expected: &amp;quot;Generated: 2024-01-15T09:00:00 by pipeline v1.0.0&amp;quot;&lt;/span&gt;
&lt;span class="c1"&gt;# Both lines still match with:&lt;/span&gt;
&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertStringCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;expected.txt&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ignore_patterns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;\d&lt;/span&gt;&lt;span class="si"&gt;{4}&lt;/span&gt;&lt;span class="s1"&gt;-\d&lt;/span&gt;&lt;span class="si"&gt;{2}&lt;/span&gt;&lt;span class="s1"&gt;-\d&lt;/span&gt;&lt;span class="si"&gt;{2}&lt;/span&gt;&lt;span class="s1"&gt;T\d&lt;/span&gt;&lt;span class="si"&gt;{2}&lt;/span&gt;&lt;span class="s1"&gt;:\d&lt;/span&gt;&lt;span class="si"&gt;{2}&lt;/span&gt;&lt;span class="s1"&gt;:\d&lt;/span&gt;&lt;span class="si"&gt;{2}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;v\d+\.\d+\.\d+&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;ignore_patterns&lt;/code&gt; is stricter than &lt;code&gt;ignore_substrings&lt;/code&gt;: the non-matching
parts of each line must agree exactly, so you cannot accidentally mask
a real change in the surrounding text.&lt;/p&gt;
&lt;h4 id="remove_linesstrip-lines-from-both-files"&gt;&lt;code&gt;remove_lines&lt;/code&gt;—strip lines from both files&lt;/h4&gt;
&lt;p&gt;Lines containing the substring are removed from both actual and expected
before comparing. Use this for optional or ephemeral lines that should
not appear in the reference at all:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Both files have lines like &amp;quot;WARNING: cache miss&amp;quot; that are&lt;/span&gt;
&lt;span class="c1"&gt;# present sometimes and absent other times:&lt;/span&gt;
&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertStringCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;expected.txt&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;remove_lines&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;WARNING: cache miss&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Unlike &lt;code&gt;ignore_substrings&lt;/code&gt;, &lt;code&gt;remove_lines&lt;/code&gt; strips from both sides, so
the reference file also need not contain these lines.&lt;/p&gt;
&lt;h4 id="preprocesstransform-both-files-before-comparing"&gt;&lt;code&gt;preprocess&lt;/code&gt;—transform both files before comparing&lt;/h4&gt;
&lt;p&gt;Takes a function that accepts a list of strings (lines) and returns
a transformed list. Applied to both actual and expected:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;strip_timestamps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# remove leading timestamp prefix &amp;quot;2026-05-20 14:32:01 &amp;quot; from each line&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertStringCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;expected.txt&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;preprocess&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;strip_timestamps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h4 id="max_permutation_casesallow-reordered-lines"&gt;&lt;code&gt;max_permutation_cases&lt;/code&gt;—allow reordered lines&lt;/h4&gt;
&lt;p&gt;Pass if the lines are a permutation of each other, up to the given
number of permutations. Use &lt;code&gt;None&lt;/code&gt; for unlimited:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Output order is non-deterministic, but the set of lines is fixed:&lt;/span&gt;
&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertStringCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;expected.txt&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_permutation_cases&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3 id="assertion-api-dataframes"&gt;Assertion API: DataFrames&lt;/h3&gt;
&lt;p&gt;The DataFrame assertion methods work with Pandas 2.x and 3.x (all three
backends: &lt;code&gt;numpy_nullable&lt;/code&gt;, &lt;code&gt;pyarrow&lt;/code&gt;, and &lt;code&gt;original&lt;/code&gt;) and with Polars.
You can even compare DataFrames across engines—e.g. a Pandas actual
against a Polars reference—with the &lt;code&gt;engine&lt;/code&gt; parameter if needed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;assertDataFramesEquivalent(df, ref_df, ...)&lt;/code&gt;&lt;/strong&gt;
Compare two in-memory DataFrames (Pandas or Polars).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;assertDataFrameCorrect(df, ref_path, ...)&lt;/code&gt;&lt;/strong&gt;
Compare an in-memory DataFrame against a reference file (CSV or Parquet).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;assertStoredDataFrameCorrect(actual_path, ref_path, ...)&lt;/code&gt;&lt;/strong&gt;
Compare two DataFrames both stored on disk.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;assertStoredDataFramesCorrect(actual_paths, ref_paths, ...)&lt;/code&gt;&lt;/strong&gt;
Compare multiple pairs of on-disk DataFrames.&lt;/p&gt;
&lt;h4 id="check_data-and-check_typesexclude-columns"&gt;&lt;code&gt;check_data&lt;/code&gt; and &lt;code&gt;check_types&lt;/code&gt;—exclude columns&lt;/h4&gt;
&lt;p&gt;The most common use is excluding columns whose values are legitimately
variable (random seeds, run IDs, timestamps):&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Exclude the &amp;#39;random&amp;#39; column from both value and type checks:&lt;/span&gt;
&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;all_fields_except&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;random&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertDataFrameCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;expected.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="n"&gt;check_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="n"&gt;check_types&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;check_data&lt;/code&gt;, &lt;code&gt;check_types&lt;/code&gt;, and &lt;code&gt;check_order&lt;/code&gt; all accept the same forms:
- &lt;code&gt;None&lt;/code&gt; or &lt;code&gt;True&lt;/code&gt;: check all fields (default)
- &lt;code&gt;False&lt;/code&gt;: skip entirely
- a list of field names to check
- a function taking a DataFrame and returning a list of field names&lt;/p&gt;
&lt;h4 id="sortbysort-before-comparing"&gt;&lt;code&gt;sortby&lt;/code&gt;—sort before comparing&lt;/h4&gt;
&lt;p&gt;Use when row order is non-deterministic:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertDataFrameCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;expected.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="n"&gt;sortby&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;country&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;date&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h4 id="conditionfilter-rows-before-comparing"&gt;&lt;code&gt;condition&lt;/code&gt;—filter rows before comparing&lt;/h4&gt;
&lt;p&gt;Use when only a subset of rows is relevant to the test:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Only compare rows where status is &amp;#39;complete&amp;#39;:&lt;/span&gt;
&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertDataFrameCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;expected.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;condition&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;status&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;complete&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h4 id="precisionfloating-point-tolerance"&gt;&lt;code&gt;precision&lt;/code&gt;—floating-point tolerance&lt;/h4&gt;
&lt;p&gt;Default is 7 decimal places. Loosen it when values come via CSV
(which can lose precision):&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertDataFrameCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;expected.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;precision&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h4 id="type_matchingdtype-strictness"&gt;&lt;code&gt;type_matching&lt;/code&gt;—dtype strictness&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;'strict'&lt;/code&gt; (default for Parquet): dtypes must be identical&lt;/li&gt;
&lt;li&gt;&lt;code&gt;'medium'&lt;/code&gt; (default for CSV): same underlying type (int, float, datetime)
  but different bit width or nullability allowed&lt;/li&gt;
&lt;li&gt;&lt;code&gt;'loose'&lt;/code&gt;: anything Pandas can compare&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# CSV round-trips can change int64 to float64—use medium:&lt;/span&gt;
&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertDataFrameCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;expected.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;type_matching&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;medium&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h4 id="fuzzy_nullstreat-different-null-types-as-equal"&gt;&lt;code&gt;fuzzy_nulls&lt;/code&gt;—treat different null types as equal&lt;/h4&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# pd.NaN and None treated as equivalent:&lt;/span&gt;
&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertDataFramesEquivalent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ref_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fuzzy_nulls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h4 id="enginepandas-or-polars"&gt;&lt;code&gt;engine&lt;/code&gt;—Pandas or Polars&lt;/h4&gt;
&lt;p&gt;Inferred automatically from the DataFrames. Only needed when comparing
across types (a Pandas actual against a Polars reference or vice versa):&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertDataFramesEquivalent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pandas_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;polars_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;pandas&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3 id="tdda-diffunderstanding-dataframe-failures"&gt;&lt;code&gt;tdda diff&lt;/code&gt;—Understanding DataFrame Failures&lt;/h3&gt;
&lt;p&gt;When a DataFrame assertion fails, the failure message suggests one or
more &lt;code&gt;diff&lt;/code&gt; commands. For tabular data, it often suggests both a raw
&lt;code&gt;diff&lt;/code&gt; and a &lt;code&gt;tdda diff&lt;/code&gt;:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;Compare with:
    diff /tmp/actual-expected.csv /path/to/testdata/expected.csv
Compare with:
    tdda diff /tmp/actual-expected.csv /path/to/testdata/expected.csv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;tdda diff&lt;/code&gt; uses the same comparison logic as the assertion methods and
produces a structured summary: which columns differ, how many rows, and
a table showing the differing values side by side. It is much easier to
read than raw &lt;code&gt;diff&lt;/code&gt; for anything beyond a handful of rows. Always prefer
it for DataFrame failures. Example output:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;Columns with differences: 1 / 12
Rows with differences:    3 / 1000

Values:
  Row   Column    Actual    Expected
   42   revenue   1500.50   1500.00
  108   revenue      0.00       NaN
  731   revenue    999.99   1000.00
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;It accepts the same field-selection flags as the assertion methods:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nv"&gt;tdda&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;diff&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;actual&lt;/span&gt;.&lt;span class="nv"&gt;csv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;expected&lt;/span&gt;.&lt;span class="nv"&gt;csv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="nv"&gt;xfields&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;random&lt;/span&gt;,&lt;span class="nv"&gt;run_id&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3 id="assertion-api-binary-files"&gt;Assertion API: Binary Files&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;assertBinaryFileCorrect(actual_path, ref_path)&lt;/code&gt;&lt;/strong&gt;
Check that a binary file is byte-for-byte identical to a reference file.
No options for partial matching—if you need that, extract the relevant
data and use a string or DataFrame assertion instead.&lt;/p&gt;
&lt;h3 id="generating-tests-automatically-with-gentest"&gt;Generating Tests Automatically with Gentest&lt;/h3&gt;
&lt;p&gt;If you have a command-line process—a script, a shell command, an R
program—&lt;code&gt;tdda gentest&lt;/code&gt; can generate a test suite for it:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;tdda gentest &amp;#39;python my_analysis.py input.csv&amp;#39; testsuite.py
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Gentest runs the command multiple times, captures all outputs (stdout,
stderr, exit code, any files written), detects which parts vary between
runs, and writes a test script that checks the stable parts. The
generated script uses &lt;code&gt;tdda.referencetest&lt;/code&gt; and can be run and maintained
like any other reference test.&lt;/p&gt;
&lt;p&gt;Inspect the generated test and the reference outputs before trusting
them. Gentest is good at generating structurally correct tests; you
still need to verify that the reference outputs are actually correct.&lt;/p&gt;
&lt;h3 id="the-reference-test-checklist"&gt;The Reference Test Checklist&lt;/h3&gt;
&lt;p&gt;☐ &lt;strong&gt;Create at least one reference test&lt;/strong&gt; for every analytical process you write.&lt;br&gt;
☐ &lt;strong&gt;Run tests before making changes&lt;/strong&gt;, so you know the baseline.&lt;br&gt;
☐ &lt;strong&gt;Run tests after making changes&lt;/strong&gt;, before assuming they worked.&lt;br&gt;
☐ &lt;strong&gt;When a test fails&lt;/strong&gt;, read the diff before doing anything else.&lt;br&gt;
☐ &lt;strong&gt;Never run &lt;code&gt;-W&lt;/code&gt; without first verifying the new output is correct.&lt;/strong&gt;&lt;br&gt;
☐ &lt;strong&gt;Prefer &lt;code&gt;-1W&lt;/code&gt; (or &lt;code&gt;--tagged --write-all -s&lt;/code&gt;)&lt;/strong&gt; over bare &lt;code&gt;-W&lt;/code&gt;—rewrite only the references that actually failed.&lt;br&gt;
☐ &lt;strong&gt;Use &lt;code&gt;-F&lt;/code&gt; and &lt;code&gt;tdda tag&lt;/code&gt;&lt;/strong&gt; to automatically tag failing tests for targeted reruns and rewrites.&lt;br&gt;
☐ &lt;strong&gt;After writing references, inspect the files.&lt;/strong&gt; Tests passing after &lt;code&gt;-W&lt;/code&gt; or &lt;code&gt;--write-all&lt;/code&gt; is not evidence of correctness.&lt;br&gt;
☐ &lt;strong&gt;Ensure reference files are clean in git&lt;/strong&gt; before running &lt;code&gt;-W&lt;/code&gt;, so you can use &lt;code&gt;git diff&lt;/code&gt; to review changes and revert with &lt;code&gt;git checkout -- testdata/&lt;/code&gt; if needed.&lt;br&gt;
☐ &lt;strong&gt;Consider unit-enhanced reference tests&lt;/strong&gt; for anything with checkable semantic structure.&lt;br&gt;
☐ &lt;strong&gt;Add a regression test for every bug you fix.&lt;/strong&gt;&lt;br&gt;
☐ &lt;strong&gt;1 test vs. 0 tests is a bigger difference than 100 vs. 1.&lt;/strong&gt;&lt;/p&gt;
&lt;h3 id="further-reading"&gt;Further Reading&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://tdda.readthedocs.io/"&gt;TDDA library documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/tdda/tdda/tree/master/tdda/examples/referencetest_examples"&gt;Reference test examples&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;man tdda&lt;/code&gt;, &lt;code&gt;man tdda-gentest&lt;/code&gt;, &lt;code&gt;man tdda-diff&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.routledge.com/Test-Driven-Data-Analysis/Radcliffe/p/book/9781032897158"&gt;&lt;em&gt;Test-Driven Data Analysis&lt;/em&gt;&lt;/a&gt;
   (Radcliffe, CRC Press, 2026), Part II, Chapters 9–12&lt;/li&gt;
&lt;li&gt;&lt;a href="https://book.tdda.info"&gt;Book resources&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content><category term="TDDA"></category><category term="reference testing"></category><category term="LLMs"></category><category term="coding bots"></category><category term="gentest"></category><category term="pytest"></category><category term="unittest"></category></entry><entry><title>TDDA: The Book, the 3.0 Library, and the PyData London 2026 Tutorial</title><link href="https://tdda.info/tdda-the-book-the-30-library-and-the-pydata-london-2026-tutorial.html" rel="alternate"></link><published>2026-05-19T18:00:00+01:00</published><updated>2026-05-19T18:00:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2026-05-19:/tdda-the-book-the-30-library-and-the-pydata-london-2026-tutorial.html</id><summary type="html">&lt;p&gt;This blog has been quite quiet, but there is a great deal of news and
it may be less quiet for a while.&lt;/p&gt;
&lt;h3 id="the-book"&gt;The Book&lt;/h3&gt;
&lt;p&gt;Today, 19th May 2026, sees the world-wide release of Test-Driven Data Analysis,
from CRC Press.&lt;/p&gt;
&lt;center&gt;
&lt;a href="https://www.routledge.com/Test-Driven-Data-Analysis/Radcliffe/p/book/9781032897158"&gt;
          &lt;img id="book-cover" src="/images/book-cover.png" alt="The cover of the book Test-Driven Data Analysis by Nicholas J. Radcliffe. It is published by Chapman and Hall, part of CRC Press, from Taylor &amp;amp; Francis Group, and is part of the DATA SCIENCE SERIES. The cover is black with mostly white text and a white graphic. The graphic is a 3-row by 4-column grid of squares. Each square contains a number of dots laid out on a regular 32x32 grid. The top-left square has 1024 dots (&amp;ldquo;full&amp;rdquo;) and working along each row in turn, the number of dots roughly halves each time, apparently at random (and, actually, pseudo-randomly). The last row&amp;rsquo;s boxes have six, two, two, and one dot." style="width:100%;max-width:400px;display:block;margin:0.3em auto 0.3em;"/&gt;&lt;/a&gt;
&lt;/center&gt;

&lt;p&gt;It is available from all good booksellers and all sellers …&lt;/p&gt;</summary><content type="html">&lt;p&gt;This blog has been quite quiet, but there is a great deal of news and
it may be less quiet for a while.&lt;/p&gt;
&lt;h3 id="the-book"&gt;The Book&lt;/h3&gt;
&lt;p&gt;Today, 19th May 2026, sees the world-wide release of Test-Driven Data Analysis,
from CRC Press.&lt;/p&gt;
&lt;center&gt;
&lt;a href="https://www.routledge.com/Test-Driven-Data-Analysis/Radcliffe/p/book/9781032897158"&gt;
          &lt;img id="book-cover" src="/images/book-cover.png" alt="The cover of the book Test-Driven Data Analysis by Nicholas J. Radcliffe. It is published by Chapman and Hall, part of CRC Press, from Taylor &amp;amp; Francis Group, and is part of the DATA SCIENCE SERIES. The cover is black with mostly white text and a white graphic. The graphic is a 3-row by 4-column grid of squares. Each square contains a number of dots laid out on a regular 32x32 grid. The top-left square has 1024 dots (&amp;ldquo;full&amp;rdquo;) and working along each row in turn, the number of dots roughly halves each time, apparently at random (and, actually, pseudo-randomly). The last row&amp;rsquo;s boxes have six, two, two, and one dot." style="width:100%;max-width:400px;display:block;margin:0.3em auto 0.3em;"/&gt;&lt;/a&gt;
&lt;/center&gt;

&lt;p&gt;It is available from all good booksellers and all sellers of good books,
and until 30th June 2026 the code &lt;strong&gt;26SMA1&lt;/strong&gt; will give a 20% discount
from &lt;a href="https://www.routledge.com/Test-Driven-Data-Analysis/Radcliffe/p/book/9781032897158"&gt;the publisher's site&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The book covers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the TDDA methodology&lt;ul&gt;
&lt;li&gt;including areas not obviously amenable to software support,
  such as errors of interpretation, errors of applicability,
  errors of process, and errors of judgement&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;the TDDA command-line tools for&lt;ul&gt;
&lt;li&gt;data validation,&lt;/li&gt;
&lt;li&gt;reference-test generation with Gentest (test for code in any language),&lt;/li&gt;
&lt;li&gt;a &lt;code&gt;diff&lt;/code&gt; tool for on-disk data frames (as parquet files and flat files)&lt;/li&gt;
&lt;li&gt;tools for working with the &lt;code&gt;tdda.serial&lt;/code&gt; format and also with CSVW
  (CSV on the Web) and Frictionless.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Reference testing with &lt;code&gt;tdda.referencetest&lt;/code&gt; under &lt;code&gt;unittest&lt;/code&gt; or &lt;code&gt;pytest&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Test-Driven Document Development (TDDD)&lt;/li&gt;
&lt;li&gt;APIs for all functionality&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Resources from the book are available at &lt;a href="https://book.tdda.info"&gt;book.tdda.info&lt;/a&gt;,
including&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;22 Checklists&lt;/li&gt;
&lt;li&gt;All figures&lt;/li&gt;
&lt;li&gt;Glossary&lt;/li&gt;
&lt;li&gt;Data Profiles&lt;/li&gt;
&lt;li&gt;Data Dictionaries&lt;/li&gt;
&lt;li&gt;TDDD tests for the book.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Examples from the book are available from the tdda library by using
the &lt;code&gt;tdda&lt;/code&gt; command:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;tdda examples book
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The whole of TDDA is really built around the encapsulation of the data-analysis
cycle shown below, and the diagram shows how the book covers these ideas.&lt;/p&gt;
&lt;center&gt;
&lt;img src="images/analysis-cycle-with-remedies.png" alt="The main part of the diagram consists of six circles from
left to right.
The first five circles have failure mode text
under them and an error class below that.
1. CHOOSE APPROACH.
Failure: 'Fail to understand data, problem domain, or methods',
ERROR OF INTERPRETATION (error of formulation).
Ch 13.
2. DEVELOP ANALYTICAL PROCESS.
Failure: 'Mistakes during coding' and the associated
ERROR OF IMPLEMENTATION (bug).
Ch 9-12.
3. RUN ANALYTICAL PROCESS.
Failure: 'Use the software incorrectly'
ERROR OF PROCESS (operator error).
Ch 16.
4. PRODUCE ANALYTICAL RESULTS
Failure 'Mismatch between development data or assumptions
and deployment data'
ERROR OF APPLICABILITY (category error).
Ch 1-7 &amp; 17.
5. INTERPRET ANALYTICAL RESULTS
Failure 'Misinterpret the results'
ERROR OF INTERPRETATION (communication error).
Ch 14 &amp; 15.
6. `First, Do No Harm'.
ERROR OF JUDGEMENT.
Ch 17.
Arrows lead to FAILURE and SUCCESS boxes.
Remedies and book chapters sit underneath the main diagram."/&gt;
&lt;/center&gt;

&lt;h3 id="the-tdda-library-version-30"&gt;The TDDA Library, Version 3.0&lt;/h3&gt;
&lt;center&gt;
&lt;img src="images/tdda-six-features.png" alt="Top Line: Three Machines illustrating
1. constraint discover and data validation: an input hopper takes training
data and produces constraints, or training data + constraints to produce
data validations at the output chute.
2. Rexpy, which takes strings in its input hopper and produces
regular expressions at the output chute,
3. TDDA gentest, which takes code in the input hopper and produces a Python
reference-test script as output.
Bottom Line: 4. tdda diff which compares data in flat files and parquet
files to detect (semantic) differences.
5. tdda.serial, which is a format for describing flat-file formats and
a suite of tools for working with tdda.serial, CSVW, and Frictionless
6. tdda.referencetest, for semantic testing of complex analytical results."/&gt;
&lt;/center&gt;

&lt;p&gt;Version 3.0 of the library and command-line tools is a major upgrade.&lt;/p&gt;
&lt;p&gt;All the main features have upgrades:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Data validation using constraints, which can be generated from training
   data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Inference of regular expressions from example strings.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Automatic generation of tests for almost any non-GUI code in any language
   (Gentest).&lt;br&gt;
&lt;em&gt;"Gentest writes tests so you don't have to."™&lt;/em&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Enhanced test support for complex results
   in both Python's unittest and in pytest with reference testing.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;New features include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Support for Pandas 3.0, including all three backends
   (&lt;code&gt;original&lt;/code&gt;, &lt;code&gt;numpy_nullable&lt;/code&gt;, and &lt;code&gt;pyarrow&lt;/code&gt;).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Support for Polars DataFrames in most areas of the library.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Comprehensive Parquet support, replacing feather format.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;tdda diff&lt;/code&gt;: find and visualize differences between datasets
   in flat files (like CSV files) and parquet files,
   with control over specificity and scope.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Flat-file metadata support: the new tdda.serial format allows the
   format of CSV and other flat files to be described for accurate
   reading across libraries. This includes inference of flat-file
   formats, Python code generation, helper functions for reading and
   writing flat files with metadata, and conversion between
   tdda.serial, CSVW (CSV on the Web), and Frictionless.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Text utilities for Unicode, including glyph counting and extended
   normalization forms beyond canonical composition and decomposition
   (NFC, NFD), and kompatibility normalization (NFKC and NFKD). Form
   NFTK performs further kompatibility normalization including accent
   stripping.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Man pages for all commands&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Upgraded documentation for command line tools and the API.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="pydata-london-tdda-tutorial-5th-june-2026-1410"&gt;PyData London TDDA Tutorial, 5th June 2026, 14:10&lt;/h3&gt;
&lt;p&gt;I'll be giving a 90-minute hands-on tutorial on TDDA on 5th June 2026
at PyData London. Do come along if you can. PyData is always great,
for experts and novices and all levels of technical interest and
proficiency. It would be great to see you there.&lt;/p&gt;
&lt;p&gt;Get tickets from &lt;a href="https://pydata.org/london2026/"&gt;PyData&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;And if you have something to share, prepare a 5-minute Lightning Talk.
They are always a highlight of the conference.&lt;/p&gt;</content><category term="TDDA"></category><category term="library"></category><category term="talk"></category><category term="book"></category></entry><entry><title>Test-Driven Document Development</title><link href="https://tdda.info/test-driven-document-development.html" rel="alternate"></link><published>2025-09-02T16:00:00+01:00</published><updated>2025-09-02T16:00:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2025-09-02:/test-driven-document-development.html</id><summary type="html">&lt;h3 id="summary"&gt;Summary&lt;/h3&gt;
&lt;p&gt;Computational documents attempt to guarantee that results included
within them—such as graphs—correspond to the code and data
claimed to generate them. They typically achieve this by generating
the outputs from the code at the time the document is generated
or viewed.  This solves significant problems, including those …&lt;/p&gt;</summary><content type="html">&lt;h3 id="summary"&gt;Summary&lt;/h3&gt;
&lt;p&gt;Computational documents attempt to guarantee that results included
within them—such as graphs—correspond to the code and data
claimed to generate them. They typically achieve this by generating
the outputs from the code at the time the document is generated
or viewed.  This solves significant problems, including those of code
&lt;a href="https://tdda.info/why-code-rusts"&gt;rusting&lt;/a&gt; (exhibiting changed
behaviour) and of unintentional inclusion of stale, incorrect, or
unvalidated results.  There is, however, a danger of what I term
&lt;em&gt;co-rusting&lt;/em&gt;, whereby the code and its outputs drift away from
correctness (&lt;em&gt;rust&lt;/em&gt;) together, without the author realizing. This is likely if
the code continues to generate output (i.e., does not crash or report
an error).&lt;/p&gt;
&lt;p&gt;Computational documents are an important part of &lt;a href="https://en.wikipedia.org/wiki/Reproducibility"&gt;reproducible
research&lt;/a&gt;,
within which the main approach to avoiding co-rusting tends to be the use of
&lt;a href="https://en.wikipedia.org/wiki/Reproducible_builds"&gt;reproducible environments&lt;/a&gt;,
which aim to prevent rusting by pinning down as much of the computational
environment as possible.&lt;/p&gt;
&lt;p&gt;Test-Driven Document Development (TDDD) builds on computational
documents by adding automated tests that fail when results change
(materially). If these tests are run as part of the build process for the
document, the possibily of co-rusting is reduced or eliminated. TDDD can be
viewed as the application of test-driven data analysis (TDDA) to
the process of document creation, essentially considering
the generation of a document as an analytical process
that should be supported by reference tests.&lt;/p&gt;
&lt;p&gt;The tests can be created by hand, but the &lt;a href="tdda-gentest-toronto-2022"&gt;Gentest&lt;/a&gt;
functionality of the
&lt;a href="https://tdda.readthedocs.io/"&gt;tdda&lt;/a&gt;
tool turns out to be powerful for implementing the tests needed by TDDD,
whatever language is used to generate the results.&lt;/p&gt;
&lt;h3 id="background-computational-documents"&gt;Background: Computational Documents&lt;/h3&gt;
&lt;p&gt;Computational documents include one or more results generated
by computer code, and provide some guarantee that each result
matches its generating code.
This is usually achieved by
including the code in the document and generating the output
either as part of document production
(compilation, e.g., &lt;a href="https://quarto.org"&gt;Quarto&lt;/a&gt;,
or in a more limited way,
&lt;a href="https://nedbatchelder.com/code/cog/index.html"&gt;cog&lt;/a&gt;)
or on-the-fly, for computational notebooks
(interpretation, like
&lt;a href="https://jupyter-notebook.readthedocs.io/"&gt;Jupyter Notebooks&lt;/a&gt;
/ &lt;a href="https://jupyterlab.readthedocs.io/"&gt;JupyterLab&lt;/a&gt;
and &lt;a href="https://marimo.io"&gt;marimo&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;Here is a simple Quarto computational document that calculates
the number of potential UK postcodes
as defined by a regular expression describing valid ones.&lt;sup id="fnref:valid"&gt;&lt;a class="footnote-ref" href="#fn:valid"&gt;1&lt;/a&gt;&lt;/sup&gt;
This number is quoted in a book I am writing on TDDA. Prior to today,
it was pasted into the book by copying the output from
an interactive Python session where I calculated it. I probably
inserted the thousand separators by hand (another error-prone process).
Today I not only changed the number to be included from a calculation
when the book is
compiled, but also added reference tests to detect if it changes.
(&lt;a href="https://github.com/njr0/tddd-postcodes/blob/main/postcodes1.qmd"&gt;source&lt;/a&gt;)&lt;/p&gt;
&lt;pre id="~/blogs/tdda-code/tddd-postcodes/postcodes1.qmd"&gt;
&lt;/pre&gt;

&lt;p&gt;This document is written in a dialect of Markdown defined by Quarto.
It has a header at the top, containing metadata,
then a fenced Markdown Python block containing (which defines
two variables used later in the document), and some text that
uses those two variables (&lt;code&gt;RE&lt;/code&gt; and &lt;code&gt;n_formatted&lt;/code&gt;) to say how many
postcodes match. It has a confected dependency on an another Python
file, &lt;a href="https://github.com/njr0/tddd-postcodes/blob/main/letters.py"&gt;letters.py&lt;/a&gt;
defining the number of letters, &lt;code&gt;nL&lt;/code&gt;, in English:&lt;/p&gt;
&lt;pre id="~/blogs/tdda-code/tddd-postcodes/letters.py"&gt;
&lt;/pre&gt;

&lt;p&gt;It can be compiled with:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;    quarto render postcodes1.qmd
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;producing &lt;a href="tddd/postcodes1.html"&gt;this page&lt;/a&gt;
and &lt;a href="tddd/postcodes1.pdf"&gt;this document&lt;/a&gt;. This
rather simple computational document, which shows
the code and one important output number that is “guaranteed”
to be generated from the code shown. It would be usual to includes
graphs or tables of some sort, but this is minimal example so
I really wanted only a single number.&lt;/p&gt;
&lt;p&gt;The version of the code actually used to generate the number in the
book, does not import &lt;code&gt;nL&lt;/code&gt; from &lt;code&gt;letters.py&lt;/code&gt;, but includes the line
&lt;code&gt;nL = 26&lt;/code&gt; in the main program. That's because I'm not trying to make
it fail in the book. I've written in this way for the post to give me
an easy way to demonstrate co-rusting, which is a entirely real
phenomenon. A change in a dependency is a common reason for rusting.
(If you do not believe in code rusting or co-rusting,
try reading &lt;a href="https://tdda.info/why-code-rusts"&gt;Why Code Rusts&lt;/a&gt;;
if that doesn't convince you, this article may not be for you.)&lt;/p&gt;
&lt;h3 id="writing-tests-for-the-code"&gt;Writing Tests For the Code&lt;/h3&gt;
&lt;p&gt;We will begin by writing tests for essentially the same code, just written
as a standalone Python program rather than embedded in a Quarto document.&lt;/p&gt;
&lt;p&gt;Here is same code as an actual python script
&lt;a href="https://github.com/njr0/tddd-postcodes/blob/main/postcodes.py"&gt;postcodes.py&lt;/a&gt;,
together with some slightly different behaviour after calling the
postcode-counting function.&lt;/p&gt;
&lt;pre id="~/blogs/tdda-code/tddd-postcodes/postcodes.py"&gt;
&lt;/pre&gt;

&lt;p&gt;If we run this code, it produces no output but writes two files.
The first is a JSON file, &lt;a href="https://github.com/njr0/tddd-postcodes/blob/main/postcodes.json"&gt;postcodes.json&lt;/a&gt;,)&lt;/p&gt;
&lt;pre id="~/blogs/tdda-code/tddd-postcodes/postcodes.json"&gt;
&lt;/pre&gt;

&lt;p&gt;We have chosen to write into this the values we might want in
the document (in this case, both the number as a number, as the formatted
number, as well as the relevant regular expression).&lt;/p&gt;
&lt;p&gt;There's a second file, &lt;a href="https://github.com/njr0/tddd-postcodes/blob/main/postcodes.json"&gt;postcodes-defs.tex&lt;/a&gt;,
which we will use later when we use LaTeX
as a TDDD engine. This contains the same values, but now as TeX macros:&lt;/p&gt;
&lt;pre id="~/blogs/tdda-code/tddd-postcodes/postcodes-defs.tex"&gt;
&lt;/pre&gt;

&lt;p&gt;If you have the &lt;code&gt;tdda&lt;/code&gt; library installed, you have as part of it a tool called
Gentest, which can write tests in Python for essentially any command-line
program, script, or command, in any language.&lt;/p&gt;
&lt;p&gt;The line below instructs Gentest to generate tests for running
the Python program &lt;code&gt;postcodes.py&lt;/code&gt;.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ tdda gentest &lt;span class="s1"&gt;&amp;#39;python postcodes.py&amp;#39;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This produces the following output:&lt;/p&gt;
&lt;pre id="~/blogs/tdda-code/tddd-postcodes/gentest-output.txt"&gt;
&lt;/pre&gt;

&lt;p&gt;If you run &lt;code&gt;tdda gentest&lt;/code&gt; without specifying a command, you get a wizard,
which asks what command to run and also gives you various other options
that can alternatively be passed on the command line.&lt;/p&gt;
&lt;p&gt;The output is intended to be self explanatory, but to elaborate,
what Gentest has done is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Run the command twice;&lt;/li&gt;
&lt;li&gt;Recorded what was printed (both on the normal output stream &lt;code&gt;stdout&lt;/code&gt;
   and also, separately, what was printed on the error output stream
   &lt;code&gt;stderr&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;Taken copies of any files created—in our case case, the &lt;code&gt;.json&lt;/code&gt;
   and &lt;code&gt;.tex&lt;/code&gt; files.&lt;/li&gt;
&lt;li&gt;Noted the exit code from the program (here 0, indicating successful
   completion);&lt;/li&gt;
&lt;li&gt;Looked to see whether there were any differences between the two runs,
   and whether anything in the output looked highly dependent on the
   environment or context. Here nothing did, but if it had Gentest would
   have generated tests that attempted to factor out things that look
   as if they might vary from run to run. (Examples include timestamps,
   run durations, hostnames etc.);&lt;/li&gt;
&lt;li&gt;Written a test script, &lt;code&gt;test_python_postcodes_py.py&lt;/code&gt;.
   When run, this executes the command under test and compares its behaviour
   and outputs to those it collected when generating the tests.
   The tests only pass if the behaviour
   and outputs were identical other than anything Gentest decided
   was not fixed. In this case, there was nothing Gentest thought
   classes as not fixed.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The code generated is in &lt;a href="https://github.com/njr0/tddd-postcodes/blob/main/test_python_postcodes_py.py"&gt;test_python_postcodes_py.py&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;If we run this test script, thus:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python test_python_postcodes_py.py
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;we get&lt;/p&gt;
&lt;pre id="~/blogs/tdda-code/tddd-postcodes/test-output.txt"&gt;
&lt;/pre&gt;
&lt;p&gt;which shows that our tests have passed, meaning that the output
is unchanged. I'm not going to go through the tests, but by all
means look at them.&lt;/p&gt;
&lt;h3 id="simulated-co-rusting"&gt;Simulated Co-Rusting&lt;/h3&gt;
&lt;p&gt;Let's look at what happens if our code's behaviour changes as a result
of rusting. We will simulate this by replacing letters.py with
letters52.py, which records the number of upper- and lower-case letters
in English.&lt;sup id="fnref:whoops"&gt;&lt;a class="footnote-ref" href="#fn:whoops"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;    cp letters52.py letters.py
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;if we do this and run the tests again we get two test failures
and some suggested diff commands to run to understand them,&lt;/p&gt;
&lt;pre id="~/blogs/tdda-code/tddd-postcodes/test-output-fail.txt"&gt;
&lt;/pre&gt;

&lt;p&gt;and if we run the second suggested diff command (on the JSON files), we see:&lt;/p&gt;
&lt;pre id="~/blogs/tdda-code/tddd-postcodes/diff-json-output.txt"&gt;
&lt;/pre&gt;

&lt;p&gt;This is showing us that, with the changed dependency, the code is now
producing well over 400 million potential postcodes, rather than th 14
million we expected. (The lack of a newline at the end of &lt;code&gt;stdout&lt;/code&gt; is
not significant, and is ignored by the test.) So as we hoped, the test
detected the rusting of our code, and the co-rusting of its output.&lt;/p&gt;
&lt;p&gt;The second diff command shows exactly the same differences in the TeX
macros written:&lt;/p&gt;
&lt;pre id="~/blogs/tdda-code/tddd-postcodes/diff-tex-output.txt"&gt;
&lt;/pre&gt;

&lt;p&gt;If we run the Quarto file &lt;a href="https://github.com/njr0/tddd-postcodes/blob/main/postcodes1.qmd"&gt;postcodes1.qmd&lt;/a&gt;
with the change, there is no obvious problem:
the code and the result continue to match, but are now different from
what I intended and orginally validated. Here are the
&lt;a href="tddd/postcodes1bad.html"&gt;html&lt;/a&gt;
and
&lt;a href="tddd/postcodes1bad.pdf"&gt;pdf&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="a-tddd-version-of-the-quarto-doc"&gt;A TDDD Version of the Quarto Doc&lt;/h3&gt;
&lt;p&gt;We can make the Quarto document more robust (and have the benefit of
keeping the code in a script, rather than forcing it into the document)
by using this Quarto file, &lt;a href="https://github.com/njr0/tddd-postcodes/blob/main/postcodes1.qmd"&gt;postcodes2.qmd&lt;/a&gt;.&lt;/p&gt;
&lt;pre id="~/blogs/tdda-code/tddd-postcodes/postcodes2.qmd"&gt;
&lt;/pre&gt;

&lt;p&gt;The include line at the top imports the file
&lt;a href="https://github.com/njr0/tddd-postcodes/blob/main/postcodes1.qmd2"&gt;_postcodes.py.qmd&lt;/a&gt;.
This file is just
our script, in Quarto Markdown fences, with a underscore filename,
which Quarto requires for inclusions for some reason.
We construct the file automatically as part of the build process
(in the &lt;a href="https://github.com/njr0/tddd-postcodes/blob/main/Makefile"&gt;Makefile&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;After the inclusion, we read the JSON file that Gentest saved in its
reference directory into Python as a dictionary called &lt;code&gt;ref&lt;/code&gt; and then,
check that thi refernece dictionary is equal to the one we generated
when we ran the code as part of the Quarto rendering process.
The Makefile runs the tests (outside Quarto) immediately before rendering
so if the assertion passes, we actually know two things:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;The tests passed when we ran them outside Quarto
    (showing that the produce the results we previously validated as OK),
    and&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;When we ran the same code inside Quarto, its results (or at least,
    the results in the dictionary) were also the same as the reference
    results in the test.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The rest of the Quarto document is the same as the first version except
that use the results from the dictionary (since those are validated)
and choose to use the preformatted string &lt;code&gt;ref['n_str']&lt;/code&gt; rather than
formatting it inline. (This makes no difference.)&lt;/p&gt;
&lt;p&gt;In this case, and many others, it makes no difference whether we use &lt;code&gt;ref&lt;/code&gt;
(the results read from the refernece JSON file) or &lt;code&gt;d&lt;/code&gt; as the source
of our values, because the assertion checked that they were identical.
The reason I've used ref is that in some other cases, the we allow
non-material differences between the actual and reference results,
typically things like datestamps indicating run-time, machine names etc.
(If those are different, we need to use a slightly different assertion.)
By using the reference results, we ensure that the document does not change
each time we compile it if there are no material differences.&lt;/p&gt;
&lt;h3 id="discussion"&gt;Discussion&lt;/h3&gt;
&lt;p&gt;Next:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Look at the JSON and TeX macros&lt;/li&gt;
&lt;li&gt;Change the letters to be 52&lt;/li&gt;
&lt;li&gt;Show the test failing&lt;/li&gt;
&lt;li&gt;Show how to use the script code in Quarto&lt;/li&gt;
&lt;li&gt;Do the LaTeX version.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:valid"&gt;
&lt;p&gt;All current valid postcodes match this expression, but many
string that match it do not exist and some would probably not be considered
valid.&amp;#160;&lt;a class="footnote-backref" href="#fnref:valid" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:whoops"&gt;
&lt;p&gt;By way of full disclosure, when I actually replaced &lt;code&gt;letters.py&lt;/code&gt;
with &lt;code&gt;letters52.py&lt;/code&gt; and ran the tests they passed, to my dismay.
This happened not because of a problem with the tests, but because I created
&lt;code&gt;letters52.py&lt;/code&gt; and &lt;code&gt;letters26.py&lt;/code&gt; by copying &lt;code&gt;letters.py&lt;/code&gt; and failed
to update the contents of th &lt;code&gt;letters52.py&lt;/code&gt;. If you were were to look
back in the Git history for the repo, you'd see that.
I mention this simply as a further demonstration that all humans are prone
to error, which is some of the reason TDDD and TDDA are helpful!
Of course, some humans are less errir-prone than others!&amp;#160;&lt;a class="footnote-backref" href="#fnref:whoops" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="TDDD"></category></entry><entry><title>tdda.serial: Metadata for Flat Files (CSV Files)</title><link href="https://tdda.info/tddaserial-metadata-for-flat-files-csv-files.html" rel="alternate"></link><published>2025-06-23T10:00:00+01:00</published><updated>2025-06-23T10:00:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2025-06-23:/tddaserial-metadata-for-flat-files-csv-files.html</id><summary type="html">&lt;p&gt;Almost all data scientists and data engineers have to work with flat files
(CSV files) from time to time. Despite their many problems,
CSVs are too ubiquitous, too universal, and (whisper it) have too many
strengths for them to be likely to disappear. Even if they did,
they would quickly …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Almost all data scientists and data engineers have to work with flat files
(CSV files) from time to time. Despite their many problems,
CSVs are too ubiquitous, too universal, and (whisper it) have too many
strengths for them to be likely to disappear. Even if they did,
they would quickly be reinvented. The problems
with them are widely known and discussed, and will be familar to almost
everyone who works with them. They include issues with encodings,
types, quoting, nulls, headers, and with dates and times.
My favourite summary of them remains Jesse Donat's
&lt;em&gt;&lt;a href="https://donatstudios.com/Falsehoods-Programmers-Believe-About-CSVs"&gt;Falsehoods Programmers Believe about CSVs&lt;/a&gt;&lt;/em&gt;. I wrote about them on this
blog nearly four years ago
(&lt;em&gt;&lt;a href="https://www.tdda.info/flat-files-aka-csv-files"&gt;Flat Files&lt;/a&gt;&lt;/em&gt;).&lt;/p&gt;
&lt;p&gt;Over the last year or so I've been writing a book on test-driven data
analysis. The only remaining chapter without a full draft discusses the same
topics as this post—metadata for CSV files and new parts of the TDDA
software that assist with its creation and use.
This post documents my current thinking, plans and ambitions in this
area, and shows some of what is already implemented.&lt;sup id="fnref:dev"&gt;&lt;a class="footnote-ref" href="#fn:dev"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;h3 id="a-metadata-format-for-flat-files-tddaserial"&gt;A Metadata Format for Flat Files: tdda.serial&lt;/h3&gt;
&lt;p&gt;The core of the new work is a new format, &lt;code&gt;tdda.serial&lt;/code&gt;, for describing
data in CSV files.&lt;/p&gt;
&lt;p&gt;The previous post showed an example (“XMD”) metadata file used by the Miró
software from my company Stochastic Solutions, which was as follows:&lt;sup id="fnref:infact"&gt;&lt;a class="footnote-ref" href="#fn:infact"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;    &lt;span class="cp"&gt;&amp;lt;?xml version=&amp;quot;1.0&amp;quot; encoding=&amp;quot;UTF-8&amp;quot;?&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;dataformat&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;sep&amp;gt;&lt;/span&gt;,&lt;span class="nt"&gt;&amp;lt;/sep&amp;gt;&lt;/span&gt;                     &lt;span class="cm"&gt;&amp;lt;!-- field separator --&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;null&amp;gt;&amp;lt;/null&amp;gt;&lt;/span&gt;                    &lt;span class="cm"&gt;&amp;lt;!-- NULL marker --&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;quoteChar&amp;gt;&lt;/span&gt;&amp;quot;&lt;span class="nt"&gt;&amp;lt;/quoteChar&amp;gt;&lt;/span&gt;         &lt;span class="cm"&gt;&amp;lt;!-- Quotation mark --&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;encoding&amp;gt;&lt;/span&gt;UTF-8&lt;span class="nt"&gt;&amp;lt;/encoding&amp;gt;&lt;/span&gt;       &lt;span class="cm"&gt;&amp;lt;!-- any python coding name --&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;allowApos&amp;gt;&lt;/span&gt;True&lt;span class="nt"&gt;&amp;lt;/allowApos&amp;gt;&lt;/span&gt;      &lt;span class="cm"&gt;&amp;lt;!-- allow apostophes in strings --&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;skipHeader&amp;gt;&lt;/span&gt;False&lt;span class="nt"&gt;&amp;lt;/skipHeader&amp;gt;&lt;/span&gt;   &lt;span class="cm"&gt;&amp;lt;!-- ignore the first line of file --&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;pc&amp;gt;&lt;/span&gt;False&lt;span class="nt"&gt;&amp;lt;/pc&amp;gt;&lt;/span&gt;                   &lt;span class="cm"&gt;&amp;lt;!-- Convert 1.2% to 0.012 etc. --&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;excel&amp;gt;&lt;/span&gt;False&lt;span class="nt"&gt;&amp;lt;/excel&amp;gt;&lt;/span&gt;             &lt;span class="cm"&gt;&amp;lt;!-- pad short lines with NULLs --&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;dateFormat&amp;gt;&lt;/span&gt;eurodt&lt;span class="nt"&gt;&amp;lt;/dateFormat&amp;gt;&lt;/span&gt;  &lt;span class="cm"&gt;&amp;lt;!-- Miró date format name --&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;fields&amp;gt;&lt;/span&gt;
            &lt;span class="nt"&gt;&amp;lt;field&lt;/span&gt; &lt;span class="na"&gt;extname=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;mc id&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;ID&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;string&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
            &lt;span class="nt"&gt;&amp;lt;field&lt;/span&gt; &lt;span class="na"&gt;extname=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;mc nm&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;MachineName&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;int&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
            &lt;span class="nt"&gt;&amp;lt;field&lt;/span&gt; &lt;span class="na"&gt;extname=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;secs&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;TimeToManufacture&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;real&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
            &lt;span class="nt"&gt;&amp;lt;field&lt;/span&gt; &lt;span class="na"&gt;extname=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;commission date&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;DateOfCommission&amp;quot;&lt;/span&gt;
                   &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;date&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
            &lt;span class="nt"&gt;&amp;lt;field&lt;/span&gt; &lt;span class="na"&gt;extname=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;mc cp&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Completion Time&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;date&amp;quot;&lt;/span&gt;
                   &lt;span class="na"&gt;format=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;rdt&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
            &lt;span class="nt"&gt;&amp;lt;field&lt;/span&gt; &lt;span class="na"&gt;extname=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;sh dt&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;ShipDate&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;date&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;format=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;rd&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
            &lt;span class="nt"&gt;&amp;lt;field&lt;/span&gt; &lt;span class="na"&gt;extname=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;qa passed?&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Passed QA&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;bool&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;/fields&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;requireAllFields&amp;gt;&lt;/span&gt;False&lt;span class="nt"&gt;&amp;lt;/requireAllFields&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;banExtraFields&amp;gt;&lt;/span&gt;False&lt;span class="nt"&gt;&amp;lt;/banExtraFields&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/dataformat&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Here is one equivalent way of expressing essentially the same information
in the (evolving) &lt;code&gt;tdda.serial&lt;/code&gt; format:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;format&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;http://tdda.info/ns/tdda.serial&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;writer&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;tdda.serial-2.2.15&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;tdda.serial&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;encoding&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;UTF-8&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;delimiter&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;|&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;quote_char&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;\&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;escape_char&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;\\&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;stutter_quotes&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;null_indicators&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;accept_percentages_as_floats&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;header_row_count&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;map_missing_trailing_cols_to_null&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;fields&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;mc id&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;ID&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;fieldtype&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;int&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;mc nm&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;fieldtype&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;string&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;secs&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;TimeToManufacture&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;fieldtype&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;int&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;commission date&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;DateOfCommission&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;fieldtype&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;date&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;format&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;iso8601date&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;mc cp&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;CompletionTime&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;fieldtype&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;datetime&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;format&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;iso8601datetime&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;sh dt&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;ShipDate&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;fieldtype&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;date&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;format&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;iso8601date&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;qa passed?&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;PassedQA&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;fieldtype&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;bool&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;true_values&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;yes&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;false_values&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;no&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The details don't matter too much at this stage, and may yet change,
but briefly here we see the file
(typically with a &lt;code&gt;.serial&lt;/code&gt; extension), describing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the text encoding used for the data (&lt;code&gt;UTF-8&lt;/code&gt;);&lt;/li&gt;
&lt;li&gt;the field separator (pipe, &lt;code&gt;|&lt;/code&gt;);&lt;/li&gt;
&lt;li&gt;the quote character (double quote, &lt;code&gt;"&lt;/code&gt;);&lt;/li&gt;
&lt;li&gt;the escape character (&lt;code&gt;\&lt;/code&gt;), which is used to escape double quotes
   in double-quoted strings, among other things;&lt;/li&gt;
&lt;li&gt;whether quotes are stuttered or escaped within quoted strings;&lt;/li&gt;
&lt;li&gt;the string used to denote null values
   (this can be a single string or a list);&lt;/li&gt;
&lt;li&gt;the number of header rows;&lt;/li&gt;
&lt;li&gt;an explicit note not to accept percentages in the file as floating-point
   values;&lt;/li&gt;
&lt;li&gt;whether or not lines with too few fields should be regarded as
   having nulls for the apparently missing fields.
   (Excel usually does not write values after the last non-empty cell
    in each row on a worksheet.)&lt;/li&gt;
&lt;li&gt;information about individual fields.
   In this case, a dictionary is used to map names in the flat file
   to names to be used in the dataset. Numbers can also be used
   to indicate column position, particularly if there is no header,
   though they have to be quoted because this is JSON.
   Field types are also specified, together with any extra information
   required, e.g. the non-standard &lt;em&gt;true&lt;/em&gt; and &lt;em&gt;false&lt;/em&gt; values for the
   boolean field &lt;code&gt;collected?&lt;/code&gt; (in the file), which becomes
   &lt;code&gt;HasBeenCollected&lt;/code&gt; once read. Formats for the
   date and time fields are also specified here.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When the fields are presented as a dictionary, as here, this allows
for the possibility that there are other fields in the file, for
which metadata is not provided. If a list is used instead, the
field list is taken to be complete. In this case, external names
can be provided using an &lt;code&gt;csvname&lt;/code&gt; attribute, if they are different.&lt;/p&gt;
&lt;p&gt;Pretty much everything is optional, and, where appropriate,
defaults can be put in the main section and over-ridden on
a per-field basis. This is useful if, for example, one or two fields
use different null markers from the default, or if multiple date formats
are used. (The &lt;code&gt;format&lt;/code&gt; key will probably change to &lt;code&gt;dateformat&lt;/code&gt; and
&lt;code&gt;boolformat&lt;/code&gt; to make this overriding work better.)&lt;/p&gt;
&lt;p&gt;Here is a simple example of its use with Pandas.
Suppose we have the following pipe-separated flat file,
with the name &lt;code&gt;machines.psv&lt;/code&gt;.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;mc id&lt;span class="p"&gt;|&lt;/span&gt;mc nm&lt;span class="p"&gt;|&lt;/span&gt;secs&lt;span class="p"&gt;|&lt;/span&gt;commission date&lt;span class="p"&gt;|&lt;/span&gt;mc cp&lt;span class="p"&gt;|&lt;/span&gt;sh dt&lt;span class="p"&gt;|&lt;/span&gt;qa passed?
&lt;span class="m"&gt;1111111&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Machine 1&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="m"&gt;86400&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="m"&gt;2025&lt;/span&gt;-06-01&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="m"&gt;2025&lt;/span&gt;-06-07T12:34:56&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="m"&gt;2025&lt;/span&gt;-06-21&lt;span class="p"&gt;|&lt;/span&gt;yes
&lt;span class="m"&gt;2222222&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Machine 2&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;||&lt;/span&gt;&lt;span class="m"&gt;2025&lt;/span&gt;-06-02&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="m"&gt;2025&lt;/span&gt;-06-08T12:34:57&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="m"&gt;2025&lt;/span&gt;-06-22
&lt;span class="m"&gt;3333333&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Machine 3&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="m"&gt;86399&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="m"&gt;2025&lt;/span&gt;-06-03&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="m"&gt;2025&lt;/span&gt;-06-09T12:34:55&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="m"&gt;2025&lt;/span&gt;-06-22&lt;span class="p"&gt;|&lt;/span&gt;no
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Then we can use the following Python code to load the data,
informed by the metadata in &lt;code&gt;machines.serial&lt;/code&gt; (the example
shown above).&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tdda.serial&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;csv_to_pandas&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;csv_to_pandas&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;machines.psv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;machines.serial&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This produces the following output:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python pd-read-machines.py
        ID       Name  TimeToManufacture DateOfCommission      CompletionTime   ShipDate  PassedQA
0  1111111  Machine 1              86400       2025-06-01 2025-06-07 12:34:56 2025-06-21      True
1  2222222  Machine 2               &amp;lt;NA&amp;gt;       2025-06-02 2025-06-08 12:34:57 2025-06-22      &amp;lt;NA&amp;gt;
2  3333333  Machine 3              86399       2025-06-03 2025-06-09 12:34:55 2025-06-22     False

&amp;lt;class &amp;#39;pandas.core.frame.DataFrame&amp;#39;&amp;gt;
RangeIndex: 3 entries, 0 to 2
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   ID                 3 non-null      Int64
 1   Name               3 non-null      string
 2   TimeToManufacture  2 non-null      Int64
 3   DateOfCommission   3 non-null      datetime64[ns]
 4   CompletionTime     3 non-null      datetime64[ns]
 5   ShipDate           3 non-null      datetime64[ns]
 6   PassedQA           2 non-null      boolean
dtypes: Int64(2), boolean(1), datetime64[ns](3), string(1)
memory usage: 288.0 bytes
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;There's nothing particularly special here, but Pandas has
read the file correctly using the metadata to understand&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the pipe separator;&lt;/li&gt;
&lt;li&gt;the date and time formats;&lt;/li&gt;
&lt;li&gt;the &lt;code&gt;yes&lt;/code&gt;/&lt;code&gt;no&lt;/code&gt; format of &lt;code&gt;PassedQA&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;the null indicator;&lt;/li&gt;
&lt;li&gt;the intended, more usable internal field names;&lt;/li&gt;
&lt;li&gt;field types, here defaulting to nullable types.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As with the &lt;code&gt;pandas.read_csv&lt;/code&gt;, we can choose whether to prefer
nullable types, but the default using &lt;code&gt;tdda.serial&lt;/code&gt; is to do so.
In this case, the date formats and null indicators would be fine anyway,
with Pandas defaults, but here we could instead have specified, say,
European dates and &lt;code&gt;?&lt;/code&gt; for nulls.&lt;/p&gt;
&lt;p&gt;This code:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tdda.serial&lt;/span&gt; &lt;span class="n"&gt;load_metadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;serial_to_pandas_read_csv_args&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;rich&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nb"&gt;print&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;rprint&lt;/span&gt;

&lt;span class="n"&gt;md&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;machines.serial&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;kwargs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;serial_to_pandas_read_csv_args&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;md&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;rprint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;shows the parameters actually passed to &lt;code&gt;pandas.read_csv&lt;/code&gt;:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;dtype&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;ID&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Int64&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Name&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;string&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;TimeToManufacture&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Int64&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;PassedQA&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;boolean&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;date_format&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;DateOfCommission&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;ISO8601&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;CompletionTime&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;ISO8601&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;ShipDate&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;ISO8601&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;parse_dates&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;DateOfCommission&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;CompletionTime&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;ShipDate&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;sep&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;|&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;encoding&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;UTF-8&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;escapechar&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;quotechar&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&amp;quot;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;doublequote&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;na_values&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;keep_default_na&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;names&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;ID&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Name&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;TimeToManufacture&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;DateOfCommission&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;CompletionTime&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;ShipDate&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;PassedQA&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;header&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;true_values&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;yes&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;false_values&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;no&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We can do the very similar things using Polars (and “soon”, other
libraries).
Here's a way to read the file with Polars:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tdda.serial&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;csv_to_polars&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;csv_to_polars&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;machines.psv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;machines.serial&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                   &lt;span class="n"&gt;map_other_bools_to_string&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;which produces:
&lt;img src="https://stochasticsolutions.com/image/polars-machines-output.png" width="1245" alt="Output from polars. There two warnings (about polars not understanding escaping or alternate bool values, and PassedQA being read a string, because that was specified in the parameters. There's then the data table showing the types as i64, str, i64, three datetimes (with microsecond resolution) and PassedQA as str. Nulls are shown for the second row for TimeToManufacture and PassedQA. The transformed field names are used."/&gt;&lt;/p&gt;
&lt;p&gt;This does mostly the same thing as the Pandas version, but
issues two warnings. The first is  because an
escape character is specified, which the Polars CSV reader
doesn't really understand.
The second warning is because the Polars CSV reader can't handle non-standard
booleans. By default, when these are specified for Polars,
&lt;code&gt;tdda.serial&lt;/code&gt; will issue a warning but still call &lt;code&gt;polars.read_csv&lt;/code&gt;
to read the file, because they might not, in fact, be used.
The parameter passed in the Python code above
(&lt;code&gt;map_other_bools_to_string=True&lt;/code&gt;) tells &lt;code&gt;tdda.serial&lt;/code&gt; to direct Polars to read
this column as a string instead (as it would if we didn't specify a type).
Of course, it would be possible to have the reader then go through and
turn the strings into booleans after reading, but that feels like
more a metadata library should do.&lt;/p&gt;
&lt;p&gt;The warnings helpfully tell you what to look out for as possible issues
when the file is read.
This as an example of a principle I'm trying to use
throughout tdda.serial:
when there's something in the serial metadata that a given reader
might not be able to handle correctly, issue a warning, and possibly provide
an option to control that behaviour.&lt;/p&gt;
&lt;p&gt;We can do the same thing as we did for Pandas and look at the
arguments generated for Polars, using the following, very similar,
Python code:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tdda.serial&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;load_metadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;serial_to_polars_read_csv_args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;rich&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nb"&gt;print&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;rprint&lt;/span&gt;

&lt;span class="n"&gt;md&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;machines.serial&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;kwargs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;serial_to_polars_read_csv_args&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;md&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;map_other_bools_to_string&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;rprint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This produces&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;separator&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;|&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;quote_char&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&amp;quot;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;null_values&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;encoding&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;UTF-8&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;schema&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;ID&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Int64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;Name&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;TimeToManufacture&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Int64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;DateOfCommission&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;CompletionTime&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;ShipDate&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;PassedQA&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;new_columns&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;ID&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;Name&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;TimeToManufacture&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;DateOfCommission&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;CompletionTime&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;ShipDate&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;PassedQA&amp;#39;&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The only subtlety here is that the types in Schema are actual polars types
(&lt;code&gt;pl.Int64&lt;/code&gt; etc.) rather than strings, hence their not being quoted.
(They're not prefixed because &lt;code&gt;repr(pl.Int64)&lt;/code&gt; is the string &lt;code&gt;"Int64"&lt;/code&gt;,
which prints as &lt;code&gt;Int64&lt;/code&gt;.)
The library can also write a &lt;code&gt;tdda.serial&lt;/code&gt; file containing the Polars
arguments explicitly. It looks like this:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;format&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;http://tdda.info/ns/tdda.serial&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;writer&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;tdda.serial-2.2.15&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;polars.read_csv&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;separator&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;|&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;quote_char&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;\&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;null_values&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;encoding&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;UTF-8&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;schema&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;ID&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Int64&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;Name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;String&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;TimeToManufacture&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Int64&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;DateOfCommission&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Datetime&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;CompletionTime&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Datetime&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;ShipDate&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Datetime&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;PassedQA&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;String&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;new_columns&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;ID&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;TimeToManufacture&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;DateOfCommission&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;CompletionTime&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;ShipDate&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;PassedQA&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Here, because we need to serialize the &lt;code&gt;tdda.serial&lt;/code&gt; file as JSON, the
polars types are mapped to their string names. The &lt;code&gt;tdda&lt;/code&gt; library
takes care of the conversion in both directions.&lt;/p&gt;
&lt;p&gt;A single &lt;code&gt;.serial&lt;/code&gt; file can contain multiple &lt;em&gt;flavours&lt;/em&gt; of
metadata—&lt;code&gt;tdda.serial&lt;/code&gt;, &lt;code&gt;polars.read_csv&lt;/code&gt;, &lt;code&gt;pandas.read_csv&lt;/code&gt; etc.
When it does, a call to &lt;code&gt;load_metadata&lt;/code&gt; can specify a preferred flavour,
or let the library choose. My hope, however, is that in most cases
the &lt;code&gt;tdda.serial&lt;/code&gt; section will contain enough information to work
as well as a library-specific specification.&lt;/p&gt;
&lt;h3 id="goals-for-tddaserial"&gt;Goals for tdda.serial&lt;/h3&gt;
&lt;p&gt;&lt;img src="https://stochasticsolutions.com/image/tdda-serial-io.png"
     alt="Image showing a circle with tdda.serial in the middle and arrows leading in and out for three formats (CSVW, tdda.serial, and Frictionless), five libraries (DuckDB, Python csv, Pandas, Polars, and Apache Arrow) and Excel. Pandas, CSVW, tdda.serial and Polars are bold for both input and output."
      width="851.5"/&gt;&lt;/p&gt;
&lt;p&gt;When I went to write down the goals for &lt;code&gt;tdda.serial&lt;/code&gt;, I was surprised
at how long the list was. Not all of this is implemented but here is the
current state of the goals for &lt;code&gt;tdda.serial&lt;/code&gt;. (The image above shows
the vision for it, with the bold parts mostly implemented, and the
rest currently only planned.)&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Describe Flat File Formats&lt;/strong&gt;.
  Allow accurate representation, full or partial, of flat-file formats
  used (or potentially used) by one or more concrete flat files.
  or .&lt;ul&gt;
&lt;li&gt;It primarily targets comma-separated values (&lt;code&gt;.csv&lt;/code&gt;) and related formats
  (tab-separated, pipe-separated etc.), but also potentially
  other tabular data. It could, for example, be used to describe
  things like date formats and numeric subtypes for tabular data
  stored in JSON or &lt;a href="https://jsonlines.org"&gt;JSON Lines&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Full or partial&lt;/em&gt; is important. When reading data, it is often
  convenient only to specify things that are causing problems.
  On write, fuller specifications are, of course, desirable.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Read Flat Files&lt;/strong&gt;.
  Assist with &lt;em&gt;reading&lt;/em&gt; flat files correctly, based on metadata in
  &lt;code&gt;.serial&lt;/code&gt; files and other formats (like &lt;a href="https://csvw.org"&gt;CSVW&lt;/a&gt;),
  primarily using data in the &lt;code&gt;"tdda.serial"&lt;/code&gt; format.&lt;ul&gt;
&lt;li&gt;Convert metadata currently stored as &lt;code&gt;tdda.serial&lt;/code&gt; to
  dictionaries of arguments for other libraries that work with CSVs.&lt;/li&gt;
&lt;li&gt;Provide an API to get such libraries to read flat-file data
  correctly, guided by the metadata&lt;/li&gt;
&lt;li&gt;Generate code to get such libraries to read flat-file data
  correctly, guided by the metadata.  Assist with writing flat files
  in documented formats.&lt;/li&gt;
&lt;li&gt;Interoperate, where possible, with other metadata formats like
  &lt;a href="https://csvw.org"&gt;CSVW&lt;/a&gt;
  and &lt;a href="https://pypi.org/project/frictionless/"&gt;Frictionless&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Generate tdda.serial Metadata Files&lt;/strong&gt;.
  Assist with generating metadata describing the format of CSV
  files based on the write arguments provided to the &lt;em&gt;writing&lt;/em&gt; software.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Write Flat Files&lt;/strong&gt;.
  Assist with getting libraries to write CSV files using a format specified
  in a &lt;code&gt;tdda.serial&lt;/code&gt; file.&lt;ul&gt;
&lt;li&gt;&lt;em&gt;This provides a second way of increasing interoperability:
  we can help readers to read from a specific format, and writers
  to write to that same format.&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Assist/Support other Software Reading, Writing, and otherwise
  handling Flat Files&lt;/strong&gt;.&lt;ul&gt;
&lt;li&gt;DataFrame Libraries&lt;ul&gt;
&lt;li&gt;Pandas&lt;/li&gt;
&lt;li&gt;Polars&lt;/li&gt;
&lt;li&gt;Apache Arrow&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Databases&lt;ul&gt;
&lt;li&gt;DuckDB&lt;/li&gt;
&lt;li&gt;SQLite&lt;/li&gt;
&lt;li&gt;Postgres&lt;/li&gt;
&lt;li&gt;MySQL&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Miscellaneous&lt;ul&gt;
&lt;li&gt;Python &lt;code&gt;csv&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;tdda&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Support Library-specific Read/Write Metadata&lt;/strong&gt;.
  Provide a mechanism for documenting library-specific read/write
  parameters for CSV files explicitly:&lt;ul&gt;
&lt;li&gt;For storing the library-specific write parameters used with
  &lt;code&gt;pandas.to_csv&lt;/code&gt;, &lt;code&gt;polars.write_csv&lt;/code&gt; in &lt;code&gt;.serial&lt;/code&gt; files (and the
  ability to use such parameters)&lt;/li&gt;
&lt;li&gt;For storing the library-specific read parameters required to
  read a flat file with high fidelity using,
  e.g. &lt;code&gt;pandas.read_csv&lt;/code&gt; , &lt;code&gt;polars.read_csv&lt;/code&gt; etc.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Assist with Format Choice&lt;/strong&gt;.
  Provide a mechanism for helping to choose a good CSV format for a
  concrete dataset to be written, e.g. choosing null indicators that
  are not likely to be confused with serialized non-null values in
  the dataset.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SERDE Verification&lt;/strong&gt;.
  Provide mechanisms for checking whether a dataset can be
  round-tripped successfully to a flat file (i.e. that the same
  library, at least, can write data to a flat file, read it back, and
  recover identical, equivalent, or similar data).&lt;sup id="fnref:identical"&gt;&lt;a class="footnote-ref" href="#fn:identical"&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;CLI Tools&lt;/strong&gt;.
  Through the associated command-line tool, &lt;code&gt;tdda diff&lt;/code&gt;, and equivalent
  API functions, to check whether two datasets are equivalent.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In the case of the command-line tool this is two datasets on
  disk (flat files, parquet files etc.). It might also be possible
  to compare two database tables, in the same or different RDBMS
  instances, or data in a database table and in
  a file on disk, though this is not yet implemented.
  (The next post will discuss &lt;code&gt;tdda diff&lt;/code&gt; further.)&lt;/li&gt;
&lt;li&gt;In the case of the API, this can also include in-memory data
  structures such as data frames.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Provide Metadata Format Conversions.&lt;/strong&gt;
  Provide mechanisms for converting between different library-specific
  flat-file parameters and tdda's &lt;code&gt;tdda.serial&lt;/code&gt; format, as well
  as between the &lt;code&gt;tdda.serial&lt;/code&gt; format, &lt;code&gt;csvw&lt;/code&gt;, and (perhaps) &lt;code&gt;frictionless&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Generate Validation Statistics and Validate using them&lt;/strong&gt;.
  (Potentially) write additional data for a concrete dataset that
  can be used for further validation that it has been read correctly,
  e.g. summary statistics, checksums etc.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="discussion"&gt;Discussion&lt;/h3&gt;
&lt;p&gt;The usual observation when proposing something new like this is
that the last thing the world needs is another “standard”.
As Randall Munro puts it: (&lt;a href="https://imgs.xkcd.com/927"&gt;https://imgs.xkcd.com/927&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://imgs.xkcd.com/comics/standards.png" width="500"
     alt="HOW STANDARDS PROLIFERATE: (See A/C chargers, character encodings, instant messaging etc. Cartoon. Panel 1: SITUATION: There are 14 competing standards. Panel 2: (Conversation between two people.) 14? Ridiculous! We need to develop one universal standard that covers everyone's use cases. (Yeah.) Panel 3 (SOON:): SITUATION: There are 15 competing standards."/&gt;&lt;/p&gt;
&lt;p&gt;In this case, however, I don't think there are all that many
recognized ways of describing flat-file formats. I was involved in one
(the &lt;code&gt;.fdd&lt;/code&gt; flat-file description data format) while at Quadstone, and
I currently use the XMD format above at Stochastic Solutions, but
pretty-much no one else does. While working with a friend, Neil
Skilling, he ran across the &lt;a href="https://csvw.org"&gt;CSVW standard&lt;/a&gt;,
developed under the auspices of W3C, and that led to my finding the
Python &lt;a href="https://framework.frictionlessdata.io"&gt;frictionless&lt;/a&gt; project.
At first I thought one of those might be the solution I was looking
for, but in fact they have goals and desgins that are different enough
that they don't quite fulfill the most important goals for
&lt;code&gt;tdda.serial&lt;/code&gt;, as impressive as both projects are.
Reluctantly, therefore, I began working on &lt;code&gt;tdda.serial&lt;/code&gt;,
which aims to interoperate with and support CSVW, (and to some extent,
frictionless), but also to handle other cases.&lt;/p&gt;
&lt;p&gt;The biggest single difference between the focus of &lt;code&gt;tdda.serial&lt;/code&gt;
and the CSVW is that &lt;code&gt;tdda.serial&lt;/code&gt; is primarily concerned with documenting
a format that might be used by many flat files (different concrete
datasets sharing the same sttructure and formatting) whereas CSVW
is  primarily concerned with documenting either
a single specific CSV file or a specific collection of CSV files,
usually each having different structure. This seems like a rather
subtle difference, but in fact turns out to be quite consequential.&lt;/p&gt;
&lt;p&gt;Here's the first example CSVW file from &lt;a href="https://csvw.org"&gt;csvw.org&lt;/a&gt;:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;@context&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;http://www.w3.org/ns/csvw&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;@language&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;en&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;tables&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;url&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;http://opendata.leeds.gov.uk/downloads/gritting/grit_bins.csv&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;tableSchema&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;columns&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;location&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;datatype&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;integer&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;easting&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;datatype&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;decimal&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;propertyUrl&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;http://data.ordnancesurvey.co.uk/ontology/spatialrelations/easting&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;northing&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;datatype&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;decimal&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;propertyUrl&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;http://data.ordnancesurvey.co.uk/ontology/spatialrelations/northing&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;aboutUrl&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;#{location}&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;dialect&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;header&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Notice that the CSVW file caters for multiple CSV files (a &lt;em&gt;list&lt;/em&gt; of tables
in the &lt;code&gt;tables&lt;/code&gt; element), and that the location of the table is provided
as a URL (which is a required element in CSVW).
In the context of &lt;em&gt;CSV on the web&lt;/em&gt;, this makes complete sense.
It's specified as being URL, but can be a &lt;code&gt;file:&lt;/code&gt; URL, or
a simple path. One convention, fora CSVW file documenting a single
dataset, seems to be that the metadata for
&lt;code&gt;grit_bins.csv&lt;/code&gt; is stored in &lt;code&gt;grit_bins-metadata.json&lt;/code&gt; in the same
directory as the CSV file itself (locally, or on the web).&lt;/p&gt;
&lt;p&gt;What is significant, however, is that this establishes either a one-to-one
relationship between CSV files and CSVW metadata files or,
if the CSVW file contains metadata about several files,
a one-to-one relationship between CSVW files and metadata tables in
a CSVW file. Here, for example, is Example 5 from the
&lt;a href="https://w3c.github.io/csvw/primer/#example-5"&gt;CSVW Primer&lt;/a&gt;:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;@context&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;http://www.w3.org/ns/csvw&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;tables&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;url&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;countries.csv&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;url&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;country-groups.csv&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;url&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;unemployment.csv&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The metadata “knows” the data file
(or data files) that it describes. In contrast, the main concern of
&lt;code&gt;tdda.serial&lt;/code&gt; is to describe a format and structure that might well be
used for many specific (“concrete”) flat files. The relationship is
almost reversed as shown here:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://stochasticsolutions.com/image/csvw-vs-tddaserial.png"
     alt="Left: The CSVW file above, containing three CSV URLS, having arrows from each filename (URL) to that CSV file, as a named icon. Right: Three csv filesn named machines1.csv, machines2.csv, and machines3.csv, each with arrows to a single tdda.serial file (the one shown above)." width="960"/&gt;&lt;/p&gt;
&lt;p&gt;Even though the URL (&lt;code&gt;url&lt;/code&gt;) is a mandatory parameter in CSVW, there is
nothing to prevent us from taking a CSVW file (particularly one
describing a single table) and using its metadata to define a format
to be used with other flat files. In doing, however, we would clearly
be going against the grain of the design of CSVW. As an example
of how it then does not quite fit, sometimes we want the metadata to
describe exactly the fields in the data, and other times we want it to
be a partial specification. In the XMD file, there are explicit
parameters to say whether or not extra fields are allowed, and whether
all fields are required. In the case of the &lt;code&gt;tdda.serial&lt;/code&gt; file, we use
a list of fields when we are describing all the fields allowed and
required in a flat file, and a dictionary when we are providing
information only on a subset, not necessarily in order.&lt;sup id="fnref:maychange"&gt;&lt;a class="footnote-ref" href="#fn:maychange"&gt;4&lt;/a&gt;&lt;/sup&gt;
This sort of flexibility is harder in CSVW, which always uses a list
to specify the fields. I could propose and use extensions, or try to
get extensions added to the standard, but the former seem undesirable,
and the latter hard an unlikely. (It does not look as if there have
been and revisions to CSVW since 2022.)
There are, in fact, many details of CSVW that are problematical
for even the first two libaries I've looked at (Pandas and Polars),
so unfortunately I think something different is needed.&lt;/p&gt;
&lt;h3 id="library-specific-support-in-tddaserial"&gt;Library-specific Support in tdda.serial&lt;/h3&gt;
&lt;p&gt;Another goal for &lt;code&gt;tdda.serial&lt;/code&gt; is that it should be
useful even for people who are only using a single library—e.g.
Pandas. In such cases, there is typically a function or
method for writing CSV files (&lt;code&gt;pandas.DataFrame.to_csv&lt;/code&gt;), and another for
reading them (&lt;code&gt;pandas.read_csv&lt;/code&gt;). Both typically have many
optional arguments, and in keeping with Postel's Law (the
&lt;a href="https://en.wikipedia.org/wiki/Robustness_principle"&gt;Robustness Principle&lt;/a&gt;),
they typically have more flexibility in read formats than in write formats.
In the case of Pandas, the read function's signature is:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;pandas&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;filepath_or_buffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sep&lt;/span&gt;&lt;span class="o"&gt;=&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;no_default&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;delimiter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;infer&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="o"&gt;=&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;no_default&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index_col&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;usecols&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;converters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;true_values&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;false_values&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;skipinitialspace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;skiprows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;skipfooter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nrows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;na_values&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;keep_default_na&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;na_filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;no_default&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;skip_blank_lines&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parse_dates&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;infer_datetime_format&lt;/span&gt;&lt;span class="o"&gt;=&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;no_default&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keep_date_col&lt;/span&gt;&lt;span class="o"&gt;=&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;no_default&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;date_parser&lt;/span&gt;&lt;span class="o"&gt;=&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;no_default&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;date_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dayfirst&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;cache_dates&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;iterator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunksize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;compression&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;infer&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;thousands&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;decimal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;.&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;lineterminator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quotechar&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&amp;quot;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quoting&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doublequote&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;escapechar&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;comment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;encoding_errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;strict&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dialect&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;on_bad_lines&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;error&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;delim_whitespace&lt;/span&gt;&lt;span class="o"&gt;=&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;no_default&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;low_memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;float_precision&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;storage_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;dtype_backend&lt;/span&gt;&lt;span class="o"&gt;=&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;no_default&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;(49 parameters), while the write method's signature is:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;path_or_buf&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sep&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;,&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;na_rep&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;float_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;index_label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;w&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;compression&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;infer&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quoting&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quotechar&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&amp;quot;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lineterminator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunksize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;date_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doublequote&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;escapechar&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;decimal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;.&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;strict&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;storage_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;(22 parameters).&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;tdda&lt;/code&gt; library's command-line tools allow a &lt;code&gt;tdda.serial&lt;/code&gt;
specification to be converted to parameters for &lt;code&gt;pandas.read_csv&lt;/code&gt;,
returning them as a dictionary that can be passed in using
&lt;code&gt;**kargs&lt;/code&gt;. It can also generate python
code to do the read using &lt;code&gt;pandas.read_csv&lt;/code&gt;
or directly perform the read, saving the result to parquet.&lt;/p&gt;
&lt;p&gt;Similarly, the library can take a set of arguments for &lt;code&gt;DataFrame.to_csv&lt;/code&gt;
and create a &lt;code&gt;tdda.serial&lt;/code&gt; file describing the format used (or
write the data and metadata together).&lt;/p&gt;
&lt;p&gt;For a user working with a single library, however, converting to and from
&lt;code&gt;tdda.serial&lt;/code&gt;'s metadata description might be unnecessarily cumbersome and
may work imperfectly. This is because different libraries represent
data differently, and are based on slighlty different conceptions of CSV files.
While I am going to make some effort to allow &lt;code&gt;tdda.serial&lt;/code&gt; universal,
it is likely that there will always be some cases in which there is
a loss of fidelity moving between any specific library's arguments
and the &lt;code&gt;.serial&lt;/code&gt; representation.&lt;/p&gt;
&lt;p&gt;For these reasons, the &lt;code&gt;tdda&lt;/code&gt; library also supports directly writing
arguments for a given library. That is why the &lt;code&gt;tdda.serial&lt;/code&gt; metadata
description is one level down inside the &lt;code&gt;tdda.serial&lt;/code&gt; file, under
a &lt;code&gt;tdda.serial&lt;/code&gt; key. It is also possible to have sections for
&lt;code&gt;pandas.read_csv&lt;/code&gt;, &lt;code&gt;polars.read_csv&lt;/code&gt; with exactly the arguments
they need.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:dev"&gt;
&lt;p&gt;The functionality used on this post is not in the release
version of the tdda library, but is there on a branch called
&lt;code&gt;detectreport&lt;/code&gt;, so can be accessed if anyone it particulary keen.&amp;#160;&lt;a class="footnote-backref" href="#fnref:dev" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:infact"&gt;
&lt;p&gt;In fact, in writing this post, I updated the
&lt;a href="https://www.tdda.info/flat-files-aka-csv-files"&gt;previous one&lt;/a&gt;
to use a slightly more sensible example that previously; this is the
new, slightly more useful example.&amp;#160;&lt;a class="footnote-backref" href="#fnref:infact" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:identical"&gt;
&lt;p&gt;CSV is not a very suitable format for perfect round-tripping
of data for reasons including numeric rounding, multiple types for
the same data, and equivalent representations such as string and categoricals.
Even using a typed format such as parquet, some of these details may
change on round-tripping and most software needs a library-specific format
in order to achieve perfect fidelity when serializing and deserializing
data.&amp;#160;&lt;a class="footnote-backref" href="#fnref:identical" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:maychange"&gt;
&lt;p&gt;This precise mechanism may change, but it is important for
&lt;code&gt;tdda.serial&lt;/code&gt;'s purpose that is supports both full and partial field
schema specification.&amp;#160;&lt;a class="footnote-backref" href="#fnref:maychange" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="misc"></category></entry><entry><title>TDDA and Quality for LLMs</title><link href="https://tdda.info/tdda-and-quality-for-llms.html" rel="alternate"></link><published>2024-12-23T16:00:00+00:00</published><updated>2024-12-23T16:00:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2024-12-23:/tdda-and-quality-for-llms.html</id><summary type="html">&lt;p&gt;It is December 2024 as I write, and large language models (LLMs)
are having an extended moment as I have been writing a book on
tet-driven data analysis. Several people have suggested that I
should write about LLMs or &lt;em&gt;artificial intelligence&lt;/em&gt; (AI),
a term that for many people now means …&lt;/p&gt;</summary><content type="html">&lt;p&gt;It is December 2024 as I write, and large language models (LLMs)
are having an extended moment as I have been writing a book on
tet-driven data analysis. Several people have suggested that I
should write about LLMs or &lt;em&gt;artificial intelligence&lt;/em&gt; (AI),
a term that for many people now means either LLMs or LLMs and
other the other forms of generative AI.&lt;/p&gt;
&lt;p&gt;Training
Inference&lt;/p&gt;
&lt;p&gt;Size
Training Data&lt;/p&gt;
&lt;p&gt;Inputs&lt;/p&gt;
&lt;p&gt;Goal&lt;/p&gt;
&lt;p&gt;First do no harm.&lt;/p&gt;
&lt;p&gt;Strong AI.&lt;/p&gt;
&lt;p&gt;Beliefs.
Hallucinations.&lt;/p&gt;
&lt;p&gt;Stochastic hypothesis generators.&lt;/p&gt;
&lt;p&gt;Rhydwaith&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;LLMs are neural networks that (loosely) predict the next word.*&lt;/li&gt;
&lt;li&gt;Given some text, they predict the next word&lt;/li&gt;
&lt;li&gt;You sentences by appending each predicted word to the input and iterating.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Mary had a -&amp;gt; little
   Mary had a little -&amp;gt; lamb,
   Mary had a little lamb, -&amp;gt; its
   Mary had a little lamb, its -&amp;gt; fleece&lt;/p&gt;
&lt;p&gt;or&lt;/p&gt;
&lt;p&gt;Mary had a -&amp;gt; seizure
   Mary had a seizure -&amp;gt; last
   Mary had a seizure last -&amp;gt; night&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;LLMs are trained on unimaginably large corpuses of data,
   mainly from the web.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;LLMs have trillions of parameters—knobs that can be set to different
   values&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;With any given parameter settings, the LLMs will predict the next word&lt;/li&gt;
&lt;li&gt;Some knob settings match the next-word associations better than others&lt;/li&gt;
&lt;li&gt;Training an LLMs consists of optimizing the knob settings&lt;/li&gt;
&lt;li&gt;(Most of) the parameters (knobs) are called ``weights''.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;During training, the current weights are used to predict the next word&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;When it is ``wrong'' (differs from the input), the weights are adjusted&lt;/li&gt;
&lt;li&gt;Even when it is ``right'', the weights are usually adjusted&lt;/li&gt;
&lt;li&gt;The raw prediction is not a single word, but probabilities
     for possible words&lt;/li&gt;
&lt;li&gt;There is always an error, which can always be reduced.&lt;/li&gt;
&lt;li&gt;The adjustments are calculated to try to reduce the errors over time.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;/li&gt;
&lt;/ol&gt;</content><category term="misc"></category></entry><entry><title>Best Practices for Notebook Users</title><link href="https://tdda.info/best-practices-for-notebook-users.html" rel="alternate"></link><published>2024-12-17T16:00:00+00:00</published><updated>2024-12-17T16:00:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2024-12-17:/best-practices-for-notebook-users.html</id><summary type="html">&lt;p&gt;In a &lt;a href="https://www.tdda.info/jupyter-notebooks-considered-harmful-the-parables-of-anne-and-beth"&gt;previous post&lt;/a&gt;,
I discussed some of the dangers of challenges, dangers and weaknesses
of Jupyter Notebooks, JupyterLabs and their ilk.
I used
&lt;a href="https://www.tdda.info/jupyter-notebooks-considered-harmful-the-parables-of-anne-and-beth"&gt;The Parables of Anne and Beth&lt;/a&gt; as a device to illustrate what I think of as good and
bad practices for data science. A reasonable criticism …&lt;/p&gt;</summary><content type="html">&lt;p&gt;In a &lt;a href="https://www.tdda.info/jupyter-notebooks-considered-harmful-the-parables-of-anne-and-beth"&gt;previous post&lt;/a&gt;,
I discussed some of the dangers of challenges, dangers and weaknesses
of Jupyter Notebooks, JupyterLabs and their ilk.
I used
&lt;a href="https://www.tdda.info/jupyter-notebooks-considered-harmful-the-parables-of-anne-and-beth"&gt;The Parables of Anne and Beth&lt;/a&gt; as a device to illustrate what I think of as good and
bad practices for data science. A reasonable criticism of this was that it did not
really offer anything to help people who might wish to continue using computational
notebooks, but to work in such a way as to limit the harms identified.&lt;/p&gt;
&lt;p&gt;Although it probably rings slightly hollow, my goal is absolutely to improve the
quality of data science, data analyis, data engineering, and really all data work,
and I very much see the attractions and strengths of Jupyter, despite being critical
of certain dark patterns I see around their use.&lt;/p&gt;
&lt;p&gt;Here are some suggesting best practices for notebook users that I hope might be
helpful an constructive. It's true that if you adopt all of them, I might have
succeeded in prising your Notebook from your hands, but if you adopt any of them,
as you use notebooks, I think you will be safer and more successful. I'm very much
in favour of half a loaf.&lt;/p&gt;
&lt;p&gt;Subject to the vaguaries of the web, the checkboxes, online, should be clickable,
should you find it useful as an actual checklist, and there's a
&lt;a href="https://stochasticsolutions.com/pdf/nbp.pdf"&gt;PDF version available&lt;/a&gt; too.&lt;/p&gt;
&lt;h2 id="notebook-best-practices"&gt;Notebook Best Practices&lt;/h2&gt;
&lt;p&gt;&lt;img src="images/checkmark-jupyter.png" width="300px" alt="Jupyter logo with checkmark" style="padding-left: 20px;"/&gt;&lt;/p&gt;
&lt;ul style='list-style-type: ""'&gt;
  &lt;li&gt; NBP1 &lt;input type="checkbox"/&gt; Ensure that your Notebook runs correctly after completion
        &lt;ul style='list-style-type: ""; margin-left:2em;'&gt;
          &lt;li&gt; &lt;input type="checkbox"/&gt; Develop the Notebook&lt;/li&gt;
          &lt;li&gt; &lt;input type="checkbox"/&gt; Take a temporary copy of the Notebook&lt;/li&gt;
          &lt;li&gt; &lt;input type="checkbox"/&gt; Clear the Notebook&lt;/li&gt;
          &lt;li&gt; &lt;input type="checkbox"/&gt; Run and confirm the results match the temporary copy&lt;/li&gt;
          &lt;ul style='list-style-type: ""; margin-left:-1em;'&gt;
              &lt;li&gt;  &lt;input type="checkbox"/&gt; Fix (if this is not the case)&lt;/li&gt;
          &lt;/ul&gt;
          &lt;li&gt;  &lt;input type="checkbox"/&gt; Clear the new Notebook again&lt;/li&gt;
          &lt;li&gt;  &lt;input type="checkbox"/&gt; Commit the new Notebook to version control&lt;/li&gt;
          &lt;li&gt;  &lt;input type="checkbox"/&gt; Rerun (so the Notebook contains the results)&lt;/li&gt;
          &lt;li&gt;  &lt;input type="checkbox"/&gt; Delete the temporary copy&lt;/li&gt;
        &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt; NBP2 &lt;input type="checkbox"/&gt; Store the (cleared) Notebook in version control (see NBP1)&lt;/li&gt;
  &lt;li&gt; NBP3 &lt;input type="checkbox"/&gt; Parameterize inputs and outputs at the top of the Notebook&lt;/li&gt;
        &lt;ul style='list-style-type: ""; margin-left:2em;'&gt;
          &lt;li&gt;  &lt;input type="checkbox"/&gt; e.g. Set and use variables
                such as &lt;span style="font-family: monospace"&gt;INPATH&lt;/span&gt; and &lt;span style="font-family: monospace"&gt;OUTPATH&lt;/span&gt;&lt;/li&gt;
          &lt;li&gt;  &lt;input type="checkbox"/&gt; Write the most important outputs to file (if not already done)&lt;/li&gt;
        &lt;/ul&gt;
  &lt;li&gt; NBP4 &lt;input type="checkbox"/&gt;  Replace some individual cells or groups of cells with functions&lt;/li&gt;
      &lt;ul style='list-style-type: ""; margin-left:2em;'&gt;
          &lt;li&gt;  &lt;input type="checkbox"/&gt; Move them into an importable file, import and use&lt;/li&gt;
          &lt;li&gt;  &lt;input type="checkbox"/&gt; Prioritize potentially re-usable code
                and code requiring testing&lt;/li&gt;
      &lt;/ul&gt;
  &lt;li&gt; NBP5 &lt;input type="checkbox"/&gt; Write some tests&lt;/li&gt;
          &lt;ul style='list-style-type: ""; margin-left:2em;'&gt;
            &lt;li&gt;  &lt;input type="checkbox"/&gt; Create a reference (regression) test for the whole process&lt;/li&gt;
            &lt;li&gt;  &lt;input type="checkbox"/&gt; Create unit tests for the individual functions&lt;/li&gt;
           &lt;/ul&gt;
  &lt;li&gt; NBP6 &lt;input type="checkbox"/&gt; Consider restructuring/extracting the code as a standalone script&lt;/li&gt;
  &lt;li&gt; NBP7 &lt;input type="checkbox"/&gt; Allow the parameters to be set from the command line&lt;/li&gt;
        &lt;ul style='list-style-type: ""; margin-left:2em;'&gt;
            &lt;li&gt;  &lt;input type="checkbox"/&gt; Alternatively, read from a configuration file
                  (e.g. &lt;span style="font-family: monospace"&gt;.json&lt;/span&gt; or &lt;span style="font-family: monospace"&gt;.toml&lt;/span&gt;)&lt;/li&gt;
        &lt;/ul&gt;
  &lt;li&gt; NBP8 &lt;input type="checkbox"/&gt; Consider using safer alternatives like &lt;a href="https://marimo.io"&gt;Marimo&lt;/a&gt;
  and &lt;a href="https://quarto.org"&gt;Quarto&lt;/a&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;NBP Version 1.0.&lt;/p&gt;
&lt;p&gt;A &lt;a href="https://stochasticsolutions.com/pdf/nbp.pdf"&gt;printable PDF copy&lt;/a&gt; is available.&lt;/p&gt;</content><category term="misc"></category></entry><entry><title>Log Graphs and Grokkability</title><link href="https://tdda.info/log-graphs-and-grokkability.html" rel="alternate"></link><published>2024-12-12T16:00:00+00:00</published><updated>2024-12-12T16:00:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2024-12-12:/log-graphs-and-grokkability.html</id><summary type="html">&lt;p&gt;In his novel &lt;em&gt;Stranger in a Strange Land,&lt;/em&gt;
Robert Heinlein&lt;sup id="fnref:Heinlein"&gt;&lt;a class="footnote-ref" href="#fn:Heinlein"&gt;1&lt;/a&gt;&lt;/sup&gt;
introduced the word &lt;em&gt;grok&lt;/em&gt;.
It is used all the time in the computing sphere, but rarely,
as far as I know, outside it.
The definition that seems to me most closely to match its usage is:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;grok&lt;/strong&gt; (&lt;em&gt;transitive verb …&lt;/em&gt;&lt;/p&gt;</summary><content type="html">&lt;p&gt;In his novel &lt;em&gt;Stranger in a Strange Land,&lt;/em&gt;
Robert Heinlein&lt;sup id="fnref:Heinlein"&gt;&lt;a class="footnote-ref" href="#fn:Heinlein"&gt;1&lt;/a&gt;&lt;/sup&gt;
introduced the word &lt;em&gt;grok&lt;/em&gt;.
It is used all the time in the computing sphere, but rarely,
as far as I know, outside it.
The definition that seems to me most closely to match its usage is:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;grok&lt;/strong&gt; (&lt;em&gt;transitive verb&lt;/em&gt;).&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;To understand profoundly through intuition or empathy.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;grok&lt;/strong&gt; (&lt;em&gt;verb&lt;/em&gt;).&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;To have or to have acquired an
   &lt;em&gt;intuitive understanding&lt;/em&gt; of; to &lt;em&gt;know&lt;/em&gt; (something) without
   having to &lt;em&gt;think&lt;/em&gt; (such as knowing the number of objects in a
   collection without needing to count them: see &lt;em&gt;subitize&lt;/em&gt;).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;To fully and completely understand something in
   all its details and intricacies.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;To get the meaning of something.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;— &lt;em&gt;The American Heritage Dictionary of the English Language&lt;/em&gt;,
5th edition.&lt;sup id="fnref:AHDEL2012"&gt;&lt;a class="footnote-ref" href="#fn:AHDEL2012"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;Graphs and other visualizations are among the most powerful tools we
have for identifying and elucidating patterns in data—for helping us
to &lt;em&gt;grok&lt;/em&gt; data. A good graph can be extremely dense in information and
easy to understand. By the same token, a bad graph can—deliberately
or inadvently—be a powerful tool for misleading and spreading confusion
and misunderstanding.  Data professionals who subscribe to the Hippocratic
Oath, “First, do no harm”, should avoid poor graphing practices as a
matter of a high priority.&lt;/p&gt;
&lt;h3 id="log-scales"&gt;Log Scales&lt;/h3&gt;
&lt;p&gt;As I was writing a chapter on graphs in a book on TDDA,
I happened upon the graph below from &lt;a href="https://bsky.app/profile/ourworldindata.org/post/3lcphnfv77s2q"&gt;Our World in Data (OWID)&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://stochasticsolutions.com/image/ourworldindataincomeinequality-2024-12-07.png" alt="A graph showing the incomes in the table on a y-axis with a log scale and tick lines at $100, $200, $500, $1k $2k $5k and $10k. The seven countries are displayed on the x-axis as an orange dot for the 10th percentile income and a blue dot for the 90th percentile income. There is a vertical line between these tow number for each country, annotated with ratio of the two incomes." width="1000"/&gt;&lt;/p&gt;
&lt;p&gt;Graph from our world in data on &lt;a href="https://bsky.app/profile/ourworldindata.org/post/3lcphnfv77s2q"&gt;Bluesky&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It uses data from the &lt;a href="https://www.lisdatacenter.org"&gt;Luxembourg Income Study&lt;/a&gt;.
and shows 14 base numbers—the
post-tax incomes of the 10th percentile and 90th percentile citizens
of seven countries (poorer and richer groups, respectively).  The most
common way of measuring income inequality uses the
&lt;a href="https://ourworldindata.org/grapher/economic-inequality-gini-index"&gt;Gini coefficient&lt;/a&gt;,&lt;sup id="fnref:Hasell2023"&gt;&lt;a class="footnote-ref" href="#fn:Hasell2023"&gt;3&lt;/a&gt;&lt;/sup&gt; which is a single number that is powerful but
rather abstract.  The OWID graph is much more intuitive,
focusing as it does on the ratio of incomes between a richer and poorer
group, each defined by its position in the income distribution.
The raw post-tax incomes and the derived multiples (ratios)
are shown in this table:&lt;/p&gt;
&lt;p&gt;The Data Shown in the OWID Graph.&lt;sup id="fnref:data"&gt;&lt;a class="footnote-ref" href="#fn:data"&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style="text-align: left;"&gt;&lt;strong&gt;Country&lt;/strong&gt;&lt;/th&gt;
&lt;th style="text-align: right;"&gt;&lt;strong&gt;South Africa&lt;/strong&gt;&lt;/th&gt;
&lt;th style="text-align: right;"&gt;&lt;strong&gt;Brazil&lt;/strong&gt;&lt;/th&gt;
&lt;th style="text-align: right;"&gt;&lt;strong&gt;China&lt;/strong&gt;&lt;/th&gt;
&lt;th style="text-align: right;"&gt;&lt;strong&gt;Uraguy&lt;/strong&gt;&lt;/th&gt;
&lt;th style="text-align: right;"&gt;&lt;strong&gt;UK&lt;/strong&gt;&lt;/th&gt;
&lt;th style="text-align: right;"&gt;&lt;strong&gt;US&lt;/strong&gt;&lt;/th&gt;
&lt;th style="text-align: right;"&gt;&lt;strong&gt;Norway&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;&lt;strong&gt;Year&lt;/strong&gt;&lt;/td&gt;
&lt;td style="text-align: right;"&gt;2017&lt;/td&gt;
&lt;td style="text-align: right;"&gt;2022&lt;/td&gt;
&lt;td style="text-align: right;"&gt;2018&lt;/td&gt;
&lt;td style="text-align: right;"&gt;2022&lt;/td&gt;
&lt;td style="text-align: right;"&gt;2021&lt;/td&gt;
&lt;td style="text-align: right;"&gt;2022&lt;/td&gt;
&lt;td style="text-align: right;"&gt;2021&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;&lt;strong&gt;90th percentile ($USD)&lt;/strong&gt;&lt;/td&gt;
&lt;td style="text-align: right;"&gt;2,480&lt;/td&gt;
&lt;td style="text-align: right;"&gt;1,650&lt;/td&gt;
&lt;td style="text-align: right;"&gt;1,900&lt;/td&gt;
&lt;td style="text-align: right;"&gt;2,220&lt;/td&gt;
&lt;td style="text-align: right;"&gt;4,100&lt;/td&gt;
&lt;td style="text-align: right;"&gt;6,830&lt;/td&gt;
&lt;td style="text-align: right;"&gt;5,130&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;&lt;strong&gt;10th percentile ($USD)&lt;/strong&gt;&lt;/td&gt;
&lt;td style="text-align: right;"&gt;110&lt;/td&gt;
&lt;td style="text-align: right;"&gt;195&lt;/td&gt;
&lt;td style="text-align: right;"&gt;250&lt;/td&gt;
&lt;td style="text-align: right;"&gt;395&lt;/td&gt;
&lt;td style="text-align: right;"&gt;1,080&lt;/td&gt;
&lt;td style="text-align: right;"&gt;1,170&lt;/td&gt;
&lt;td style="text-align: right;"&gt;1,670&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;&lt;strong&gt;Multiple&lt;/strong&gt;&lt;/td&gt;
&lt;td style="text-align: right;"&gt;22⨉&lt;/td&gt;
&lt;td style="text-align: right;"&gt;8.4⨉&lt;/td&gt;
&lt;td style="text-align: right;"&gt;7.8⨉&lt;/td&gt;
&lt;td style="text-align: right;"&gt;5.6⨉&lt;/td&gt;
&lt;td style="text-align: right;"&gt;3.8⨉&lt;/td&gt;
&lt;td style="text-align: right;"&gt;6⨉&lt;/td&gt;
&lt;td style="text-align: right;"&gt;3.1⨉&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Before discussing weaknesses with the graph, consider a few of its
exemplary features. The graph minimizes chart junk&lt;sup id="fnref:vdqi"&gt;&lt;a class="footnote-ref" href="#fn:vdqi"&gt;5&lt;/a&gt;&lt;/sup&gt; while clearly
explaining the findings with various direct annotations. Labels are
close to the data, the numbers are easy to read, and different weights
and colours (shades) of text are used to emphasize and de-emphasize
information. The seven numbers they most wish to focus on are the
income ratios for each country, which are shown clearly.  The notes at
the bottom both specify the source of the data and provide useful
guidance about interpreting the numbers. The use of
colour is effective, and the colours used appear to have been chosen
to work for readers with colour blindness as well as everyone else. I
think the graph draws the reader in, and is much more likely to cause
someone to stop and study it than the dry table of numbers alone would be,
particularly in social media (where I saw this).&lt;/p&gt;
&lt;p&gt;Despite these many merits, I think the graph only partially succeeds
&lt;em&gt;as a graph.&lt;/em&gt; My own experience was that I
did get a feel for the information (I believe) the graph was trying to
communicate by reading it (which did not take too long—it is,
after all, only 21 numbers), but to gain that understanding I did have
to &lt;em&gt;read&lt;/em&gt; all 21 numbers—the 7 multipliers from the labels,
and the 14 incomes by looking across at the &lt;em&gt;y&lt;/em&gt;-axis.
Even when I had done this, I still did not have the same intuitive,
almost visceral understanding that comes from a really effective graph.
The immediate reason for this is the log scale: humans simply
do not have the same effortless ability to &lt;em&gt;grok&lt;/em&gt; information
presented using log-scaled lengths as we do with proportionately
scaled lengths.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://stochasticsolutions.com/image/OurWorldInData-LuxembourghIncomeStudy2024-Redrawn.png" alt="Top: the previous graph redrawn in much the same manner but using a linear scale for the $y$-axis. Bottom: the ratios from the original graph plotted in a similar style, but with the poor group in each country at value 1 (1 times), and the rich group at whatever the income ratio is for that country." width="1000"/&gt;&lt;/p&gt;
&lt;p&gt;My first instinct was the that log scale was unnecessary: the range of
values is not too great to show comfortably on a linear scale. So I
plotted the top graph of the pair above, which comfortably displays
all the incomes, albeit with the qualification that differences in the
incomes of the poorer groups in first four countries are harder to
differentiate.  (Smaller markers would help here, but I wanted to
change as little as possible from the OWID graph, where the marker
size is not a problem.) I contend that the redrawn graph makes it
easier to compare incomes across the seven different countries
&lt;em&gt;visually&lt;/em&gt;. Unfortunately, however, it is hopeless for conveying the
income ratios, which are surely the quantities OWID most wanted to
communicate.  In the redrawn graph, a larger separation between the
rich and poor groups does &lt;em&gt;not&lt;/em&gt; mean that the income ratio is larger,
still less is the separation proportional to the ratio: the absolute
difference between incomes in the US is more that twice that in South
Africa, because incomes are so much higher in the US, so its
multiplier of 6⨉ is represented by a much longer line than is the 22⨉
multiplier for South Africa. The log plot cleverly resolves this
because &lt;em&gt;addition&lt;/em&gt; of logs is equivalent to multiplication of the original
quantities.  As a result, differences on a log plot are the logs of
the ratios, and do &lt;em&gt;rank&lt;/em&gt; the ratios correctly.&lt;sup id="fnref:logs"&gt;&lt;a class="footnote-ref" href="#fn:logs"&gt;6&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;The cleverness of the presentation chosen by OWID is that it
allows a single graph, with a single scale, to show both the relative
incomes among countries and the ratios between incomes of the the
richer and poorer groups in each country.  But the cost of this
approach is that not only the income scale itself, but also the
multipliers are shown on log scales which—to labour the point—are
&lt;em&gt;hard to grok&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;The bottom graph in the redrawn figure shows the multipliers
themselves on a linear scale. I think this is a significantly better
(more grokkable) way to visualize them. There is, however, a final
subtlety. In general, ratios of positive quantities can have any
positive value. But in this case, the multiplier constructed is the
ratio of a &lt;em&gt;larger&lt;/em&gt; income (the 90th percentile) and a &lt;em&gt;smaller&lt;/em&gt; one (the
10th percentile). Manifestly, this cannot be smaller than 1—the value
it would take if everyone in the country had the same post-tax income.
It is for this reason that I have drawn the connecting lines on the last plot
from 1⨉ to the multiplier, rather than from 0, and have chosen not to
present this as a conventional bar graph. One &lt;em&gt;is&lt;/em&gt; the effective zero for these
particular multipliers.&lt;sup id="fnref:log1"&gt;&lt;a class="footnote-ref" href="#fn:log1"&gt;7&lt;/a&gt;&lt;/sup&gt; A country with a perfectly even income
distribution would have no distance between the 10th and 90th
percentiles—once more emphasizing how clever the device of using
a log scale for this data is, even if, as I contend, it was ultimately
a poor choice for communcation.&lt;/p&gt;
&lt;p&gt;Another way of saying this is that OWID (I believe)
was trying to show two different things, lying on naturally
different scales, on a single plot. OWID found a clever
technical solution that allowed them to do this, but at the
cost of grokkability. As I deconstructed it, I realised I needed
two plots to show the data in ways that I think are much easier
to understand. You, of course, must form you own judgement.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:Heinlein"&gt;
&lt;p&gt;Heinlein, Robert A. (1961). Stranger in a Strange Land. G. P. Putnam’s Sons.&amp;#160;&lt;a class="footnote-backref" href="#fnref:Heinlein" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:AHDEL2012"&gt;
&lt;p&gt;The American Heritage Dictionary of the English Language (2022). 5th ed. Random House Inc.&amp;#160;&lt;a class="footnote-backref" href="#fnref:AHDEL2012" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:Hasell2023"&gt;
&lt;p&gt;Hasell, Joe (2023). &lt;em&gt;Measuring inequality: what is the Gini coefficient?&lt;/em&gt;
In: Our World in Data. https://ourworldindata.org/what-is-the-gini-coefficient.&amp;#160;&lt;a class="footnote-backref" href="#fnref:Hasell2023" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:data"&gt;
&lt;p&gt;The data was reconstructed from the OWID graph, so there
will be minor deviations from the original Luxembough Income Study data.&amp;#160;&lt;a class="footnote-backref" href="#fnref:data" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:vdqi"&gt;
&lt;p&gt;&lt;em&gt;The Visual Display of Quantitative Information&lt;/em&gt;,
     Edward R. Tufte, Graphics Press, 1984.&amp;#160;&lt;a class="footnote-backref" href="#fnref:vdqi" title="Jump back to footnote 5 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:logs"&gt;
&lt;p&gt;since logarithms are monotonic, increasing functions.&amp;#160;&lt;a class="footnote-backref" href="#fnref:logs" title="Jump back to footnote 6 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:log1"&gt;
&lt;p&gt;because log 1 = 0&amp;#160;&lt;a class="footnote-backref" href="#fnref:log1" title="Jump back to footnote 7 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="misc"></category></entry><entry><title>Jupyter Notebooks Considered Harmful: The Parables of Anne and Beth</title><link href="https://tdda.info/jupyter-notebooks-considered-harmful-the-parables-of-anne-and-beth.html" rel="alternate"></link><published>2024-11-14T18:30:00+00:00</published><updated>2024-11-14T18:30:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2024-11-14:/jupyter-notebooks-considered-harmful-the-parables-of-anne-and-beth.html</id><summary type="html">&lt;p&gt;I have long considered writing a post about the various problems I see
with computational notebooks such as Jupyter Notebooks. As part of a book
I am writing on TDDA, I created four parables about good and bad
development practices for analytical workflows. They were not intended
to form this …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I have long considered writing a post about the various problems I see
with computational notebooks such as Jupyter Notebooks. As part of a book
I am writing on TDDA, I created four parables about good and bad
development practices for analytical workflows. They were not intended
to form this post; but they way they turned out fits the theme quite well.&lt;/p&gt;
&lt;h3 id="situation-report"&gt;Situation Report&lt;/h3&gt;
&lt;p&gt;Anne and Beth are data scientists, working in parallel roles in
different organizations. Each was previously tasked with analysing
data up to the end of the second quarter of 2024. Their analyses
were successful and popular. Even though there had never been any
suggestion that the analysis would need to be updated,
on the last Friday of October Anne and Beth each receive an urgent request
to “rerun the numbers using data up to the
end of Q3”. The following parables show four different ways this
might play out for our two protagonists from this common initial
situation.&lt;/p&gt;
&lt;h3 id="parable-1-the-parable-of-parameterization"&gt;Parable #1: The Parable of Parameterization&lt;/h3&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Beth&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Beth locates the Jupyter Notebook she used to run the numbers
previously and copies it with a new name ending &lt;code&gt;Q3&lt;/code&gt;. She changes the
copy to use the new data and tries to run it but discovers that the
Notebook does not work if run from start to finish. (During
development, Beth jumped about in the Notebook, changing steps and
rerunning cells out of order as she worked until the answers looked
plausible.)&lt;/p&gt;
&lt;p&gt;Beth spends the rest of Friday trying to make the analysis work on
the new data, cursing her former self and trying to remember exactly
what she did previously.&lt;/p&gt;
&lt;p&gt;At 16:30, Beth texts her partner saying she might be a little late
home.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Anne&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Anne begins by typing&lt;/p&gt;
&lt;p&gt;&lt;code&gt;make test&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;to run the tests she created based on her Q2 analysis. They pass. She
then puts the data in the &lt;code&gt;data&lt;/code&gt; subdirectory, calling it &lt;code&gt;data-2024Q3&lt;/code&gt;,
and types&lt;/p&gt;
&lt;p&gt;&lt;code&gt;make analysis-2024Q3.pdf&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Her &lt;code&gt;Makefile&lt;/code&gt;’s pattern rule matches the Q3 data to the target output
and runs her parameterized script, which performs the analysis using
the new data. It produces the PDF, and issues a message confirming
that consistency checks on the outputs passed. After checking the
document, Anne issues it.&lt;/p&gt;
&lt;p&gt;At 09:30, Anne starts to plan the rest of her Friday.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Computational notebooks, such as Jupyter Notebooks,&lt;sup id="fnref:notebook"&gt;&lt;a class="footnote-ref" href="#fn:notebook"&gt;1&lt;/a&gt;&lt;/sup&gt; have
taken the data science community by storm, to the point that it is now
often assumed that analyses will be performed in a Notebook. Although
I almost never use them myself, I do use a web interface (Salvador) to
my own &lt;a href="https://stochasticsolutions.com/miro"&gt;Miró&lt;/a&gt; that from a
distance looks a version of Jupyter from a different solar system.
Notebooks are excellent tools for &lt;em&gt;ad hoc&lt;/em&gt; analysis, particularly data
exploration, and offer clear benefits including the ability to embed
graphical output, easy shareability, web approaches to handling wide
tables, and facilitation of annotation of analysis in a spirit
somewhat akin to &lt;em&gt;literate programming&lt;/em&gt;.  I do not wish to take away
anyone's Notebooks, notwithstanding the title of this post.  I do,
however, see several key problems with the way Notebooks are used and
abused. Briefly, these are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Lack of Parameterization.&lt;/em&gt;
   I see Notebook users constantly copying
   Notebooks and editing them to work with new data, instead of
   writing parameterized code that handles different inputs.  Anne's
   process uses the same program to process the Q2 and Q3 data.
   Beth's process uses a modified copy, which is significantly less
   manageable and offers more scope for error (particularly
   &lt;a href="pages/glossary.html#error-of-process"&gt;errors of process&lt;/a&gt;).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Lack of Automated testing.&lt;/em&gt; While it is possible to write tests
    for Notebooks, and some tools and guides exist
    e.g. &lt;a href="https://semaphoreci.com/blog/test-jupyter-notebooks-with-pytest-and-nbmake"&gt;Remedios 2021&lt;/a&gt;,&lt;sup id="fnref:Remedios"&gt;&lt;a class="footnote-ref" href="#fn:Remedios"&gt;2&lt;/a&gt;&lt;/sup&gt;
    in my experience it is rare for this to be
    done even by the standards of data science, where testing is less
    common than I would like it to be.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Out-of-order execution.&lt;/em&gt; In Notebooks, individual cells can be
    executed in any order, and state is maintained between execution
    steps. Cells may fail to work as intended (or at all) if the state
    has not been set up correctly before they are run.  When this
    happens, other cells can be executed to patch up the state and
    then the failing cell can be run again.  Not only can critical
    setup code end up lower down a Notebook than code that uses it,
    causing a problem if the Notebook is cleared and re-run: the key
    setup code can be altered or deleted after it has been used to set
    the state. This is my most fundamental reservation about
    Notebooks, and it not merely a theoretical concern. I have known
    many analysts who routinely leave Notebooks in inconsistent states
    that prevent them from running straight through to produce the
    results.  Notebooks are &lt;em&gt;fragile.&lt;/em&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Interaction with Source Control Systems.&lt;/em&gt; Notebooks can be stored
    in source control systems like Git, but some care is
    needed. Again, in my experience, Notebooks tend not to under
    version control, with the &lt;em&gt;copy-paste-edit&lt;/em&gt; pattern
    (for whole Notebooks) being more common.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In my view, Notebooks should be treated as prototypes to be converted
to parameterized, tested scripts immediately after development.  This
will often involve converting the code in cells (or groups of cells)
into parameterized functions, something else that, Notebooks seem to
discourage.  This is probably because cells provide a subset of the
benefits of a callable function by visually grouping a block of code
and allowing it to be executed in isolation. Cells do not, however,
provide other key benefits of functions and classes, such as separate
scopes, parameters, enhanced reusability, enhanced testability, and
abstraction.&lt;/p&gt;
&lt;p&gt;Anne's process has four key features that differ from Beth's.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Anne runs her code from a command line using &lt;code&gt;make&lt;/code&gt;. If you
   are not familiar with the
   &lt;a href="https://www.gnu.org/software/make/"&gt;make utility&lt;/a&gt;, it is well worth
   learning about, but the critical point here is
   that Anne's setup allows her to use her process on new data without
   editing any code: in this case her &lt;code&gt;make&lt;/code&gt; command
   (is intended to) get expanded to &lt;code&gt;python analyse.py 2024Q3&lt;/code&gt;
   and uses the parameter
   &lt;code&gt;2024Q3&lt;/code&gt; both to locate the input data and to name the
   matching report generated.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Anne also benefits from tests she previously wrote, so has
   confidence that her code is behaving as expected on known data.
   This is the essence of what we mean by
   &lt;a href="pages/glossary.html#reference-test"&gt;reference testing&lt;/a&gt;.
   While you might think that that if Anne has not changed anything
   since she last ran the tests, they are bound to pass (as they do in
   this parable), this is not necessarily the case.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Anne's code also includes computational checks on the outputs.  Of
   course, such checks can be included in a Notebook just as easily as
   they can be in scripts. The reason they are not is entirely because
   I am making one analyst a paragon of good practice and the other a
   caricature of sloppiness.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Finally, unlike Beth, Anne takes the time to check her outputs
   before sending them on. Once again, this is because Anne &lt;em&gt;cares
   about getting correct results, and wants to find any problems
   herself&lt;/em&gt;, not because she does not use a Notebook.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="parable-2-the-parable-of-testing"&gt;Parable #2: The Parable of Testing&lt;/h3&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Beth&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Beth copies, renames and edits her previous Notebook
and is pleased to see it runs without error on the Q3 data.
She issues the results and plans the rest of her Friday.&lt;/p&gt;
&lt;p&gt;The following week, Beth's inbox is flooded with
people saying her results are “obviously wrong”.
Beth is surprised since she merely copied the Q2 analysis,
updated the input and output file paths, and ran it.
She opens her old Q2 Notebook and reruns all cells. She is
dismayed to see all the values and graphs in the second
half of the Notebook change.&lt;/p&gt;
&lt;p&gt;Beth has
some remedial work to do.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Anne&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Anne runs her tests but one fails.
On examining the failing test, Anne realises
that a change she made to one her helper libraries means that
a transformation she had previous applied in the main analysis
is now done by the library, so should be removed from her analysis
code.&lt;/p&gt;
&lt;p&gt;After making this change, the tests (which
were all based on the Q2 data) pass. Anne commits the
change to source control before typing&lt;/p&gt;
&lt;p&gt;&lt;code&gt;make analysis-2024Q3.pdf&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;to analyse the new data.
After sense checking the results,
Anne issues them.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In the first parable, Beth's code would not run from start to finish;
this time, it runs but produces different answers from when she ran it
before using the Q2 data.  This could be because she had failed to
clear and re-run the Notebook to generate her final Q2 results, but
here I am assuming that her results changed for the same reason as
Anne's: they had both updated a helper library that their code used.
Whereas Anne's tests detected the fact that her previous results had
changed, Beth only discovered this when other people noticed her Q3
results did not look right (though had she checked her results, she
might have noticed that something looked wrong.)  Anne is in a slightly
better position than Beth to diagnose what went wrong, because her
“correct” (previous) results are stored as part of her tests. Now
that Beth has updated her Notebook, it may be harder for her to
recover the old results. Even if she has access to the old and new
results, Beth is probably is less good position than Anne because Anne
has at least one test highlighting how the result has changed. This
should allow her to make faster progress and gain confidence that her
fix is correct more easily.&lt;/p&gt;
&lt;h3 id="parable-3-the-parable-of-units"&gt;Parable #3: The Parable of Units&lt;/h3&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Beth&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Again, Beth copies, renames and updates her previous Notebook
and is happy to see it runs straight through on the Q3 data.
She issues the results and looks forward to a low-stress day.&lt;/p&gt;
&lt;p&gt;Around 16:00, Beth's phone rings and a stressed executive
tells her the answers “can't be right” and need to be fixed quickly.
Beth is puzzled. She opens her Q2 Notebook, re-runs it and the output
is stable. That, at least, is good.&lt;/p&gt;
&lt;p&gt;Beth now compares the Q2 and Q3 datasets
and notices that the values in the
&lt;code&gt;PurchasePrice&lt;/code&gt; column are some three orders of magnitude larger
in Q3 than in Q2, as if the data is in different units.
She checks with her data supplier to confirm that this is the case,
then sends some rebarbative emails, with the subject
&lt;em&gt;Garbage In, Garbage Out!&lt;/em&gt;
Beth grumpily adds a cell to her Q3 notebook dividing the
relevant column by 1,000.
She then adds &lt;code&gt;_fixed&lt;/code&gt; to the Q3 notebook's name
to encourage her to copy that one next time.
She wonders why everyone else is so lazy and incompetent.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Anne&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;As usual, Anne first runs her tests, which pass.
She then runs the analysis on the Q3 data by issuing
the command&lt;/p&gt;
&lt;p&gt;&lt;code&gt;make analysis-2024Q3.pdf&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;The code stops with an error:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Input Data Check failed for: PurchasePrice_kGBP&lt;/code&gt;
&lt;br/&gt;
   &lt;code&gt;Max expected: GBP 10.0k&lt;/code&gt;
&lt;br/&gt;
   &lt;code&gt;Max found:    GBP 7,843.21k&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;(Anne's code creates a renamed copy of the column after load because
she had noticed while analysing Q2, that the prices were in thousands
of pounds.)&lt;/p&gt;
&lt;p&gt;Anne checks with her data supplier, who confirms a change of units,
which will continue going forward.
Anne persuades her data provider to change the field name for clarity,
and to reissue the data.&lt;/p&gt;
&lt;p&gt;Anne adds different code paths based on the supplied column names
and adds tests for the new case.
Once they pass, and she has received the updated data,
Anne commits the change,
runs and checks the analysis and issues the results.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This time, Anne is saved by virtue of having added checks to the input
data, which Beth clearly did not (though, again, such checks could easily be
included in a Notebook).  This builds directly on the ideas of
other articles in this blog, whether implemented through TDDA-style
constraints or more directly as explicit checks on input values in the
code.&lt;/p&gt;
&lt;p&gt;Anne (being the personification of good practice) also noticed
the ambiguity in the &lt;code&gt;PurchasePrice&lt;/code&gt; variable and created
a renamed copy of it for clarity. Note, however, that her check
would have worked if she had not created a renamed variable.&lt;/p&gt;
&lt;p&gt;A third difference is that Anne has effected a systematic
improvement in her data feed by getting the supplier to rename
the field. This reduces the likelihood that the unit will be
changed without flagging it, decreases chances of its being
misinterpreted, and allows Anne to have two paths through her
single script, coping with data in either format safely.
By re-sourcing the updated data, Anne also confirms
that the promised change has actually been made,
and that the new data looks correct.&lt;/p&gt;
&lt;p&gt;Finally, Beth now has different copies
of her code and has to be careful to copy the right one next time
(hence &lt;code&gt;_fixed&lt;/code&gt;). Anne's old code
only exists in the version control system, and crucially, her new
code safely handles both cases.&lt;/p&gt;
&lt;h3 id="parable-4-the-parable-of-applicability"&gt;Parable #4: The Parable of Applicability&lt;/h3&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Beth&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Beth dusts off her Jupyter Notebook
and, as usual, copies it with a new name ending Q3.
She makes the necessary changes to use the new data
but it crashes with the error:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;ZeroDivisionError: division by zero&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;After a few hours tracing through the code,
Beth eventually realises that there is no data in Q3 for
a category that had been quite numerous in the Q2 data.
Her Notebook indexes other categories against the missing category
by calculating their ratios. On checking with her data provider,
Beth confirms that the data is correct, so adds extra code to the
Q3 version of the Notebook to handle the case.
She also makes a mental note to try to remember to copy the Q3 notebook
in future.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Anne&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Anne runs her tests by typing&lt;/p&gt;
&lt;p&gt;&lt;code&gt;make test&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;and enjoys watching the progress of the “green line of goodness”.
She then runs the analysis on the Q3 data by typing&lt;/p&gt;
&lt;p&gt;&lt;code&gt;make analysis-2024Q3.pdf&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;but it stops with an error message: &lt;em&gt;There is
no data for Reference Store 001. If this is right, you
need to choose a different Reference Store.&lt;/em&gt;
After establishing that the data is indeed correct,
Anne updates the code to handle this situation, checks
that the existing tests pass, adds a couple regression tests
to make sure that it copes not only with the default reference
store having no data, but also alternative reference stores.
When all tests pass, she runs the analysis
in the usual way, checks it, commits her updated code
and issues the results.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;As in the previous parable, the most important difference between
Beth's approach and Anne's is that Beth's fix for their common problem
is &lt;em&gt;ad hoc&lt;/em&gt; and leads to a further code proliferation&lt;sup id="fnref:fixed2"&gt;&lt;a class="footnote-ref" href="#fn:fixed2"&gt;3&lt;/a&gt;&lt;/sup&gt; and its
concomitant risks if the analysis is run again.  In contrast, Anne's
code becomes more general and robust as it handles the new case along
with the old and she adds new extra tests (&lt;em&gt;regression tests&lt;/em&gt;)
to try to ensure that nothing breaks the handling of this case in
future. The “green line of goodness” mentioned is the name some
testers use for the line of dots (sometimes green) many test
frameworks issue each time a test passes.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;So there we have it. On a more constructive note, I have been
following the progress of &lt;a href="https://quarto.org"&gt;Quarto&lt;/a&gt; with interest.
Quarto is a development of RMarkdown, a different style of computational
document popular in the R and &lt;a href="https://www.ncbi.nlm.nih.gov/books/NBK547546/"&gt;Reproducibile Research&lt;/a&gt; communities.&lt;sup id="fnref:sister"&gt;&lt;a class="footnote-ref" href="#fn:sister"&gt;4&lt;/a&gt;&lt;/sup&gt;
To my mind it has fewer of the problems highlighted here.
It also supports Python and much of the Python data stack as first-class
citizens, and in fact integrates closely with Jupyter, which it uses
behind the scenes for many Python-based workflows.
I have been using it over the last couple of days, and though it is
still distinctly rough around the edges, I think it offers a very
promising way forward, with excellent output options that include
PDF (via LaTeX), HTML, Word documents and many other formats.
It's both an interesting alternative to Notebooks and
(perhaps more realistically) a reasonable target for migrating
code from Notebook prototypes. I use most of the cells to
call functions imported at the start, promoting code re-use
and parametization, which avoids another of the pitfalls of
Notebooks (in practice) discussed above.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:notebook"&gt;
&lt;p&gt;I am using the term &lt;em&gt;Jupyter Notebook&lt;/em&gt; to
cover both what are now called &lt;em&gt;Jupyter Notebooks&lt;/em&gt; “The
Classic Notebook Interface”) and &lt;em&gt;JupyterLabs&lt;/em&gt; (the &lt;a href="https://jupyter.org"&gt;“Next-Generation Notebook Interface”&lt;/a&gt;). This
is both because most people I know continue to call them Jupyter
Notebooks, even when using JupyterLab, and because “Notebooks” reads
better in the text. I will capitalize Notebook when it refers to
a computational notebook, as opposed to a paper notebook.&amp;#160;&lt;a class="footnote-backref" href="#fnref:notebook" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:Remedios"&gt;
&lt;p&gt;Remedios 2021, &lt;a href="https://semaphoreci.com/blog/test-jupyter-notebooks-with-pytest-and-nbmake"&gt;How to Test Jupyter Notebooks with Pytest and Nbmake&lt;/a&gt;, 2021-12-14.&amp;#160;&lt;a class="footnote-backref" href="#fnref:Remedios" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:fixed2"&gt;
&lt;p&gt;&lt;code&gt;analysis_fixed_fixed2_final.ipynb&lt;/code&gt;&amp;#160;&lt;a class="footnote-backref" href="#fnref:fixed2" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:sister"&gt;
&lt;p&gt;Reproducible research is very much a sister
movement to TDDA—a much larger sister movement. Its goals are similar
and it is wholly congruent with all the ideas of TDDA. To the extent
that there is divergence, some of it simply arises from separate evolution,
and some from the fact that focus of reproducible research is more
allowing other people to access your code and data, to run it themselves
and verify the outputs, or to write their own analysis to verify your
results even more strongly, or to use your code on their data as a
different sort of validation. I sometimes call TDDA “reproducible
research for solipsists”, because of its greater focus on testing,
and helping to discover and eliminate problems even if no second
person is involved. Another related area I have recently become
aware of is &lt;a href="https://vdsbook.com"&gt;Veridical data science&lt;/a&gt;, as developed
by Bin Yu and Rececca Barter. The link is to their book of that name.&amp;#160;&lt;a class="footnote-backref" href="#fnref:sister" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="TDDA"></category><category term="reproducibility"></category><category term="process"></category></entry><entry><title>An Adware Malware Story Featuring Safari, Notification Centre, and Box Plots</title><link href="https://tdda.info/an-adware-malware-story-featuring-safari-notification-centre-and-box-plots.html" rel="alternate"></link><published>2024-09-22T16:00:00+01:00</published><updated>2024-09-22T16:00:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2024-09-22:/an-adware-malware-story-featuring-safari-notification-centre-and-box-plots.html</id><summary type="html">&lt;p&gt;This is not, primarily,
an article about TDDA, but I thought it was worth publishing
here anyway. Itʼs a story about a kind of adware/malware incident I had
this morning—with digressions about box plots.&lt;/p&gt;
&lt;h3 id="disgression"&gt;Disgression&lt;/h3&gt;
&lt;p&gt;I was doing some research for a book (on TDDA), looking up information …&lt;/p&gt;</summary><content type="html">&lt;p&gt;This is not, primarily,
an article about TDDA, but I thought it was worth publishing
here anyway. Itʼs a story about a kind of adware/malware incident I had
this morning—with digressions about box plots.&lt;/p&gt;
&lt;h3 id="disgression"&gt;Disgression&lt;/h3&gt;
&lt;p&gt;I was doing some research for a book (on TDDA), looking up information
on box plots, also known as box-and-whisker diagrams.
When I first came across box plots, I assumed the “box”
in the name was a reference to the literal “box” part of a traditional
box plot. If you are not familiar with box plots, they typically
look like the ones shown in Wikipedia:&lt;sup id="fnref:wikibox"&gt;&lt;a class="footnote-ref" href="#fn:wikibox"&gt;1&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Box plots for the Michaelson-Morely Experiment"
     src="https://njr.prose.sh/wikipedia-box-plot.png"
     width="546"/&gt;&lt;/p&gt;
&lt;p&gt;There are variations, but typically the central line represents
the median, the “box” delineates the interquartile range,
the whiskers extend to either the minimum and maximum
or, sometimes, other percentiles, such as 1 and 99. When the
minimum and maximum are not used, outliers beyond those extents
can be shown as individual points, as seen here for  experiments 1
and 3.&lt;/p&gt;
&lt;p&gt;At some point after learning about box plots,
I became aware of the statistician George Box—he of
“All models are wrong, but some models are useful”&lt;sup id="fnref:boxquote"&gt;&lt;a class="footnote-ref" href="#fn:boxquote"&gt;2&lt;/a&gt;&lt;/sup&gt;
fame, and ended up believing that box plots had in fact been invented
by him (and should, therefore, arguably be called “Box plots” rather
than “box plots”). Whether someone misinformed me or my brain simply
put 2 and 2 together to make about 15, Tufte&lt;sup id="fnref:tuftevdqa"&gt;&lt;a class="footnote-ref" href="#fn:tuftevdqa"&gt;5&lt;/a&gt;&lt;/sup&gt;
(who advocates his own “reduced box plot”, in line with his principle of
maximizing data ink and minimizing chart junk)
states definitively that the box plot
was in fact a refinement by John Tukey&lt;sup id="fnref:tukeybox"&gt;&lt;a class="footnote-ref" href="#fn:tukeybox"&gt;3&lt;/a&gt;&lt;/sup&gt; of Mary Eleanor
Spearʼs “range bars”&lt;sup id="fnref:spearbox"&gt;&lt;a class="footnote-ref" href="#fn:spearbox"&gt;4&lt;/a&gt;&lt;/sup&gt;. So I was wrong.&lt;/p&gt;
&lt;h3 id="back-to-the-malware"&gt;Back to the malware&lt;/h3&gt;
&lt;p&gt;Anyway, back to the malware. I was clicking about on image search results
for searchs like &lt;code&gt;box plot "George Box"&lt;/code&gt; and hit a site that gave one of
the ubiquitous “Are you a human?” prompts that sometimes, but not always,
act as a gateway to solving CAPTCHAs to train AI models. But this one
didnʼt seem to work. I closed the tab and moved on, but soon after started
getting highly suspicious looking pop-up notifications like these:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Malware/ad-ware notifications from ask you"
     src="https://njr.prose.sh/askyou-popups.png"
     width="356"/&gt;&lt;/p&gt;
&lt;p&gt;These are comically obviously not legitimate warnings from anything
that I would knowingly allow on a computer, which made me less alarmed
that I might otherwise have been. But clearly something had happened
as a result of clicking an image search result and an
“I am not a robot” dialogue.&lt;/p&gt;
&lt;p&gt;I wonʼt bore you with a blow-by-blow account of what I did, but the
key points are that&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Killing that tab did not stop the notifications.&lt;/li&gt;
&lt;li&gt;Nor stopping and restarting Safari but bringing back all the old windows.&lt;/li&gt;
&lt;li&gt;Nor did stopping and restarting Safari without beinging back
   any tabs or windows.&lt;/li&gt;
&lt;li&gt;Nor did deleting todayʼs history from Safari.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So I did some web searches, almost all of the results of which
advocated downloading sketchy “anti-malware” products to clean my Mac,
which I was never going to do. Eventually, I came across the
suggestion that it might be a site that had requested and been given
permission to put notifications in Notification Centre. I think I was
only half-aware that this was a possible behaviour, but it turns out
that (on MacOS Ventura 13.6.9, with Safari 17.6) Safari → Settings →
Websites has a Notifications section on the left that looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Malware/ad-ware notifications from ask you"
     src="https://njr.prose.sh/safari-notifications.png"
     width="948"/&gt;&lt;/p&gt;
&lt;p&gt;I must have been aware of this at various points,
because I had set the four websites at the bottom of the list
to &lt;code&gt;Deny&lt;/code&gt;,
but I had not noticed the
&lt;code&gt;Allow websites to ask for permission to send notifications&lt;/code&gt;
checkbox, which was enabled (but is now disabled).
The top one—looks suspicious, dunnit?—was set to &lt;code&gt;Allow&lt;/code&gt;
when I went in. I have a strong suspicion that the site
I tricked me into giving it permission by getting me to click
something that did not appear to be asking for such permission.
I suspect it hides its URL by using a newline or a lot of whitespace,
which is why it does not show up in the screenshot above.&lt;/p&gt;
&lt;p&gt;Setting that top (blank-looking) site to &lt;code&gt;Deny&lt;/code&gt; and
(as a belt-and-braces and preventative measure)
unchecking the checkbox so that sites are not even allowed to ask for
permission to put popups in Notification Centre had the desired
effect of making popups stop.
I believe this consitutes a full fix and that no data was being
exfiltrated from the Mac, despite the malicious notification.
I will probably also Remove at least that top site (with the
Remove button in the future) but will leave it there for now
in case Apple (or anyone else) can tell me how to find out what
site it comes from.&lt;/p&gt;
&lt;p&gt;I also found (but cannot now find again) an option to reset
the notications from those sites. This was the extremely confusing
dialogue for the site in question.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Reset notifications dialogue for "ask you"
     src="https://njr.prose.sh/reset-notifications-ask.png"
     width="500"/&gt;&lt;/p&gt;
&lt;p&gt;I think whatʼs going on here is that some text that the site is using
to identify itself to Safari when asking for permission consists
of the following text, &lt;em&gt;incuding the new lines&lt;/em&gt;:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;ask you

Confirm that you&amp;#39;re not a robot, you need to click Allow
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This makes reading the dialogue quite hard and confusing.
Looking more carefully at Notification Centre, I also see this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Ask you permission which text request"
     src="https://njr.prose.sh/askyou-permission.png"
     width="240"/&gt;&lt;/p&gt;
&lt;p&gt;I don't quite understand whether this is an image forming a notification,
or an image included in some other notification, but offset, or something
else. Whatever it is, it consists of (or includes) white text saying&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;ask you

Confirm that you&amp;#39;re not a
robot, you need to click Allow
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;with a little bit of light grey background around the letters.&lt;/p&gt;
&lt;p&gt;I donʼt entirely understand why the site would have used barely readable
white text on a light grey background like this, but I presume somehow
this text was involved in getting me to click the “I am not a robot”
dialogue (which I believe to be the only click I performed on the site).&lt;/p&gt;
&lt;p&gt;Anyway, the long and the short of it is that if anyone else runs
into this, my recommendations (which do not come from a security expert,
so use your own judgement) are:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Donʼt download a random binary from the internet to remove spyware.&lt;/li&gt;
&lt;li&gt;Try to find the Safari Preference for &lt;code&gt;Notifications&lt;/code&gt; under &lt;code&gt;Websites&lt;/code&gt;
   and see if you have a sketchy-looking entry like mine. If so, set
   that to Deny&lt;/li&gt;
&lt;li&gt;Probably also remove that site with the &lt;code&gt;Remove&lt;/code&gt; buttom&lt;/li&gt;
&lt;li&gt;Consider turning off the ability for sites to request permission
   to put notifications in Notification Centre if this is not something
   you want, or that no site you care about needs.&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:wikibox"&gt;
&lt;p&gt;Taken from &lt;a href="https://en.wikipedia.org/wiki/Box_plot#/media/File:Michelsonmorley-boxplot.svg"&gt;Wikipedia&lt;/a&gt; entry on &lt;a href="https://en.wikipedia.org/wiki/Box_plot"&gt;box plots&lt;/a&gt;, own work by Wikipediaʼs &lt;a href="https://commons.wikimedia.org/wiki/User:Schutz"&gt;User:Schutz&lt;/a&gt; (public domain).&amp;#160;&lt;a class="footnote-backref" href="#fnref:wikibox" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:boxquote"&gt;
&lt;p&gt;&lt;em&gt;George Box, (1919-2013): a wit, a kind man and a statistician.&lt;/em&gt;,
Obituary by Julian Champkin, 4 April 2013
&lt;a href="https://significancemagazine.com/george-box-1919-2013-a-wit-a-kind-man-and-a-statistician-2/"&gt;https://significancemagazine.com/george-box-1919-2013-a-wit-a-kind-man-and-a-statistician-2/&lt;/a&gt;&amp;#160;&lt;a class="footnote-backref" href="#fnref:boxquote" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:tukeybox"&gt;
&lt;p&gt;&lt;em&gt;Exploratory data analysis&lt;/em&gt;, John Tukey, Reading/Addison-Wesley, 1977.&amp;#160;&lt;a class="footnote-backref" href="#fnref:tukeybox" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:spearbox"&gt;
&lt;p&gt;&lt;em&gt;Charting Statistics&lt;/em&gt;, Mary Eleanor Spear, McGraw Hill, 1952.&amp;#160;&lt;a class="footnote-backref" href="#fnref:spearbox" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:tuftevdqa"&gt;
&lt;p&gt;&lt;em&gt;The Visual Display of Quantitative Information&lt;/em&gt;,
Edward R. Tufte, Graphics Press, 1984.&amp;#160;&lt;a class="footnote-backref" href="#fnref:tuftevdqa" title="Jump back to footnote 5 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="misc"></category></entry><entry><title>PyData London 2024 TDDA Tutorial</title><link href="https://tdda.info/pydata-london-2024-tdda-tutorial.html" rel="alternate"></link><published>2024-07-21T16:00:00+01:00</published><updated>2024-07-21T16:00:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2024-07-21:/pydata-london-2024-tdda-tutorial.html</id><content type="html">&lt;p&gt;PyData London had its tenth conference in 2024, and it was excellent.&lt;/p&gt;
&lt;p&gt;I gave a tutorial on TDDA, and the video is available &lt;a href="https://www.youtube.com/watch?v=iM89ZoJYdwE"&gt;on YouTube&lt;/a&gt; and below:&lt;/p&gt;
&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/iM89ZoJYdwE?si=CzpcTeNunR-MAM1t" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen&gt;&lt;/iframe&gt;

&lt;p&gt;The slides are also available &lt;a href="https://stochasticsolutions.com/pdf/tdda-london-2024.pdf"&gt;here&lt;/a&gt;.&lt;/p&gt;</content><category term="TDDA"></category><category term="TDDA"></category><category term="tutorial"></category></entry><entry><title>Learning the Hard Way: Regression to the Mean</title><link href="https://tdda.info/learning-the-hard-way-regression-to-the-mean.html" rel="alternate"></link><published>2024-06-20T20:00:00+01:00</published><updated>2024-06-20T20:00:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2024-06-20:/learning-the-hard-way-regression-to-the-mean.html</id><summary type="html">&lt;p&gt;I was at the tenth PyData London Conference last weekend, which was excellent,
as always. One of the keynote speakers was
&lt;a href="https://rotational.io/authors/rebecca-bilbro/"&gt;Rebecca Bilbro&lt;/a&gt;
who gave a rather brilliant (and cleverly titled) talk called
&lt;a href="https://www.slideshare.net/slideshow/pydata-london-2024-mistakes-were-made-dr-rebecca-bilbro/269771696"&gt;Mistakes Were Made: Data Science 10 Years In&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The title is, of course, a reference to the …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I was at the tenth PyData London Conference last weekend, which was excellent,
as always. One of the keynote speakers was
&lt;a href="https://rotational.io/authors/rebecca-bilbro/"&gt;Rebecca Bilbro&lt;/a&gt;
who gave a rather brilliant (and cleverly titled) talk called
&lt;a href="https://www.slideshare.net/slideshow/pydata-london-2024-mistakes-were-made-dr-rebecca-bilbro/269771696"&gt;Mistakes Were Made: Data Science 10 Years In&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The title is, of course, a reference to the tendency many of us have
to be more willing to admit that &lt;em&gt;mistakes were made&lt;/em&gt;, than to say "I
made mistakes". So I thought I'd share a mistake I made fairly early
in my data science career, probably around 1996 or 1997. This is not
one of those interview-style
"I-sometimes-worry-I'm-a-bit-too-much-of-a-perfectionist"-style
admissions that we have all heard; this one was bad.&lt;/p&gt;
&lt;p&gt;My company at the time, Quadstone, was under contract to analyse
a large retailer's customer base for relationship marketing,
using loyalty-card data.  We had done all sorts of work in the area
with this retailer, and one day the relationship manager we were
working with decided that it would be good to incentivise more
spending among the retailer's less active, lower spending customers. This is
fairly standard.  The idea was to set a reasonably high, but
achievable, target spend level for each of these customers over a period of a
few weeks. Those customers who hit their individual target would receive a
large number of loyalty points worth a reasonable amount of money.&lt;/p&gt;
&lt;p&gt;We had been tracking spend carefully, placing customers on a
behavioural segmentation, and had enough data that we felt relatively
confident we knew what be good indivualized stretch goals for
customers (wrongly, as events would prove). We set the
targets at levels such that retailer should break even if people just met
them (foregoing profit, but not losing much money), and estimated how
many people we thought would hit the target if the campaign did not
have much effect, and then estimated volumes and costs for various
higher levels of campaign success.&lt;/p&gt;
&lt;p&gt;I'm sure many of you can already see how this will go, and even more
of you will have been attuned to the problem by the title of this post.
We, however, we had not seen the title of this post, and although
I knew about the phenomenon of &lt;em&gt;regression to the mean&lt;/em&gt;, I had not really
internalized it. I didn't know it in my bones. I had not been bitten by it.
I did not see the trap we were walking into.&lt;/p&gt;
&lt;p&gt;As Confucious apparently &lt;a href="https://english.stackexchange.com/questions/226886/origin-of-i-hear-and-i-forget-i-see-and-i-remember-i-do-and-i-understand"&gt;did not say&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I hear and I forget. I see and I remember. I do and I understand.&lt;/p&gt;
&lt;p&gt;— probably not Confucious; possibly Xunzi.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Well, I certainly now understand.&lt;/p&gt;
&lt;p&gt;On the positive side, our treated group increased its
level of spend by a decent amount, and a large number of the group earned
many extra loyalty points. I don't believe we had developed
&lt;a href="https://stochasticsolutions.com/uplift/"&gt;uplift modelling&lt;/a&gt; at the
time of this work,
we were very aware that we needed a randomized control group
in order to understand the behaviour change we had driven,
and we had kept one.
To our dismay, the level of spend in the control group,
though lower than that in the treated group, also increased
quite considerably.
In fact, it increased enough that the return on investment
for the activity was negative, rather than positive. It was
at this point (just before admitting to the client what had happened,
and negotiating with them about exactly who should shoulder the loss&lt;sup id="fnref:loss"&gt;&lt;a class="footnote-ref" href="#fn:loss"&gt;1&lt;/a&gt;&lt;/sup&gt;)
a little voice in my head started saying &lt;em&gt;regression to the mean, regression to the mean, regression to the mean,&lt;/em&gt; almost like a more analytical version
of Long John Silver's parrot.&lt;/p&gt;
&lt;p&gt;So (for those of you who don't know), what is regression to the mean?
And why did it occur in this case?
And why should we, in fact, have predicted that?&lt;/p&gt;
&lt;p&gt;Allow me to lead you through the gory details.&lt;/p&gt;
&lt;h3 id="background-control-groups"&gt;Background: Control Groups&lt;/h3&gt;
&lt;p&gt;We all know that marketers can't honestly claim the credit for all the sales
from people included in a direct marketing campaign, because (in almost all
circumstances) some of them would have bought anyway.
As with randomized control trials in medicine,
in order to understand the true effect of our campaign,
we need to divide our target population, uniformly at random,&lt;sup id="fnref:uniform"&gt;&lt;a class="footnote-ref" href="#fn:uniform"&gt;2&lt;/a&gt;&lt;/sup&gt;
into a treatment group, who receive the marketing treatment in question,
and a control group, who remain untreated.
The two groups do not need to be the same size, but both
need to be big enough to allow us to measure the outcome accurately,
and indeed to measure the difference between the behaviour of the
two groups.
This is slightly problematical, because
we don't know the effect size before we take action.
Happily, however, if the effect is too small to measure, it is
pretty much guaranteed to be uninteresting and not to achieve
a meaningful return on investment, so we can size the two groups
by calculating the minimum effect we need to be able to detect
in order to achieve a sufficiently positive ROI.&lt;/p&gt;
&lt;p&gt;The effect size is the difference between the outcome in the treated
group and the control group—usually a difference in response rate,
for a binary outcome, or a difference in a continuous variable such as
revenue. Things become more interesting when there are negative
effects in play, which is sometimes the case with intrusive marketing
or when retention activity is being undertaken.  There can be negative
effects for a subpopulation or, in the worst cases, for the population
as a whole. When these happen, a company is literally spending money
to drive customers away, which is usually undesirable.&lt;/p&gt;
&lt;p&gt;Let's suppose, for simplicity, that we have selected an ideal target
population of 2 million and we mail half of them (chosen on the toss
of a fair coin) and keep the other 1 million as controls.
If we then send a motivational mailing to the 1 million encouraging
them to spend more, with or without an incentive to do so, we can
measure their average weekly spend in a pre-period (say six weeks)
and their average weekly send in a post-period, which for simplicity
we will also take to be six weeks. In this case, we will assume that
there was no financial incentive: it was simply a motivation mail along
the lines of "we're really great: come and give us more of your money".
(Good creatives would craft the message more attractively than this.)
Let's suppose we do this and that the results
for the treated group of one million are as follows:&lt;/p&gt;
&lt;center&gt;
  &lt;table&gt;
    &lt;tr&gt;&lt;th&gt;Before&lt;/th&gt;&lt;th&gt;After&lt;/th&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td style="text-align: right;"&gt;£50&lt;/td&gt;
        &lt;td style="text-align: right;"&gt;£60&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/table&gt;
&lt;/center&gt;

&lt;p&gt;This is not enough information to say whether the campaign worked,
and the reason has nothing to with statistical errors:
at the scale of 1 million, you can
guarantee the errors will be insignificant.
It's also not primarily because we are measuring revenue
rather than profit, nor because we haven't taken into account the
cost of action (though those are things we should do).
We can see that our 1 million customers spent, on average
£10 per week more in the post-period than in the pre-period
(a cool £60m in increased revenue over six weeks), but we don't
know about causality: did our marketing campaign cause the effect?&lt;/p&gt;
&lt;p&gt;To answer this, we need to look at what happened in the control group.&lt;/p&gt;
&lt;center&gt;
  &lt;table&gt;
    &lt;tr&gt;&lt;th&gt;&lt;/th&gt;&lt;th&gt;Before&lt;/th&gt;&lt;th&gt;After&lt;/th&gt;&lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;&lt;b&gt;Mailed (Treated)&lt;/b&gt;&lt;/td&gt;
        &lt;td style="text-align: right;"&gt;£50&lt;/td&gt;
        &lt;td style="text-align: right;"&gt;£60&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;&lt;b&gt;Unmailed (Control)&lt;/b&gt;&lt;/td&gt;
        &lt;td style="text-align: right;"&gt;£50&lt;/td&gt;
        &lt;td style="text-align: right;"&gt;£55&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/table&gt;
&lt;/center&gt;

&lt;p&gt;We immediately see that the spend in the pre-period was the same
in both groups,
as must be the case for a proper treatment-control split.
We also seee that the spend in the control group rose to £55.&lt;/p&gt;
&lt;p&gt;We now have enough evidence to say for with very high degree of
confidence that the treatment caused a £5 per week increase in spend,
but that some other factor—perhaps seasonality, TV ads, or a mis-step
by our competitors—caused the other £5 increase per week across the
larger population of 2 million customers. I should emphasize that this
is a valid conclusion regardless of how the population of 2 million
was chosen, as long as the treatment-control split was unbiased.&lt;/p&gt;
&lt;h3 id="behavioural-segmentation"&gt;Behavioural Segmentation&lt;/h3&gt;
&lt;p&gt;Now let us take a similar treatment group drawn uniformly,
at random, from a larger population of 10 million,
We segment the treatment population by
average weekly spend in the pre-period
and plit the &lt;em&gt;increase or decrease&lt;/em&gt; in spend,
between the pre- and post-periods in each segment for our treatment group.
The graph below shows a possible outcome.&lt;/p&gt;
&lt;center&gt;
&lt;img src="https://stochasticsolutions.com/image/t-spend-change.png"
     width="600"
     alt="A bar graph showing a split of the treated population
          using average spend bands for the pre-period
          of £0, and £0.01 to £10, and ten-pound intervals
          up to £90, and finally a bar for over £90.
          The vertical scale is the change in spend between
          the pre- and post periods, quantified by the difference
          between them (post-spend minus pre-spend).
          The bars decrease monotonically, with the £0 group
          increasing spend by about £15, and these increases
          dropping to zero for the £60--70 per week group,
          and being negative to the tune of about £10 a week
          for those spending over £90 in the pre-period."/&gt;
&lt;/center&gt;

&lt;p&gt;For people who are not steeped in regression to mean,
this graph may appear somewhat alarming. Depending on the
distribution of the population, this might well represent an overall
increase in spending (since probably more of the people are on
the left of the graph, where the change in spend is positive).
But I can almost guarantee
that any marketing director would declare this to be disaster,
saying (possibly more colourful language)
"Look at the damage you have wreaked on my best customers!"&lt;/p&gt;
&lt;p&gt;But would this be a reasonable reaction?
Would it, in fact, be accurate?
At this point, we have no idea whether the campaign
caused the higher-spending customers' spend to decline,
or whether something else did.
To assess that, we need once more to look at the same information for the
control group (9 million people, in this case).
That's shown below.&lt;/p&gt;
&lt;center&gt;
&lt;img src="https://stochasticsolutions.com/image/tc-spend-change.png"
     width="600"
     alt="The same graph as above, but now showing the change in
          spend for the control group as well as for the treated group.
          The same general pattern is seen in the control group,
          but the increases are smaller in the control group
          (starting at £8 for the group that had spent £0 in the
          six weeks before the mailing, and going down to --£15
          for the group spending over £90 in the pre-period.
          So change in spend is more positive, or less negative,
          in the treated group than in the control group in every
          behavioural segment."/&gt;
&lt;/center&gt;

&lt;p&gt;What we clearly see is that in every segment the change in spend
was either more positive or less negative in the treated group than in
the control group. So the campaign &lt;em&gt;did&lt;/em&gt; have a positive effect in
every segment. Not so embarrassing. (If only this had been our case!)&lt;/p&gt;
&lt;h3 id="regression-to-the-mean"&gt;Regression to the mean&lt;/h3&gt;
&lt;p&gt;To understand more clearly what's going on here, it's helpful to look at
the same data but focus only the control group.&lt;/p&gt;
&lt;center&gt;
&lt;img src="https://stochasticsolutions.com/image/c-spend-change.png"
     width="600"
     alt="The same graph as above the last, but now with the treated
          group removed."/&gt;
&lt;/center&gt;

&lt;p&gt;Remember, this is the control group: we have not done &lt;em&gt;anything&lt;/em&gt; to
this population. This is a classic case of &lt;em&gt;regression to the mean.&lt;/em&gt; I
would confidently predict that for almost any customer base, if we
allocate people to segments on the basis of a behavioural
characteristic over some period, then measure that same characteristic
for the same people, using the same segment allocations,
at a later period, we would see a
pattern like this: the segments with low rates of the behaviour in
question in the first place would increase that behaviour (at least,
relative to the population as a whole), and the people in the segments
that exhibited the behaviour more would fall back, on average.&lt;/p&gt;
&lt;p&gt;Why?&lt;/p&gt;
&lt;h3 id="mixing-effects"&gt;Mixing Effects&lt;/h3&gt;
&lt;p&gt;When you segment a population on the basis of a behaviour, many of the
people are captured exhibiting their typical behaviours.
But inevitably, you capture
some of the people exhibiting behaviour that is for them &lt;em&gt;atypical&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Consider the first bar in the example above—people who spent nothing
in the six weeks before the mailing.  Ignoring the possibility of
people returning goods, it is impossible for the average spend of this
group to decline. In fact, if even a single person from this group
buys something in the post-campaign period, the average spend for that
segment will increase.  In terms of the &lt;em&gt;mixing&lt;/em&gt; I am talking about,
some of the people will have completely lapsed, and will never spend
again, while others were in an atypically low spending period for
them: maybe they were on holiday, or trying out a competitor or didn't
use their loyalty card and so their spending was not tracked.
The thing that's special about this first group is that they
literally cannot be in an atypically high spending period when
they were assigned to segments,
because they weren't spending anything.&lt;/p&gt;
&lt;p&gt;It's less clear-cut, but a similar argument pertains to the group on
the far right of the graph. Some of those are people who routinely
spend over £90 a week at this retailer. But others will have had
atypically high spend when we assigned them to segments: maybe they
had a huge party and bought lots of alcohol for it, or maybe they
shopped for someone else over that period. With the highest-spending
group, there will probably be a small number of people whose spend was
atypically low during the period we assigned them to segments, but
there are likely to be far more people for whom this spend was
atypically high at the right side of the distribution.  So in this
case, we can see it's likely that the average spend of these
higher-spending segments will decline (relative to the population as a
whole) if we measure them at a later time period.&lt;sup id="fnref:static"&gt;&lt;a class="footnote-ref" href="#fn:static"&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;For people in the middle of the distribution, the story is similar
but more balanced. Some people will have their typical spend where
we measured it, and there will be others whom we captured at
atypically high or atypically low spending periods, but those will
tend to cancel out.&lt;/p&gt;
&lt;p&gt;These mixing effects give the best explanation I know of the
phenomenon of regression to the mean. It is always something
to look for when you assign people to segments based a behaviour
and then look for changes in the people in those segments a later time.&lt;/p&gt;
&lt;h3 id="so-how-did-we-lose-so-much-money"&gt;So how did we lose so much money?&lt;/h3&gt;
&lt;p&gt;The reason our campaign worked out so poorly was that we did not
take into account regression to the mean when we set the targets,
because we didn't think of it.
Because we targeted more people with below-median spend than
above-median spend, regression to the mean meant that although
spend increased quite strongly among our treated customers, it also
increased quite strongly for the control group in each segment.
In that regard, the uplifts were less similar across the spend segments
that I have shown here; something, in fact, that I now know to
be characteristic of retail unike in many other areas. The most active
shoppers are often also the most responsive to marketing campaigns.&lt;/p&gt;
&lt;p&gt;The campaign produced an uplift in all segments, but much more
of the increase in spend than we expected was due to regression to the mean
in the population we targeted,
with the result that value of the loyalty points given out significantly
exceeded the incremental profit contribution from the people
awarded those points.&lt;/p&gt;
&lt;p&gt;This was a tough one. But at least I will remember it forever.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:loss"&gt;
&lt;p&gt;Without getting too specific, the loss was a six-figure sterling
sum, back when that was an more significant amount of money than it is today.
It was not really a material amount for the retailer, which had a significant
fraction of the UK population as regular customers; but it was a highly
material amount for Quadstone: more, in fact, than the four founders
had invested in the company, the probably less than our entire
first-round funding. And retailers don't get to be big and dominant
by treating six-figure losses with equinimity.&amp;#160;&lt;a class="footnote-backref" href="#fnref:loss" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:uniform"&gt;
&lt;p&gt;&lt;em&gt;Uniformly&lt;/em&gt; at random means that, conceptually, a coin
is tossed or a die is rolled to determine whether someone is allocated
to the control group or the treated group. The coin or die does not need
to be fair: it's fine to allocate all the 1's on the die to control and
all the 2-6's to treated, or to use a weighted coin,
as long as the procedure does use any other
information to determine the allocation. For example, choosing to put
two thirds of the men in control (chosen randomly) and only one third
of the women in control is no good, because now, if there's a difference,
it is hard to easily separate out the effect of the treatment from the effect
of sex. (If the volume suffices, you could assess the uplift independently
for men and women, in this particular case, but that quickly gets complicated,
and there is more than enough scope for errors without such complications.)&amp;#160;&lt;a class="footnote-backref" href="#fnref:uniform" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:static"&gt;
&lt;p&gt;One possible confusion here is that I'm describing a &lt;em&gt;static,&lt;/em&gt;
rather than a dynamic, segmentation here: people are allocated to a segment
on the basis of the spend in the pre-period and remain in that segment
when assessed in the post-period. If we reassigned people on the basis
of their later behaviour, we would not see this effect if the
spend distribution were static.&amp;#160;&lt;a class="footnote-backref" href="#fnref:static" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="TDDA"></category><category term="reproducibility"></category><category term="errors"></category><category term="interpretation"></category></entry><entry><title>Name Styles</title><link href="https://tdda.info/name-styles.html" rel="alternate"></link><published>2024-03-04T16:00:00+00:00</published><updated>2024-03-04T16:00:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2024-03-04:/name-styles.html</id><summary type="html">&lt;p&gt;This is just a bit of fun, but I've always been
interested in the different kinds of names allowed,
encouraged, and used in different areas of computing
and data.&lt;/p&gt;
&lt;p&gt;A few years ago, I tweeted some
&lt;a href="https://twitter.com/njr0/status/1368856623062675456"&gt;well-known naming styles&lt;/a&gt;
and a collection of &lt;a href="https://twitter.com/njr0/status/1368857275113308162"&gt;lesser-known naming styles&lt;/a&gt;.
I was playing about …&lt;/p&gt;</summary><content type="html">&lt;p&gt;This is just a bit of fun, but I've always been
interested in the different kinds of names allowed,
encouraged, and used in different areas of computing
and data.&lt;/p&gt;
&lt;p&gt;A few years ago, I tweeted some
&lt;a href="https://twitter.com/njr0/status/1368856623062675456"&gt;well-known naming styles&lt;/a&gt;
and a collection of &lt;a href="https://twitter.com/njr0/status/1368857275113308162"&gt;lesser-known naming styles&lt;/a&gt;.
I was playing about with the same idea while thinking about metadata
standards today and came up with this.
Just as I often think one of the boxes on the uniqitous
2x2 "Boston-Box"-style matrices makes no sense, I think some of the boxes
on the evil-good-lawful-chaotic breakdown
(which I gather comes from
&lt;a href="https://en.wikipedia.org/wiki/Alignment_(Dungeons_%26_Dragons)"&gt;Dungeons and Dragons&lt;/a&gt;
make little sense, so forgive me if some of this looks slightly forced.
But I think it's fun.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://www.tdda.info/images/name-styles-evil-good-lawful-chaotic-1.0-2860x2860.png" alt="Evil-Good-Lawful-Chaotic 3x3 matrix classification of name styles. LAWFUL GOOD: CamelCase, dromedaryCase, snake_case.  NEUTRAL GOOD: kebab-case, SCREAMING_SNAKE_CASE.  CHAOTIC GOOD: Train-Case, SCREAMING-KEBAB-CASE.  LAWFUL NEUTRAL: Pascal_Snake_Case, camel Snake Case, flatcase, UPPERFLATCASE.  NEUTRAL: reservedcase_, private ish case.  CHAOTIC NEUTRAL: space case.  LAWFUL EVIL: double quoted case, single quoted case, __dunder_case__.  NEUTRAL EVIL: path/case.extended, colon:caseflatcase, path/case, endash-kebab-case, quoted embedded newline case.  CHAOTIC EVIL: teRRorIsT nOTe CAse, alternating_separator-case, curly double quoted case, curly single quoted case, unquoted embedded, newline case."/&gt;&lt;/p&gt;</content><category term="TDDA"></category><category term="TDDA"></category><category term="names"></category></entry><entry><title>TOMLParams: TOML-based parameter files made better</title><link href="https://tdda.info/tomlparams-toml-based-parameter-files-made-better.html" rel="alternate"></link><published>2023-07-16T16:00:00+01:00</published><updated>2023-07-16T16:00:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2023-07-16:/tomlparams-toml-based-parameter-files-made-better.html</id><summary type="html">&lt;p&gt;TOMLParams is a new open-source library that helps Python developers
to externalize parameters in &lt;a href="https://toml.io/"&gt;TOML&lt;/a&gt; files.
This post will explain why storing parameters in non-code files
is beneficial (including for reproducibility), why TOML was chosen,
and some of the useful features of the library, which include
structured sets of parameters …&lt;/p&gt;</summary><content type="html">&lt;p&gt;TOMLParams is a new open-source library that helps Python developers
to externalize parameters in &lt;a href="https://toml.io/"&gt;TOML&lt;/a&gt; files.
This post will explain why storing parameters in non-code files
is beneficial (including for reproducibility), why TOML was chosen,
and some of the useful features of the library, which include
structured sets of parameters using TOML tables,
hierarchical inclusion with overriding,
default values,
parameter (key) checking,
optional type checking,
and features to help use across programs, including
built-in support for setting parameters using environment variables.&lt;/p&gt;
&lt;h1 id="the-benefits-of-externalizing-parameters"&gt;The Benefits of Externalizing Parameters&lt;/h1&gt;
&lt;p&gt;Almost all software can do more than one thing, and has various
&lt;em&gt;parameters&lt;/em&gt; that are used to control exactly what it does.  Some of
these parameters are set once and never changed (typically
&lt;em&gt;configuration&lt;/em&gt; parameters, that each user or installation chooses),
while others may be changed more often, perhaps from run to
run. Command-line tools often accept some parameters on the command
line itself, most obviously input and output files and core parameters such
as search terms for search commands. On Unix and Linux systems, it's
also common to use command line "switches" (also called &lt;em&gt;flags&lt;/em&gt; or
&lt;em&gt;options&lt;/em&gt;) to refine behaviour. So for example, the Unix/Linux grep tool
might be used in any of the following ways:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;grep &lt;span class="nb"&gt;time&lt;/span&gt;            &lt;span class="c1"&gt;# find all lines including &amp;#39;time&amp;#39; on stdin&lt;/span&gt;
grep &lt;span class="nb"&gt;time&lt;/span&gt; p*.txt     &lt;span class="c1"&gt;# ... on .txt files starting with &amp;#39;p&amp;#39;&lt;/span&gt;
grep -i &lt;span class="nb"&gt;time&lt;/span&gt; p*.txt  &lt;span class="c1"&gt;# ... ignoring capitalization (-i switch)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;All of &lt;code&gt;time&lt;/code&gt;, &lt;code&gt;p*.txt&lt;/code&gt; and &lt;code&gt;-i&lt;/code&gt; are examples of command-line parameters.&lt;/p&gt;
&lt;p&gt;Many tools also use configuration files to control the behaviour of
the software. On Unix and Linux, these are typically stored in the user's
home directory, often in 'dot' files,
such as &lt;code&gt;~/.bashrc&lt;/code&gt; for the Bash shell,
&lt;code&gt;~/.emacs&lt;/code&gt; for the Emacs editor and &lt;code&gt;~/.zshrc&lt;/code&gt; for the Z Shell.
These files are sometimes in propriery formats, but increasly
often are in "standard" formats such as
&lt;a href="https://en.wikipedia.org/wiki/INI_file"&gt;&lt;code&gt;.ini&lt;/code&gt;&lt;/a&gt; (MS-DOS initialization files),
&lt;a href="https://www.json.org/json-en.html"&gt;JSON&lt;/a&gt; (&lt;em&gt;JavaScript Object Notation&lt;/em&gt;),
&lt;a href="https://yaml.org"&gt;YAML&lt;/a&gt; (&lt;em&gt;YAML Ain't Markup Language&lt;/em&gt;),
or &lt;a href="https://toml.io/en/"&gt;TOML&lt;/a&gt; (&lt;em&gt;Tom's Obvious Minimal Language&lt;/em&gt;).&lt;/p&gt;
&lt;p&gt;When writing code, it's often tempting just to set parameters in code.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The quickest, dirtiest practice is just to have parameter &lt;em&gt;values&lt;/em&gt;
   "hard-wired" as literals in code where they are used,
   possibly with the same values being repeated in different places.
   This is generally frowned upon for a few reasons:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Hard-wired parameters are hard to find, and therefore hard to update
   and inspect;&lt;/li&gt;
&lt;li&gt;There may be repetition, violating the
   &lt;a href="https://en.wikipedia.org/wiki/Don%27t_repeat_yourself"&gt;&lt;em&gt;don't repeat yourself&lt;/em&gt;&lt;/a&gt;
   (DRY) principle, and leading to possible inconsistency (though see also
   the counterveiling &lt;a href="https://en.wikipedia.org/wiki/Rule_of_three_(computer_programming)"&gt;&lt;em&gt;rule of three&lt;/em&gt;&lt;/a&gt; (ROT)
   and the &lt;a href="https://erock.prose.sh/anti-pattern"&gt;&lt;em&gt;write everything in threes&lt;/em&gt;&lt;/a&gt; (WET) principle.&lt;/li&gt;
&lt;li&gt;If you want to change parameters, you have to go to many places;&lt;/li&gt;
&lt;li&gt;There may be no name associated with the parameter, making it hard
   to know what it means;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So implementing an energy calculation using Einstein's formula
   E = mc², we would probably prefer either:&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;energy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mass&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;Return energy in joules from mass in kilograms&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;mass&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;constants&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;speed_of_light&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;constants&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;speed_of_light&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;or&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;energy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;Return energy in joules from mass m in kilograms&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;constants&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;  &lt;span class="c1"&gt;# the speed of light&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;(for a physicist)
to&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;energy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;3e8&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;3e8&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;or (worse!)&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;energy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;9e16&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;For something like the speed of light (which is after all, a universal
   constant) in code is probably appropriate, so we might have a file
   called &lt;code&gt;constants.py&lt;/code&gt; containing something like:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# constants.py: physical constants&lt;/span&gt;

&lt;span class="n"&gt;speed_of_light&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;299_792_458&lt;/span&gt;   &lt;span class="c1"&gt;# metres per second, in vacuuo. (c. 300 000 km/s)&lt;/span&gt;
&lt;span class="n"&gt;h_cross&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.054_571_817e-34&lt;/span&gt;    &lt;span class="c1"&gt;# reduced Planck constant in joule seconds (= h/2π)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;For parameters that change in different situations, or for different users,
   it's usually much better practice to store the parameters in a separate file,
   ideally in a common format for which is good library support.
   Some of the advantages of this include:&lt;ul&gt;
&lt;li&gt;Changing parameters then does not require changing code.
   Code may be unavilable to the user—e.g. write locked, compiled
   or running on a remote server through an API.
   Moreover, editing code often requires more expertise
   and confidence than editing a parameter file.&lt;/li&gt;
&lt;li&gt;The same code can be run multiple times using different parameters
   at the same time; this can be more challenging if the parameters
   are "baked into" the code.&lt;/li&gt;
&lt;li&gt;It becomes easy to maintain multiple sets of parameters for different
   situations, allowing a user simply to choose a named set on each occasion
   (usually through the parameter file's name). Of course, the code itself
   can have different named sets of parameters, but this is less common,
   and usually less transparent to the user, and harder to maintain.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 id="why-toml"&gt;Why TOML?&lt;/h1&gt;
&lt;p&gt;There is nothing particularly special about TOML, but it is
well suited as a format for parameter files, with a sane
and pragmatic set of features, good library support in most modern
programming languages and rapidly increasing adoption.&lt;/p&gt;
&lt;p&gt;Here's a simple example of a TOML file:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# A TOML file (This is a comment)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2024-01-01&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="c1"&gt;# This will produce a datetime.date in Python&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;run_days&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;366&lt;/span&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="c1"&gt;# an integer (int)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;tolerance&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0001&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="c1"&gt;# a floating-point vlaue (float)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;locale&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;en_GB&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="c1"&gt;# a string value&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;2024&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="c1"&gt;# also a string valuue&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;critical_event_time&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2024-07-31T03:22:22+03:00&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# a datetime.datetime&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;fermi_approximate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;true&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# a boolean value (bool)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;


&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="c1"&gt;# A TOML &amp;quot;table&amp;quot;, which acts like a section, or group&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;.csv&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="c1"&gt;# Yet another string&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="c1"&gt;# An array (list)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;financial&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;telecoms&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;2024 events&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="c1"&gt;# The same key as before, but this one is&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                          &lt;/span&gt;&lt;span class="c1"&gt;# in the logging table, so does not conflict&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                          &lt;/span&gt;&lt;span class="c1"&gt;# with the previous title&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Strengths of TOML as a parameter/configuration format&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Ease of reading, writing and editing ("obviousness").&lt;/li&gt;
&lt;li&gt;Lack of surprises: there are few if any weirdnesses in TOML (which
   is less true of the other main competing formats).&lt;/li&gt;
&lt;li&gt;Good coverage of the main expected types: TOML supports (and
   differentiates between) booleans,
   integers, floating point values, strings,
   &lt;a href="https://en.wikipedia.org/wiki/ISO_8601"&gt;ISO8601&lt;/a&gt;-formatted
   dates and timestamps (with and without timezones,
   and mapping to &lt;code&gt;datetime.date&lt;/code&gt; and &lt;code&gt;datetime.datetime&lt;/code&gt; in Python),
   arrays (lists), key-value pairs (dictionaries),
   and hierarchical sections.&lt;/li&gt;
&lt;li&gt;It supports comments (which begin with a hash mark &lt;code&gt;#&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Support in most modern langauges, including C#, C++, Clojure,
   Dart, Elixir, Erlang, Go, Haskell, Java, Javascript, Lua,
   Objective-C, Perl, PHP, Python, Ruby, Rust, Swift, Scala&lt;/li&gt;
&lt;li&gt;Flexible quoting with support for multiline strings, and no
   use of bare (unquoted) strings (avoiding ambiguity)&lt;/li&gt;
&lt;li&gt;Well-defined semantics without becoming fussy and awkward to edit
   (e.g., being unfussy about trailing commas in arrays).&lt;/li&gt;
&lt;li&gt;TOML is specifically designed to be a configuration file format.
   Quoting the front page of &lt;a href="https://toml.io/en/"&gt;the website&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;TOML aims to be a minimal configuration file format that:
is easy to read due to obvious semantics
maps unambiguously to a hash table
is easy to parse into data structures in a wide variety of languages"&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ul&gt;
&lt;li&gt;In the context of Python, TOML is already being adopted widely,
   most notably in the increasingly ubiquitous &lt;code&gt;pyproject.toml&lt;/code&gt;
   files. There are good libraries available for reading (&lt;code&gt;tomli&lt;/code&gt;)
   and writing (&lt;code&gt;tomli_w&lt;/code&gt;) TOML files, although these are not
   part of the standard library. (It appears that Python 3.11 does
   include &lt;code&gt;tomllib&lt;/code&gt;, for reading TOML files, in the standard library.)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Weaknesses of TOML as a parameter/configuration format&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;We don't think TOML has any major weaknesses for this purpose, but
points that might count against it for some include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;TOML does not have a null (NULL/nil/None/undefined) value.
   (&lt;code&gt;TOMLParams&lt;/code&gt; could address this, but has no current plans to do so.)&lt;/li&gt;
&lt;li&gt;Hierarchical sections ('tables') in TOML are not nested. So if you want
   So if you want a section/table called &lt;code&gt;behaviours&lt;/code&gt; and
   subsections/subtables called &lt;code&gt;personal&lt;/code&gt; and &lt;code&gt;business&lt;/code&gt;, in TOML this
   might be represented by something like the excerpt below (possibly
   with a &lt;code&gt;[behaviours]&lt;/code&gt; table with its own parameters as well).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Some people, however, &lt;a href="https://hitchdev.com/strictyaml/why-not/toml/"&gt;really don't like TOML&lt;/a&gt;.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;[behaviours.personal]&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="n"&gt;frequency&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="k"&gt;[behaviours.business]&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="n"&gt;frequency&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why not JSON, YAML, .ini...&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;We chose TOML because we think, overall, it is better than each of the
obvious competing formats, but the purpose of this post isn't to
do down those formats, which all have their established places.
But we will comment briefly on the most obvious alternatives.&lt;/p&gt;
&lt;p&gt;JSON, while good as a transfer format and very popular in web services,
is tricky for humans to write correctly, even with editor support,
because it requires correct nesting and refuses to accept trailing commas
in lists and dictionaries.
The lack of support for dates and timestamps is also a frequent source of frustration,
with quoted strings typically being used instead, with all the concomitant problems
of that approach.&lt;/p&gt;
&lt;p&gt;At first glance, YAML appears more suitable as a
configuration/parameter-file format, but is the opposite of obvious
and often produces unexpected results (in practice). Sources of frustration
include no requirement to quote strings, "magic values" like &lt;code&gt;yes&lt;/code&gt;
(which maps to &lt;code&gt;true&lt;/code&gt;) and &lt;code&gt;no&lt;/code&gt; which maps to &lt;code&gt;false&lt;/code&gt; (much to the
annoyance of Norwegians and anyone using the NO country code),
inadvent coercison of numbers with leading zeroes to octal (in YAML 1.1),
whitespace sensitivity, and issues around multi-line strings.&lt;sup id="fnref:7yaml"&gt;&lt;a class="footnote-ref" href="#fn:7yaml"&gt;1&lt;/a&gt;&lt;/sup&gt;
&lt;sup id="fnref:10yaml"&gt;&lt;a class="footnote-ref" href="#fn:10yaml"&gt;2&lt;/a&gt;&lt;/sup&gt; &lt;sup id="fnref:yamldochell"&gt;&lt;a class="footnote-ref" href="#fn:yamldochell"&gt;3&lt;/a&gt;&lt;/sup&gt; &lt;sup id="fnref:ifhy"&gt;&lt;a class="footnote-ref" href="#fn:ifhy"&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;.ini&lt;/code&gt; files look a lot like TOML (I would guess they a major inspiration
for it) but are much simpler and less rich, have less well-defined syntax,
have fewer types and don't require quoting of strings, leading to ambiguity.&lt;/p&gt;
&lt;h1 id="what-are-the-key-features-of-tomlparams"&gt;What are the Key Features of TOMLParams?&lt;/h1&gt;
&lt;p&gt;&lt;strong&gt;Simple externalization of parameters in one or more TOML files&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;You write your parameters in a TOML file, pass the name of the
TOML file an instance of the &lt;code&gt;TOMLParams&lt;/code&gt; class
and it reads the parameters from the file
and makes them available as attributes on the object, but also
makes them available using dictionary-style look-up, i.e. if you
TOMLParams instance is &lt;code&gt;p&lt;/code&gt; and you have a parameter &lt;code&gt;startdate&lt;/code&gt;
you can access it as &lt;code&gt;p.startdate&lt;/code&gt; or &lt;code&gt;p['startdate']&lt;/code&gt; (and, more pertinently,
also as &lt;code&gt;p[k]&lt;/code&gt; if &lt;code&gt;k&lt;/code&gt; is set to &lt;code&gt;'startdate'&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;If you use tables, you can "dot" you way throught to the parameters
(&lt;code&gt;p.behaviours.personal.frequency&lt;/code&gt;) or use repeated dictionary lookups
(&lt;code&gt;p['behaviours']['personal']['frequency']&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Loading, saving, default values, parameter name checking and optional parameter type checking&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The parameters that exist are defined by default value with &lt;code&gt;TOMLParams&lt;/code&gt;.
You can either store you defaults in a TOML file (e.g. &lt;code&gt;defaults.toml&lt;/code&gt;)
or pass them to the &lt;code&gt;TOMParams&lt;/code&gt; initializer as a dictionary.&lt;/p&gt;
&lt;p&gt;If you choose a different TOML file, all the parameter values are first set
to their default values, and then any parameters set in the file you specify
override those. New parameters (i.e. any not listed in defaults)
raise an error.&lt;/p&gt;
&lt;p&gt;If you wish turn on type checking, the library will check that the
all parameter values provided match those of the defaults, and you choose
whether these cases result in a warning (which doesn't raise an exception)
or an error (which does).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hierarchical file inclusion with overriding&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Perhaps the most powerful feature of TOMLParams is the ability for one
parameters file to 'include' one or more others. If you use TOMLParams
specifying the parameters file name as &lt;code&gt;base&lt;/code&gt;, and the first line of
&lt;code&gt;base.toml&lt;/code&gt; is&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;include&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;uk&amp;#39;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;then parameters from &lt;code&gt;uk.toml&lt;/code&gt; will be read and processed before
those from &lt;code&gt;base.toml&lt;/code&gt;. So all parameters will first be set to
default values, then anything in &lt;code&gt;uk.toml&lt;/code&gt; will override those values,
and finally any values in &lt;code&gt;base.toml&lt;/code&gt; will override those.&lt;/p&gt;
&lt;p&gt;The include statement can also be a list&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;include&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;uk&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;metric&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;2023&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Inclusions are processed left to right (i.e. &lt;code&gt;uk.toml&lt;/code&gt; is processed,
before &lt;code&gt;metric.toml&lt;/code&gt;, followed by &lt;code&gt;2023.toml&lt;/code&gt;), followed by the parameters
in the including file itself. So in this case, if defaults are in
&lt;code&gt;defaults.toml&lt;/code&gt; and the TOML file specified is &lt;code&gt;base.toml&lt;/code&gt;, the full
sequence is&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;defaults.toml&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;uk.toml&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;metric.toml&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;2023.toml&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;base.toml&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Of course, the included files can themselves include others, but the
library keeps track of this and each file is only processed once
(the first time it is encountered), which prevents infinite recursion
and repeated setting.&lt;/p&gt;
&lt;p&gt;This makes it very easy to maintain different kinds and groups of
parameters in different files, and to create a variation of a set
of parameters simply by making the first line &lt;code&gt;include&lt;/code&gt; whatever
TOML file specifies the parameters you want to start from, and then
override the specific parameter or parameters you want to be different
in your new file.&lt;/p&gt;
&lt;p&gt;** Support for writing consolidated parameters as TOML after
hierarchical inclusion and resolution**&lt;/p&gt;
&lt;p&gt;The library can also write the final parameters used out as a single,
consolidated TOML file, which is useful when hierarchical inclusion
and overriding are used, and keeps a record of the final values of all
parameters. This helps with reproducibility and logging.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Support for using environment variables to select parameter set (as
well as API)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;You can choose how to specify the name of the parameters file to be used,
and the name to which it should default. If you create a &lt;code&gt;TOMLParams&lt;/code&gt; instance
with:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TOMLParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;defaults&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;defaults&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;newparams&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;standard_params_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;.&amp;#39;&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;default parameters will be read from &lt;code&gt;./defaults.toml&lt;/code&gt;
and then &lt;code&gt;./newparams.toml&lt;/code&gt; will be processed, overriding default values.&lt;/p&gt;
&lt;p&gt;If you want to specify the name of the parameters file to use on the
command line, the usual pattern would be something like:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tomlparams&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TOMLParams&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Simulate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="fm"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="vm"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;__main__:&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;base&amp;#39;&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TOMLParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;defaults.toml&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Simulate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Sometimes, however, it's convenient to use an environment variable
to set the name of the parameters file, particularly if you want
to use the same parameters in multiple programs, or run from a
shell script or a &lt;code&gt;Makefile&lt;/code&gt;. You can specify an environment variable
to use for this and &lt;code&gt;TOMLParams&lt;/code&gt; will inspect that environment variable
if no name is passed. If you choose &lt;code&gt;SIMPARAMS&lt;/code&gt; for this and say&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TOMLParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;defaults.toml&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;env_var&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;SIMPARAMS&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;it will look for a name in &lt;code&gt;SIMPARAMS&lt;/code&gt; in the environment which you can
set with&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;SIMPARAMS=&amp;quot;foo&amp;quot; python pythonfile.py
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;or&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;SIMPARAMS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;foo&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pythonfile&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If it's not set, it will use &lt;code&gt;base.toml&lt;/code&gt; as the file name, or something
else you choose with the &lt;code&gt;base_params_stem&lt;/code&gt; argument to &lt;code&gt;TOMLParams&lt;/code&gt;.&lt;/p&gt;
&lt;h1 id="check-it-out"&gt;Check it out&lt;/h1&gt;
&lt;p&gt;You can install &lt;code&gt;TOMLParams&lt;/code&gt; from PyPI in the usual way, e.g.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;python -m pip -U tomlparams
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The source is available from Github (under an MIT license), at
&lt;a href="https://github.com/smartdatafoundry/tomlparams"&gt;github.com/smartdatafoundry/tomlparams&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;There's a &lt;code&gt;README.md&lt;/code&gt; and documentation is available on ReAd the Docs at
&lt;a href="https://tomlparams.readthedocs.io"&gt;tomlparams.readthedocs.io&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;You can get help within Python on the &lt;code&gt;TOMLParams&lt;/code&gt; class with&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;tomlparams&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tomlparams&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;After installing tomlparams, you will find you have a &lt;code&gt;tomlparams&lt;/code&gt; command,
which you can use to copy example code from the &lt;code&gt;README&lt;/code&gt;&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ tomlparams examples
Examples copied to ./tomlparams_examples.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can also get help from with &lt;code&gt;tomlparams help&lt;/code&gt;:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ tomlparams &lt;span class="nb"&gt;help&lt;/span&gt;
TOMLParams

USAGE:
    tomlparams &lt;span class="nb"&gt;help&lt;/span&gt;      — show this message
    tomlparams version   — report version number
    tomlparams examples  — copy the examples to ./tomlparams_examples
    tomlparams &lt;span class="nb"&gt;test&lt;/span&gt;      — run the tomlparams tests

Documentation: https://tomlparams.readthedocs.io/
Source code:   https://github.com/smartdatafoundry.com/tomlparams
Website:       https://tomlparams.com

Installation:

    python -m pip install -U tomlparams
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:7yaml"&gt;
&lt;p&gt;&lt;a href="https://www.infoworld.com/article/3669238/7-yaml-gotchas-to-avoidand-how-to-avoid-them.html"&gt;7 YAML gotchas to avoid—and how to avoid them&lt;/a&gt;&amp;#160;&lt;a class="footnote-backref" href="#fnref:7yaml" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:10yaml"&gt;
&lt;p&gt;&lt;a href="https://www.redhat.com/sysadmin/yaml-tips"&gt;10 YAML tips for people who hate YAML&lt;/a&gt;&amp;#160;&lt;a class="footnote-backref" href="#fnref:10yaml" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:yamldochell"&gt;
&lt;p&gt;&lt;a href="https://ruudvanasseldonk.com/2023/01/11/the-yaml-document-from-hell"&gt;https://ruudvanasseldonk.com/2023/01/11/the-yaml-document-from-hell&lt;/a&gt;&amp;#160;&lt;a class="footnote-backref" href="#fnref:yamldochell" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:ifhy"&gt;
&lt;p&gt;&lt;a href="https://www.reddit.com/r/programminghorror/comments/i0cnog/i_fucking_hate_yaml/"&gt;I fucking hate YAML&lt;/a&gt;&amp;#160;&lt;a class="footnote-backref" href="#fnref:ifhy" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="TDDA"></category><category term="reproducibility"></category></entry><entry><title>TDDA on the Coding for Thought Podcast</title><link href="https://tdda.info/tdda-on-the-coding-for-thought-podcast.html" rel="alternate"></link><published>2023-07-11T16:00:00+01:00</published><updated>2023-07-11T16:00:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2023-07-11:/tdda-on-the-coding-for-thought-podcast.html</id><summary type="html">&lt;p&gt;I had the pleasure of discussing TDDA with Peter Schmidt on his
Coding for Thought podcast.&lt;/p&gt;
&lt;p&gt;I think it came out really well, so this might be a nice way for people
to learn about the ideas and motivations for the ideas and the library,
which Simon Brown, Sam Rhynas …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I had the pleasure of discussing TDDA with Peter Schmidt on his
Coding for Thought podcast.&lt;/p&gt;
&lt;p&gt;I think it came out really well, so this might be a nice way for people
to learn about the ideas and motivations for the ideas and the library,
which Simon Brown, Sam Rhynas and I developed over some years at
&lt;a href="https://stochasticsolutions.com"&gt;Stochastic Solutions&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The podcast should be available by search in any podcast player for
'Code for Thought'.&lt;/p&gt;
&lt;p&gt;Direct links are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://podcasts.apple.com/gb/podcast/en-test-driven-data-analysis-with-nick-radcliffe/id1548426989?i=1000620643209"&gt;Apple Podcast Directory&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://soundcloud.com/code4thought-615691925/en-test-driven-data-analysis?si=bbec1b46f8ce4cbea4cc31f5999ff135&amp;amp;utm_source=clipboard&amp;amp;utm_medium=text&amp;amp;utm_campaign=social_sharing"&gt;Soundcloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://codeforthought.buzzsprout.com/1326658/13192819-en-test-driven-data-analysis-with-nick-radcliffe"&gt;Buzzsprout&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=kSf6xZvYyFE"&gt;YouTube (no video!)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content><category term="TDDA"></category><category term="podcast"></category><category term="TDDA"></category></entry><entry><title>Overcast Logged-in iCloud Users: Self-Selection Bias and Customer Stickiness</title><link href="https://tdda.info/overcast-logged-in-icloud-users-self-selection-bias-and-customer-stickiness.html" rel="alternate"></link><published>2023-01-08T16:00:00+00:00</published><updated>2023-01-08T16:00:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2023-01-08:/overcast-logged-in-icloud-users-self-selection-bias-and-customer-stickiness.html</id><summary type="html">&lt;p&gt;On &lt;a href="https://podcasts.apple.com/gb/podcast/a-less-cloudy-outlook/id1055685246?i=1000590953199"&gt;Episode 258&lt;/a&gt;
of
&lt;a href="https://marco.org"&gt;Marco Arment&lt;/a&gt;
and
“Underscore” &lt;a href="https://www.david-smith.org/blog/2021/04/08/watchsmith-2-0"&gt;David Smith&lt;/a&gt;’s
podcast
&lt;a href="https://podcasts.apple.com/gb/podcast/under-the-radar/id1055685246"&gt;Under the Radar&lt;/a&gt;,
and then on &lt;a href="https://atp.fm/516"&gt;Episode 516&lt;/a&gt;
of Marco &amp;amp; co’s &lt;a href="https://atp.fm"&gt;Accidental Tech Podcast&lt;/a&gt;,
Marco describes the fact that his data suggests that about 12% of his users
don’t have logged-in iCloud accounts with iCloud Drive …&lt;/p&gt;</summary><content type="html">&lt;p&gt;On &lt;a href="https://podcasts.apple.com/gb/podcast/a-less-cloudy-outlook/id1055685246?i=1000590953199"&gt;Episode 258&lt;/a&gt;
of
&lt;a href="https://marco.org"&gt;Marco Arment&lt;/a&gt;
and
“Underscore” &lt;a href="https://www.david-smith.org/blog/2021/04/08/watchsmith-2-0"&gt;David Smith&lt;/a&gt;’s
podcast
&lt;a href="https://podcasts.apple.com/gb/podcast/under-the-radar/id1055685246"&gt;Under the Radar&lt;/a&gt;,
and then on &lt;a href="https://atp.fm/516"&gt;Episode 516&lt;/a&gt;
of Marco &amp;amp; co’s &lt;a href="https://atp.fm"&gt;Accidental Tech Podcast&lt;/a&gt;,
Marco describes the fact that his data suggests that about 12% of his users
don’t have logged-in iCloud accounts with iCloud Drive enabled, which was a significant
obstacle to moving his sync system for &lt;a href="https://overcast.fm"&gt;Overcast&lt;/a&gt;
to use Apple’s &lt;a href="https://developer.apple.com/icloud/cloudkit/"&gt;CloudKit&lt;/a&gt;, which requires both.&lt;/p&gt;
&lt;p&gt;This was a surprise to Marco, who had expected that the figure would be closer to 1%,
and led &lt;a href="https://www.caseyliss.com"&gt;Casey Liss&lt;/a&gt;
to worry aloud about producing an app that depended on CloudKit.&lt;/p&gt;
&lt;p&gt;Marco, &lt;em&gt;entirely correctly&lt;/em&gt;, suggested that his user base may be non-representative,
and pointed out that&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Many people listen to podcasts at work and when commuting,
    so may be more that usually likely to use work-issued phones; such devices are often
    locked out of use of iCloud in general, and iCloud Drive, most specifically.&lt;/li&gt;
&lt;li&gt;His users are almost certainly biased towards geeks.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;It may not be immediately obvious, but there is also a strong “statistical” explanation for why
Marco is almost certainly right, and Casey’s fears are likely to be somewhere between
&lt;em&gt;exaggerated&lt;/em&gt; and &lt;em&gt;unfounded&lt;/em&gt;.&lt;/p&gt;
&lt;h1 id="tldr-self-selection-bias-and-customer-stickiness"&gt;TL;DR: Self-Selection Bias and Customer Stickiness&lt;/h1&gt;
&lt;p&gt;Overcast is unusual (possibly unique) in offering a non-CloudKit-based sync system.
Users who need non-CloudKit-based podcast syncing have a limited choice of options,
possibly &lt;a href="https://en.wikipedia.org/wiki/Hobson%27s_choice"&gt;Hobson’s Choice&lt;/a&gt;.
So it is extremely likely that Overcast has a disproportionately high number of users who
can’t/don’t use iCloud Drive. Interestingly these people might also be exceptionally loyal,
because they (perhaps) have &lt;em&gt;nowhere else to go&lt;/em&gt;.&lt;/p&gt;
&lt;h1 id="the-long-version"&gt;The Long Version&lt;/h1&gt;
&lt;p&gt;Let’s do some &lt;a href="https://en.wikipedia.org/wiki/Fermi_problem"&gt;Fermi Estimation&lt;/a&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Let’s suppose Apple has 1 billion iOS users.&lt;/li&gt;
&lt;li&gt;Let’s suppose 10% of them use a podcast player. That’s 100 million people.&lt;/li&gt;
&lt;li&gt;Let’s suppose (consistent with Marco’s assumption) that 1% of those don’t have a logged-in iCloud account with iCloud Drive (forthwith to be refered to as “iCloudless”). That’s 1 million people.&lt;/li&gt;
&lt;li&gt;Let’s suppose Overcast has 1% market share (1 million active users).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Understandably, Marco doesn’t release user stats &lt;em&gt;per se&lt;/em&gt;, but did say:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;“So, to give you some idea of what I mean by ’hardly anybody uses it’,
I’m looking at a few hundred people who use the website, and that is not
a large portion of the user base. And this is ... per day. ... It’s under
1,000 people. ... snd that’s ... well under 1% ... [I]t’s a very small portion
of the user base.”&lt;/p&gt;
&lt;p&gt;— Marco Arment, &lt;strong&gt;Accidential Tech Podcast #516&lt;/strong&gt;,
     &lt;em&gt;One of My Fits of Outrage&lt;/em&gt;, 3rd January 2023,
     from &lt;a href="https://overcast.fm/+R7DXe40g4/31:16"&gt;31:16 (listen here)&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That sounds consist with a million users,
and certainly says the number significantly exceeds 100,000.
Further, &lt;a href="https://www.buzzsprout.com"&gt;BuzzSprout&lt;/a&gt;’s
&lt;a href="https://www.buzzsprout.com/global_stats"&gt;Global Player stats&lt;/a&gt;
for &lt;a href="https://www.buzzsprout.com/global_stats?date=2022-12-01"&gt;December 2022&lt;/a&gt;
put Overcast’s Market Share at 1%,&lt;sup id="fnref:amazingly"&gt;&lt;a class="footnote-ref" href="#fn:amazingly"&gt;1&lt;/a&gt;&lt;/sup&gt;
with that number pegged at 1,134,026&lt;sup id="fnref:evenmoreamazingly"&gt;&lt;a class="footnote-ref" href="#fn:evenmoreamazingly"&gt;2&lt;/a&gt;&lt;/sup&gt; (users, presumably).&lt;/p&gt;
&lt;p&gt;Based on these estimates, there are the same number of iCloudless podcast
listeners—one million—as there are Overcast users, and their only choices
are
* to use Overcast,
* to find another podcast player that offers non-iCloud-Drive-based
sync (if there is one), or,
* to forego sync.
This is the &lt;em&gt;self-selection bias&lt;/em&gt;.
Just as restaurants with good vegan food probably have a disportionately high number
of vegan customers (and much higher than ones that don't cater for vegans),
and buildings with proper disabled access probably have disportionately high numbers
of disabled customers (and much higher than those that don't offer disabled access),
a podcast player that supports iCloudless sync will almost inevitably have
a disproportionately high number of iCloudless users.
In principle, if the estimates are reasonable, Marco could see up to 100%
of his users in this iCloudless category without running out of iCloudless users
(though this would plainly be ridiculous).&lt;/p&gt;
&lt;p&gt;If these estimates are the right order of magnitude, 12% seems like a very plausible figure
for what Marco would see in his stats. It means only 12% of people who might benefit
from the sync service are using Overcast, but that’s an order of magnitude more than
its estimated market share, which is pretty good. It also means that 88% of podcast listeners
who might benefit from iCloudless sync &lt;em&gt;are not&lt;/em&gt; currently using Overcast.&lt;/p&gt;
&lt;p&gt;As I said, I don’t know whether any other podcast players do have an non-iCloud sync service,
but either way, it also suggests that now that Marco has put his plan to discontinue it on ice,
it might not be a terrible idea for him to lean into it. Maybe his marketing should specifically
try to target users who want to sync podcasts across devices (and the web!) but are “iCloudless".
After all, it’s very plausible that there are a million of them out there; and it would be really
hard for Marco’s competitors to respond. As noted above, these also might be exceptionally
loyal/sticky customers, because there may be nowhere else for them to go. It’s one of Marco's
stronger competitive moats.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:amazingly"&gt;
&lt;p&gt;&lt;em&gt;amazingly,&lt;/em&gt; given that I made the estimate before trying to look up any stats,
and as you can see, my numbers really are Fermi Estimates.&amp;#160;&lt;a class="footnote-backref" href="#fnref:amazingly" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:evenmoreamazingly"&gt;
&lt;p&gt;&lt;em&gt;even more amazingly,&lt;/em&gt; this number of users is also within 2% of my Fermi Estimate. It’s a just a coincidence, but a very happy one!&amp;#160;&lt;a class="footnote-backref" href="#fnref:evenmoreamazingly" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="stats"></category><category term="bias"></category><category term="interpretation"></category></entry><entry><title>Gentest Talk at 2022 Toronto Workshop on Reproducibility</title><link href="https://tdda.info/tdda-gentest-toronto-2022.html" rel="alternate"></link><published>2022-02-25T16:00:00+00:00</published><updated>2022-02-25T16:00:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2022-02-25:/tdda-gentest-toronto-2022.html</id><summary type="html">&lt;p&gt;We released version 2.0 of the Python
&lt;a href="https://github.com/tdda/tdda"&gt;TDDA library&lt;/a&gt;
this week. The radical new feature of the 2.0 release
is &lt;em&gt;Gentest&lt;/em&gt;, a command-line tool for automatically
generating tests for more-or-less any code that you can
run from a command line.&lt;/p&gt;
&lt;p&gt;Gentest was introduced at the &lt;a href="https://canssiontario.utoronto.ca/toronto_workshop_on_reproducibility_2022/"&gt;2022 Toronto Workshop …&lt;/a&gt;&lt;/p&gt;</summary><content type="html">&lt;p&gt;We released version 2.0 of the Python
&lt;a href="https://github.com/tdda/tdda"&gt;TDDA library&lt;/a&gt;
this week. The radical new feature of the 2.0 release
is &lt;em&gt;Gentest&lt;/em&gt;, a command-line tool for automatically
generating tests for more-or-less any code that you can
run from a command line.&lt;/p&gt;
&lt;p&gt;Gentest was introduced at the &lt;a href="https://canssiontario.utoronto.ca/toronto_workshop_on_reproducibility_2022/"&gt;2022 Toronto Workshop on
Reproducibility&lt;/a&gt; yesterday
(24th February), where demonstrations included using it
to write tests for three increasing complex R programs.
This was to emphasize
that Gentest is useful for much more than just testing
Python code. Our (only slightly facetious) strapline is&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Gentest writes tests, so you don't have to.&lt;/em&gt;™&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Slides from the talk are available here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://stochasticsolutions.com/pdf/toronto2022-tdda-gentest.pdf"&gt;Gentest: Automatic Test Generation for Data Science
   (SLIDES)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And here's the video:&lt;/p&gt;
&lt;iframe width="779" height="438" src="https://www.youtube.com/embed/qx_n8mPLpnw" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;

&lt;p&gt;We'll be posting more detail about Gentest here over the coming weeks.&lt;/p&gt;
&lt;p&gt;Another major upgrade in the 2.0 TDDA release is the documentation.
We've made much more effort to separate out the command-line
uses of the TDDA library&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;constraint generation&lt;/li&gt;
&lt;li&gt;data verification&lt;/li&gt;
&lt;li&gt;data validation&lt;/li&gt;
&lt;li&gt;inference of regular expressions, with Rexpy&lt;/li&gt;
&lt;li&gt;(now) automatic test generation, with Gentest&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;from the API documentation, which is only really relevant to Python users.&lt;/p&gt;
&lt;p&gt;The documentation is available &lt;a href="https://tdda.readthedocs.io"&gt;at Read The Docs&lt;/a&gt;.&lt;/p&gt;</content><category term="TDDA"></category><category term="tests"></category><category term="reference tests"></category><category term="gentest"></category></entry><entry><title>Unix &amp; Linux Survival Guide for Data Science etc.</title><link href="https://tdda.info/2.html" rel="alternate"></link><published>2022-02-21T21:00:00+00:00</published><updated>2022-02-21T21:00:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2022-02-21:/2.html</id><content type="html">&lt;center&gt;
&lt;img src="https://www.tdda.info/images/cartoons/etc2-unix-survival-ds.png" width="100%"
     alt="Cheat-sheet for unix and linux"/&gt;
&lt;/center&gt;

&lt;p&gt;&lt;a href="https://www.tdda.info/pdf/etc2-unix-survival-ds-A4.pdf"&gt;PDF Version (A4)&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.tdda.info/pdf/etc2-unix-survival-ds-Letter.pdf"&gt;PDF Version (Letter)&lt;/a&gt;&lt;/p&gt;</content><category term="TDDA"></category><category term="tests"></category><category term="cartoon"></category></entry><entry><title>One Tiny Bug Fix etc.</title><link href="https://tdda.info/1.html" rel="alternate"></link><published>2022-02-16T16:00:00+00:00</published><updated>2022-02-16T16:00:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2022-02-16:/1.html</id><content type="html">&lt;center&gt;
&lt;img src="https://www.tdda.info/images/cartoons/etc1-one-tiny-bug-fix.png" width="910"
     alt="White Cat: The tests have failed again. Black Cat: Did you change the code? White Cat: No! Black Cat: Really? White Cat: I just fixed on TINY BUG in a COMPLETE DIFFERENT part of the code. There's NO WAY that could cause this!" Black Cat: Not again..."/&gt;
&lt;/center&gt;</content><category term="TDDA"></category><category term="tests"></category><category term="cartoon"></category></entry><entry><title>Why Code Rusts</title><link href="https://tdda.info/why-code-rusts.html" rel="alternate"></link><published>2022-02-07T16:00:00+00:00</published><updated>2022-02-07T16:00:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2022-02-07:/why-code-rusts.html</id><summary type="html">&lt;p&gt;&lt;strong&gt;&lt;em&gt;or&lt;/em&gt; Why Tests Spontanously Fail&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;You might think that if you write a program, and don't change
anything, then come back a day later (or a decade later) and run it
with the same inputs, it would produce the same output. At their core,
reference tests exist because this isn't …&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;strong&gt;&lt;em&gt;or&lt;/em&gt; Why Tests Spontanously Fail&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;You might think that if you write a program, and don't change
anything, then come back a day later (or a decade later) and run it
with the same inputs, it would produce the same output. At their core,
reference tests exist because this isn't true, and it's useful to find
out if code you wrote in the past no longer does the same thing it
used to. This post collects together some of reasons the behaviour of
code changes over time.&lt;sup id="fnref:contributions"&gt;&lt;a class="footnote-ref" href="#fn:contributions"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;h2 id="the-environment-has-changed"&gt;The Environment Has Changed&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;E1&lt;/strong&gt; You updated your compiler/interpreter (Python/R etc.)&lt;br&gt;
&lt;strong&gt;E2&lt;/strong&gt; You updated libraries used in your code (e.g. from PyPI/CRAN).&lt;br&gt;
&lt;strong&gt;E3&lt;/strong&gt; You updated the operating system of the machine you're running on.&lt;br&gt;
&lt;strong&gt;E4&lt;/strong&gt; Someone else updated the operating system or library/compiler etc.&lt;br&gt;
&lt;strong&gt;E5&lt;/strong&gt; Your code uses some other software on your machine (or another)
       machine that has been updated (e.g. a database).&lt;br&gt;
&lt;strong&gt;E6&lt;/strong&gt; Your code uses an external service whose behaviour has changed
       (e.g. calling a web service to get/do something).&lt;br&gt;
&lt;strong&gt;E7&lt;/strong&gt; You have updated/replaced your hardware.&lt;br&gt;
&lt;strong&gt;E8&lt;/strong&gt; You run it on different hardware (another machine or OS or OS version
       or under a different compiler or...)&lt;br&gt;
&lt;strong&gt;E9&lt;/strong&gt; You move the code to a different location in the file system.&lt;br&gt;
&lt;strong&gt;E10&lt;/strong&gt; You have changed something in the file system that messes up the code
        e.g.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;deleting a file the code uses&lt;/li&gt;
&lt;li&gt;renaming a file the code uses&lt;/li&gt;
&lt;li&gt;editing a file the code uses&lt;/li&gt;
&lt;li&gt;removing or renaming a directory the code uses&lt;/li&gt;
&lt;li&gt;changing permissions on a file or directory the code uses&lt;/li&gt;
&lt;li&gt;creating a file or directory that the code expects to create
  and is now unable to, e.g. because of permissions.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;E11&lt;/strong&gt; You run as a different user.&lt;br&gt;
&lt;strong&gt;E12&lt;/strong&gt; You run from a different directory while leaving the code in the
        same place.&lt;br&gt;
&lt;strong&gt;E13&lt;/strong&gt; You run the code in a different way (e.g. from a script instead
        of interactively, or in a scheduler).&lt;br&gt;
&lt;strong&gt;E14&lt;/strong&gt; A disk fills or some other resource becomes full or depleted.&lt;br&gt;
&lt;strong&gt;E15&lt;/strong&gt; The load on the machine is higher, and the code runs out of
        memory or disk or some other resource; or has a subtle timing
        dependency or assumption that fails under load.&lt;br&gt;
&lt;strong&gt;E15a&lt;/strong&gt; The load on the machine is &lt;em&gt;lower,&lt;/em&gt; meaning part of your code
         runs faster, causing a race condition to behave differently.
         [Added 2022-02-17]&lt;br&gt;
&lt;strong&gt;E16&lt;/strong&gt; The hardware has developed a fault.&lt;br&gt;
&lt;strong&gt;E17&lt;/strong&gt; A systems manager has changed some limits e.g. disk quotas,
allowed nice levels, a directory service, some permissions or groups...&lt;br&gt;
&lt;strong&gt;E18&lt;/strong&gt; A shell variable changed, or was created or destroyed.&lt;br&gt;
&lt;strong&gt;E19&lt;/strong&gt; The locale in which the machine is running changed.&lt;br&gt;
&lt;strong&gt;E20&lt;/strong&gt; You changed your PYTHONPATH or equivalent.&lt;br&gt;
&lt;strong&gt;E21&lt;/strong&gt; A new library that you don't use (or didn't think you used)
has appeared in a &lt;code&gt;site-packages&lt;/code&gt; or similar location, and was picked
up by your code or something else your code uses.&lt;br&gt;
&lt;strong&gt;E22&lt;/strong&gt; You updated your editor/IDE and now whenever you load a file it
gets changes in some subtle way that matters (e.g. line endings, blank
lines at the of files, encoding, tabs vs. spaces).&lt;br&gt;
&lt;strong&gt;E23&lt;/strong&gt; The physical environment has changed in some way that affects
the machine you are running on (e.g. causing it to slow down).&lt;br&gt;
&lt;strong&gt;E24&lt;/strong&gt; A file has been touched&lt;sup id="fnref:touch"&gt;&lt;a class="footnote-ref" href="#fn:touch"&gt;2&lt;/a&gt;&lt;/sup&gt; and the software determines order
of processing by last update date.&lt;br&gt;
&lt;strong&gt;E25&lt;/strong&gt; The code uses a password or key that is changed, expires or
is revoked.&lt;br&gt;
&lt;strong&gt;E26&lt;/strong&gt; The code requires network access and the network is unavailable, slow, or unreliable at the time the test is run.&lt;br&gt;
&lt;strong&gt;E27&lt;/strong&gt; Almost any of the above (or below), but for a dependency of your code rather than your code itself, e.g. something in a data centre or library.&lt;br&gt;
&lt;strong&gt;E28&lt;/strong&gt; Your &lt;code&gt;PATH&lt;/code&gt; (the list of locations checks for executables) has
changed, or an alias has changed so that the executable you run is
different from before. [Added 2022-02-11]&lt;br&gt;
&lt;strong&gt;E29&lt;/strong&gt; A different disk or share is mounted, so that even though you
specify the same path, some file that you are using is different from
before. [Added 2022-02-11]&lt;br&gt;
&lt;strong&gt;E30&lt;/strong&gt; You run the code under a different shell or changed something in a shell startup file.  [Added 2022-02-17]  &lt;/p&gt;
&lt;p&gt;Many of these are illuminated by one of my favourite quote from
Beth Andres-Beck:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Mocking in unit tests makes the tests more stable because they
don’t break when your code breaks.&lt;br&gt;
— @bethcodes, 2020-12-29T01:26:00Z
&lt;a href="https://twitter.com/bethcodes/status/1343730015851069440"&gt;https://twitter.com/bethcodes/status/1343730015851069440&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="the-code-has-in-fact-changed"&gt;The Code Has, in Fact, Changed&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;C1&lt;/strong&gt; You think you didn't change the code, but actually you did.&lt;br&gt;
&lt;strong&gt;C2&lt;/strong&gt; You did change the code, but only in a way that &lt;em&gt;couldn't
       possibly&lt;/em&gt; change the behaviour in the case you're testing.&lt;br&gt;
&lt;strong&gt;C3&lt;/strong&gt; You didn't change the code, you &lt;em&gt;fixed a bug&lt;/em&gt;.&lt;br&gt;
&lt;strong&gt;C4&lt;/strong&gt; You didn't change the code, but someone else did.&lt;br&gt;
&lt;strong&gt;C5&lt;/strong&gt; You didn't change the code, but disk corruption did.&lt;br&gt;
&lt;strong&gt;C6&lt;/strong&gt; You didn't change the code, but you did update some data it uses.&lt;br&gt;
&lt;strong&gt;C7&lt;/strong&gt; You pulled the code again from a source-code repository but&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;someone else had pushed a change&lt;/li&gt;
&lt;li&gt;you checked out a different branch&lt;/li&gt;
&lt;li&gt;you pulled from the wrong repository.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;C8&lt;/strong&gt; You're on the wrong branch.&lt;br&gt;
&lt;strong&gt;C9&lt;/strong&gt; The system was restored from backup and you lost changes.&lt;br&gt;
&lt;strong&gt;C10&lt;/strong&gt; You used a hard link to a file and didn't change the file here but
did change it in one of the other linked locations.&lt;br&gt;
&lt;strong&gt;C11&lt;/strong&gt; You used symbolic links and though your symbolic link
didn't change, the code (or other file or files) it symbolically linked did.&lt;br&gt;
&lt;strong&gt;C12&lt;/strong&gt; You used a diff tool to compare files, but a difference that
&lt;em&gt;does&lt;/em&gt; matter to your code was not detected by the diff tool (e.g. line
endings or capitalization or whitespace).&lt;br&gt;
&lt;strong&gt;C13&lt;/strong&gt; You are in fact running more tests than previously, or different tests
from the ones you ran previously, without realising it.&lt;br&gt;
&lt;strong&gt;C14&lt;/strong&gt; You reformatted your code thinking that you were only making
changes to appearance.&lt;br&gt;
&lt;strong&gt;C15&lt;/strong&gt; You ran a code formatter/beautifier/coding standard enforcement
tool that had a bug in it and changed the meaning.&lt;br&gt;
&lt;strong&gt;C16&lt;/strong&gt; You believe nothing has changed because &lt;code&gt;git status&lt;/code&gt; tells you nothing
has changed, but you are using files that aren't tracked or are ignored.&lt;br&gt;
&lt;strong&gt;C17&lt;/strong&gt; You think a file hasn't changed because of its timestamp, but the
timestamp is wrong or doesn't mean what you think it means.&lt;br&gt;
&lt;strong&gt;C18&lt;/strong&gt; A hidden file changed (e.g. a dotfile).&lt;br&gt;
&lt;strong&gt;C19&lt;/strong&gt; A file that doesn't match a glob pattern you use changed.
&lt;strong&gt;C20&lt;/strong&gt;  The file is in a cloud linked folder (e.g. Dropbox)
         and changed remotely. [Added 2025-01-31]&lt;br&gt;
&lt;strong&gt;C21&lt;/strong&gt;  A coding bot changed your code [Added 2025-06-02]  &lt;/p&gt;
&lt;p&gt;Also from Beth Andres-Beck:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you have 100% test coverage and your tests use mocks, no you don’t.&lt;br&gt;
— @bethcodes, 2020-12-29T01:51:00Z
&lt;a href="https://twitter.com/bethcodes/status/1343736477839020032"&gt;https://twitter.com/bethcodes/status/1343736477839020032&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="you-arent-running-the-code-you-think-you-are"&gt;You Aren't Running the Code You Think You Are&lt;/h2&gt;
&lt;p&gt;There is another set of problems that aren't strictly causes of code
rusting, but which help to explain a set of related situations every
developer has probably experienced, which all fall under the general
heading of &lt;em&gt;you aren't running the code you think you are&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;M1&lt;/strong&gt; The code you're running is not the the version you think it is
       (e.g. you're in the wrong directory).&lt;br&gt;
&lt;strong&gt;M2&lt;/strong&gt; You are running the code on a different server from the one you think
       you are (e.g. you haven't realised you're ssh'd in to a different
       machine or editing a file over a network).&lt;br&gt;
&lt;strong&gt;M3&lt;/strong&gt; You're editing the code in one place but running it in another.&lt;br&gt;
&lt;strong&gt;M4&lt;/strong&gt; You have cross-mounted a file system and it's the wrong file
       system or you think you are/aren't using it when you actually
       aren't/are (respectively).&lt;br&gt;
&lt;strong&gt;M5&lt;/strong&gt; Something (e.g. a browser) is caching your code (or some CSS
or an image or something).&lt;br&gt;
&lt;strong&gt;M6&lt;/strong&gt; The code &lt;em&gt;has&lt;/em&gt; in fact run correctly (tests have passed)
but you're look at the wrong output (wrong directory, wrong tab,
wrong URL, wrong window, wrong machine...)&lt;br&gt;
&lt;strong&gt;M7&lt;/strong&gt; Your compiled code is out-of-sync with your source code, so you're
not  running what you think you are.&lt;br&gt;
&lt;strong&gt;M8&lt;/strong&gt; You're running (or not running) a virtual environment when you
think you are not (or are), respectively.&lt;br&gt;
&lt;strong&gt;M9&lt;/strong&gt; You're running a virtual environment and not understanding how
it's doing its magic, with the result that you're not using the libraries/code
you think you are.&lt;br&gt;
&lt;strong&gt;M10&lt;/strong&gt; You use a package manager that's installed the right libraries
into a different Python (or whatever) from the one you think it has.&lt;sup id="fnref:pip"&gt;&lt;a class="footnote-ref" href="#fn:pip"&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;br&gt;
&lt;strong&gt;M11&lt;/strong&gt; You think you haven't changed the code/libraries/Python you're
using, but in fact you did when you updated (what you thought was)
a different virtual (or non-virtual) environment.&lt;br&gt;
&lt;strong&gt;M12&lt;/strong&gt; You have a conflict between different import directories (e.g.
a local &lt;code&gt;site-packages&lt;/code&gt; and a system &lt;code&gt;site-packages&lt;/code&gt;), with different
versions of the same library, and aren't importing the one you think you are.&lt;br&gt;
&lt;strong&gt;M13&lt;/strong&gt; You think the code hasn't changed because you recorded the
version number, but there was a code change that didn't cause the
version number to be changed, or the code has multiple version
numbers, or the code is reporting its version number wrongly, or the
version number actually refers to a number of slightly different
builds that are supposed to have the same behaviour, but don't.&lt;br&gt;
&lt;strong&gt;M14&lt;/strong&gt; You have defined the same class or method or function or variable
more than once in a language that doesn't mind such things, and are looking
at (and possibly) editing a copy of the relevant function/callable/object
that is masked by the later definition. [Added 2022-09-14]&lt;br&gt;
&lt;strong&gt;M15&lt;/strong&gt; A web server or application server has your code in memory and
changing or recompiling your code won't have any effect until you restart
that web server or application server. This is really a variation of &lt;strong&gt;M5&lt;/strong&gt;,
but is subtly different because you wouldn't normally think of this as
&lt;em&gt;caching.&lt;/em&gt; [Added 2024-03-30]  &lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;These are the ones that make you question your sanity.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;TIP&lt;/strong&gt; If what's happening &lt;em&gt;can't&lt;/em&gt; be happening, trying introducing
a clear syntax error or debug statement or some other change you
should be able to see. Then check that it shows
up as expected when you're running your code.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Almost every time I think I'm losing my mind when coding, it's
because I'm editing and running different code
(or viewing results from different code).&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="time-has-moved-on"&gt;Time has Moved On&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;T1&lt;/strong&gt; Your code has a (usually implicit) date/time dependence in it, e.g.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;it uses 2-digit dates&lt;/li&gt;
&lt;li&gt;it assumes it's running in 2022&lt;/li&gt;
&lt;li&gt;it assumes it's not 29th February, or 1st January, or isn't a weekend...&lt;/li&gt;
&lt;li&gt;it assumes something else that's not true about (computer) time (no
leap seconds, no daylight savings times, no time-zones, no half-hour-aligned
timezones...)&lt;/li&gt;
&lt;li&gt;it uses 2-digit dates with a pivot year and time (or some computed time
the code uses) moves past the pivot year.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;T2&lt;/strong&gt; Time is 'bigger' in some material way that causes a problem, e.g.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Y2K&lt;/li&gt;
&lt;li&gt;Unix 2038 (when the number of seconds from 1 Jan 1970 overflows
  signed 32-bit integers)&lt;/li&gt;
&lt;li&gt;Number of days since the code was written needs more digits (10, 100, 1000).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;T3&lt;/strong&gt; While the code is running, daylight savings time starts or stops,
and a measured (local) time interval goes negative.&lt;br&gt;
&lt;strong&gt;T4&lt;/strong&gt; Your code uses Excel to interpret data and today's a special date
that Excel doesn't (or more likely does) recognize.&lt;br&gt;
&lt;strong&gt;T5&lt;/strong&gt; The system clock is wrong (perhaps badly wrong); or the system
clock was wrong when you ran it before and is now right.  &lt;/p&gt;
&lt;h2 id="resources-used-by-the-code-have-changed"&gt;Resources Used by the Code Have Changed&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;R1&lt;/strong&gt; A resource your code uses (a database, a reference file, a page
on the internet, a web service) returns different data
from the data it always previously returned.&lt;br&gt;
&lt;strong&gt;R2&lt;/strong&gt; A resource your code uses returns data in a different format
e.g. a different text encoding, different precision, different line endings
(Unix vs. PC vs. Mac), presence or absence of a byte-order marker (BOM) in UTF-8, presence of new characters in Unicode, different normalization of unicode, indented or unindented JSON/XML, different sort order etc.&lt;br&gt;
&lt;strong&gt;R3&lt;/strong&gt; A resource you depend on returns “the same” data as expected but
something about the interaction is different, e.g. a different status
code or some extra data you can ignore, or some redundant data you
use has been removed.  &lt;/p&gt;
&lt;h2 id="stochastic-and-indeterminate-effects"&gt;Stochastic and Indeterminate Effects&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;S1&lt;/strong&gt; Your code uses random numbers and doesn't fix the seed.&lt;br&gt;
&lt;strong&gt;S2&lt;/strong&gt; Your code uses random numbers and &lt;em&gt;does&lt;/em&gt; fix the main seed
but not other seeds that get used (e.g. the the seed for numpy is
different from Python's main seed).&lt;br&gt;
&lt;strong&gt;S3&lt;/strong&gt; A cosmic ray hits the machine and causes a bit flip.&lt;br&gt;
&lt;strong&gt;S4&lt;/strong&gt; The code is running on a GPU (or even CPU) that does not,
in fact, always produce the same answer (order of execution).&lt;br&gt;
&lt;strong&gt;S5&lt;/strong&gt; The code is running on a parallel, distributed, or multi-threaded
system and there is inderminacy, a race condition, possible deadlock
or livelock, or any number of other things that might cause indeterminate
behaviour.&lt;br&gt;
&lt;strong&gt;S6&lt;/strong&gt; Your code assumes something is deterministic or has specified
behaviour that is in fact not determinisic or specified, especially
if that result is the same most but not all of the time, e.g. tie-breaking
in sorts, order of extraction from sets or (unordered) dictionaries,
or the order in which results arrive from asynchronous calls.&lt;sup id="fnref:indeterminacy"&gt;&lt;a class="footnote-ref" href="#fn:indeterminacy"&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;br&gt;
&lt;strong&gt;S7&lt;/strong&gt; Your code relies on something likely but not certain, e.g. that
two randomly-generated, fairly long IDs will be different from each other.&lt;br&gt;
&lt;strong&gt;S8&lt;/strong&gt; Your code uses random numbers and &lt;em&gt;does&lt;/em&gt; fix the main seed, but
the sequence of random numbers has changed. This has happened with
NumPy, where they realised that one of the sampling functions was
drawing unnecessary samples from the PRNG. In making the sampler more
efficient, they changed the samples that were returned for the same
PRNG seed. [Contributed by Rob Moss
(&lt;a href="https://twitter.com/rob_models"&gt;@rob_models&lt;/a&gt; and
&lt;a href="https://mas.to/@rob_models"&gt;@rob_models@mas.to&lt;/a&gt;), who "had a quick
search for the relevant issue/changelog item, but it was a long time
ago (~NumPy 1.7, maybe)." He "couldn't find the original NumPy issue,
but here's a similar one: &lt;a href="https://github.com/numpy/numpy/issues/14522"&gt;https://github.com/numpy/numpy/issues/14522&lt;/a&gt;".
Thanks, Rob!]&lt;/p&gt;
&lt;h2 id="it-never-worked-or-didnt-work-when-you-thought-it-did"&gt;It Never Worked (or didn't work when you thought it did)&lt;/h2&gt;
&lt;p&gt;[Added 2024-07-19]&lt;/p&gt;
&lt;p&gt;I realised there's another whole class of errors of process/errors
of interpretation that could lead us to think that code has “rusted”
despite not having been changed. These are all broadly the same as one
of the explanations offered before, but now for the original run
when you thought it worked, rather than for the current or new run,
when it fails.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;N1&lt;/strong&gt; You thought you ran the code before, and that it worked correctly,
but you are mistaken: you didn't run it at all, or it in fact failed
but you did not notice.&lt;br&gt;
&lt;strong&gt;N2&lt;/strong&gt; You did run the code before, but picked up the output from
a previous state, before you broke it, when it did work.&lt;br&gt;
&lt;strong&gt;N3&lt;/strong&gt; You did run the code before, and it did produce the
wrong output then as now, but you used a defective procedure
or tool to examine the output then, and failed to realise
it was wrong/failing.&lt;br&gt;
&lt;strong&gt;N4&lt;/strong&gt; You did run the code before, and it did pass, but you
passed the wrong parameters/inputs/whatever and are now passing
the correct (or different) parameters/inputs/whatever so it now
fails as it would have done then if you had done the same.  &lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:contributions"&gt;
&lt;p&gt;If you have think of other reasons code rusts,
do let me know and I'll be happy to expand this list (and attribute,
of course)&amp;#160;&lt;a class="footnote-backref" href="#fnref:contributions" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:touch"&gt;
&lt;p&gt;&lt;em&gt;Touching&lt;/em&gt; a file (the unix &lt;code&gt;touch&lt;/code&gt; command) updates the last
update date on a file without changing its contents.&amp;#160;&lt;a class="footnote-backref" href="#fnref:touch" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:pip"&gt;
&lt;p&gt;For this reason, a lot of people prefer to run &lt;code&gt;python -m pip&lt;/code&gt;
rather than &lt;code&gt;pip&lt;/code&gt;, because this way you can have greater confidence that
the module is getting installed in the &lt;code&gt;site-packages&lt;/code&gt; for the version
of &lt;code&gt;python&lt;/code&gt; you're actually running.&amp;#160;&lt;a class="footnote-backref" href="#fnref:pip" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:indeterminacy"&gt;
&lt;p&gt;Most of these kinds of indeterminacy will, in fact,
usually be stable given identical inputs on the same machine running
the same software, but it can take very little to change that, and
should not be relied upon.&amp;#160;&lt;a class="footnote-backref" href="#fnref:indeterminacy" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="tests"></category><category term="reference tests"></category><category term="rust"></category></entry><entry><title>Flat Files (a.k.a. CSV files)</title><link href="https://tdda.info/flat-files-aka-csv-files.html" rel="alternate"></link><published>2021-07-16T18:00:00+01:00</published><updated>2021-07-16T18:00:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2021-07-16:/flat-files-aka-csv-files.html</id><summary type="html">&lt;p&gt;This week, a client I'm working for received a large volume of data, and
as usual the data was sent as "flat" files—or CSV (&lt;em&gt;comma-separated
values&lt;/em&gt;&lt;sup id="fnref:CSV"&gt;&lt;a class="footnote-ref" href="#fn:CSV"&gt;1&lt;/a&gt;&lt;/sup&gt;) files, as they are more often called. Everyone hates CSV files,
because they are badly specified, contain little metadata and are
generally …&lt;/p&gt;</summary><content type="html">&lt;p&gt;This week, a client I'm working for received a large volume of data, and
as usual the data was sent as "flat" files—or CSV (&lt;em&gt;comma-separated
values&lt;/em&gt;&lt;sup id="fnref:CSV"&gt;&lt;a class="footnote-ref" href="#fn:CSV"&gt;1&lt;/a&gt;&lt;/sup&gt;) files, as they are more often called. Everyone hates CSV files,
because they are badly specified, contain little metadata and are
generally an unreliable way to transfer information accurately. They continue
to be used, of course, because they are the lowest-common denominator
format and just about everything can read and write them in some fashion.&lt;/p&gt;
&lt;p&gt;Some of the problems with CSV files are well captured in a pithy blog
post by Jesse Donat entitled &lt;em&gt;&lt;a href="https://donatstudios.com/Falsehoods-Programmers-Believe-About-CSVs"&gt;Falsehoods Programmers Believe about CSVs&lt;/a&gt;&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Among other things, the data we received this week featured:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;unescaped commas in unquoted (comma-separated) values;&lt;/li&gt;
&lt;li&gt;an unspecified non-UTF-8 encoding that also did not appear to be iso-8859-1
   ("latin-1" to its friends), nor indeed iso-8859-15 ("latin-9");&lt;/li&gt;
&lt;li&gt;different null markers in different fields, and some cases, different
   null markers in a single field;&lt;sup id="fnref:notwrong"&gt;&lt;a class="footnote-ref" href="#fn:notwrong"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;li&gt;field names (column headers) that included spaces, apostrophes,
   dashes and (in at least one case) a non-ASCII non-alphanumeric character;&lt;/li&gt;
&lt;li&gt;multiple date formats, even within a single field, including some dates
   with &lt;em&gt;three&lt;/em&gt;-digit years.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All of this is a bit frustrating, but far from unusual, and only one
of these problems was actually fatal—the use of unquoted,
unescaped separators in values, which makes the file inherently
ambiguous.
I'm almost sure this data was written but not read or validated,
because I don't believe the supplier
would have been able to read it reliably either.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Metadata&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;In an ideal world, we'd move away from CSV files, but we also need to
recognise not only that this probably won't happen, but that the universality,
plain-text nature,
&lt;a href="https://www.merriam-webster.com/dictionary/grok"&gt;grokkability&lt;/a&gt;
and simplicity of CSV files are all strengths;
for all that we might gain using fancier, better-specified formats,
we would lose quite a lot too, not least the utility of awk, split, grep
and friends in many cases.&lt;/p&gt;
&lt;p&gt;So if we can't get away from CSV files, how can we increase reliability
when using them? Standardizing might be good, but again, this is going
to be hard to achieve. What we might be able to do, however, is to work
towards a way of specifying flat files that at least allows a receiver
of them to know what to expect, or a generator to know what to write.
I've been involved with a few such ideas over the years, and the software
my company produces (&lt;a href="https://stochasticsolutions.com/miro"&gt;Miró&lt;/a&gt;) used its
own non-standard, XML-based way of describing flat files.&lt;/p&gt;
&lt;p&gt;What I'm thinking about is trying to produce something more general,
less opinionated, and more modern (think JSON, rather than XML, for starters)
that addresses more issues. The initial goal would be simply descriptive—to
allow a metadata file to be created that accurately describes the
specific features of a given flat file so that a reader (human or machine)
knows how to interpret it. Over time, this might grow into something bigger.
I think obvious things to do after the format is created include:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;In my case, getting Miró to accept these in place of&lt;sup id="fnref:aswellas"&gt;&lt;a class="footnote-ref" href="#fn:aswellas"&gt;3&lt;/a&gt;&lt;/sup&gt; its current
   XML-based files when reading (or writing) flat files. (Initially, at least,
   Miró would not be able to read or write all files that could be
   specified in this way, but could at least warn the user when it couldn't.)&lt;/li&gt;
&lt;li&gt;Also getting the Python &lt;a href="https://github.com/tdda/tdda"&gt;&lt;code&gt;tdda&lt;/code&gt;&lt;/a&gt; library
   to be able to use this when using CSV files for input
   (and perhaps also for output).&lt;/li&gt;
&lt;li&gt;Writing an "argument generator" for some of the standard (Python)
   CSV readers and writers to set the read/write options to be
   consistent with a given metadata description, and then probably
   to provide wrapped versions of those readers/writers that can
   accept a path for a CSV file and a path to a metadata file and
   use the underlying CSV reader or writer to read or write the file
   using that specification.&lt;/li&gt;
&lt;li&gt;Writing (yet another) "smart" reader to try to read any old CSV files (using
   heuristics) and write out a metadata file that appears to match
   the data provided. This could not possibly work completely reliably
   because of all inherent ambiguity in flat files already alluded to,
   but an "80%" solution for real-world files should certainly be achievable
   as many programs make a reasonable job of handling arbitrary CSV files
   already.&lt;/li&gt;
&lt;li&gt;Writing a validator to confirm whether a given CSV file is consistent
   with the specification in the metadata file.&lt;/li&gt;
&lt;li&gt;Incorporating such a flat-file validator into TDDA so that it can check
   not only the (semantic) content of a dataset, but also the
   syntactic/formatting validity of data, confirming that it has been
   or can be read correctly.&lt;sup id="fnref:tdda"&gt;&lt;a class="footnote-ref" href="#fn:tdda"&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Together, a smart reader that generates a metadata file for a CSV
file (item 4 above) and a validator that validates a CSV file against such a
metadata specification (item 5) are very analogous to the current
constraint discovery and data verification, respectively,
but in the space of CSV files—roughly, "syntactic" conformance—rather than
data (or "semantic") correctness.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Miró's Flat File Description format (XMD Files)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Here is an example, from its documentation, of the XMD data files
that Miró uses.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="cp"&gt;&amp;lt;?xml version=&amp;quot;1.0&amp;quot; encoding=&amp;quot;UTF-8&amp;quot;?&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;dataformat&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;sep&amp;gt;&lt;/span&gt;,&lt;span class="nt"&gt;&amp;lt;/sep&amp;gt;&lt;/span&gt;                     &lt;span class="cm"&gt;&amp;lt;!-- field separator --&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;null&amp;gt;&amp;lt;/null&amp;gt;&lt;/span&gt;                    &lt;span class="cm"&gt;&amp;lt;!-- NULL marker --&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;quoteChar&amp;gt;&lt;/span&gt;&amp;quot;&lt;span class="nt"&gt;&amp;lt;/quoteChar&amp;gt;&lt;/span&gt;         &lt;span class="cm"&gt;&amp;lt;!-- Quotation mark --&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;encoding&amp;gt;&lt;/span&gt;UTF-8&lt;span class="nt"&gt;&amp;lt;/encoding&amp;gt;&lt;/span&gt;       &lt;span class="cm"&gt;&amp;lt;!-- any python coding name --&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;allowApos&amp;gt;&lt;/span&gt;True&lt;span class="nt"&gt;&amp;lt;/allowApos&amp;gt;&lt;/span&gt;      &lt;span class="cm"&gt;&amp;lt;!-- allow apostophes in strings --&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;skipHeader&amp;gt;&lt;/span&gt;False&lt;span class="nt"&gt;&amp;lt;/skipHeader&amp;gt;&lt;/span&gt;   &lt;span class="cm"&gt;&amp;lt;!-- ignore the first line of file --&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;pc&amp;gt;&lt;/span&gt;False&lt;span class="nt"&gt;&amp;lt;/pc&amp;gt;&lt;/span&gt;                   &lt;span class="cm"&gt;&amp;lt;!-- Convert 1.2% to 0.012 etc. --&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;excel&amp;gt;&lt;/span&gt;False&lt;span class="nt"&gt;&amp;lt;/excel&amp;gt;&lt;/span&gt;             &lt;span class="cm"&gt;&amp;lt;!-- pad short lines with NULLs --&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;dateFormat&amp;gt;&lt;/span&gt;eurodt&lt;span class="nt"&gt;&amp;lt;/dateFormat&amp;gt;&lt;/span&gt;  &lt;span class="cm"&gt;&amp;lt;!-- Miró date format name --&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;fields&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;field&lt;/span&gt; &lt;span class="na"&gt;extname=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;mc id&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;ID&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;string&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;field&lt;/span&gt; &lt;span class="na"&gt;extname=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;mc nm&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;MachineName&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;int&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;field&lt;/span&gt; &lt;span class="na"&gt;extname=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;secs&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;TimeToManufacture&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;real&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;field&lt;/span&gt; &lt;span class="na"&gt;extname=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;commission date&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;DateOfCommission&amp;quot;&lt;/span&gt;
               &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;date&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;field&lt;/span&gt; &lt;span class="na"&gt;extname=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;mc cp&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Completion Time&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;date&amp;quot;&lt;/span&gt;
               &lt;span class="na"&gt;format=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;rdt&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;field&lt;/span&gt; &lt;span class="na"&gt;extname=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;sh dt&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;ShipDate&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;date&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;format=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;rd&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;field&lt;/span&gt; &lt;span class="na"&gt;extname=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;qa passed?&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Passed QA&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;bool&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/fields&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;requireAllFields&amp;gt;&lt;/span&gt;False&lt;span class="nt"&gt;&amp;lt;/requireAllFields&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;banExtraFields&amp;gt;&lt;/span&gt;False&lt;span class="nt"&gt;&amp;lt;/banExtraFields&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/dataformat&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Three things to note immediately about this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;I'm not presenting this as the solution: the XMD format is now rather
   out of vogue and there are a number of things I would definitely
   do differently fifteen years on (such as using more standard names
   for types and more standard date format specifiers).&lt;/li&gt;
&lt;li&gt;The XMD format is slightly more than just a flat file &lt;em&gt;description&lt;/em&gt;,
   in that it contains a couple of things that are more about how to
   interpret and handle the data after reading, rather than simply
   describing the data.&lt;/li&gt;
&lt;li&gt;The XMD file supports the notion of two different names for a field.
   The &lt;code&gt;extname&lt;/code&gt; is the name in the CSV file (the &lt;em&gt;external&lt;/em&gt; name),
   while the &lt;code&gt;name&lt;/code&gt; is the name
   for Miró to use for the field. The semantics of this are slightly
   complicated, but allow for renaming of fields on import, and for naming
   of fields where there is no external name,
   or external names are repeated,
   or the external name is otherwise unusable by Miró.
   If the CSV file has a header and each field has a different name in the
   header, the order of the fields int he XMD file does not matter, but if
   there are missing or repeated field names, Miró will use the field
   order in the XMD file.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Notwithstanding the amazing variety seen in CSV files, as
illuminated by Jesse Donat's
&lt;a href="https://donatstudios.com/Falsehoods-Programmers-Believe-About-CSVs"&gt;aforementioned blogpost&lt;/a&gt;,
most CSV files from mature systems vary only in the ways covered by a few of
the items described in the CSV file. The most important things to know
about a flat file overall are normally:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Encoding.&lt;/em&gt;
    The file encoding—these days, most commonly UTF-8.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Separator.&lt;/em&gt;
    The separator character—most commonly a comma (&lt;code&gt;,&lt;/code&gt;), but pipe (&lt;code&gt;|&lt;/code&gt;), tab
    and semicolon (&lt;code&gt;;&lt;/code&gt;) are also frequently used.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Quoting.&lt;/em&gt;
    What character is used to quote strings (if any).
    There are quite number
    of subtleties here (not all capable of being expressed in the XMD file)
    including:&lt;ul&gt;
&lt;li&gt;Are all strings quoted or just some (e.g. ones containing the field
    separator)?&lt;/li&gt;
&lt;li&gt;Are non-string values (e.g. numbers) quoted too?&lt;sup id="fnref:quotingnonstrings"&gt;&lt;a class="footnote-ref" href="#fn:quotingnonstrings"&gt;5&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;li&gt;Are missing values (NULL) quoted?&lt;sup id="fnref:quotingnulls"&gt;&lt;a class="footnote-ref" href="#fn:quotingnulls"&gt;6&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Missing Values.&lt;/em&gt;
    How are missing values (NULLs) denoted in the file, should there be any?&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Escaping.&lt;/em&gt;
    How are characters "escaped"? This really covers a set of different
    issues, and the XMD file is not rich enough to cover all possibilities.
    One aspect is, when strings are quoted, how are quotes in the
    string handled? The most common answers are either by preceding
    them with an escape character, usually backslash (&lt;code&gt;\&lt;/code&gt;), e.g.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&amp;quot;This is an escaped \&amp;quot; character in a string&amp;quot;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;or by stuttering:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&amp;quot;This is a stuttered &amp;quot;&amp;quot; character in a string&amp;quot;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Escaping is also a way of including the separator in non-quoted
values, like these display prices:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;Price,DisplayPrice
100.0,£100.00
1000.0,£1\,000.00
1000000,£1\,000\,000.00
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Escaping is also a way of specifying some special characters,
e.g. &lt;code&gt;\n&lt;/code&gt; for a newline, &lt;code&gt;\t&lt;/code&gt; for a tab etc., and as a result
when an actually backslash is required it is self-escaped
(as &lt;code&gt;\\&lt;/code&gt;).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Row Truncation after the last non-null value.&lt;/em&gt;
    Are rows in which the last value is missing truncated?
    Like many CSV writers, Excel writes missing values as blanks
    so that &lt;code&gt;1,,3&lt;/code&gt; is read as &lt;code&gt;1&lt;/code&gt; for the first field, a missing
    value for the second field and &lt;code&gt;3&lt;/code&gt; for the third field.
    More quirkily, when Excel writes out CSV files, if there are
    &lt;em&gt;n&lt;/em&gt; columns and the last &lt;em&gt;m&lt;/em&gt; of them on a row are missing,
    Excel will write out only the non-missing values, and no further
    separators, so that there will be only &lt;em&gt;n – m&lt;/em&gt; values on that line
    and only &lt;em&gt;n – m&lt;/em&gt;  – 1 separators.
    This behaviour is hard to describe and (as far as I know) unique
    to Excel, so in the
    XMD file this is simply marked as &lt;code&gt;&amp;lt;excel&amp;gt;True&amp;lt;/excel&amp;gt;&lt;/code&gt;.&lt;sup id="fnref:quirks"&gt;&lt;a class="footnote-ref" href="#fn:quirks"&gt;7&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Header handling.&lt;/em&gt;
    Although the common case is for CSV files to have a single line at
    the start with the field names, sometimes there is no such line,
    and sometimes there are multiple lines before the data (one or more
    of which many specify the field names). As a minimum, a metadata
    description needs to be able to specify whether there is a header
    line, and ideally how many such lines there are and how headers
    should be extracted from them.
    If there are no headers, the specification should probably specify the
    field names.
    (Miró imaginatively calls the fields &lt;code&gt;Field1&lt;/code&gt; to &lt;code&gt;FieldN&lt;/code&gt; if no fieldnames
    are available in the flat file or any XMD file.)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Per-Field Information&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;It's always useful and sometimes necessary to specify field types,
and as discussed above, sometimes field names.
Typing is almost always ambigous, and such ambiguity is increased if
there are any bad values in the data.
Moreover, in some cases (especially dates and timestamps), it is
useful to specify the date format.
Although good flat-file readers generally make a reasonable job
of inferring types, and often date formats too, it is clearly helpful
for a metadata specification to include these.&lt;/p&gt;
&lt;p&gt;Just as date formats can vary between fields, other things can vary too,
most obviously null indicators (missing value information), quoting
and escaping. Moreover, if numeric data is formatted (e.g. including
currency indicators, thousand separators etc.) these can all usefully be
specified.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Required/Allowed Fields&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The final pair of settings in the XMD file look slightly different from
the others, partly because they are phrased as directives rather than
descriptions. &lt;code&gt;requireAllFields&lt;/code&gt;, when set, is a directive to Miró
to raise a warning or an error if any of the fields in the XMD file are
not present in the CSV file. Similarly, &lt;code&gt;banExtraFields&lt;/code&gt; is a directive
to raise such a warning or error if any fields are found in the CSV file
that are not listed in the XMD file. Miró has several ways to specify
whether infringements result in warnings or errors.&lt;/p&gt;
&lt;p&gt;These directives can, however, be recast as declarations. The
&lt;code&gt;banExtraFields&lt;/code&gt; directive, when true, can equally be thought of as a
declaration the field list is complete.  Similarly, the
&lt;code&gt;requireAllFields&lt;/code&gt; directive, when true, can be thought of as a
declaration that the field list is not just describing types that and
formats for fields that &lt;em&gt;might&lt;/em&gt; be in the CSV files, but rather that all fields
listed are &lt;em&gt;actually&lt;/em&gt; in the file.&lt;sup id="fnref:mirotdda"&gt;&lt;a class="footnote-ref" href="#fn:mirotdda"&gt;8&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;In principle, I think it would probably be better if these descriptions
were more obviously descriptive or declarative, but I am struggling to
find a pair of words/phrases that would capture that elegantly.
At this point I am tempted to retain their imperative nature but
make them slightly more symmetrical, perhaps with:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;quot;require-all-fields&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;quot;allow-extra-fields&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Alternatively a more declarative syntax might be something like:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;quot;csv-file-might-omit-fields&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;quot;csv-file-might-include-extra-fields&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The reader might wonder why the fields in the metadata file would
ever not correspond exactly to those in file. In practice, it is not
uncommon when dealing with relatively "good" CSV files to write an XMD
file that specifies types and formats only for fields that trip up the
flat-file reader. Conversely, it can be useful to have XMD files that
describe a variety of possible files that share field names and types;
in those cases, the extra ones do no harm.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What Might a Metadata File Look Like?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The XMD file gets quite a lot of things right:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;As XML, it's a standard format that's easy to read, though today
   JSON is clearly more popular for this sort of use.
   (It would be fairly easy to allow a common format
   to be expressed in JSON, XML or YAML, but there's something to be said
   for a single format, probably JSON.)&lt;/li&gt;
&lt;li&gt;All of the most fundamental overall properties are represented—encoding,
   separator, null marker, escape characters, and date format.&lt;/li&gt;
&lt;li&gt;There's a separation between the overall file properties and the per-field
   properties, with the ability to specify the actual fieldname in the file,
   the field type and, in the case of date fields, custom formats on a
   per-field basis, if necessary.&lt;/li&gt;
&lt;li&gt;It can give enough enough information to allow Excel-style truncated
   lines can be read successfully.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There are also a few major shortcomings:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The single escape chaaracter specification covers multiple things.&lt;/li&gt;
&lt;li&gt;There is no explicit support for quote stuttering (which is fairly common).&lt;/li&gt;
&lt;li&gt;The format does not recognise multiple headers.&lt;/li&gt;
&lt;li&gt;The format does not provide any way to specify non-date field formats
   such as boolean specifiers, possible thousand separators and decimal
   point markers.&lt;/li&gt;
&lt;li&gt;The format assume a single NULL indicator for all fields and
   assumes that there is only one kind of missing value/missing value.&lt;/li&gt;
&lt;li&gt;The date formats supported are not comprehensive and are not expressed
   in a standard way.&lt;/li&gt;
&lt;li&gt;Type specifiers are also somewhat non-standard.&lt;/li&gt;
&lt;li&gt;XMD files fail to recognize the possibility that null markers are
   quoted, and implicitly assume that any empty string is distinct from
   a missing string value. This is probably too opinionated.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Some of these shortcomings reflect the fact that the XMD format was
conceived less as a general-purpose flat-file descriptor than a
specification as to how Miró should read or write a given flat file,
and also a way for Miró to specify how it &lt;em&gt;has&lt;/em&gt; written a flat file.&lt;/p&gt;
&lt;p&gt;Essentially, I think a good flat-file description format would preserve
the good aspects and remedy the faults identified, as well as providing
a mechanism for specifying some more esoteric possibilies not mentioned
so far.&lt;/p&gt;
&lt;p&gt;I'll propose something concrete in subsequent posts.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;UPDATE&lt;/strong&gt; The example metadata was updated on 2025-06-23, to be slightly
more interesting and realistic. This coincides with the the post,
&lt;a href="https://tdda.info/tddaserial-metadata-for-flat-files-csv-files-csv"&gt;tdda.serial: Metadata for Flat Files (CSV Files)&lt;/a&gt;&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:CSV"&gt;
&lt;p&gt;Sometimes the separator in a flat file is a character other than a
comma, and you occasionally see &lt;code&gt;.tsv&lt;/code&gt; used an extension when the separator
is a tab character, or &lt;code&gt;.psv&lt;/code&gt; when the separator is a pipe character (&lt;code&gt;|&lt;/code&gt;).
Often, however, a &lt;code&gt;csv&lt;/code&gt; extension is still used, and as result the acronym
CSV is sometimes restyled as &lt;code&gt;character-separated values&lt;/code&gt;. I had always
heard this extension attributed to Microsoft, but have been unable
to verify this.&amp;#160;&lt;a class="footnote-backref" href="#fnref:CSV" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:notwrong"&gt;
&lt;p&gt;To be fair, the notion of different kinds of missing values is reasonable—missing because it wasn't recorded, missing because it was unreadable, missing because it's an undefined result (e.g. mean of no values) etc. But this wasn't that: it was just multiple ways of denoting generic missing values.&amp;#160;&lt;a class="footnote-backref" href="#fnref:notwrong" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:aswellas"&gt;
&lt;p&gt;by which, of course, I mean as well as ...&amp;#160;&lt;a class="footnote-backref" href="#fnref:aswellas" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:tdda"&gt;
&lt;p&gt;There's an interesting question as to whether the CSV format
specification should be incorporated as an optional part of a TDDA file,
and if so, whether it should simply be a nested section or whether
the field-specific components should be merged with TDDA's field sections.
There are pros and cons.&amp;#160;&lt;a class="footnote-backref" href="#fnref:tdda" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:quotingnonstrings"&gt;
&lt;p&gt;Yes, some systems do this.&amp;#160;&lt;a class="footnote-backref" href="#fnref:quotingnonstrings" title="Jump back to footnote 5 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:quotingnulls"&gt;
&lt;p&gt;I know, madness! But such practices occur!&amp;#160;&lt;a class="footnote-backref" href="#fnref:quotingnulls" title="Jump back to footnote 6 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:quirks"&gt;
&lt;p&gt;Maybe it should have been called &lt;a href="https://en.wikipedia.org/wiki/Quirks_mode"&gt;quirks mode&lt;/a&gt;&amp;#160;&lt;a class="footnote-backref" href="#fnref:quirks" title="Jump back to footnote 7 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:mirotdda"&gt;
&lt;p&gt;Miró's slightly extended version of TDDA
files includes lists of &lt;em&gt;required&lt;/em&gt; and &lt;em&gt;allowed&lt;/em&gt; fields, which serve a
similar purpose to these settings.&amp;#160;&lt;a class="footnote-backref" href="#fnref:mirotdda" title="Jump back to footnote 8 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="data"></category></entry><entry><title>Sharing Tests across Implementations by Externalizing Test Data</title><link href="https://tdda.info/sharing-tests-across-implementations-by-externalizing-test-data.html" rel="alternate"></link><published>2020-08-30T17:30:00+01:00</published><updated>2020-08-30T17:30:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2020-08-30:/sharing-tests-across-implementations-by-externalizing-test-data.html</id><summary type="html">&lt;p&gt;I've been dabbling in &lt;a href="https://swift.org"&gt;Swift&lt;/a&gt;—Apple's new-ish
programming language—recently. One of the things I often do when
learning a new language is either to take an existing project in a
language I know (usually, Python) and translate it to the new one,
or (better) to try a new project …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I've been dabbling in &lt;a href="https://swift.org"&gt;Swift&lt;/a&gt;—Apple's new-ish
programming language—recently. One of the things I often do when
learning a new language is either to take an existing project in a
language I know (usually, Python) and translate it to the new one,
or (better) to try a new project, first writing it in Python then
translating it. This allows me to separate out debugging the algorithm
from debugging my understanding of the new language, and also give
me something to test against.&lt;/p&gt;
&lt;p&gt;I have a partially finished Python project for analysing chords that
I've been starting to translate, and this has led me to begin to experiment
with some new extensions to the TDDA library (not yet pushed/published).&lt;/p&gt;
&lt;p&gt;It's a bit fragmented and embryonic, but this what I'm thinking about.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sharing test data between languages&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Many tests boil down to "check that passing these inputs to this
function&lt;sup id="fnref:callable"&gt;&lt;a class="footnote-ref" href="#fn:callable"&gt;1&lt;/a&gt;&lt;/sup&gt; produces this result". There would be some
benefits in sharing the inputs and expected outputs between
implementations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://en.wikipedia.org/wiki/Don%27t_repeat_yourself"&gt;DRY&lt;/a&gt; principle
    (don't repeat yourself);&lt;/li&gt;
&lt;li&gt;reducing the chances of things getting out of sync;&lt;/li&gt;
&lt;li&gt;more confidence that the two implementations really do the same thing;&lt;/li&gt;
&lt;li&gt;less typing / less code.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;** Looping over test cases **&lt;/p&gt;
&lt;p&gt;Standard unit-testing dogma tends to focus on the idea of testing small
units using many tests, each containing a single assertion,
usually as the last statement in the test.&lt;sup id="fnref:teardown"&gt;&lt;a class="footnote-ref" href="#fn:teardown"&gt;2&lt;/a&gt;&lt;/sup&gt;
The benefit of using  a single assertion
is that when there's a failure it's very clear what it was, and an
earlier failure doesn't prevent a later check (assertion) from being
carried out: you get all your failures in one go.
Less importantly, it also means that the number of tests executed is the same
as the number of assertions tested, which might be useful and
psychologically satisfying.&lt;/p&gt;
&lt;p&gt;On the other hand, it is extremely common to want to test
multiple input-output pairs and it is natural and convenient to collect
those together and loop over them. I do this &lt;em&gt;all the time&lt;/em&gt;, and the
reference testing capability in the TDDA library already
helps mitigate some downsides of this approach in some situations.&lt;/p&gt;
&lt;p&gt;A common way I do this is to loop over a dictionary or a list of tuples
specifying input-output pairs. For example, if I were testing  a function
that did string slicing from the left in python (&lt;code&gt;string[:n]&lt;/code&gt;)
I might use something like&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;cases&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Catherine&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Cath&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Catherine&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Cath&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# deliberately wrong, for illustration&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Miró forever&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Miró&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Miró forever&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39; &amp;#39;&lt;/span&gt;    &lt;span class="c1"&gt;# also deliberately wrong&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cases&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;left_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In Python this is fine, because tuples, being hashable, can be used
as dictionary keys, and there's something quite intuitive and satisfying
about the cases being presented as lines of the form
&lt;code&gt;input: expected output&lt;/code&gt;. But I also often just use nested tuples or lists,
partly as a hangover from older versions of Python in which dictionaries
weren't sorted.&lt;sup id="fnref:ordereddict"&gt;&lt;a class="footnote-ref" href="#fn:ordereddict"&gt;3&lt;/a&gt;&lt;/sup&gt; Here's a full example using tuples:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tdda.referencetest&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ReferenceTestCase&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;left_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestLeft&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ReferenceTestCase&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testLeft&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;cases&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Catherine&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Cath&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Catherine&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Cath&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;    &lt;span class="c1"&gt;# deliberately wrong, for illustration&lt;/span&gt;
            &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Miró forever&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Miró&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Miró forever&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39; &amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;      &lt;span class="c1"&gt;# also deliberately wrong&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cases&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;left_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="vm"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;__main__&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;ReferenceTestCase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;As noted above, two problems with this are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;if one test case fails, it's not necessarily easy to figure out which
    one it was, especially if expected values (e.g. &lt;code&gt;'Cath'&lt;/code&gt;) are repeated.&lt;/li&gt;
&lt;li&gt;an earlier failure prevents later cases from running.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We can see both of these problems if we run this:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python3 looptest.py
&lt;span class="nv"&gt;F&lt;/span&gt;
&lt;span class="o"&gt;======================================================================&lt;/span&gt;
FAIL: testLeft &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestLeft&lt;span class="o"&gt;)&lt;/span&gt;
----------------------------------------------------------------------
Traceback &lt;span class="o"&gt;(&lt;/span&gt;most recent call last&lt;span class="o"&gt;)&lt;/span&gt;:
  File &lt;span class="s2"&gt;&amp;quot;looptest.py&amp;quot;&lt;/span&gt;, line &lt;span class="m"&gt;18&lt;/span&gt;, &lt;span class="k"&gt;in&lt;/span&gt; testLeft
    self.assertEqual&lt;span class="o"&gt;(&lt;/span&gt;left_string&lt;span class="o"&gt;(&lt;/span&gt;text, n&lt;span class="o"&gt;)&lt;/span&gt;, expected&lt;span class="o"&gt;)&lt;/span&gt;
AssertionError: &lt;span class="s1"&gt;&amp;#39;Cat&amp;#39;&lt;/span&gt; !&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Cath&amp;#39;&lt;/span&gt;
- Cat
+ Cath
?    +


----------------------------------------------------------------------
Ran &lt;span class="m"&gt;1&lt;/span&gt; &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;.000s

FAILED &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;failures&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;It's actually the second case that failed, and the fifth case would
also fail if it ran (since it should produce an empty string, not a space).&lt;/p&gt;
&lt;p&gt;A technique I've long used to address the first problem is to include the
test case in the equality assertion, replacing&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;with&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;like so:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testLeft&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;cases&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Catherine&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Cath&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Catherine&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Cath&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Miró forever&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Miró&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Miró forever&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39; &amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cases&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;left_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                         &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now when a case fails, we see what the failure is more easily:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python3 looptest2.py
&lt;span class="nv"&gt;F&lt;/span&gt;
&lt;span class="o"&gt;======================================================================&lt;/span&gt;
FAIL: testLeft &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestLeft&lt;span class="o"&gt;)&lt;/span&gt;
----------------------------------------------------------------------
Traceback &lt;span class="o"&gt;(&lt;/span&gt;most recent call last&lt;span class="o"&gt;)&lt;/span&gt;:
  File &lt;span class="s2"&gt;&amp;quot;looptest2.py&amp;quot;&lt;/span&gt;, line &lt;span class="m"&gt;20&lt;/span&gt;, &lt;span class="k"&gt;in&lt;/span&gt; testLeft
    &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;, expected&lt;span class="o"&gt;))&lt;/span&gt;
AssertionError: Tuples differ: &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Catherine&amp;#39;&lt;/span&gt;, -6&lt;span class="o"&gt;)&lt;/span&gt;, &lt;span class="s1"&gt;&amp;#39;Cat&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; !&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Catherine&amp;#39;&lt;/span&gt;, -6&lt;span class="o"&gt;)&lt;/span&gt;, &lt;span class="s1"&gt;&amp;#39;Cath&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

First differing element &lt;span class="m"&gt;1&lt;/span&gt;:
&lt;span class="s1"&gt;&amp;#39;Cat&amp;#39;&lt;/span&gt;
&lt;span class="s1"&gt;&amp;#39;Cath&amp;#39;&lt;/span&gt;

- &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Catherine&amp;#39;&lt;/span&gt;, -6&lt;span class="o"&gt;)&lt;/span&gt;, &lt;span class="s1"&gt;&amp;#39;Cat&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
+ &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Catherine&amp;#39;&lt;/span&gt;, -6&lt;span class="o"&gt;)&lt;/span&gt;, &lt;span class="s1"&gt;&amp;#39;Cath&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
?                         +


----------------------------------------------------------------------
Ran &lt;span class="m"&gt;1&lt;/span&gt; &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;.001s

FAILED &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;failures&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I wouldn't call it beautiful, but it does the job, at least when
the inputs and outputs are of a manageable size.&lt;/p&gt;
&lt;p&gt;This still leaves the problem that the failure of an earlier case prevents
later cases from running. The TDDA library already addresses this
in the case of file checks, by providing the
&lt;code&gt;assertFilesCorrect&lt;/code&gt; (plural) assertion in addition to the
&lt;code&gt;assertFileCorrect&lt;/code&gt; (singular); we'll come back to it later.&lt;/p&gt;
&lt;h3 id="externalizing-test-data"&gt;Externalizing Test Data&lt;/h3&gt;
&lt;p&gt;Returning to the main theme of this post, when there are multiple
implementations of software, potentially in different languages,
there is some attraction to being able to share the test data—ideally,
both the inputs being tested and the expected results.&lt;/p&gt;
&lt;p&gt;The project I'm translating is a chord analysis tool focused on
jazz guitar chords, especially moveable ones with no root.
It has various classes, functions and structures concerned with musical
notes, scales, abstract chords, tunings, chord shapes, chord names and
so forth. It includes an easy-to-type text format that
uses &lt;code&gt;#&lt;/code&gt; as the sharp sign and &lt;code&gt;b&lt;/code&gt; as the flat sign, though on output,
these are usually translated to &lt;code&gt;♯&lt;/code&gt; and &lt;code&gt;♭&lt;/code&gt;. Below are two simple tests
from the Python code.&lt;/p&gt;
&lt;p&gt;For those interested, the first tests a function &lt;code&gt;transpose&lt;/code&gt; that
transposes a note by an number of semitones. There's an optional &lt;code&gt;key&lt;/code&gt;
parameter which, when provided, is used to decide whether to express
the result as a sharp or flat note (when appropriate).&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testTranspose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;C#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;F&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Db&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Db&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;D&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;D&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;D#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Eb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Bb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Eb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;D#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Eb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;B&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;B&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Bb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;E&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;A#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Bb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;A&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;G&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;B&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;B&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;G&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;F#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Bb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;F#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;E&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;A#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Bb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;F#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;F#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Bb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;F&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Gb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Eb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Gb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;G&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;B&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;F#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Bb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;F#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;E&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;A#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;B&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;G&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Bb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;F#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;F#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Bb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;F&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Gb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;F&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Gb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Clearly, this test does not use looping, but does combine some
36 test cases in a single test (dogma be damned!)&lt;/p&gt;
&lt;p&gt;A second test is for a function &lt;code&gt;to_flat_equiv&lt;/code&gt;, which (again, for
those interested) accepts chord names (in various forms) and—where
the chord's key is sharp, as written—converts them to the equivalent
flat form.  (Here, &lt;code&gt;o&lt;/code&gt; is one of the ways to indicate a diminished
chord (e.g. Dº) and &lt;code&gt;M&lt;/code&gt; is on of the ways of describing a major chord
(also &lt;code&gt;maj&lt;/code&gt; or &lt;code&gt;Δ&lt;/code&gt;).  The function also accepts &lt;code&gt;None&lt;/code&gt; as an input
(returned unmodified) and &lt;code&gt;R&lt;/code&gt; as an abstract chord with no key
specified (also unmodified).&lt;sup id="fnref:nobsharp"&gt;&lt;a class="footnote-ref" href="#fn:nobsharp"&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_to_flat_equiv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;cases&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C#m&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Dbm&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Db7&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Db7&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C#M7&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;DbM7&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Do&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Do&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;D#M&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;EbM&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;E9&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;E9&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;FmM7&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;FmM7&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;F#mM7&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;GbmM7&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;G&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;G&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;G#11&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Ab11&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Ab11&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Ab11&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Am11&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Am11&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A#+&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Bb+&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Bb+&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Bb+&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A♯+&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;B♭+&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;B♭+&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;B♭+&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;

        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;R&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;R&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;R#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;R#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Rm&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Rm&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cases&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;to_flat_equiv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;letter&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;BEPQaz@&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertRaises&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NoteError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;to_flat_equiv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;letter&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This function uses two loops within the test, one for the
good cases and another for eight illegal input cases that raise exceptions.
The looping has a clear benefit, but there's no reason to have combined
the good and bad test cases in a single test function other than laziness.&lt;/p&gt;
&lt;p&gt;In 2020, if we're going to share the test data between implementations,
it hard to look beyond JSON. Here's an extract from a file
&lt;code&gt;scale-tests.json&lt;/code&gt; that encapsulates the inputs and expected outputs
for all the tests above:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;transpose&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C#&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;F&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Db&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C#&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Db&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;D&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;D&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;A&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;D#&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Eb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Bb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Eb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;D#&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Eb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;B&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;B&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Bb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;E&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;A#&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Bb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;A#&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;A&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;A&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;G&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;B&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;B&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;G&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;F#&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Bb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;F#&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;E&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;A#&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Bb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;F#&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;A#&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;F#&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Bb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;F&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Gb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;A#&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Eb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Gb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;G&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;B&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;F#&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Bb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;F#&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;E&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;A#&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;B&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;G&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Bb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;F#&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;A#&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;F#&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Bb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;F&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Gb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;A#&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;F&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Gb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;flat_equivs&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C#m&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Dbm&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Db7&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Db7&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;C#M7&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;DbM7&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Do&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Do&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;D#M&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;EbM&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;E9&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;E9&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;FmM7&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;FmM7&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;F#mM7&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;GbmM7&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;G&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;G&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;G#11&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Ab11&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Ab11&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Ab11&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Am11&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Am11&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;A#+&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Bb+&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Bb+&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Bb+&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;A♯+&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;B♭+&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;B♭+&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;B♭+&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;R&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;R&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;R#&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;R#&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Rm&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Rm&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;flat_equiv_bads&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;BEPQaz@&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I have a function that reads this uses &lt;code&gt;json.load&lt;/code&gt; to read this and
other test data, storing the results in an object with a &lt;code&gt;.scale&lt;/code&gt; attribute,
like so:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;moveablechords.utils&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ReadJSONTestData&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pprint&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pprint&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;TestData&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ReadJSONTestData&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;pprint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TestData&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;transpose&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;[[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;C#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;F&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Db&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Db&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;D&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;D&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;A&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;D#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Eb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Bb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Eb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;D#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Eb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;B&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;B&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Bb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;E&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;A#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Bb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;A&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;G&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;B&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;B&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;G&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;F#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Bb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;F#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;E&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;A#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Bb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;F#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;F#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Bb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;F&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Gb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Eb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Gb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;G&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;B&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;F#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Bb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;F#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;E&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;A#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;B&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;G&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Bb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;F#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;F#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Bb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;F&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Gb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;F&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Gb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I have also made it so you can get entries using attribute lookup on
the objects, i.e. &lt;code&gt;TestData.scale.transpose&lt;/code&gt; rather than
&lt;code&gt;TestData.scale['transpose']&lt;/code&gt;, just because it looks more elegant
and readable to me.&lt;/p&gt;
&lt;p&gt;A straightforward refactoring of the &lt;code&gt;testTranspose&lt;/code&gt; function to use the
JSON-loaded data in &lt;code&gt;TestData.scale&lt;/code&gt; would be&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testTranspose2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;TestData&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;note&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;note&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                             &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;note&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;note&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                             &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In case this isn't self-explanatory&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The loop runs over the cases and expected values, so on the first
   iteration &lt;code&gt;case&lt;/code&gt; is &lt;code&gt;["C", 0]&lt;/code&gt; and &lt;code&gt;expected&lt;/code&gt; is &lt;code&gt;'C'&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;The assignments  set the &lt;code&gt;note&lt;/code&gt; and &lt;code&gt;offset&lt;/code&gt; variables; if the
   list is of length three the &lt;code&gt;key&lt;/code&gt; variable is also set;&lt;/li&gt;
&lt;li&gt;As discussed above, rather than just using things like
   &lt;code&gt;self.assertEqual(transpose(note, offset), expected)&lt;/code&gt;,
   we're including the &lt;code&gt;case&lt;/code&gt; (the tuple of input parameters) on both sides
   of the assertion so that if there's a failure, we can see which
   case is failing.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We can simplify this further since the &lt;code&gt;transpose&lt;/code&gt; function has only
one optional (keyword) argument, &lt;code&gt;key&lt;/code&gt;, which can also be provided as a third
positional argument. Assuming we don't specifically need to test the
handling of &lt;code&gt;key&lt;/code&gt; as a keyword argument, we can combine the two branches
as follows:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testTranspose3&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;TestData&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                         &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Here, we're using the &lt;code&gt;*&lt;/code&gt; operator to unpack&lt;sup id="fnref:splat"&gt;&lt;a class="footnote-ref" href="#fn:splat"&gt;5&lt;/a&gt;&lt;/sup&gt; &lt;code&gt;case&lt;/code&gt; into an argument
list for the &lt;code&gt;transpose&lt;/code&gt; function.&lt;/p&gt;
&lt;h3 id="adding-tdda-support"&gt;Adding TDDA Support&lt;/h3&gt;
&lt;p&gt;It probably hasn't escaped your attention that this third version of
&lt;code&gt;testTranspose&lt;/code&gt; is rather generic: the same structure would work
for &lt;em&gt;any&lt;/em&gt; function &lt;code&gt;f&lt;/code&gt; and list of input-output pairs &lt;code&gt;Pairs&lt;/code&gt;:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testAnyOldFunction_f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;Pairs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This makes it fairly easy to add TDDA support. I added prototype
support for this that allows us to use an even shorter version of the
test:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testTranspose4&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;checkFunctionByArgs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TestData&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This new &lt;code&gt;checkFunctionByArgs&lt;/code&gt; takes a function to test and a list of
input output pairs and runs a slightly fancier version of
&lt;code&gt;testAnyOldFunction&lt;/code&gt;. I'll go into extensions in another post, but the
most important difference is that it will report all failures rather
than stopping at the first one.&lt;/p&gt;
&lt;p&gt;We can illustrate this by changing the last first and last cases
in &lt;code&gt;TestData.scale['transpose']&lt;/code&gt; to be incorrect, say:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="p"&gt;[[[&lt;/span&gt;&lt;span class="err"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;&amp;#39;Z&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;A#&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;F&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Zb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If we run &lt;code&gt;testTranspose3&lt;/code&gt; using this modified test data,
we get only the first failing case,
and although the test case is listed in the output,
the output isn't particularly easy to
&lt;a href="https://en.wikipedia.org/wiki/Grok"&gt;grok&lt;/a&gt;.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python3 testscale.py
.....F.......
&lt;span class="o"&gt;======================================================================&lt;/span&gt;
FAIL: testTranspose2 &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt;
----------------------------------------------------------------------
Traceback &lt;span class="o"&gt;(&lt;/span&gt;most recent call last&lt;span class="o"&gt;)&lt;/span&gt;:
  File &lt;span class="s2"&gt;&amp;quot;testscale.py&amp;quot;&lt;/span&gt;, line &lt;span class="m"&gt;22&lt;/span&gt;, &lt;span class="k"&gt;in&lt;/span&gt; testTranspose2
    &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;, expected&lt;span class="o"&gt;))&lt;/span&gt;
AssertionError: Tuples differ: &lt;span class="o"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;, &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;, &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; !&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;, &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;, &lt;span class="s1"&gt;&amp;#39;Z&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

First differing element &lt;span class="m"&gt;1&lt;/span&gt;:
&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;
&lt;span class="s1"&gt;&amp;#39;Z&amp;#39;&lt;/span&gt;

- &lt;span class="o"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;, &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;, &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
?             ^

+ &lt;span class="o"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;, &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;, &lt;span class="s1"&gt;&amp;#39;Z&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
?             ^


----------------------------------------------------------------------
Ran &lt;span class="m"&gt;13&lt;/span&gt; tests &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;.002s

FAILED &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;failures&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;But if we use the TDDA's prototype &lt;code&gt;checkFunctionByArgs&lt;/code&gt; functionality,
we see both failures and it shows them in a more digestible format:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python3 testscale.py
.....

Case transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;, &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: failure.
    Actual: &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;
  Expected: &lt;span class="s1"&gt;&amp;#39;Z&amp;#39;&lt;/span&gt;


Case transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A#&amp;#39;&lt;/span&gt;, -4, &lt;span class="s1"&gt;&amp;#39;F&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: failure.
    Actual: &lt;span class="s1"&gt;&amp;#39;Gb&amp;#39;&lt;/span&gt;
  Expected: &lt;span class="s1"&gt;&amp;#39;Zb&amp;#39;&lt;/span&gt;
F.......
&lt;span class="o"&gt;======================================================================&lt;/span&gt;
FAIL: testTranspose4 &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt;
----------------------------------------------------------------------
Traceback &lt;span class="o"&gt;(&lt;/span&gt;most recent call last&lt;span class="o"&gt;)&lt;/span&gt;:
  File &lt;span class="s2"&gt;&amp;quot;testscale.py&amp;quot;&lt;/span&gt;, line &lt;span class="m"&gt;15&lt;/span&gt;, &lt;span class="k"&gt;in&lt;/span&gt; testTranspose4
    self.checkFunctionByArgs&lt;span class="o"&gt;(&lt;/span&gt;transpose, TestData.scale.transpose&lt;span class="o"&gt;)&lt;/span&gt;
  File &lt;span class="s2"&gt;&amp;quot;/Users/njr/python/tdda/tdda/referencetest/referencetest.py&amp;quot;&lt;/span&gt;, line &lt;span class="m"&gt;899&lt;/span&gt;, &lt;span class="k"&gt;in&lt;/span&gt; checkFunctionByArgs
    self._check_failures&lt;span class="o"&gt;(&lt;/span&gt;failures, msgs&lt;span class="o"&gt;)&lt;/span&gt;
  File &lt;span class="s2"&gt;&amp;quot;/Users/njr/python/tdda/tdda/referencetest/referencetest.py&amp;quot;&lt;/span&gt;, line &lt;span class="m"&gt;919&lt;/span&gt;, &lt;span class="k"&gt;in&lt;/span&gt; _check_failures
    self.assert_fn&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;failures&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;, msgs.message&lt;span class="o"&gt;())&lt;/span&gt;
AssertionError: False is not &lt;span class="nb"&gt;true&lt;/span&gt; :

Case transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;, &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: failure.
    Actual: &lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;
  Expected: &lt;span class="s1"&gt;&amp;#39;Z&amp;#39;&lt;/span&gt;


Case transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A#&amp;#39;&lt;/span&gt;, -4, &lt;span class="s1"&gt;&amp;#39;F&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: failure.
    Actual: &lt;span class="s1"&gt;&amp;#39;Gb&amp;#39;&lt;/span&gt;
  Expected: &lt;span class="s1"&gt;&amp;#39;Zb&amp;#39;&lt;/span&gt;

----------------------------------------------------------------------
Ran &lt;span class="m"&gt;13&lt;/span&gt; tests &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;.001s

FAILED &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;failures&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The failures currently get shown twice, once during execution of the
tests and again at the end in the summary, and the test just counts this
as a single failure, though these are both things that could be changed.&lt;/p&gt;
&lt;p&gt;There are variant forms of the prototype checking function above to handle
keyword arguments only and mixed positional and keyword argmuents.
There's also a version specifically for single-argument functions,
where it's natural not to write the arguments as a tuple, but a simple value.&lt;/p&gt;
&lt;h3 id="is-this-a-good-idea"&gt;Is this a Good Idea?&lt;/h3&gt;
&lt;p&gt;I think the potential benefits of sharing data between different
implementations of the same project are pretty clear. I haven't
actually modified the Swift implementation to use the JSON, but I'm
sure doing so will be easy and a clear win. I hope the example above
also illustrates that good support from testing frameworks can
significantly mitigate the downsides of looping over test cases within
a single test function. But there are other potential downsides.&lt;/p&gt;
&lt;p&gt;The most obvious problem, to me, is that the separation of the test
data from the test it makes it harder to see what's being tested (and
perhaps means you have to trust the framework more, though that is
quite easy to check). Arguably, this is even more true when the test
is reduced to the one-line form in &lt;code&gt;testTranpose4&lt;/code&gt;, rather than longer
form in &lt;code&gt;testTranspose2&lt;/code&gt;, where the function arguments are unpacked
and named, so that you can see a bit more of what is actually being
passed into the function.&lt;/p&gt;
&lt;p&gt;There's a broader point about the utility of tests as a form of
documentation. A web search for &lt;a href="https://duckduckgo.com/?q=externalizing+test+data"&gt;externalizing test
data&lt;/a&gt; uncovered
&lt;a href="https://www.theserverside.com/discussions/thread/36356.html"&gt;this
post&lt;/a&gt;
from Arvind Patil in 2005 in which he proposes something like scheme
here for Java (with XML taking the place of JSON, in 2005, of course).
Three replies to the post are quite hostile, including the first for
Irakli Nadareishvili, who says:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;sorry, but this is a quite dangerous anti-pattern. Unit-tests are
not simply for testing a piece of code. They carry several,
additional, very important roles. One of them is - documentation.&lt;/p&gt;
&lt;p&gt;In a well-tested code, unit-tests are the first examples of API
usage (API that they test). A TDD-experienced developer can learn
a lot about the API, looking at its unit-tests. For the
readability and clarity of what unit-test tests, it is very
important that test data is in the code and the reader does not
have to consistently hop from a configuration file to the test
code.&lt;/p&gt;
&lt;p&gt;Also, usually boundary conditions for a code (which is what test
data commonly is) almost never change, so there is more harm in
this "pattern" than gain, indeed.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is definitely a reasonable concern. Even if code has good documentation,
it is all too common for it to become out of date, whereas (passing) tests,
almost by definition, tend to stay up-to-date with API changes.
We could mitigate this issue quite a lot by hooking into verbose mode
(&lt;code&gt;-v&lt;/code&gt; or &lt;code&gt;--verbose&lt;/code&gt;) and having it show each call as well as the test
function being run, which seems like a good idea anyway. At the moment, if
you run the &lt;code&gt;scale&lt;/code&gt; tests with &lt;code&gt;-v&lt;/code&gt; on my chord project like this you get
output like this:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python3 testscale.py -v
testAsSmallestIntervals &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok
testDeMinorMajors &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok
testFretForNoteOnString &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok
testNotePairIntervals &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok
testRelMajor &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok
testTranspose &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok
test_are_not_same &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok
test_are_same &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok
test_are_same_invalids &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok
test_flat_equiv &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok
test_flat_equiv_bads &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok
test_preferred_equiv &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok
test_preferred_equiv_bads &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok

----------------------------------------------------------------------
Ran &lt;span class="m"&gt;13&lt;/span&gt; tests &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;.002s

OK
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;but we could (probably) extend this to something more like:&lt;sup id="fnref:OK"&gt;&lt;a class="footnote-ref" href="#fn:OK"&gt;6&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python3 testscale.py -v
testAsSmallestIntervals &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok
testDeMinorMajors &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok
testFretForNoteOnString &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok
testNotePairIntervals &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok
testRelMajor &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok
testTranspose &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ...
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;, &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;, &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;, &lt;span class="m"&gt;1&lt;/span&gt;, &lt;span class="s1"&gt;&amp;#39;F&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C#&amp;#39;&lt;/span&gt;, -1&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Db&amp;#39;&lt;/span&gt;, -1&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;, &lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;D&amp;#39;&lt;/span&gt;, -2&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;, &lt;span class="m"&gt;3&lt;/span&gt;, &lt;span class="s1"&gt;&amp;#39;A&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;, &lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;, &lt;span class="m"&gt;3&lt;/span&gt;, &lt;span class="s1"&gt;&amp;#39;Bb&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;D#&amp;#39;&lt;/span&gt;, -3&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Eb&amp;#39;&lt;/span&gt;, -3&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;, -1&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;B&amp;#39;&lt;/span&gt;, &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;, -2&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;, -2, &lt;span class="s1"&gt;&amp;#39;E&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Bb&amp;#39;&lt;/span&gt;, &lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A#&amp;#39;&lt;/span&gt;, &lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;, -3&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A&amp;#39;&lt;/span&gt;, &lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;G&amp;#39;&lt;/span&gt;, &lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;B&amp;#39;&lt;/span&gt;, -4&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;F#&amp;#39;&lt;/span&gt;, &lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;F#&amp;#39;&lt;/span&gt;, &lt;span class="m"&gt;4&lt;/span&gt;, &lt;span class="s1"&gt;&amp;#39;E&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Bb&amp;#39;&lt;/span&gt;, -4&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A#&amp;#39;&lt;/span&gt;, -4&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Bb&amp;#39;&lt;/span&gt;, -4, &lt;span class="s1"&gt;&amp;#39;F&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A#&amp;#39;&lt;/span&gt;, -4, &lt;span class="s1"&gt;&amp;#39;Eb&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;G&amp;#39;&lt;/span&gt;, &lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;F#&amp;#39;&lt;/span&gt;, &lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;F#&amp;#39;&lt;/span&gt;, &lt;span class="m"&gt;4&lt;/span&gt;, &lt;span class="s1"&gt;&amp;#39;E&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;B&amp;#39;&lt;/span&gt;, -4&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Bb&amp;#39;&lt;/span&gt;, -4&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A#&amp;#39;&lt;/span&gt;, -4&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Bb&amp;#39;&lt;/span&gt;, -4, &lt;span class="s1"&gt;&amp;#39;F&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: OK
    transpose&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A#&amp;#39;&lt;/span&gt;, -4, &lt;span class="s1"&gt;&amp;#39;F&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;: OK
... testTranspose &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt;: &lt;span class="m"&gt;36&lt;/span&gt; tests: ... ok
test_are_not_same &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok
test_are_same &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok
test_are_same_invalids &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok
test_flat_equiv &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok
test_flat_equiv_bads &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok
test_preferred_equiv &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok
test_preferred_equiv_bads &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestScale&lt;span class="o"&gt;)&lt;/span&gt; ... ok

----------------------------------------------------------------------
Ran &lt;span class="m"&gt;49&lt;/span&gt; &lt;span class="nb"&gt;test&lt;/span&gt; cases across &lt;span class="m"&gt;13&lt;/span&gt; tests &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;.002s

OK
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I also found &lt;a href="https://softwareengineering.stackexchange.com/questions/301117/sharing-unit-tests-between-several-language-implementations-of-one-spec"&gt;this
post&lt;/a&gt;
from Jeremy Wadhams in 2015 on the subject of &lt;em&gt;Sharing unit tests
between several language implementations of one spec&lt;/em&gt;. It discusses
&lt;a href="https://github.com/jwadhams/json-logic-js/blob/master/tests/tests.js"&gt;JsonLogic&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;JsonLogic is a data format (built on top of JSON) for storing and
sharing rules between front-end and back-end code. It's essential
that the same rule returns the same result whether executed by the
JavaScript client or the PHP client.&lt;/p&gt;
&lt;p&gt;Currently the JavaScript client has tests in QUnit, and the PHP
client has tests in PHPunit. The vast majority of tests are "given
these inputs (rule and data), assert the output equals the expected
result."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Jeremy also &lt;a href="https://softwareengineering.stackexchange.com/a/301118"&gt;suggests&lt;/a&gt; something very like the scheme above, again using JSON.&lt;/p&gt;
&lt;h3 id="conclusion"&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;I think this has been quite a promising experiment.
It reduced the length of &lt;code&gt;testscale.py&lt;/code&gt; from 223 lines to 75, which wasn't
an aim (and carries the potential issues noted above), but which does
make the scope and structure of the tests easier to understand.
It also achieved the primary goal of allowing test data to be shared
between implementations, which seems like a valuable prize.
Eventually, the project might gain a command line in both implementations,
and and that will potentially enable my favourite mode of testing—pairs
of input command lines and expected output. But this is a useful start.&lt;/p&gt;
&lt;p&gt;Meanwhile, I will probably refine (and document and test!) the prototype
implementations a bit more and then release it.&lt;/p&gt;
&lt;p&gt;If you have thoughts, do get in touch.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:callable"&gt;
&lt;p&gt;or, more generally, this callable.&amp;#160;&lt;a class="footnote-backref" href="#fnref:callable" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:teardown"&gt;
&lt;p&gt;other than, perhaps, and manual teardown in a &lt;code&gt;try...finally&lt;/code&gt;
block.&amp;#160;&lt;a class="footnote-backref" href="#fnref:teardown" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:ordereddict"&gt;
&lt;p&gt;From Python 3.8 on, all Python dictionaries are ordered.
This is also the case in CPython implementations from 3.6 onwards.&amp;#160;&lt;a class="footnote-backref" href="#fnref:ordereddict" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:nobsharp"&gt;
&lt;p&gt;The function does not accept &lt;code&gt;B#&lt;/code&gt; or &lt;code&gt;E#&lt;/code&gt;, even though
musically these can be used as alternatives to &lt;code&gt;C&lt;/code&gt; and &lt;code&gt;F&lt;/code&gt; respectively.
That is outside the scope of this function.&amp;#160;&lt;a class="footnote-backref" href="#fnref:nobsharp" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:splat"&gt;
&lt;p&gt;this operation is sometimes called &lt;a href="https://stackoverflow.com/questions/2322355/proper-name-for-python-operator"&gt;&lt;em&gt;splatting&lt;/em&gt;&lt;/a&gt;, and sometimes
unsplatting or desplatting.&amp;#160;&lt;a class="footnote-backref" href="#fnref:splat" title="Jump back to footnote 5 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:OK"&gt;
&lt;p&gt;Would I seem like a very old fuddy-duddy if I ask "who writes 'ok' in lower case anyway?"&amp;#160;&lt;a class="footnote-backref" href="#fnref:OK" title="Jump back to footnote 6 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="tests"></category><category term="reference tests"></category><category term="data"></category></entry><entry><title>Reference Testing Exercise 2 (pytest flavour)</title><link href="https://tdda.info/reference-testing-exercise-2-pytest-flavour.html" rel="alternate"></link><published>2019-10-31T09:30:00+00:00</published><updated>2019-10-31T09:30:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2019-10-31:/reference-testing-exercise-2-pytest-flavour.html</id><summary type="html">&lt;iframe width="1038" height="649" src="https://www.youtube.com/embed/pwLRUbAcUDU" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;

&lt;p&gt;This exercise (video 2m 58s)
shows a powerful way to run only a single test, or some subset of tests,
by using the &lt;code&gt;@tag&lt;/code&gt; decorator available in the TDDA library.
This is useful for speeding up the test cycle and allowing you to focus
on a single test, or a …&lt;/p&gt;</summary><content type="html">&lt;iframe width="1038" height="649" src="https://www.youtube.com/embed/pwLRUbAcUDU" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;

&lt;p&gt;This exercise (video 2m 58s)
shows a powerful way to run only a single test, or some subset of tests,
by using the &lt;code&gt;@tag&lt;/code&gt; decorator available in the TDDA library.
This is useful for speeding up the test cycle and allowing you to focus
on a single test, or a few tests.
We will also see, in the next exercise, how it can be used to update
test results more easily and safely when expected behaviour changes.&lt;/p&gt;
&lt;p&gt;(If you do not currently use &lt;code&gt;pytest&lt;/code&gt; for writing
tests, you might prefer the
&lt;a href="/ref/exercise2u"&gt;unittest-flavoured version&lt;/a&gt;
of this exercise, since &lt;code&gt;unittest&lt;/code&gt; is in Python's standard library.)&lt;/p&gt;
&lt;h3 id="prerequisites"&gt;Prerequisites&lt;/h3&gt;
&lt;p&gt;★ You need to have the TDDA Python library
(version 1.0.31 or newer)
installed see &lt;a href="/pages/installation"&gt;installation&lt;/a&gt;.
Use&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;tdda version
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;to check the version that you have.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 1: Copy the exercises (if you don't already have them)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;You need to change to some directory in which you're happy to create three
directories with data. We are use &lt;code&gt;~/tmp&lt;/code&gt; for this. Then copy the
example code.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ &lt;span class="nb"&gt;cd&lt;/span&gt; ~/tmp
$ tdda examples    &lt;span class="c1"&gt;# copy the example code&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Go the exercise files and examine them:&lt;/strong&gt;&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ &lt;span class="nb"&gt;cd&lt;/span&gt; referencetest_examples/exercises-pytest/exercise2  &lt;span class="c1"&gt;# Go to exercise2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;As in the first exercise, you should have at least the following
four files:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ ls
conftest.py expected.html   generators.py   test_all.py
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;conftest.html&lt;/code&gt; is configuration to extend &lt;code&gt;pytest&lt;/code&gt; with
   &lt;code&gt;referencetest&lt;/code&gt; capabilities,&lt;/li&gt;
&lt;li&gt;&lt;code&gt;expected.html&lt;/code&gt; contains the expected output from one test,&lt;/li&gt;
&lt;li&gt;&lt;code&gt;generators.py&lt;/code&gt; contains the code to be tested,&lt;/li&gt;
&lt;li&gt;&lt;code&gt;test_all.py&lt;/code&gt; contains the tests.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you look at &lt;code&gt;test_all.py&lt;/code&gt;, you'll see it contains five test functions.
Only one of the tests is useful
(&lt;code&gt;testExampleStringGeneration&lt;/code&gt;) with all the others making
manifestly true assertions and most of them deliberately
wasting time to simulate annoyingly slow tests.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;generators&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;generate_string&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testZero&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testOne&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testExampleStringGeneration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;generate_string&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;ref&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertStringCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;expected.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testTwo&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testThree&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Run the tests, which should be slow and produce one failure&lt;/strong&gt;&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ pytest           &lt;span class="c1"&gt;#  This will work with Python 3 or Python2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;When you run the tests, you should get a single failure, that being
the non-trivial test &lt;code&gt;testExampleStringGeneration&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The output will be something like:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;=============================&lt;/span&gt; &lt;span class="nb"&gt;test&lt;/span&gt; session &lt;span class="nv"&gt;starts&lt;/span&gt; &lt;span class="o"&gt;==============================&lt;/span&gt;
test_all.py ..F..

&lt;span class="o"&gt;[&lt;/span&gt;...details of &lt;span class="nb"&gt;test&lt;/span&gt; failure...&lt;span class="o"&gt;]&lt;/span&gt;

&lt;span class="o"&gt;======================&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt; failed, &lt;span class="m"&gt;4&lt;/span&gt; passed &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="m"&gt;6&lt;/span&gt;.17 &lt;span class="nv"&gt;seconds&lt;/span&gt; &lt;span class="o"&gt;======================&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We get a test failure because we haven't added the &lt;code&gt;ignore_substrings&lt;/code&gt;
parameter that we saw in &lt;a href="/ref/exercise1p"&gt;Exercise 1&lt;/a&gt;
is needed for it to pass.&lt;/p&gt;
&lt;p&gt;The tests should take slightly over 6 seconds in total to run,
because of the three annoyingly slow tests with sleep statements
in them—&lt;code&gt;testOne&lt;/code&gt;, &lt;code&gt;testTwo&lt;/code&gt; and &lt;code&gt;testThree&lt;/code&gt;.
(If you're not annoyed by a 6-second delay, increase the sleep time in
one of the "sleepy" tests until you are annoyed!)&lt;/p&gt;
&lt;p&gt;The point of this exercise is to show some simple but very useful
functionality for running only tests on which we wish to focus,
such as our failing test.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 4: Tag the failing test using @tag&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The TDDA library includes a function called &lt;code&gt;tag&lt;/code&gt;;
this is a decorator function&lt;sup id="fnref:decorator"&gt;&lt;a class="footnote-ref" href="#fn:decorator"&gt;1&lt;/a&gt;&lt;/sup&gt;
that we can put before individual tests,
to mark them as being of special interest temporarily.&lt;/p&gt;
&lt;p&gt;Edit &lt;code&gt;test_all.py&lt;/code&gt; to decorate the failing test by an &lt;code&gt;import&lt;/code&gt;
statement to bring in &lt;code&gt;tag&lt;/code&gt; from the TDDA library,
and then decorate the definition of &lt;code&gt;testStringFunction&lt;/code&gt; by preceding it
with &lt;code&gt;@tag&lt;/code&gt; as follows:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tdda.referencetest&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testZero&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testOne&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

&lt;span class="nd"&gt;@tag&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testExampleStringGeneration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;generate_string&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;ref&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertStringCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;expected.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step 5: Run only the tagged test&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Having tagged the failing test, if we run the tests again adding
&lt;code&gt;--tagged&lt;/code&gt; to the command, it will run only the tagged test, and
take hardly any time. The (abbreviated) output should be something like&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;=============================&lt;/span&gt; &lt;span class="nb"&gt;test&lt;/span&gt; session &lt;span class="nv"&gt;starts&lt;/span&gt; &lt;span class="o"&gt;==============================&lt;/span&gt;
$ pytest --tagged
test_all.py F

&lt;span class="o"&gt;[&lt;/span&gt;...details of &lt;span class="nb"&gt;test&lt;/span&gt; failure...&lt;span class="o"&gt;]&lt;/span&gt;

&lt;span class="o"&gt;===========================&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt; failed &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;.16 &lt;span class="nv"&gt;seconds&lt;/span&gt; &lt;span class="o"&gt;===========================&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We can tag as many tests as we like, across any number of test files,
to run a subset of tests, rather than a single one.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 6: Locating @tag decorators&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;In a typical debugging or test development cycle in which you have
been using the &lt;code&gt;@tag&lt;/code&gt; decorator to focus on just a few failing tests,
you might end up with &lt;code&gt;@tag&lt;/code&gt; decorations scattered across several
files, perhaps in multiple directories.&lt;/p&gt;
&lt;p&gt;Although it's not hard to use &lt;code&gt;grep&lt;/code&gt; or &lt;code&gt;grep -r&lt;/code&gt; to find them, the library
can actually do this for you. If you use the &lt;code&gt;--istagged&lt;/code&gt; flag
instead of running the tests, the library will report which test classes in
which files have tagged tests. So in our case:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="err"&gt;$&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;istagged&lt;/span&gt;
&lt;span class="o"&gt;=============================&lt;/span&gt; &lt;span class="n"&gt;test&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="n"&gt;starts&lt;/span&gt; &lt;span class="o"&gt;==============================&lt;/span&gt;
&lt;span class="n"&gt;platform&lt;/span&gt; &lt;span class="n"&gt;darwin&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt; &lt;span class="n"&gt;Python&lt;/span&gt; &lt;span class="mf"&gt;3.7.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;4.4.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.8.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pluggy&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.9.0&lt;/span&gt;
&lt;span class="n"&gt;rootdir&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;Users&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;njr&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;tmp&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;referencetest_examples&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;exercises&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;pytest&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;exercise2&lt;/span&gt;
&lt;span class="n"&gt;collecting&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="n"&gt;test_all&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;testExampleStringGeneration&lt;/span&gt;
&lt;span class="n"&gt;collected&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;

&lt;span class="o"&gt;=========================&lt;/span&gt; &lt;span class="n"&gt;no&lt;/span&gt; &lt;span class="n"&gt;tests&lt;/span&gt; &lt;span class="n"&gt;ran&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="mf"&gt;0.01&lt;/span&gt; &lt;span class="n"&gt;seconds&lt;/span&gt; &lt;span class="o"&gt;=========================&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Obviously, in the case of a single test file, this is not a big deal,
but if you have dozens or hundreds of source files, in a directory
hierarchy, and have tagged a few functions across them, it becomes
significantly more helpful.&lt;/p&gt;
&lt;h3 id="recap-what-we-have-seen"&gt;Recap: What we have seen&lt;/h3&gt;
&lt;p&gt;This simple exercise has shown how we can easily run subsets of tests
by &lt;em&gt;tagging&lt;/em&gt; them and then using &lt;code&gt;--tagged&lt;/code&gt; to run only tagged tests.&lt;/p&gt;
&lt;p&gt;In this case, the motivation was simply to save time and reduce clutter
in the output, focusing on one test, or a small number of tests.&lt;/p&gt;
&lt;p&gt;In the Exercise 3, we will see how this combines with the ability
to automatically regenerate updated reference outputs to make for
a safe and efficient way to update tests after code changes.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:decorator"&gt;
&lt;p&gt;Decorator functions in Python are functions that are used
to transform other functions: they take a function as an argument and
return a new function that modifies the original in some way. Out decorator
function &lt;code&gt;tag&lt;/code&gt; is called by writing &lt;code&gt;@tag&lt;/code&gt; on the line before function
(or class) definition, and the effect of this is that the function returned
by &lt;code&gt;@tag&lt;/code&gt; replaces the function (or class) it precedes. In our case, all
&lt;code&gt;@tag&lt;/code&gt; does is set an attribute on the function in question so that the
TDDA reference test framework can identify it as a &lt;code&gt;tagged&lt;/code&gt; function,
and choose to run only tagged tests when so requested.&amp;#160;&lt;a class="footnote-backref" href="#fnref:decorator" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="reference test"></category><category term="exercise"></category><category term="screencast"></category><category term="video"></category><category term="pytest"></category></entry><entry><title>Reference Testing Exercise 2 (unittest flavour)</title><link href="https://tdda.info/reference-testing-exercise-2-unittest-flavour.html" rel="alternate"></link><published>2019-10-30T08:30:00+00:00</published><updated>2019-10-30T08:30:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2019-10-30:/reference-testing-exercise-2-unittest-flavour.html</id><summary type="html">&lt;iframe width="1038" height="649" src="https://www.youtube.com/embed/JD_Ke-oweA8" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;

&lt;p&gt;This exercise (video 3m 34s)
shows a powerful way to run only a single test, or some subset of tests,
by using the &lt;code&gt;@tag&lt;/code&gt; decorator available in the TDDA library.
This is useful for speeding up the test cycle and allowing you to focus
on a single test, or a …&lt;/p&gt;</summary><content type="html">&lt;iframe width="1038" height="649" src="https://www.youtube.com/embed/JD_Ke-oweA8" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;

&lt;p&gt;This exercise (video 3m 34s)
shows a powerful way to run only a single test, or some subset of tests,
by using the &lt;code&gt;@tag&lt;/code&gt; decorator available in the TDDA library.
This is useful for speeding up the test cycle and allowing you to focus
on a single test, or a few tests.
We will also see, in the next exercise, how it can be used to update
test results more easily and safely when expected behaviour changes.&lt;/p&gt;
&lt;p&gt;(If you use &lt;code&gt;pytest&lt;/code&gt; for writing tests,
you might prefer the
&lt;a href="/ref/exercise1p"&gt;pytest-flavoured version&lt;/a&gt;
of this exercise.)&lt;/p&gt;
&lt;h3 id="prerequisites"&gt;Prerequisites&lt;/h3&gt;
&lt;p&gt;★ You need to have the TDDA Python library
(version 1.0.31 or newer)
installed see &lt;a href="/pages/installation"&gt;installation&lt;/a&gt;.
Use&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;tdda version
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;to check the version that you have.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 1: Copy the exercises (if you don't already have them)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;You need to change to some directory in which you're happy to create three
directories with data. We are use &lt;code&gt;~/tmp&lt;/code&gt; for this. Then copy the
example code.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ &lt;span class="nb"&gt;cd&lt;/span&gt; ~/tmp
$ tdda examples    &lt;span class="c1"&gt;# copy the example code&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Go the exercise files and examine them:&lt;/strong&gt;&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ &lt;span class="nb"&gt;cd&lt;/span&gt; referencetest_examples/exercises-unittest/exercise2  &lt;span class="c1"&gt;# Go to exercise2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;As in the first exercise, you should have at least the following
three files&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ ls
expected.html   generators.py   test_all.py
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;expected.html&lt;/code&gt; contains the expected output from one test,&lt;/li&gt;
&lt;li&gt;&lt;code&gt;generators.py&lt;/code&gt; contains the code to be tested,&lt;/li&gt;
&lt;li&gt;&lt;code&gt;test_all.py&lt;/code&gt; contains the tests.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you look at &lt;code&gt;test_all.py&lt;/code&gt;, you'll see it contains two test classes
with five tests between them. Only one of the tests is useful
(&lt;code&gt;testExampleStringGeneration&lt;/code&gt;) with all the others making
manifestly true assertions and deliberately wasting time
to simulate annoyingly slow tests.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tdda.referencetest&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ReferenceTestCase&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;generators&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;generate_string&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestQuickThings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ReferenceTestCase&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testExampleStringGeneration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;generate_string&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertStringCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;expected.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testZero&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertIsNone&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestSuperSlowThings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ReferenceTestCase&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testOne&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testTwo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testThree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Run the tests, which should be slow and produce one failure&lt;/strong&gt;&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python test_all.py   &lt;span class="c1"&gt;#  This will work with Python 3 or Python2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;When you run the tests, you should get a single failure, that being
the non-trivial test &lt;code&gt;testExampleStringGeneration&lt;/code&gt; from the class
&lt;code&gt;TestQuickThings&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The output will be:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;F....

&lt;span class="o"&gt;[&lt;/span&gt;...details of &lt;span class="nb"&gt;test&lt;/span&gt; failure...&lt;span class="o"&gt;]&lt;/span&gt;

Ran &lt;span class="m"&gt;5&lt;/span&gt; tests &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="m"&gt;6&lt;/span&gt;.007s
FAILED &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;failures&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We get a test failure because we haven't added the &lt;code&gt;ignore_substrings&lt;/code&gt;
parameter that we saw in &lt;a href="/ref/exercise1u"&gt;Exercise 1&lt;/a&gt;
is needed for it to pass.&lt;/p&gt;
&lt;p&gt;The tests should take slightly over 6 seconds in total to run,
because of the annoyingly slow tests in &lt;code&gt;TestSuperSlowThings&lt;/code&gt;.
(If you're not annoyed by a 6-second delay, increase the sleep time in
one of the "slow" tests until you are annoyed!)&lt;/p&gt;
&lt;p&gt;The point of this exercise is to show some simple but very useful
functionality for running only tests on which we wish to focus,
such as our failing test.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 4: Tag the failing test using @tag&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;If you look at the &lt;code&gt;import&lt;/code&gt; statements, you'll see that as well as
&lt;code&gt;ReferenceTestCase&lt;/code&gt; we also import &lt;code&gt;tag&lt;/code&gt;.
This is a decorator function&lt;sup id="fnref:decorator"&gt;&lt;a class="footnote-ref" href="#fn:decorator"&gt;1&lt;/a&gt;&lt;/sup&gt;
that we can put before individual tests, or test classes,
to indicate that they are of special interest temporarily.&lt;/p&gt;
&lt;p&gt;Edit &lt;code&gt;test_all.py&lt;/code&gt; to decorate the failing test by adding &lt;code&gt;@tag&lt;/code&gt; on the
line before it, thus:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestQuickThings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ReferenceTestCase&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="nd"&gt;@tag&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testExampleStringGeneration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;generate_string&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertStringCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;expected.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testZero&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertIsNone&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step 5: Run only the tagged test&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Having tagged the failing test, if we run the tests again adding
&lt;code&gt;-1&lt;/code&gt; (the digit one, for "single",not the letter ell)
to the command, it will run only the tagged test, and
take hardly any time. The (abbreviated) output should be something like&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python test_all.py -1
F

&lt;span class="o"&gt;[&lt;/span&gt;...details of &lt;span class="nb"&gt;test&lt;/span&gt; failure...&lt;span class="o"&gt;]&lt;/span&gt;

Ran &lt;span class="m"&gt;1&lt;/span&gt; tests &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;.006s
FAILED &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;failures&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can also use &lt;code&gt;--tagged&lt;/code&gt; instead of &lt;code&gt;-1&lt;/code&gt; if you like more descriptive
flags.&lt;/p&gt;
&lt;p&gt;We can tag as many tests as we like, across any number of test files,
and we can also tag whole classes
by placing the &lt;code&gt;@tag&lt;/code&gt; decorator before a test class definition.
So if we instead use:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nd"&gt;@tag&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestQuickThings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ReferenceTestCase&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testExampleStringGeneration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;generate_string&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertStringCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;expected.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testZero&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertIsNone&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;and run the tests with &lt;code&gt;-1&lt;/code&gt;, we will get output more like:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python test_all.py -1
F.

&lt;span class="o"&gt;[&lt;/span&gt;...details of &lt;span class="nb"&gt;test&lt;/span&gt; failure...&lt;span class="o"&gt;]&lt;/span&gt;

Ran &lt;span class="m"&gt;2&lt;/span&gt; tests &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;.006s
FAILED &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;failures&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In this case, both the tests in our first test class were run,
but no others (and, in particular, not our painfully slow tests!)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 6: Locating @tag decorators&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;In a typical debugging or test development cycle in which you have
been using the &lt;code&gt;@tag&lt;/code&gt; decorator to focus on just a few failing tests,
you might end up with &lt;code&gt;@tag&lt;/code&gt; decorations scattered across several
files, perhaps in multiple directories. (We're assuming here you have
&lt;code&gt;test_all.py&lt;/code&gt; or similar that imports all the other test classes so
you can easily run them all together.)&lt;/p&gt;
&lt;p&gt;Although it's not hard to use &lt;code&gt;grep&lt;/code&gt; or &lt;code&gt;grep -r&lt;/code&gt; to find them, the library
can actually do this for you. If you use the &lt;code&gt;-0&lt;/code&gt; flag (the digit zero,
for "no tests"), or the &lt;code&gt;--istagged&lt;/code&gt; flag,
instead of running the tests, the library will report which test classes in
which files have tagged tests. So in our case:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python test_all.py -0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;produces:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;__main__.TestQuickThings
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Here, &lt;code&gt;__main__&lt;/code&gt; stands for the current file; other files would be
referenced by their imported name.&lt;/p&gt;
&lt;h3 id="recap-what-we-have-seen"&gt;Recap: What we have seen&lt;/h3&gt;
&lt;p&gt;This simple exercise has shown how we can easily run subsets of tests
by &lt;em&gt;tagging&lt;/em&gt; them and then using the &lt;code&gt;-1&lt;/code&gt; flag (or &lt;code&gt;--tagged&lt;/code&gt;)
to run only tagged tests.&lt;/p&gt;
&lt;p&gt;In this case, the motivation was simply to save time and reduce clutter
in the output, focusing on one test, or a small number of tests.&lt;/p&gt;
&lt;p&gt;In the Exercise 3, we will see how this combines with the ability
to automatically regenerate updated reference outputs to make for
a safe and efficient way to update tests after code changes.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:decorator"&gt;
&lt;p&gt;Decorator functions in Python are functions that are used
to transform other functions: they take a function as an argument and
return a new function that modifies the original in some way. Out decorator
function &lt;code&gt;tag&lt;/code&gt; is called by writing &lt;code&gt;@tag&lt;/code&gt; on the line before function
(or class) definition, and the effect of this is that the function returned
by &lt;code&gt;@tag&lt;/code&gt; replaces the function (or class) it precedes. In our case, all
&lt;code&gt;@tag&lt;/code&gt; does is set an attribute on the function in question so that the
TDDA reference test framework can identify it as a &lt;code&gt;tagged&lt;/code&gt; function,
and choose to run only tagged tests when so requested.&amp;#160;&lt;a class="footnote-backref" href="#fnref:decorator" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="reference test"></category><category term="exercise"></category><category term="screencast"></category><category term="video"></category><category term="unittest"></category></entry><entry><title>Reference Testing Exercise 1 (pytest flavour)</title><link href="https://tdda.info/reference-testing-exercise-1-pytest-flavour.html" rel="alternate"></link><published>2019-10-29T12:00:00+00:00</published><updated>2019-10-29T12:00:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2019-10-29:/reference-testing-exercise-1-pytest-flavour.html</id><summary type="html">&lt;iframe width="1038" height="649" src="https://www.youtube.com/embed/HSQxKKgiCEU" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;

&lt;p&gt;This exercise (video 8m 54s) shows how to migrate a test from using &lt;code&gt;pytest&lt;/code&gt;
directly to the exploiting the &lt;code&gt;referencetest&lt;/code&gt; capabilities in
the TDDA library.
(If you do not currently use &lt;code&gt;pytest&lt;/code&gt; for writing
tests, you might prefer the
&lt;a href="/ref/exercise1u"&gt;unittest-flavoured version&lt;/a&gt;
of this exercise, since &lt;code&gt;unittest&lt;/code&gt; is in Python's standard …&lt;/p&gt;</summary><content type="html">&lt;iframe width="1038" height="649" src="https://www.youtube.com/embed/HSQxKKgiCEU" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;

&lt;p&gt;This exercise (video 8m 54s) shows how to migrate a test from using &lt;code&gt;pytest&lt;/code&gt;
directly to the exploiting the &lt;code&gt;referencetest&lt;/code&gt; capabilities in
the TDDA library.
(If you do not currently use &lt;code&gt;pytest&lt;/code&gt; for writing
tests, you might prefer the
&lt;a href="/ref/exercise1u"&gt;unittest-flavoured version&lt;/a&gt;
of this exercise, since &lt;code&gt;unittest&lt;/code&gt; is in Python's standard library.)&lt;/p&gt;
&lt;p&gt;We will see how even simple use of &lt;code&gt;referencetest&lt;/code&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;makes it much easier to see how tests have failed when complex
   outputs are generated&lt;/li&gt;
&lt;li&gt;helps us to update reference outputs (the expected values)
   when we have verified that a new behaviour is correct&lt;/li&gt;
&lt;li&gt;allows us easily to write tests of code whose outputs are not
   identical from run to run. We do this by specifying exclusions
   from the comparisons used in assertions.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="prerequisites"&gt;Prerequisites&lt;/h3&gt;
&lt;p&gt;★ You need to have the TDDA Python library installed
(version 1.0.31 or newer)
see &lt;a href="/pages/installation"&gt;installation&lt;/a&gt;.
Use&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;tdda version
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;to check the version that you have.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 1: Copy the exercises&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;You need to change to some directory in which you're happy to create three
new directories with data. We are use &lt;code&gt;~/tmp&lt;/code&gt; for this. Then copy the
example code.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ &lt;span class="nb"&gt;cd&lt;/span&gt; ~/tmp
$ tdda examples    &lt;span class="c1"&gt;# copy the example code&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Go the exercise files and examine them:&lt;/strong&gt;&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ &lt;span class="nb"&gt;cd&lt;/span&gt; referencetest_examples/exercises-pytest/exercise1  &lt;span class="c1"&gt;# Go to exercise1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You should have at least the following four files:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ ls
conftest.py expected.html   generators.py   test_all.py
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;generators.py&lt;/code&gt; contains a function called &lt;code&gt;generate_string&lt;/code&gt;
   that, when called, returns HTML text suitable for viewing
   as a web page.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;expected.html&lt;/code&gt; is the result of calling that function, saved
   to file&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;test_all.py&lt;/code&gt; contains a single &lt;code&gt;unittest&lt;/code&gt;-based test of that file.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;conftest.py&lt;/code&gt; imports key &lt;code&gt;referencetest&lt;/code&gt; functionality from the &lt;code&gt;tdda&lt;/code&gt;
   library into &lt;code&gt;pytest&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It's probably useful to look at the web page &lt;code&gt;expected.html&lt;/code&gt; in a browser,
either by navigating to it in a file browser and double clicking it,
or by using&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;open expected.html
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;if your OS supports this. As you can see, it's just some text and an
image. The image is an inline SVG vector image, generated along with
the text.&lt;/p&gt;
&lt;p&gt;Also have a look at the test code. The core part of it is very short:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;generators&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;generate_string&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testExampleStringGeneration&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;generate_string&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;expected.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The code&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;calls &lt;code&gt;generate_string()&lt;/code&gt; to create the content&lt;/li&gt;
&lt;li&gt;stores its output in the variable &lt;code&gt;actual&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;reads the &lt;em&gt;expected&lt;/em&gt; content into the variable &lt;code&gt;expected&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;asserts that the two strings are the same.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Step 3. Run the test, which should fail&lt;/strong&gt;&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ pytest      &lt;span class="c1"&gt;#  This will whether pytest uses Python2 or Python3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You should get a failure, and pytest tries quite hard to show what's causing
the failure:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;===================================&lt;/span&gt; &lt;span class="nv"&gt;FAILURES&lt;/span&gt; &lt;span class="o"&gt;===================================&lt;/span&gt;
_________________________ testExampleStringGeneration __________________________

    def testExampleStringGeneration&lt;span class="o"&gt;()&lt;/span&gt;:
        &lt;span class="nv"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; generate_string&lt;span class="o"&gt;()&lt;/span&gt;
        with open&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;expected.html&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; as f:
            &lt;span class="nv"&gt;expected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; f.read&lt;span class="o"&gt;()&lt;/span&gt;
&amp;gt;       assert &lt;span class="nv"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; expected
E       AssertionError: assert &lt;span class="s1"&gt;&amp;#39;&amp;lt;!DOCTYPE ht...y&amp;gt;\n&amp;lt;/html&amp;gt;\n&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&amp;lt;!DOCTYPE htm...y&amp;gt;\n&amp;lt;/html&amp;gt;\n&amp;#39;&lt;/span&gt;
E         Skipping &lt;span class="m"&gt;69&lt;/span&gt; identical leading characters &lt;span class="k"&gt;in&lt;/span&gt; diff, use -v to show
E         -  Solutions, &lt;span class="m"&gt;2016&lt;/span&gt;
E         +  Solutions Limited, &lt;span class="m"&gt;2016&lt;/span&gt;
E         ?           ++++++++
E         -     Version &lt;span class="m"&gt;1&lt;/span&gt;.0.0
E         ?             ^
E         +     Version &lt;span class="m"&gt;0&lt;/span&gt;.0.0...
E
E         ...Full output truncated &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="m"&gt;31&lt;/span&gt; lines hidden&lt;span class="o"&gt;)&lt;/span&gt;, use &lt;span class="s1"&gt;&amp;#39;-vv&amp;#39;&lt;/span&gt; to show

test_all.py:24: &lt;span class="nv"&gt;AssertionError&lt;/span&gt;
&lt;span class="o"&gt;===========================&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt; failed &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;.11 &lt;span class="nv"&gt;seconds&lt;/span&gt; &lt;span class="o"&gt;===========================&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can certainly see that there's a different in the Version number in
the output and also a line including &lt;code&gt;2016&lt;/code&gt; (a copyright notice, in fact).&lt;/p&gt;
&lt;p&gt;But it also says:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;...&lt;span class="nv"&gt;Full&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;output&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;truncated&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;31&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;lines&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;hidden&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;use&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;-vv&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;show&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;and if you do that, the output becomes a bit overwhelming.&lt;/p&gt;
&lt;p&gt;We'll convert the test to use the TDDA libraries referencetest and
see how that helps.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 4. Change the code to use referencetest.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The key change we need to make is the to the assertion, which will now be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;ref.assertStringCorrect(actual, 'expected.html')&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;code&gt;ref&lt;/code&gt; is object made available by &lt;code&gt;conftest.py&lt;/code&gt;, and is passed into our test
function by &lt;code&gt;pytest&lt;/code&gt;. We therefore need to change the function declaration to
take &lt;code&gt;ref&lt;/code&gt; as an argument:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;def testExampleStringGeneration(ref):&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Finally, because &lt;code&gt;assertStringCorrect&lt;/code&gt; compares a string in memory
to content from a file, we don't need the lines in the middle that
read the file:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;* Delete the middle two lines of the test function.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The result is:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;generators&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;generate_string&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testExampleStringGeneration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;generate_string&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;ref&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertStringCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;expected.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step 5. Run the modified test&lt;/strong&gt;&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ pytest
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You should see very different output, that includes, near the end,
something like this:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;AssertionError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;are&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;different&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;starting&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;at&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;Expected&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;Compare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;with&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="n"&gt;diff&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="k"&gt;var&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;folders&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;zv&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;xvhmvpj0216687_pk__2f5h0000gn&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;Compare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;processed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;with&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="n"&gt;diff&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="k"&gt;var&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;folders&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;zv&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;xvhmvpj0216687_pk__2f5h0000gn&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="k"&gt;var&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;folders&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;zv&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;xvhmvpj0216687_pk__2f5h0000gn&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;Users&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;njr&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;tdda&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;tdda&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;referencetest&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;referencepytest&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;187&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;AssertionError&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;(You will probably need to scroll right to see all of the message on this page.)&lt;/p&gt;
&lt;p&gt;Because the test failed, the TDDA library has written a copy of the
actual ouput to file to make it easy for us to examine it and to use &lt;code&gt;diff&lt;/code&gt;
commands to see how it actually differs from what we expected. (In fact,
it's written out two copies, a "raw" and a "post-precocessed" one, but we
haven't used any processing, so they will be the same in our case. So
we ignore the second diff command suggested for now.)&lt;/p&gt;
&lt;p&gt;It's also given us the precise &lt;code&gt;diff&lt;/code&gt; command we need to see the differences
between our actual and expected output.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 6. Copy the first &lt;code&gt;diff&lt;/code&gt; command and run it. You should see something similar to this:&lt;/strong&gt;&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ diff /var/folders/zv/3xvhmvpj0216687_pk__2f5h0000gn/T/actual-raw-expected.html expected.html
&lt;span class="m"&gt;5&lt;/span&gt;,6c5,6
&amp;lt;     Copyright &lt;span class="o"&gt;(&lt;/span&gt;c&lt;span class="o"&gt;)&lt;/span&gt; Stochastic Solutions, &lt;span class="m"&gt;2016&lt;/span&gt;
&amp;lt;     Version &lt;span class="m"&gt;1&lt;/span&gt;.0.0
—
&amp;gt;     Copyright &lt;span class="o"&gt;(&lt;/span&gt;c&lt;span class="o"&gt;)&lt;/span&gt; Stochastic Solutions Limited, &lt;span class="m"&gt;2016&lt;/span&gt;
&amp;gt;     Version &lt;span class="m"&gt;0&lt;/span&gt;.0.0
35c35
&amp;lt; &amp;lt;/html&amp;gt;
&lt;span class="se"&gt;\ &lt;/span&gt;No newline at end of file
—
&amp;gt; &amp;lt;/html&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;(If you have a visual diff tool, can also use that. For example,
on a Mac, if you have Xcode installed, you should have the
&lt;code&gt;opendiff&lt;/code&gt; command available.)&lt;/p&gt;
&lt;p&gt;The diff makes it clear that there are three differences:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The copyright notice has changed slightly&lt;/li&gt;
&lt;li&gt;The version number has changed&lt;/li&gt;
&lt;li&gt;The string doesn't have a newline at the end, whereas the file does.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The Copyright and version numbers lines are both in comments in the HTML,
so don't affect the rendering at all. You might want to confirm that if
you look at the actual file it saved (&lt;code&gt;/var/folders/zv/3xvhmvpj0216687_pk__2f5h0000gn/T/actual-raw-expected.html&lt;/code&gt;, the first file in the diff command),
you should see that it looks identical.&lt;/p&gt;
&lt;p&gt;In this case, therefore, we might now feel that we should simply
update &lt;code&gt;expected.html&lt;/code&gt; with what &lt;code&gt;generate_string()&lt;/code&gt; is now
producing. It would be (by design) extremely easy to change the &lt;code&gt;diff&lt;/code&gt;
in the command it gave is to &lt;code&gt;cp&lt;/code&gt; to achieve that.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;However&lt;/strong&gt;, there's better thing we can do in this case.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 7. Specify exclusions&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Standing back, it seems obvious likely that periodically the version
number and Copyright line written to comments in the HTML will change.
If the only difference between out expected output and what we actually
generate are those, we'd probably prefer the test didn't fail.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;ref.assertStringCorrect&lt;/code&gt; function from &lt;code&gt;referencetest&lt;/code&gt; gives us
several mechanisms for specifying changes that can be ignored when
checking whether a string is correct. The simplest one, which will be
enough for our example, is just to specify strings which, if they
occur on a line in the output, case differences in those lines to be
ignored, so that the assertion doesn't fail.&lt;/p&gt;
&lt;p&gt;** Step 7a. Add the &lt;code&gt;ignore_substrings&lt;/code&gt; parameter to &lt;code&gt;assertStringCorrect&lt;/code&gt; as follows:**&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;        &lt;span class="n"&gt;ref&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertStringCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;expected.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                &lt;span class="n"&gt;ignore_substrings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Copyright&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Version&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step 7b. Run the test again. It should now pass:&lt;/strong&gt;&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ &lt;span class="nv"&gt;pytest&lt;/span&gt;
&lt;span class="o"&gt;=============================&lt;/span&gt; &lt;span class="nb"&gt;test&lt;/span&gt; session &lt;span class="nv"&gt;starts&lt;/span&gt; &lt;span class="o"&gt;==============================&lt;/span&gt;

test_all.py .                                                            &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;%&lt;span class="o"&gt;]&lt;/span&gt;

&lt;span class="o"&gt;===========================&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt; passed &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;.04 &lt;span class="nv"&gt;seconds&lt;/span&gt; &lt;span class="o"&gt;===========================&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3 id="recap-what-we-have-seen"&gt;Recap: What we have seen&lt;/h3&gt;
&lt;p&gt;We've seen&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Converting standard &lt;code&gt;pytest&lt;/code&gt;-based tests to use &lt;code&gt;referencetestcase&lt;/code&gt; is
     straightfoward.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;When we do that, we gain access to powerful new kinds of assertion
     such as &lt;code&gt;assertStringCorrect&lt;/code&gt;. Among the immediate benefits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;When there is failure, this saves the failing output
     to a temporary file&lt;/li&gt;
&lt;li&gt;It tells you the exact &lt;code&gt;diff&lt;/code&gt; command you need to see be able
     to see differences&lt;/li&gt;
&lt;li&gt;This also makes it very easy to copy the new "known good"
     answer into place if you've verified that the new answer
     is now correct. (In fact, the library also has a more powerful
     way to do this, as we'll see in a later exercise).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The &lt;code&gt;ref.assertStringCorrect&lt;/code&gt; fucntion also has a number of mechanisms
     for allowing specific expected differences to occur without
     causing the test to fail. The simplest of these mechanisms
     is the &lt;code&gt;ignore_substrings&lt;/code&gt; keyword argument we used here.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;</content><category term="TDDA"></category><category term="reference test"></category><category term="exercise"></category><category term="screencast"></category><category term="video"></category><category term="pytest"></category></entry><entry><title>Reference Testing Exercise 1 (unittest flavour)</title><link href="https://tdda.info/reference-testing-exercise-1-unittest-flavour.html" rel="alternate"></link><published>2019-10-28T08:15:00+00:00</published><updated>2019-10-28T08:15:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2019-10-28:/reference-testing-exercise-1-unittest-flavour.html</id><summary type="html">&lt;iframe width="1038" height="649" src="https://www.youtube.com/embed/WUwCEZ6ufN8" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;

&lt;p&gt;This exercise (video 8m 53s)
shows how to migrate a test from using &lt;code&gt;unittest&lt;/code&gt;
directly to the exploiting the &lt;code&gt;referencetest&lt;/code&gt; capabilities in
the TDDA library.
(If you use &lt;code&gt;pytest&lt;/code&gt; for writing tests,
you might prefer the
&lt;a href="/ref/exercise1p"&gt;pytest-flavoured version&lt;/a&gt;
of this exercise.)&lt;/p&gt;
&lt;p&gt;We will see how even simple use of &lt;code&gt;referencetest …&lt;/code&gt;&lt;/p&gt;</summary><content type="html">&lt;iframe width="1038" height="649" src="https://www.youtube.com/embed/WUwCEZ6ufN8" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;

&lt;p&gt;This exercise (video 8m 53s)
shows how to migrate a test from using &lt;code&gt;unittest&lt;/code&gt;
directly to the exploiting the &lt;code&gt;referencetest&lt;/code&gt; capabilities in
the TDDA library.
(If you use &lt;code&gt;pytest&lt;/code&gt; for writing tests,
you might prefer the
&lt;a href="/ref/exercise1p"&gt;pytest-flavoured version&lt;/a&gt;
of this exercise.)&lt;/p&gt;
&lt;p&gt;We will see how even simple use of &lt;code&gt;referencetest&lt;/code&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;makes it much easier to see how tests have failed when complex
   outputs are generated&lt;/li&gt;
&lt;li&gt;helps us to update reference outputs (the expected values)
   when we have verified that a new behaviour is correct&lt;/li&gt;
&lt;li&gt;allows us easily to write tests of code whose outputs are not
   identical from run to run. We do this by specifying exclusions
   from the comparisons used in assertions.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="prerequisites"&gt;Prerequisites&lt;/h3&gt;
&lt;p&gt;★ You need to have the TDDA Python library
(version 1.0.31 or newer)
installed see &lt;a href="/pages/installation"&gt;installation&lt;/a&gt;.
Use&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;tdda version
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;to check the version that you have.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 1: Copy the exercises&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;You need to change to some directory in which you're happy to create three
new directories with data. We are use &lt;code&gt;~/tmp&lt;/code&gt; for this. Then copy the
example code.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ &lt;span class="nb"&gt;cd&lt;/span&gt; ~/tmp
$ tdda examples    &lt;span class="c1"&gt;# copy the example code&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Go the exercise files and examine them:&lt;/strong&gt;&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ &lt;span class="nb"&gt;cd&lt;/span&gt; referencetest_examples/exercises-unittest/exercise1  &lt;span class="c1"&gt;# Go to exercise1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You should have at least the following three files:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ ls
expected.html   generators.py   test_all.py
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;generators.py&lt;/code&gt; contains a function called &lt;code&gt;generate_string&lt;/code&gt;
   that, when called, returns HTML text suitable for viewing
   as a web page.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;expected.html&lt;/code&gt; is the result of calling that function, saved
   to file&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;test_all.py&lt;/code&gt; contains a single &lt;code&gt;unittest&lt;/code&gt;-based test of that file.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It's probably useful to look at the web page &lt;code&gt;expected.html&lt;/code&gt; in a browser,
either by navigating to it in a file browser and double clicking it,
or by using&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;open expected.html
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;if your OS supports this. As you can see, it's just some text and an
image. The image is an inline SVG vector image, generated along with
the text.&lt;/p&gt;
&lt;p&gt;Also have a look at the test code. The core part of it is very short:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;unittest&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;generators&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;generate_string&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestFileGeneration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unittest&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TestCase&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testExampleStringGeneration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;generate_string&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;expected.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="vm"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;__main__&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;unittest&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The code&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;calls &lt;code&gt;generate_string()&lt;/code&gt; to create the content&lt;/li&gt;
&lt;li&gt;stores its output in the variable &lt;code&gt;actual&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;reads the &lt;em&gt;expected&lt;/em&gt; content into the variable &lt;code&gt;expected&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;asserts that the two strings are the same.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Step 3. Run the test, which should fail&lt;/strong&gt;&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python test_all.py   &lt;span class="c1"&gt;#  This will work with Python 3 or Python2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You should get a failure, but it will probably be quite hard to see exactly
what the differences are.&lt;/p&gt;
&lt;p&gt;We'll convert the test to use the TDDA libraries referencetest and
see how that helps.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 4. Change the code to use referencetest.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;First we need our test to use &lt;code&gt;ReferenceTestCase&lt;/code&gt; from &lt;code&gt;tdda.referencetest&lt;/code&gt;
instead of &lt;code&gt;unittest.TestCase&lt;/code&gt;. &lt;code&gt;ReferenceTestCase&lt;/code&gt; is a subclass of
&lt;code&gt;unittest.TestCase&lt;/code&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Change the import statement to &lt;code&gt;from tdda.referencetest import ReferenceTestCase&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Replace &lt;code&gt;unittest.TestCase&lt;/code&gt; with &lt;code&gt;ReferenceTestCase&lt;/code&gt; in the class declaration&lt;/li&gt;
&lt;li&gt;Replace &lt;code&gt;unittest.main()&lt;/code&gt; with &lt;code&gt;ReferenceTestCase.main()&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The result is:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tdda.referencetest&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ReferenceTestCase&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;generators&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;generate_string&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestFileGeneration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ReferenceTestCase&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testExampleStringGeneration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;generate_string&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;expected.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="vm"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;__main__&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;ReferenceTestCase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If you run this, it's behaviour should be exactly the same, because
we haven't used any of the extra features of &lt;code&gt;tdda.referencetest&lt;/code&gt; yet.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 5. Change the assertion to use &lt;code&gt;assertStringCorrect&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;TDDA's &lt;code&gt;ReferenceTestCase&lt;/code&gt; provides the &lt;code&gt;assertStringCorrect&lt;/code&gt; method,
which expects as its first positional arguments an &lt;em&gt;actual&lt;/em&gt; string
and the path to a file containing the &lt;em&gt;expected&lt;/em&gt; result. So:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Change &lt;code&gt;assertEqual&lt;/code&gt; to &lt;code&gt;assertStringCorrect&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Change &lt;code&gt;expected&lt;/code&gt; to &lt;code&gt;expected.html&lt;/code&gt; as the second argument
    to the assertion&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delete the two lines reading the file and assigning to &lt;code&gt;expected&lt;/code&gt;
    as we no longer need that.&lt;/strong&gt;&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testExampleStringGeneration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;generate_string&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertStringCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;expected.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Step 6. Run the modified test&lt;/strong&gt;&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python test_all.py
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You should see very different output, that includes, near the end,
something like this:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;Expected&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;Compare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;with&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;diff&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="k"&gt;var&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;folders&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;zv&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;xvhmvpj0216687_pk__2f5h0000gn&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="n"&gt;Compare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;processed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;with&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;diff&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="k"&gt;var&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;folders&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;zv&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;xvhmvpj0216687_pk__2f5h0000gn&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="k"&gt;var&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;folders&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;zv&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;xvhmvpj0216687_pk__2f5h0000gn&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Because the test failed, the TDDA library has written a copy of the
actual ouput to file to make it easy for us to examine it and to use &lt;code&gt;diff&lt;/code&gt;
commands to see how it actually differs from what we expected. (In fact,
it's written out two copies, a "raw" and a "post-precocessed" one, but we
haven't used any processing, so they will be the same in our case. So
we ignore the second diff command suggested for now.)&lt;/p&gt;
&lt;p&gt;It's also given us the precise &lt;code&gt;diff&lt;/code&gt; command we need to see the differences
between our actual and expected output.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 6a. Copy the first &lt;code&gt;diff&lt;/code&gt; command and run it. You should see something similar to this:&lt;/strong&gt;&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ diff /var/folders/zv/3xvhmvpj0216687_pk__2f5h0000gn/T/actual-raw-expected.html expected.html
&lt;span class="m"&gt;5&lt;/span&gt;,6c5,6
&amp;lt;     Copyright &lt;span class="o"&gt;(&lt;/span&gt;c&lt;span class="o"&gt;)&lt;/span&gt; Stochastic Solutions, &lt;span class="m"&gt;2016&lt;/span&gt;
&amp;lt;     Version &lt;span class="m"&gt;1&lt;/span&gt;.0.0
—
&amp;gt;     Copyright &lt;span class="o"&gt;(&lt;/span&gt;c&lt;span class="o"&gt;)&lt;/span&gt; Stochastic Solutions Limited, &lt;span class="m"&gt;2016&lt;/span&gt;
&amp;gt;     Version &lt;span class="m"&gt;0&lt;/span&gt;.0.0
35c35
&amp;lt; &amp;lt;/html&amp;gt;
&lt;span class="se"&gt;\ &lt;/span&gt;No newline at end of file
—
&amp;gt; &amp;lt;/html&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;(If you have a visual diff tool, can also use that. For example,
on a Mac, if you have Xcode installed, you should have the
&lt;code&gt;opendiff&lt;/code&gt; command available.)&lt;/p&gt;
&lt;p&gt;The diff makes it clear that there are three differences:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The copyright notice has changed slightly&lt;/li&gt;
&lt;li&gt;The version number has changed&lt;/li&gt;
&lt;li&gt;The string doesn't have a newline at the end, whereas the file does.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The Copyright and version numbers lines are both in comments in the HTML,
so don't affect the rendering at all. You might want to confirm that if
you look at the actual file it saved (&lt;code&gt;/var/folders/zv/3xvhmvpj0216687_pk__2f5h0000gn/T/actual-raw-expected.html&lt;/code&gt;, the first file in the diff command),
you should see that it looks identical.&lt;/p&gt;
&lt;p&gt;In this case, therefore, we might now feel that we should simply
update &lt;code&gt;expected.html&lt;/code&gt; with what &lt;code&gt;generate_string()&lt;/code&gt; is now
producing. It would be (by design) extremely easy to change the &lt;code&gt;diff&lt;/code&gt;
in the command it gave is to &lt;code&gt;cp&lt;/code&gt; to achieve that.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;However&lt;/strong&gt;, there's better thing we can do in this case.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 7. Specify exclusions&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Standing back, it seems obvious likely that periodically the version
number and Copyright line written to comments in the HTML will change.
If the only difference between out expected output and what we actually
generate are those, we'd probably prefer the test didn't fail.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;assertStringCorrect&lt;/code&gt; method from &lt;code&gt;referencetest&lt;/code&gt; gives us several
mechanisms for specifying changes that can be ignored when checking whether
a string is correct. The simplest one, which will be enough for our example,
is just to specify strings which, if they occur on a line in the output,
case differences in those lines to be ignored, so that the assertion
doesn't fail.&lt;/p&gt;
&lt;p&gt;** Step 7a. Add the &lt;code&gt;ignore_substrings&lt;/code&gt; parameter to &lt;code&gt;assertStringCorrect&lt;/code&gt; as follows:**&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertStringCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;expected.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                 &lt;span class="n"&gt;ignore_substrings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Copyright&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Version&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step 7b. Run the test again. It should now pass:&lt;/strong&gt;&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python3 test_all.py
.
----------------------------------------------------------------------
Ran &lt;span class="m"&gt;1&lt;/span&gt; &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;.002s

OK
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3 id="recap-what-we-have-seen"&gt;Recap: What we have seen&lt;/h3&gt;
&lt;p&gt;We've seen&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Converting &lt;code&gt;unittest&lt;/code&gt;-based tests to use &lt;code&gt;ReferenceTestCase&lt;/code&gt; is
     straightfoward.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;When we do that, we gain access to powerful new assert methods
     such as &lt;code&gt;assertStringCorrect&lt;/code&gt;. Among the immediate benefits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;When there is failure, this method saves the failing output
     to a temporary file&lt;/li&gt;
&lt;li&gt;It tells you the exact &lt;code&gt;diff&lt;/code&gt; command you need to see be able
     to see differences&lt;/li&gt;
&lt;li&gt;This also makes it very easy to copy the new "known good"
     answer into place if you've verified that the new answer
     is now correct. (In fact, the library also has a more powerful
     way to do this, as we'll see in a later exercise).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The &lt;code&gt;assertStringCorrect&lt;/code&gt; method also has a number of mechanisms
     for allowing specific expected differences to occur without
     causing the test to fail. The simplest of these mechanisms
     is the &lt;code&gt;ignore_substrings&lt;/code&gt; keyword argument we used here.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;</content><category term="TDDA"></category><category term="reference test"></category><category term="exercise"></category><category term="screencast"></category><category term="video"></category><category term="unittest"></category></entry><entry><title>Screencasts and Exercises</title><link href="https://tdda.info/screencasts-and-exercises.html" rel="alternate"></link><published>2019-10-25T15:00:00+01:00</published><updated>2019-10-25T15:00:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2019-10-25:/screencasts-and-exercises.html</id><summary type="html">&lt;p&gt;We've started producing a series of exercises for various aspects
of TDDA, available on the blog, with follow-along screencasts.&lt;/p&gt;
&lt;p&gt;There will be a series of posts about these,
starting on Monday (28th October).
There's a &lt;a href="https://www.youtube.com/channel/UCAwK_xYqaEL3lEOz4YUZmZw"&gt;YouTube channel&lt;/a&gt; as well, if you want to subscribe.&lt;/p&gt;
&lt;p&gt;The goal has been for each …&lt;/p&gt;</summary><content type="html">&lt;p&gt;We've started producing a series of exercises for various aspects
of TDDA, available on the blog, with follow-along screencasts.&lt;/p&gt;
&lt;p&gt;There will be a series of posts about these,
starting on Monday (28th October).
There's a &lt;a href="https://www.youtube.com/channel/UCAwK_xYqaEL3lEOz4YUZmZw"&gt;YouTube channel&lt;/a&gt; as well, if you want to subscribe.&lt;/p&gt;
&lt;p&gt;The goal has been for each exercise to be as short and simple as it
can reasonably be while still covering useful aspects.&lt;/p&gt;
&lt;p&gt;The first set of exercises will cover the &lt;em&gt;reference testing&lt;/em&gt; capabilities
of TDDA, and at least some of them will be available both as
unittest-favoured versions and pytest variants. If you don't currently
use either, you probably want to follow the &lt;code&gt;unittest&lt;/code&gt; variants, since
&lt;code&gt;unittest&lt;/code&gt; is part of Python's standard library.&lt;/p&gt;
&lt;p&gt;There's a page for the exercises at:&lt;/p&gt;
&lt;p&gt;&lt;a href="/exercises"&gt;tdda.info/exercises&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;which we'll try to keep up-to-date as we add more.&lt;/p&gt;
&lt;p&gt;Please note: if you want to do the exercises, you'll need the latest TDDA
release, and as we add more (unfortunately) you'll probably need to update
each time we add new exercises, with something like&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;pip install -U tdda
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;or&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;python3 -m pip install -U tdda
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;depending on your setup. See the &lt;a href="/installation"&gt;installation instructions&lt;/a&gt; for details.&lt;/p&gt;</content><category term="TDDA"></category><category term="tests"></category><category term="screencast"></category><category term="video"></category><category term="exercises"></category></entry><entry><title>Installation</title><link href="https://tdda.info/installation.html" rel="alternate"></link><published>2019-10-24T15:00:00+01:00</published><updated>2019-10-24T15:00:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2019-10-24:/installation.html</id><summary type="html">&lt;p&gt;This post is a standing post that we plan to try to keep up to date,
describing options for obtaining the open-source Python
TDDA library that we maintain.&lt;/p&gt;
&lt;h2 id="using-pip-from-pypi"&gt;Using pip from PyPI&lt;/h2&gt;
&lt;p&gt;If you don't need source, and have Python installed, the easiest way
to get the TDDA library is …&lt;/p&gt;</summary><content type="html">&lt;p&gt;This post is a standing post that we plan to try to keep up to date,
describing options for obtaining the open-source Python
TDDA library that we maintain.&lt;/p&gt;
&lt;h2 id="using-pip-from-pypi"&gt;Using pip from PyPI&lt;/h2&gt;
&lt;p&gt;If you don't need source, and have Python installed, the easiest way
to get the TDDA library is from the Python package index
&lt;a href="https://pypi.python.org/pypi/tdda"&gt;PyPI&lt;/a&gt;
using the &lt;a href="https://pip.pypa.io/en/stable/installing/"&gt;pip&lt;/a&gt;
utility.&lt;/p&gt;
&lt;p&gt;Assuming you have a working pip setup, you should be able to install
the tdda library by typing:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;pip install tdda
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;or, if your permissions don't allow use in this mode&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;sudo pip install tdda
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If &lt;code&gt;pip&lt;/code&gt; isn't working, or is associated with a different Python from the one you are using, try:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;python -m pip install tdda
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;or&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;sudo python -m pip install tdda
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The tdda library supports both Python 3 (tested with 3.6 and 3.7) and Python 2 (tested with 2.7). (We'll start testing against 3.8 real soon!)&lt;/p&gt;
&lt;h2 id="upgrading"&gt;Upgrading&lt;/h2&gt;
&lt;p&gt;If you have a version of the &lt;code&gt;tdda&lt;/code&gt; library installed and want to upgrade it with pip, add &lt;code&gt;-U&lt;/code&gt; to one of the command above, i.e. use whichever of the following you need for your setup:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;pip install -U tdda
sudo pip install -U tdda
python -m pip install -U tdda
sudo python -m pip install -U tdda
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2 id="installing-from-source"&gt;Installing from Source&lt;/h2&gt;
&lt;p&gt;The source for the tdda library is available from Github and can be
cloned with&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;git clone https://github.com/tdda/tdda.git
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;or&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;git&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;clone&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;git&lt;/span&gt;&lt;span class="nv"&gt;@github&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nl"&gt;com&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;tdda&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;tdda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;git&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;When installing from source, if you want the command line &lt;code&gt;tdda&lt;/code&gt; utility
to be available, you need to run&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;python setup.py install
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;from the top-level tdda directory after downloading it.&lt;/p&gt;
&lt;h2 id="documentation"&gt;Documentation&lt;/h2&gt;
&lt;p&gt;The main documentation for the &lt;code&gt;tdda&lt;/code&gt; library is available on
&lt;a href="https://tdda.readthedocs.org"&gt;Read the Docs&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;You can also build it youself if you have downloaded the source from Github.
In order to do this, you will need an installation of
&lt;a href="https://pypi.python.org/pypi/Sphinx"&gt;Sphinx&lt;/a&gt;.
The HTML documentation is built, starting from the top-level
&lt;code&gt;tdda&lt;/code&gt; directory by running:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;cd doc
make html
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2 id="running-tddas-tests"&gt;Running TDDA's tests&lt;/h2&gt;
&lt;p&gt;Once you have installed TDDA (whether using pip or from source), you can
run its tests by typing&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;tdda test
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If you have all the dependencies, including optional dependencies, installed,
you should get a line of dots and the message OK at the end, something like this:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c"&gt;$ tdda test&lt;/span&gt;
&lt;span class="nt"&gt;........................................................................................................................&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="nb"&gt;----------------------------------------------------------------------&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="c"&gt;Ran 122 tests in 3&lt;/span&gt;&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;251s&lt;/span&gt;

&lt;span class="c"&gt;OK&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If you don't have some of the optional dependencies installed, some of the dots will be replaced by the letter 's'. For example:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c"&gt;$ tdda test&lt;/span&gt;
&lt;span class="nt"&gt;.................................................................&lt;/span&gt;&lt;span class="c"&gt;s&lt;/span&gt;&lt;span class="nt"&gt;.............................&lt;/span&gt;&lt;span class="c"&gt;s&lt;/span&gt;&lt;span class="nt"&gt;........................&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="nb"&gt;----------------------------------------------------------------------&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="c"&gt;Ran 120 tests in 3&lt;/span&gt;&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;221s&lt;/span&gt;

&lt;span class="c"&gt;OK (skipped=2)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This does not indicate a problem, and simply means there will be some
of the functionality unavailable (e.g. usually one or more database types).&lt;/p&gt;
&lt;h2 id="using-the-tdda-examples"&gt;Using the TDDA examples&lt;/h2&gt;
&lt;p&gt;The tdda library includes three sets of examples, covering
&lt;a href="https://www.tdda.info/the-new-referencetest-class-for-tdda"&gt;reference testing&lt;/a&gt;,
&lt;a href="https://www.tdda.info/constraint-discovery-and-verification-for-pandas-dataframes"&gt;automatic constraint discovery and verification&lt;/a&gt;,
and
&lt;a href="https://www.tdda.info/introducing-rexpy-automatic-discovery-of-regular-expressions"&gt;Rexpy&lt;/a&gt;
(discovery of regular expressions from examples,
outside the context of constraints).&lt;/p&gt;
&lt;p&gt;The tdda command line can be used to copy the relevant files into place.
To get the examples, first change to a directory where you would like
them to be placed, and then use the command:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;tdda examples
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This should produce the following output:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;Copied&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tdda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;referencetest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;./&lt;/span&gt;&lt;span class="n"&gt;referencetest&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;examples&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;Copied&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tdda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;constraints&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;./&lt;/span&gt;&lt;span class="n"&gt;constraints&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;examples&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;Copied&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tdda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rexpy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;./&lt;/span&gt;&lt;span class="n"&gt;rexpy&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;examples&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2 id="quick-reference-guides"&gt;Quick Reference Guides&lt;/h2&gt;
&lt;p&gt;There is a quick reference guides available for the TDDA library.
These are often a little behind the current release, but are usually
still quite helpful.&lt;/p&gt;
&lt;p&gt;These are available from &lt;a href="https://www.tdda.info/pdf/tdda-quickref.pdf"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="online-tutorials"&gt;Online Tutorials&lt;/h2&gt;
&lt;p&gt;Various videos of tutorials, and accompanying slides, are available
&lt;a href="https://stochasticsolutions.com/talks/"&gt;online&lt;/a&gt;. Exercises with
screencasts are under development, and we hope to begin to release
these shortly.&lt;/p&gt;</content><category term="TDDA"></category><category term="tdda"></category><category term="python"></category><category term="installation"></category></entry><entry><title>Rexpy for Generating Regular Expressions: Postcodes</title><link href="https://tdda.info/rexpy-for-generating-regular-expressions-postcodes.html" rel="alternate"></link><published>2019-02-20T18:20:00+00:00</published><updated>2019-02-20T18:20:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2019-02-20:/rexpy-for-generating-regular-expressions-postcodes.html</id><summary type="html">&lt;p&gt;Rexpy is a powerful tool we created
that generates regular expressions from examples.
It's available online at
&lt;a href="https://rexpy.herokuapp.com"&gt;https://rexpy.herokuapp.com&lt;/a&gt;
and forms part of our
&lt;a href="https://www.tdda.info/obtaining-the-python-tdda-library"&gt;open-source TDDA library&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://stochasticsolutions.com"&gt;Miró&lt;/a&gt;
users can use the built-in &lt;code&gt;rex&lt;/code&gt; command.&lt;/p&gt;
&lt;p&gt;This post illustrates using Rexpy to find regular expressions for UK postcodes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;A …&lt;/strong&gt;&lt;/p&gt;</summary><content type="html">&lt;p&gt;Rexpy is a powerful tool we created
that generates regular expressions from examples.
It's available online at
&lt;a href="https://rexpy.herokuapp.com"&gt;https://rexpy.herokuapp.com&lt;/a&gt;
and forms part of our
&lt;a href="https://www.tdda.info/obtaining-the-python-tdda-library"&gt;open-source TDDA library&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://stochasticsolutions.com"&gt;Miró&lt;/a&gt;
users can use the built-in &lt;code&gt;rex&lt;/code&gt; command.&lt;/p&gt;
&lt;p&gt;This post illustrates using Rexpy to find regular expressions for UK postcodes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;A regular expression for Postcodes&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;If someone asked you what a UK postcode looks like,
and you don't live in London,
you'd probably say something like:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A couple of letters, then a number then a space, then a number then a couple of letters.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;About the simplest way to get Rexpy to generate a regular expression
is to give it at least two examples. You can do this online
at &lt;a href="https://rexpy.herokuapp.com"&gt;https://rexpy.herokuapp.com&lt;/a&gt;
or using the &lt;a href="https://www.tdda.info/obtaining-the-python-tdda-library"&gt;open-source TDDA library&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If you give it &lt;code&gt;EH1 3LH&lt;/code&gt; and &lt;code&gt;BB2 5NR&lt;/code&gt;, Rexpy generates
&lt;code&gt;[A-Z]{2}\d \d[A-Z]{2}&lt;/code&gt;, as illustrated here,
using the &lt;a href="https://rexpy.herokuapp.com"&gt;online version of rexpy&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://www.tdda.info/images/2postcodes.png" alt="Rexpy online, with EH1 3LH and BB2 5NR as inputs, produces [A-Z]{2}\d \d[A-Z]{2}"/&gt;&lt;/p&gt;
&lt;p&gt;This is the regular-expression equivalent of what we said:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;[A-Z]{2}&lt;/code&gt; means exactly two (&lt;code&gt;{2}&lt;/code&gt;) characters from the range &lt;code&gt;[A-Z]&lt;/code&gt;,
    i.e. two capital letters&lt;/li&gt;
&lt;li&gt;&lt;code&gt;\d&lt;/code&gt; means a digit (which is the same as &lt;code&gt;[0-9]&lt;/code&gt;—two characters from
    the range 0 to 9)&lt;/li&gt;
&lt;li&gt;the gap (&lt;code&gt;&lt;/code&gt;) is a space character&lt;/li&gt;
&lt;li&gt;&lt;code&gt;\d&lt;/code&gt; is another digit&lt;/li&gt;
&lt;li&gt;&lt;code&gt;[A-Z]{2}&lt;/code&gt; is two more letters.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This doesn't cover all postcodes, but it's a good start.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Other cases&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Any easy way to try out the regular expression we generated
is to use the &lt;code&gt;grep&lt;/code&gt; command&lt;sup id="fnref:grep"&gt;&lt;a class="footnote-ref" href="#fn:grep"&gt;1&lt;/a&gt;&lt;/sup&gt;.
This is built into all Unix and Linux systems, and is available on
Windows if you install a Linux distribution under
&lt;a href="https://docs.microsoft.com/en-us/windows/wsl/install-win10"&gt;WSL&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If we try matching a few postcodes using this regular expression, we'll
see that many—but not all—postcodes match the pattern.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;On Linux, the particular variant of &lt;code&gt;grep&lt;/code&gt; we need is &lt;code&gt;grep -P&lt;/code&gt;,
    to tell it we're using &lt;code&gt;Perl&lt;/code&gt;-style regular expressions.&lt;/li&gt;
&lt;li&gt;On Unix (e.g. Macintosh), we need to use &lt;code&gt;grep -E&lt;/code&gt; (or &lt;code&gt;egrep&lt;/code&gt;)
    to tell it we're using "extended" regular expressions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If we write a few postcodes to a file:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ cat &amp;gt; postcodes
HA2 6QD
IP4 2LS
PR1 9BW
BB2 5NR
G1 9PU
DH9 6DU
RG22 4EX
EC1A 1AB
OL14 8DQ
CT2 7UD
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;we can then use &lt;code&gt;grep&lt;/code&gt; to find the lines that match:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ grep -E &lt;span class="s1"&gt;&amp;#39;[A-Z]{2}\d \d[A-Z]{2}&amp;#39;&lt;/span&gt; postcodes
HA2 6QD
IP4 2LS
PR1 9BW
BB2 5NR
DH9 6DU
CT2 7UD
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;(Use &lt;code&gt;-P&lt;/code&gt; instead of &lt;code&gt;-E&lt;/code&gt; on Linux.)&lt;/p&gt;
&lt;p&gt;More relevantly, for present purposes, we can also add the &lt;code&gt;-v&lt;/code&gt; flag,
to ask the match to be "inVerted", i.e. to show lines that fail to match:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ grep -v -E &lt;span class="s1"&gt;&amp;#39;[A-Z]{2}\d \d[A-Z]{2}&amp;#39;&lt;/span&gt; postcodes
G1 9PU
RG22 4EX
EC1A 1AB
OL14 8DQ
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The first of these, a Glasgow postcode, fails because it only has a
   single letter at the start.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The second and fourth fail because they have two digits after the letters.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The third fails because it's a London postcode with an extra letter, &lt;code&gt;A&lt;/code&gt;
   after the &lt;code&gt;EC1&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Let's add an example of each in turn:&lt;/p&gt;
&lt;p&gt;If we first add the Glasgow postcode, Rexpy generates
&lt;code&gt;^[A-Z]{1,2}\d \d[A-Z]{2}$&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://www.tdda.info/images/3postcodes.png" alt="Rexpy online, adding G1 9PU, produces [A-Z]{1,2}\d \d[A-Z]{2}"/&gt;&lt;/p&gt;
&lt;p&gt;Here &lt;code&gt;[A-Z]{1,2}&lt;/code&gt; in brackets means 1–2 capital letters,
and we've checked the &lt;code&gt;anchor&lt;/code&gt; checkbox, to get it to add in &lt;code&gt;^&lt;/code&gt;
at the start and &lt;code&gt;$&lt;/code&gt; at the end of the regular expression.&lt;sup id="fnref:anchor"&gt;&lt;a class="footnote-ref" href="#fn:anchor"&gt;2&lt;/a&gt;&lt;/sup&gt;
If we use this with our &lt;code&gt;grep&lt;/code&gt; command, we get:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ grep -v -E &lt;span class="s1"&gt;&amp;#39;^[A-Z]{1,2}\d \d[A-Z]{2}$&amp;#39;&lt;/span&gt; postcodes
RG22 4EX
EC1A 1AB
OL14 8DQ
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If we now add in an example with two digits in the first part of the
postcode—say &lt;code&gt;RG22 4EX&lt;/code&gt;—rexpy further refines the expression to
&lt;code&gt;^[A-Z]{1,2}\d{1,2} \d[A-Z]{2}$&lt;/code&gt;, which is good for all(?) non-London
postcodes. If we repeat the &lt;code&gt;grep&lt;/code&gt; with this new pattern:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ grep -v -E &lt;span class="s1"&gt;&amp;#39;^[A-Z]{1,2}\d{1,2} \d[A-Z]{2}$&amp;#39;&lt;/span&gt; postcodes
EC1A 1AB
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;only the London example now fails.&lt;/p&gt;
&lt;p&gt;In a perfect world, just by adding &lt;code&gt;EC1A 1AB&lt;/code&gt;,
Rexpy would produce our ideal regular expression—something like
&lt;code&gt;^[A-Z]{1,2}\d[A-Z]? \d[A-Z]{2}$&lt;/code&gt;.
(Here, the &lt;code&gt;?&lt;/code&gt; is the equivalent to &lt;code&gt;{0,1}&lt;/code&gt;, meaning that the
term before can occur zero times or once, i.e. it is optional.)&lt;/p&gt;
&lt;p&gt;Unfortunately, that's not what happens.
Instead, Rexpy produces:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;A&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;Z0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;\&lt;span class="nv"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;A&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;Z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;}$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Unfortunately, Rexpy has concluded that the first part is just a jumble
of capital letters and numbers and is saying that the first part can
be any mixture of 2-4 letters and numbers.&lt;/p&gt;
&lt;p&gt;In this case, we'd probably fix up the regular expression by hand,
or separately pass in the special Central London postcodes and all
the rest. If we feed in a few London postcodes on their own, we get:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;A&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;Z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="nv"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;A&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;Z&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;\&lt;span class="nv"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;A&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;Z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;}$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;which is also a useful start.&lt;/p&gt;
&lt;p&gt;Have fun with Rexpy!&lt;/p&gt;
&lt;p&gt;By the way: if you're in easy reach of Edinburgh, we're running a
training course on the TDDA library as part of the Fringe of the
Edinburgh DataFest, on 20th March. This will include use of Rexpy.
You should come!&lt;/p&gt;
&lt;center&gt;
&lt;a href="https://StochasticSolutions.com/training"&gt;
&lt;img src="https://www.tdda.info/images/DataFestTDDATraining2.png" alt="Training Course on Testing Data and Data Processes" width=400/&gt;
&lt;/a&gt;
&lt;/center&gt;

&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:grep"&gt;
&lt;p&gt;&lt;code&gt;grep&lt;/code&gt; stands for &lt;em&gt;global regular expression print&lt;/em&gt;, and the &lt;code&gt;e&lt;/code&gt;
in &lt;code&gt;egrep&lt;/code&gt; stands for &lt;em&gt;extended&lt;/em&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:grep" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:anchor"&gt;
&lt;p&gt;Sometimes, regular expressions match any line that &lt;em&gt;contains&lt;/em&gt; the
pattern anywhere in them, rather than requiring the pattern to
match the whole line. In such cases, using the &lt;em&gt;anchored&lt;/em&gt; form
of the regular expression, &lt;code&gt;^[A-Z]{2}\d \d[A-Z]{2}$&lt;/code&gt;, means that matching
lines must not contain anything before or after the text that matches
the regular expression. (You can think of &lt;code&gt;^&lt;/code&gt; as matching the start
of the string, or line, and &lt;code&gt;$&lt;/code&gt; as matching the end.)&amp;#160;&lt;a class="footnote-backref" href="#fnref:anchor" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="regular expressions"></category><category term="rexpy"></category><category term="tdda"></category></entry><entry><title>Tagging PyTest Tests</title><link href="https://tdda.info/tagging-pytest-tests.html" rel="alternate"></link><published>2018-05-22T12:00:00+01:00</published><updated>2018-05-22T12:00:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2018-05-22:/tagging-pytest-tests.html</id><summary type="html">&lt;p&gt;A &lt;a href="https://tdda.info/saving-time-running-subsets-of-tests-with-tagging"&gt;recent post&lt;/a&gt;
described the new ability to run a subset of &lt;code&gt;ReferenceTest&lt;/code&gt; tests
from the &lt;a href="https://tdda.info/obtaining-the-python-tdda-library"&gt;tdda library&lt;/a&gt;
by &lt;em&gt;tagging&lt;/em&gt; tests or test classes
with the &lt;code&gt;@tag&lt;/code&gt; decorator.
Initially, this ability was only available for &lt;code&gt;unittest&lt;/code&gt;-based tests.
From version 1.0 of the tdda library,
&lt;a href="https://tdda.info/obtaining-the-python-tdda-library"&gt;now available&lt;/a&gt;,
we have …&lt;/p&gt;</summary><content type="html">&lt;p&gt;A &lt;a href="https://tdda.info/saving-time-running-subsets-of-tests-with-tagging"&gt;recent post&lt;/a&gt;
described the new ability to run a subset of &lt;code&gt;ReferenceTest&lt;/code&gt; tests
from the &lt;a href="https://tdda.info/obtaining-the-python-tdda-library"&gt;tdda library&lt;/a&gt;
by &lt;em&gt;tagging&lt;/em&gt; tests or test classes
with the &lt;code&gt;@tag&lt;/code&gt; decorator.
Initially, this ability was only available for &lt;code&gt;unittest&lt;/code&gt;-based tests.
From version 1.0 of the tdda library,
&lt;a href="https://tdda.info/obtaining-the-python-tdda-library"&gt;now available&lt;/a&gt;,
we have extended this capability to work with &lt;code&gt;pytest&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This post is very similar to the
&lt;a href="https://tdda.info/saving-time-running-subsets-of-tests-with-tagging"&gt;previous one&lt;/a&gt;
on tagging &lt;code&gt;unittest&lt;/code&gt;-based tests, but adapted for &lt;code&gt;pytest&lt;/code&gt;.&lt;/p&gt;
&lt;h3 id="overview"&gt;Overview&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;A decorator called &lt;code&gt;tag&lt;/code&gt; can be imported and used to decorate
    individual tests or whole test classes (by preceding the test function
    or class with &lt;code&gt;@tag&lt;/code&gt;).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;When &lt;code&gt;pytest&lt;/code&gt; is run using the  &lt;code&gt;--tagged&lt;/code&gt; option,
    only tagged tests and tests from tagged test classes will be run.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;There is a second new option, &lt;code&gt;--istagged&lt;/code&gt;.
    When this is used, the software will report which test classes
    are tagged, or contain tests that are tagged, but will not actually
    run any tests.
    This is helpful if you have a lot of test classes, spread across
    different files, and want to change the set of tagged tests.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="benefits"&gt;Benefits&lt;/h3&gt;
&lt;p&gt;The situations where we find this particularly helpful are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Fixing a broken test or working on a new feature or dataset.&lt;/strong&gt;
    We often find ourselves with a small subset of tests failing
    (perhaps, a single test) either because we're adding
    a new feature, or because something has changed, or because
    we are working     with data that has slightly different characteristics.
    If the tests of interest run in a few seconds, but the whole test
    suite takes minutes or hours to run, we can iterate
    dramatically faster if we have an easy way to run only
    the subset of tests currently failing.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Re-writing test output.&lt;/strong&gt;
    The &lt;code&gt;tdda&lt;/code&gt; library provides the ability to re-write the expected
    ("reference") output from tests with the actual result from the
    code, using the &lt;code&gt;--write-all&lt;/code&gt; command-line flag.  If it's only a
    subset of the tests that have failed, there is real benefit in
    re-writing only their output. This is particularly true if the
    reference outputs contain some differences each time (version
    numbers, dates etc.)  that are being ignored using the
    &lt;code&gt;ignore-lines&lt;/code&gt; or &lt;code&gt;ignore-patterns&lt;/code&gt; options provided by the
    library. If we regenerate all the test outputs, and then look at
    which files have changed, we might see differences in many
    reference files. In contrast, if we only regenerate the tests that
    need to be updated, we avoid committing unnecessary changes and
    reduce the likelihood of overlooking changes that may actually be
    incorrect.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="prerequisites"&gt;Prerequisites&lt;/h3&gt;
&lt;p&gt;In order to use the reference test functionality with &lt;code&gt;pytest&lt;/code&gt;,
you have always needed to add some boilerplate code to
&lt;code&gt;conftest.py&lt;/code&gt; in the directory from which you are running &lt;code&gt;pytest&lt;/code&gt;.
To use the tagging capability, you need to add one more function definition,
&lt;code&gt;pytest_collection_modifyitems&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The recommended imports in &lt;code&gt;conftest.py&lt;/code&gt; are now:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tdda.referencetest.pytestconfig&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pytest_addoption&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                             &lt;span class="n"&gt;pytest_collection_modifyitems&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                             &lt;span class="n"&gt;set_default_data_location&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                             &lt;span class="n"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;conftest.py&lt;/code&gt; is also a good place to set the reference file location
if you want to do so using &lt;code&gt;set_default_data_location&lt;/code&gt;.&lt;/p&gt;
&lt;h3 id="example"&gt;Example&lt;/h3&gt;
&lt;p&gt;We'll illustrate this with a simple example.
The code below implements four trivial tests, two in a class
and two as plain functions.&lt;/p&gt;
&lt;p&gt;Note the import of the &lt;code&gt;tag&lt;/code&gt; decorator function near the top,
and that &lt;code&gt;test_a&lt;/code&gt; and the class &lt;code&gt;TestClassA&lt;/code&gt;
are decorated with the &lt;code&gt;@tag&lt;/code&gt; decorator.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;### test_all.py&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tdda.referencetest&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt;

&lt;span class="nd"&gt;@tag&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_a&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;a&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;a&amp;#39;&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_b&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;b&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;b&amp;#39;&lt;/span&gt;

&lt;span class="nd"&gt;@tag&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestClassA&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_x&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;x&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;x&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;x&amp;#39;&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_y&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;y&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Y&amp;#39;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If we run this as normal, all four tests run and pass:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gp"&gt;$ &lt;/span&gt;pytest
&lt;span class="go"&gt;============================= test session starts ==============================&lt;/span&gt;
&lt;span class="go"&gt;platform darwin -- Python 3.5.1, pytest-3.2.1, py-1.4.34, pluggy-0.4.0&lt;/span&gt;
&lt;span class="go"&gt;rootdir: /Users/njr/tmp/referencetest_examples/pytest, inifile:&lt;/span&gt;
&lt;span class="go"&gt;plugins: hypothesis-3.4.2&lt;/span&gt;
&lt;span class="go"&gt;collected 4 items&lt;/span&gt;

&lt;span class="go"&gt;test_all.py ....&lt;/span&gt;

&lt;span class="go"&gt;=========================== 4 passed in 0.02 seconds ===========================&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;But if we add the &lt;code&gt;–tagged&lt;/code&gt; flag, only three tests run:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gp"&gt;$ &lt;/span&gt;pytest --tagged
&lt;span class="go"&gt;============================= test session starts ==============================&lt;/span&gt;
&lt;span class="go"&gt;platform darwin -- Python 3.5.1, pytest-3.2.1, py-1.4.34, pluggy-0.4.0&lt;/span&gt;
&lt;span class="go"&gt;rootdir: /Users/njr/tmp/referencetest_examples/pytest, inifile:&lt;/span&gt;
&lt;span class="go"&gt;plugins: hypothesis-3.4.2&lt;/span&gt;
&lt;span class="go"&gt;collected 4 items&lt;/span&gt;

&lt;span class="go"&gt;test_all.py ...&lt;/span&gt;

&lt;span class="go"&gt;=========================== 3 passed in 0.02 seconds ===========================&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Adding the &lt;code&gt;–-verbose&lt;/code&gt; flag confirms that these three are the tagged
test and the tests in the tagged class, as expected:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gp"&gt;$ &lt;/span&gt;pytest --tagged --verbose
&lt;span class="go"&gt;============================= test session starts ==============================&lt;/span&gt;
&lt;span class="go"&gt;platform darwin -- Python 3.5.1, pytest-3.2.1, py-1.4.34, pluggy-0.4.0 -- /usr/local/Cellar/python/3.5.1/bin/python3.5&lt;/span&gt;
&lt;span class="go"&gt;cachedir: .cache&lt;/span&gt;
&lt;span class="go"&gt;rootdir: /Users/njr/tmp/referencetest_examples/pytest, inifile:&lt;/span&gt;
&lt;span class="go"&gt;plugins: hypothesis-3.4.2&lt;/span&gt;
&lt;span class="go"&gt;collected 4 items&lt;/span&gt;

&lt;span class="go"&gt;test_all.py::test_a PASSED&lt;/span&gt;
&lt;span class="go"&gt;test_all.py::TestClassA::test_x PASSED&lt;/span&gt;
&lt;span class="go"&gt;test_all.py::TestClassA::test_y PASSED&lt;/span&gt;

&lt;span class="go"&gt;=========================== 3 passed in 0.01 seconds ===========================&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Finally, if we want to find out which classes include tagged tests,
we can use the &lt;code&gt;--istagged&lt;/code&gt; flag:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="go"&gt;pytest --istagged&lt;/span&gt;
&lt;span class="go"&gt;============================= test session starts ==============================&lt;/span&gt;
&lt;span class="go"&gt;platform darwin -- Python 3.5.1, pytest-3.2.1, py-1.4.34, pluggy-0.4.0&lt;/span&gt;
&lt;span class="go"&gt;rootdir: /Users/njr/tmp/referencetest_examples/pytest, inifile:&lt;/span&gt;
&lt;span class="go"&gt;plugins: hypothesis-3.4.2&lt;/span&gt;
&lt;span class="go"&gt;collected 4 items&lt;/span&gt;

&lt;span class="go"&gt;test_all.test_a&lt;/span&gt;
&lt;span class="go"&gt;test_all.TestClassA&lt;/span&gt;

&lt;span class="go"&gt;========================= no tests ran in 0.01 seconds =========================&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This is particularly helpful when our tests are spread across multiple
files, as the filenames are then shown as well as the class names.&lt;/p&gt;
&lt;h3 id="installation"&gt;Installation&lt;/h3&gt;
&lt;p&gt;Information about installing the library is available
in &lt;a href="https://www.tdda.info/obtaining-the-python-tdda-library"&gt;this post&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id="other-features"&gt;Other Features&lt;/h3&gt;
&lt;p&gt;Other features of the &lt;code&gt;ReferenceTest&lt;/code&gt; capabilities of the
&lt;a href="https://www.tdda.info/obtaining-the-python-tdda-library"&gt;tdda library&lt;/a&gt;
are described in &lt;a href="https://www.tdda.info/the-new-referencetest-class-for-tdda"&gt;this post&lt;/a&gt;.
Its capabilities in the area of constraint discovery and verification
are discussed
in &lt;a href="https://www.tdda.info/constraint-discovery-and-verification-for-pandas-dataframes"&gt;this post&lt;/a&gt;,
and &lt;a href="https://www.tdda.info/the-tdda-constraints-file-format"&gt;this post&lt;/a&gt;.&lt;/p&gt;</content><category term="TDDA"></category><category term="tests"></category><category term="tagging"></category></entry><entry><title>Detecting Bad Data and Anomalies with the TDDA Library (Part I)</title><link href="https://tdda.info/detecting-bad-data-and-anomalies-with-the-tdda-library-part-i.html" rel="alternate"></link><published>2018-05-04T10:30:00+01:00</published><updated>2018-05-04T10:30:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2018-05-04:/detecting-bad-data-and-anomalies-with-the-tdda-library-part-i.html</id><summary type="html">&lt;p&gt;The test-driven data analysis library, &lt;a href="https://www.tdda.info/obtaining-the-python-tdda-library"&gt;tdda&lt;/a&gt;, has two main kinds of functionality&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;support for testing complex analytical processes
    with &lt;code&gt;unittest&lt;/code&gt; or &lt;code&gt;pytest&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;support for verifying data against constraints, and optionally for
    discovering such constraints from example data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Until now, however, the verification process has only reported
which constraints failed to …&lt;/p&gt;</summary><content type="html">&lt;p&gt;The test-driven data analysis library, &lt;a href="https://www.tdda.info/obtaining-the-python-tdda-library"&gt;tdda&lt;/a&gt;, has two main kinds of functionality&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;support for testing complex analytical processes
    with &lt;code&gt;unittest&lt;/code&gt; or &lt;code&gt;pytest&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;support for verifying data against constraints, and optionally for
    discovering such constraints from example data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Until now, however, the verification process has only reported
which constraints failed to be satisfied by a dataset.&lt;/p&gt;
&lt;p&gt;We have now extended the &lt;code&gt;tdda&lt;/code&gt; library to allow identification of
individual failing records, allowing it to act as a general purpose
anomaly detection framework.&lt;/p&gt;
&lt;p&gt;The new functionality is available through a new &lt;code&gt;detect_df&lt;/code&gt; API call,
and from the command line with the new &lt;code&gt;tdda detect&lt;/code&gt; command.&lt;/p&gt;
&lt;p&gt;The diagram shows conceptually how detection works, separating out
&lt;em&gt;anomalous&lt;/em&gt; records from the rest.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://www.tdda.info/images/SimpleAnomalyDetection.png" width="250"
     alt="A simple anomaly detection process, splitting input data into anomalous and non-anomalous records"/&gt;&lt;/p&gt;
&lt;p&gt;With the TDDA framework, &lt;em&gt;anomalous&lt;/em&gt; simply means &lt;em&gt;fails at least one
constraint.&lt;/em&gt; We'll discuss cases in which the constraints have been
developed to try to model some subset of data of interest (defects,
high-risk applicants, heart arrythmias, flawless diamonds, model
patients etc.) in part II of this post.  In those cases, we start to
be able to discuss classifications such as true and false positives,
and true and false negatives.&lt;/p&gt;
&lt;h3 id="example-usage-from-the-command-line"&gt;Example Usage from the Command Line&lt;/h3&gt;
&lt;p&gt;Suppose we have a simple transaction stream with just three fields,
&lt;code&gt;id&lt;/code&gt;, &lt;code&gt;category&lt;/code&gt; and &lt;code&gt;price&lt;/code&gt;, like this:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;       id    category     price
710316821       QT       150.39
516025643       AA       346.69
414345845       QT       205.83
590179892       CB        55.61
117687080       QT       142.03
684803436       AA       152.10
611205703       QT        39.65
399848408       AA       455.67
289394404       AA       102.61
863476710       AA       297.82
534170200       KA        80.96
898969231       QT        81.39
255672456       QT        71.67
133111344       TB       229.19
763476994       CB       338.40
769595502       QT       310.19
464477044       QT        54.41
675155634       QT       199.07
483511995       QT       209.53
416094320       QT        83.31
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;and the following constraints (which might have been created by hand,
or generated using the &lt;code&gt;tdda discover&lt;/code&gt; command).&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;id&lt;/code&gt; (integer): Identifier for item.
    Should not be null, and should be unique in the table&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;category&lt;/code&gt; (string): Should be one of “AA”, “CB”, “QT”, “KA” or “TB”&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;price&lt;/code&gt; (floating point value): unit price in pounds sterling.
    Should be non-negative and no more than 1,000.00.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This would be represented in a TDDA file with the following constraints.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;{
  &amp;quot;fields&amp;quot;: {
    &amp;quot;id&amp;quot;: {
      &amp;quot;type&amp;quot;: &amp;quot;int&amp;quot;,
      &amp;quot;max_nulls&amp;quot;: 0,
      &amp;quot;no_duplicates&amp;quot;: true
    },
    &amp;quot;category&amp;quot;: {
      &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;,
      &amp;quot;max_nulls&amp;quot;: 0,
      &amp;quot;allowed_values&amp;quot;:
        [&amp;quot;AA&amp;quot;, &amp;quot;CB&amp;quot;, &amp;quot;QT&amp;quot;, &amp;quot;KA&amp;quot;, &amp;quot;TB&amp;quot;]
    },
    &amp;quot;price&amp;quot;: {
      &amp;quot;type&amp;quot;: &amp;quot;real&amp;quot;,
      &amp;quot;min&amp;quot;: 0.0,
      &amp;quot;max&amp;quot;: 1000.0,
      &amp;quot;max_nulls&amp;quot;: 0
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We can use the &lt;code&gt;tdda verify&lt;/code&gt; command to verify a CSV file or a feather
file&lt;sup id="fnref:feather"&gt;&lt;a class="footnote-ref" href="#fn:feather"&gt;1&lt;/a&gt;&lt;/sup&gt; against these files, and get a summary of which constraints pass
and fail. If our data is in the file &lt;code&gt;items.feather&lt;/code&gt; and the JSON
constraints are in &lt;code&gt;constraints.tdda&lt;/code&gt;, and there are some violations
we will get output exemplified by the following:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gp"&gt;$ &lt;/span&gt;tdda verify items.feather constraints.tdda
&lt;span class="go"&gt;FIELDS:&lt;/span&gt;

&lt;span class="go"&gt;id: 1 failure  2 passes  type ✓  max_nulls ✓  no_duplicates ✗&lt;/span&gt;

&lt;span class="go"&gt;category: 1 failure  2 passes  type ✓  max_nulls ✓  allowed_values ✗&lt;/span&gt;

&lt;span class="go"&gt;price: 2 failures  2 passes  type ✓  min ✓  max ✗  max_nulls ✗&lt;/span&gt;

&lt;span class="go"&gt;SUMMARY:&lt;/span&gt;

&lt;span class="go"&gt;Constraints passing: 6&lt;/span&gt;
&lt;span class="go"&gt;Constraints failing: 4&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The new &lt;code&gt;tdda detect&lt;/code&gt; command allows us to go further and find which
individual records fail.&lt;/p&gt;
&lt;p&gt;We can use the following command to write out a CSV file, &lt;code&gt;bads.csv&lt;/code&gt;,
containing the records that fail constraints:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gp"&gt;$ &lt;/span&gt;tdda detect items.feather constraints.tdda bads.csv --per-constraint --output-fields
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The flag &lt;code&gt;--per-constraint&lt;/code&gt; tells the software to write out a boolean
column for each constraint, indicating whether the record
passed, and the &lt;code&gt;--output-fields&lt;/code&gt; tells the software to include all
the input fields in the output.&lt;/p&gt;
&lt;p&gt;The result is the following CSV file:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;id,category,price,id_nodups_ok,category_values_ok,price_max_ok,price_nonnull_ok,n_failures
113791348,TQ,318.63,true,false,true,true,1
102829374,AA,65.24,false,true,true,true,1
720313295,TB,1004.72,true,true,false,true,1
384044032,QT,478.65,false,true,true,true,1
602948968,TB,209.31,false,true,true,true,1
105983384,AA,8.95,false,true,true,true,1
444140832,QT,1132.87,true,true,false,true,1
593548725,AA,282.58,false,true,true,true,1
545398672,QT,1026.4,true,true,false,true,1
759425162,CB,1052.72,true,true,false,true,1
452691252,AA,1028.19,true,true,false,true,1
105983384,QT,242.64,false,true,true,true,1
102829374,KA,71.64,false,true,true,true,1
105983384,AA,10.24,false,true,true,true,1
405321922,QT,85.23,false,true,true,true,1
102829374,,100000.0,false,false,false,true,3
872018391,QT,51.69,false,true,true,true,1
862101984,QT,158.53,false,true,true,true,1
274332319,AA,1069.25,true,true,false,true,1
827919239,QT,1013.0,true,true,false,true,1
105983384,QT,450.68,false,true,true,true,1
102829374,,100000.0,false,false,false,true,3
872018391,QT,199.37,false,true,true,true,1
602948968,KA,558.73,false,true,true,true,1
328073211,CB,1031.67,true,true,false,true,1
405321922,TB,330.97,false,true,true,true,1
334193154,QT,1032.31,true,true,false,true,1
194125540,TB,,true,true,,false,1
724692620,TB,1025.81,true,true,false,true,1
862101984,QT,186.76,false,true,true,true,1
593548725,QT,196.56,false,true,true,true,1
384044032,AA,157.25,false,true,true,true,1
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;which, we can read a bit more easily if we reformat this (using · to denote
nulls) as:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;       id  category        price        id  category  price     price  n_failures
                                   _nodups   _values   _max  _nonnull
                                       _ok       _ok    _ok       _ok
113791348        TQ       318.63      true     false   true      true           1
102829374        AA        65.24     false      true   true      true           1
720313295        TB     1,004.72      true      true  false      true           1
384044032        QT       478.65     false      true   true      true           1
602948968        TB       209.31     false      true   true      true           1
105983384        AA         8.95     false      true   true      true           1
444140832        QT     1132.87       true      true  false      true           1
593548725        AA       282.58     false      true   true      true           1
545398672        QT     1,026.40      true      true  false      true           1
759425162        CB     1,052.72      true      true  false      true           1
452691252        AA     1,028.19      true      true  false      true           1
105983384        QT       242.64     false      true   true      true           1
102829374        KA        71.64     false      true   true      true           1
105983384        AA        10.24     false      true   true      true           1
405321922        QT        85.23     false      true   true      true           1
102829374         ·   100,000.00     false     false  false      true           3
872018391        QT        51.69     false      true   true      true           1
862101984        QT       158.53     false      true   true      true           1
274332319        AA     1,069.25      true      true  false      true           1
827919239        QT     1,013.00      true      true  false      true           1
105983384        QT       450.68     false      true   true      true           1
102829374         ·   100,000.00     false     false  false      true           3
872018391        QT       199.37     false      true   true      true           1
602948968        KA       558.73     false      true   true      true           1
328073211        CB     1,031.67      true      true  false      true           1
405321922        TB       330.97     false      true   true      true           1
334193154        QT     1,032.31      true      true  false      true           1
194125540        TB            ·      true      true      ·     false           1
724692620        TB     1,025.81      true      true  false      true           1
862101984        QT       186.76     false      true   true      true           1
593548725        QT       196.56     false      true   true      true           1
384044032        AA       157.25     false      true   true      true           1
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3 id="command-line-syntax"&gt;Command Line Syntax&lt;/h3&gt;
&lt;p&gt;The basic form of the command-line command is:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gp"&gt;$ &lt;/span&gt;tdda detect INPUT CONSTRAINTS OUTPUT
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;where&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;INPUT&lt;/code&gt; is normally either a &lt;code&gt;.csv&lt;/code&gt; file, in a suitable format,
    or a &lt;code&gt;.feather&lt;/code&gt; file&lt;sup id="fnref2:feather"&gt;&lt;a class="footnote-ref" href="#fn:feather"&gt;1&lt;/a&gt;&lt;/sup&gt; containing a DataFrame,
    preferably with an accompanying &lt;code&gt;.pmm&lt;/code&gt; file&lt;sup id="fnref3:feather"&gt;&lt;a class="footnote-ref" href="#fn:feather"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CONSTRAINTS&lt;/code&gt; is a JSON file containing constraints, usually with
    a &lt;code&gt;.tdda&lt;/code&gt; suffix. This can be created by the &lt;code&gt;tdda discover&lt;/code&gt; command,
    or edited by hand.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;OUTPUT&lt;/code&gt; is again either a &lt;code&gt;.csv&lt;/code&gt; or &lt;code&gt;.feather&lt;/code&gt; file to be created
    with the output rows. If the &lt;code&gt;pmmif&lt;/code&gt; library is installed, a &lt;code&gt;.pmm&lt;/code&gt;
    metadata file will be generated alongside the &lt;code&gt;.feather&lt;/code&gt; file,
    when &lt;code&gt;.feather&lt;/code&gt; output is requested.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Options&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Several command line options are available to control the detailed
behaviour:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;defaults&lt;/strong&gt;: If no command-line options are supplied:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;only failing records will be written&lt;/li&gt;
&lt;li&gt;only a record identifier and the number of failing constraints
    will be written.&lt;ul&gt;
&lt;li&gt;When the input is Pandas, the record identifier
    will be the index for the failing records;&lt;/li&gt;
&lt;li&gt;when the input is a CSV file, the record identifier will
    be the row number, with the first row after the header
    being numbered 1.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;--per-constraint&lt;/code&gt; when this is added, an &lt;code&gt;_ok&lt;/code&gt; column will also
    be written for every constraint that has any failures, with &lt;code&gt;true&lt;/code&gt;
    for rows that satisfy the contraint, &lt;code&gt;false&lt;/code&gt; for rows that do not
    satisfy the constraint and a missing value where the constraint is
    inapplicable (which does not count as a failure).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;--output-fields [FIELD1 FIELD2 ...]&lt;/code&gt; If the &lt;code&gt;--output-fields&lt;/code&gt; flag
    is used without specifying any fields, all fields from the input
    will be included in the output. Alternatively, a space-separated
    list of fields may be provided, in which case only those will be
    included. Whenever this option is used, no index or row-number is
    written unless specifically requested&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;--write-all&lt;/code&gt; If this flag is used, all records from the input
    will be included in the output, including those that have no
    constraint failures.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;--index&lt;/code&gt; This flag forces the writing of the index (for DataFrame
    inputs) or row number (for CSV inputs).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;--int&lt;/code&gt; When writing boolean values to CSV files (either from input
    data or as per-constraint output fields), use &lt;code&gt;1&lt;/code&gt; for &lt;code&gt;true&lt;/code&gt;
    and &lt;code&gt;0&lt;/code&gt; for &lt;code&gt;false&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="api-access"&gt;API Access&lt;/h3&gt;
&lt;p&gt;The detection functionality is also available through the TDDA library's
API with a new &lt;code&gt;detect_df&lt;/code&gt; function, which takes similar parameters
to the command line. The corresponding call, with a DataFrame &lt;code&gt;df&lt;/code&gt; in
memory, would be:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tdda.constraints&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;detect_df&lt;/span&gt;

&lt;span class="n"&gt;verification&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;detect_df&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;constraints.tdda&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;per_constraint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;output_fields&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[])&lt;/span&gt;
&lt;span class="n"&gt;bads_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;verification&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;detected&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:feather"&gt;
&lt;p&gt;The &lt;a href="https://wesmckinney.com/blog/feather-its-the-metadata/"&gt;feather&lt;/a&gt; file
format is an interoperable way to save DataFrames from Pandas or
R. Its aim is to preserve metadata better and be faster than CSV
files. It has a few issues, particularly around types and nulls, and
when available, we save a secondary &lt;code&gt;.pmm&lt;/code&gt; file alongside &lt;code&gt;.feather&lt;/code&gt;
files which makes reading and writing them more robust when our
extensions in the &lt;code&gt;pmmif&lt;/code&gt; library are used. We'll do a future blog
post about this, but if you install both &lt;code&gt;feather&lt;/code&gt; and &lt;code&gt;pmmif&lt;/code&gt; with
&lt;code&gt;pip install feather&lt;/code&gt; and &lt;code&gt;pip install pmmif&lt;/code&gt;,
and use &lt;code&gt;featherpmm.write_dataframe&lt;/code&gt;, imported from &lt;code&gt;pmmif&lt;/code&gt;, rather than
&lt;code&gt;feather.write_dataframe&lt;/code&gt;, you should get more robust behaviour.&amp;#160;&lt;a class="footnote-backref" href="#fnref:feather" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;a class="footnote-backref" href="#fnref2:feather" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;a class="footnote-backref" href="#fnref3:feather" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="tests"></category><category term="anomaly detection"></category><category term="bad data"></category></entry><entry><title>Saving Time Running Subsets of Tests with Tagging</title><link href="https://tdda.info/saving-time-running-subsets-of-tests-with-tagging.html" rel="alternate"></link><published>2018-05-01T10:30:00+01:00</published><updated>2018-05-01T10:30:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2018-05-01:/saving-time-running-subsets-of-tests-with-tagging.html</id><summary type="html">&lt;p&gt;It is common, when working with tests for analytical processes,
for test suites to take non-trivial amount of time to run.
It is often helpful to have a convenient way to execute a
subset of tests, or even a single test.&lt;/p&gt;
&lt;p&gt;We have added a simple mechanism for allowing this …&lt;/p&gt;</summary><content type="html">&lt;p&gt;It is common, when working with tests for analytical processes,
for test suites to take non-trivial amount of time to run.
It is often helpful to have a convenient way to execute a
subset of tests, or even a single test.&lt;/p&gt;
&lt;p&gt;We have added a simple mechanism for allowing this to &lt;code&gt;unittest&lt;/code&gt;-based tests
in the &lt;code&gt;ReferenceTest&lt;/code&gt; functionality of the
&lt;a href="https://www.tdda.info/obtaining-the-python-tdda-library"&gt;tdda&lt;/a&gt; Python library.
It is based on &lt;em&gt;tagging&lt;/em&gt; tests.&lt;/p&gt;
&lt;p&gt;The quick summary is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;A decorator called &lt;code&gt;tag&lt;/code&gt; can be imported and used to decorate
    individual tests or whole test classes (by preceding the test function
    or class with &lt;code&gt;@tag&lt;/code&gt;).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;When a script calling &lt;code&gt;ReferenceTest.main()&lt;/code&gt; is run, if the flag
    &lt;code&gt;--tagged&lt;/code&gt; (or &lt;code&gt;–1&lt;/code&gt;, the digit one) is used on the command line,
    only tagged tests and tests from tagged test classes will be run.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;There is a second new option, &lt;code&gt;--istagged&lt;/code&gt; (or &lt;code&gt;–0&lt;/code&gt;, the digit zero).
    When this is used, the software will report only which test classes
    are tagged, or contain tests that are tagged, and will not actually
    run any tests.
    This is helpful if you have a lot of test classes, spread across
    different files, and want to change the set of tagged tests.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="benefits"&gt;Benefits&lt;/h3&gt;
&lt;p&gt;The situations where we find this particularly helpful are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Fixing a broken test or working on a new feature or dataset.&lt;/strong&gt;
    We often find ourselves with a small subset of tests failing
    (perhaps, a single test case), either because we're adding
    a new feature, or because something has changed or we are working
    with data that has slightly different characteristics.
    If the tests of interest run in a few seconds, but the whole test
    suite takes minutes or hours to run, we can iterate
    dramatically faster if we have an easy way to run only
    the subset of tests currently failing.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Re-writing test output.&lt;/strong&gt;
    The &lt;code&gt;tdda&lt;/code&gt; library provides the ability to re-write the expected
    ("reference") output from tests with whatever the code is currently
    generating, using the &lt;code&gt;--write-all&lt;/code&gt; command-line flag.
    If it's only a subset of the tests that have failed, there is real
    benefit in re-writing only the output for the previously failing tests,
    rather than for all tests. This is particularly true if the reference
    outputs contain some differences each time (version numbers, dates etc.)
    that are being ignored using the &lt;code&gt;ignore-lines&lt;/code&gt; or &lt;code&gt;ignore-patterns&lt;/code&gt;
    options provided by the library. If we re-write all the tests,
    and then look at which files have changed, we might see differences in
    all reference files, whereas if we only regenerate the tests with
    meaningful changes, we will avoid committing changes that were not required.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="example"&gt;Example&lt;/h3&gt;
&lt;p&gt;We'll illustrate this with a simple example.
The code below implements four trivial tests across two classes.&lt;/p&gt;
&lt;p&gt;Note the import of the &lt;code&gt;tag&lt;/code&gt; decorator function near the top,
and that the two of the tests—&lt;code&gt;testTwo&lt;/code&gt; and &lt;code&gt;testThree&lt;/code&gt; in the class
&lt;code&gt;Tests&lt;/code&gt;—are decorated with the &lt;code&gt;@tag&lt;/code&gt; decorator, as is the entire
test class &lt;code&gt;MoreTests&lt;/code&gt;.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# tests.py&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tdda.referencetest&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ReferenceTestCase&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Tests&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ReferenceTestCase&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testOne&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nd"&gt;@tag&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testTwo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nd"&gt;@tag&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testThree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testFour&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="nd"&gt;@tag&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MoreTests&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ReferenceTestCase&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testFive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testSix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="vm"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;__main__&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
     &lt;span class="n"&gt;ReferenceTestCase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If we run this as normal, all six tests run and pass:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gp"&gt;$ &lt;/span&gt;python tests.py
&lt;span class="go"&gt;......&lt;/span&gt;
&lt;span class="go"&gt;----------------------------------------------------------------------&lt;/span&gt;
&lt;span class="go"&gt;Ran 6 tests in 0.000s&lt;/span&gt;

&lt;span class="go"&gt;OK&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;But if we add the &lt;code&gt;–1&lt;/code&gt; flag, only four tests run:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gp"&gt;$ &lt;/span&gt;python tests.py -1
&lt;span class="go"&gt;....&lt;/span&gt;
&lt;span class="go"&gt;----------------------------------------------------------------------&lt;/span&gt;
&lt;span class="go"&gt;Ran 4 tests in 0.000s&lt;/span&gt;

&lt;span class="go"&gt;OK&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Adding the &lt;code&gt;–v&lt;/code&gt; (&lt;em&gt;verbose&lt;/em&gt;) flag confirms that these four are the tagged
tests, as expected:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gp"&gt;$ &lt;/span&gt;python tests.py -1 -v
&lt;span class="go"&gt;testFive (__main__.MoreTests) ... ok&lt;/span&gt;
&lt;span class="go"&gt;testSix (__main__.MoreTests) ... ok&lt;/span&gt;
&lt;span class="go"&gt;testThree (__main__.Tests) ... ok&lt;/span&gt;
&lt;span class="go"&gt;testTwo (__main__.Tests) ... ok&lt;/span&gt;

&lt;span class="go"&gt;----------------------------------------------------------------------&lt;/span&gt;
&lt;span class="go"&gt;Ran 4 tests in 0.000s&lt;/span&gt;

&lt;span class="go"&gt;OK&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Finally, if we want to find out which classes include tagged tests,
we can use the &lt;code&gt;–0&lt;/code&gt; flag:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gp"&gt;$ &lt;/span&gt;python tests.py -0
&lt;span class="go"&gt;__main__.MoreTests&lt;/span&gt;
&lt;span class="go"&gt;__main__.Tests&lt;/span&gt;

&lt;span class="go"&gt;----------------------------------------------------------------------&lt;/span&gt;
&lt;span class="go"&gt;Ran 0 tests in 0.000s&lt;/span&gt;

&lt;span class="go"&gt;OK&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This is particularly helpful when our tests are spread across multiple
files, as the filenames are then shown as well as the class names.&lt;/p&gt;
&lt;h3 id="installation"&gt;Installation&lt;/h3&gt;
&lt;p&gt;Information about installing the library is available
in &lt;a href="https://www.tdda.info/obtaining-the-python-tdda-library"&gt;this post&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id="other-features"&gt;Other Features&lt;/h3&gt;
&lt;p&gt;Other features of the &lt;code&gt;ReferenceTest&lt;/code&gt; capabilities of the tdda library
are described in &lt;a href="https://the-new-referencetest-class-for-tdda"&gt;this post&lt;/a&gt;.
Its capabilities in the area of constraint discovery and verification
are discussed
in &lt;a href="https://www.tdda.info/constraint-discovery-and-verification-for-pandas-dataframes"&gt;this post&lt;/a&gt;,
and &lt;a href="https://www.tdda.info/the-tdda-constraints-file-format"&gt;this post&lt;/a&gt;.&lt;/p&gt;</content><category term="TDDA"></category><category term="tests"></category><category term="tagging"></category></entry><entry><title>Our Approach to Data Provenance</title><link href="https://tdda.info/our-approach-to-data-provenance.html" rel="alternate"></link><published>2017-12-12T15:30:00+00:00</published><updated>2017-12-12T15:30:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2017-12-12:/our-approach-to-data-provenance.html</id><summary type="html">&lt;p&gt;&lt;img src="https://www.tdda.info/images/Results2017_final_FINAL3-revised.png" width="875"
     alt="NEW DATA GOVERNANCE RULES: — We need to track data provenance. — No problem! We do that already! — We do? — We do! — (thinks) Results2017_final_FINAL3-revised.xlsx"/&gt;&lt;/p&gt;
&lt;p&gt;Our &lt;a href="https://www.tdda.info/data-provenance-and-data-lineage-the-view-from-the-podcasts.html#data-provenance-and-data-lineage-the-view-from-the-podcasts"&gt;previous post&lt;/a&gt;
introduced the idea of data provenance (a.k.a. data lineage),
which has been discussed on a
&lt;a href="https://nssdeviations.com/49-baltimore-is-the-home-of-cloud-computing"&gt;couple&lt;/a&gt;
of
&lt;a href="https://lineardigressions.com/episodes/2017/9/3/data-lineage"&gt;podcasts&lt;/a&gt;
recently.
This is an issue that is close to our hearts at Stochastic Solutions.
Here, we'll talk about how we handle this issue, both methodologically
and in …&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img src="https://www.tdda.info/images/Results2017_final_FINAL3-revised.png" width="875"
     alt="NEW DATA GOVERNANCE RULES: — We need to track data provenance. — No problem! We do that already! — We do? — We do! — (thinks) Results2017_final_FINAL3-revised.xlsx"/&gt;&lt;/p&gt;
&lt;p&gt;Our &lt;a href="https://www.tdda.info/data-provenance-and-data-lineage-the-view-from-the-podcasts.html#data-provenance-and-data-lineage-the-view-from-the-podcasts"&gt;previous post&lt;/a&gt;
introduced the idea of data provenance (a.k.a. data lineage),
which has been discussed on a
&lt;a href="https://nssdeviations.com/49-baltimore-is-the-home-of-cloud-computing"&gt;couple&lt;/a&gt;
of
&lt;a href="https://lineardigressions.com/episodes/2017/9/3/data-lineage"&gt;podcasts&lt;/a&gt;
recently.
This is an issue that is close to our hearts at Stochastic Solutions.
Here, we'll talk about how we handle this issue, both methodologically
and in our &lt;a href="https://stochasticsolutions.com/#miro"&gt;Miró&lt;/a&gt; software.&lt;/p&gt;
&lt;p&gt;We'll focus on seven key ideas from our approach to data provenance:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Automatic Logging&lt;/li&gt;
&lt;li&gt;Audit trail stored in datasets&lt;/li&gt;
&lt;li&gt;Recording of field definitions and contexts&lt;/li&gt;
&lt;li&gt;Constraint generation and verification&lt;/li&gt;
&lt;li&gt;Data Dictionary Generation &amp;amp; Import&lt;/li&gt;
&lt;li&gt;Data signatures (hashing)&lt;/li&gt;
&lt;li&gt;Comparing datasets (diff commands)&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="automatic-logging"&gt;Automatic Logging&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;Automatic logging provides a concrete record of all analysis
performed in the software.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Our analysis software, Miró, is normally used through a scripting interface
that automatically writes several detailed logs. Of these,
the most important are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;A log of all commands executed (in editable, re-executable form)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;A full interleaved log of commands issued and the resulting output,
    in several forms (including plain text and rich HTML).&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Logs build up in a year/month/day/session hierarchy indefinitely,
providing a comprehensive (and searchable) record of the analysis
that has been performed.&lt;/p&gt;
&lt;p&gt;Even when the software is used through the API, the sequence of operations
is recorded in the log, though in that case the ability to re-execute
the operations is normally lost.&lt;/p&gt;
&lt;h2 id="audit-trail"&gt;Audit trail&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;The audit trail in a dataset tracks changes to the data and
metadata across sessions, users, and machines, making it possible
to see the sequence of operations that led to the current
state of the data.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Like a &lt;em&gt;table&lt;/em&gt; in a database, or a &lt;em&gt;DataFrame&lt;/em&gt; in Pandas or R,
Miró's most important data structure is a &lt;em&gt;dataset&lt;/em&gt;—a tabular
structure with named &lt;em&gt;fields&lt;/em&gt; as columns
and different observations (&lt;em&gt;records&lt;/em&gt;) stored in rows.
These form a column-based data store, and datasets can be saved to
disk with a &lt;code&gt;.miro&lt;/code&gt; extension—a folder that contains the typed data
together with rich metadata.&lt;/p&gt;
&lt;p&gt;Every time a change is made to a dataset, the operation that caused
the change is recorded in the &lt;em&gt;Audit Trail&lt;/em&gt; section of the dataset.
This is true both for changes to data and to metadata:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Examples of changes to data in a dataset include creating new fields,
    deleting fields, filtering out records, appending new records and
    (more rarely) changing the original values in the data.&lt;sup id="fnref:change"&gt;&lt;a class="footnote-ref" href="#fn:change"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Miró maintains many types of metadata, including, field descriptions,
    field and dataset tags,
    binnings on fields, long and short names, custom colours
    and labels, a current record selection (filter)
    and various formatting information.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In this post, we'll illustrate the most concept using the following small
dataset containing transactions for four different customers, identified
by &lt;code&gt;id&lt;/code&gt;:&lt;/p&gt;
&lt;table class="solid sortable"&gt;
    &lt;thead&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#E0E0E0" title="id"&gt;id&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="date"&gt;date&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="categ"&gt;categ&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="amount"&gt;amount&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="Time Since Previous Transaction (days)"&gt;days-since-prev&lt;/th&gt;
    &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#F7DBD8"&gt;1&lt;/td&gt;
        &lt;td bgcolor="#F2B67A"&gt;2009-01-31 00:00:00&lt;/td&gt;
        &lt;td&gt;A&lt;/td&gt;
        &lt;td bgcolor="#E9F2D7"&gt;1,000.00&lt;/td&gt;
        &lt;td style="color: #C0C0C0;"&gt;∅&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#EFB8B3"&gt;2&lt;/td&gt;
        &lt;td bgcolor="#F2B67A"&gt;2009-02-02 00:00:00&lt;/td&gt;
        &lt;td&gt;A&lt;/td&gt;
        &lt;td bgcolor="#D4E5B2"&gt;2,000.00&lt;/td&gt;
        &lt;td style="color: #C0C0C0;"&gt;∅&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#EFB8B3"&gt;2&lt;/td&gt;
        &lt;td bgcolor="#F2B67A"&gt;2009-02-02 22:22:22&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;B&lt;/td&gt;
        &lt;td bgcolor="#CFE2AA"&gt;2,222.22&lt;/td&gt;
        &lt;td bgcolor="#60BF70"&gt;0.93&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#E79890"&gt;3&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-03-03 00:00:00&lt;/td&gt;
        &lt;td&gt;A&lt;/td&gt;
        &lt;td bgcolor="#E9F2D7"&gt;1,000.00&lt;/td&gt;
        &lt;td style="color: #C0C0C0;"&gt;∅&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#E79890"&gt;3&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-03-03 13:33:33&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;B&lt;/td&gt;
        &lt;td bgcolor="#C0D88F"&gt;3,000.00&lt;/td&gt;
        &lt;td bgcolor="#97D8A2"&gt;0.56&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#E79890"&gt;3&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-03-03 23:33:33&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;B&lt;/td&gt;
        &lt;td bgcolor="#B9D484"&gt;3,333.33&lt;/td&gt;
        &lt;td bgcolor="#B0E3B8"&gt;0.42&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#DF7970"&gt;4&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-04-04 00:00:00&lt;/td&gt;
        &lt;td&gt;A&lt;/td&gt;
        &lt;td bgcolor="#E9F2D7"&gt;1,000.00&lt;/td&gt;
        &lt;td style="color: #C0C0C0;"&gt;∅&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#DF7970"&gt;4&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-04-04 04:44:44&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;B&lt;/td&gt;
        &lt;td bgcolor="#E7F1D3"&gt;1,111.11&lt;/td&gt;
        &lt;td bgcolor="#D8F1DC"&gt;0.20&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#DF7970"&gt;4&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-04-04 14:44:44&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;B&lt;/td&gt;
        &lt;td style="color: #C0C0C0;"&gt;∅&lt;/td&gt;
        &lt;td bgcolor="#B0E3B8"&gt;0.42&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#DF7970"&gt;4&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-04-04 20:44:44&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;B&lt;/td&gt;
        &lt;td bgcolor="#A5C663"&gt;4,444.44&lt;/td&gt;
        &lt;td bgcolor="#CEEED3"&gt;0.25&lt;/td&gt;
    &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Here is the audit trail recorded in that dataset:&lt;/p&gt;
&lt;table class="solid"&gt;
    &lt;thead&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Date&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Description&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Command&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Host&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Session&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Line&lt;/th&gt;
    &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
    &lt;tr&gt;
        &lt;td&gt;2017/12/07 14:05:30&lt;/td&gt;
        &lt;td&gt;Load from flat file /Users/njr/python/artists/miro/testdata/trans.csv&lt;/td&gt;
        &lt;td&gt;load testdata/trans.csv&lt;/td&gt;
        &lt;td&gt;godel.local&lt;/td&gt;
        &lt;td&gt;/Users/njr/miro/log/2017/12/02/session142&lt;/td&gt;
        &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;2017/12/07 14:05:30&lt;/td&gt;
        &lt;td&gt;Save as /miro/datasets/trans.miro&lt;/td&gt;
        &lt;td&gt;save trans&lt;/td&gt;
        &lt;td&gt;godel.local&lt;/td&gt;
        &lt;td&gt;/Users/njr/miro/log/2017/12/02/session142&lt;/td&gt;
        &lt;td&gt;3&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;2017/12/11 12:48:39&lt;/td&gt;
        &lt;td&gt;Set dataset description to &amp;quot;Some demo transactions&amp;quot;.&lt;/td&gt;
        &lt;td&gt;description -d &amp;quot;Some demo transactions&amp;quot;&lt;/td&gt;
        &lt;td&gt;bartok.local&lt;/td&gt;
        &lt;td&gt;/Users/njr/miro/log/2017/12/11/session013&lt;/td&gt;
        &lt;td&gt;2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;2017/12/11 12:48:39&lt;/td&gt;
        &lt;td&gt;Set description for field amount to &amp;quot;transaction value (GBP)&amp;quot;.&lt;/td&gt;
        &lt;td&gt;description amount &amp;quot;transaction value (GBP)&amp;quot;&lt;/td&gt;
        &lt;td&gt;bartok.local&lt;/td&gt;
        &lt;td&gt;/Users/njr/miro/log/2017/12/11/session013&lt;/td&gt;
        &lt;td&gt;3&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;2017/12/11 12:48:39&lt;/td&gt;
        &lt;td&gt;Defined field days-since-prev as (/ (- date (prev-by date id)) 24 60 60)&lt;/td&gt;
        &lt;td&gt;def days-since-prev (/ (- date (prev-by date id)) 24 60 60)&lt;/td&gt;
        &lt;td&gt;bartok.local&lt;/td&gt;
        &lt;td&gt;/Users/njr/miro/log/2017/12/11/session013&lt;/td&gt;
        &lt;td&gt;4&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;2017/12/11 12:48:39&lt;/td&gt;
        &lt;td&gt;Tag field days-since-prev with L=&amp;quot;Time Since Previous Transaction (days)&amp;quot;.&lt;/td&gt;
        &lt;td&gt;tag L=&amp;quot;Time Since Previous Transaction (days)&amp;quot; days-since-prev&lt;/td&gt;
        &lt;td&gt;bartok.local&lt;/td&gt;
        &lt;td&gt;/Users/njr/miro/log/2017/12/11/session013&lt;/td&gt;
        &lt;td&gt;5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;2017/12/11 12:48:39&lt;/td&gt;
        &lt;td&gt;Save as /miro/datasets/trans.miro&lt;/td&gt;
        &lt;td&gt;save&lt;/td&gt;
        &lt;td&gt;bartok.local&lt;/td&gt;
        &lt;td&gt;/Users/njr/miro/log/2017/12/11/session013&lt;/td&gt;
        &lt;td&gt;6&lt;/td&gt;
    &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;In this case, the history is quite short, but includes information about&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;where the data originally came from (first line)&lt;/li&gt;
&lt;li&gt;when the data has been saved (second and seventh lines)&lt;/li&gt;
&lt;li&gt;metadata changes (third, fourth and sixth lines)&lt;/li&gt;
&lt;li&gt;changes to the data content (in this case, creation of a new field,
    &lt;code&gt;days-since-prev&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;detail about when (column 1)
    and where (column 4) changes were made,
    in what session these occurred (column 5, linking to the logs),
    what commands were used (column 3)
    and where in the log files to find those commands and context
    (column 6).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="field-definitions-and-contexts"&gt;Field Definitions and Contexts&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;Fields remember their definitions, including—where relevant—the context
in which they were created. This allows us to understand where any value
in a dataset came from, as a sequence of transformations of
input values.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In the previous section, we saw that the audit trail contained information
about a derived field in the data, including the expression used to
derive it.
That information is also available as a basic property of the field:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;[2]&amp;gt; ls -D days-since-prev
          Field                                 Definition
days-since-prev    (/ (- date (prev-by date id)) 24 60 60)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In other cases, entire datasets are created by taking "measurements"
from a base dataset, such as the transactional data shown above.
For example, we might want to create a dataset that measures how many
transactions each customer has, and their total value.&lt;/p&gt;
&lt;p&gt;One way of doing this is Miró is with the following command:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="cp"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="cp"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;xtab&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-d&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-R&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;MEASURES&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;count&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;sum&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;amount&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;by&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;id&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="nt"&gt;MEASURES&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;records&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nt"&gt;100&lt;/span&gt;&lt;span class="o"&gt;%)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;selected&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;which creates a new dataset.
As you might expect, this is the result:&lt;/p&gt;
&lt;table class="solid sortable"&gt;
    &lt;thead&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#E0E0E0" title="id"&gt;id&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="count"&gt;count&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="sumamount"&gt;sumamount&lt;/th&gt;
    &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
    &lt;tr&gt;
        &lt;td&gt;1&lt;/td&gt;
        &lt;td bgcolor="#FCECDC"&gt;1&lt;/td&gt;
        &lt;td bgcolor="#FCF8EB"&gt;1,000.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#F4CFCC"&gt;2&lt;/td&gt;
        &lt;td bgcolor="#F9DABA"&gt;2&lt;/td&gt;
        &lt;td bgcolor="#F4E2AE"&gt;4,222.22&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#EAA29C"&gt;3&lt;/td&gt;
        &lt;td bgcolor="#F5C799"&gt;3&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;7,333.33&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#DF7970"&gt;4&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;4&lt;/td&gt;
        &lt;td bgcolor="#EED384"&gt;6,555.55&lt;/td&gt;
    &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;The field definitions in the dataset created by default)
are attached to the fields, as we can see:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MEASURES&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ls&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="w"&gt;                                  &lt;/span&gt;&lt;span class="n"&gt;Definition&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="n"&gt;Rollup&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;variable&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="ow"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;trans&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;miro&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;by&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="ow"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;trans&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;miro&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;sumamount&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;by&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="ow"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;trans&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;miro&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Of course, the audit trail for &lt;code&gt;MEASURES&lt;/code&gt; also contains this information,
together with more detailed information about the session or sessions in which
the relevant commands were issued.&lt;/p&gt;
&lt;h2 id="constraint-generation-and-verification"&gt;Constraint Generation and Verification&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;Automatically generated constraints can be used to identify anomalous
and possibly incorrect data within a source dataset. The can also
be used to check that new data with the same structure has similar
or expected properties.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;We've talked in previous posts (&lt;a href="https://www.tdda.info/constraint-discovery-and-verification-for-pandas-dataframes" title="Constraint Disovery and Verification for Pandas DataFrames"&gt;here&lt;/a&gt;, &lt;a href="https://www.tdda.info/the-tdda-constraints-file-format" title="The TDDA Constraints File Format"&gt;here&lt;/a&gt;,
&lt;a href="https://www.tdda.info/introducing-rexpy-automatic-discovery-of-regular-expressions" title="Introducing Rexpy"&gt;here&lt;/a&gt;, and &lt;a href="https://www.tdda.info/constraint-generation-in-the-presence-of-bad-data" title="&amp;quot; &amp;quot;Constraint Generation in the Presence of Bad Data"&gt;here&lt;/a&gt;) and in a &lt;a href="https://www.tdda.info/pdf/tdda-constraint-generation-and-verification.pdf" title="White Paper: Automatic Constraint Generation and Verification"&gt;white paper&lt;/a&gt;
about automatically generating constraints that characterize either
all of, or an inferred "good" subset of, the data in a dataset.  Such
constraints are useful for finding bad and anomalous data in the
original dataset, and also for checking ("verifying") new data as it
comes in.&lt;/p&gt;
&lt;p&gt;We won't go over all the details of constraint generation and
verification in this post, but do note that this relates to Roger
Peng's idea, discussed in the &lt;a href="https://www.tdda.info/data-provenance-and-data-lineage-the-view-from-the-podcasts" title="Data Provenance and Data Lineage: the View from the Podcasts"&gt;last post&lt;/a&gt;, of tracking
changes to data tests as a surrogate for tracking changes to the
actual data. Obviously, having generated constraints, it's a good
idea to store the constraints under version control, to facilitate
such tracking.
More directly, the results of verification allow you to see some
changes to data directly.&lt;/p&gt;
&lt;h2 id="data-dictionary"&gt;Data Dictionary&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;The data dictionary provides a useful reference for any user of the data
particularly when it includes helpful (and up-to-date) annotations.
By making it easily editable, we encourage users to record useful
information alongside the data.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Miró can generate a data dictionary from the data.
This contains:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Summary information about the overall dataset&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Per-field information, including&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;basic metadata about each field, including name and type&lt;/li&gt;
&lt;li&gt;some information about the range of each field, including minimum
    and maximum values, null count etc.&lt;/li&gt;
&lt;li&gt;further characterization information, including whether there
    are duplicate values,&lt;/li&gt;
&lt;li&gt;any annotations that have been added to the dataset (such as
    descriptions, alternate names, tags etc.)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here's an example of the first part of the data dictionary for our
transaction dataset:&lt;/p&gt;
&lt;table class="solid"&gt;
    &lt;tbody&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#E0E0E0" align="right"&gt;Name&lt;/th&gt;
        &lt;td&gt;trans.miro&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#E0E0E0" align="right"&gt;Path&lt;/th&gt;
        &lt;td&gt;/miro/datasets/trans.miro&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#E0E0E0" align="right"&gt;Host&lt;/th&gt;
        &lt;td&gt;bartok.local&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#E0E0E0" align="right"&gt;Hash&lt;/th&gt;
        &lt;td&gt;0971dde52de7bc2fb2ad2282a572f6ca295c33fae105d8b0fab7a618f4c70b71&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#E0E0E0" align="right"&gt;Description&lt;/th&gt;
        &lt;td&gt;Some demo transactions&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#E0E0E0" align="right"&gt;Tags&lt;/th&gt;
        &lt;td/&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#E0E0E0" align="right"&gt;Creation Date&lt;/th&gt;
        &lt;td&gt;2017-12-11 13:33:36&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#E0E0E0" align="right"&gt;Number of Records&lt;/th&gt;
        &lt;td&gt;10&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#E0E0E0" align="right"&gt;Number of Fields&lt;/th&gt;
        &lt;td&gt;5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;And here's the second part, with an entry for each field:&lt;/p&gt;
&lt;table class="solid"&gt;
    &lt;thead&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Name&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Type&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Min&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Max&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Min Length&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Max Length&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;# nulls&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;# empty/zero&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;# positive&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;# negative&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Any duplicates&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Values&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Description&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Tags&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Long Name&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Short Name&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Definition&lt;/th&gt;
    &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
    &lt;tr&gt;
        &lt;td&gt;id&lt;/td&gt;
        &lt;td&gt;int&lt;/td&gt;
        &lt;td&gt;1&lt;/td&gt;
        &lt;td&gt;4&lt;/td&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;10&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;yes&lt;/td&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;date&lt;/td&gt;
        &lt;td&gt;date&lt;/td&gt;
        &lt;td&gt;2009-01-31T00:00:00&lt;/td&gt;
        &lt;td&gt;2009-04-04T20:44:44&lt;/td&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;10&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;no&lt;/td&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;categ&lt;/td&gt;
        &lt;td&gt;string&lt;/td&gt;
        &lt;td&gt;A&lt;/td&gt;
        &lt;td&gt;B&lt;/td&gt;
        &lt;td&gt;1&lt;/td&gt;
        &lt;td&gt;1&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td&gt;yes&lt;/td&gt;
        &lt;td&gt;&amp;quot;A&amp;quot; &amp;quot;B&amp;quot;&lt;/td&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;amount&lt;/td&gt;
        &lt;td&gt;real&lt;/td&gt;
        &lt;td&gt;1,000.00&lt;/td&gt;
        &lt;td&gt;4,444.44&lt;/td&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td&gt;1&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;9&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;yes&lt;/td&gt;
        &lt;td/&gt;
        &lt;td&gt;transaction value (GBP)&lt;/td&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;days-since-prev&lt;/td&gt;
        &lt;td&gt;real&lt;/td&gt;
        &lt;td&gt;0.20&lt;/td&gt;
        &lt;td&gt;0.93&lt;/td&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td&gt;4&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;6&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;yes&lt;/td&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td&gt;Time Since Previous Transaction (days)&lt;/td&gt;
        &lt;td/&gt;
        &lt;td&gt;(/ (- date (prev-by date id)) 24 60 60)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;While some of the information in the data dictionary is derived
directly from the data, other parts (descriptions, alternate names,
and tags) are created by annotation actions, whether by humans, bots or
scripts. Although there are Miró commands for setting all the
(editable) metadata properties, to encourage maximum use, Miró can
also export the metadata to a spreadsheet, where users can update it,
and then the appropriate parts of the metadata can be re-imported
using Miró's &lt;code&gt;importmetadata&lt;/code&gt; command.&lt;/p&gt;
&lt;h2 id="data-signatures"&gt;Data signatures&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;A data signature is a very compact way to summarize a dataset.
This allows quick and efficient checking that analysis is being
performed using the correct data.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;One of the properties reported in the data dictionary (the first table
above) is a &lt;em&gt;hash&lt;/em&gt;. If you're not familiar with hashing, a hash
function is one that takes a (typically large) input and converts it
to a much smaller output. Hash functions are designed so that
different inputs tend (but are not guaranteed) to map to
different outputs.  So if you store the hashes of two large objects
and they are different, you can be certain that the objects are
different. If the hashes are the same, this does not absolutely
guarantee that the objects are the same, but hash functions are
designed to make so-called "hash collisions" extremely rare,
especially between similar inputs. For most practical purposes,
therefore, we can safely assume that if two hashes are the same, the
inputs were the same.&lt;sup id="fnref:git"&gt;&lt;a class="footnote-ref" href="#fn:git"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;The point of storing the hash is that it acts as a much smaller,
very efficient proxy for our original data, and if we want to know whether
some dataset we have lying around contains the same data as the one
used to generate the data dictionary, all we have to do is compare the hashes.&lt;/p&gt;
&lt;p&gt;The hashes that Miró constructs for fields use &lt;em&gt;only&lt;/em&gt; the data values,
not any metadata, as inputs.&lt;sup id="fnref:fieldnames"&gt;&lt;a class="footnote-ref" href="#fn:fieldnames"&gt;3&lt;/a&gt;&lt;/sup&gt;
This is a choice, of course, and we could also hash some or all of the
metadata, but our primary concern here is whether we have the same
underlying data or not, so we view it as an advantage that the hash
is based solely on the data values.&lt;/p&gt;
&lt;p&gt;There is an option to store hashes for individual fields, as well, which
we have not used in generating the data dictionary shown.&lt;/p&gt;
&lt;h2 id="comparing-datasets-diff-commands"&gt;Comparing Datasets (diff commands)&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;The ability to compare two datasets, and when they are different,
to see clearly what the differences are, is as fundamental as the
ability to compare two files in Unix or git, or to perform a
track changes operation on a Word document.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;If we want to compare two datasets to see if they are the same, comparing
hashes is a very efficient way to do so.&lt;/p&gt;
&lt;p&gt;If, however, we actually want to understand the differences between datasets,
we need something more like Unix's &lt;code&gt;diff&lt;/code&gt; command,
which was discussed in the &lt;a href="https://www.tdda.info/data-provenance-and-data-lineage-the-view-from-the-podcasts" title="Data Provenance and Data Lineage: the View from the Podcasts"&gt;previous post&lt;/a&gt;,
or Microsoft Word's &lt;em&gt;Compare Documents&lt;/em&gt; functionality.&lt;/p&gt;
&lt;p&gt;Miró includes a &lt;code&gt;ddiff&lt;/code&gt; command for comparing two datasets.
Let's look at an example.&lt;/p&gt;
&lt;p&gt;Here's our transaction data again, without the derived
&lt;code&gt;days-since-prev&lt;/code&gt; field:&lt;/p&gt;
&lt;table class="solid sortable"&gt;
    &lt;thead&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#E0E0E0" title="id"&gt;id&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="date"&gt;date&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="categ"&gt;categ&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="amount"&gt;amount&lt;/th&gt;
    &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#F7DBD8"&gt;1&lt;/td&gt;
        &lt;td bgcolor="#F2B67A"&gt;2009-01-31 00:00:00&lt;/td&gt;
        &lt;td&gt;A&lt;/td&gt;
        &lt;td bgcolor="#E9F2D7"&gt;1,000.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#EFB8B3"&gt;2&lt;/td&gt;
        &lt;td bgcolor="#F2B67A"&gt;2009-02-02 00:00:00&lt;/td&gt;
        &lt;td&gt;A&lt;/td&gt;
        &lt;td bgcolor="#D4E5B2"&gt;2,000.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#EFB8B3"&gt;2&lt;/td&gt;
        &lt;td bgcolor="#F2B67A"&gt;2009-02-02 22:22:22&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;B&lt;/td&gt;
        &lt;td bgcolor="#CFE2AA"&gt;2,222.22&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#E79890"&gt;3&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-03-03 00:00:00&lt;/td&gt;
        &lt;td&gt;A&lt;/td&gt;
        &lt;td bgcolor="#E9F2D7"&gt;1,000.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#E79890"&gt;3&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-03-03 13:33:33&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;B&lt;/td&gt;
        &lt;td bgcolor="#C0D88F"&gt;3,000.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#E79890"&gt;3&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-03-03 23:33:33&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;B&lt;/td&gt;
        &lt;td bgcolor="#B9D484"&gt;3,333.33&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#DF7970"&gt;4&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-04-04 00:00:00&lt;/td&gt;
        &lt;td&gt;A&lt;/td&gt;
        &lt;td bgcolor="#E9F2D7"&gt;1,000.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#DF7970"&gt;4&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-04-04 04:44:44&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;B&lt;/td&gt;
        &lt;td bgcolor="#E7F1D3"&gt;1,111.11&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#DF7970"&gt;4&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-04-04 14:44:44&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;B&lt;/td&gt;
        &lt;td style="color: #C0C0C0;"&gt;∅&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#DF7970"&gt;4&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-04-04 20:44:44&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;B&lt;/td&gt;
        &lt;td bgcolor="#A5C663"&gt;4,444.44&lt;/td&gt;
    &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;and here's a variant of it:&lt;/p&gt;
&lt;table class="solid sortable"&gt;
    &lt;thead&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#E0E0E0" title="id"&gt;id&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="date"&gt;date&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="categ"&gt;categ&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="amount"&gt;amount&lt;/th&gt;
    &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#F7DBD8"&gt;1&lt;/td&gt;
        &lt;td bgcolor="#F2B67A"&gt;2009-01-31 00:00:00&lt;/td&gt;
        &lt;td&gt;A&lt;/td&gt;
        &lt;td bgcolor="#E9F2D7"&gt;1,000.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#EFB8B3"&gt;2&lt;/td&gt;
        &lt;td bgcolor="#F2B67A"&gt;2009-02-02 00:00:00&lt;/td&gt;
        &lt;td&gt;A&lt;/td&gt;
        &lt;td bgcolor="#D4E5B2"&gt;2,000.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#EFB8B3"&gt;2&lt;/td&gt;
        &lt;td bgcolor="#F2B67A"&gt;2009-02-02 22:22:22&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;B&lt;/td&gt;
        &lt;td bgcolor="#CFE2AA"&gt;2,222.22&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#E79890"&gt;3&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-03-03 00:00:00&lt;/td&gt;
        &lt;td&gt;A&lt;/td&gt;
        &lt;td bgcolor="#E9F2D7"&gt;1,000.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#E79890"&gt;3&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-03-03 13:33:33&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;B&lt;/td&gt;
        &lt;td bgcolor="#C0D88F"&gt;3,000.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#E79890"&gt;3&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-03-03 23:33:33&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;B&lt;/td&gt;
        &lt;td bgcolor="#B9D484"&gt;3,333.33&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#DF7970"&gt;4&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-04-04 00:00:00&lt;/td&gt;
        &lt;td&gt;A&lt;/td&gt;
        &lt;td bgcolor="#E9F2D7"&gt;1,000.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#DF7970"&gt;4&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-04-04 04:44:44&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;B&lt;/td&gt;
        &lt;td bgcolor="#E7F1D3"&gt;1,111.11&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#DF7970"&gt;4&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-04-04 14:44:44&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;B&lt;/td&gt;
        &lt;td bgcolor="#AFCD74"&gt;3,874.18&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#DF7970"&gt;4&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-04-04 20:44:44&lt;/td&gt;
        &lt;td&gt;A&lt;/td&gt;
        &lt;td bgcolor="#A5C663"&gt;4,444.44&lt;/td&gt;
    &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;These datasets are small enough that you can probably see the differences by
inspection (though it might take a little while to be confident that you've
spotted them all), but when there are millions of rows and
hundreds of columns, that becomes less easy.&lt;/p&gt;
&lt;p&gt;Using the &lt;code&gt;hash&lt;/code&gt; trick we talked about previously, we can see whether there
are any differences, and slightly more besides. Assuming the current
working dataset in Miró is the first, and the second is in &lt;code&gt;TRANS2&lt;/code&gt;, we can
hash them both:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;[46]&amp;gt;  hash
19451c13321284f2b0dd7736b75b443945fac1eae08a8118d600ec0d49b6bf87 id
f3c08f1d2d23abaa06da0529237f63bf8099053b9088328dfd5642d9b06e8f6a date
3a0070ae42b9f341e7e266a18ea1c78c7d8be093cb628c7be06b6175c8b09f23 categ
f5b3f6284f7d510f22df32daac2784597122d5c14b89b5355464ce05f84ce120 amount

b89ae4b74f95187ecc5d49ddd7f45a64849a603539044ae318a06c2dc7292cf9 combined

[47]&amp;gt;  TRANS2.hash
19451c13321284f2b0dd7736b75b443945fac1eae08a8118d600ec0d49b6bf87 id
f3c08f1d2d23abaa06da0529237f63bf8099053b9088328dfd5642d9b06e8f6a date
fcd11dbd69eee0bf6d2a405a6e4ef9227bb3f0279d9cc7866e2efe5b4c97112c categ
64c5c97e9e9676ec085b522303d75ff11b0ebe01a1ceebaf003719b3718f12bb amount

2e171d2a24183e5e25bbcc50d9cd99ad8b4ca48ee7e1abfa6027edd291a22584 combined
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We can see immediately that these are different, but that the individual
hashes for the &lt;code&gt;id&lt;/code&gt; and &lt;code&gt;date&lt;/code&gt; fields are the same, indicating that
their content is (almost certainly) the same. It's the &lt;code&gt;categ&lt;/code&gt; and &lt;code&gt;amount&lt;/code&gt;
fields that differ between the two datasets.&lt;/p&gt;
&lt;p&gt;We can use the &lt;code&gt;ddiff&lt;/code&gt; command to get a more detailed diagnosis:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;[49]&amp;gt;  ddiff -P TRANS2

Number of differences       Field Pair
           0:                   id : id-2
           0:                 date : date-2
           1:                categ : categ-2
           1:               amount : amount-2

Diff fields:
           1: diff-amount
           1: diff-categ

Total number of differences found: 2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The output here confirms that there are no differences between the
&lt;code&gt;id&lt;/code&gt; and &lt;code&gt;date&lt;/code&gt; fields in the two datasets, but that
one value differs for each of the &lt;code&gt;categ&lt;/code&gt; and &lt;code&gt;amount&lt;/code&gt; fields.
The &lt;code&gt;-P&lt;/code&gt; flag that we passed to the &lt;code&gt;ddiff&lt;/code&gt; command told it to
&lt;em&gt;preserve&lt;/em&gt; information about the differences, and if we now look at
the data, we see five extra fields (on the first dataset)—the fields
as they were in the other dataset, &lt;code&gt;TRANS2&lt;/code&gt;. Miró also creates an overall
&lt;code&gt;diff&lt;/code&gt; field showing whether each record has &lt;em&gt;any&lt;/em&gt; differences
across the two datasets.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;[&lt;span class="mi"&gt;50&lt;/span&gt;]&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;show&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;table class="solid sortable"&gt;
    &lt;thead&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#E0E0E0" title="id"&gt;id&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="date"&gt;date&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="categ"&gt;categ&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="categ 2"&gt;categ-2&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="diff categ"&gt;diff-categ&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="amount"&gt;amount&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="amount 2"&gt;amount-2&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="diff amount"&gt;diff-amount&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="diff"&gt;diff&lt;/th&gt;
    &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#F7DBD8"&gt;1&lt;/td&gt;
        &lt;td bgcolor="#F2B67A"&gt;2009-01-31 00:00:00&lt;/td&gt;
        &lt;td&gt;A&lt;/td&gt;
        &lt;td&gt;A&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td bgcolor="#D7F2E9"&gt;1,000.00&lt;/td&gt;
        &lt;td bgcolor="#D9F0F5"&gt;1,000.00&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#EFB8B3"&gt;2&lt;/td&gt;
        &lt;td bgcolor="#F2B67A"&gt;2009-02-02 00:00:00&lt;/td&gt;
        &lt;td&gt;A&lt;/td&gt;
        &lt;td&gt;A&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td bgcolor="#B2E5D4"&gt;2,000.00&lt;/td&gt;
        &lt;td bgcolor="#B6E2EB"&gt;2,000.00&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#EFB8B3"&gt;2&lt;/td&gt;
        &lt;td bgcolor="#F2B67A"&gt;2009-02-02 22:22:22&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;B&lt;/td&gt;
        &lt;td bgcolor="#A5C663"&gt;B&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td bgcolor="#AAE2CF"&gt;2,222.22&lt;/td&gt;
        &lt;td bgcolor="#AFDFE9"&gt;2,222.22&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#E79890"&gt;3&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-03-03 00:00:00&lt;/td&gt;
        &lt;td&gt;A&lt;/td&gt;
        &lt;td&gt;A&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td bgcolor="#D7F2E9"&gt;1,000.00&lt;/td&gt;
        &lt;td bgcolor="#D9F0F5"&gt;1,000.00&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#E79890"&gt;3&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-03-03 13:33:33&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;B&lt;/td&gt;
        &lt;td bgcolor="#A5C663"&gt;B&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td bgcolor="#8FD8C0"&gt;3,000.00&lt;/td&gt;
        &lt;td bgcolor="#95D4E1"&gt;3,000.00&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#E79890"&gt;3&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-03-03 23:33:33&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;B&lt;/td&gt;
        &lt;td bgcolor="#A5C663"&gt;B&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td bgcolor="#84D4B9"&gt;3,333.33&lt;/td&gt;
        &lt;td bgcolor="#8AD0DE"&gt;3,333.33&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#DF7970"&gt;4&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-04-04 00:00:00&lt;/td&gt;
        &lt;td&gt;A&lt;/td&gt;
        &lt;td&gt;A&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td bgcolor="#D7F2E9"&gt;1,000.00&lt;/td&gt;
        &lt;td bgcolor="#D9F0F5"&gt;1,000.00&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#DF7970"&gt;4&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-04-04 04:44:44&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;B&lt;/td&gt;
        &lt;td bgcolor="#A5C663"&gt;B&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td bgcolor="#D3F1E7"&gt;1,111.11&lt;/td&gt;
        &lt;td bgcolor="#D5EFF4"&gt;1,111.11&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#DF7970"&gt;4&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-04-04 14:44:44&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;B&lt;/td&gt;
        &lt;td bgcolor="#A5C663"&gt;B&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td style="color: #C0C0C0;"&gt;∅&lt;/td&gt;
        &lt;td bgcolor="#7AC8D8"&gt;3,874.18&lt;/td&gt;
        &lt;td bgcolor="#70A7DF"&gt;1&lt;/td&gt;
        &lt;td bgcolor="#B176EC"&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#DF7970"&gt;4&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2009-04-04 20:44:44&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;B&lt;/td&gt;
        &lt;td&gt;A&lt;/td&gt;
        &lt;td bgcolor="#60BF70"&gt;1&lt;/td&gt;
        &lt;td bgcolor="#63C6A5"&gt;4,444.44&lt;/td&gt;
        &lt;td bgcolor="#69C1D2"&gt;4,444.44&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td bgcolor="#B176EC"&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;This makes is easy to identify and select only those fields
or records with differences, which is one of the key tasks
when trying to track data lineage.&lt;/p&gt;
&lt;p&gt;As powerful as Miró's &lt;code&gt;ddiff&lt;/code&gt; and related commands are, there is also
much more that we would like (and plan) to support. Our comparison is
fundamentally based on &lt;em&gt;joining&lt;/em&gt; the two datasets (either on one or more
nominated key fields, or, as in this case, implicitly on record number).
When we are using a join key, it's quite easy to deal with row
addition and deletion, but that is harder when we are just joining on
record number.  It would be useful to have a Unix &lt;code&gt;diff&lt;/code&gt;-like ability to
spot single rows or groups of rows that have been inserted, deleted,
or re-ordered, but we don't have that today. In certain cases,
spotting other kinds of systematic edits would be interesting—for
example, thinking of the table in speadsheet-like terms, it would be
useful to spot cases in which blocks of cells shift up, down, left or
right. This situation is not very common in the data we most commonly
work with, but there are domains in which those sorts of changes
might be frequent.&lt;/p&gt;
&lt;h2 id="what-next"&gt;What Next?&lt;/h2&gt;
&lt;p&gt;We surveyed a few of the ways we think about and implement features in
our software (and workflows) to help track data provenance and data lineage.
There's a great deal more we could do, and over time we will almost
certainly add more. Hopefully these ideas will prove useful and interesting,
and obviously any of you fortunate enough to use Miró can try them out.&lt;/p&gt;
&lt;p&gt;We'll keep you posted as we extend our thinking.&lt;/p&gt;
&lt;h2 id="credits"&gt;Credits&lt;/h2&gt;
&lt;p&gt;Thanks to our Social Media Manager, for actively policing our content
even as it was being created.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://www.tdda.info/images/social-media-manager.jpeg" width="600"
     alt="Alfie, our Social Media Manager, inspects progress on the blog post."/&gt;&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:change"&gt;
&lt;p&gt;Changing (original) values in data is actually so rare that
Miró does provides very few facilities for doing so directly, but data
can effectively be changed through deleting one field or record and
adding another, and there are a couple of operations that can change
values in place—primarily anonymization operations and functions
that update special, automatically created fields, albeit usually by
deletion and regeneration. In fact, one of the rules on our "unwritten
list" of good practices is never to replace an input field with an
edited copy, but instead always to derive a new field with a variant name,
so that when we see a field from source data we know its values have
not been altered.&amp;#160;&lt;a class="footnote-backref" href="#fnref:change" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:git"&gt;
&lt;p&gt;A familiar example comes from git,
which uses hashes to compare the contents of files efficiently, and
also uses a SHA-1 hash to identify a commit, by hashing all of the
important information about that commit.&amp;#160;&lt;a class="footnote-backref" href="#fnref:git" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:fieldnames"&gt;
&lt;p&gt;The overall hash depends on the name of the fields and
their order, as well as the data in the fields, but the individual
field hashes do not. As a result, two fields containing the same values
(in the same order) will receive the same hash, but datasets in which
fields have been renamed or reordered will have different overall hashes.
We should also note that missing values (NULLs) also contribute to
the field hashes.&amp;#160;&lt;a class="footnote-backref" href="#fnref:fieldnames" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="data lineage"></category><category term="data provenance"></category><category term="data governance"></category><category term="tdda"></category><category term="constraints"></category><category term="miro"></category></entry><entry><title>Data Provenance and Data Lineage: the View from the Podcasts</title><link href="https://tdda.info/data-provenance-and-data-lineage-the-view-from-the-podcasts.html" rel="alternate"></link><published>2017-11-30T15:30:00+00:00</published><updated>2017-11-30T15:30:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2017-11-30:/data-provenance-and-data-lineage-the-view-from-the-podcasts.html</id><summary type="html">&lt;p&gt;In Episode 49 of the
&lt;a href="https://nssdeviations.com/49-baltimore-is-the-home-of-cloud-computing"&gt;Not So Standard Deviations&lt;/a&gt;
podcast, the final segment (starting at 59:32)
discusses &lt;em&gt;data lineage,&lt;/em&gt;
after Roger Peng listened to the September 3rd (2017)
episode of another podcast, &lt;a href="https://lineardigressions.com/episodes/2017/9/3/data-lineage"&gt;Linear Digressions&lt;/a&gt;,
which discussed that subject.&lt;/p&gt;
&lt;p&gt;This is a topic very close to our hearts, and I …&lt;/p&gt;</summary><content type="html">&lt;p&gt;In Episode 49 of the
&lt;a href="https://nssdeviations.com/49-baltimore-is-the-home-of-cloud-computing"&gt;Not So Standard Deviations&lt;/a&gt;
podcast, the final segment (starting at 59:32)
discusses &lt;em&gt;data lineage,&lt;/em&gt;
after Roger Peng listened to the September 3rd (2017)
episode of another podcast, &lt;a href="https://lineardigressions.com/episodes/2017/9/3/data-lineage"&gt;Linear Digressions&lt;/a&gt;,
which discussed that subject.&lt;/p&gt;
&lt;p&gt;This is a topic very close to our hearts, and I thought it would be useful
to summarize the discussions on both podcasts, as a precursor to writing
up how we approach some of these issues at
&lt;a href="https://stochasticsolutions.com"&gt;Stochastic Solutions&lt;/a&gt;—in our work,
in our Miró software and through the TDDA approaches
discussed on this blog.&lt;/p&gt;
&lt;p&gt;It probably makes sense to begin by summarizing the Linear Digressions,
episode, in which Katie Malone explains the ideas of &lt;em&gt;data lineage&lt;/em&gt;
(also known as &lt;em&gt;data provenance&lt;/em&gt;)
as the tracking of changes to a dataset.&lt;/p&gt;
&lt;p&gt;Any dataset starts from one or more "original sources"—the sensors that
first measured the quantities recorded, or the system that generated
the original data. In almost all cases, a series of transformations
is then applied to the data before it is finally used in some given
application. For example, in machine learning, Katie describes typical
transformation stages as:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Cleaning the data&lt;/li&gt;
&lt;li&gt;Making additions, subtractions and merges to the dataset&lt;/li&gt;
&lt;li&gt;Aggregating the data in some way&lt;/li&gt;
&lt;li&gt;Imputing missing values&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;She describes this as the &lt;em&gt;process view&lt;/em&gt; of data lineage.&lt;/p&gt;
&lt;p&gt;An alternative perspective focuses less on the processes than the
resulting sequence of datasets, as snapshots.  As her co-host, Ben
Jaffe points out, this is more akin to the way version control systems
view files or collections of files, and &lt;em&gt;diff&lt;/em&gt; tools (see below)
effectively reconstruct the process view&lt;sup id="fnref:roughly"&gt;&lt;a class="footnote-ref" href="#fn:roughly"&gt;1&lt;/a&gt;&lt;/sup&gt; from the data.&lt;/p&gt;
&lt;p&gt;In terms of tooling, Katie reckons that the tools for tracking data lineage
are relatively well developed (but specialist) in large scientific
collaborations such as particle physics (she used to work on the LHC
at CERN) and genomics, but her sense is tools are less well developed
in many business contexts.&lt;/p&gt;
&lt;p&gt;She then describes five reasons to care about data provenance:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;(To improve/track/ensure) data quality&lt;/li&gt;
&lt;li&gt;(To provide) an audit trail&lt;/li&gt;
&lt;li&gt;(To aid) replication (her example was enabling the rebuilding of a predictive model that had been lost if you knew how it had been produced, subject obviously, not only to having the data but full details of the parameters, training regime and any random number seeds used)&lt;/li&gt;
&lt;li&gt;(To support) attribution (e.g. providing a credit to the original
   data collector when publishing a paper)&lt;/li&gt;
&lt;li&gt;Informational, (i.e. to keep track of and aid navigation within large collection of datasets).&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I recommend listening to the episode.&lt;/p&gt;
&lt;p&gt;In &lt;a href="https://nssdeviations.com/49-baltimore-is-the-home-of-cloud-computing"&gt;Not-so-Standard Deviations 49&lt;/a&gt;, Roger Peng introduces the idea of
tracking and versioning data, as a result of listening to Katie
and Ben discuss the issue on their podcast.
Roger argues that while you can stick a dataset into Git or other version
control software, doing so it not terribly helpful because, most of
the time the dataset acts essentially as a blob,&lt;sup id="fnref:blob"&gt;&lt;a class="footnote-ref" href="#fn:blob"&gt;2&lt;/a&gt;&lt;/sup&gt; rather than as
a structured entity that diff tools help you to understand.&lt;/p&gt;
&lt;p&gt;In this, he is exactly right. In case you're not familar with version
control and diff tools, let me illustrate the point.
In &lt;a href="https://www.tdda.info/introducing-rexpy-automatic-discovery-of-regular-expressions"&gt;a previous post on
Rexpy&lt;/a&gt;,
I added a link to a page about our Miró software between two edits. If I run
the relevant &lt;code&gt;git diff&lt;/code&gt; command, this is its output:&lt;/p&gt;
&lt;p&gt;&lt;image src="https://www.tdda.info/images/cli-diff.png" width="578" alt="git diff of two versions of the markdown for a blog post"/&gt;&lt;/p&gt;
&lt;p&gt;As you can see, this shows pretty clearly what's changed.
Using a visual &lt;code&gt;diff&lt;/code&gt; tool, we get an even clearer picture of the changes
(especially when the changes are more numerous and complex):&lt;/p&gt;
&lt;p&gt;&lt;image src="https://www.tdda.info/images/visual-diff.png" width="1145" alt="visual diff (opendiff) of two versions of the markdown for a blog post"/&gt;&lt;/p&gt;
&lt;p&gt;In contrast, if I do a &lt;code&gt;diff&lt;/code&gt; on two Excel files stored in Git, I get
the following:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;git diff b1b85ddc448a723845c36688480cfe5072f28c1a -- test-excel-sheet1.xlsx 
diff --git a/testdata/test-excel-sheet1.xlsx b/testdata/test-excel-sheet1.xlsx
index 0bd63cb..91e5b0e 100644
Binary files a/testdata/test-excel-sheet1.xlsx and b/testdata/test-excel-sheet1.xlsx differ
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This is better than nothing, but gives no insight into &lt;em&gt;what&lt;/em&gt; has
changed.  (In fact, it's worse than it looks, because even changes to
the metadata inside an Excel file--such as the location of the
selected cell—will cause the files to be shown as different. As a
result, there are many hard-to-detect false positives when using
&lt;code&gt;diff&lt;/code&gt; commands with binary files.)&lt;/p&gt;
&lt;p&gt;Going back to the podcast, Hilary Parker then talked about the difficulty
of the idea of data provenance in the context of streaming data,
but argued there was a bit more hope with batch processes,
since at least the datasets used then are "frozen".&lt;/p&gt;
&lt;p&gt;Roger then argued that there are good custom tools used in particular
places like CERN, but those are not tools he can just pick up and use.
He then wondered aloud whether such tools cannot really exist, because
they require too much understanding on analytical goals. (I don't agree with
this, as the next post will show.)  He then rowed back slightly,
saying that maybe it's too hard for general data, but more reasonable
in a narrower context such as "tidy data".&lt;/p&gt;
&lt;p&gt;If you aren't familiar with the term &lt;a href="https://vita.had.co.nz/papers/tidy-data.pdf"&gt;"tidy
data"&lt;/a&gt;, it really just
refers to data stored in the same way that relational databases store
data in tables, with columns corresponding to variables, rows
corresponding to whatever the items being measured are (the
&lt;em&gt;observations&lt;/em&gt;), consistent types for all the values in a column and a
resulting regular, grid-like structure. (This contrasts with, say,
JSON data, which is hierarchical, and with many spreadsheets, in which
different parts of the sheet are used for different things, and with data in
which different observations are in columns and the different
variables are in rows.)  So "tidy data" is an extremely large subset
of the data we use in structured data analysis,
at least after initial regularization.&lt;/p&gt;
&lt;p&gt;Hilary then mentioned an R package called
&lt;a href="https://github.com/ropensci/testdat"&gt;testdat&lt;/a&gt;, that she had had
worked on at an "unconference".  This aimed to test things check things
like missing values and contiguity of dates in datasets. These ideas
are very similar to those of constraint generation and
verification, which we frequently discuss on this blog,
as supported in the
&lt;a href="https://www.tdda.info/obtaining-the-python-tdda-library"&gt;TDDA&lt;/a&gt;
package.  But Hilary (and others?) concluded that the package was not
really useful, and that what was more important was tools to make
writing tests for data easier.
(I guess we disagree that general tools like this aren't useful,
but very much support the latter point.)&lt;/p&gt;
&lt;p&gt;Roger then raised the idea of tracking changes to such tests as a kind of
proxy for tracking changes to the data, though clearly this is a very
partial solution.&lt;/p&gt;
&lt;p&gt;Both podcasts are interesting and worth a listen,
but both of them seemed to feel that
there is very little standardization in this area, and that it's really hard.&lt;/p&gt;
&lt;p&gt;We have a lot of processes, software and ideas addressing many aspects
of these issues, and I'll discuss them and try to relate them to the
various points raised here in a subsequent post, fairly soon, I hope.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:roughly"&gt;
&lt;p&gt;Strictly speaking, diff tools cannot know what the actual
processes used to transform the data were, but construct a set of
atomic changes (known as &lt;em&gt;patches&lt;/em&gt;) that are capable of transforming
the old dataset into the new one.&amp;#160;&lt;a class="footnote-backref" href="#fnref:roughly" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:blob"&gt;
&lt;p&gt;a &lt;em&gt;blob&lt;/em&gt;, in version control systems like Git, is a Binary
Large OBject. Git allows you to store blobs, and will track different
versions of them, but all you can really see is whether two versions
are the same, whereas for "normal" files (text files), visual diff
tools normally allow you to see easily exactly what changes have been
made, much like the track changes feature in Word documents.&amp;#160;&lt;a class="footnote-backref" href="#fnref:blob" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="data lineage"></category><category term="data provenance"></category><category term="data governance"></category><category term="tdda"></category><category term="constraints"></category></entry><entry><title>Automatic Constraint Generation and Verification White Paper</title><link href="https://tdda.info/automatic-constraint-generation-and-verification-white-paper.html" rel="alternate"></link><published>2017-10-06T15:30:00+01:00</published><updated>2017-10-06T15:30:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2017-10-06:/automatic-constraint-generation-and-verification-white-paper.html</id><summary type="html">&lt;p&gt;We have a new White Paper available:&lt;/p&gt;
&lt;h2 id="automatic-constraint-generation-and-verification"&gt;Automatic Constraint Generation and Verification&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Abstract&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Correctness is a key problem at every stage of data science projects:
completing an entire analysis without a serious error at some stage is
surprisingly hard. Even errors that reverse or completely invalidate
the analysis can be …&lt;/p&gt;</summary><content type="html">&lt;p&gt;We have a new White Paper available:&lt;/p&gt;
&lt;h2 id="automatic-constraint-generation-and-verification"&gt;Automatic Constraint Generation and Verification&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Abstract&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Correctness is a key problem at every stage of data science projects:
completing an entire analysis without a serious error at some stage is
surprisingly hard. Even errors that reverse or completely invalidate
the analysis can be hard to detect. Test-Driven Data Analysis (TDDA)
attempts to identify, reduce, and aid correction of such errors. A
core tool that we use in TDDA is Automatic Constraint Discovery and
Verification, the focus of this paper.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.tdda.info/pdf/tdda-constraint-generation-and-verification.pdf"&gt;Download White Paper here&lt;/a&gt;&lt;/p&gt;</content><category term="TDDA"></category><category term="tdda"></category><category term="constraints"></category><category term="verification"></category><category term="bad data"></category></entry><entry><title>Constraint Generation in the Presence of Bad Data</title><link href="https://tdda.info/constraint-generation-in-the-presence-of-bad-data.html" rel="alternate"></link><published>2017-09-21T17:30:00+01:00</published><updated>2017-09-21T17:30:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2017-09-21:/constraint-generation-in-the-presence-of-bad-data.html</id><summary type="html">&lt;p&gt;Bad data is widespread and pervasive.&lt;sup id="fnref:sorry"&gt;&lt;a class="footnote-ref" href="#fn:sorry"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;Only datasets and analytical processes that have been subject to
rigorous and sustained quality assurance processes are typically
capable of achieving low or zero error rates. "Badness" can take
many forms and have various aspects, including incorrect values,
missing values, duplicated entries, misencoded …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Bad data is widespread and pervasive.&lt;sup id="fnref:sorry"&gt;&lt;a class="footnote-ref" href="#fn:sorry"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;Only datasets and analytical processes that have been subject to
rigorous and sustained quality assurance processes are typically
capable of achieving low or zero error rates. "Badness" can take
many forms and have various aspects, including incorrect values,
missing values, duplicated entries, misencoded values, values that are
inconsistent with other entries in the same dataset, and values that
are inconsistent with those in other datasets, to name but a few.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://www.tdda.info/images/NoKnownBadData1750x500.png" width="875"
     alt="Person A: 'After our data quality drive, I am happy to say we know of no remaining bad data.' Person B: 'But every day I find problems, and mail the bad data account. They never get fixed.' A: 'Sure, but that's an unmonitored account.' (Covers ears, closes eyes.) A: 'Like I said, we know of no remaining bad data.' B: (Thinks): 'What is this hell?'"/&gt;&lt;/p&gt;
&lt;p&gt;We have previously discussed
&lt;a href="https://www.tdda.info/constraint-discovery-and-verification-for-pandas-dataframes"&gt;automatic constraint discovery and verification&lt;/a&gt;
as a mechanism for finding, specifying, and executing checks on data,
and have advocated using these to verify input, output, and
intermediate datasets.  Until now, however, such constraint discovery has
been based on the assumption that the datasets provided to the
discovery algorithm contain only good data—a significant limitation.&lt;/p&gt;
&lt;p&gt;We have recently been thinking about ways to extend our approach to
constraint generation to handle cases in which this assumption is
relaxed, so that the "discovery" process can be applied to datasets
that include bad data.  This article discusses and illustrates one
such approach, which we have prototyped, as usual, in our own
&lt;a href="https://stochasticsolutions.com/miro.html"&gt;Miró&lt;/a&gt; software.  We plan to
bring the same approach to the open-source Python
&lt;a href="https://www.tdda.info/obtaining-the-python-tdda-library"&gt;tdda&lt;/a&gt; library
as soon as we have gained further experience using it and convinced
ourselves we are going in a good direction.&lt;/p&gt;
&lt;p&gt;The most obvious benefit of extending constraint generation to function
usefully even when the dataset used contains some bad data is
increased applicability.
Perhaps a more important benefit is that it means that, in general,
&lt;em&gt;more&lt;/em&gt; and &lt;em&gt;tighter&lt;/em&gt; constraints will be generated, potentially increasing
their utility and more clearly highlighting areas that should be of concern.&lt;/p&gt;
&lt;h2 id="the-goal-and-pitfalls-seeing-through-bad-data"&gt;The Goal and Pitfalls: "Seeing Through" Bad Data&lt;/h2&gt;
&lt;p&gt;Our aim is to add to constraint generation a second mode of operation
whose goal is not to &lt;em&gt;discover&lt;/em&gt; constraints that are true over our
example data, but rather to generate constraints for which there is
reasonable evidence, even if some of them are not actually satisfied
by all of the example data.  This is not so much constraint
&lt;em&gt;discovery&lt;/em&gt; as constraint &lt;em&gt;suggestion.&lt;/em&gt; We want the software
to attempt to "see through" the bad data to find constraints similar
to those that would have been discovered had there been only good data.
This is hard (and ill specified) because the bad data values are
not identified, but we shall not be deterred.&lt;/p&gt;
&lt;p&gt;For constraints generated in this way, the role of a human supervisor is
particularly important: ideally, the user would look at the constraints
produced—both the ordinary ones actually satisfied by the data and the
more numerous, tighter suggestions typically produced by the "assuming
bad data" version—and decide which to accept on a case-by-case basis.&lt;/p&gt;
&lt;p&gt;Extending constraint generation with the ability to "see through"
bad data is very powerful, but carries obvious risks: as an example,
constraint suggestion might look at lending data and conclude that
defaults (non-payment of loans—hopefully a small proportion of
customers) are probably outliers that should be excluded from the
data. Clearly, that would be the &lt;em&gt;wrong&lt;/em&gt; conclusion.&lt;/p&gt;
&lt;p&gt;Notwithstanding the risks, our initial experiences with generation in
the presence of bad data have been rather positive: the process has
generated constraints that have identified problems of which we had
been blissfully unaware. The very first time we used it highlighted:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a nearly but not-quite unique identifier (that should, of course,
    have been unique)&lt;/li&gt;
&lt;li&gt;a number of low-frequency, bad categorical values in a field&lt;/li&gt;
&lt;li&gt;several numeric fields with much smaller true good ranges than we had
    realized.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Additionally, even when the constraints
discovered go beyond identifying incorrect data, they often highlight
data that we are happy to have flagged because they do represent
outliers deserving of scrutiny.&lt;/p&gt;
&lt;h2 id="our-approach"&gt;Our Approach&lt;/h2&gt;
&lt;p&gt;Our initial attempt at extending constraint discovery to the case in
which there is (or might be) bad data is straightforward: we simply
allow that a proportion of the data might be bad, and look for the
best constraints we can find on that basis.&lt;/p&gt;
&lt;p&gt;With our current implementation, in Miró, the user has to supply an
upper limit on the proportion, &lt;em&gt;p,&lt;/em&gt; of values that are thought
likely to be bad. Eventually, we expect to be able to determine this
proportion heuristically, at least by default.  In practice, for
datasets of any reasonable size, we are mostly just using 1%, which
seems to be working fairly well.&lt;/p&gt;
&lt;p&gt;We should emphasize that the "bad" proportion &lt;em&gt;p&lt;/em&gt; provided is not an
estimate of how much of the data &lt;em&gt;is&lt;/em&gt; bad, but an upper limit on how
much we believe is reasonably likely to be bad.&lt;sup id="fnref:per-constraint"&gt;&lt;a class="footnote-ref" href="#fn:per-constraint"&gt;2&lt;/a&gt;&lt;/sup&gt; As a
result, when we use 1%, our goal is &lt;em&gt;not&lt;/em&gt; to find the tighest possible
constraints that are consistent with 99% of data, but rather to work
with the assumption that &lt;em&gt;at least&lt;/em&gt; 99% of the data values are good,
and then to find constraints that separate out values that look
very different from the 99%. If this turns out to be only 0.1%, or
0.001%, or none at all, so much the better.&lt;/p&gt;
&lt;p&gt;The way this plays out is different for different kinds of constraint.
Let's work through some examples.&lt;/p&gt;
&lt;h2 id="minimum-and-maximum-constraints"&gt;Minimum and Maximum Constraints&lt;/h2&gt;
&lt;p&gt;Looking for minimum and maximum constraints in the presence of
possible bad values is closely connected to univariate outlier detection.
We cannot, however, sensibly make any assumptions about the shape of
the data in the general case, so parametric statistics (means,
variances etc.) are not really appropriate. We also need to be clear
that what we are looking for is not the tails of some smooth
distribution, but rather values that look as if they might have been
generated by some completely different process, or drawn from some
other distribution.&lt;/p&gt;
&lt;p&gt;An example of the kind of thing what we are looking for is something
qualitatively like the situation in the top diagram (though perhaps
considerably more extreme), where the red and blue sections of the
distribution look completely different from the main body, and we want to
find cutoffs somewhere around the values shown.
This is the sort of thing we might see, for example, if the main body
of values are prices in one currency and the outliers, in contrast,
are the similar prices, but measured in two other currencies.
In contrast, we are not aiming to cut off the tails of a smooth,
regular distribution, such as the normal distribution shown in the lower
diagram.&lt;/p&gt;
&lt;div style="text-align:center;"&gt;
     &lt;img src="https://www.tdda.info/images/ContrastingOutliers.png"
          width="333px"
          alt="Top: a distribution with a double-peaked centre, and two small (coloured) peaks well to the left and right of the centre. Bottom: a normal distribution, with a small part of each tail coloured."/&gt;
&lt;/div&gt;

&lt;p&gt;In our initial implementation we have used the possible bad proportion
&lt;em&gt;p&lt;/em&gt; to define an assumed "main body" of the distribution as running
between the quantiles at (&lt;em&gt;1 – p&lt;/em&gt;) and &lt;em&gt;p&lt;/em&gt; (from the 1st
percentile to the 99th, when &lt;em&gt;p&lt;/em&gt; = 1%). We assume all
values in this range are good. We then search out from this main body
of the distribution, looking for a gap between successive values that
is at least &lt;em&gt;α&lt;/em&gt; times the (1% – 99%) interquantile range.  We are using
&lt;em&gt;α&lt;/em&gt;=0.5 for now. Once we have identified a gap, we then pick a value
within it, currently favouring the end nearer the main distribution
and also favouring "round" numbers.&lt;/p&gt;
&lt;p&gt;Let's look at an example.&lt;/p&gt;
&lt;p&gt;We have a dataset with 12,680,141 records and 3 prices (floating-point
values), Price1, Price2 and Price3, in sterling. If we run ordinary
TDDA constraint discovery on this, we get the following results:&lt;/p&gt;
&lt;table class="solid sortable"&gt;
    &lt;thead&gt;
    &lt;tr&gt;
        &lt;th colspan="8" bgcolor="#A8A8A8"&gt;Individual Field Constraints&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#A8A8A8"&gt;Name&lt;/th&gt;
        &lt;th bgcolor="#A8A8A8"&gt;Type Allowed&lt;/th&gt;
        &lt;th bgcolor="#A8A8A8"&gt;Min Allowed&lt;/th&gt;
        &lt;th bgcolor="#A8A8A8"&gt;Max Allowed&lt;/th&gt;
    &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
    &lt;tr&gt;
        &lt;td&gt;Price1&lt;/td&gt;
        &lt;td&gt;real&lt;/td&gt;
        &lt;td&gt;0.0&lt;/td&gt;
        &lt;td&gt;198,204.34&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Price2&lt;/td&gt;
        &lt;td&gt;real&lt;/td&gt;
        &lt;td&gt;–468,550,685.56&lt;/td&gt;
        &lt;td&gt;2,432,595.87&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Price3&lt;/td&gt;
        &lt;td&gt;real&lt;/td&gt;
        &lt;td&gt;0.0&lt;/td&gt;
        &lt;td&gt;1,390,276,267.42&lt;/td&gt;
    &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;With ordinary constraint discovery, the min and max constraints are just set
to the largest and smallest values in the dataset, so this tells us that
the largest value for Price1 in the dataset is nearly £200,000,
that Price2 ranges from close to –£500m to about +£2.5m,
and that Price3 runs up to about £1.4bn.&lt;/p&gt;
&lt;p&gt;If we rerun the constraint generation allowing for up to 1% bad data,
we get the following instead.&lt;/p&gt;
&lt;table class="solid sortable"&gt;
    &lt;thead&gt;
    &lt;tr&gt;
        &lt;th colspan="8" bgcolor="#A8A8A8"&gt;Individual Field Constraints&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#A8A8A8"&gt;Name&lt;/th&gt;
        &lt;th bgcolor="#A8A8A8"&gt;Type Allowed&lt;/th&gt;
        &lt;th bgcolor="#A8A8A8"&gt;Min Allowed&lt;/th&gt;
        &lt;th bgcolor="#A8A8A8"&gt;Max Allowed&lt;/th&gt;
    &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
    &lt;tr&gt;
        &lt;td&gt;Price1&lt;/td&gt;
        &lt;td&gt;real&lt;/td&gt;
        &lt;td&gt;0.0&lt;/td&gt;
        &lt;td&gt;38,000.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Price2&lt;/td&gt;
        &lt;td&gt;real&lt;/td&gt;
        &lt;td&gt;–3,680.0&lt;/td&gt;
        &lt;td&gt;3,870.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Price3&lt;/td&gt;
        &lt;td&gt;real&lt;/td&gt;
        &lt;td&gt;0.0&lt;/td&gt;
        &lt;td&gt;125,000.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Let's look at why these values were chosen, and see whether they
are reasonable.&lt;/p&gt;
&lt;p&gt;Starting with Price1, let's get a few more statistics.&lt;/p&gt;
&lt;table class="solid"&gt;
    &lt;thead&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#E0E0E0"&gt;min Price1&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;percentile 1 Price1&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;median Price1&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;percentile 99 Price1&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;max Price1&lt;/th&gt;
    &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
    &lt;tr&gt;
        &lt;td&gt;0.00&lt;/td&gt;
        &lt;td&gt;24.77&lt;/td&gt;
        &lt;td&gt;299.77&lt;/td&gt;
        &lt;td&gt;2,856.04&lt;/td&gt;
        &lt;td&gt;198,204.34&lt;/td&gt;
    &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;So here our median value is just under £300, while our first percentile
value is about £25 and our 99th percentile is a little under £3,000,
giving us an interquantile range also somewhat under £3,000
(£2,831.27, to be precise).&lt;/p&gt;
&lt;p&gt;The way our algorithm currently works, this interquantile range
defines the scale for the gap we need to define something as an
outlier—in this case, half of £2,831.27, which is about £1,415.&lt;/p&gt;
&lt;p&gt;If we were just looking to find the tighest maximum constraint
consistent with 1% of the data being bad (with respect to this
constraint), we would obviously just set the upper limit to fractionally
above £2,856.04. But this would be fairly nonsensensical as we
can see if we look at the data around that threshold.
It is a feature of this particular data that most values are duplicated
a few times (typically between 3 and 10 times), so we first
deduplicate them. Here are the first 20 distinct values above £2,855:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="mf"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;855.02&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;855.42&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;855.80&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;856.04&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mf"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;855.16&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;855.50&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;855.82&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;856.08&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mf"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;855.22&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;855.64&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="mf"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;855.86&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;856.13&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mf"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;855.38&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;855.76&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;855.91&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;856.14&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Obviously, there's nothing special about £2,854.04 (marked with *),
and there is no meaningful gap after it, so that would be a
fairly odd place to cut off.&lt;/p&gt;
&lt;p&gt;Now let's look at the (distinct) values above £30,000:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="mf"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;173.70&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;31&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;513.21&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;39&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;505.67&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;44&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;852.01&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;67&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;097.01&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="mf"&gt;72&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;082.60&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mf"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;452.45&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;358.52&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;39&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;703.93&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;45&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;944.21&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;858.88&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="mf"&gt;72&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;382.57&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mf"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;562.55&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;838.23&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;888.79&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;47&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;026.70&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;911.84&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="mf"&gt;72&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;388.55&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mf"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;586.96&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;33&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;906.98&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;999.04&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;47&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;058.99&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;63&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;160.28&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="mf"&gt;72&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;657.04&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mf"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;601.67&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;34&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;058.27&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;41&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;252.50&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;47&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;126.60&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;63&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;984.64&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="mf"&gt;72&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;760.90&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mf"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;620.78&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;302.33&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;41&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;447.00&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;49&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;827.28&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;517.16&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="mf"&gt;90&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;713.28&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mf"&gt;31&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;118.10&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;36&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;472.53&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;41&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;513.95&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;51&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;814.22&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;67&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;256.27&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;103&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;231.44&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mf"&gt;31&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;139.86&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;36&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;601.86&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;473.36&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;53&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;845.76&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;68&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;081.68&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;123&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;629.97&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mf"&gt;31&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;206.71&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;36&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;941.76&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="mf"&gt;43&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;384.46&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;55&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;871.84&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;70&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;782.28&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;198&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;204.34&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mf"&gt;31&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;449.57&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;39&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;034.38&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;43&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;510.03&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;56&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;393.14&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;71&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;285.95&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The cutoff the TDDA algorithm has chosen (£38,000) is just after the
value marked *, which is the start of the first sizable gap.  We
are not claiming that there is anything uniquely right or "optimal"
about the cutoff that has been chosen in the context of this
data, but it looks reasonable. The first few values here are still
comparatively close together—with gaps of only a few hundred between
each pair of successive value—and the last ones are almost an order of
magnitude  bigger.&lt;/p&gt;
&lt;p&gt;We won't look in detail at Price2 and Price3, where the algorithm has
narrowed the range rather more, except to comment briefly on the negative
values for Price2, which you might have expected to be excluded.
This table shows why:&lt;sup id="fnref:round"&gt;&lt;a class="footnote-ref" href="#fn:round"&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;table class="solid"&gt;
    &lt;thead&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#E0E0E0"&gt;countnegative Price2&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;countzero Price2&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;countpositive Price2&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;countnull Price2&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;countnonnull Price2&lt;/th&gt;
    &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
    &lt;tr&gt;
        &lt;td&gt;1,300&lt;/td&gt;
        &lt;td&gt;51,640&lt;/td&gt;
        &lt;td&gt;39,750&lt;/td&gt;
        &lt;td&gt;15,880&lt;/td&gt;
        &lt;td&gt;92,690&lt;/td&gt;
    &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Although negative prices are not very common, they do account for about
1.2% of the data (critically, more than 1%).
Additionally, nearly 15% of the values for Price2 are missing,
and we exclude nulls from this calculation, so the truly relevant
figure is that 1,300 of 92,690 non-null values are negative,
or about 1.4%.&lt;/p&gt;
&lt;h2 id="other-constraints"&gt;Other Constraints&lt;/h2&gt;
&lt;p&gt;We will now use a slightly wider dataset, with more field types, to
look at how constraint generation works for other kinds of constraints
in the presence of bad data.  Here are the first 10 records from a
dataset with 108,559 records, including a similar set of three price fields
(with apologies for the slightly unintuitive colouring of the Colour field).&lt;/p&gt;
&lt;table class="solid sortable"&gt;
    &lt;thead&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#E0E0E0" title="ID"&gt;ID&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="Date"&gt;Date&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="Price 1"&gt;Price1&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="Price 2"&gt;Price2&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="Price 3"&gt;Price3&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="Code"&gt;Code&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="n Items"&gt;nItems&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="Parts"&gt;Parts&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0" title="Colour"&gt;Colour&lt;/th&gt;
    &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#FCF0EF"&gt;af0370b4-16e4-1891-8b30-cbb075438394&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2016-11-01 00:38:34&lt;/td&gt;
        &lt;td bgcolor="#EFD68A"&gt;830.25&lt;/td&gt;
        &lt;td style="color: #C0C0C0;"&gt;∅&lt;/td&gt;
        &lt;td bgcolor="#74C982"&gt;830.25&lt;/td&gt;
        &lt;td bgcolor="#EDF9F5"&gt;NLOM&lt;/td&gt;
        &lt;td bgcolor="#AFDFE9"&gt;1&lt;/td&gt;
        &lt;td bgcolor="#EFF6FC"&gt;62444&lt;/td&gt;
        &lt;td bgcolor="#D7B8F5"&gt;red&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#F9E2E0"&gt;126edf08-16e5-1891-851c-d926416373f2&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2016-11-01 00:41:21&lt;/td&gt;
        &lt;td bgcolor="#ECCE76"&gt;983.08&lt;/td&gt;
        &lt;td&gt;0.00&lt;/td&gt;
        &lt;td bgcolor="#60BF70"&gt;983.08&lt;/td&gt;
        &lt;td bgcolor="#DBF4EB"&gt;QLDZ&lt;/td&gt;
        &lt;td bgcolor="#AFDFE9"&gt;1&lt;/td&gt;
        &lt;td bgcolor="#E0ECF9"&gt;62677&lt;/td&gt;
        &lt;td bgcolor="#B176EC"&gt;yellow&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#F5D4D1"&gt;73462586-16ed-1891-b164-7da16012a3ab&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2016-11-01 01:41:20&lt;/td&gt;
        &lt;td bgcolor="#F4E4B1"&gt;540.82&lt;/td&gt;
        &lt;td&gt;0.00&lt;/td&gt;
        &lt;td bgcolor="#9FDCAA"&gt;540.82&lt;/td&gt;
        &lt;td bgcolor="#CAEEE2"&gt;TKNX&lt;/td&gt;
        &lt;td bgcolor="#AFDFE9"&gt;1&lt;/td&gt;
        &lt;td bgcolor="#D1E3F5"&gt;62177 62132&lt;/td&gt;
        &lt;td bgcolor="#D7B8F5"&gt;red&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#F2C6C2"&gt;1cd55aac-170e-1891-aec3-0ffc0ab020e7&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2016-11-01 05:35:08&lt;/td&gt;
        &lt;td bgcolor="#F7EBC5"&gt;398.89&lt;/td&gt;
        &lt;td&gt;0.00&lt;/td&gt;
        &lt;td bgcolor="#B7E5BE"&gt;398.89&lt;/td&gt;
        &lt;td bgcolor="#BAE8D9"&gt;PKWI&lt;/td&gt;
        &lt;td bgcolor="#AFDFE9"&gt;1&lt;/td&gt;
        &lt;td bgcolor="#C2DAF2"&gt;61734&lt;/td&gt;
        &lt;td bgcolor="#D7B8F5"&gt;red&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#EFB8B3"&gt;68f74486-170e-1891-8b30-cbb075438394&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2016-11-01 05:37:16&lt;/td&gt;
        &lt;td bgcolor="#F9EFD1"&gt;314.54&lt;/td&gt;
        &lt;td bgcolor="#A5C663"&gt;8.21&lt;/td&gt;
        &lt;td bgcolor="#C5EBCB"&gt;314.54&lt;/td&gt;
        &lt;td bgcolor="#AAE2CF"&gt;PLUA&lt;/td&gt;
        &lt;td bgcolor="#AFDFE9"&gt;1&lt;/td&gt;
        &lt;td bgcolor="#B3D1EF"&gt;62611&lt;/td&gt;
        &lt;td bgcolor="#D7B8F5"&gt;red&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#ECABA5"&gt;28dff654-16fa-1891-bd69-03d25ee96dee&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2016-11-01 03:12:18&lt;/td&gt;
        &lt;td bgcolor="#F6E7BA"&gt;479.75&lt;/td&gt;
        &lt;td style="color: #C0C0C0;"&gt;∅&lt;/td&gt;
        &lt;td bgcolor="#A9E0B2"&gt;479.75&lt;/td&gt;
        &lt;td bgcolor="#9ADDC7"&gt;UHEG&lt;/td&gt;
        &lt;td bgcolor="#AFDFE9"&gt;1&lt;/td&gt;
        &lt;td bgcolor="#A5C8EC"&gt;62128&lt;/td&gt;
        &lt;td bgcolor="#D7B8F5"&gt;red&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#E99E97"&gt;14321670-1703-1891-ab75-77d18aced2f5&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2016-11-01 04:16:09&lt;/td&gt;
        &lt;td bgcolor="#F1DA97"&gt;733.41&lt;/td&gt;
        &lt;td&gt;0.00&lt;/td&gt;
        &lt;td bgcolor="#82CF8F"&gt;733.41&lt;/td&gt;
        &lt;td bgcolor="#8CD7BE"&gt;RTKT&lt;/td&gt;
        &lt;td bgcolor="#69C1D2"&gt;2&lt;/td&gt;
        &lt;td bgcolor="#97C0E9"&gt;61829&lt;/td&gt;
        &lt;td bgcolor="#D7B8F5"&gt;red&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#E6918A"&gt;60957d68-1717-1891-a5b2-67fe1e248d60&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2016-11-01 06:41:27&lt;/td&gt;
        &lt;td bgcolor="#F5E4B2"&gt;537.81&lt;/td&gt;
        &lt;td&gt;0.00&lt;/td&gt;
        &lt;td bgcolor="#A0DCAA"&gt;537.81&lt;/td&gt;
        &lt;td bgcolor="#7DD1B5"&gt;OBFZ&lt;/td&gt;
        &lt;td bgcolor="#AFDFE9"&gt;1&lt;/td&gt;
        &lt;td bgcolor="#8AB8E6"&gt;61939 62371&lt;/td&gt;
        &lt;td bgcolor="#D7B8F5"&gt;red&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#E2857C"&gt;4fea2bb6-171d-1891-80a6-2f322b85f525&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2016-11-01 07:23:56&lt;/td&gt;
        &lt;td bgcolor="#FCF8EB"&gt;132.60&lt;/td&gt;
        &lt;td bgcolor="#E2EECB"&gt;2.42&lt;/td&gt;
        &lt;td bgcolor="#E5F6E8"&gt;135.02&lt;/td&gt;
        &lt;td bgcolor="#70CBAD"&gt;TACG&lt;/td&gt;
        &lt;td bgcolor="#69C1D2"&gt;2&lt;/td&gt;
        &lt;td bgcolor="#7CAFE2"&gt;62356&lt;/td&gt;
        &lt;td bgcolor="#D7B8F5"&gt;red&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td bgcolor="#DF7970"&gt;204ed3d6-1725-1891-a1fb-15b6ffac52ce&lt;/td&gt;
        &lt;td bgcolor="#F2B679"&gt;2016-11-01 08:19:52&lt;/td&gt;
        &lt;td bgcolor="#EED485"&gt;866.32&lt;/td&gt;
        &lt;td style="color: #C0C0C0;"&gt;∅&lt;/td&gt;
        &lt;td bgcolor="#6FC77E"&gt;866.32&lt;/td&gt;
        &lt;td bgcolor="#63C6A5"&gt;XHLE&lt;/td&gt;
        &lt;td bgcolor="#69C1D2"&gt;2&lt;/td&gt;
        &lt;td bgcolor="#70A7DF"&gt;61939&lt;/td&gt;
        &lt;td bgcolor="#D7B8F5"&gt;red&lt;/td&gt;
    &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;If we run ordinary constraint discovery on this (not allowing for any
bad data), these are the results:&lt;/p&gt;
&lt;table class="solid sortable"&gt;
    &lt;thead&gt;
    &lt;tr&gt;
        &lt;th colspan="9" bgcolor="#A8A8A8"&gt;Individual Field Constraints&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#A8A8A8"&gt;Name&lt;/th&gt;
        &lt;th bgcolor="#A8A8A8"&gt;Type Allowed&lt;/th&gt;
        &lt;th bgcolor="#A8A8A8"&gt;Min Allowed&lt;/th&gt;
        &lt;th bgcolor="#A8A8A8"&gt;Max Allowed&lt;/th&gt;
        &lt;th bgcolor="#A8A8A8"&gt;Sign Allowed&lt;/th&gt;
        &lt;th bgcolor="#A8A8A8"&gt;Nulls Allowed&lt;/th&gt;
        &lt;th bgcolor="#A8A8A8"&gt;Duplicates Allowed&lt;/th&gt;
        &lt;th bgcolor="#A8A8A8"&gt;Values Allowed&lt;/th&gt;
        &lt;th bgcolor="#A8A8A8"&gt;# Regular Expressions&lt;/th&gt;
    &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
    &lt;tr&gt;
        &lt;td&gt;ID&lt;/td&gt;
        &lt;td&gt;string&lt;/td&gt;
        &lt;td&gt;length 32&lt;/td&gt;
        &lt;td&gt;length 36&lt;/td&gt;
        &lt;td/&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td/&gt;
        &lt;td&gt;2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Date&lt;/td&gt;
        &lt;td&gt;date&lt;/td&gt;
        &lt;td&gt;1970-01-01&lt;/td&gt;
        &lt;td&gt;2017-06-03 23:59:06&lt;/td&gt;
        &lt;td/&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Date:time-before-now&lt;/td&gt;
        &lt;td&gt;timedelta&lt;/td&gt;
        &lt;td&gt;107 days, 11:28:05&lt;/td&gt;
        &lt;td&gt;17428 days, 11:27:11&lt;/td&gt;
        &lt;td&gt;positive&lt;/td&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Price1&lt;/td&gt;
        &lt;td&gt;real&lt;/td&gt;
        &lt;td&gt;0.0&lt;/td&gt;
        &lt;td&gt;19,653,405.06&lt;/td&gt;
        &lt;td&gt;non-negative&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Price2&lt;/td&gt;
        &lt;td&gt;real&lt;/td&gt;
        &lt;td&gt;-4,331,261.54&lt;/td&gt;
        &lt;td&gt;589,023.50&lt;/td&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Price3&lt;/td&gt;
        &lt;td&gt;real&lt;/td&gt;
        &lt;td&gt;0.0&lt;/td&gt;
        &lt;td&gt;20,242,428.57&lt;/td&gt;
        &lt;td&gt;non-negative&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Code&lt;/td&gt;
        &lt;td&gt;string&lt;/td&gt;
        &lt;td&gt;length 4&lt;/td&gt;
        &lt;td&gt;length 4&lt;/td&gt;
        &lt;td/&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td/&gt;
        &lt;td&gt;3&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;nItems&lt;/td&gt;
        &lt;td&gt;int&lt;/td&gt;
        &lt;td&gt;1&lt;/td&gt;
        &lt;td&gt;99&lt;/td&gt;
        &lt;td&gt;positive&lt;/td&gt;
        &lt;td/&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td/&gt;
        &lt;td/&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Parts&lt;/td&gt;
        &lt;td&gt;string&lt;/td&gt;
        &lt;td&gt;length 5&lt;/td&gt;
        &lt;td&gt;length 65&lt;/td&gt;
        &lt;td/&gt;
        &lt;td&gt;1&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td/&gt;
        &lt;td&gt;5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Colour&lt;/td&gt;
        &lt;td&gt;string&lt;/td&gt;
        &lt;td&gt;length 3&lt;/td&gt;
        &lt;td&gt;length 6&lt;/td&gt;
        &lt;td/&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;6 values&lt;/td&gt;
        &lt;td&gt;2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;There are a few points to note:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;To make the table more manageable, the regular expressions and field
     values are not shown, by default. We'll see them later.&lt;/li&gt;
&lt;li&gt;The third line of constraints is generated is slightly different
     from the others, and looks not at the actual values of the dates,
     but rather than how far in the past or future they are. This is something
     else we are experimenting with, but doesn't matter too much for this
     example.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Let's repeat the process telling the software that up to 1% of the data
might be bad.&lt;/p&gt;
&lt;table class="solid sortable"&gt;
    &lt;thead&gt;
    &lt;tr&gt;
        &lt;th colspan="9" bgcolor="#A8A8A8"&gt;Individual Field Constraints&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#A8A8A8"&gt;Name&lt;/th&gt;
        &lt;th bgcolor="#A8A8A8"&gt;Type Allowed&lt;/th&gt;
        &lt;th bgcolor="#A8A8A8"&gt;Min Allowed&lt;/th&gt;
        &lt;th bgcolor="#A8A8A8"&gt;Max Allowed&lt;/th&gt;
        &lt;th bgcolor="#A8A8A8"&gt;Sign Allowed&lt;/th&gt;
        &lt;th bgcolor="#A8A8A8"&gt;Nulls Allowed&lt;/th&gt;
        &lt;th bgcolor="#A8A8A8"&gt;Duplicates Allowed&lt;/th&gt;
        &lt;th bgcolor="#A8A8A8"&gt;Values Allowed&lt;/th&gt;
        &lt;th bgcolor="#A8A8A8"&gt;# Regular Expressions&lt;/th&gt;
    &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
    &lt;tr&gt;
        &lt;td&gt;ID&lt;/td&gt;
        &lt;td&gt;string&lt;/td&gt;
        &lt;td&gt;length 32&lt;/td&gt;
        &lt;td&gt;length 36&lt;/td&gt;
        &lt;td/&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;no&lt;/td&gt;
        &lt;td/&gt;
        &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Date&lt;/td&gt;
        &lt;td&gt;date&lt;/td&gt;
        &lt;td&gt;1970-01-01 01:23:20&lt;/td&gt;
        &lt;td&gt;2017-06-03 23:59:06&lt;/td&gt;
        &lt;td/&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Date:time-before-now&lt;/td&gt;
        &lt;td&gt;timedelta&lt;/td&gt;
        &lt;td&gt;107 days, 11:55:27&lt;/td&gt;
        &lt;td&gt;17428 days, 11:54:33&lt;/td&gt;
        &lt;td&gt;positive&lt;/td&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Price1&lt;/td&gt;
        &lt;td&gt;real&lt;/td&gt;
        &lt;td&gt;0.0&lt;/td&gt;
        &lt;td&gt;25000.0&lt;/td&gt;
        &lt;td&gt;non-negative&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Price2&lt;/td&gt;
        &lt;td&gt;real&lt;/td&gt;
        &lt;td&gt;-520.0&lt;/td&gt;
        &lt;td&gt;790.0&lt;/td&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Price3&lt;/td&gt;
        &lt;td&gt;real&lt;/td&gt;
        &lt;td&gt;0.0&lt;/td&gt;
        &lt;td&gt;24000.0&lt;/td&gt;
        &lt;td&gt;non-negative&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td/&gt;
        &lt;td/&gt;
        &lt;td/&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Code&lt;/td&gt;
        &lt;td&gt;string&lt;/td&gt;
        &lt;td&gt;length 4&lt;/td&gt;
        &lt;td&gt;length 4&lt;/td&gt;
        &lt;td/&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td/&gt;
        &lt;td&gt;2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;nItems&lt;/td&gt;
        &lt;td&gt;int&lt;/td&gt;
        &lt;td&gt;1&lt;/td&gt;
        &lt;td&gt;10&lt;/td&gt;
        &lt;td&gt;positive&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td/&gt;
        &lt;td/&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Parts&lt;/td&gt;
        &lt;td&gt;string&lt;/td&gt;
        &lt;td&gt;length 5&lt;/td&gt;
        &lt;td&gt;length 65&lt;/td&gt;
        &lt;td/&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td/&gt;
        &lt;td&gt;2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Colour&lt;/td&gt;
        &lt;td&gt;string&lt;/td&gt;
        &lt;td&gt;length 3&lt;/td&gt;
        &lt;td&gt;length 6&lt;/td&gt;
        &lt;td/&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;3 values&lt;/td&gt;
        &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Now let's consider the (other) constraint kinds in turn.&lt;/p&gt;
&lt;h3 id="sign"&gt;Sign&lt;/h3&gt;
&lt;p&gt;Nothing has changed for any of the sign constraints.
In general, all that we do in this case
is see whether less than our nominated proportion &lt;em&gt;p&lt;/em&gt; is negative,
or less than our nominated proportion &lt;em&gt;p&lt;/em&gt; is positive, and if so write
a constraint to this effect, but in the example nothing has changed.&lt;/p&gt;
&lt;h3 id="nulls"&gt;Nulls&lt;/h3&gt;
&lt;p&gt;The &lt;em&gt;nulls allowed&lt;/em&gt; constraint puts an upper limit on the number of nulls
allowed in a field. The only two values we ever use are 0 and 1, with 0
meaning that no nulls are allowed and 1 meaning that a single null is
allowed. The Parts field originally had a value of 1 here, meaning
that there is a single null in this field in the data, and therefore the
software wrote a constraint that a maxumim of 1 null is permitted. When the
software is allowed to assume that 1% of the data might be bad, a more
natural constraint is not to allow nulls at all.&lt;/p&gt;
&lt;p&gt;No constraint on nulls was produced for the field nItems originally,
but now we have one. If we check the number of nulls in nItems, it turns
out to be just 37 or around 0.03%. Since that is well below our 1% limit,
a "no-nulls" constraint has been generated.&lt;/p&gt;
&lt;h3 id="duplicates"&gt;Duplicates&lt;/h3&gt;
&lt;p&gt;In the original constraint discovery, no constraints banning duplicates
were generated, whereas in this case ID did get a "no duplicates" constraint.
If we count the number of distinct values in ID, there are in fact,
108,569 for 108,570 records, so there is clearly one duplicate, affecting
2 records. Since 2/108,570 is again well below our 1% limit (more like 0.002%),
the software generates this constraint.&lt;/p&gt;
&lt;h3 id="values-allowed"&gt;Values allowed.&lt;/h3&gt;
&lt;p&gt;In the original dataset, a constraint on the values for the field Colour
was generated. Specifically, the six values it allowed were:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&amp;quot;red&amp;quot; &amp;quot;yellow&amp;quot; &amp;quot;blue&amp;quot; &amp;quot;green&amp;quot; &amp;quot;CC1734&amp;quot; &amp;quot;???&amp;quot;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;When we ran the discovery process allowing for bad data, the result was
that only 3 values were allowed, which turn out to be "red", "yellow",
and "blue". If we look at the breakdown of the field, we will see why.&lt;/p&gt;
&lt;table class="solid"&gt;
    &lt;thead&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Colour&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;count&lt;/th&gt;
    &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
    &lt;tr&gt;
        &lt;th&gt;red   &lt;/th&gt;
        &lt;td bgcolor="#DF7970"&gt;88,258&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th&gt;yellow&lt;/th&gt;
        &lt;td bgcolor="#FBEDEB"&gt;10,979&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th&gt;blue  &lt;/th&gt;
        &lt;td bgcolor="#FCF1F0"&gt; 8,350&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th&gt;green &lt;/th&gt;
        &lt;td bgcolor="#FFFDFD"&gt;   978&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th&gt;CC1734&lt;/th&gt;
        &lt;td&gt;     4&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th&gt;???   &lt;/th&gt;
        &lt;td&gt;     1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;The two values we might have picked out as suspicious (or at least,
having a different form from the others) are obviously &lt;code&gt;CC1734&lt;/code&gt; and &lt;code&gt;???&lt;/code&gt;,
and constraint generation did exclude those, but also &lt;code&gt;green&lt;/code&gt;, which
we probably would not have done. There are 978 &lt;code&gt;green&lt;/code&gt; values (about 0.9%),
which is slighty under out 1% cutoff, so it can be excluded by the
algorithm, and is. The algorithm simply works through the values in order,
starting with the least numerous one, and removes values until the maximum
possible bad proportion &lt;em&gt;p&lt;/em&gt; is reached (cumulatively).
Values with the same frequency are treated together, which means that if
we had had another value (say "purple") with exactly the same frequency
as green (978), neither would have been excluded.&lt;/p&gt;
&lt;h3 id="regular-expressions"&gt;Regular Expressions&lt;/h3&gt;
&lt;p&gt;Sets of regular expressions were generated to characterize each of the
four string fields, and in every case the number generated was smaller
when assuming the possible presence of bad data than when not.&lt;/p&gt;
&lt;p&gt;For the ID field, two regular expressions were generated:&lt;/p&gt;
&lt;pre class="nocommand"&gt;&lt;span class="anchor"&gt;^&lt;/span&gt;&lt;span class="even"&gt;[0-9a-f]{32}&lt;/span&gt;&lt;span class="anchor"&gt;$&lt;/span&gt;
&lt;span class="anchor"&gt;^&lt;/span&gt;&lt;span class="even"&gt;[0-9a-f]{8}&lt;/span&gt;&lt;span class="odd"&gt;\-&lt;/span&gt;&lt;span class="even"&gt;[0-9a-f]{4}&lt;/span&gt;&lt;span class="odd"&gt;\-&lt;/span&gt;&lt;span class="even"&gt;1891&lt;/span&gt;&lt;span class="odd"&gt;\-&lt;/span&gt;&lt;span class="even"&gt;[0-9a-f]{4}&lt;/span&gt;&lt;span class="odd"&gt;\-&lt;/span&gt;&lt;span class="even"&gt;[0-9a-f]{12}&lt;/span&gt;&lt;span class="anchor"&gt;$&lt;/span&gt;&lt;/pre&gt;

&lt;p&gt;The first of these just corresponds to a 32-digit hex number,
while the second is a 32-digit hex number broken into five groups
of 8, 4, 4, 4, and 12 digits, separated by dashes—i.e. a
&lt;a href="https://en.wikipedia.org/wiki/Universally_unique_identifier"&gt;UUID&lt;/a&gt;.&lt;sup id="fnref:uuid1"&gt;&lt;a class="footnote-ref" href="#fn:uuid1"&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;If we get Rexpy to give us the coverage information We see that all but
three of the IDs are properly formatted UUIDs, with just three being
plain 32-digit numbers.&lt;/p&gt;
&lt;table class="solid"&gt;
    &lt;thead&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Regular Expression&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Incremental Coverage&lt;/th&gt;
    &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
    &lt;tr&gt;
        &lt;td&gt;
&lt;span class="anchor"&gt;^&lt;/span&gt;&lt;span class="even"&gt;[0-9a-f]{8}&lt;/span&gt;&lt;span class="odd"&gt;\-&lt;/span&gt;&lt;span class="even"&gt;[0-9a-f]{4}&lt;/span&gt;&lt;span class="odd"&gt;\-&lt;/span&gt;&lt;span class="even"&gt;1891&lt;/span&gt;&lt;span class="odd"&gt;\-&lt;/span&gt;&lt;span class="even"&gt;[0-9a-f]{4}&lt;/span&gt;&lt;span class="odd"&gt;\-&lt;/span&gt;&lt;span class="even"&gt;[0-9a-f]{12}&lt;/span&gt;&lt;span class="anchor"&gt;$&lt;/span&gt;       &lt;/td&gt;
        &lt;td&gt;108,567&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;
&lt;span class="anchor"&gt;^&lt;/span&gt;&lt;span class="even"&gt;[0-9a-f]{32}&lt;/span&gt;&lt;span class="anchor"&gt;$&lt;/span&gt;        &lt;/td&gt;
        &lt;td&gt;3&lt;/td&gt;
    &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;and indeed, it is the 3 records that are excluded by the constraint generation
assuming bad data.&lt;/p&gt;
&lt;p&gt;We won't go through all the cases, but will look at one more.
The field Parts is a list of 5-digit numbers, separated by spaces,
with a single one being generated most commonly. Here are the regular
expressions that Rexpy generated, together with the coverage information.&lt;/p&gt;
&lt;table class="solid"&gt;
    &lt;thead&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Regular Expression&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Incremental Coverage&lt;/th&gt;
    &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
    &lt;tr&gt;
        &lt;td&gt;
&lt;span class="anchor"&gt;^&lt;/span&gt;&lt;span class="even"&gt;\d{5}&lt;/span&gt;&lt;span class="anchor"&gt;$&lt;/span&gt;       &lt;/td&gt;
        &lt;td&gt;99,255&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;
&lt;span class="anchor"&gt;^&lt;/span&gt;&lt;span class="even"&gt;\d{5}&lt;/span&gt;&lt;span class="odd"&gt; &lt;/span&gt;&lt;span class="even"&gt;\d{5}&lt;/span&gt;&lt;span class="anchor"&gt;$&lt;/span&gt;      &lt;/td&gt;
        &lt;td&gt;8,931&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;
&lt;span class="anchor"&gt;^&lt;/span&gt;&lt;span class="even"&gt;\d{5}&lt;/span&gt;&lt;span class="odd"&gt; &lt;/span&gt;&lt;span class="even"&gt;\d{5}&lt;/span&gt;&lt;span class="odd"&gt; &lt;/span&gt;&lt;span class="even"&gt;\d{5}&lt;/span&gt;&lt;span class="anchor"&gt;$&lt;/span&gt;     &lt;/td&gt;
        &lt;td&gt;378&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;
&lt;span class="anchor"&gt;^&lt;/span&gt;&lt;span class="even"&gt;\d{5}&lt;/span&gt;&lt;span class="odd"&gt; &lt;/span&gt;&lt;span class="even"&gt;\d{5}&lt;/span&gt;&lt;span class="odd"&gt; &lt;/span&gt;&lt;span class="even"&gt;\d{5}&lt;/span&gt;&lt;span class="odd"&gt; &lt;/span&gt;&lt;span class="even"&gt;\d{5}&lt;/span&gt;&lt;span class="anchor"&gt;$&lt;/span&gt;        &lt;/td&gt;
        &lt;td&gt;4&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;
&lt;span class="anchor"&gt;^&lt;/span&gt;&lt;span class="even"&gt;61000&lt;/span&gt;&lt;span class="odd"&gt; &lt;/span&gt;&lt;span class="even"&gt;61000&lt;/span&gt;&lt;span class="odd"&gt; &lt;/span&gt;&lt;span class="even"&gt;61000&lt;/span&gt;&lt;span class="odd"&gt; &lt;/span&gt;&lt;span class="even"&gt;61000&lt;/span&gt;&lt;span class="odd"&gt; &lt;/span&gt;&lt;span class="even"&gt;61000&lt;/span&gt;&lt;span class="odd"&gt; &lt;/span&gt;&lt;span class="even"&gt;61000&lt;/span&gt;&lt;span class="odd"&gt; &lt;/span&gt;&lt;span class="even"&gt;61000&lt;/span&gt;&lt;span class="odd"&gt; &lt;/span&gt;&lt;span class="even"&gt;61000&lt;/span&gt;&lt;span class="odd"&gt; &lt;/span&gt;&lt;span class="even"&gt;61000&lt;/span&gt;&lt;span class="odd"&gt; &lt;/span&gt;&lt;span class="even"&gt;61000&lt;/span&gt;&lt;span class="odd"&gt; &lt;/span&gt;&lt;span class="even"&gt;61000&lt;/span&gt;&lt;span class="anchor"&gt;$&lt;/span&gt;     &lt;/td&gt;
        &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;As you can see, Rexpy has generated five separate regular expressions,
where in an ideal world we might have preferred it produced a single one:&lt;/p&gt;
&lt;pre&gt;&lt;span class="anchor"&gt;^&lt;/span&gt;&lt;span class="even"&gt;\d{5}&lt;/span&gt;&lt;span class="odd"&gt;( \d{5})*&lt;/span&gt;&lt;span class="anchor"&gt;$&lt;/span&gt;&lt;/pre&gt;

&lt;p&gt;In fact, however, the fact it has produced separate expressions for
five clearly distinguishable cases turns out to be very helpful for
TDDA purposes.&lt;/p&gt;
&lt;p&gt;In this case, we can see that the vast bulk of the cases have either
one or two 5-digit codes (which are the two regular expressions
retained by constraint generation), but we would almost certainly
consider the 382 cases with three and four codes to be also
correct. The last one is more interesting. First, the number
of codes (eleven) is noticably larger than for any other record.
Secondly, the fact that it is eleven copies of a single code, that is
a relatively round number is suspicious.
(Looking at the data, it turns out that in no other case are any
codes repeated when there are multiple codes, suggesting even more
strongly that there's something not right with this record.)&lt;/p&gt;
&lt;h2 id="verifying-the-constraint-generation-data-against-the-suggested-constraints"&gt;Verifying the Constraint Generation Data Against the Suggested Constraints&lt;/h2&gt;
&lt;p&gt;With the previous approach to constraint discovery, in which we assume
that the example data given to us contains only good data, it should
always be the case that if we verify the example data against
constraints generated against it, they will all pass.&lt;sup id="fnref:except"&gt;&lt;a class="footnote-ref" href="#fn:except"&gt;5&lt;/a&gt;&lt;/sup&gt; With
the new approach, this is no longer the case, because our hope is that
the constraints will help us to identify bad data.  We show below the
result of running verification against the generated contraints for
the example we have been looking at:&lt;/p&gt;
&lt;table class="tdver"&gt;
   &lt;tr&gt;
     &lt;td&gt;
       &lt;table class="tdvercell"&gt;
         &lt;tr&gt;&lt;th&gt;Constraints&lt;/th&gt;&lt;/tr&gt;
         &lt;tr&gt;&lt;td&gt;47&lt;/td&gt;&lt;/tr&gt;
       &lt;/table&gt;
     &lt;/td&gt;
     &lt;td&gt;
       &lt;table class="tdvercell"&gt;
         &lt;tr&gt;&lt;th&gt;Records&lt;/th&gt;&lt;/tr&gt;
         &lt;tr&gt;&lt;td&gt;108,570&lt;/td&gt;&lt;/tr&gt;
       &lt;/table&gt;
     &lt;/td&gt;
     &lt;td&gt;
       &lt;table class="tdvercell"&gt;
         &lt;tr&gt;&lt;th&gt;Fields&lt;/th&gt;&lt;/tr&gt;
         &lt;tr&gt;&lt;td&gt;10&lt;/td&gt;&lt;/tr&gt;
       &lt;/table&gt;
     &lt;/td&gt;
     &lt;td&gt;
       &lt;table class="tdvercell"&gt;
         &lt;tr&gt;&lt;th&gt;Values&lt;/th&gt;&lt;/tr&gt;
         &lt;tr&gt;&lt;td&gt;1,085,700&lt;/td&gt;&lt;/tr&gt;
       &lt;/table&gt;
     &lt;/td&gt;
   &lt;/tr&gt;
   &lt;tr&gt;
     &lt;td&gt;
       &lt;table class="tdvercell"&gt;
         &lt;tr&gt;&lt;th&gt;Failing Constraints&lt;/th&gt;&lt;/tr&gt;
         &lt;tr&gt;&lt;td class="tdred"&gt;15&lt;/td&gt;&lt;/tr&gt;
       &lt;/table&gt;
     &lt;/td&gt;
     &lt;td&gt;
       &lt;table class="tdvercell"&gt;
         &lt;tr&gt;&lt;th&gt;Failing Records&lt;/th&gt;&lt;/tr&gt;
         &lt;tr&gt;&lt;td class="tdred"&gt;1,722&lt;/td&gt;&lt;/tr&gt;
       &lt;/table&gt;
     &lt;/td&gt;
     &lt;td&gt;
       &lt;table class="tdvercell"&gt;
         &lt;tr&gt;&lt;th&gt;Failing Fields&lt;/th&gt;&lt;/tr&gt;
         &lt;tr&gt;&lt;td class="tdred"&gt;10&lt;/td&gt;&lt;/tr&gt;
       &lt;/table&gt;
     &lt;/td&gt;
     &lt;td&gt;
       &lt;table class="tdvercell"&gt;
         &lt;tr&gt;&lt;th&gt;Failing Values&lt;/th&gt;&lt;/tr&gt;
         &lt;tr&gt;&lt;td class="tdred"&gt;1,887&lt;/td&gt;&lt;/tr&gt;
       &lt;/table&gt;
     &lt;/td&gt;
   &lt;/tr&gt;
&lt;/table&gt;

&lt;p&gt;We can also see a little more information about where the failures were
in the table below.&lt;/p&gt;
&lt;table class="solid sortable tdda"&gt;
    &lt;thead&gt;
    &lt;tr&gt;
        &lt;th colspan="1" bgcolor="#E0E0E0" rowspan="2"&gt;Name&lt;/th&gt;
        &lt;th colspan="2" bgcolor="#E0E0E0"&gt;Failures&lt;/th&gt;
        &lt;th colspan="3" bgcolor="#E0E0E0"&gt;Type&lt;/th&gt;
        &lt;th colspan="3" bgcolor="#E0E0E0"&gt;Minimum&lt;/th&gt;
        &lt;th colspan="3" bgcolor="#E0E0E0"&gt;Maximum&lt;/th&gt;
        &lt;th colspan="3" bgcolor="#E0E0E0"&gt;Sign&lt;/th&gt;
        &lt;th colspan="3" bgcolor="#E0E0E0"&gt;Max Nulls&lt;/th&gt;
        &lt;th colspan="3" bgcolor="#E0E0E0"&gt;Duplicates&lt;/th&gt;
        &lt;th colspan="3" bgcolor="#E0E0E0"&gt;Values&lt;/th&gt;
        &lt;th colspan="3" bgcolor="#E0E0E0"&gt;Rex&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Values&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Constraints&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Allowed&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Actual&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;✓&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Allowed&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Actual&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;✓&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Allowed&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Actual&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;✓&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Allowed&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Actual&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;✓&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Allowed&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Actual&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;✓&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Allowed&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Actual&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;✓&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Allowed&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Actual&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;✓&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Allowed&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;Actual&lt;/th&gt;
        &lt;th bgcolor="#E0E0E0"&gt;✓&lt;/th&gt;
    &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
    &lt;tr&gt;
        &lt;td&gt;Colour&lt;/td&gt;
        &lt;td class="tdred"&gt;983&lt;/td&gt;
        &lt;td class="tdred"&gt;2&lt;/td&gt;
        &lt;td&gt;string&lt;/td&gt;
        &lt;td&gt;string&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;length 3&lt;/td&gt;
        &lt;td&gt;length 3&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;length 6&lt;/td&gt;
        &lt;td&gt;length 6&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;3 values&lt;/td&gt;
        &lt;td class="tdred"&gt;e.g. &amp;quot;green&amp;quot;&lt;/td&gt;
        &lt;td class="tdred"&gt;✗&lt;/td&gt;
        &lt;td&gt;1 pattern&lt;/td&gt;
        &lt;td class="tdred"&gt;e.g. &amp;quot;???&amp;quot;&lt;/td&gt;
        &lt;td class="tdred"&gt;✗&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Parts&lt;/td&gt;
        &lt;td class="tdred"&gt;384&lt;/td&gt;
        &lt;td class="tdred"&gt;2&lt;/td&gt;
        &lt;td&gt;string&lt;/td&gt;
        &lt;td&gt;string&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;length 5&lt;/td&gt;
        &lt;td&gt;length 5&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;length 65&lt;/td&gt;
        &lt;td&gt;length 65&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td class="tdred"&gt;1&lt;/td&gt;
        &lt;td class="tdred"&gt;✗&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;2 patterns&lt;/td&gt;
        &lt;td class="tdred"&gt;e.g. &amp;quot;62702 62132 62341&amp;quot;&lt;/td&gt;
        &lt;td class="tdred"&gt;✗&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Price2&lt;/td&gt;
        &lt;td class="tdred"&gt;281&lt;/td&gt;
        &lt;td class="tdred"&gt;2&lt;/td&gt;
        &lt;td&gt;real&lt;/td&gt;
        &lt;td&gt;real&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;-520.00&lt;/td&gt;
        &lt;td class="tdred"&gt;-4,331,261.54&lt;/td&gt;
        &lt;td class="tdred"&gt;✗&lt;/td&gt;
        &lt;td&gt;790.00&lt;/td&gt;
        &lt;td class="tdred"&gt;589,023.50&lt;/td&gt;
        &lt;td class="tdred"&gt;✗&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;15880&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Price3&lt;/td&gt;
        &lt;td class="tdred"&gt;92&lt;/td&gt;
        &lt;td class="tdred"&gt;1&lt;/td&gt;
        &lt;td&gt;real&lt;/td&gt;
        &lt;td&gt;real&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;0.00&lt;/td&gt;
        &lt;td&gt;0.00&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;24,000.00&lt;/td&gt;
        &lt;td class="tdred"&gt;20,242,428.57&lt;/td&gt;
        &lt;td class="tdred"&gt;✗&lt;/td&gt;
        &lt;td&gt;≥ 0&lt;/td&gt;
        &lt;td&gt;✓&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Price1&lt;/td&gt;
        &lt;td class="tdred"&gt;91&lt;/td&gt;
        &lt;td class="tdred"&gt;1&lt;/td&gt;
        &lt;td&gt;real&lt;/td&gt;
        &lt;td&gt;real&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;0.00&lt;/td&gt;
        &lt;td&gt;0.00&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;25,000.00&lt;/td&gt;
        &lt;td class="tdred"&gt;19,653,405.06&lt;/td&gt;
        &lt;td class="tdred"&gt;✗&lt;/td&gt;
        &lt;td&gt;≥ 0&lt;/td&gt;
        &lt;td&gt;✓&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;nItems&lt;/td&gt;
        &lt;td class="tdred"&gt;44&lt;/td&gt;
        &lt;td class="tdred"&gt;2&lt;/td&gt;
        &lt;td&gt;int&lt;/td&gt;
        &lt;td&gt;int&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;1&lt;/td&gt;
        &lt;td&gt;1&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;10&lt;/td&gt;
        &lt;td class="tdred"&gt;99&lt;/td&gt;
        &lt;td class="tdred"&gt;✗&lt;/td&gt;
        &lt;td&gt;&amp;gt; 0&lt;/td&gt;
        &lt;td&gt;✓&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td class="tdred"&gt;37&lt;/td&gt;
        &lt;td class="tdred"&gt;✗&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Code&lt;/td&gt;
        &lt;td class="tdred"&gt;4&lt;/td&gt;
        &lt;td class="tdred"&gt;1&lt;/td&gt;
        &lt;td&gt;string&lt;/td&gt;
        &lt;td&gt;string&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;length 4&lt;/td&gt;
        &lt;td&gt;length 4&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;length 4&lt;/td&gt;
        &lt;td&gt;length 4&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;2 patterns&lt;/td&gt;
        &lt;td class="tdred"&gt;e.g. &amp;quot;A__Z&amp;quot;&lt;/td&gt;
        &lt;td class="tdred"&gt;✗&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;ID&lt;/td&gt;
        &lt;td class="tdred"&gt;3&lt;/td&gt;
        &lt;td class="tdred"&gt;2&lt;/td&gt;
        &lt;td&gt;string&lt;/td&gt;
        &lt;td&gt;string&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;length 32&lt;/td&gt;
        &lt;td&gt;length 32&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;length 36&lt;/td&gt;
        &lt;td&gt;length 36&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;no&lt;/td&gt;
        &lt;td class="tdred"&gt;e.g. &amp;quot;374e4f9816e51891b1647da16012a3ab&amp;quot;&lt;/td&gt;
        &lt;td class="tdred"&gt;✗&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;1 pattern&lt;/td&gt;
        &lt;td class="tdred"&gt;e.g. &amp;quot;374e4f9816e51891b1647da16012a3ab&amp;quot;&lt;/td&gt;
        &lt;td class="tdred"&gt;✗&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Date&lt;/td&gt;
        &lt;td class="tdred"&gt;3&lt;/td&gt;
        &lt;td class="tdred"&gt;1&lt;/td&gt;
        &lt;td&gt;date&lt;/td&gt;
        &lt;td&gt;date&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;1970-01-01 01:23:20&lt;/td&gt;
        &lt;td class="tdred"&gt;1970-01-01&lt;/td&gt;
        &lt;td class="tdred"&gt;✗&lt;/td&gt;
        &lt;td&gt;2017-06-03 23:59:06&lt;/td&gt;
        &lt;td&gt;2017-06-03 23:59:06&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Date:time-before-now&lt;/td&gt;
        &lt;td class="tdred"&gt;2&lt;/td&gt;
        &lt;td class="tdred"&gt;1&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;109 days, 16:27:17&lt;/td&gt;
        &lt;td&gt;109 days, 16:28:09&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;17430 days, 16:26:23&lt;/td&gt;
        &lt;td class="tdred"&gt;17430 days, 16:27:15&lt;/td&gt;
        &lt;td class="tdred"&gt;✗&lt;/td&gt;
        &lt;td&gt;&amp;gt; 0&lt;/td&gt;
        &lt;td&gt;✓&lt;/td&gt;
        &lt;td class="tdgreen"&gt;✓&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
        &lt;td&gt;-&lt;/td&gt;
    &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Because the verification fails in this way, and in doing so creates indicator
fields for each failing constraint, and an overall field with the number
of failures for each record, it is then easy to narrow down to the
data being flagged by these constraints to see whether the constraints are
useful or over zealous, and adjust them as necessary.&lt;/p&gt;
&lt;h2 id="summary"&gt;Summary&lt;/h2&gt;
&lt;p&gt;In this post, we've shown how we're extending Automatic Constraint
Generation in TDDA to cover cases where the datasets used are not
assumed to be perfect.  We think this is quite a significant
development.  We'll use it a bit more, and when it's solid, extend the
open-source
&lt;a href="https://www.tdda.info/obtaining-the-python-tdda-library"&gt;tdda&lt;/a&gt;
library to include this functionality.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:sorry"&gt;
&lt;p&gt;As
&lt;a href="https://www.tdda.info/why-test-driven-data-analysis#fn:SingularData"&gt;previously&lt;/a&gt;,
I am aware that, classically, data is the plural of datum, and that
purists would prefer the assertion to be "bad data are widespread and
pervastive." I apologise to anyone whose sensibilities are offended
by my use of the word &lt;em&gt;data&lt;/em&gt; in the singular.&amp;#160;&lt;a class="footnote-backref" href="#fnref:sorry" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:per-constraint"&gt;
&lt;p&gt;As a further clarification, the proportion &lt;em&gt;p&lt;/em&gt; is a
used for each potential constraint on each field separately, so it's not
"1% of all values" might be bad, but rather "1% of the Ages might be
higher than the maximum constraint we generate" and so on.
Obviously, we could generalize this approach to allow different possible
propotions for each constraint type, or each field, or both, at the cost
of increasing the number of free parameters.&amp;#160;&lt;a class="footnote-backref" href="#fnref:per-constraint" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:round"&gt;
&lt;p&gt;Like you, I look at those figures and am immediately suspicious
that all the counts shown are multiples of 10. But this is one of those
cases our suspicions are wrong: this is a random sample from larger data,
but the roundness of these numbers is blind chance.&amp;#160;&lt;a class="footnote-backref" href="#fnref:round" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:uuid1"&gt;
&lt;p&gt;In fact, looking carefully the third group of digits is
fixed and starts with 1,
indicating this is a &lt;a href="https://en.wikipedia.org/wiki/Universally_unique_identifier#Version_1_.28date-time_and_MAC_address.29"&gt;UUID-1&lt;/a&gt;, which is something
we hadn't noticed in this data until we got Rexpy to generate a regular
expression for us, as part of TDDA.&amp;#160;&lt;a class="footnote-backref" href="#fnref:uuid1" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:except"&gt;
&lt;p&gt;With the time-delta constraints we are now generating, this is
not strictly true, but this need not concern us in this case.&amp;#160;&lt;a class="footnote-backref" href="#fnref:except" title="Jump back to footnote 5 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="tdda"></category><category term="constraints"></category><category term="discovery"></category><category term="verification"></category><category term="suggestion"></category><category term="cartoon"></category><category term="bad data"></category></entry><entry><title>Obtaining the Python tdda Library</title><link href="https://tdda.info/obtaining-the-python-tdda-library.html" rel="alternate"></link><published>2017-09-14T15:30:00+01:00</published><updated>2017-09-14T15:30:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2017-09-14:/obtaining-the-python-tdda-library.html</id><summary type="html">&lt;p&gt;This post is a standing post that we plan to try to keep up to date,
describing options for obtaining the open-source Python
TDDA library that we maintain.&lt;/p&gt;
&lt;h2 id="using-pip-from-pypi"&gt;Using pip from PyPI&lt;/h2&gt;
&lt;p&gt;Assuming you have a working pip setup, you should be able to install
the tdda library by typing …&lt;/p&gt;</summary><content type="html">&lt;p&gt;This post is a standing post that we plan to try to keep up to date,
describing options for obtaining the open-source Python
TDDA library that we maintain.&lt;/p&gt;
&lt;h2 id="using-pip-from-pypi"&gt;Using pip from PyPI&lt;/h2&gt;
&lt;p&gt;Assuming you have a working pip setup, you should be able to install
the tdda library by typing:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;pip install tdda
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;or, if your permissions don't allow use in this mode&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;sudo pip install tdda
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If &lt;code&gt;pip&lt;/code&gt; isn't working, or is associated with a different Python from the one you are using, try:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;python -m pip install tdda
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;or&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;sudo python -m pip install tdda
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The tdda library supports both Python 3 (tested with 3.6 and 3.7) and Python 2 (tested with 2.7). (We'll start testing against 3.8 real soon!)&lt;/p&gt;
&lt;h2 id="upgrading"&gt;Upgrading&lt;/h2&gt;
&lt;p&gt;If you have a version of the &lt;code&gt;tdda&lt;/code&gt; library installed and want to upgrade it with pip, add &lt;code&gt;-U&lt;/code&gt; to one of the command above, i.e. use whichever of the following you need for your setup:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;pip install -U tdda
sudo pip install -U tdda
python -m pip install -U tdda
sudo python -m pip install -U tdda
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2 id="installing-from-source"&gt;Installing from Source&lt;/h2&gt;
&lt;p&gt;The source for the tdda library is available from Github and can be
cloned with&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;git clone https://github.com/tdda/tdda.git
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;or&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;git&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;clone&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;git&lt;/span&gt;&lt;span class="nv"&gt;@github&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nl"&gt;com&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;tdda&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;tdda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;git&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;When installing from source, if you want the command line &lt;code&gt;tdda&lt;/code&gt; utility
to be available, you need to run&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;python setup.py install
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;from the top-level tdda directory after downloading it.&lt;/p&gt;
&lt;h2 id="documentation"&gt;Documentation&lt;/h2&gt;
&lt;p&gt;The main documentation for the &lt;code&gt;tdda&lt;/code&gt; library is available on
&lt;a href="https://tdda.readthedocs.org"&gt;Read the Docs&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;You can also build it youself if you have downloaded the source from Github.
In order to do this, you will need an installation of
&lt;a href="https://pypi.python.org/pypi/Sphinx"&gt;Sphinx&lt;/a&gt;.
The HTML documentation is built, starting from the top-level
&lt;code&gt;tdda&lt;/code&gt; directory by running:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;cd doc
make html
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2 id="running-tddas-tests"&gt;Running TDDA's tests&lt;/h2&gt;
&lt;p&gt;Once you have installed TDDA (whether using pip or from source), you can
run its tests by typing&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;tdda test
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If you have all the dependencies, including optional dependencies, installed,
you should get a line of dots and the message OK at the end, something like this:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c"&gt;$ tdda test&lt;/span&gt;
&lt;span class="nt"&gt;........................................................................................................................&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="nb"&gt;----------------------------------------------------------------------&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="c"&gt;Ran 122 tests in 3&lt;/span&gt;&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;251s&lt;/span&gt;

&lt;span class="c"&gt;OK&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If you don't have some of the optional dependencies installed, some of the dots will be replaced by the letter 's'. For example:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c"&gt;$ tdda test&lt;/span&gt;
&lt;span class="nt"&gt;.................................................................&lt;/span&gt;&lt;span class="c"&gt;s&lt;/span&gt;&lt;span class="nt"&gt;.............................&lt;/span&gt;&lt;span class="c"&gt;s&lt;/span&gt;&lt;span class="nt"&gt;........................&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="nb"&gt;----------------------------------------------------------------------&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="c"&gt;Ran 120 tests in 3&lt;/span&gt;&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;221s&lt;/span&gt;

&lt;span class="c"&gt;OK (skipped=2)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This does not indicate a problem, and simply means there will be some
of the functionality unavailable (e.g. usually one or more database types).&lt;/p&gt;
&lt;h2 id="using-the-tdda-examples"&gt;Using the TDDA examples&lt;/h2&gt;
&lt;p&gt;The tdda library includes three sets of examples, covering
&lt;a href="https://www.tdda.info/the-new-referencetest-class-for-tdda"&gt;reference testing&lt;/a&gt;,
&lt;a href="https://www.tdda.info/constraint-discovery-and-verification-for-pandas-dataframes"&gt;automatic constraint discovery and verification&lt;/a&gt;,
and
&lt;a href="https://www.tdda.info/introducing-rexpy-automatic-discovery-of-regular-expressions"&gt;Rexpy&lt;/a&gt;
(discovery of regular expressions from examples,
outside the context of constraints).&lt;/p&gt;
&lt;p&gt;The tdda command line can be used to copy the relevant files into place.
To get the examples, first change to a directory where you would like
them to be placed, and then use the command:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;tdda examples
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This should produce the following output:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;Copied&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tdda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;referencetest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;./&lt;/span&gt;&lt;span class="n"&gt;referencetest&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;examples&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;Copied&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tdda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;constraints&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;./&lt;/span&gt;&lt;span class="n"&gt;constraints&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;examples&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;Copied&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tdda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rexpy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;./&lt;/span&gt;&lt;span class="n"&gt;rexpy&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;examples&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2 id="quick-reference-guides"&gt;Quick Reference Guides&lt;/h2&gt;
&lt;p&gt;There is a quick reference guides available for the TDDA library.
These are often a little behind the current release, but are usually
still quite helpful.&lt;/p&gt;
&lt;p&gt;These are available from &lt;a href="https://www.tdda.info/pdf/tdda-quickref.pdf"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="tutorial-from-pydata-london"&gt;Tutorial from PyData London&lt;/h2&gt;
&lt;p&gt;There is a &lt;a href="https://www.youtube.com/watch?v=TGwZnZYg0jw&amp;amp;list=PLGVZCDnMOq0pAwbVAb1kUN3lV7ukhLL2k&amp;amp;index=7"&gt;video&lt;/a&gt; online of a workshop at
&lt;a href="https://pydata.org/london2017/"&gt;PyData London 2017&lt;/a&gt;.
Watching a video of a workshop probably isn't ideal,
but it does have a fairly detailed and gentle introduction to using
the library,
so if you are struggling, it might be a good place to start.&lt;/p&gt;</content><category term="TDDA"></category><category term="tdda"></category><category term="python"></category></entry><entry><title>GDPR, Consent and Microformats: A Half-Baked Idea</title><link href="https://tdda.info/gdpr-consent-and-microformats-a-half-baked-idea.html" rel="alternate"></link><published>2017-09-08T11:00:00+01:00</published><updated>2017-09-08T11:00:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2017-09-08:/gdpr-consent-and-microformats-a-half-baked-idea.html</id><summary type="html">&lt;p&gt;Last night I went to
&lt;a href="https://www.meetup.com/Protectors-of-Data-Scotland-PODs/"&gt;The Protectors of Data Scotland&lt;/a&gt; Meetup
on the subject of
&lt;a href="https://www.meetup.com/Protectors-of-Data-Scotland-PODs/events/242504992/"&gt;Marketing and GDPR&lt;/a&gt;.
If you're not familiar with Europe's fast-approaching
General Data Protection Regulation, and you keep or process any personal data
about humans,&lt;sup id="fnref:pii"&gt;&lt;a class="footnote-ref" href="#fn:pii"&gt;1&lt;/a&gt;&lt;/sup&gt;, you probably ought to learn about it.
A good place …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Last night I went to
&lt;a href="https://www.meetup.com/Protectors-of-Data-Scotland-PODs/"&gt;The Protectors of Data Scotland&lt;/a&gt; Meetup
on the subject of
&lt;a href="https://www.meetup.com/Protectors-of-Data-Scotland-PODs/events/242504992/"&gt;Marketing and GDPR&lt;/a&gt;.
If you're not familiar with Europe's fast-approaching
General Data Protection Regulation, and you keep or process any personal data
about humans,&lt;sup id="fnref:pii"&gt;&lt;a class="footnote-ref" href="#fn:pii"&gt;1&lt;/a&gt;&lt;/sup&gt;, you probably ought to learn about it.
A good place to start is episode 202 of Horace Dediu's
&lt;a href="https://5by5.tv/criticalpath/202"&gt;The Critical Path podcast&lt;/a&gt;,
in which he interviews &lt;a href="https://twitter.com/tim_walters"&gt;Tim Walters&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;During the meeting, I had an idea, and though it is rather less than
half-baked, right now it seems just about interesting enough that I thought
I'd record it.&lt;/p&gt;
&lt;p&gt;One of the key provisions in GDPR is that data processing generally requires
&lt;a href="https://gdpr-legislation.co.uk/lawful-processing"&gt;consent of the data subject&lt;/a&gt;,
and that consent is
&lt;a href="https://gdpr-legislation.co.uk/lawful-processing"&gt;defined as&lt;/a&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;any freely given, specific, informed and unambiguous indication
of his or her wishes by which the data subject, either by a statement
or by a clear affirmative action, signifies agreement to personal
data relating to them being processed&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is further &lt;a href="https://gdpr-legislation.co.uk/lawful-processing"&gt;clarified&lt;/a&gt;
as follows:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This could include ticking a box when visiting an Internet website,
choosing technical settings for information society services or
by any other statement or conduct which clearly indicates in this
context the data subject’s acceptance of the proposed processing
of their personal data.&lt;/p&gt;
&lt;p&gt;Silence, pre-ticked boxes or inactivity should therefore not
constitute consent.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="the-idea-in-a-nutshell"&gt;The Idea in a Nutshell&lt;/h2&gt;
&lt;p&gt;Briefly, the idea is this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Websites (and potentially apps) requesting consents should include a
    digital specification of that consent in a standardized format
    to be defined (probably either an HTML microformat or a standardized
    JSON bundle).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;This would allow software to understand the consents being
    requested unambiguously and present them in a standardized, uniform,
    easy-to-understand format. It would also encourages businesses and
    other organizations to standardize the forms of consent they request.
    I imagine that if this happened, initially browser extensions and
    special apps such as password managers would learn to read the format
    and present the information clearly, but if were successful,
    eventually web browsers themselves would do this.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Software could also allow people to create one or more templates
    or default responses, allowing, for example, someone who never wants
    to receive marketing to make this their default response, and someone
    who wants as many special offers as possible to have settings that
    reflect that. Obviously, you might want several different formats
    for organizations towards which you have different feelings.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;A very small extension to the idea would extend the format to record
    the choices made, allowing password managers, browsers, apps etc. to
    record for the user exactly what consents were given.&lt;sup id="fnref:blockchain"&gt;&lt;a class="footnote-ref" href="#fn:blockchain"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="benefits"&gt;Benefits&lt;/h2&gt;
&lt;p&gt;I believe such a standard has potential benefits for all
parties—businesses and other organizations requesting consent, individuals giving consent, regulators and courts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Businesses and other data processing/capturing organizations would
    benefit from a clear set of consent kinds, each of which could
    have a detailed description (perhaps on an EU or W3C document)
    that could be referenced by a specific label (e.g.
    &lt;code&gt;marketing_contact_email_organization&lt;/code&gt;). Best practice would
    hopefully quickly move to using these standard categories.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Software could present information in a standardized, clear way to
    users, highlighting non-standard provisions (preferably with
    standard symbols, a bit like the
    &lt;a href="https://creativecommons.org/licenses/"&gt;Creative Commons Symbols&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;By using template responses, users could more easily complete consent
    forms with less effort and less fear of ticking the wrong boxes.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="digital-specification-microformat-json"&gt;Digital Specification? Microformat? JSON?&lt;/h2&gt;
&lt;p&gt;What are we actually talking about here?&lt;/p&gt;
&lt;p&gt;The main (textual) content of a web page consists of the actual human-readable
text together with annotation ("markup") to specify formatting and layout,
as well as special features like the sort of checkboxes used to request
consent.
In older versions of the web, a web page was literally a text file
in the a special format originally defined by
&lt;a href="https://en.wikipedia.org/wiki/Tim_Berners-Lee"&gt;Tim Berners-Lee&lt;/a&gt;
(&lt;a href="https://en.wikipedia.org/wiki/HTML"&gt;HTML&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;For example, in HTML, this one-sentence paragraph with the word &lt;strong&gt;very&lt;/strong&gt;
in bold might be written as follows:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;p&amp;gt;&lt;/span&gt;For example, in HTML, this one-sentence paragraph with the
word &lt;span class="nt"&gt;&amp;lt;b&amp;gt;&lt;/span&gt;very&lt;span class="nt"&gt;&amp;lt;/b&amp;gt;&lt;/span&gt; in bold might be written as follows:&lt;span class="nt"&gt;&amp;lt;/p&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Since the advent of &lt;a href="https://en.wikipedia.org/wiki/Web_2.0"&gt;"Web 2.0"&lt;/a&gt;,
many web pages are generated dynamically, with much of
the data being sent in a format called
&lt;a href="https://en.wikipedia.org/wiki/JSON"&gt;JSON&lt;/a&gt;. A simple example of some
JSON for describing (say) an element in the periodic might be&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;{
    &amp;quot;name&amp;quot;: &amp;quot;Lithium&amp;quot;,
    &amp;quot;atomicnumber&amp;quot;: 3,
    &amp;quot;metal&amp;quot;: true,
    &amp;quot;period&amp;quot;: 2,
    &amp;quot;group&amp;quot;: 1,
    &amp;quot;etymology&amp;quot;: &amp;quot;Greek lithos&amp;quot;
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;It would be straightforward&lt;sup id="fnref:work"&gt;&lt;a class="footnote-ref" href="#fn:work"&gt;3&lt;/a&gt;&lt;/sup&gt; to develop a format for allowing all the
common types of marketing consent (and indeed, many other kinds of
processing consent) to be expressed either in JSON or an HTML microformat
(which might not be rendered directly by the webpage). As a sketch, the
&lt;em&gt;kind of thing&lt;/em&gt; a marketing consent request might look like in JSON could be:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;{
    &amp;quot;format&amp;quot;: &amp;quot;GDPR-Marketing-Consent-Bundle&amp;quot;,
    &amp;quot;format_version&amp;quot;: &amp;quot;1.0&amp;quot;,
    &amp;quot;requesting_organization&amp;quot;: &amp;quot;Stochastic Solutions Limited&amp;quot;,
    &amp;quot;requesting_organization_data_protection_policy_page:
        &amp;quot;https://stochasticsolutions.com/privacy.html&amp;quot;,
    &amp;quot;requesting_organization_partners&amp;quot;: [],
    &amp;quot;requested_consents&amp;quot;: [
        &amp;quot;marketing_contact_email_organization&amp;quot;,
        &amp;quot;marketing_contact_mobile_phone_organization&amp;quot;,
        &amp;quot;marketing_contact_physical_mail_organization&amp;quot;
    ],
    &amp;quot;request_date&amp;quot;: &amp;quot;2017-09-08&amp;quot;,
    &amp;quot;request_url&amp;quot;: &amp;quot;https://StochasticSolutios.com/give-us-your-data&amp;quot;
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Key features I am trying to illustrate here are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The format would include details about the organization making the
    request&lt;/li&gt;
&lt;li&gt;The format would have the capacity to list partner organizations
    in cases in which consent for partner marketing or processing
    was also requested&lt;/li&gt;
&lt;li&gt;The format would be granular with a taxonomy of known kinds of
    consents. These might be parameterized, rather than being simple
    strings. In this case, I've included a few different contact
    mechanisms and the suffix "organization" to indicate this is
    consent for the organization itself, rather than any partners
    or other randoms.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Undoubtedly, a real implementation would end up a bit bigger than this,
and perhaps more hierarchical, but hopefully not too much bigger.&lt;/p&gt;
&lt;p&gt;The format could be extended very simply to include the response,
which could then be sent back to the site and also made available
on the page to the browser/password manager/apps etc.
Here is an augmentation of the request format that also captures the
responses:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;{
    &amp;quot;format&amp;quot;: &amp;quot;GDPR-Marketing-Consent-Bundle&amp;quot;,
    &amp;quot;format_version&amp;quot;: &amp;quot;1.0&amp;quot;,
    &amp;quot;requesting_organization&amp;quot;: &amp;quot;Stochastic Solutions Limited&amp;quot;,
    &amp;quot;requesting_organization_data_protection_policy_page:
        &amp;quot;https://stochasticsolutions.com/privacy.html&amp;quot;,
    &amp;quot;requesting_organization_partners&amp;quot;: [],
    &amp;quot;requested_consents&amp;quot;: {
        &amp;quot;marketing_contact_email_organization&amp;quot;: false,
        &amp;quot;marketing_contact_mobile_phone_organization&amp;quot;: false
        &amp;quot;marketing_contact_physical_mail_organization: true
    },
    &amp;quot;request_date&amp;quot;: &amp;quot;2017-09-08&amp;quot;,
    &amp;quot;request_url&amp;quot;: &amp;quot;https://StochasticSolutions.com/give-us-your-data&amp;quot;
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This example indicates consent to marketing contact by paper mail from
the organization, but not by phone or email.&lt;/p&gt;
&lt;p&gt;Exactly the same could be achieved with an HTML Microformat, perhaps with
something like this:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;GDPR-Marketing-Consent-Bundle&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;format_version&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;1.0&lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;requesting_organization&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        Stochastic Solutions Limited
    &lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;requesting_organization_data_protection_policy_page&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        &amp;quot;https://stochasticsolutions.com/privacy.html&amp;quot;
    &lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;ol&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;requesting_organization_partners&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/old&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;ol&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;requested_consents&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;li&amp;gt;&lt;/span&gt;marketing_contact_email_organization&lt;span class="nt"&gt;&amp;lt;/li&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;li&amp;gt;&lt;/span&gt;marketing_contact_mobile_phone_organization&lt;span class="nt"&gt;&amp;lt;li&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;li&amp;gt;&lt;/span&gt;marketing_contact_physical_mail_organization&lt;span class="nt"&gt;&amp;lt;/li&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/ol&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;request_date&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;2017-09-08&lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;span&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;request_url&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        https://StochasticSolutions.com/give-us-your-data
    &lt;span class="nt"&gt;&amp;lt;/span&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;(Again, I've no idea whether this is actually what HTML-based microformats
typically look like; this is purely illustrative.)&lt;/p&gt;
&lt;h2 id="useful"&gt;Useful?&lt;/h2&gt;
&lt;p&gt;I don't know whether this idea is useful or feasible, nor whether it
is merely a half-baked version of something that an phalanx of people
in Brussels has already specified, though I did perform many seconds
of arduous due dilligence in the form of a web searches for terms like
"marketing consent microformat" without turning up anything obviously
relevant.&lt;/p&gt;
&lt;p&gt;It seems to me that if something like this were created and adopted,
it might help make GDPR and web/app-based consent avoid the ignominious
fate of the cookie pop-ups that were so well intentioned but such a waste
of time in practice. Ideally, some kind of collaboration between
the relevant part of the EU and either W3C would produce (or at least
endorse) any format.&lt;/p&gt;
&lt;p&gt;Do get in touch through any of the channels
(&lt;a href="https://twitter.com/tdda0"&gt;@tdda0&lt;/a&gt;, mail to &lt;code&gt;info@&lt;/code&gt; this domain etc.)
if you have thoughts.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:pii"&gt;
&lt;p&gt;So-called &lt;a href="https://en.wikipedia.org/wiki/Personally_identifiable_information"&gt;&lt;em&gt;personally identifiable information&lt;/em&gt; (PII)&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:pii" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:blockchain"&gt;
&lt;p&gt;Possibly even on a blockchain, if you want to be terribly
&lt;em&gt;au courrant&lt;/em&gt; and have the possibility of cryptographic verification.&amp;#160;&lt;a class="footnote-backref" href="#fnref:blockchain" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:work"&gt;
&lt;p&gt;&lt;em&gt;technically&lt;/em&gt; straightforward; obviously this would require much
work and hammering out of special cases and mechanisms for non-standard
requirements.&amp;#160;&lt;a class="footnote-backref" href="#fnref:work" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="tdda"></category></entry><entry><title>Quick Reference for TDDA Library</title><link href="https://tdda.info/quick-reference-for-tdda-library.html" rel="alternate"></link><published>2017-05-04T15:30:00+01:00</published><updated>2017-05-04T15:30:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2017-05-04:/quick-reference-for-tdda-library.html</id><summary type="html">&lt;p&gt;A quick-reference guide ("cheat sheet") is now available for
the &lt;a href="https://pypi.python.org/pypi/tdda"&gt;Python TDDA library&lt;/a&gt;.
This is linked in the sidebar and available
&lt;a href="https://www.tdda.info/pdf/tdda-quickref.pdf"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We will try to keep it up-to-date as the library evolves.&lt;/p&gt;
&lt;p&gt;See you all at &lt;a href="https://pydata.org/london2017/"&gt;PyData London 2017&lt;/a&gt;
this weekend (5-6 May 2017),
where we'll be running a …&lt;/p&gt;</summary><content type="html">&lt;p&gt;A quick-reference guide ("cheat sheet") is now available for
the &lt;a href="https://pypi.python.org/pypi/tdda"&gt;Python TDDA library&lt;/a&gt;.
This is linked in the sidebar and available
&lt;a href="https://www.tdda.info/pdf/tdda-quickref.pdf"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We will try to keep it up-to-date as the library evolves.&lt;/p&gt;
&lt;p&gt;See you all at &lt;a href="https://pydata.org/london2017/"&gt;PyData London 2017&lt;/a&gt;
this weekend (5-6 May 2017),
where we'll be running a &lt;a href="https://pydata.org/london2017/schedule/presentation/2/"&gt;TDDA tutorial&lt;/a&gt; on Friday.&lt;/p&gt;</content><category term="TDDA"></category><category term="tdda"></category></entry><entry><title>Improving Rexpy</title><link href="https://tdda.info/improving-rexpy.html" rel="alternate"></link><published>2017-03-09T15:30:00+00:00</published><updated>2017-03-09T15:30:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2017-03-09:/improving-rexpy.html</id><summary type="html">&lt;p&gt;Today we are announcing some enhancements to Rexpy,
the &lt;a href="https://pypi.python.org/pypi/tdda"&gt;tdda&lt;/a&gt;
tool for finding regular expressions from examples.
In short, the new version often finds more precise regular expressions
than was previously the case, with the only downside being a modest
increase in run-time.&lt;/p&gt;
&lt;p&gt;Background on Rexpy is available in two …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Today we are announcing some enhancements to Rexpy,
the &lt;a href="https://pypi.python.org/pypi/tdda"&gt;tdda&lt;/a&gt;
tool for finding regular expressions from examples.
In short, the new version often finds more precise regular expressions
than was previously the case, with the only downside being a modest
increase in run-time.&lt;/p&gt;
&lt;p&gt;Background on Rexpy is available in two previous posts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.tdda.info/introducing-rexpy-automatic-discovery-of-regular-expressions"&gt;This post&lt;/a&gt; introduced the concept and Python library&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.tdda.info/coverage-information-for-rexpy"&gt;This post&lt;/a&gt;
    discussed the addition of coverage information—statistics
    about which how many examples each regular expression matched.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Rexpy is also available online at &lt;a href="https://rexpy.herokuapp.com"&gt;https://rexpy.herokuapp.com&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="weaknesses-addressed"&gt;Weaknesses addressed&lt;/h2&gt;
&lt;p&gt;Rexpy is not intended to be an entirely general-purpose tool: it is
specifically focused on the case of trying to find regular expressions
to characterize the sort of structured textual data we most often see
in database and datasets. We are very interested in characterizing
things like identifiers, zip codes, phone numbers, URLs, UUIDs,
social security numbers and (string) bar codes, and much less interested
in characterizing things like sentences, tweets, programs and encrypted text.&lt;/p&gt;
&lt;p&gt;Within this focus, there were some obvious shortcomings of Rexpy, a significant
subset of which the current release (tdda version 0.3.0) now addresses.&lt;/p&gt;
&lt;h3 id="example-1-postcodes"&gt;Example 1: Postcodes&lt;/h3&gt;
&lt;p&gt;Rexpy's tests have always included postcodes, but Rexpy never did a very good
job with them. Here is the output from using tdda version 0.2.7:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python rexpy.py
EH1 1AA
B2 8EA

^&lt;span class="o"&gt;[&lt;/span&gt;A-Z0-9&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;,3&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\ &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;-9A-F&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Rexpy's result is completely valid, but not very specific.
It has correctly identified that there are two main parts,
separated by a space, and that the first part is a mixture of two or three
characters, each a capital letter or a number, and that the second part
is exactly three characters, again all capital letters or numbers.
However, it has failed to notice that the
first group &lt;em&gt;starts&lt;/em&gt; with a letter and follows this with a single digit
and that the second group is one digit followed by two letters.
What a human would probably have written is something more like:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;^&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;,2&lt;span class="o"&gt;}[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;-9&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="se"&gt;\ &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;-9&lt;span class="o"&gt;][&lt;/span&gt;A-F&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Let's try Rexpy 0.3.0.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;^&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;,2&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\d\ \d&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now Rexpy does exactly what we would probably have wanted it to do,
and actually written it slightly more compactly—&lt;code&gt;\d&lt;/code&gt; is any digit,
i.e. it is precisely equivalent to &lt;code&gt;[0-9]&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;With a few more examples, it still does the perfect thing (in 0.3.0).&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ rexpy
EC1 1BB
W1 0AX
M1 1AE
B33 8TH
CR2 6XH
DN55 1PT

^&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;,2&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\d&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;,2&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\ \d&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;(Note that the 0.3.0 release of TDDA includes wrapper scripts, &lt;code&gt;rexpy&lt;/code&gt;
and &lt;code&gt;tdda&lt;/code&gt; that allow the main functions to be used directly from
command line. These are installed when you &lt;code&gt;pip install tdda&lt;/code&gt;.
So the &lt;code&gt;rexpy&lt;/code&gt; above is exactly equivalent to running &lt;code&gt;python rexpy.py&lt;/code&gt;.)&lt;/p&gt;
&lt;p&gt;We should note, however, that it still doesn't work perfectly
if we include general London postcodes (even with 0.3.0):&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ rexpy
EC1A 1BB
W1A 0AX
M1 1AE
B33 8TH
CR2 6XH
DN55 1PT

^&lt;span class="o"&gt;[&lt;/span&gt;A-Z0-9&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;,4&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\ \d&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In this case, the addition of the final letter in the first block for
&lt;code&gt;EC1A&lt;/code&gt; and &lt;code&gt;W1A&lt;/code&gt; has convinced the software that the first block is just
a jumble of capital letters and numbers. We might hope that examples
like these (at least, if expanded) would result in something like:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;^&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;,2&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\d&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;,2&lt;span class="o"&gt;}&lt;/span&gt;A?&lt;span class="se"&gt;\ \d&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;though the real pattern for postcodes is actually
&lt;a href="https://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom"&gt;quite complex&lt;/a&gt;,
with only certain London postal areas being allowed a trailing letter,
and only in cases where there is a single digit in the first group,
and that letter can actually be &lt;code&gt;A&lt;/code&gt;, &lt;code&gt;C&lt;/code&gt;, &lt;code&gt;P&lt;/code&gt; or &lt;code&gt;W&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;So while it isn't &lt;em&gt;perfect&lt;/em&gt;, Rexpy is doing fairly well with postcodes now.&lt;/p&gt;
&lt;p&gt;Let's look at another couple of examples.&lt;/p&gt;
&lt;h3 id="example-2-toy-examples"&gt;Example 2: Toy Examples&lt;/h3&gt;
&lt;p&gt;Looking at the logs from the &lt;a href="https://rexpy.herokuapp.com"&gt;Rexpy online&lt;/a&gt;,
it is clear that a lot of people (naturally) start by trying the sorts
of examples commonly used for teaching regular expressions.
Here are some examples motivated by what we tend to see in logs.&lt;/p&gt;
&lt;p&gt;First, let's try a common toy example in the old version of Rexpy (0.2.7):&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python rexpy.py
ab
abb
abbb
abbbb
abbbbb
abbbbbb

^&lt;span class="o"&gt;[&lt;/span&gt;a-z&lt;span class="o"&gt;]&lt;/span&gt;+$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Not so impressive.&lt;/p&gt;
&lt;p&gt;Now in the new version (0.3.0):&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ rexpy
ab
abb
abbb
abbbb
abbbbb
abbbbbb

^ab+$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;That's more like it!&lt;/p&gt;
&lt;h3 id="example-3-names"&gt;Example 3: Names&lt;/h3&gt;
&lt;p&gt;Here's another example it's got better at. First under the old version:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python rexpy.py
Albert Einstein
Rosalind Franklin
Isaac Newton

^&lt;span class="o"&gt;[&lt;/span&gt;A-Za-z&lt;span class="o"&gt;]&lt;/span&gt;+&lt;span class="se"&gt;\ &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;A-Za-z&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;6&lt;/span&gt;,8&lt;span class="o"&gt;}&lt;/span&gt;$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Again, this is not wrong, but Rexpy has singularly failed
to notice the pattern of capitalization.&lt;/p&gt;
&lt;p&gt;Now, under the new version:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ rexpy
Albert Einstein
Rosalind Franklin
Isaac Newton

^&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;][&lt;/span&gt;a-z&lt;span class="o"&gt;]&lt;/span&gt;+&lt;span class="se"&gt;\ &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;][&lt;/span&gt;a-z&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;,7&lt;span class="o"&gt;}&lt;/span&gt;$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Better.&lt;/p&gt;
&lt;p&gt;Incidentally, it's not doing anything special with the first character of
groups. Here are some related examples:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ rexpy
AlbertEinstein
RosalindFranklin
IsaacNewton

^&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;][&lt;/span&gt;a-z&lt;span class="o"&gt;]&lt;/span&gt;+&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;][&lt;/span&gt;a-z&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;,7&lt;span class="o"&gt;}&lt;/span&gt;$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3 id="example-4-identifiers"&gt;Example 4: Identifiers&lt;/h3&gt;
&lt;p&gt;Some of the examples we used previously were like this
(same result under both versions):&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ rexpy
&lt;span class="m"&gt;123&lt;/span&gt;-AA-22
&lt;span class="m"&gt;576&lt;/span&gt;-KY-18
&lt;span class="m"&gt;989&lt;/span&gt;-BR-93

^&lt;span class="se"&gt;\d&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\-&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\-\d&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;What worked less well in the old version were examples like these:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python rexpy.py
123AA22
576KY18
989BR93

^&lt;span class="o"&gt;[&lt;/span&gt;A-Z0-9&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;These work much better under the 0.3.0:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ rexpy
123AA22
576KY18
989BR93

^&lt;span class="se"&gt;\d&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="o"&gt;}[&lt;/span&gt;A-Z&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\d&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2 id="some-remaining-areas-for-improvement"&gt;Some Remaining Areas for Improvement&lt;/h2&gt;
&lt;p&gt;The changes that we've made in this release of Rexpy appear to be
almost unambiguous improvements. Both from trying examples, and from
understanding the underlying code changes, we can find almost no
cases in which the changes make the results worse, and a great
number where the results are improved. Of course, that's not to say
that there don't remain areas that could be improved.&lt;/p&gt;
&lt;p&gt;Here we summarize a few of the things we still hope to improve:&lt;/p&gt;
&lt;h3 id="alternations-of-whole-groups"&gt;Alternations of Whole Groups&lt;/h3&gt;
&lt;p&gt;Rexpy isn't very good at generating &lt;em&gt;alternations&lt;/em&gt; at the moment, either
at a character or group level. So for example, you might have hoped
that in the following example, Rexpy would notice that the middle two
letters are always &lt;code&gt;AA&lt;/code&gt; or &lt;code&gt;BB&lt;/code&gt; (or, possibly, that the letter is
repeated).&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ rexpy
&lt;span class="m"&gt;123&lt;/span&gt;-AA-321
&lt;span class="m"&gt;465&lt;/span&gt;-BB-763
&lt;span class="m"&gt;777&lt;/span&gt;-AA-81
&lt;span class="m"&gt;434&lt;/span&gt;-BB-987
&lt;span class="m"&gt;101&lt;/span&gt;-BB-773
&lt;span class="m"&gt;032&lt;/span&gt;-BB-881
&lt;span class="m"&gt;094&lt;/span&gt;-AA-662

^&lt;span class="se"&gt;\d&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\-&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\-\d&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;,3&lt;span class="o"&gt;}&lt;/span&gt;$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Unfortunately, it does not.
(This probably won't be very hard to change.)&lt;/p&gt;
&lt;h3 id="alternations-within-groups"&gt;Alternations within Groups&lt;/h3&gt;
&lt;p&gt;Similarly, you might hope that it would do rather better than this:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ rexpy
Roger
Boger
Coger

^&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;][&lt;/span&gt;a-z&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Clearly, we would like this to produce&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;^&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;]&lt;/span&gt;oger$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;and you might think from the previous examples that it would do this,
but it can't combine the fixed &lt;code&gt;oger&lt;/code&gt; with the adjacent letter range.&lt;/p&gt;
&lt;h3 id="too-many-expressions-or-combining-results"&gt;Too Many Expressions (or Combining Results)&lt;/h3&gt;
&lt;p&gt;Rexpy also produces rather too many regular expressions in many cases,
particularly by failing to use optionals when it could.
For example:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ rexpy
Angela Carter
Barbara Kingsolver
Socrates
Cher
Martin Luther King
James Clerk Maxwell

^&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;][&lt;/span&gt;a-z&lt;span class="o"&gt;]&lt;/span&gt;+$
^&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;][&lt;/span&gt;a-z&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;,6&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\ &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;][&lt;/span&gt;a-z&lt;span class="o"&gt;]&lt;/span&gt;+$
^&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;][&lt;/span&gt;a-z&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;,5&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\ &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;][&lt;/span&gt;a-z&lt;span class="o"&gt;]{&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;,5&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\ &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;][&lt;/span&gt;a-z&lt;span class="o"&gt;]&lt;/span&gt;+$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;At least in some circumstances, we might prefer that this would produce
a single result such as:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;^&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;][&lt;/span&gt;a-z&lt;span class="o"&gt;]&lt;/span&gt;+&lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="se"&gt;\ &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;][&lt;/span&gt;a-z&lt;span class="o"&gt;]&lt;/span&gt;+&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="se"&gt;\ &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;][&lt;/span&gt;a-z&lt;span class="o"&gt;]&lt;/span&gt;+&lt;span class="o"&gt;)&lt;/span&gt;?&lt;span class="o"&gt;)&lt;/span&gt;?$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;or, ever better:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;^&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;][&lt;/span&gt;a-z&lt;span class="o"&gt;]&lt;/span&gt;+&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="se"&gt;\ &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;A-Z&lt;span class="o"&gt;][&lt;/span&gt;a-z&lt;span class="o"&gt;])&lt;/span&gt;*$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Although we would definitely like Rexpy to be able to produce
one of these results, we don't necessarily &lt;em&gt;always&lt;/em&gt; want this behaviour.
It transpires that in a TDDA context, producing different expressions
for the different cases is very often useful. So if we do crack the
"combining" problem, we'll probably make it an option (perhaps with
levels);
that will just leave the issue of deciding on a default!&lt;/p&gt;
&lt;h2 id="plans"&gt;Plans&lt;/h2&gt;
&lt;p&gt;We have ideas on how to address all of these, albeit not perfectly,
so expect further improvements.&lt;/p&gt;
&lt;p&gt;If you use Rexpy and have feedback, do let us know. You can reach us
on Twitter at (&lt;a href="https://twitter.com/tdda0"&gt;@tdda0&lt;/a&gt;), and there's
also a TDDA Slack (#TDDA) that we'd be happy to invite you to.&lt;/p&gt;</content><category term="TDDA"></category><category term="tdda"></category><category term="rexpy"></category><category term="regular expressions"></category></entry><entry><title>An Error of Process</title><link href="https://tdda.info/an-error-of-process.html" rel="alternate"></link><published>2017-03-08T13:00:00+00:00</published><updated>2017-03-08T13:00:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2017-03-08:/an-error-of-process.html</id><summary type="html">&lt;p&gt;Yesterday, email subscribers to the blog, and some RSS/casual viewers,
will have seen a half-finished (in fact, abandoned) post that began
to try to characterize success and failure on the crowd-funding
platform &lt;a href="https://kickstarter.com"&gt;Kickstarter&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The post was abandoned because I didn't believe its first conclusion,
but unfortunately was published by …&lt;/strong&gt;&lt;/p&gt;</summary><content type="html">&lt;p&gt;Yesterday, email subscribers to the blog, and some RSS/casual viewers,
will have seen a half-finished (in fact, abandoned) post that began
to try to characterize success and failure on the crowd-funding
platform &lt;a href="https://kickstarter.com"&gt;Kickstarter&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The post was abandoned because I didn't believe its first conclusion,
but unfortunately was published by mistake yesterday.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This post explains what happened and tries to salvage a "teachable
moment" out of this minor fiasco.&lt;/p&gt;
&lt;h2 id="the-problem-the-post-was-trying-to-address"&gt;The problem the post was trying to address&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://kickstarter.com"&gt;Kickstarter&lt;/a&gt; is a crowd-funding platform that
allows people to back creative projects, usually in exchange for
rewards of various kinds. Projects set a funding goal and backers
only pay out if the aggregate pledges made match or exceed the
funding goal during a funding period—usually 30 days.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.kickstarter.com/projects/1548768604/twitterrific-for-mac-project-phoenix"&gt;Project Phoenix&lt;/a&gt; on Kickstarter,
from &lt;a href="https://iconfactory.com"&gt;The Icon Factory&lt;/a&gt;,
seeks to fund the development of a new version of
&lt;a href="https://twitterrific.com/mac/"&gt;Twitterrific for Mac&lt;/a&gt;.
Twitterrific was the first independent Twitter client,
and was responsible for many of the things that define Twitter
today.&lt;sup id="fnref:nottheabuse"&gt;&lt;a class="footnote-ref" href="#fn:nottheabuse"&gt;1&lt;/a&gt;&lt;/sup&gt;
(You were, and are, cordially invited to break off from reading this post to
go and back the project before reading on.)&lt;/p&gt;
&lt;p&gt;Ollie is the bird in Twitterrific's icon.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://www.tdda.info/images/Ollie-512.png" width="256"/&gt;&lt;/p&gt;
&lt;p&gt;At the time I started the post, the project had pledges of $63,554 towards
a funding goal of $75,000 (84%) after 13 days, with 17 days to go.
This is what the amount raised over time looked like (using data
from &lt;a href="https://www.kicktraq.com/projects/1548768604/twitterrific-for-mac-project-phoenix/"&gt;Kicktraq&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://www.tdda.info/images/CumFunding-546x566.png" width="273"/&gt;&lt;/p&gt;
&lt;p&gt;Given that the amount raised was falling each day, and looked asymptotic,
questions I was interested in were:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How likely was the project to succeed (i.e. to reach its funding goal
    by day 30? (In fact, it is now fully funded.)&lt;/li&gt;
&lt;li&gt;How much was the project likely to raise?&lt;/li&gt;
&lt;li&gt;How likely was the project to reach its stretch goal of $100,000?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The idea was to use some open data from Kickstarter and simple
assumptions to try to find out what successful and unsuccessful
projects look like.&lt;/p&gt;
&lt;h2 id="data-and-assumptions"&gt;Data and Assumptions&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;[This paragraph is unedited from the post yesterday, save that I have
made the third item below bold.]&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Kickstarter does not have a public API, but is scrapable.
The site &lt;a href="https://webrobots.io"&gt;Web Robots&lt;/a&gt; makes available
a series of roughly monthly scrapes of Kickstarter data from October 2015
to present; as well as seven older datasets.
We have based our analysis on this data, making the following assumptions:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The data is correct and covers all Kickstarter Projects&lt;/li&gt;
&lt;li&gt;That we are interpreting the fields in the data correctly&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Most critically: if any projects are missing from this data,
     the missing projects are random. Our analysis is completely
     invalid if failing projects are removed from the datasets.&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;em&gt;[That last point, heavily signalled as critical, turned out not to be the case.
As soon as I saw the 99.9% figure below, I went to try to validate that
projects didn't go missing from month to month in the scraped data.
In fact, they do, all the time, and when I realised this, I abandoned the post.
There would have been other ways to try to make the prediction, but they would
have been less reliable and required much more work.]&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;We started with the latest dataset, from 15th February 2017.
This included data about 175,085 projects, which break down as follows.&lt;/p&gt;
&lt;p&gt;Only projects with a 30-day funding period were included in the comparison,
and only those for which we knew the final outcome.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;count           is 30 day?
state           no     yes    TOTAL
failed      41,382  42,134   83,516
successful  44,071  31,142   75,213
canceled     6,319   5,463   11,782
suspended      463     363      826
live         2,084   1,664    3,748
TOTAL       94,319  80,766  175,085
-----------------------------------
less live:           1,664
-----------------------------------
Universe            79,102
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The table showed that 80,766 of the projects are 30-day, and of these, 79,102
are not live. So this is our starting universe for analysis.
NOTE: We deliberately did not exclude &lt;code&gt;suspended&lt;/code&gt; or &lt;code&gt;canceled&lt;/code&gt; projects,
since doing so would have biased our results.&lt;/p&gt;
&lt;p&gt;Various fields are available in the JSON data provided by Web Robots.
The subset we have used are listed below, together with our interpretation
of the meaning of each field:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;launched_at&lt;/code&gt; — Unix timestamp (seconds since 1 January 1970) for
    the start of the funding period&lt;/li&gt;
&lt;li&gt;&lt;code&gt;deadline&lt;/code&gt; — Unix timestamp for the end of the funding period&lt;/li&gt;
&lt;li&gt;&lt;code&gt;state&lt;/code&gt; — (see above)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;goal&lt;/code&gt; — the amount required to be raised for the project to be
    funded&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pledged&lt;/code&gt; — the total amount of pledges (today); pledges can only be
    made during the funding period&lt;/li&gt;
&lt;li&gt;&lt;code&gt;currency&lt;/code&gt; — the currency in which the goal and pledges are made.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;backers_count&lt;/code&gt; — the number of people who have pledged money.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="overall-statistics-for-30-day-non-live-projects"&gt;Overall Statistics for 30-day, non-live projects&lt;/h2&gt;
&lt;p&gt;These are the overall statistics for our 30-day, non-live projects:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;succeeded    count        %
no          47,839   60.48%
yes         31,263   39.52%
TOTAL       79,102  100.00%
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Just under 40% of them succeed.&lt;/p&gt;
&lt;p&gt;But what proportion reach 84% and still fail to reach 100%?
According to the detailed data, the answer was just 0.10%,
suggesting 99.90% of 30-day projects that reached 84% of their
funding goal, &lt;em&gt;at any stage of the campaign&lt;/em&gt; went on to be
fully funded.&lt;/p&gt;
&lt;p&gt;That looked wildly implausible to me, and immediately made me
question whether the data I was trying to use was capable of
supporting this analysis. In particular, my immediate worry
was that projects that looked like they were not going to reach
their goal might end up being removed—for whatever reason—more
often that those that were on track. Although I have not proved
that this is the case, it is clear that projects do quite often
disappear between successive scrapes.&lt;/p&gt;
&lt;p&gt;To check this, I went back over all the earlier datasets
I had collected and extracted the projects that were &lt;code&gt;live&lt;/code&gt;
in those datasets. There were 47,777 such projects.
I then joined those onto the latest dataset
to see how many of them were in the latest dataset.
15,276 (31.97%) of the once-live projects were not in the
latest data (based on joining on &lt;code&gt;id&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;It was at this point I abandoned the blog post.&lt;/p&gt;
&lt;h2 id="error-of-process"&gt;Error of Process&lt;/h2&gt;
&lt;p&gt;So what did we learn?&lt;/p&gt;
&lt;p&gt;The whole motivation for test-driven data analysis is the observation
that data analysis is hard to get right, and most of us make mistakes
all the time. We have previously classified these mistakes as&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;errors of interpretation&lt;/em&gt; (where we or a consumer of our analysis
    misunderstand the data, the methods, or our results)&lt;/li&gt;
&lt;li&gt;&lt;em&gt;errors of implementation&lt;/em&gt; (bugs)&lt;/li&gt;
&lt;li&gt;&lt;em&gt;errors of process&lt;/em&gt; (where we make a mistake in using our analytical
    process, and this leads to a false result being generated or propagated)&lt;/li&gt;
&lt;li&gt;&lt;em&gt;errors of applicability&lt;/em&gt; (where we use an analytical process
    with data that does not satisfy the requirements or assumptions
    (explicit or implicit) of the analysis).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We are trying to develop methodologies and tools to reduce the likelihood
and impact of each of these kinds of errors.&lt;/p&gt;
&lt;p&gt;While we wouldn't normally regard this blog as an analytical process,
it's perhaps close enough that we can view this particular error through
the TDDA lens. I was writing up the analysis as I did it, fully expecting
to generate a useful post. Although I got as far as &lt;em&gt;writing&lt;/em&gt; into the
entry the (very dubious) &lt;em&gt;99.9% of 30-day projects that reach 84% funding at
any stage on Kickstarter go on to be fully funded&lt;/em&gt;, that result immediately
smelled wrong and I went off to try to see whether my assumptions about the
data were correct. So I was trying hard to avoid an error of interpretation.&lt;/p&gt;
&lt;p&gt;But an error of process occurred. This blog is published using
&lt;a href="https://blog.getpelican.com"&gt;Pelican&lt;/a&gt;, a static site generator
that I mostly quite like. The way Pelican works is posts are (usually)
written in &lt;a href="https://daringfireball.net/projects/markdown/"&gt;Markdown&lt;/a&gt;
with some metadata at the top. One of the bits of metadata is
a &lt;code&gt;Status&lt;/code&gt; field, which can either be set to &lt;code&gt;Draft&lt;/code&gt; or &lt;code&gt;Published&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;When writing the posts, before publishing, you can either run a local
webserver to view the output, or actually post them to the main site
(on Github Pages, in this case). As long as their status is set to
&lt;code&gt;Draft&lt;/code&gt;, the posts don't show up as part of the blog in either site (local
or on Github), but have to be accessed through a special draft URL.
Unfortunately, the draft URL is a little hard to guess, so I generally
work with posts with status set to &lt;code&gt;Published&lt;/code&gt; until I push them to
Github to allow other people to review them before setting them live.&lt;/p&gt;
&lt;p&gt;What went wrong here is that the abandoned post had its status left as
&lt;code&gt;Published&lt;/code&gt;, which was fine until I started the next post (due tomorrow)
and pushed that (as draft) to Github. Needless to say, a side-effect
of pushing the site with a draft of tomorrow's post was that the
abandoned post got pushed too, with its status as &lt;code&gt;Public&lt;/code&gt;. Oops!&lt;/p&gt;
&lt;p&gt;So the learning for me is that I either have to be more careful
with Status (which is optimistic) or I need to add some protection
in the publishing process to stop this happening. Realistically,
that probably means creating a new Status—Internal—which will
get the &lt;code&gt;make&lt;/code&gt; process to transmogrify into &lt;code&gt;Published&lt;/code&gt;
when compiling locally, and &lt;code&gt;Draft&lt;/code&gt; when pushing to Github.
That should avoid repeats of this particular &lt;em&gt;error of process.&lt;/em&gt;&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:nottheabuse"&gt;
&lt;p&gt;good things, like birds and @names and retweeting;
            not the abuse.&amp;#160;&lt;a class="footnote-backref" href="#fnref:nottheabuse" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="tdda"></category><category term="errors of process"></category><category term="errors of interpretation"></category></entry><entry><title>Errors of Interpretation: Bad Graphs with Dual Scales</title><link href="https://tdda.info/errors-of-interpretation-bad-graphs-with-dual-scales.html" rel="alternate"></link><published>2017-02-20T15:30:00+00:00</published><updated>2017-02-20T15:30:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2017-02-20:/errors-of-interpretation-bad-graphs-with-dual-scales.html</id><summary type="html">&lt;p&gt;It is a primary responsibility of analysts to present findings and
data clearly, in ways to minimize the likelihood of misinterpretation.
Graphs should help this, but all too often, if drawn badly (whether
deliberately or through oversight) they can make misinterpretation
highly likely. I want to illustrate this danger with …&lt;/p&gt;</summary><content type="html">&lt;p&gt;It is a primary responsibility of analysts to present findings and
data clearly, in ways to minimize the likelihood of misinterpretation.
Graphs should help this, but all too often, if drawn badly (whether
deliberately or through oversight) they can make misinterpretation
highly likely. I want to illustrate this danger with a unifortunate
graph I came across recently in a very interesting—and good, and
insightful—article on the US Election.&lt;/p&gt;
&lt;p&gt;Take a look at this graph, taken from an article
called &lt;em&gt;The Road to Trumpsville: The Long, Long Mistreatment of the
American Working Class&lt;/em&gt;, by Jeremy Grantham.&lt;sup id="fnref:Grantham"&gt;&lt;a class="footnote-ref" href="#fn:Grantham"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="Exhibit 1: Corportate Profits and Employee Compensation" src="https://www.tdda.info/images/DualScaleGraph.png"&gt;&lt;/p&gt;
&lt;p&gt;In the article, this graph ("Exhibit 1") is described as follows by Grantham:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The combined result is shown in Exhibit 1:  the share of GDP going
to labor hit historical lows as recently as 2014 and the share going
to corporate profits hit a simultaneous high.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Is that what you interpret from the graph?  I agree with these words,
but they don't really sum up my first reading of the graph.  Rather, I
think the natural reading of the graph is as follows:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Wow: Labor's share and Capital's share of GDP crossed over,
apparently for good, around 2002. Before then, Capital's share was
mostly materially lower than Labor's (though they were nearly
equal, briefly, in 1965, and crossed for a for a few years in
1995), but over the 66-year period shown Capital's share increased
while Labor's fell, until now is taking about four times as much
as Labor.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I think something like that is what most people will read from the graph,
unless they read it particularly carefully.&lt;/p&gt;
&lt;p&gt;But that is &lt;em&gt;not&lt;/em&gt; what this graph is saying. In fact, this is one of the
most misleading graphs I have ever come across.&lt;/p&gt;
&lt;p&gt;If you look carefully, the two lines use different scales: the red one,
for Labor, uses the scale on the right, which runs from 23% to about
34%, whereas the blue line for Capital, uses the scale on the left,
which runs from 3% to 11%.&lt;/p&gt;
&lt;p&gt;Dual-scale graphs are always difficult to read; so difficult, in fact,
that my personal recommendation is&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Never plot data on two different scales on the same graph.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Not everyone agrees with this, but most people accept that dual-scale
graphs are confusing and hard to read. Even, however, by the standards
of dual scale graphs, this is bad.&lt;/p&gt;
&lt;p&gt;Here are the problems, in roughly decreasing order of importance:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The two lines are showing commensurate&lt;sup id="fnref:commensurate"&gt;&lt;a class="footnote-ref" href="#fn:commensurate"&gt;2&lt;/a&gt;&lt;/sup&gt; figures of
     roughly the same order of magnitude, so &lt;em&gt;could &lt;/em&gt;&lt;em&gt;and should&lt;/em&gt;&lt;em&gt;
     have been&lt;/em&gt; on the same scale: this isn't a case of showing price
     against volume, where the units are different, or even a case
     in which one size in millimetres and another in miles:
     these are both percentages,
     of the same thing, all between 4% and 32%.&lt;/li&gt;
&lt;li&gt;The graphs cross over when the data doesn't. The very strong
     suggestion from the graphs that we go from Labor's share of GDP
     exceeding that of Capital to being radically lower than that
     of Capital is entirely false.&lt;/li&gt;
&lt;li&gt;Despite measuring the same quantity, the magnification is
     different on the two axes (i.e. the distance on the page
     between ticks is different, and the percentage-point gap
     represented by ticks on the two scales is different).
     As a consequence slopes (gradients) are not comparable.&lt;/li&gt;
&lt;li&gt;Neither scale goes to zero.&lt;/li&gt;
&lt;li&gt;The position of the two series relative to their scales is
     inconsistent: the Labor graph goes right down to the x-axis
     at its minimum (23%) while the Capital graph—whose minimum
     is also very close to an integer percentage—does not.
     This adds further to the impression that Labor's share has been
     absolutely annihilated.&lt;/li&gt;
&lt;li&gt;There are no gridlines to help you read the data. (Sure, gridlines
     are &lt;em&gt;chart junk&lt;/em&gt;&lt;sup id="fnref:chartjunk"&gt;&lt;a class="footnote-ref" href="#fn:chartjunk"&gt;3&lt;/a&gt;&lt;/sup&gt;, but are especially important when
     different scales are used, so you have some hope of reading the
     values.)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I want to be clear: I am &lt;em&gt;not&lt;/em&gt; accusing Jeremy Grantham of
deliberately plotting the data in a misleading way. I do not believe
he intended to distort or manipulate. I suspect he's plotted it this
way because stock graphs, which may well be the graphs he most often
looks at,&lt;sup id="fnref:gmo"&gt;&lt;a class="footnote-ref" href="#fn:gmo"&gt;4&lt;/a&gt;&lt;/sup&gt; are frequently plotted with false zeros.
Despite this, he has unfortunately
plotted the graphs in a way&lt;sup id="fnref:or"&gt;&lt;a class="footnote-ref" href="#fn:or"&gt;5&lt;/a&gt;&lt;/sup&gt; that visually distorts the data in
almost exactly the way I would choose to do if I wanted to make the
points he is making appear even stronger than they are.&lt;/p&gt;
&lt;p&gt;I don't have the source numbers, so I have gone through a rather painful
exercise, of reading the numbers off the graph (at slightly coarser
granularity) so that I can replot the graph as it should, in my opinion,
have been plotted in the first place. (I apologise if I have misread any
values; transcribing numbers from graphs is tedious and error-prone.)
This is the result:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Exhibit 1 (revised): Same Data, with single, zero-based scale (redrawn approximation)" src="https://www.tdda.info/images/SingleScaleGraph.png"&gt;&lt;/p&gt;
&lt;p&gt;Even after I'd looked carefully at the scales and appreciated all the
distortions in the original graph, I was quite shocked to see the data
presented neutrally. To be clear: Grantham's textual &lt;em&gt;summary&lt;/em&gt; of the
data is accurate: a few years ago, Capital's share of GDP (from his
figures) were at an all time—albeit not dramatically higher than in
1949 or about 1966—and Labor's share of GDP, a few years ago, was at an
all-time low around 23%, down from 30%. But the true picture just doesn't
look like the graph Gratham showed. (Again: I feel a bit bad about going
on about this graph from such a good article; but the graph encapsulates
a number of problematical practices that it makes a perfect illustration.)&lt;/p&gt;
&lt;h2 id="how-to-lie-with-statistics"&gt;How to Lie with Statistics&lt;/h2&gt;
&lt;p&gt;In 1954, Darrell Huff published a book called &lt;em&gt;How to Lie with Statistics&lt;/em&gt;&lt;sup id="fnref:howtolie"&gt;&lt;a class="footnote-ref" href="#fn:howtolie"&gt;6&lt;/a&gt;&lt;/sup&gt;. Chapter 5 is called &lt;em&gt;The Gee Wizz Graph&lt;/em&gt;.
His first example is the following graph (neutrally presented) graph:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Exhibit 2 (neutral): Sales Data, zero-based scale (redrawn from original)" src="https://www.tdda.info/images/SalesZeroBased.png"&gt;&lt;/p&gt;
&lt;p&gt;As Huff says:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;That is very well if all you want to do is convey information.
But suppose you wish to win an argument, shock a reader, move
him into action, sell him something. For that, this chart lacks
schmaltz. Chop off the bottom.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;img alt="Exhibit 2 (non-zero-based): Sales Data, non-zero-based scale (redrawn from original)" src="https://www.tdda.info/images/SalesNonZeroBased.png"&gt;&lt;/p&gt;
&lt;p&gt;Huff continues:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Thats more like it. (You've saved paper&lt;sup id="fnref:trees"&gt;&lt;a class="footnote-ref" href="#fn:trees"&gt;7&lt;/a&gt;&lt;/sup&gt; too, something to point
out if any carping fellow objects to your misleading graphics.)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;But there's more, folks:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Now that you have practised to deceive, why stop with truncating?
You have one more trick available that's worth a dozen of that.
It will make your modest rise of ten per cent look livelier
than one hundred percent is entitled to look.
Simply change the proportion between the ordinate and the abscissa:&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;img alt="Exhibit 2 (non-zero-based, expanded): Sales Data, non-zero-based scale, expanded effect (redrawn from original)" src="https://www.tdda.info/images/SalesNonZeroBasedExpanded.png"&gt;&lt;/p&gt;
&lt;p&gt;Both of these unfortunate practices are present in Exhibit 1, and that's
before we even get to dual scales.&lt;/p&gt;
&lt;h2 id="errors-of-interpretation"&gt;Errors of Interpretation&lt;/h2&gt;
&lt;p&gt;In our various overviews of &lt;em&gt;test-driven data analysis,&lt;/em&gt;
(e.g., &lt;a href="https://stochasticsolutions.com/pdf/TDDA-One-Pager.pdf"&gt;this summary&lt;/a&gt;)
we have described
four major classes of errors:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;errors of interpretation&lt;/li&gt;
&lt;li&gt;errors of implementation (bugs)&lt;/li&gt;
&lt;li&gt;errors of process&lt;/li&gt;
&lt;li&gt;errors of applicability&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Errors of interpretation can occur at any point in the process: not
only are we, the analysts, susceptible to misinterpreting our inputs,
our methods, our intermediate results and our outputs, but the
recipients of our insights and analyses are in even greater danger of
misinterpreting our results, because they have not worked through the
process and seen all that we did.  As analysts, we have a special
responsibility to make our results as clear as possible, and hard to
misinterpret. We should assume not that the reader will be diligent,
unhurried and careful, reading every number and observing every
subtlety, but that she or he will be hurried and will rely on us to
have brought out the salient points and to have helped the reader
towards the right conclusions.&lt;/p&gt;
&lt;p&gt;The purpose of a graph is to bring allow a reader to assimilate
large quantities of data, and to understand patterns therein,
more quickly and more easily than is possible from tables of numbers.
There are strong conventions about how to do that, based on known
human strengths and weaknesses as well as commonsense "fair treatment"
of different series.&lt;/p&gt;
&lt;p&gt;However well intentioned, Exhibit 1 fails in every respect: I would
guess very few casual readers would get an accurate impression
from the data as presented.&lt;/p&gt;
&lt;p&gt;If data scientists had the equivalent of a Hippocratic Oath, it would
be something like:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;First, do not mislead.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:Grantham"&gt;
&lt;p&gt;The Road to Trumpsville: The Long, Long Mistreatment of
         the American Working Class, by Jeremy Grantham,
         in the GMO Quarterly Newsletter, 4Q, 2016.
         &lt;a href="https://www.gmo.com/docs/default-source/public-commentary/gmo-quarterly-letter.pdf"&gt;https://www.gmo.com/docs/default-source/public-commentary/gmo-quarterly-letter.pdf&lt;/a&gt;&amp;#160;&lt;a class="footnote-backref" href="#fnref:Grantham" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:commensurate"&gt;
&lt;p&gt;two variables are &lt;em&gt;commensurate&lt;/em&gt; if they are measured
             in the same units and it is meaningful to make a direct
             comparison between them.&amp;#160;&lt;a class="footnote-backref" href="#fnref:commensurate" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:chartjunk"&gt;
&lt;p&gt;Tufte describes all ink on a graph that is not actually
          plotting data "chart junk", and advocates "maximizing
          data ink" (the amount of the ink on a graph actually
          devoted to plotting the data points) and minimizing
          chart junk. These are excellent principles.
          The Visual Display of Quantitative Information,
          Edward R. Tufte, Graphics Press (Cheshire, Connecticut) 1983.&amp;#160;&lt;a class="footnote-backref" href="#fnref:chartjunk" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:gmo"&gt;
&lt;p&gt;Mr Grantham works for GMO, a "global investment management firm".
    &lt;a href="https://gmo.com"&gt;https://gmo.com&lt;/a&gt;&amp;#160;&lt;a class="footnote-backref" href="#fnref:gmo" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:or"&gt;
&lt;p&gt;chosen to use a plot, if he isn't responsible for the plot&amp;#160;&lt;a class="footnote-backref" href="#fnref:or" title="Jump back to footnote 5 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:howtolie"&gt;
&lt;p&gt;&lt;em&gt;How to Lie with Statistics&lt;/em&gt;, Darrell Huff, published Victor Gollancz, 1954. Republished, 1973, by Pelican Books.&amp;#160;&lt;a class="footnote-backref" href="#fnref:howtolie" title="Jump back to footnote 6 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:trees"&gt;
&lt;p&gt;Obviously the "saving paper" argument had more force in 1954,
      and the constant references to "him", "he" and "fellows" similarly
      stood out less than they do today.&amp;#160;&lt;a class="footnote-backref" href="#fnref:trees" title="Jump back to footnote 7 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="tdda"></category><category term="errors of interpretation"></category><category term="graphs"></category></entry><entry><title>TDDA 1-pager</title><link href="https://tdda.info/tdda-1-pager.html" rel="alternate"></link><published>2017-02-10T15:00:00+00:00</published><updated>2017-02-10T15:00:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2017-02-10:/tdda-1-pager.html</id><content type="html">&lt;p&gt;We have written a 1-page summary of some of the core ideas in TDDA.&lt;/p&gt;
&lt;p&gt;It is available as a PDF from
&lt;a href="https://stochasticsolutions.com/pdf/TDDA-One-Pager.pdf"&gt;stochasticsolutions.com/pdf/TDDA-One-Pager.pdf&lt;/a&gt;.&lt;/p&gt;</content><category term="TDDA"></category><category term="tdda"></category></entry><entry><title>Coverage information for Rexpy</title><link href="https://tdda.info/coverage-information-for-rexpy.html" rel="alternate"></link><published>2017-01-31T15:00:00+00:00</published><updated>2017-01-31T15:00:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2017-01-31:/coverage-information-for-rexpy.html</id><summary type="html">&lt;h2 id="rexpy-stats"&gt;Rexpy Stats&lt;/h2&gt;
&lt;p&gt;We
&lt;a href="https://www.tdda.info/introducing-rexpy-automatic-discovery-of-regular-expressions"&gt;previously&lt;/a&gt;
added &lt;code&gt;rexpy&lt;/code&gt; to the Python &lt;code&gt;tdda&lt;/code&gt; module.  Rexpy is used to
find regular expressions from example strings.&lt;/p&gt;
&lt;p&gt;One of the most common requests from Rexpy users has been for information
regarding how many examples each resulting regular expression matches.&lt;/p&gt;
&lt;p&gt;We have now added a few methods …&lt;/p&gt;</summary><content type="html">&lt;h2 id="rexpy-stats"&gt;Rexpy Stats&lt;/h2&gt;
&lt;p&gt;We
&lt;a href="https://www.tdda.info/introducing-rexpy-automatic-discovery-of-regular-expressions"&gt;previously&lt;/a&gt;
added &lt;code&gt;rexpy&lt;/code&gt; to the Python &lt;code&gt;tdda&lt;/code&gt; module.  Rexpy is used to
find regular expressions from example strings.&lt;/p&gt;
&lt;p&gt;One of the most common requests from Rexpy users has been for information
regarding how many examples each resulting regular expression matches.&lt;/p&gt;
&lt;p&gt;We have now added a few methods to Rexpy to support this.&lt;/p&gt;
&lt;p&gt;Currently this is only available in the Python library for Rexpy,
available as part of the &lt;code&gt;tdda&lt;/code&gt; module, with either&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;pip install tdda
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;or&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;git clone https://github.com/tdda/tdda.git
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Needless to say, we also plan to use this functionality in the
&lt;a href="https://rexpy.herokuapp.com"&gt;online version of Rexpy&lt;/a&gt;
in the future.&lt;/p&gt;
&lt;h2 id="rexpy-quick-recap"&gt;Rexpy: Quick Recap&lt;/h2&gt;
&lt;p&gt;The following example shows simple use of Rexpy from Python:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="err"&gt;$&lt;/span&gt; &lt;span class="n"&gt;python&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tdda&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;rexpy&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;corpus&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;123-AA-971&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;12-DQ-802&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;198-AA-045&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;1-BA-834&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;rexpy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;corpus&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;
&lt;span class="o"&gt;^&lt;/span&gt;\&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;Z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;\&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In this case, Rexpy found a single regular expression that matched
all the strings, but in many cases it returns a list of regular expressions,
each covering some subset of the examples.&lt;/p&gt;
&lt;p&gt;The way the algorithm currently works, in most cases&lt;sup id="fnref:orall"&gt;&lt;a class="footnote-ref" href="#fn:orall"&gt;1&lt;/a&gt;&lt;/sup&gt; each example
will match only one regular expression, but in general, some examples
might match more than one pattern. So we've designed the new functionality
to work even when this is the case. We've provided three new methods
on the &lt;code&gt;Extractor&lt;/code&gt; class, which gives a more powerful API than the
simple &lt;code&gt;extract&lt;/code&gt; function.&lt;/p&gt;
&lt;p&gt;Here's an example based on one of Rexpy's tests:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;urls2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="nv"&gt;stochasticsolutions&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;com&lt;/span&gt;&lt;span class="o"&gt;/&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="nv"&gt;apple&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;com&lt;/span&gt;&lt;span class="o"&gt;/&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="nv"&gt;stochasticsolutions&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;com&lt;/span&gt;&lt;span class="o"&gt;/&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;actual&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;duplicate&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="nv"&gt;https&lt;/span&gt;&lt;span class="o"&gt;://&lt;/span&gt;&lt;span class="nv"&gt;www&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;stochasticsolutions&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;co&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;uk&lt;/span&gt;&lt;span class="o"&gt;/&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="nv"&gt;https&lt;/span&gt;&lt;span class="o"&gt;://&lt;/span&gt;&lt;span class="nv"&gt;www&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;google&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;co&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;uk&lt;/span&gt;&lt;span class="o"&gt;/&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="nv"&gt;https&lt;/span&gt;&lt;span class="o"&gt;://&lt;/span&gt;&lt;span class="nv"&gt;www&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;google&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;com&lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="nv"&gt;https&lt;/span&gt;&lt;span class="o"&gt;://&lt;/span&gt;&lt;span class="nv"&gt;www&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;google&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;com&lt;/span&gt;&lt;span class="o"&gt;/&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="nv"&gt;https&lt;/span&gt;&lt;span class="o"&gt;://&lt;/span&gt;&lt;span class="nv"&gt;www&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;guardian&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;co&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;uk&lt;/span&gt;&lt;span class="o"&gt;/&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="nv"&gt;https&lt;/span&gt;&lt;span class="o"&gt;://&lt;/span&gt;&lt;span class="nv"&gt;www&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;guardian&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;com&lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="nv"&gt;https&lt;/span&gt;&lt;span class="o"&gt;://&lt;/span&gt;&lt;span class="nv"&gt;www&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;guardian&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;com&lt;/span&gt;&lt;span class="o"&gt;/&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="nv"&gt;https&lt;/span&gt;&lt;span class="o"&gt;://&lt;/span&gt;&lt;span class="nv"&gt;www&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;stochasticsolutions&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;com&lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="nv"&gt;web&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;stochasticsolutions&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;com&lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="nv"&gt;https&lt;/span&gt;&lt;span class="o"&gt;://&lt;/span&gt;&lt;span class="nv"&gt;www&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;stochasticsolutions&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;com&lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="nv"&gt;tdda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="no"&gt;inf&lt;/span&gt;&lt;span class="nv"&gt;o&lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="nv"&gt;gov&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;uk&lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="nv"&gt;https&lt;/span&gt;&lt;span class="o"&gt;://&lt;/span&gt;&lt;span class="nv"&gt;web&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;web&lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;rexpy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Extractor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;urls2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;r&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;results&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;rex&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;com&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}[&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;\&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;:&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nv"&gt;www&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;com&lt;/span&gt;&lt;span class="p"&gt;$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="nv"&gt;http&lt;/span&gt;\&lt;span class="o"&gt;:&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nv"&gt;www&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;com&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="nv"&gt;http&lt;/span&gt;\&lt;span class="o"&gt;:&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nv"&gt;www&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;co&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;uk&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;As you can see, Rexpy has produced six different regular expressions,
some of which should probably be collapsed together. The &lt;code&gt;Extractor&lt;/code&gt;
object we have created has three new methods available.&lt;/p&gt;
&lt;h2 id="the-new-coverage-methods"&gt;The New Coverage Methods&lt;/h2&gt;
&lt;p&gt;The simplest new method is &lt;code&gt;coverage(dedup=False)&lt;/code&gt;, which returns a list of
the number of matches for each regular expression returned, in the same
order as the regular expressions in &lt;code&gt;x.results.rex&lt;/code&gt;. So:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; print(x.coverage())
[2, 3, 2, 4, 2, 3]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;is the list of frequencies for the six regular expressions given, in order.
So the pairings are illustrated by:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;results&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;rex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;coverage&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="nv"&gt;%d&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;examples&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;are&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;matched&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;by&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;%s&lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;k&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;examples&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;are&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;matched&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;by&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;examples&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;are&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;matched&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;by&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;com&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;examples&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;are&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;matched&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;by&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}[&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;\&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;examples&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;are&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;matched&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;by&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;:&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nv"&gt;www&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;com&lt;/span&gt;&lt;span class="p"&gt;$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;examples&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;are&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;matched&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;by&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="nv"&gt;http&lt;/span&gt;\&lt;span class="o"&gt;:&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nv"&gt;www&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;com&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;examples&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;are&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;matched&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;by&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="nv"&gt;http&lt;/span&gt;\&lt;span class="o"&gt;:&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nv"&gt;www&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;co&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;uk&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The optional &lt;code&gt;dedup&lt;/code&gt; parameter, when set to &lt;code&gt;True&lt;/code&gt;, requests
deduplicated frequencies, i.e.  ignoring any duplicate strings passed
in (remembering that Rexpy strips whitespace from both ends of input
strings). In this case, there is just one duplicate string
(&lt;code&gt;stochasticsolutions.com/&lt;/code&gt;). So:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; print(x.coverage(dedup=True))
[2, 2, 2, 4, 2, 3]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;where the second number (the matches for &lt;code&gt;^[a-z]+\.com\/$&lt;/code&gt;) is now 2, because
&lt;code&gt;stochasticsolutions.com/&lt;/code&gt; has been deduplicated.&lt;/p&gt;
&lt;p&gt;We can also find the total number of examples, with or without duplicates,
by calling the &lt;code&gt;n_examples(dedup=False)&lt;/code&gt; method:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; print(x.n_examples())
16
&amp;gt;&amp;gt;&amp;gt; print(x.n_examples(dedup=True))
15
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;But what we will probably normally be most interested in doing is sorting
the regular expressions from highest to lowest coverage, ignoring any
examples matched by an earlier pattern in cases where they do overlap.
That's exactly what the &lt;code&gt;incremental_coverage(dedup=False)&lt;/code&gt; method does for us.
It returns an ordered dictionary.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;incremental_coverage&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="nv"&gt;%d&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;%s&lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;k&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;:&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nv"&gt;www&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;com&lt;/span&gt;&lt;span class="p"&gt;$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;com&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="nv"&gt;http&lt;/span&gt;\&lt;span class="o"&gt;:&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nv"&gt;www&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;co&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;uk&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}[&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;\&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="nv"&gt;http&lt;/span&gt;\&lt;span class="o"&gt;:&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nv"&gt;www&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;com&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This is our sixteen input strings (including duplicates), and the number
of examples matched by this expression, not matched by any previous
expression. (As noted earlier, that caveat probably won't make any difference
at the moment, but it will in future versions.) So, to be explicit, this
is saying:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The regular expression that matches most examples is:
    &lt;code&gt;^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$&lt;/code&gt;
    which matches 4 of the 16 strings.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Of the remaining 12 examples, 3 are matched by &lt;code&gt;^[a-z]+\.com\/$&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Of the remaining 9 examples, 3 more are matched by
    &lt;code&gt;^http\:\/\/www\.[a-z]+\.co\.uk\/$&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;and so on.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Note, that in the case of ties, Rexpy sorts regular expressions
as strings to break ties.&lt;/p&gt;
&lt;p&gt;We can get the deduplicated numbers if we prefer:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;incremental_coverage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;dedup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="nv"&gt;%d&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;%s&lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;k&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;:&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nv"&gt;www&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;com&lt;/span&gt;&lt;span class="p"&gt;$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="nv"&gt;http&lt;/span&gt;\&lt;span class="o"&gt;:&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nv"&gt;www&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;co&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;uk&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;com&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}[&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;\&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="nv"&gt;http&lt;/span&gt;\&lt;span class="o"&gt;:&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nv"&gt;www&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;com&lt;/span&gt;\&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;That's all the new functionality for now. Let us know how you get on,
and if you find any problems. And tweet your email address to
&lt;a href="https://twitter.com/tdda0"&gt;@tdda0&lt;/a&gt; if you want to join the TDDA Slack
to discuss anything around the subject of test-driven data analysis.&lt;/p&gt;
&lt;p&gt;[&lt;strong&gt;NOTE&lt;/strong&gt;: This post was updated on 10.2.2017 after an update to the
rexpy library changed function and attribute names from "sequential"
(which was not very descriptive) to "incremental", which is better.]&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:orall"&gt;
&lt;p&gt;In fact, probably in all cases, currently&amp;#160;&lt;a class="footnote-backref" href="#fnref:orall" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="tdda"></category><category term="regular expressions"></category></entry><entry><title>The New ReferenceTest class for TDDA</title><link href="https://tdda.info/the-new-referencetest-class-for-tdda.html" rel="alternate"></link><published>2017-01-26T15:30:00+00:00</published><updated>2017-01-26T15:30:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2017-01-26:/the-new-referencetest-class-for-tdda.html</id><summary type="html">&lt;p&gt;Since the last post, we have extended the reference test functionality
in the Python &lt;a href="https://github.com/tdda/tdda"&gt;&lt;code&gt;tdda&lt;/code&gt;&lt;/a&gt; library.
Major changes (as of version 0.2.5, at the time of writing) include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Introduction of a new &lt;code&gt;ReferenceTest&lt;/code&gt; class that has significantly
    more functionality from the previous (now deprecated)
    &lt;code&gt;WritableTestCase&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Support for &lt;a href="https://docs.pytest.org/en/latest/"&gt;&lt;code&gt;pytest …&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;</summary><content type="html">&lt;p&gt;Since the last post, we have extended the reference test functionality
in the Python &lt;a href="https://github.com/tdda/tdda"&gt;&lt;code&gt;tdda&lt;/code&gt;&lt;/a&gt; library.
Major changes (as of version 0.2.5, at the time of writing) include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Introduction of a new &lt;code&gt;ReferenceTest&lt;/code&gt; class that has significantly
    more functionality from the previous (now deprecated)
    &lt;code&gt;WritableTestCase&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Support for &lt;a href="https://docs.pytest.org/en/latest/"&gt;&lt;code&gt;pytest&lt;/code&gt;&lt;/a&gt;
    as well as &lt;code&gt;unittest&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Available from &lt;a href="https://pypi.python.org/pypi/tdda/"&gt;PyPI&lt;/a&gt;
    with &lt;code&gt;pip install tdda&lt;/code&gt;, as well as from Github.&lt;/li&gt;
&lt;li&gt;Support for comparing CSV files.&lt;/li&gt;
&lt;li&gt;Support for comparing pandas DataFrames.&lt;/li&gt;
&lt;li&gt;Support for preprocessing results before comparison (beyond
    simply dropping lines) in reference tests.&lt;/li&gt;
&lt;li&gt;Greater consistency between parameters and options for all comparison
    methods&lt;/li&gt;
&lt;li&gt;Support for categorizing kinds of reference data and rewriting only
    nominated categories (with &lt;code&gt;-w&lt;/code&gt; or &lt;code&gt;--write&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;More (meta) tests of the reference test functionality.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="background-reference-tests-and-writabletestcase"&gt;Background: reference tests and WritableTestCase&lt;/h2&gt;
&lt;p&gt;We previously introduced the idea of a &lt;a href="https://www.tdda.info/pages/glossary.html#reference-test"&gt;reference test&lt;/a&gt;
with an example in the post &lt;a href="https://www.tdda.info/first-test"&gt;First Test&lt;/a&gt;,
and then when describing the
&lt;a href="https://www.tdda.info/writabletestcase-example-use"&gt;&lt;code&gt;WritableTestCase&lt;/code&gt;&lt;/a&gt;
library.
A reference test is essentially the TDDA equivalent of a software &lt;em&gt;system test&lt;/em&gt;
(or integration test), and is characterized by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;normally testing a relatively large unit of analysis functionality
    (up to and including whole analytical processes)&lt;/li&gt;
&lt;li&gt;normally generating one or more large or complex outputs that are
    complex to verify (e.g. datasets, tables, graphs, files etc.)&lt;/li&gt;
&lt;li&gt;sometimes featuring unimportant run-to-run differences that mean
    testing equality of actual and expected output will often fail
    (e.g. files may contain date stamps, version numbers, or random
    identifiers)&lt;/li&gt;
&lt;li&gt;often being impractical to generate by hand&lt;/li&gt;
&lt;li&gt;often needing to be regenerated automatically (after verification!)
    when formats change, when bugs are fixed or when understanding changes.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The old &lt;code&gt;WritableTestCase&lt;/code&gt; class that we made available in the &lt;code&gt;tdda&lt;/code&gt;
library provided support for writing, running and updating such reference
tests in Python by extending the &lt;code&gt;unittest.TestCase&lt;/code&gt; class and providing
methods for writing reference tests and commands for running (and,
where necessary, updating the reference results used by) reference tests.&lt;/p&gt;
&lt;h2 id="deprecating-writabletestcase-in-favour-of-referencetest"&gt;Deprecating WritableTestCase in Favour of ReferenceTest&lt;/h2&gt;
&lt;p&gt;Through our use of reference testing in various contexts and projects
at Stochastic Solutions, we have ended up producing three different
implementations of reference-test libraries, each with different
capabilities. We also become aware that an increasing number of Python
developers have a marked preference for &lt;code&gt;pytest&lt;/code&gt; over &lt;code&gt;unittest&lt;/code&gt;, and
wanted to support that more naturally.  The new &lt;code&gt;ReferenceTest&lt;/code&gt; class
brings together all the capabilities we have developed, standardizes
them and fills in missing combinations while providing idiomatic
patterns for using it both in the context of &lt;code&gt;unittest&lt;/code&gt; and &lt;code&gt;pytest&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;We have no immediate plans to remove &lt;code&gt;WritableTestCase&lt;/code&gt; from the &lt;code&gt;tdda&lt;/code&gt;
library (and, indeed, continue use it extensively ourselves), but encourage
people to adopt &lt;code&gt;ReferenceTest&lt;/code&gt; instead as we believe it is superior
in all respects.&lt;/p&gt;
&lt;h2 id="availability-and-installation"&gt;Availability and Installation&lt;/h2&gt;
&lt;p&gt;You can now install the &lt;code&gt;tdda&lt;/code&gt; module with pip:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;pip install tdda
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;and the source remains available under an MIT licence from github
with&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;git clone https://github.com/tdda/tdda.git
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;tdda&lt;/code&gt; library works under Python 2.7 and Python 3 and includes
reference test functionality mentionedabove,
&lt;a href="https://www.tdda.info/constraints-and-assertions"&gt;constraint discovery and verification&lt;/a&gt;,
including a &lt;a href="https://www.tdda.info/constraint-discovery-and-verification-for-pandas-dataframes"&gt;Pandas version&lt;/a&gt;
and automatic discovery of regular expressions from examples
(&lt;a href="https://www.tdda.info/introducing-rexpy-automatic-discovery-of-regular-expressions.html"&gt;rexpy&lt;/a&gt;, also available online at &lt;a href="https://rexpy.herokuapp.com"&gt;here&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;After installation, you can run TDDA's tests as follows:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c"&gt;$ python &lt;/span&gt;&lt;span class="nb"&gt;-&lt;/span&gt;&lt;span class="c"&gt;m tdda&lt;/span&gt;&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;testtdda&lt;/span&gt;
&lt;span class="nt"&gt;............................................................................&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="nb"&gt;----------------------------------------------------------------------&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="c"&gt;Ran 76 tests in 0&lt;/span&gt;&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;279s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2 id="getting-example-code"&gt;Getting example code&lt;/h2&gt;
&lt;p&gt;However you have obtained the &lt;code&gt;tdda&lt;/code&gt; module, you can get a copy of its
reference test examples by running the command&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python -m tdda.referencetest.examples
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;which will place them in a &lt;code&gt;referencetest-examples&lt;/code&gt; subdirectory
of your current directory. Alternatively, you can specify that you want them
in a particular place with&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python -m tdda.referencetest.examples /path/to/particular/place
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;There are variations for getting examples for the constraint generation
and validation functionality, and for regular expression extraction
(&lt;a href="https://www.tdda.info/introducing-rexpy-automatic-discovery-of-regular-expressions"&gt;rexpy&lt;/a&gt;)&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python -m tdda.constraints.examples &lt;span class="o"&gt;[&lt;/span&gt;/path/to/particular/place&lt;span class="o"&gt;]&lt;/span&gt;
$ python -m tdda.rexpy.examples &lt;span class="o"&gt;[&lt;/span&gt;/path/to/particular/place&lt;span class="o"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2 id="example-use-of-referencetest-from-unittest"&gt;Example use of ReferenceTest from unittest&lt;/h2&gt;
&lt;p&gt;Here is a cut-down example of how to use the &lt;code&gt;ReferenceTest&lt;/code&gt; class
with Python's &lt;code&gt;unittest&lt;/code&gt;, based on the example in
&lt;code&gt;referencetest-examples/unittest&lt;/code&gt;. For those who prefer &lt;code&gt;pytest&lt;/code&gt;,
there is a similar &lt;code&gt;pytest&lt;/code&gt;-ready example in &lt;code&gt;referencetest-examples/pytest&lt;/code&gt;.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;unicode_literals&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;tempfile&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tdda.referencetest&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ReferenceTestCase&lt;/span&gt;

&lt;span class="c1"&gt;# ensure we can import the generators module in the directory above,&lt;/span&gt;
&lt;span class="c1"&gt;# wherever that happens to be&lt;/span&gt;
&lt;span class="n"&gt;FILE_DIR&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;abspath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dirname&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="vm"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;PARENT_DIR&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dirname&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;FILE_DIR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PARENT_DIR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;generators&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;generate_string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;generate_file&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestExample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ReferenceTestCase&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testExampleStringGeneration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;generate_string&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertStringCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;string_result.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                 &lt;span class="n"&gt;ignore_substrings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Copyright&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Version&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testExampleFileGeneration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;outdir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tempfile&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gettempdir&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;outpath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outdir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;file_result.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;generate_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outpath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertFileCorrect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outpath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;file_result.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                               &lt;span class="n"&gt;ignore_patterns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Copy(lef|righ)t&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;Version&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;TestExample&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_default_data_location&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PARENT_DIR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;reference&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="vm"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;__main__&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;ReferenceTestCase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;These tests illustrate comparing a string generated by some code to a
   reference string (stored in a file), and testing a file generated by
   code to a reference file.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;We need to tell the &lt;code&gt;ReferenceTest&lt;/code&gt; class where to find reference
   files used for comparison. The call to &lt;code&gt;set_default_data_location&lt;/code&gt;,
   straight after defining the class, does this.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The first test generates the string &lt;code&gt;actual&lt;/code&gt; and compares it to
   the contents of the file &lt;code&gt;string_result.html&lt;/code&gt; in the data location
   specified (&lt;code&gt;../reference&lt;/code&gt;). The &lt;code&gt;ignore_substrings&lt;/code&gt; parameter
   specifies strings which, when encountered, cause these lines to be
   omitted from the comparison.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The second test instead writes a file to a temporary directory (but
   using the same name as the reference file). In this case, rather
   than &lt;code&gt;ignore_strings&lt;/code&gt; we have used an &lt;code&gt;ignore_patterns&lt;/code&gt; parameter
   to specify regular expressions which, when matched, cause lines to
   be disregarded in comparisons.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;There are a number of other parameters that can be added to the
   various &lt;code&gt;assert...&lt;/code&gt; methods to allow other kinds of discrepancies
   between actual and generated files to be disregarded.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="running-the-reference-tests-success-failure-and-rewriting"&gt;Running the reference tests: Success, Failure and Rewriting&lt;/h2&gt;
&lt;p&gt;If you just run the code above, or the file in the examples, you should
see output like this:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c"&gt;$ python test_using_referencetestcase&lt;/span&gt;&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;py&lt;/span&gt;
&lt;span class="nt"&gt;..&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="nb"&gt;----------------------------------------------------------------------&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="c"&gt;Ran 2 tests in 0&lt;/span&gt;&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;003s&lt;/span&gt;

&lt;span class="c"&gt;OK&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;or, if you use pytest, like this:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ &lt;span class="nv"&gt;pytest&lt;/span&gt;
&lt;span class="o"&gt;=============================&lt;/span&gt; &lt;span class="nb"&gt;test&lt;/span&gt; session &lt;span class="nv"&gt;starts&lt;/span&gt; &lt;span class="o"&gt;==============================&lt;/span&gt;
platform darwin -- Python &lt;span class="m"&gt;2&lt;/span&gt;.7.11, pytest-3.0.2, py-1.4.31, pluggy-0.3.1
rootdir: /Users/njr/tmp/referencetest-examples/pytest, inifile:
plugins: hypothesis-3.4.2
collected &lt;span class="m"&gt;2&lt;/span&gt; items

test_using_referencepytest.py ..

&lt;span class="o"&gt;===========================&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt; passed &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;.01 &lt;span class="nv"&gt;seconds&lt;/span&gt; &lt;span class="o"&gt;===========================&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If you then edit &lt;code&gt;generators.py&lt;/code&gt; in the directory above and make some change
to the HTML in the &lt;code&gt;generate_string&lt;/code&gt; and &lt;code&gt;generate_file&lt;/code&gt; functions
preferably non-semantic changes, like adding an extra space before
the &lt;code&gt;&amp;gt;&lt;/code&gt; in &lt;code&gt;&amp;lt;/html&amp;gt;&lt;/code&gt;) and then rerun the tests, you should get failures.
Changing just the &lt;code&gt;generate_string&lt;/code&gt; function:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python unittest/test_using_referencetestcase.py
.1 line is different, starting at line &lt;span class="m"&gt;33&lt;/span&gt;
Expected file /Users/njr/tmp/referencetest-examples/reference/string_result.html
Note exclusions:
    Copyright
    Version
Compare with &lt;span class="s2"&gt;&amp;quot;diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/actual-string_result.html&lt;/span&gt;
&lt;span class="s2"&gt;/Users/njr/tmp/referencetest-examples/reference/string_result.html&amp;quot;&lt;/span&gt;.
&lt;span class="nv"&gt;F&lt;/span&gt;
&lt;span class="o"&gt;======================================================================&lt;/span&gt;
FAIL: testExampleStringGeneration &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestExample&lt;span class="o"&gt;)&lt;/span&gt;
----------------------------------------------------------------------
Traceback &lt;span class="o"&gt;(&lt;/span&gt;most recent call last&lt;span class="o"&gt;)&lt;/span&gt;:
  File &lt;span class="s2"&gt;&amp;quot;unittest/test_using_referencetestcase.py&amp;quot;&lt;/span&gt;, line &lt;span class="m"&gt;62&lt;/span&gt;, &lt;span class="k"&gt;in&lt;/span&gt; testExampleStringGeneration
    &lt;span class="nv"&gt;ignore_substrings&lt;/span&gt;&lt;span class="o"&gt;=[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Copyright&amp;#39;&lt;/span&gt;, &lt;span class="s1"&gt;&amp;#39;Version&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;])&lt;/span&gt;
  File &lt;span class="s2"&gt;&amp;quot;/Users/njr/python/tdda/tdda/referencetest/referencetest.py&amp;quot;&lt;/span&gt;, line &lt;span class="m"&gt;527&lt;/span&gt;, &lt;span class="k"&gt;in&lt;/span&gt; assertStringCorrect
    self.check_failures&lt;span class="o"&gt;(&lt;/span&gt;failures, msgs&lt;span class="o"&gt;)&lt;/span&gt;
  File &lt;span class="s2"&gt;&amp;quot;/Users/njr/python/tdda/tdda/referencetest/referencetest.py&amp;quot;&lt;/span&gt;, line &lt;span class="m"&gt;709&lt;/span&gt;, &lt;span class="k"&gt;in&lt;/span&gt; check_failures
    self.assert_fn&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;failures&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;, &lt;span class="s1"&gt;&amp;#39;\n&amp;#39;&lt;/span&gt;.join&lt;span class="o"&gt;(&lt;/span&gt;msgs&lt;span class="o"&gt;))&lt;/span&gt;
AssertionError: &lt;span class="m"&gt;1&lt;/span&gt; line is different, starting at line &lt;span class="m"&gt;33&lt;/span&gt;
Expected file /Users/njr/tmp/referencetest-examples/reference/string_result.html
Note exclusions:
    Copyright
    Version
Compare with &lt;span class="s2"&gt;&amp;quot;diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/actual-string_result.html&lt;/span&gt;
&lt;span class="s2"&gt;/Users/njr/tmp/referencetest-examples/reference/string_result.html&amp;quot;&lt;/span&gt;.


    ----------------------------------------------------------------------
    Ran &lt;span class="m"&gt;2&lt;/span&gt; tests &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;.005s
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;As expected, the string test now fails, and the &lt;code&gt;ReferenceTest&lt;/code&gt;
library suggests a command you can use to &lt;code&gt;diff&lt;/code&gt; the output: because
the test failed, it wrote the actual output to a temporary file.  (It
reports the failure twice, once as it occurs and once at the end.
This is deliberate as it's convenient to see it when it happens if the
tests take any non-trivial amount of time to run, and convenient to
collect together all the failures at the end too.)&lt;/p&gt;
&lt;p&gt;Because these are HTML files, I would probably instead open them both
(using the &lt;code&gt;open&lt;/code&gt; command on Mac OS) and visually inspect them. In
these case, the pages look identical, and &lt;code&gt;diff&lt;/code&gt; will confirm that the
changes are only those we expect:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/actual-string_result.html
/Users/njr/tmp/referencetest-examples/reference/string_result.html
5,6c5,6
&lt;span class="nt"&gt;&amp;lt;     Copyright&lt;/span&gt; &lt;span class="err"&gt;(c)&lt;/span&gt; &lt;span class="err"&gt;Stochastic&lt;/span&gt; &lt;span class="err"&gt;Solutions,&lt;/span&gt; &lt;span class="err"&gt;2016&lt;/span&gt;
&lt;span class="err"&gt;&amp;lt;&lt;/span&gt;     &lt;span class="err"&gt;Version&lt;/span&gt; &lt;span class="err"&gt;1.0.0&lt;/span&gt;
&lt;span class="err"&gt;—&lt;/span&gt;
&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;     Copyright (c) Stochastic Solutions Limited, 2016
&amp;gt;     Version 0.0.0
33c33
&lt;span class="err"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nt"&gt;&amp;lt;/html &amp;gt;&lt;/span&gt;
—
&amp;gt; &lt;span class="nt"&gt;&amp;lt;/html&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In this case&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We see that the copyright and version lines are different, but we used
    &lt;code&gt;ignore_strings&lt;/code&gt; to avoid say that's OK&lt;/li&gt;
&lt;li&gt;It shows us the extra space before the close tag.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If we are happy that the new output is OK and should replace the previous
reference test, you can rerun with the &lt;code&gt;-W&lt;/code&gt; (or &lt;code&gt;--write-all&lt;/code&gt;).&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python unittest/test_using_referencetestcase.py -W
Written /Users/njr/tmp/referencetest-examples/reference/file_result.html
.Written /Users/njr/tmp/referencetest-examples/reference/string_result.html
.
----------------------------------------------------------------------
Ran &lt;span class="m"&gt;2&lt;/span&gt; tests &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;.003s

OK
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now if you run them again without &lt;code&gt;-W&lt;/code&gt;, the tests should all pass.&lt;/p&gt;
&lt;p&gt;You can do the same with &lt;code&gt;pytest&lt;/code&gt;, except that in this case you need to use&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;pytest --write-all
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;because &lt;code&gt;pytest&lt;/code&gt; does not allow short flags.&lt;/p&gt;
&lt;h2 id="other-kinds-of-tests"&gt;Other kinds of tests&lt;/h2&gt;
&lt;p&gt;Since this post is already quite long, we won't go through all the other
options, parameters and kinds of tests in detail, but will mention a few
other points:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;In addition to &lt;code&gt;assertStringCorrect&lt;/code&gt; and &lt;code&gt;assertFileCorrect&lt;/code&gt; there
    are various other methods available:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;assertFilesCorrect&lt;/code&gt; for checking that multiple files are as expected&lt;/li&gt;
&lt;li&gt;&lt;code&gt;assertCSVFileCorrect&lt;/code&gt; for checking a single CSV file&lt;/li&gt;
&lt;li&gt;&lt;code&gt;assertCSVFilesCorrect&lt;/code&gt; for checking multiple CSV files&lt;/li&gt;
&lt;li&gt;&lt;code&gt;assertDataFramesEqual&lt;/code&gt; to check equality of Pandas DataFrames&lt;/li&gt;
&lt;li&gt;&lt;code&gt;assertDataFrameCorrect&lt;/code&gt; to check a data frame matches data in
    a CSV file.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Where appropriate, &lt;code&gt;assert&lt;/code&gt; methods accept the various optional
    parameters, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;lstrip&lt;/code&gt; — ignore whitespace at start of files/strings (default
    &lt;code&gt;False&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;rstrip&lt;/code&gt; —  ignore whitespace at end of files/strings (default
    &lt;code&gt;False&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;kind&lt;/code&gt; — optional category for test; these can be used to allow
    selective rewriting of test output with the &lt;code&gt;-w&lt;/code&gt;/&lt;code&gt;--write&lt;/code&gt; flag&lt;/li&gt;
&lt;li&gt;&lt;code&gt;preprocess&lt;/code&gt; — function to call on expected and actual data before
    comparing results&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We'll add more detail in future posts.&lt;/p&gt;
&lt;p&gt;If you'd like to join the slack group where we discuss TDDA and related topics,
DM your email address to &lt;code&gt;@tdda&lt;/code&gt; on Twitter and we'll send you an invitation.&lt;/p&gt;</content><category term="TDDA"></category><category term="tdda"></category><category term="constraints"></category><category term="rexpy"></category></entry><entry><title>Introducing Rexpy: Automatic Discovery of Regular Expressions</title><link href="https://tdda.info/introducing-rexpy-automatic-discovery-of-regular-expressions.html" rel="alternate"></link><published>2016-11-11T15:00:00+00:00</published><updated>2016-11-11T15:00:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2016-11-11:/introducing-rexpy-automatic-discovery-of-regular-expressions.html</id><summary type="html">&lt;h2 id="motivation"&gt;Motivation&lt;/h2&gt;
&lt;p&gt;There's a &lt;a href="https://skyscanner.net"&gt;Skyscanner&lt;/a&gt; data feed we have been
working with for a year or so.  It's produced some six million records
so far, each of which has a transaction ID consisting of three
parts—a four-digit alphanumeric &lt;em&gt;transaction type&lt;/em&gt;, a numeric
timestamp and a UUID, with the three parts …&lt;/p&gt;</summary><content type="html">&lt;h2 id="motivation"&gt;Motivation&lt;/h2&gt;
&lt;p&gt;There's a &lt;a href="https://skyscanner.net"&gt;Skyscanner&lt;/a&gt; data feed we have been
working with for a year or so.  It's produced some six million records
so far, each of which has a transaction ID consisting of three
parts—a four-digit alphanumeric &lt;em&gt;transaction type&lt;/em&gt;, a numeric
timestamp and a UUID, with the three parts separated by
hyphens. Things like this:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;adyt-1466611238-cf68496e-40f1-455e-94d9-ea13a96ff044
ooqt-1466602219-012da468-a820-11e6-8ba1-b8f6b118f191
z65e-1448755954-2d677190-ecda-4279-acb2-31a31ec8e86e
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The only thing that really matters is that the transaction IDs are unique,
but if everything is working correctly, the three parts should have the
right structure and match data that we have in other fields in the feed.&lt;/p&gt;
&lt;p&gt;We're pretty familiar with this data; or so we thought . . .&lt;/p&gt;
&lt;p&gt;We've added a command to our data analysis
software—&lt;a href="https://stochasticsolutions.com/miro.html"&gt;Miró&lt;/a&gt;—for
characterizing the patterns in string fields. The command is &lt;code&gt;rex&lt;/code&gt;
and when we run it on the field (&lt;code&gt;tid&lt;/code&gt;), by saying:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;load&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;skyscanner&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;transactions&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;miro&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;transactions&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;miro&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;020&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;946&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;020&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;946&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;selected&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;57&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;rex&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;this is the output:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;A&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;Za&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}[&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;\&lt;span class="nv"&gt;_&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="nv"&gt;d&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="nv"&gt;dckx&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1466604137&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="nv"&gt;aada032aa7348e1ac0fcfdd02a80f9c&lt;/span&gt;&lt;span class="p"&gt;$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;A&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;Za&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}[&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;\&lt;span class="nv"&gt;_&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="nv"&gt;d&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;f&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;f&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;f&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;f&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;f&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;}$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="nv"&gt;q&lt;/span&gt;\&lt;span class="nv"&gt;_&lt;/span&gt;\&lt;span class="nv"&gt;_s&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;\&lt;span class="nv"&gt;d&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;f&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;f&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;f&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;f&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;f&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;}$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;rex&lt;/code&gt; command has found four patterns that, between them,
characterize all the values in the field, and it has expressed these in
the form of &lt;em&gt;regular expressions.&lt;/em&gt; (If you aren't familiar with
&lt;a href="https://en.wikipedia.org/wiki/Regular_expression"&gt;regular
expressions&lt;/a&gt;, you
might want to read the linked &lt;a href="https://en.wikipedia.org/wiki/Regular_expression"&gt;Wikipedia
article&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;The third pattern in the list is more-or-less the one we thought would
characterize all the transaction IDs. It reads:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;start&lt;sup id="fnref:sol"&gt;&lt;a class="footnote-ref" href="#fn:sol"&gt;1&lt;/a&gt;&lt;/sup&gt; (&lt;code&gt;^&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;then 2--4 letters or numbers (&lt;code&gt;[A-Za-z0-9]{2,4}&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;then a mixture one to three hyphens and underscores (&lt;code&gt;[\-\_]{1,3}&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;then 10 digits (&lt;code&gt;\d{10}&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;then a hyphen (&lt;code&gt;\-&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;then a &lt;a href="https://en.wikipedia.org/wiki/Universally_unique_identifier"&gt;UUID&lt;/a&gt;,
    which is a hyphen-separated collection of 28 hex digits, in groups of
    8, 4, 4, 4 and 12
    (&lt;code&gt;[0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12}&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;and end&lt;sup id="fnref:eol"&gt;&lt;a class="footnote-ref" href="#fn:eol"&gt;2&lt;/a&gt;&lt;/sup&gt; (&lt;code&gt;$&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The only difference between this and what we expected is that it turns
out that some of the "alphanumeric transaction types" end with two
underscores rather than being stricly alphanumeric, and the &lt;code&gt;rex&lt;/code&gt;
command has expressed this as "2-4 alphanumeric characters followed by
1-3 hyphens or underscores", rather than "2-4 characters that are
alphanumeric or underscores, followed by a hyphen", which would have
been more like the way we think about it.&lt;/p&gt;
&lt;p&gt;What are the other three expressions?&lt;/p&gt;
&lt;p&gt;The first one is the same, but without the UUID. Occasionally the UUID
is missing (null), and when this happens the UUID portion of the &lt;code&gt;tid&lt;/code&gt;
is blank. It is possible to write a regular expression that combines
these two cases, but &lt;code&gt;rex&lt;/code&gt; doesn't quite yet know how to do this. The
way to do it would be to make the UUID optional by enclosing it in
parentheses and following it by a question mark:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;([0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12})?
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;which reads &lt;em&gt;zero or one UUIDs&lt;/em&gt;, given that we know the pattern inside the
parentheses corresponds to a UUID.&lt;/p&gt;
&lt;p&gt;The last pattern is another one we could unify with the other two:
it is the same except that it identifies a particular transaction type
that again uses underscores, but now in the middle: &lt;code&gt;q__s&lt;/code&gt;. So we could
replace those three (and might, in an ideal world, want &lt;code&gt;rex&lt;/code&gt; to find)&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;A&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;Za&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;z0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;\&lt;span class="nv"&gt;_&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}[&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;\&lt;span class="nv"&gt;d&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;f&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;f&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;f&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;f&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;f&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;}$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This is exactly what we described, except with the inclusion of &lt;code&gt;_&lt;/code&gt; as an
allowable character in the 4-character transaction type at the start.&lt;/p&gt;
&lt;p&gt;But what about the second pattern?&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="nv"&gt;dckx&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1466604137&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="nv"&gt;aada032aa7348e1ac0fcfdd02a80f9c&lt;/span&gt;&lt;span class="p"&gt;$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;It's actually a single, completely specific transaction ID, that matches
the main pattern except for omitting the hyphens in the UUID.
This shouldn't be possible—by which I mean, the UUID generator should
never omit the hyphens. But clearly either it did for this one transaction,
or something else stripped them out later. Either way, it shows that among
the several million transactions, there a bad transaction ID.&lt;/p&gt;
&lt;p&gt;A further check in Miró shows that this occurs just a single time (i.e.
it is not duplicated):&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="cp"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="cp"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;select&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;tid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;dckx-1466604137-1aada032aa7348e1ac0fcfdd02a80f9c&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="nt"&gt;transactions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;miro&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;6&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="nt"&gt;020&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="nt"&gt;946&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;records&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nt"&gt;0&lt;/span&gt;&lt;span class="o"&gt;%)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;selected&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;57&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="nt"&gt;Selection&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="o"&gt;(=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;tid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;dckx-1466604137-1aada032aa7348e1ac0fcfdd02a80f9c&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;So we've learnt something interesting and useful about our data, and
Miró's &lt;code&gt;rex&lt;/code&gt; command has helped us produce regular expressions that
characterize all our transaction IDs. It hasn't done a perfect job,
but it was pretty useful, and it's easy for us to merge the three
main patterns by hand. We plan to extend the functionality to cover these
cases better over coming weeks.&lt;/p&gt;
&lt;h2 id="rexpy-outside-miro"&gt;Rexpy outside Miró&lt;/h2&gt;
&lt;p&gt;If
&lt;a href="https://stochasticsolutions.com/getmiro.html"&gt;you don't happen to have Miró&lt;/a&gt;,
or want to find regular expressions from
data outside the context of a Miró dataset, you can use equivalent
functionality directly from the &lt;code&gt;rexpy&lt;/code&gt; module of our open-source,
MIT-licenced &lt;code&gt;tdda&lt;/code&gt; package, available from Github:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;git clone https://github.com/tdda/tdda.git
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This provides both a Python API for finding regular expressions from
example data, and a command line tool.&lt;/p&gt;
&lt;h3 id="the-rexpy-command-line"&gt;The Rexpy Command Line&lt;/h3&gt;
&lt;p&gt;If you just run &lt;code&gt;rexpy.py&lt;/code&gt; from the command line, with no arguments,
it expects you to type (or paste) a set of strings, and when you finish
(with &lt;code&gt;CTRL-D&lt;/code&gt; on unix-like systems, or &lt;code&gt;CTRL-Z&lt;/code&gt; on Windows systems)
it will spit out the regular expressions it thinks you need to match them:
For example:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gp"&gt;$ &lt;/span&gt;python rexpy.py
&lt;span class="go"&gt;EH1 7JQ&lt;/span&gt;
&lt;span class="go"&gt;WC1 4AA&lt;/span&gt;
&lt;span class="go"&gt;G2 3PQ&lt;/span&gt;
&lt;span class="go"&gt;^[A-Za-z0-9]{2,3}\ [A-Za-z0-9]{3}$&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Or, of course, you can pipe input into it.  If, for example, I do that
in a folder full of photos from a Nikon camera, I get&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gp"&gt;$ &lt;/span&gt;ls *.* &lt;span class="p"&gt;|&lt;/span&gt; python ~/python/tdda/rexpy/rexpy.py
&lt;span class="go"&gt;^DSC\_\d{4}\.[A-Za-z]{3}$&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;(because these are all files like &lt;code&gt;DSC_1234.NEF&lt;/code&gt; or &lt;code&gt;DSC_9346.xmp&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;You can also give it a filename as the first argument, in which case
it will read strings (one per line) from a file.&lt;/p&gt;
&lt;p&gt;So given the file &lt;code&gt;ids.txt&lt;/code&gt; (which is in the &lt;code&gt;rexpy/examples&lt;/code&gt; subdirectory
in the &lt;code&gt;tdda&lt;/code&gt; repository), containing:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="mf"&gt;123&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;AA&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;971&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mf"&gt;12&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;DQ&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;802&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mf"&gt;198&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;AA&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;045&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mf"&gt;1&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;BA&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;834&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;we can use rexpy on it by saying:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gp"&gt;$ &lt;/span&gt;python rexpy.py examples/ids.txt
&lt;span class="go"&gt;^\d{1,3}\-[A-Z]{2}\-\d{3}$&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;and if there is a header line you want to skip, you can add either &lt;code&gt;-h&lt;/code&gt;
or &lt;code&gt;--header&lt;/code&gt; to tell Rexpy to skip that.&lt;/p&gt;
&lt;p&gt;You can also give a filename as a second command line argument, in which
case Rexpy will write the results (one per line) to that file.&lt;/p&gt;
&lt;h2 id="motivation-for-rexpy"&gt;Motivation for Rexpy&lt;/h2&gt;
&lt;p&gt;There's an &lt;a href="https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/"&gt;old joke&lt;/a&gt; among programmers, generally attributed to
&lt;a href="https://mobile.twitter.com/jwz"&gt;Jamie Zawinski&lt;/a&gt;
that bears repeating:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Some people, when confronted with a problem, think&lt;/em&gt;
 &lt;em&gt;"I know, I'll use regular expressions."&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Now they have two problems.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;— Jamie Zawinski&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;As powerful as regular expressions are, &lt;a href="https://en.wikipedia.org/wiki/Stephen_Cole_Kleene"&gt;even&lt;/a&gt; &lt;a href="https://en.wikipedia.org/wiki/Ken_Thompson"&gt;their&lt;/a&gt;
&lt;a href="https://en.wikipedia.org/wiki/Brian_Kernighan"&gt;best&lt;/a&gt; &lt;a href="https://en.wikipedia.org/wiki/Larry_Wall"&gt;friends&lt;/a&gt; would probably concede that they can be hard
to write, harder to read and harder still to debug.&lt;/p&gt;
&lt;p&gt;Despite this, regular expressions are an attractive way to specify
constraints on string fields for two main reasons:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;First, regular expressions constitute a fast, powerful, and
     near-ubiquitous mechanism for describing a wide variety of
     possible structures in string data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Secondly, regular expressions include the concept of
     &lt;em&gt;capture groups&lt;/em&gt;. You may recall that in a grouped&lt;sup id="fnref:aka"&gt;&lt;a class="footnote-ref" href="#fn:aka"&gt;3&lt;/a&gt;&lt;/sup&gt;
     regular expression, one or more subcomponents is enclosed in
     parentheses and its matched value can be extracted.
     As we will see below, this is particulary interesting in
     the context of test-driven data analysis.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For concreteness, suppose we have a field containing strings such as:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="mf"&gt;123&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;AA&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;971&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mf"&gt;12&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;DQ&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;802&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mf"&gt;198&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;AA&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;045&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mf"&gt;1&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;BA&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;834&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;One obvious regular expression to describe these would be:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;^&lt;/span&gt;\&lt;span class="nv"&gt;d&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;A&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;Z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;\&lt;span class="nv"&gt;d&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;(start [&lt;code&gt;^&lt;/code&gt;] with
one or more digits [&lt;code&gt;\d+&lt;/code&gt;],
then a hyphen [&lt;code&gt;\-&lt;/code&gt;],
then two capital letters &lt;code&gt;[A-Z]{2}&lt;/code&gt;,
then another hyphen [&lt;code&gt;\-&lt;/code&gt;],
then three digits [&lt;code&gt;\d{3}&lt;/code&gt;]
then stop [&lt;code&gt;$&lt;/code&gt;]).&lt;/p&gt;
&lt;p&gt;But it is in the nature of regular expressions that there are both more
and less specific formulations we could use to describe the same strings,
ranging from the fairly specific:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2389&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;AA&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nv"&gt;DQ&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nv"&gt;BA&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;\&lt;span class="nv"&gt;d&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;(start [&lt;code&gt;^&lt;/code&gt;]
with a 1 [&lt;code&gt;1&lt;/code&gt;],
followed by up to two digits
chosen from {2, 3, 8, 9} [&lt;code&gt;[2389]{0,2}&lt;/code&gt;],
followed by a hyphen [&lt;code&gt;\-&lt;/code&gt;]
then AA, DA or BA [&lt;code&gt;(AA|DQ|BA)&lt;/code&gt;],
followed by another hypen [&lt;code&gt;\-&lt;/code&gt;]
then three digits [&lt;code&gt;\d{3}&lt;/code&gt;]
and finish [&lt;code&gt;$&lt;/code&gt;])&lt;/p&gt;
&lt;p&gt;to the fully general&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;^.*&lt;/span&gt;&lt;span class="p"&gt;$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;(start [&lt;code&gt;^&lt;/code&gt;],
 then zero or more characters [&lt;code&gt;.*&lt;/code&gt;],
 then stop [&lt;code&gt;$&lt;/code&gt;]).&lt;/p&gt;
&lt;p&gt;Going back to our first formulation&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;^&lt;/span&gt;\&lt;span class="nv"&gt;d&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;A&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;Z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;\&lt;span class="nv"&gt;d&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;one possible &lt;em&gt;grouped&lt;/em&gt; equivalent is&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;\&lt;span class="nv"&gt;d&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nv"&gt;A&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;Z&lt;/span&gt;&lt;span class="p"&gt;]{&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;\&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;\&lt;span class="nv"&gt;d&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;})$&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The three parenthesized sections are known as &lt;em&gt;groups&lt;/em&gt;, and
regular expression implementations usually provide a way of looking up
the value of these groups when a particular string is matched by it.
For example, in Python we might say:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;re&lt;/span&gt;

&lt;span class="n"&gt;pattern&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;^(\d+)\-([A-Z]&lt;/span&gt;&lt;span class="si"&gt;{2}&lt;/span&gt;&lt;span class="s1"&gt;)\-(\d&lt;/span&gt;&lt;span class="si"&gt;{3}&lt;/span&gt;&lt;span class="s1"&gt;)$&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;123-ZQ-987&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;1: &amp;quot;&lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="s1"&gt;&amp;quot;  2: &amp;quot;&lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="s1"&gt;&amp;quot;  3: &amp;quot;&lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="s1"&gt;&amp;quot;&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                                         &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;

&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;00-FT-020&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;1: &amp;quot;&lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="s1"&gt;&amp;quot;  2: &amp;quot;&lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="s1"&gt;&amp;quot;  3: &amp;quot;&lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="s1"&gt;&amp;quot;&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                                         &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If we run this, the output is:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="mf"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;123&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="mf"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;ZQ&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="mf"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;987&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="mf"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;00&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="mf"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;FT&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="mf"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;020&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2 id="the-big-idea-automatic-discovery-of-constraints-on-string-structure"&gt;The Big Idea: Automatic Discovery of Constraints on String Structure&lt;/h2&gt;
&lt;p&gt;In the context of test-driven data analysis, the idea is probably
obvious: for string fields with some kind of structure—telephone
numbers, post codes, zip codes, UUIDs, more-or-less any kind of
structured identifier, airline codes, airport codes, national
insurance numbers, credit card numbers, bank sort codes, social
security numbers—we would like to specify constraints on the values
in the field using regular expressions.  A natural extension to the
&lt;a href="https://www.tdda.info/the-tdda-constraints-file-format"&gt;TDDA constraints file
format&lt;/a&gt;
introduced in the last post would be something along the lines
of:&lt;sup id="fnref:raw"&gt;&lt;a class="footnote-ref" href="#fn:raw"&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="s"&gt;&amp;quot;regex&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;^&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;d+&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;-[A-Z]{2}&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;d{3}$&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;if there is a single regular expression that usefully matches all the
allowed field values. If a field contains strings in multiple
formats that are so different that using a single regular expression
would be unhelpful, we might instead provide a list of regular
expressions, such as:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="s"&gt;&amp;quot;regex&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;^&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;d+&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;-[A-Z]{2}&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;d{3}$&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;^[A-Z]{5}&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;+&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;d{5}$&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;which would mean &lt;em&gt;each field values should match at least one of the regular
expressions in the list&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Just as with the &lt;a href="https://www.tdda.info/constraint-discovery-and-verification-for-pandas-dataframes"&gt;automatic discovery&lt;/a&gt;
of other types of constraints,
we want the TDDA constraint discovery library to be able to suggest
suitable regular expression constraints on string fields, where appropriate.
This is where Rexpy comes in:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Rexpy is a library for finding regular expressions that usefully&lt;/em&gt;
&lt;em&gt;characterize a given corpus of strings.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;We are choosing to build Rexpy as a stand-alone module because it has
clear utility outside the context of constraint generation.&lt;/p&gt;
&lt;h2 id="the-other-idea-automatically-discovered-quasi-fields"&gt;The Other Idea: Automatically discovered Quasi-Fields&lt;/h2&gt;
&lt;p&gt;We can imagine going beyond simply using (and automatically
discovering) regular expressions to describe constraints on string data.
Once we have useful regular expressions that characterize some
string data—and more particularly, in cases where we have a &lt;em&gt;single&lt;/em&gt;
regular expression that usefully describes the structure of the string—we
can tag meaningful subcomponents. In the example we used above, we had
three groups:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;(\d+)&lt;/code&gt; — the digits at the start of the identifier&lt;/li&gt;
&lt;li&gt;&lt;code&gt;([A-Z]{2})&lt;/code&gt; — the pair of letters in the middle&lt;/li&gt;
&lt;li&gt;&lt;code&gt;(\d{3})&lt;/code&gt; — the three digits at the end&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It's not totally trivial to work out which subcomponents are useful to tag,
but I think we could probably find pretty good heuristics that would do
a reasonable job, at least in simple cases. Once we know the groups, we
can potentially start to treat them as quasi-fields in their own right.
So in this case, if we had a field &lt;code&gt;ID&lt;/code&gt; containing string identifiers
like those shown, we might create from that three quasi fields as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;ID_qf1&lt;/code&gt;, of type int, values &lt;code&gt;123&lt;/code&gt;, &lt;code&gt;12&lt;/code&gt;, &lt;code&gt;198&lt;/code&gt;, and &lt;code&gt;1&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ID_qf2&lt;/code&gt;, of type string, values &lt;code&gt;AA&lt;/code&gt;, &lt;code&gt;DQ&lt;/code&gt;, &lt;code&gt;AA&lt;/code&gt;, and &lt;code&gt;BA&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ID_qf3&lt;/code&gt;, of type int, values &lt;code&gt;971&lt;/code&gt;, &lt;code&gt;802&lt;/code&gt;, &lt;code&gt;45&lt;/code&gt; and &lt;code&gt;834&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Once we have these quasi fields, we can potentially subject &lt;em&gt;them&lt;/em&gt; to the
usual TDDA constraint generation process, which might suggest extra,
stronger constraints. For example, we might find the the numbers in
&lt;code&gt;ID_qf3&lt;/code&gt; are unique, or form a contiguous sequence, and we might find
that although our regular expression only specified that there were two
letters in the middle, in fact the only combinations found in the
(full) data were &lt;code&gt;AA&lt;/code&gt;, &lt;code&gt;DQ&lt;/code&gt;, &lt;code&gt;BA&lt;/code&gt; and &lt;code&gt;BB&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;I don't want to suggest that all of this is easy: there are three or four
non-trivial steps to get from where Rexpy is today to this full vision:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;First, it has to get better at merging related regular expressions
    into a useful single regular expression with optional components
    and alternations than it is today.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Secondly, it would have to be able to identify good subcomponents
    for grouping.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Thirdly, it would have to do useful type inference on the
    groups it identifies.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Finally, it would have to be extended to create the quasi fields
    and apply the TDDA discovery process to them.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But none of this seems hopelessly difficult. So continue to watch this space.&lt;/p&gt;
&lt;h1 id="the-rexpy-api"&gt;The Rexpy API&lt;/h1&gt;
&lt;p&gt;Assuming you have cloned the TDDA library somewhere on your &lt;code&gt;PYTHONPATH&lt;/code&gt;,
you should then be able to use it through the API as follows:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tdda&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;rexpy&lt;/span&gt;

&lt;span class="n"&gt;corpus&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;123-AA-971&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;12-DQ-802&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;198-AA-045&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;1-BA-834&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rexpy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;corpus&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Number of regular expressions found: &lt;/span&gt;&lt;span class="si"&gt;%d&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rex&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;   &amp;#39;&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;  &lt;span class="n"&gt;rex&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;which produces:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gp"&gt;$ &lt;/span&gt;python ids.py
&lt;span class="go"&gt;Number of regular expressions found: 1&lt;/span&gt;
&lt;span class="go"&gt;   ^\d{1,3}\-[A-Z]{2}\-\d{3}$&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In general, Rexpy returns a list of regular expressions, and at the moment
it is not very good at merging them. There's quite a lot of unused code
that, I hope, will soon allow it to do so. But even as it is, it can do
a reasonable job of characterizing simple strings. Within reason, the
more examples you give it, the better it can do, and it is reasonably
performant with hundreds of thousands or even millions of strings.&lt;/p&gt;
&lt;h2 id="rexpy-the-pandas-interface"&gt;Rexpy: the Pandas interface&lt;/h2&gt;
&lt;p&gt;There's also a Pandas binding. You can say:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tdda&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;rexpy&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rexpy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pdextract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;to find regular expressions for the strings in column &lt;code&gt;A&lt;/code&gt; of a dataframe &lt;code&gt;df&lt;/code&gt;,
or&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tdda&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;rexpy&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rexpy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pdextract&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;B&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;to find a single set of regular expressions that match all the strings from
columns &lt;code&gt;A&lt;/code&gt;, &lt;code&gt;B&lt;/code&gt; and &lt;code&gt;C&lt;/code&gt;. In all cases, &lt;code&gt;null&lt;/code&gt; (&lt;code&gt;pandas.np.NaN&lt;/code&gt;) values
are ignored.   The results are returned as a (Python) list of regular
expressions, as strings.&lt;/p&gt;
&lt;h2 id="final-words"&gt;Final Words&lt;/h2&gt;
&lt;p&gt;Take it for a spin and let us know how you get on.&lt;/p&gt;
&lt;p&gt;As always, follow us or tweet at us (&lt;a href="http:/twitter.com/tdda0"&gt;@tdda0&lt;/a&gt;)
if you want to hear more, and watch out for the TDDA Slack
team, which will be opening up very soon.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:sol"&gt;
&lt;p&gt;More precisely, &lt;code&gt;^&lt;/code&gt; matches &lt;em&gt;start of line&lt;/em&gt;; by default,
   &lt;code&gt;rex&lt;/code&gt; always starts regular expressions with &lt;code&gt;^&lt;/code&gt; and finishes them
   with &lt;code&gt;$&lt;/code&gt; on the assumption that strings will be presented one-per-line&amp;#160;&lt;a class="footnote-backref" href="#fnref:sol" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:eol"&gt;
&lt;p&gt;Again, more precisely, &lt;code&gt;$&lt;/code&gt; matches &lt;em&gt;end of line&lt;/em&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:eol" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:aka"&gt;
&lt;p&gt;Grouped regular expressions are also referred to variously as
&lt;em&gt;marked&lt;/em&gt; or &lt;em&gt;tagged&lt;/em&gt; regular expressions, and the &lt;em&gt;groups&lt;/em&gt; are
also sometimes known as &lt;em&gt;subexpressions&lt;/em&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:aka" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:raw"&gt;
&lt;p&gt;One thing to notice here is that in JSON we need extra backslashes
in the regular expression. This is because regular expressions themselves
make fairly liberal use of backslashes, and JSON uses backslash as
an escape character. We could avoid this in Python by using &lt;em&gt;raw strings&lt;/em&gt;,
which are introduced with an &lt;code&gt;r&lt;/code&gt; prefix
(e.g. &lt;code&gt;'^(\d+)\-([A-Z]{2})\-(\d{3})$'&lt;/code&gt;).
In such strings, backslashes are not treated in any special way.
Since JSON has no equivalent mechanism, we have to escape all
our backslashes, leading to the ugliness above.&amp;#160;&lt;a class="footnote-backref" href="#fnref:raw" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="tdda"></category><category term="constraints"></category><category term="pandas"></category><category term="regular expressions"></category></entry><entry><title>The TDDA Constraints File Format</title><link href="https://tdda.info/the-tdda-constraints-file-format.html" rel="alternate"></link><published>2016-11-04T14:30:00+00:00</published><updated>2016-11-04T14:30:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2016-11-04:/the-tdda-constraints-file-format.html</id><summary type="html">&lt;h1 id="background"&gt;Background&lt;/h1&gt;
&lt;p&gt;We recently extended the &lt;a href="https://github.com/tdda/tdda"&gt;tdda library&lt;/a&gt;
to include support for automatic discovery of constraints from datasets,
and for verification of datasets against constraints. Yesterday's
post—&lt;a href="https://www.tdda.info/constraint-discovery-and-verification-for-pandas-dataframes"&gt;Constraint Discovery and Verification for Pandas DataFrames&lt;/a&gt;—describes these developments and the API.&lt;/p&gt;
&lt;p&gt;The library we published is intended to be a base for …&lt;/p&gt;</summary><content type="html">&lt;h1 id="background"&gt;Background&lt;/h1&gt;
&lt;p&gt;We recently extended the &lt;a href="https://github.com/tdda/tdda"&gt;tdda library&lt;/a&gt;
to include support for automatic discovery of constraints from datasets,
and for verification of datasets against constraints. Yesterday's
post—&lt;a href="https://www.tdda.info/constraint-discovery-and-verification-for-pandas-dataframes"&gt;Constraint Discovery and Verification for Pandas DataFrames&lt;/a&gt;—describes these developments and the API.&lt;/p&gt;
&lt;p&gt;The library we published is intended to be a base for producing various
implementations of the constraint discovery and verification process,
and uses a JSON file format (extension &lt;code&gt;.tdda&lt;/code&gt;) to save constraints in a form
that should be interchangable between implementations.
We currently have two compatible implementations—the open-source
Pandas code in the library and the implementation in our own analytical
software, &lt;a href="https://StochasticSolutions.com/miro.html"&gt;Miró&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This post describes the &lt;code&gt;.tdda&lt;/code&gt; JSON file format. The bulk of it is
merely a snapshot of the documentation shipped with the library in the
Github repository (visible &lt;a href="https://github.com/tdda/tdda/blob/master/constraints/tdda_json_file_format.md"&gt;on
Github&lt;/a&gt;).
We intend to keep that file up to date as we expand the format.&lt;/p&gt;
&lt;h1 id="the-tdda-json-file-format"&gt;The TDDA JSON File Format&lt;/h1&gt;
&lt;p&gt;The TDDA constraints library (Repository
&lt;a href="https://github.com/tdda/tdda"&gt;https://github.com/tdda/tdda&lt;/a&gt;,
module &lt;a href="https://github.com/tdda/tdda/tree/master/constraints"&gt;constraints&lt;/a&gt;)
uses a JSON file to store constraints.&lt;/p&gt;
&lt;p&gt;This document describes that file format.&lt;/p&gt;
&lt;h1 id="purpose"&gt;Purpose&lt;/h1&gt;
&lt;p&gt;TDDA files describe &lt;em&gt;constraints&lt;/em&gt; on a &lt;em&gt;dataset,&lt;/em&gt; with a view to
&lt;em&gt;verifying&lt;/em&gt; the dataset to check whether any or all of the specified
constraints are satisfied.&lt;/p&gt;
&lt;p&gt;A dataset is assumed to consist of one or more &lt;em&gt;fields&lt;/em&gt; (also known
as columns), each of which has a (different) name&lt;sup id="fnref:PandasColNames"&gt;&lt;a class="footnote-ref" href="#fn:PandasColNames"&gt;1&lt;/a&gt;&lt;/sup&gt;
and a well-defined type.&lt;sup id="fnref:PandasTypes"&gt;&lt;a class="footnote-ref" href="#fn:PandasTypes"&gt;2&lt;/a&gt;&lt;/sup&gt;
Each field has a &lt;em&gt;value&lt;/em&gt; for each of a number of &lt;em&gt;records&lt;/em&gt; (also
known as rows). In some cases, values may be &lt;code&gt;null&lt;/code&gt; (or missing).&lt;sup id="fnref:PandasNulls"&gt;&lt;a class="footnote-ref" href="#fn:PandasNulls"&gt;3&lt;/a&gt;&lt;/sup&gt;
Even a field consisting entirely of nulls can be considered to have
a type.&lt;/p&gt;
&lt;p&gt;Familiar examples of datasets include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;tables in relational databases&lt;/li&gt;
&lt;li&gt;DataFrames in (Pandas and R)&lt;/li&gt;
&lt;li&gt;flat ("CSV") files (subject to type inference or assigment)&lt;/li&gt;
&lt;li&gt;sheets in spreadsheets, or areas within sheets,
    if the columns have names, are not merged, and have values
    with consistent meanings and types over an entire column&lt;/li&gt;
&lt;li&gt;more generally, many forms of tabular data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In principle, TDDA files are intended to be capable of supporting any kind
of constraint regarding datasets. Today, we are primarily concerned with
  * field types
  * minimum and maximum values (or in the case of string fields,
    minumum and maximum string lengths)
  * whether nulls are allowed,
  * whether duplicate values are allowed within a field
  * the allowed values for a field.&lt;/p&gt;
&lt;p&gt;The format also has support for describing relations between fields.&lt;/p&gt;
&lt;p&gt;Future extensions we can already foresee include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;dataset-level constraints (e.g. numbers of records;
    required or disallowed fields)&lt;/li&gt;
&lt;li&gt;sortedness of fields, of field values or both&lt;/li&gt;
&lt;li&gt;regular expressions to which string fields should conform&lt;/li&gt;
&lt;li&gt;constraints on subsets of the data (e.g. &lt;em&gt;records dated after July
    2016 should not have null values for the ID field&lt;/em&gt;)&lt;/li&gt;
&lt;li&gt;constraints on substructure within fields (e.g. constraints on tagged
    subexpressions from regular expressions to which string fields are
    expected to conform)&lt;/li&gt;
&lt;li&gt;potentially checksums (though this is more suitable for checking
    the integreity of transfer of a specific dataset, than for use
    across multiple related datasets)&lt;/li&gt;
&lt;li&gt;constraints between datasets, most obviously key relations (e.g.
    &lt;em&gt;every value field KA in dataset A should also occur in field KB
    in dataset B&lt;/em&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The motivation for generating, storing and verifying datasets against
such sets of constraints is that they can provide a powerful way of
detecting bad or unexpected inputs to, or outputs from, a data analysis
process. They can also be valuable as checks on intermediate results.
While manually generating constraints can be burdensome, automatic
discovery of constraints from example datasets, potentially followed
by manual removal of over-specific constraints, provides a good cost/benefit
ratio in many situations.&lt;/p&gt;
&lt;h2 id="filename-and-encoding"&gt;Filename and encoding&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The preferred extension for TDDA Constraints files is &lt;code&gt;.tdda&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;TDDA constraints files must be encoded as UTF-8.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;TDDA files must be valid JSON.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="example"&gt;Example&lt;/h2&gt;
&lt;p&gt;This is an extremely simple example TDDA file:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;{
    &amp;quot;fields&amp;quot;: {
        &amp;quot;a&amp;quot;: {
            &amp;quot;type&amp;quot;: &amp;quot;int&amp;quot;,
            &amp;quot;min&amp;quot;: 1,
            &amp;quot;max&amp;quot;: 9,
            &amp;quot;sign&amp;quot;: &amp;quot;positive&amp;quot;,
            &amp;quot;max_nulls&amp;quot;: 0,
            &amp;quot;no_duplicates&amp;quot;: true
        },
        &amp;quot;b&amp;quot;: {
            &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;,
            &amp;quot;min_length&amp;quot;: 3,
            &amp;quot;max_length&amp;quot;: 3,
            &amp;quot;max_nulls&amp;quot;: 1,
            &amp;quot;no_duplicates&amp;quot;: true,
            &amp;quot;allowed_values&amp;quot;: [
                &amp;quot;one&amp;quot;,
                &amp;quot;two&amp;quot;
            ]
        }
    }
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2 id="general-structure"&gt;General Structure&lt;/h2&gt;
&lt;p&gt;A TDDA file is a JSON dictionary.
There are currently two supported top-level keys:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;fields&lt;/code&gt;: constraints for individual fields, keyed on the field name.
   (In TDDA, we generally refer to dataset columns as &lt;em&gt;fields&lt;/em&gt;.)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;field_groups&lt;/code&gt;: constraints specifying relations between multiple
   fields (two, for now). &lt;code&gt;field_groups&lt;/code&gt; constraints are keyed
   on a comma-separated list of the names of the fields to which they
   relate, and order is significant.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Both top-level keys are optional.&lt;/p&gt;
&lt;p&gt;In future, we expect to add further top-level keys (e.g. for
possible constraints on the number of rows,
required or disallowed fields etc.)&lt;/p&gt;
&lt;p&gt;The order of constraints in the file is immaterial (of course; this is JSON),
though writers may choose to present fields in a particular order,
e.g. dataset order or sorted on fieldname.&lt;/p&gt;
&lt;h1 id="field-constraints"&gt;Field Constraints&lt;/h1&gt;
&lt;p&gt;The value of a field constraints entry (in the &lt;code&gt;fields&lt;/code&gt; section)
is a dictionary keyed on constraint &lt;em&gt;kind&lt;/em&gt;.
For example, the constraints on field &lt;code&gt;a&lt;/code&gt; in the example above are
specified as:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&amp;quot;a&amp;quot;: {
    &amp;quot;type&amp;quot;: &amp;quot;int&amp;quot;,
    &amp;quot;min&amp;quot;: 1,
    &amp;quot;max&amp;quot;: 9,
    &amp;quot;sign&amp;quot;: &amp;quot;positive&amp;quot;,
    &amp;quot;max_nulls&amp;quot;: 0,
    &amp;quot;no_duplicates&amp;quot;: true
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The TDDA library currently recognizes the following &lt;em&gt;kind&lt;/em&gt;s of constraints:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;type&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;min&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;max&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;min_length&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;max_length&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;sign&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;sign&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;max_nulls&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;no_duplicates&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;allowed_values&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Other constraint libraries are free to define their own, custom kinds
of constraints. We will probably recommend that non-standard
constraints have names beginning with colon-terminated prefix. For
example, if we wanted to support more specific Pandas type
constraints, we would probably use a key such as &lt;code&gt;pandas:type&lt;/code&gt; for
this.&lt;/p&gt;
&lt;p&gt;The value of a constraint is often simply a scalar value, but can be a list
or a dictionary; when it is a dictionary, it should include a key &lt;code&gt;value&lt;/code&gt;,
with the principle value associated with the constraint (&lt;code&gt;true&lt;/code&gt;, if there
is no specific value beyond the name of the constraint).&lt;/p&gt;
&lt;p&gt;If the value of a constraint (the scalar value, or the &lt;code&gt;value&lt;/code&gt; key if the
value is a dictionary) is &lt;code&gt;null&lt;/code&gt;, this is taken to indicate
the absence of a constraint. A constraint with value &lt;code&gt;null&lt;/code&gt; should be
completely ignored, so that a constraints file including &lt;code&gt;null&lt;/code&gt;-valued
constraints should produce identical results to one omitting those constraints.
(This obviously means that we are discouraging using &lt;code&gt;null&lt;/code&gt; as a meaningful
constraint value, though a string &lt;code&gt;"null"&lt;/code&gt; is fine, and in fact we use this
for &lt;code&gt;sign&lt;/code&gt; constraints.)&lt;/p&gt;
&lt;p&gt;The semantics and values of the standard field constraint types are as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;type&lt;/code&gt;: the allowed (standard, TDDA) type of the field.
    This can be a single value from &lt;code&gt;bool&lt;/code&gt; (boolean),
    &lt;code&gt;int&lt;/code&gt; (integer; whole-numbered); &lt;code&gt;real&lt;/code&gt; (floating point values);
    &lt;code&gt;string&lt;/code&gt; (unicode in Python3; byte string in Python2) or &lt;code&gt;date&lt;/code&gt;
    (any kind of date or date time, with or without timezone information).
    It can also be a list of such allowed values (in which case, order
    is not significant).&lt;/p&gt;
&lt;p&gt;It is up to the generation and verification libraries to map between
the actual types in whatever dataset/dataframe/table/... object is
used and these TDDA constraint types, though over time we may provide
further guidance.&lt;/p&gt;
&lt;p&gt;Examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;{"type": "int"}&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;{"type": ["int", "real"]}&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;min&lt;/code&gt;: the minimum allowed value for a field. This is often
    a simple value, but in the case of real fields, it can be
    convenient to specify a level of precision. In particular,
    a minimum value can have &lt;code&gt;precision&lt;/code&gt; (default: &lt;code&gt;fuzzy&lt;/code&gt;):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;closed&lt;/code&gt;: all non-null values in the field must be
    greater than or equal to the value specified.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;open&lt;/code&gt;: all non-null values in the field must be
    strictly greater than the value specified.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;fuzzy&lt;/code&gt;: when the precision is specified as &lt;em&gt;fuzzy&lt;/em&gt;,
    the verifier should allow a small degree of violation
    of the constraint without generating a failure.
    Verifiers take a parameter,
    &lt;code&gt;epsilon&lt;/code&gt;, which specifies how the fuzzy constraints
    should be taken to be: epsilon is a fraction of the
    constraint value by which field values are allowed
    to exceed the constraint without being considered
    to fail the constraint. This defaults to 0.01 (i.e. 1%).
    Notice that this means that constraint values of zero
    are never fuzzy.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Examples are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;{"min": 1}&lt;/code&gt;,&lt;/li&gt;
&lt;li&gt;&lt;code&gt;{"min": 1.2}&lt;/code&gt;,&lt;/li&gt;
&lt;li&gt;&lt;code&gt;{"min": {"value": 3.4}, {"precision": "fuzzy"}}&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;JSON, does not—of course—have a date type.
TDDA files specifying dates should use string representations
in one of the following formats:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;YYYY-MM-DD&lt;/code&gt; for dates without times&lt;/li&gt;
&lt;li&gt;&lt;code&gt;YYYY-MM-DD hh:mm:ss&lt;/code&gt; for date-times without timezone&lt;/li&gt;
&lt;li&gt;&lt;code&gt;YYYY-MM-DD hh:mm:ss [+-]ZZZZ&lt;/code&gt; for date-times with timezone.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We recommend that writers use precisely these formats, but that
readers offer some flexibility in reading, e.g. accepting &lt;code&gt;/&lt;/code&gt;
as well as &lt;code&gt;-&lt;/code&gt; to separate date components, and
&lt;code&gt;T&lt;/code&gt; as well as space to separate the time component from the date.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;max&lt;/code&gt;: the maximum allowed value for a field. Much like &lt;code&gt;min&lt;/code&gt;,
    but for maximum values.
    Examples are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;{"max": 1}&lt;/code&gt;,&lt;/li&gt;
&lt;li&gt;&lt;code&gt;{"max": 1.2}&lt;/code&gt;,&lt;/li&gt;
&lt;li&gt;&lt;code&gt;{"max": {"value": 3.4}, {"precision": "closed"}}&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dates should be formatted as for &lt;code&gt;min&lt;/code&gt; values.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;min_length&lt;/code&gt;: the minimum allowed length of strings in a string field.
    How unicode strings are counted is up to the implementation.
    Example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;{"min_length": 2}&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;max_length&lt;/code&gt;: the minimum allowed length of strings in a string field.
    How unicode strings are counted is up to the implementation.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;{"max_length": 22}&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;sign&lt;/code&gt;: For numeric fields, the allowed sign of (non-null)
    values.  Although this overlaps with minimum and maximum
    values, it it often useful to have a separate sign constraint,
    which carries semantically different information. Allowed
    values are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;positive&lt;/code&gt;: All values must be greater than zero&lt;/li&gt;
&lt;li&gt;&lt;code&gt;non-negative&lt;/code&gt;: No value may be less than zero&lt;/li&gt;
&lt;li&gt;&lt;code&gt;zero&lt;/code&gt;: All values must be zero&lt;/li&gt;
&lt;li&gt;&lt;code&gt;non-positive&lt;/code&gt;: No value may be greater than zero&lt;/li&gt;
&lt;li&gt;&lt;code&gt;negative&lt;/code&gt;: All values must be negative&lt;/li&gt;
&lt;li&gt;&lt;code&gt;null&lt;/code&gt;: No signed values are allowed, i.e. the field must
     be entirely null.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;{"sign": "non-negative"}&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;max_nulls&lt;/code&gt;: The maximum number of nulls allowed in the field.
    This can be any non-negative value. We recommend only writing
    values of zero (no nulls values are allowed) or 1 (At most a
    single null is allowed) into this constraint, but checking
    against any value found.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;{"max_nulls": 0}&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;no_duplicates&lt;/code&gt;: When this constraint is set on a field
    (with value &lt;code&gt;true&lt;/code&gt;), it means that each non-null value must occur
    only once in the field. The current implementation only uses
    this constraint for string fields.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;{"no_duplicates": true}&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;allowed_values&lt;/code&gt;: The value of this constraint is a list of
    allowed values for the field. The order is not significant.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;{"allowed_values": ["red", "green", "blue"]}&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 id="multifield-constraints"&gt;MultiField Constraints&lt;/h1&gt;
&lt;p&gt;Multifield constraints are not yet being generated by this implementation,
though our (proprietary) &lt;a href="https://StochasticSolutions.com/miro.html"&gt;Miró&lt;/a&gt;
implementation does produce them. The
currently planned constraint types for field relations cover field
equality and inequality for pairs of fields, with options to specify
null relations too.&lt;/p&gt;
&lt;p&gt;A simple example would be:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&amp;quot;field_groups&amp;quot;: {
    &amp;quot;StartDate,EndDate&amp;quot;: {&amp;quot;lt&amp;quot;: true}
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This is a &lt;em&gt;less-than&lt;/em&gt; constraint, to be interpreted as&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;StartDate &amp;lt; EndDate&lt;/code&gt; wherever &lt;code&gt;StartDate&lt;/code&gt; and &lt;code&gt;EndDate&lt;/code&gt; and both
    non-null.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The plan is to support the obvious five equality and inequality relations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;lt&lt;/code&gt;: first field value is strictly less than the second field value
    for each record&lt;/li&gt;
&lt;li&gt;&lt;code&gt;lte&lt;/code&gt;: first field value is less than or equal to the second field value
    for each record&lt;/li&gt;
&lt;li&gt;&lt;code&gt;eq&lt;/code&gt;: first field value is equal to the second field value
    for each record&lt;/li&gt;
&lt;li&gt;&lt;code&gt;gte&lt;/code&gt;: first field value is greater than or equal to the second field
    value for each record&lt;/li&gt;
&lt;li&gt;&lt;code&gt;gt&lt;/code&gt;: first field value is strictly greater than the second field
    value for each record.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the case of equality (only), we will probably also support a &lt;code&gt;precision&lt;/code&gt;
parameter with values &lt;code&gt;fuzzy&lt;/code&gt; or &lt;code&gt;precise&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;There should probably also be an option to specify relations between null
values in pairs of columns, either as a separate constraint or as a quality
on each of the above.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:PandasColNames"&gt;
&lt;p&gt;Pandas, of course, allows multiple columns to have
the same name. This format makes no concessions to such madness, though
there is nothing to stop a verifier or generator sharing constraints
across all columns with the same name. The Pandas generators and verifiers
in this library do not currently attempt to do this.&amp;#160;&lt;a class="footnote-backref" href="#fnref:PandasColNames" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:PandasTypes"&gt;
&lt;p&gt;Pandas also allows columns of mixed type. Again, this
file format does not recognize such columns, and it would probably be
sensible not to use type constraints for columns of mixed type.&amp;#160;&lt;a class="footnote-backref" href="#fnref:PandasTypes" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:PandasNulls"&gt;
&lt;p&gt;Pandas uses &lt;em&gt;not-a-number&lt;/em&gt; (&lt;code&gt;pandas.np.NaN&lt;/code&gt;) to represent
&lt;code&gt;null&lt;/code&gt; values for numeric, string and boolean fields; it uses a special
&lt;em&gt;not-a-time&lt;/em&gt; (&lt;code&gt;pandas.NaT&lt;/code&gt;) value to represent null date (&lt;code&gt;Timestamp&lt;/code&gt;) values.&amp;#160;&lt;a class="footnote-backref" href="#fnref:PandasNulls" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="tdda"></category><category term="constraints"></category><category term="pandas"></category></entry><entry><title>Constraint Discovery and Verification for Pandas DataFrames</title><link href="https://tdda.info/constraint-discovery-and-verification-for-pandas-dataframes.html" rel="alternate"></link><published>2016-11-03T15:30:00+00:00</published><updated>2016-11-03T15:30:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2016-11-03:/constraint-discovery-and-verification-for-pandas-dataframes.html</id><summary type="html">&lt;h1 id="background"&gt;Background&lt;/h1&gt;
&lt;p&gt;In a previous post,
&lt;a href="https://www.tdda.info/constraints-and-assertions"&gt;Constraints and Assertions&lt;/a&gt;,
we introduced the idea of using constraints to verify
input, output and intermediate datasets for an analytical process.
We also demonstrated that candidate constraints can be automatically
generated from example datasets. We prototyped this in
our own software (&lt;a href="https://stochasticsolutions.com/miro.html"&gt;Miró&lt;/a&gt;)
expressing constraints as …&lt;/p&gt;</summary><content type="html">&lt;h1 id="background"&gt;Background&lt;/h1&gt;
&lt;p&gt;In a previous post,
&lt;a href="https://www.tdda.info/constraints-and-assertions"&gt;Constraints and Assertions&lt;/a&gt;,
we introduced the idea of using constraints to verify
input, output and intermediate datasets for an analytical process.
We also demonstrated that candidate constraints can be automatically
generated from example datasets. We prototyped this in
our own software (&lt;a href="https://stochasticsolutions.com/miro.html"&gt;Miró&lt;/a&gt;)
expressing constraints as lisp S-expressions.&lt;/p&gt;
&lt;h1 id="improving-and-extending-the-approach-open-source-pandas-code"&gt;Improving and Extending the Approach: Open-Source Pandas Code&lt;/h1&gt;
&lt;p&gt;We have now taken the core ideas, polished them a little and made them
available through an &lt;a href="https://github.com/tdda/tdda"&gt;open-source library&lt;/a&gt;,
currently on Github.
We will push it to PyPI when it has solidified a little further.&lt;/p&gt;
&lt;p&gt;The constraint code I'm referring to is available in the &lt;code&gt;constraints&lt;/code&gt;
module of the &lt;code&gt;tdda&lt;/code&gt; repository for the &lt;code&gt;tdda&lt;/code&gt; user on github. So if you
issue the command&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;git clone https://github.com/tdda/tdda.git
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;in a directory somewhere on your &lt;code&gt;PYTHONPATH&lt;/code&gt;,
this should enable you to use it.&lt;/p&gt;
&lt;p&gt;The TDDA library:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;is under an &lt;a href="https://github.com/tdda/tdda/blob/master/LICENSE"&gt;MIT&lt;/a&gt;
    licence;&lt;/li&gt;
&lt;li&gt;runs under Python2 and Python3;&lt;/li&gt;
&lt;li&gt;other than Pandas itself, has no dependencies outside
    the standard library unless you want
    to use &lt;code&gt;feather&lt;/code&gt; files (see below);&lt;/li&gt;
&lt;li&gt;includes a base layer to help with building constraint verification
    and discovery libraries for various systems;&lt;/li&gt;
&lt;li&gt;includes Pandas implementations of constraint discovery and verification
    through a (Python) API;&lt;/li&gt;
&lt;li&gt;uses a new JSON format (normally stored in &lt;code&gt;.tdda&lt;/code&gt; files) for saving
    constraints;&lt;/li&gt;
&lt;li&gt;also includes a prototype command-line tool for verifying
    a dataframe stored in
    &lt;a href="https://pypi.python.org/pypi/feather-format/"&gt;feather&lt;/a&gt;
    format against a &lt;code&gt;.tdda&lt;/code&gt; file of constraints.
    Feather is a file format developed by
    &lt;a href="https://wesmckinney.com"&gt;Wes McKinney&lt;/a&gt;
    (the original creator of Pandas)
    and &lt;a href="https://hadley.nz"&gt;Hadley Wickham&lt;/a&gt; (of ggplot and tidyverse fame) for
    dataframes that allows interchange between R and Pandas while
    preserving type information and exact values. It is based on the
    &lt;a href="https://arrow.apache.org"&gt;Apache Arrow&lt;/a&gt; project.
    It can be used directly or using our
    ugly-but-useful extension library &lt;a href="https://github.com/tdda/pmmif"&gt;pmmif&lt;/a&gt;,
    which allows extra metadata (including extended type information)
    to be saved alongside a &lt;code&gt;.feather&lt;/code&gt; file, in a companion &lt;code&gt;.pmm&lt;/code&gt; file.&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 id="testing"&gt;Testing&lt;/h1&gt;
&lt;p&gt;All the constraint-handling code is in the constraints module within
the TDDA repository.&lt;/p&gt;
&lt;p&gt;After you've cloned the repository, it's probably a good idea to run
the tests. There are two sets, and both should run under Python2 or Python3.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c"&gt;$ cd tdda/constraints&lt;/span&gt;
&lt;span class="c"&gt;$ python testbase&lt;/span&gt;&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;py&lt;/span&gt;
&lt;span class="nt"&gt;.....&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="nb"&gt;----------------------------------------------------------------------&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="c"&gt;Ran 5 tests in 0&lt;/span&gt;&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;003s&lt;/span&gt;

&lt;span class="c"&gt;OK&lt;/span&gt;

&lt;span class="c"&gt;$ python testpdconstraints&lt;/span&gt;&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;py&lt;/span&gt;
&lt;span class="nt"&gt;.....................&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="nb"&gt;----------------------------------------------------------------------&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="c"&gt;Ran 21 tests in 0&lt;/span&gt;&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;123s&lt;/span&gt;

&lt;span class="c"&gt;OK&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;There is example code (which we'll walk through below) in the &lt;code&gt;examples&lt;/code&gt;
subdirectory of &lt;code&gt;constraints&lt;/code&gt;.&lt;/p&gt;
&lt;h1 id="basic-use"&gt;Basic Use&lt;/h1&gt;
&lt;h2 id="constraint-discovery"&gt;Constraint Discovery&lt;/h2&gt;
&lt;p&gt;Here is some minimal code for getting the software to &lt;em&gt;discover&lt;/em&gt; constraints
satisfied by a Pandas DataFrame:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tdda.constraints.pdconstraints&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;discover_constraints&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;a&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;b&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;one&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;two&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NaN&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;
&lt;span class="n"&gt;constraints&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;discover_constraints&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/tmp/example_constraints.tdda&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;w&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;constraints&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;(This is the core of the example code in
&lt;code&gt;tdda/constraints/examples/simple_discovery.py&lt;/code&gt;, and is included in
the docstring for the &lt;code&gt;discover_constraints&lt;/code&gt; function.)&lt;/p&gt;
&lt;p&gt;Hopefully the code is fairly self-explanatory, but walking through
the lines after the imports:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We first generate a trivial 3-row DataFrame with an integer column &lt;code&gt;a&lt;/code&gt;
    and a string column &lt;code&gt;b&lt;/code&gt;. The string column includes a Pandas null (&lt;code&gt;NaN&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;We then pass that DataFrame to the &lt;code&gt;discover_constraints&lt;/code&gt; function
    from &lt;code&gt;tdda.constraints.pdconstraints&lt;/code&gt;, and it returns a
    &lt;code&gt;DatasetConstraints&lt;/code&gt; object, which is defined in &lt;code&gt;tdda.constraints.base&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The resulting constraints object has a &lt;code&gt;to_json()&lt;/code&gt; method which converts
    the structured constraints into a JSON string.&lt;/li&gt;
&lt;li&gt;In the example, we write that to &lt;code&gt;/tmp/example_constraints.tdda&lt;/code&gt;;
    we encourage everyone to use the &lt;code&gt;.tdda&lt;/code&gt; extension for these JSON
    constraint files.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is what happens if we run the example file:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ &lt;span class="nb"&gt;cd&lt;/span&gt; tdda/constraints/examples

$ python simple_discovery.py
Written /tmp/example_constraints.tdda successfully.

$ cat /tmp/example_constraints.tdda
&lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;fields&amp;quot;&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;&amp;quot;a&amp;quot;&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;type&amp;quot;&lt;/span&gt;: &lt;span class="s2"&gt;&amp;quot;int&amp;quot;&lt;/span&gt;,
            &lt;span class="s2"&gt;&amp;quot;min&amp;quot;&lt;/span&gt;: &lt;span class="m"&gt;1&lt;/span&gt;,
            &lt;span class="s2"&gt;&amp;quot;max&amp;quot;&lt;/span&gt;: &lt;span class="m"&gt;9&lt;/span&gt;,
            &lt;span class="s2"&gt;&amp;quot;sign&amp;quot;&lt;/span&gt;: &lt;span class="s2"&gt;&amp;quot;positive&amp;quot;&lt;/span&gt;,
            &lt;span class="s2"&gt;&amp;quot;max_nulls&amp;quot;&lt;/span&gt;: &lt;span class="m"&gt;0&lt;/span&gt;,
            &lt;span class="s2"&gt;&amp;quot;no_duplicates&amp;quot;&lt;/span&gt;: &lt;span class="nb"&gt;true&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;,
        &lt;span class="s2"&gt;&amp;quot;b&amp;quot;&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;type&amp;quot;&lt;/span&gt;: &lt;span class="s2"&gt;&amp;quot;string&amp;quot;&lt;/span&gt;,
            &lt;span class="s2"&gt;&amp;quot;min_length&amp;quot;&lt;/span&gt;: &lt;span class="m"&gt;3&lt;/span&gt;,
            &lt;span class="s2"&gt;&amp;quot;max_length&amp;quot;&lt;/span&gt;: &lt;span class="m"&gt;3&lt;/span&gt;,
            &lt;span class="s2"&gt;&amp;quot;max_nulls&amp;quot;&lt;/span&gt;: &lt;span class="m"&gt;1&lt;/span&gt;,
            &lt;span class="s2"&gt;&amp;quot;no_duplicates&amp;quot;&lt;/span&gt;: true,
            &lt;span class="s2"&gt;&amp;quot;allowed_values&amp;quot;&lt;/span&gt;: &lt;span class="o"&gt;[&lt;/span&gt;
                &lt;span class="s2"&gt;&amp;quot;one&amp;quot;&lt;/span&gt;,
                &lt;span class="s2"&gt;&amp;quot;two&amp;quot;&lt;/span&gt;
            &lt;span class="o"&gt;]&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;As you can see, in this case, the system has 'discovered' six constraints
for each field, and it's rather easy to read what they are and at least
roughly what they mean. We'll do a separate post describing the &lt;code&gt;.tdda&lt;/code&gt;
JSON file format, but it's documented in
&lt;code&gt;tdda/constraints/tdda_json_file_format.md&lt;/code&gt; in the repository
(which—through almost unfathomable power of Github—means you can see it
formatted
&lt;a href="https://github.com/tdda/tdda/blob/master/constraints/tdda_json_file_format.md"&gt;here&lt;/a&gt;).&lt;/p&gt;
&lt;h2 id="constraint-verification"&gt;Constraint Verification&lt;/h2&gt;
&lt;p&gt;Now that we have a &lt;code&gt;.tdda&lt;/code&gt; file, we can use it to verify a DataFrame.&lt;/p&gt;
&lt;p&gt;First, let's look at code that should lead to a successful verification
(this code is in &lt;code&gt;tdda/constraints/examples/simple_verify_pass.py&lt;/code&gt;).&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tdda.constraints.pdconstraints&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;verify_df&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;a&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;b&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;one&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NaN&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;
&lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;verify_df&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;example_constraints.tdda&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Passes: &lt;/span&gt;&lt;span class="si"&gt;%d&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;passes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Failures: &lt;/span&gt;&lt;span class="si"&gt;%d&lt;/span&gt;&lt;span class="se"&gt;\n\n\n&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failures&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_frame&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Again, hopefully the code is fairly self-explanatory, but:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;df&lt;/code&gt; is a DataFrame that is different from the one we used to
    generate the constraints, but is nevertheless consistent
    with the constraints.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;verify_df&lt;/code&gt; function from &lt;code&gt;tdda.constraints.pdconstraints&lt;/code&gt;
    takes a DataFrame and the location of a &lt;code&gt;.tdda&lt;/code&gt; file and verifies
    the DataFrame against the constraints in the file.
    The function returns a &lt;code&gt;PandasVerification&lt;/code&gt; object.
    The &lt;code&gt;PandasVerification&lt;/code&gt; class is a subclass of &lt;code&gt;Verification&lt;/code&gt;
    from &lt;code&gt;tdda.constraints.base&lt;/code&gt;, adding the ability to turn
    the verification object into a DataFrame.&lt;/li&gt;
&lt;li&gt;All verification objects include &lt;code&gt;passes&lt;/code&gt; and &lt;code&gt;failures&lt;/code&gt; attributes,
    which respectively indicate the number of constraints that passed
    and the number that failed. So the simplest complete verification
    is simply to check that &lt;code&gt;v.failures == 0&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Verification methods also include a &lt;code&gt;__str__&lt;/code&gt; method.
    Its output currently includes a section for fields and a summary.&lt;/li&gt;
&lt;li&gt;the &lt;code&gt;.to_frame()&lt;/code&gt; method converts a &lt;code&gt;PandasVerification&lt;/code&gt;
    into a Pandas DataFrame.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If we run this, the result is as follows:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python simple_verify_pass.py
Passes: &lt;span class="m"&gt;12&lt;/span&gt;
Failures: &lt;span class="m"&gt;0&lt;/span&gt;



FIELDS:

a: &lt;span class="m"&gt;0&lt;/span&gt; failures  &lt;span class="m"&gt;6&lt;/span&gt; passes  &lt;span class="nb"&gt;type&lt;/span&gt; ✓  min ✓  max ✓  sign ✓  max_nulls ✓  no_duplicates ✓

b: &lt;span class="m"&gt;0&lt;/span&gt; failures  &lt;span class="m"&gt;6&lt;/span&gt; passes  &lt;span class="nb"&gt;type&lt;/span&gt; ✓  min_length ✓  max_length ✓  max_nulls ✓  no_duplicates ✓  allowed_values ✓

SUMMARY:

Passes: &lt;span class="m"&gt;12&lt;/span&gt;
Failures: &lt;span class="m"&gt;0&lt;/span&gt;



  field  failures  passes  &lt;span class="nb"&gt;type&lt;/span&gt;   min min_length   max max_length  sign  &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="m"&gt;0&lt;/span&gt;     a         &lt;span class="m"&gt;0&lt;/span&gt;       &lt;span class="m"&gt;6&lt;/span&gt;  True  True        NaN  True        NaN  True
&lt;span class="m"&gt;1&lt;/span&gt;     b         &lt;span class="m"&gt;0&lt;/span&gt;       &lt;span class="m"&gt;6&lt;/span&gt;  True   NaN       True   NaN       True   NaN

  max_nulls no_duplicates allowed_values
&lt;span class="m"&gt;0&lt;/span&gt;      True          True            NaN
&lt;span class="m"&gt;1&lt;/span&gt;      True          True           True
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In the DataFrame produced from the Verification&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;True&lt;/code&gt; indicates a constraint that was satisfied in the dataset;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;False&lt;/code&gt; indicates a constraint that was not satisfied in the dataset;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;NaN&lt;/code&gt; (null) indicates a constraint that was not present for that field.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As you would expect, the &lt;code&gt;field&lt;/code&gt;, &lt;code&gt;failures&lt;/code&gt; and &lt;code&gt;passes&lt;/code&gt; columns are,
respectively, the name of the field, the number of failures and the
number of passes for that field.&lt;/p&gt;
&lt;p&gt;If we now change the DataFrame definition to:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;df = pd.DataFrame({&amp;#39;a&amp;#39;: [0, 1, 2, 10, pd.np.NaN],
                   &amp;#39;b&amp;#39;: [&amp;#39;one&amp;#39;, &amp;#39;one&amp;#39;, &amp;#39;two&amp;#39;, &amp;#39;three&amp;#39;, pd.np.NaN]})
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;(as is the case in &lt;code&gt;tdda/constraints/examples/simple_verify_fail.py&lt;/code&gt;), we now
expect some constraint failures. If we run this, we see:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python simple_verify_fail.py
Passes: &lt;span class="m"&gt;5&lt;/span&gt;
Failures: &lt;span class="m"&gt;7&lt;/span&gt;



FIELDS:

a: &lt;span class="m"&gt;4&lt;/span&gt; failures  &lt;span class="m"&gt;2&lt;/span&gt; passes  &lt;span class="nb"&gt;type&lt;/span&gt; ✓  min ✗  max ✗  sign ✗  max_nulls ✗  no_duplicates ✓

b: &lt;span class="m"&gt;3&lt;/span&gt; failures  &lt;span class="m"&gt;3&lt;/span&gt; passes  &lt;span class="nb"&gt;type&lt;/span&gt; ✓  min_length ✓  max_length ✗  max_nulls ✓  noh_duplicates ✗  allowed_values ✗

SUMMARY:

Passes: &lt;span class="m"&gt;5&lt;/span&gt;
Failures: &lt;span class="m"&gt;7&lt;/span&gt;



  field  failures  passes  &lt;span class="nb"&gt;type&lt;/span&gt;    min min_length    max max_length   sign  &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="m"&gt;0&lt;/span&gt;     a         &lt;span class="m"&gt;4&lt;/span&gt;       &lt;span class="m"&gt;2&lt;/span&gt;  True  False        NaN  False        NaN  False
&lt;span class="m"&gt;1&lt;/span&gt;     b         &lt;span class="m"&gt;3&lt;/span&gt;       &lt;span class="m"&gt;3&lt;/span&gt;  True    NaN       True    NaN      False    NaN

  max_nulls no_duplicates allowed_values
&lt;span class="m"&gt;0&lt;/span&gt;     False          True            NaN
&lt;span class="m"&gt;1&lt;/span&gt;      True         False          False
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h1 id="final-notes"&gt;Final Notes&lt;/h1&gt;
&lt;p&gt;There are more options and there's more to say about the Pandas implementation,
but that's probably enough for one post. We'll have follow-ups on the
file format, more options, and the foibles of Pandas.&lt;/p&gt;
&lt;p&gt;If you want to hear more, follow us on twitter at
&lt;a href="https://twitter.com/tdda0"&gt;@tdda0&lt;/a&gt;.&lt;/p&gt;</content><category term="TDDA"></category><category term="tdda"></category><category term="constraints"></category><category term="pandas"></category></entry><entry><title>WritableTestCase: Example Use</title><link href="https://tdda.info/writabletestcase-example-use.html" rel="alternate"></link><published>2016-09-18T15:30:00+01:00</published><updated>2016-09-18T15:30:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2016-09-18:/writabletestcase-example-use.html</id><summary type="html">&lt;p&gt;In my PyCon UK talk yesterday I promised to update the and document
the copy of &lt;code&gt;writabletestcase.WritableTestCase&lt;/code&gt; on GitHub.&lt;/p&gt;
&lt;p&gt;The version I've put up is not quite as powerful as the example I showed
in the talk—that will follow—but has the basic functionality.&lt;/p&gt;
&lt;p&gt;I've now added examples …&lt;/p&gt;</summary><content type="html">&lt;p&gt;In my PyCon UK talk yesterday I promised to update the and document
the copy of &lt;code&gt;writabletestcase.WritableTestCase&lt;/code&gt; on GitHub.&lt;/p&gt;
&lt;p&gt;The version I've put up is not quite as powerful as the example I showed
in the talk—that will follow—but has the basic functionality.&lt;/p&gt;
&lt;p&gt;I've now added examples to the repository and, below, show how these work.&lt;/p&gt;
&lt;p&gt;The library is available with&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;git clone https://github.com/tdda/tdda.git
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;WritableTestCase&lt;/code&gt; extends &lt;code&gt;unittest.TestCase&lt;/code&gt;, from the Python's standard
library, in three main ways:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;It provides methods for testing strings produced in memory or
    files written to disk against reference results in files.  When a
    test fails, rather than just showing a hard-to-read difference, it
    writes the actual result to file (if necessary) and then shows the
    &lt;code&gt;diff&lt;/code&gt; command needed to compare it—something like this:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;Compare with &amp;quot;diff /path/to/actual-output /path/to/expected-output&amp;quot;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Obviously, the &lt;code&gt;diff&lt;/code&gt; command can be replaced with a graphical
diff tool, an &lt;code&gt;open&lt;/code&gt; command or whatever.&lt;/p&gt;
&lt;p&gt;Although this shouldn't be necessary (see below), you also have
the option, after verification, or replacing &lt;code&gt;diff&lt;/code&gt; with &lt;code&gt;cp&lt;/code&gt; to
copy the actual output as the new reference output.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Secondly, the code supports excluding lines of the output
    contain nominated strings. This is often handy for excluding
    things like date stamps, version numbers, copyright notices
    etc. These often appear in output, and vary, without affecting
    the semantics.&lt;/p&gt;
&lt;p&gt;(The version of the library I showed at PyCon had more powerful
variants of this, which I'll add later.)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Thirdly, if you verify that the new output is correct, the library
    supports re-running with the &lt;code&gt;-w&lt;/code&gt; flag to overwrite the expected
    ("reference") results with the ones generated by the code.&lt;/p&gt;
&lt;p&gt;Obviously, if this feature is abused, the value of the tests will
be lost, but provided you check the output carefully before re-writing,
this is a significant convenience.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The example code is in the &lt;code&gt;examples&lt;/code&gt; subdirectory, called
&lt;code&gt;test_using_writabletestcase.py&lt;/code&gt;. It has two test functions,
one of which generates HTML output as a string, and the other
of which produces some slightly different HTML output as a file.
In each case, the output produced by the function is not identical
to the expected "reference" output (in &lt;code&gt;examples/reference&lt;/code&gt;), but
differs only on lines containing "Copyright" and "Version".
Since these are passed into the test as exclusions, the tests should pass.&lt;/p&gt;
&lt;p&gt;Here is the example code:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# -*- coding: utf-8 -*-&lt;/span&gt;
&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;test_using_writabletestcase.py: A simple example of how to use&lt;/span&gt;
&lt;span class="sd"&gt;tdda.writabletestcase.WritableTestCase.&lt;/span&gt;

&lt;span class="sd"&gt;Source repository: https://github.com/tdda/tdda&lt;/span&gt;

&lt;span class="sd"&gt;License: MIT&lt;/span&gt;

&lt;span class="sd"&gt;Copyright (c) Stochastic Solutions Limited 2016&lt;/span&gt;
&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;division&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;print_function&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;unicode_literals&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;tempfile&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tdda&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;writabletestcase&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tdda.examples.generators&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;generate_string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;generate_file&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestExample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;writabletestcase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WritableTestCase&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testExampleStringGeneration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;        This test uses generate_string() from tdda.examples.generators&lt;/span&gt;
&lt;span class="sd"&gt;        to generate some HTML as a string.&lt;/span&gt;

&lt;span class="sd"&gt;        It is similar to the reference HTML in&lt;/span&gt;
&lt;span class="sd"&gt;        tdda/examples/reference/string_result.html except that the&lt;/span&gt;
&lt;span class="sd"&gt;        Copyright and version lines are slightly different.&lt;/span&gt;

&lt;span class="sd"&gt;        As shipped, the test should pass, because the ignore_patterns&lt;/span&gt;
&lt;span class="sd"&gt;        tell it to ignore those lines.&lt;/span&gt;

&lt;span class="sd"&gt;        Make a change to the generation code in the generate_string&lt;/span&gt;
&lt;span class="sd"&gt;        function in generators.py to change the HTML output.&lt;/span&gt;

&lt;span class="sd"&gt;        The test should then fail and suggest a diff command to run&lt;/span&gt;
&lt;span class="sd"&gt;        to see the difference.&lt;/span&gt;

&lt;span class="sd"&gt;        Rerun with&lt;/span&gt;

&lt;span class="sd"&gt;            python test_using_writabletestcase.py -w&lt;/span&gt;

&lt;span class="sd"&gt;        and it should re-write the reference output to match your&lt;/span&gt;
&lt;span class="sd"&gt;        modified results.&lt;/span&gt;
&lt;span class="sd"&gt;        &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
        &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;generate_string&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;this_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;abspath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dirname&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="vm"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;expected_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;this_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;reference&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                     &lt;span class="s1"&gt;&amp;#39;string_result.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;check_string_against_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                       &lt;span class="n"&gt;ignore_patterns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Copyright&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                        &lt;span class="s1"&gt;&amp;#39;Version&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;


    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testExampleFileGeneration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;        This test uses generate_file() from tdda.examples.generators&lt;/span&gt;
&lt;span class="sd"&gt;        to generate some HTML as a file.&lt;/span&gt;

&lt;span class="sd"&gt;        It is similar to the reference HTML in&lt;/span&gt;
&lt;span class="sd"&gt;        tdda/examples/reference/file_result.html except that the&lt;/span&gt;
&lt;span class="sd"&gt;        Copyright and version lines are slightly different.&lt;/span&gt;

&lt;span class="sd"&gt;        As shipped, the test should pass, because the ignore_patterns&lt;/span&gt;
&lt;span class="sd"&gt;        tell it to ignore those lines.&lt;/span&gt;

&lt;span class="sd"&gt;        Make a change to the generation code in the generate_file function&lt;/span&gt;
&lt;span class="sd"&gt;        in generators.py to change the HTML output.&lt;/span&gt;

&lt;span class="sd"&gt;        The test should then fail and suggest a diff command to run&lt;/span&gt;
&lt;span class="sd"&gt;        to see the difference.&lt;/span&gt;

&lt;span class="sd"&gt;        Rerun with&lt;/span&gt;

&lt;span class="sd"&gt;            python test_using_writabletestcase.py -w&lt;/span&gt;

&lt;span class="sd"&gt;        and it should re-write the reference output to match your&lt;/span&gt;
&lt;span class="sd"&gt;        modified results.&lt;/span&gt;
&lt;span class="sd"&gt;        &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
        &lt;span class="n"&gt;outdir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tempfile&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gettempdir&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;outpath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outdir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;file_result.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;generate_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outpath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;this_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;abspath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dirname&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="vm"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;expected_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;this_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;reference&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                     &lt;span class="s1"&gt;&amp;#39;file_result.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;check_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outpath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;ignore_patterns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Copyright&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Version&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="vm"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;__main__&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;writabletestcase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;writabletestcase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_write_from_argv&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If you download it, and try running it, you should output similar to the
following:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c"&gt;$ python test_using_writabletestcase&lt;/span&gt;&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;py&lt;/span&gt;
&lt;span class="nt"&gt;..&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="nb"&gt;----------------------------------------------------------------------&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="c"&gt;Ran 2 tests in 0&lt;/span&gt;&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;004s&lt;/span&gt;

&lt;span class="c"&gt;OK&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The reference output files it compares against are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;examples/reference/string_result.html&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;examples/reference/file_result.html&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To see what happens when there's a difference, try editing one or both
of the main functions that generate the HTML in &lt;code&gt;generators.py&lt;/code&gt;.
They're most just using explicit strings, so the simplest thing is just
to change a word or something in the output.&lt;/p&gt;
&lt;p&gt;If I change &lt;code&gt;It's&lt;/code&gt; to &lt;code&gt;It is&lt;/code&gt; in the &lt;code&gt;generate_string()&lt;/code&gt; function and rerun,
I get this output:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python test_using_writabletestcase.py
.
File check failed.
Compare with &lt;span class="s2"&gt;&amp;quot;diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/string_result.html /Users/njr/python/tdda/examples/reference/string_result.html&amp;quot;&lt;/span&gt;.

Note exclusions:
Copyright
Version
&lt;span class="nv"&gt;F&lt;/span&gt;
&lt;span class="o"&gt;======================================================================&lt;/span&gt;
FAIL: testExampleStringGeneration &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestExample&lt;span class="o"&gt;)&lt;/span&gt;
----------------------------------------------------------------------
Traceback &lt;span class="o"&gt;(&lt;/span&gt;most recent call last&lt;span class="o"&gt;)&lt;/span&gt;:
  File &lt;span class="s2"&gt;&amp;quot;test_using_writabletestcase.py&amp;quot;&lt;/span&gt;, line &lt;span class="m"&gt;55&lt;/span&gt;, &lt;span class="k"&gt;in&lt;/span&gt; testExampleStringGeneration
    &lt;span class="s1"&gt;&amp;#39;Version&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;])&lt;/span&gt;
  File &lt;span class="s2"&gt;&amp;quot;/Users/njr/python/tdda/writabletestcase.py&amp;quot;&lt;/span&gt;, line &lt;span class="m"&gt;294&lt;/span&gt;, &lt;span class="k"&gt;in&lt;/span&gt; check_string_against_file
    self.assertEqual&lt;span class="o"&gt;(&lt;/span&gt;failures, &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
AssertionError: &lt;span class="m"&gt;1&lt;/span&gt; !&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;

----------------------------------------------------------------------
Ran &lt;span class="m"&gt;2&lt;/span&gt; tests &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;.005s

FAILED &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;failures&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="m"&gt;1&lt;/span&gt; godel:$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If I then run the diff command it suggests, the output is:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/string_result.html /Users/njr/python/tdda/examples/reference/string_result.html
&lt;span class="m"&gt;5&lt;/span&gt;,6c5,6
&amp;lt;     Copyright &lt;span class="o"&gt;(&lt;/span&gt;c&lt;span class="o"&gt;)&lt;/span&gt; Stochastic Solutions, &lt;span class="m"&gt;2016&lt;/span&gt;
&amp;lt;     Version &lt;span class="m"&gt;1&lt;/span&gt;.0.0
—
&amp;gt;     Copyright &lt;span class="o"&gt;(&lt;/span&gt;c&lt;span class="o"&gt;)&lt;/span&gt; Stochastic Solutions Limited, &lt;span class="m"&gt;2016&lt;/span&gt;
&amp;gt;     Version &lt;span class="m"&gt;0&lt;/span&gt;.0.0
29c29
&amp;lt;     It is not terribly exciting.
—
&amp;gt;     It&lt;span class="err"&gt;&amp;#39;&lt;/span&gt;s not terribly exciting.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Here you can see the differences that are excluded, and the change I
actually made.&lt;/p&gt;
&lt;p&gt;(The version I showed at PyCon has an option to see the only the
non-excluded differences, but this version doesn't; that will come!)&lt;/p&gt;
&lt;p&gt;If I now run again using &lt;code&gt;-w&lt;/code&gt;, to re-write the reference output,
it shows:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c"&gt;$ python test_using_writabletestcase&lt;/span&gt;&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;py &lt;/span&gt;&lt;span class="nb"&gt;-&lt;/span&gt;&lt;span class="c"&gt;w&lt;/span&gt;
&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;Expected file /Users/njr/python/tdda/examples/reference/string_result&lt;/span&gt;&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;html written&lt;/span&gt;&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="nb"&gt;----------------------------------------------------------------------&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="c"&gt;Ran 2 tests in 0&lt;/span&gt;&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;003s&lt;/span&gt;

&lt;span class="c"&gt;OK&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;And, of course, if I run a third time, without &lt;code&gt;-w&lt;/code&gt;, the test now passes:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c"&gt;$ python test_using_writabletestcase&lt;/span&gt;&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;py&lt;/span&gt;
&lt;span class="nt"&gt;..&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="nb"&gt;----------------------------------------------------------------------&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="c"&gt;Ran 2 tests in 0&lt;/span&gt;&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;003s&lt;/span&gt;

&lt;span class="c"&gt;OK&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;So that's a quick overview of it works.&lt;/p&gt;</content><category term="TDDA"></category><category term="tdda"></category></entry><entry><title>Slides and Rough Transcript of TDDA talk from PyCon UK 2016</title><link href="https://tdda.info/slides-and-rough-transcript-of-tdda-talk-from-pycon-uk-2016.html" rel="alternate"></link><published>2016-09-17T15:30:00+01:00</published><updated>2016-09-17T15:30:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2016-09-17:/slides-and-rough-transcript-of-tdda-talk-from-pycon-uk-2016.html</id><content type="html">&lt;p&gt;Python UK 2016, Cardiff.&lt;/p&gt;
&lt;p&gt;I gave a talk on test-driven data analysis at PyCon UK 2016, Cardiff,
today.&lt;/p&gt;
&lt;p&gt;The slides (which are kind-of useless without the words) are available
&lt;a href="https://www.tdda.info/pdf/tdda-pycon-cardiff-2016-slides.pdf"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;More usefully, a rough transcript, with thumbnail slides, is available
&lt;a href="https://www.tdda.info/pdf/tdda-pycon-cardiff-2016-rough-transcript.pdf"&gt;here&lt;/a&gt;.&lt;/p&gt;</content><category term="TDDA"></category><category term="tdda"></category></entry><entry><title>Extracting More Apple Health Data</title><link href="https://tdda.info/extracting-more-apple-health-data.html" rel="alternate"></link><published>2016-04-20T15:30:00+01:00</published><updated>2016-04-20T15:30:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2016-04-20:/extracting-more-apple-health-data.html</id><summary type="html">&lt;p&gt;The &lt;a href="https://www.tdda.info/in-defence-of-xml-exporting-and-analysing-apple-health-data"&gt;first version&lt;/a&gt; of the Python code for extracting data from
the XML export from the Apple Health on iOS neglected to extract
Activity Summaries and Workout data.
We will now fix that.&lt;/p&gt;
&lt;p&gt;As usual, I'll remind you how to get the code, if you want, then discuss
the changes …&lt;/p&gt;</summary><content type="html">&lt;p&gt;The &lt;a href="https://www.tdda.info/in-defence-of-xml-exporting-and-analysing-apple-health-data"&gt;first version&lt;/a&gt; of the Python code for extracting data from
the XML export from the Apple Health on iOS neglected to extract
Activity Summaries and Workout data.
We will now fix that.&lt;/p&gt;
&lt;p&gt;As usual, I'll remind you how to get the code, if you want, then discuss
the changes to the code, the reference test and the unit tests.
Then in the next post, we'll actually start looking at the data.&lt;/p&gt;
&lt;h3 id="the-updated-code"&gt;The Updated Code&lt;/h3&gt;
&lt;p&gt;As before, you can get the code from Github with&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ git clone https://github.com/tdda/applehealthdata.git
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;or if you have pulled it before, with&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ git pull --tags
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This version of the code is tagged with &lt;code&gt;v1.3&lt;/code&gt;, so if it has been updated
by the time you read this, get that version with&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ git checkout v1.3
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I'm not going to list all the code here, but will pull out a few key changes
as we discuss them.&lt;/p&gt;
&lt;h3 id="changes"&gt;Changes&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Change 1: Change FIELDS to handle three different field structures.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The first version of the extraction code wrote only Records, which contain
the granular activity data (which is the vast bulk of it, by volume).&lt;/p&gt;
&lt;p&gt;Now I want to extend the code to handle the other two main kinds of data
it writes—&lt;code&gt;ActivitySummary&lt;/code&gt; and &lt;code&gt;Workout&lt;/code&gt; elements in the XML.&lt;/p&gt;
&lt;p&gt;The three different element types contain different XML attributes, which
correspond to different fields in the CSV, and although they overlap,
I think the best approach is to have separate record structures for each,
and then to create a dictionary mapping the element kind to its field
information.&lt;/p&gt;
&lt;p&gt;Accordingly, the code that sets &lt;code&gt;FIELDS&lt;/code&gt; changes to become:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;RECORD_FIELDS = OrderedDict((
    (&amp;#39;sourceName&amp;#39;, &amp;#39;s&amp;#39;),
    (&amp;#39;sourceVersion&amp;#39;, &amp;#39;s&amp;#39;),
    (&amp;#39;device&amp;#39;, &amp;#39;s&amp;#39;),
    (&amp;#39;type&amp;#39;, &amp;#39;s&amp;#39;),
    (&amp;#39;unit&amp;#39;, &amp;#39;s&amp;#39;),
    (&amp;#39;creationDate&amp;#39;, &amp;#39;d&amp;#39;),
    (&amp;#39;startDate&amp;#39;, &amp;#39;d&amp;#39;),
    (&amp;#39;endDate&amp;#39;, &amp;#39;d&amp;#39;),
    (&amp;#39;value&amp;#39;, &amp;#39;n&amp;#39;),
))

ACTIVITY_SUMMARY_FIELDS = OrderedDict((
    (&amp;#39;dateComponents&amp;#39;, &amp;#39;d&amp;#39;),
    (&amp;#39;activeEnergyBurned&amp;#39;, &amp;#39;n&amp;#39;),
    (&amp;#39;activeEnergyBurnedGoal&amp;#39;, &amp;#39;n&amp;#39;),
    (&amp;#39;activeEnergyBurnedUnit&amp;#39;, &amp;#39;s&amp;#39;),
    (&amp;#39;appleExerciseTime&amp;#39;, &amp;#39;s&amp;#39;),
    (&amp;#39;appleExerciseTimeGoal&amp;#39;, &amp;#39;s&amp;#39;),
    (&amp;#39;appleStandHours&amp;#39;, &amp;#39;n&amp;#39;),
    (&amp;#39;appleStandHoursGoal&amp;#39;, &amp;#39;n&amp;#39;),
))

WORKOUT_FIELDS = OrderedDict((
    (&amp;#39;sourceName&amp;#39;, &amp;#39;s&amp;#39;),
    (&amp;#39;sourceVersion&amp;#39;, &amp;#39;s&amp;#39;),
    (&amp;#39;device&amp;#39;, &amp;#39;s&amp;#39;),
    (&amp;#39;creationDate&amp;#39;, &amp;#39;d&amp;#39;),
    (&amp;#39;startDate&amp;#39;, &amp;#39;d&amp;#39;),
    (&amp;#39;endDate&amp;#39;, &amp;#39;d&amp;#39;),
    (&amp;#39;workoutActivityType&amp;#39;, &amp;#39;s&amp;#39;),
    (&amp;#39;duration&amp;#39;, &amp;#39;n&amp;#39;),
    (&amp;#39;durationUnit&amp;#39;, &amp;#39;s&amp;#39;),
    (&amp;#39;totalDistance&amp;#39;, &amp;#39;n&amp;#39;),
    (&amp;#39;totalDistanceUnit&amp;#39;, &amp;#39;s&amp;#39;),
    (&amp;#39;totalEnergyBurned&amp;#39;, &amp;#39;n&amp;#39;),
    (&amp;#39;totalEnergyBurnedUnit&amp;#39;, &amp;#39;s&amp;#39;),
))

FIELDS = {
    &amp;#39;Record&amp;#39;: RECORD_FIELDS,
    &amp;#39;ActivitySummary&amp;#39;: ACTIVITY_SUMMARY_FIELDS,
    &amp;#39;Workout&amp;#39;: WORKOUT_FIELDS,
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;and we have to change references (in both the main code and the test code)
to refer to &lt;code&gt;RECORD_FIELDS&lt;/code&gt; where previously there were references to &lt;code&gt;FIELDS&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Change 2: Add a Workout to the test data&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;There was a single workout in the data I exported from the phone (a token one
I performed primarily to generate a record of this type). I simply used
grep to extract that line from &lt;code&gt;export.xml&lt;/code&gt; and poked it into the test
data `testdata/export6s3sample.xml'.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Change 3: Update the tag and field counters&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The code for counting record types previously considered only nodes of type
&lt;code&gt;Record&lt;/code&gt;. Now we also want to handle &lt;code&gt;Workout&lt;/code&gt; and &lt;code&gt;ActivitySummary&lt;/code&gt; elements.
Workouts do come in different types (they have a &lt;code&gt;workoutActivityType&lt;/code&gt; field),
so it may be that we will want to write out different workout types
into different CSV files, but since I have only, so far, seen a single
workout, I don't really want to do this. So instead, we'll write all
&lt;code&gt;Workout&lt;/code&gt; elements to a corresponding &lt;code&gt;Workout.csv&lt;/code&gt; file, and all
&lt;code&gt;ActivitySummary&lt;/code&gt; elements to an &lt;code&gt;ActivitySummary.csv&lt;/code&gt; file.&lt;/p&gt;
&lt;p&gt;Accordingly, the &lt;code&gt;count_record_types&lt;/code&gt; method now uses an extra
&lt;code&gt;Counter&lt;/code&gt; attribute, &lt;code&gt;other_types&lt;/code&gt; to count the number of each of these
elements, keyed on their tag (i.e. &lt;code&gt;Workout&lt;/code&gt; or &lt;code&gt;ActivitySummary&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Change 4: Update the test results to reflect the new behaviour&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Two of the unit tests introduced last time need to be updated to reflect
this Change 3. First, the field counts change, and secondly we need
reference values for the &lt;code&gt;other_types&lt;/code&gt; counts. Hence the new section
in &lt;code&gt;test_extracted_reference_stats&lt;/code&gt;:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;    &lt;span class="n"&gt;expectedOtherCounts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
       &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;ActivitySummary&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
       &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Workout&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;other_types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
                     &lt;span class="n"&gt;expectedOtherCounts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Change 5: Open (and close) files for Workouts and ActivitySummaries&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;We need to open new files for &lt;code&gt;Workout.csv&lt;/code&gt; and &lt;code&gt;ActivitySummary.csv&lt;/code&gt;
if we have any such records. This is handled in the &lt;code&gt;open_for_writing&lt;/code&gt;
method.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Change 6: Write records for Workouts and ActivitySummaries&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;There are minor changes to the &lt;code&gt;write_records&lt;/code&gt; method to allow it to
handle writing &lt;code&gt;Workout&lt;/code&gt; and &lt;code&gt;ActivitySummary&lt;/code&gt; records. The only
real difference is that the different CSV files have different fields,
so we need to look up the right values, in the order specified by the header
for each kind. The new code does that:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;write_records&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;kinds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;FIELDS&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;kinds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;attributes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attrib&lt;/span&gt;
            &lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;type&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Record&amp;#39;&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt;
            &lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;format_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;datatype&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                      &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;datatype&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;FIELDS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
            &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;,&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;handles&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Change 7: Update the reference test&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Finally, the reference test itself now generates two more files,
so I've added reference copies of those to the &lt;code&gt;testdata&lt;/code&gt; subdirectory
and changed the test to loop over all four files:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_tiny_reference_extraction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;copy_test_data&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HealthDataExtractor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;VERBOSE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;StepCount&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;DistanceWalkingRunning&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="s1"&gt;&amp;#39;Workout&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;ActivitySummary&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;check_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="s1"&gt;.csv&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3 id="mission-accomplished"&gt;Mission Accomplished&lt;/h3&gt;
&lt;p&gt;We've now extracted essentially all the data from the &lt;code&gt;export.xml&lt;/code&gt;
file from the Apple Health app, and created various tests for that
extraction process. We'll start to look at the data in future posts.
There is one more component in my extract—another XML file called
&lt;code&gt;export_cda.xml&lt;/code&gt;. This contains a &lt;code&gt;ClinicalDocument&lt;/code&gt;, apparently
conforming to a standard from (or possibly administered by) Health Level
Seven International. It contains heart-rate data from my Apple Watch.
I probably will extract it and publish the code for doing so, but later.&lt;/p&gt;</content><category term="TDDA"></category><category term="xml"></category><category term="apple"></category><category term="health"></category></entry><entry><title>Unit Tests</title><link href="https://tdda.info/unit-tests.html" rel="alternate"></link><published>2016-04-19T21:05:00+01:00</published><updated>2016-04-19T21:05:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2016-04-19:/unit-tests.html</id><summary type="html">&lt;p&gt;In the &lt;a href="https://www.tdda.info/first-test"&gt;last post&lt;/a&gt;,
we presented some code for implementing a
&lt;a href="glossary.html#reference-test"&gt;"reference" test&lt;/a&gt;
for the code for extracting CSV files from the XML dump that the
Apple Health app on iOS can produce.&lt;/p&gt;
&lt;p&gt;We will now expand that test with a few other, smaller and more conventional
&lt;em&gt;unit tests&lt;/em&gt;. Each …&lt;/p&gt;</summary><content type="html">&lt;p&gt;In the &lt;a href="https://www.tdda.info/first-test"&gt;last post&lt;/a&gt;,
we presented some code for implementing a
&lt;a href="glossary.html#reference-test"&gt;"reference" test&lt;/a&gt;
for the code for extracting CSV files from the XML dump that the
Apple Health app on iOS can produce.&lt;/p&gt;
&lt;p&gt;We will now expand that test with a few other, smaller and more conventional
&lt;em&gt;unit tests&lt;/em&gt;. Each unit test focuses on a smaller block of functionality.&lt;/p&gt;
&lt;h3 id="the-test-code"&gt;The Test Code&lt;/h3&gt;
&lt;p&gt;As before, you can get the code from Github with&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ git clone https://github.com/tdda/applehealthdata.git
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;or if you have pulled it previously, with&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ git pull
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This version of the code is tagged with &lt;code&gt;v1.2&lt;/code&gt;, so if it has been updated
by the time you read this, get that version with&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ git checkout v1.2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Here is the updated test code.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# -*- coding: utf-8 -*-&lt;/span&gt;
&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;testapplehealthdata.py: tests for the applehealthdata.py&lt;/span&gt;

&lt;span class="sd"&gt;Copyright (c) 2016 Nicholas J. Radcliffe&lt;/span&gt;
&lt;span class="sd"&gt;Licence: MIT&lt;/span&gt;
&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;absolute_import&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;division&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;print_function&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;unicode_literals&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;re&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;shutil&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;unittest&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt;


&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;applehealthdata&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;HealthDataExtractor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                             &lt;span class="n"&gt;format_freqs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;format_value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                             &lt;span class="n"&gt;abbreviate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;CLEAN_UP&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;
&lt;span class="n"&gt;VERBOSE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;False&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_base_dir&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;    Return the directory containing this test file,&lt;/span&gt;
&lt;span class="sd"&gt;    which will (normally) be the applyhealthdata directory&lt;/span&gt;
&lt;span class="sd"&gt;    also containing the testdata dir.&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;abspath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="vm"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;))[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_testdata_dir&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;Return the full path to the testdata directory&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_base_dir&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;testdata&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_tmp_dir&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;Return the full path to the tmp directory&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_base_dir&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;tmp&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;remove_any_tmp_dir&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;    Remove the temporary directory if it exists.&lt;/span&gt;
&lt;span class="sd"&gt;    Returns its location either way.&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="n"&gt;tmp_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_tmp_dir&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmp_dir&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;shutil&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rmtree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmp_dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tmp_dir&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;make_tmp_dir&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;    Remove any existing tmp directory.&lt;/span&gt;
&lt;span class="sd"&gt;    Create empty tmp direcory.&lt;/span&gt;
&lt;span class="sd"&gt;    Return the location of the tmp dir.&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="n"&gt;tmp_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;remove_any_tmp_dir&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmp_dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tmp_dir&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;copy_test_data&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;    Copy the test data export6s3sample.xml from testdata directory&lt;/span&gt;
&lt;span class="sd"&gt;    to tmp directory.&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="n"&gt;tmp_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;make_tmp_dir&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;export6s3sample.xml&amp;#39;&lt;/span&gt;
    &lt;span class="n"&gt;in_xml_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_testdata_dir&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;out_xml_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_tmp_dir&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;shutil&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copyfile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;in_xml_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out_xml_file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out_xml_file&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestAppleHealthDataExtractor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unittest&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TestCase&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nd"&gt;@classmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tearDownClass&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;Clean up by removing the tmp directory, if it exists.&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;CLEAN_UP&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;remove_any_tmp_dir&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;expected_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_testdata_dir&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;actual_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_tmp_dir&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected_output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual_output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_tiny_reference_extraction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;copy_test_data&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HealthDataExtractor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;VERBOSE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;check_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;StepCount.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;check_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;DistanceWalkingRunning.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_format_freqs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;counts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;format_freqs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;one&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;format_freqs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;one: 1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;one&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;format_freqs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;one: 2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;two&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;three&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;format_freqs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                         &lt;span class="sd"&gt;&amp;#39;&amp;#39;&amp;#39;one: 2&lt;/span&gt;
&lt;span class="sd"&gt;three: 1&lt;/span&gt;
&lt;span class="sd"&gt;two: 1&amp;#39;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_format_null_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;dt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;s&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;n&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;d&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;z&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="c1"&gt;# Note: even an illegal type, z, produces correct output for&lt;/span&gt;
            &lt;span class="c1"&gt;# null values.&lt;/span&gt;
            &lt;span class="c1"&gt;# Questionable, but we&amp;#39;ll leave as a feature&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;format_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_format_numeric_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;cases&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;0&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;0&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;3&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;3&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;-1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;-1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;2.5&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;2.5&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cases&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;format_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;n&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_format_date_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;hearts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;any string not need escaping or quoting; even this: ♥♥&amp;#39;&lt;/span&gt;
        &lt;span class="n"&gt;cases&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;01/02/2000 12:34:56&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;01/02/2000 12:34:56&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;hearts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;hearts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cases&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;format_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;d&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_format_string_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;cases&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;a&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&amp;quot;a&amp;quot;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&amp;quot;&amp;quot;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;one &amp;quot;2&amp;quot; three&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&amp;quot;one \&amp;quot;2\&amp;quot; three&amp;quot;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;1\2\3&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&amp;quot;1&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s1"&gt;2&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s1"&gt;3&amp;quot;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cases&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;format_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;s&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_abbreviate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;changed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;HKQuantityTypeIdentifierHeight&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Height&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;HKQuantityTypeIdentifierStepCount&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;StepCount&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;HK*TypeIdentifierStepCount&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;StepCount&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;HKCharacteristicTypeIdentifierDateOfBirth&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;DateOfBirth&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;HKCharacteristicTypeIdentifierBiologicalSex&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;BiologicalSex&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;HKCharacteristicTypeIdentifierBloodType&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;BloodType&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;HKCharacteristicTypeIdentifierFitzpatrickSkinType&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                                                    &lt;span class="s1"&gt;&amp;#39;FitzpatrickSkinType&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;unchanged&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;a&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;aHKQuantityTypeIdentifierHeight&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s1"&gt;&amp;#39;HKQuantityTypeIdentityHeight&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;changed&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;abbreviate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;abbreviate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;unchanged&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;abbreviate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# This test looks strange, but because of the import statments&lt;/span&gt;
        &lt;span class="c1"&gt;#     from __future__ import unicode_literals&lt;/span&gt;
        &lt;span class="c1"&gt;# in Python 2, type(&amp;#39;a&amp;#39;) is unicode, and the point of the encode&lt;/span&gt;
        &lt;span class="c1"&gt;# function is to ensure that it has been converted to a UTF-8 string&lt;/span&gt;
        &lt;span class="c1"&gt;# before writing to file.&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;a&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_extracted_reference_stats&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;copy_test_data&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HealthDataExtractor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;VERBOSE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n_nodes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;expectedRecordCounts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
           &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;DistanceWalkingRunning&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
           &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;StepCount&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;record_types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
                         &lt;span class="n"&gt;expectedRecordCounts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;expectedTagCounts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
           &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;ActivitySummary&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
           &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;ExportDate&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
           &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Me&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
           &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Record&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
                         &lt;span class="n"&gt;expectedTagCounts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;expectedFieldCounts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;HKCharacteristicTypeIdentifierBiologicalSex&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;HKCharacteristicTypeIdentifierBloodType&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;HKCharacteristicTypeIdentifierDateOfBirth&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;HKCharacteristicTypeIdentifierFitzpatrickSkinType&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;activeEnergyBurned&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;activeEnergyBurnedGoal&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;activeEnergyBurnedUnit&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;appleExerciseTime&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;appleExerciseTimeGoal&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;appleStandHours&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;appleStandHoursGoal&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;creationDate&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;dateComponents&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;endDate&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;sourceName&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;startDate&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;type&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;unit&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;value&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
                         &lt;span class="n"&gt;expectedFieldCounts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="vm"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;__main__&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;unittest&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3 id="notes"&gt;Notes&lt;/h3&gt;
&lt;p&gt;We're not going to discuss every part of the code, but will point out
a few salient features.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;I've added a &lt;a href="https://www.python.org/dev/peps/pep-0263/"&gt;&lt;code&gt;coding&lt;/code&gt;&lt;/a&gt;
    line at the top of both the test script and the main
    &lt;code&gt;applehealthdata.py&lt;/code&gt; script.  This tells Python (and my editor, Emacs)
    the encoding of the file on disk (UTF-8).  This is now relevant
    because one of the new tests (&lt;code&gt;test_format_date_values&lt;/code&gt;)
    features a non-ASCII character in a string literal.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The previous test method &lt;code&gt;test_tiny_fixed_extraction&lt;/code&gt; has been renamed
    &lt;code&gt;test_tiny_reference_extraction&lt;/code&gt;, but is otherwise unchanged.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Several of the tests loop over dictionaries or lists of
    input-output pairs, with an assertion of some kind in the main
    body. Some people don't like this, and prefer one assertion per
    test. I don't really agree with that, but do think it's important
    to be able to see easily &lt;em&gt;which&lt;/em&gt; assertion fails. An idiom I
    often use to assist this is to include the input on both sides of
    the test. For example, in &lt;code&gt;test_abbreviate&lt;/code&gt;, when checking the
    abbreviation of items that should change, the code reads:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;changed&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;abbreviate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;rather than&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;changed&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;abbreviate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This makes it easy to tell which input fails, if one does, even in
cases in which the main values being compared (&lt;code&gt;abbreviate(k)&lt;/code&gt; and
&lt;code&gt;v&lt;/code&gt;, in this case) are long, complex or repeated across different
inputs. It doesn't actually make much difference in these
examples, but in general I find it helpful.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The test &lt;code&gt;test_extracted_reference_stats&lt;/code&gt; checks that three
    counters used by the code work as expected.
    Some people would definitely advocate splitting this into three
    tests, but, even though it's quick, it seems more natural to test
    these together to me. This also means we don't have to process the
    XML file three times. There are other ways of achieving the same
    end, and this approach has the potential disadvantage that the later
    cases won't be run if the first one fails.&lt;/p&gt;
&lt;p&gt;The other point to note here is that the &lt;code&gt;Counter&lt;/code&gt; objects are
unordered, so I've sorted the expected results on their
keys in the expected values, and then used Python's &lt;code&gt;sorted&lt;/code&gt;
function, which returns a generator to return the values of a list
(or other iterable) in sorted order. We could avoid the sort by
constructing sets or a dictionaries from the &lt;code&gt;Counter&lt;/code&gt; objects and
checking those instead, but the sort here is not expensive, and this
approach is probably simpler.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;I haven't bothered to write a separate test for the extraction
    phase (checking that it writes the right CSV files) because that
    seems to me to add almost nothing over the existing reference test
    (&lt;code&gt;test_tiny_reference_extraction&lt;/code&gt;).&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="closing"&gt;Closing&lt;/h3&gt;
&lt;p&gt;That's it for this post.
The unit tests are not terribly exciting, but they will prove useful
as we extend the extraction code, which we'll start to do in the next
post.&lt;/p&gt;
&lt;p&gt;In a few posts' time, we will start analysing the data extracted from
the app; it will be interesting to see whether, at that stage, we discover
any more serious problems with the extraction code. Experience teaches
that we probably will.&lt;/p&gt;</content><category term="TDDA"></category><category term="xml"></category><category term="apple"></category><category term="health"></category></entry><entry><title>First Test</title><link href="https://tdda.info/first-test.html" rel="alternate"></link><published>2016-04-18T16:20:00+01:00</published><updated>2016-04-18T16:20:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2016-04-18:/first-test.html</id><summary type="html">&lt;p&gt;In the &lt;a href="https://tdda.info/in-defence-of-xml-exporting-and-analysing-apple-health-data"&gt;last post&lt;/a&gt;,
I presented some code for extracting (some of) the data from the XML
file exported by the &lt;code&gt;Apple Health&lt;/code&gt; app on iOS, but—almost
comically, given this blog's theme—omitted to include any tests.
This post and the next couple (in quick succession) will aim to …&lt;/p&gt;</summary><content type="html">&lt;p&gt;In the &lt;a href="https://tdda.info/in-defence-of-xml-exporting-and-analysing-apple-health-data"&gt;last post&lt;/a&gt;,
I presented some code for extracting (some of) the data from the XML
file exported by the &lt;code&gt;Apple Health&lt;/code&gt; app on iOS, but—almost
comically, given this blog's theme—omitted to include any tests.
This post and the next couple (in quick succession) will aim to fix that.&lt;/p&gt;
&lt;p&gt;This post begins to remedy that by writing a single
&lt;a href="pages/glossary.html#reference-test"&gt;"reference" test&lt;/a&gt;.
To recap: a &lt;em&gt;reference test&lt;/em&gt; is a test that tests a whole analytical process,
checking that the known inputs produce the expected outputs.
So far, our analytical process is quite small, consisting only of
data extraction, but this will still prove very worthwhile.&lt;/p&gt;
&lt;h3 id="dogma"&gt;Dogma&lt;/h3&gt;
&lt;p&gt;While the mainstream TDD dogma states that tests should be written
&lt;em&gt;before&lt;/em&gt; the code, it is far from uncommon to write them afterwards,
and in the context of &lt;em&gt;test-driven data analysis&lt;/em&gt; I maintain that this
is usually preferable. Regardless, when you find yourself in a
situation in which you have written some code and possess any
reasonable level of belief that it might be right,&lt;sup id="fnref:eg"&gt;&lt;a class="footnote-ref" href="#fn:eg"&gt;1&lt;/a&gt;&lt;/sup&gt; an excellent
starting point is simply to capture the input(s) that you have already
used, together with the output that it generates, and write a test
that checks that the input you provided produces the expected output.
That's exactly the procedure I advocated for TDDA, and that's how we
shall start here.&lt;/p&gt;
&lt;h3 id="test-data"&gt;Test Data&lt;/h3&gt;
&lt;p&gt;The only flies in the ointment in this case are&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;the input data I used initially was quite large
     (5.5MB compressed; 109MB uncompressed),
     leading to quite a slow test;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;the data is somewhat personal.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For both these reasons, I have decided to reduce it so that it will be more
manageable, run more quickly, and be more suitable for public sharing.&lt;/p&gt;
&lt;p&gt;So I cut down the data to contain only the DTD header, the &lt;code&gt;Me&lt;/code&gt;
record, ten &lt;code&gt;StepCount&lt;/code&gt; records, and five &lt;code&gt;DistanceWalkingRunning&lt;/code&gt;
records.  That results in a small, valid XML file (under 7K)
containing exactly 100 lines.  It's in the &lt;code&gt;testdata&lt;/code&gt; subdirectory of
the repository, and if I run it (which you probably don't want do, at
least &lt;em&gt;in situ&lt;/em&gt;, as that will trample over the reference output), the
following output is produced:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python applehealthdata/applehealthdata.py testdata/export6s3sample.xml
Reading data from testdata/export6s3sample.xml . . . &lt;span class="k"&gt;done&lt;/span&gt;

Tags:
ActivitySummary: &lt;span class="m"&gt;2&lt;/span&gt;
ExportDate: &lt;span class="m"&gt;1&lt;/span&gt;
Me: &lt;span class="m"&gt;1&lt;/span&gt;
Record: &lt;span class="m"&gt;15&lt;/span&gt;

Fields:
HKCharacteristicTypeIdentifierBiologicalSex: &lt;span class="m"&gt;1&lt;/span&gt;
HKCharacteristicTypeIdentifierBloodType: &lt;span class="m"&gt;1&lt;/span&gt;
HKCharacteristicTypeIdentifierDateOfBirth: &lt;span class="m"&gt;1&lt;/span&gt;
HKCharacteristicTypeIdentifierFitzpatrickSkinType: &lt;span class="m"&gt;1&lt;/span&gt;
activeEnergyBurned: &lt;span class="m"&gt;2&lt;/span&gt;
activeEnergyBurnedGoal: &lt;span class="m"&gt;2&lt;/span&gt;
activeEnergyBurnedUnit: &lt;span class="m"&gt;2&lt;/span&gt;
appleExerciseTime: &lt;span class="m"&gt;2&lt;/span&gt;
appleExerciseTimeGoal: &lt;span class="m"&gt;2&lt;/span&gt;
appleStandHours: &lt;span class="m"&gt;2&lt;/span&gt;
appleStandHoursGoal: &lt;span class="m"&gt;2&lt;/span&gt;
creationDate: &lt;span class="m"&gt;15&lt;/span&gt;
dateComponents: &lt;span class="m"&gt;2&lt;/span&gt;
endDate: &lt;span class="m"&gt;15&lt;/span&gt;
sourceName: &lt;span class="m"&gt;15&lt;/span&gt;
startDate: &lt;span class="m"&gt;15&lt;/span&gt;
type: &lt;span class="m"&gt;15&lt;/span&gt;
unit: &lt;span class="m"&gt;15&lt;/span&gt;
value: &lt;span class="m"&gt;16&lt;/span&gt;

Record types:
DistanceWalkingRunning: &lt;span class="m"&gt;5&lt;/span&gt;
StepCount: &lt;span class="m"&gt;10&lt;/span&gt;

Opening /Users/njr/qs/testdata/StepCount.csv &lt;span class="k"&gt;for&lt;/span&gt; writing
Opening /Users/njr/qs/testdata/DistanceWalkingRunning.csv &lt;span class="k"&gt;for&lt;/span&gt; writing
Written StepCount data.
Written DistanceWalkingRunning data.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The two CSV files it writes, which are also in the &lt;code&gt;testdata&lt;/code&gt; subdirectory
in the repository, are as follows:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ cat testdata/StepCount.csv
sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value
&lt;span class="s2"&gt;&amp;quot;Health&amp;quot;&lt;/span&gt;,,,&lt;span class="s2"&gt;&amp;quot;StepCount&amp;quot;&lt;/span&gt;,&lt;span class="s2"&gt;&amp;quot;count&amp;quot;&lt;/span&gt;,2014-09-21 &lt;span class="m"&gt;07&lt;/span&gt;:08:47 +0100,2014-09-13 &lt;span class="m"&gt;10&lt;/span&gt;:27:54 +0100,2014-09-13 &lt;span class="m"&gt;10&lt;/span&gt;:27:59 +0100,329
&lt;span class="s2"&gt;&amp;quot;Health&amp;quot;&lt;/span&gt;,,,&lt;span class="s2"&gt;&amp;quot;StepCount&amp;quot;&lt;/span&gt;,&lt;span class="s2"&gt;&amp;quot;count&amp;quot;&lt;/span&gt;,2014-09-21 &lt;span class="m"&gt;07&lt;/span&gt;:08:47 +0100,2014-09-13 &lt;span class="m"&gt;10&lt;/span&gt;:34:09 +0100,2014-09-13 &lt;span class="m"&gt;10&lt;/span&gt;:34:14 +0100,283
&lt;span class="s2"&gt;&amp;quot;Health&amp;quot;&lt;/span&gt;,,,&lt;span class="s2"&gt;&amp;quot;StepCount&amp;quot;&lt;/span&gt;,&lt;span class="s2"&gt;&amp;quot;count&amp;quot;&lt;/span&gt;,2014-09-21 &lt;span class="m"&gt;07&lt;/span&gt;:08:47 +0100,2014-09-13 &lt;span class="m"&gt;10&lt;/span&gt;:39:29 +0100,2014-09-13 &lt;span class="m"&gt;10&lt;/span&gt;:39:34 +0100,426
&lt;span class="s2"&gt;&amp;quot;Health&amp;quot;&lt;/span&gt;,,,&lt;span class="s2"&gt;&amp;quot;StepCount&amp;quot;&lt;/span&gt;,&lt;span class="s2"&gt;&amp;quot;count&amp;quot;&lt;/span&gt;,2014-09-21 &lt;span class="m"&gt;07&lt;/span&gt;:08:48 +0100,2014-09-13 &lt;span class="m"&gt;10&lt;/span&gt;:45:36 +0100,2014-09-13 &lt;span class="m"&gt;10&lt;/span&gt;:45:41 +0100,61
&lt;span class="s2"&gt;&amp;quot;Health&amp;quot;&lt;/span&gt;,,,&lt;span class="s2"&gt;&amp;quot;StepCount&amp;quot;&lt;/span&gt;,&lt;span class="s2"&gt;&amp;quot;count&amp;quot;&lt;/span&gt;,2014-09-21 &lt;span class="m"&gt;07&lt;/span&gt;:08:48 +0100,2014-09-13 &lt;span class="m"&gt;10&lt;/span&gt;:51:16 +0100,2014-09-13 &lt;span class="m"&gt;10&lt;/span&gt;:51:21 +0100,10
&lt;span class="s2"&gt;&amp;quot;Health&amp;quot;&lt;/span&gt;,,,&lt;span class="s2"&gt;&amp;quot;StepCount&amp;quot;&lt;/span&gt;,&lt;span class="s2"&gt;&amp;quot;count&amp;quot;&lt;/span&gt;,2014-09-21 &lt;span class="m"&gt;07&lt;/span&gt;:08:48 +0100,2014-09-13 &lt;span class="m"&gt;10&lt;/span&gt;:57:40 +0100,2014-09-13 &lt;span class="m"&gt;10&lt;/span&gt;:57:45 +0100,200
&lt;span class="s2"&gt;&amp;quot;Health&amp;quot;&lt;/span&gt;,,,&lt;span class="s2"&gt;&amp;quot;StepCount&amp;quot;&lt;/span&gt;,&lt;span class="s2"&gt;&amp;quot;count&amp;quot;&lt;/span&gt;,2014-09-21 &lt;span class="m"&gt;07&lt;/span&gt;:08:48 +0100,2014-09-13 &lt;span class="m"&gt;11&lt;/span&gt;:03:00 +0100,2014-09-13 &lt;span class="m"&gt;11&lt;/span&gt;:03:05 +0100,390
&lt;span class="s2"&gt;&amp;quot;Health&amp;quot;&lt;/span&gt;,,,&lt;span class="s2"&gt;&amp;quot;StepCount&amp;quot;&lt;/span&gt;,&lt;span class="s2"&gt;&amp;quot;count&amp;quot;&lt;/span&gt;,2014-09-21 &lt;span class="m"&gt;07&lt;/span&gt;:08:48 +0100,2014-09-13 &lt;span class="m"&gt;11&lt;/span&gt;:08:10 +0100,2014-09-13 &lt;span class="m"&gt;11&lt;/span&gt;:08:15 +0100,320
&lt;span class="s2"&gt;&amp;quot;Health&amp;quot;&lt;/span&gt;,,,&lt;span class="s2"&gt;&amp;quot;StepCount&amp;quot;&lt;/span&gt;,&lt;span class="s2"&gt;&amp;quot;count&amp;quot;&lt;/span&gt;,2014-09-21 &lt;span class="m"&gt;07&lt;/span&gt;:08:48 +0100,2014-09-13 &lt;span class="m"&gt;11&lt;/span&gt;:27:22 +0100,2014-09-13 &lt;span class="m"&gt;11&lt;/span&gt;:27:27 +0100,216
&lt;span class="s2"&gt;&amp;quot;Health&amp;quot;&lt;/span&gt;,,,&lt;span class="s2"&gt;&amp;quot;StepCount&amp;quot;&lt;/span&gt;,&lt;span class="s2"&gt;&amp;quot;count&amp;quot;&lt;/span&gt;,2014-09-21 &lt;span class="m"&gt;07&lt;/span&gt;:08:48 +0100,2014-09-13 &lt;span class="m"&gt;11&lt;/span&gt;:33:24 +0100,2014-09-13 &lt;span class="m"&gt;11&lt;/span&gt;:33:29 +0100,282
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;and&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ cat testdata/DistanceWalkingRunning.csv
sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value
&lt;span class="s2"&gt;&amp;quot;Health&amp;quot;&lt;/span&gt;,,,&lt;span class="s2"&gt;&amp;quot;DistanceWalkingRunning&amp;quot;&lt;/span&gt;,&lt;span class="s2"&gt;&amp;quot;km&amp;quot;&lt;/span&gt;,2014-09-21 &lt;span class="m"&gt;07&lt;/span&gt;:08:49 +0100,2014-09-20 &lt;span class="m"&gt;10&lt;/span&gt;:41:28 +0100,2014-09-20 &lt;span class="m"&gt;10&lt;/span&gt;:41:30 +0100,0.00288
&lt;span class="s2"&gt;&amp;quot;Health&amp;quot;&lt;/span&gt;,,,&lt;span class="s2"&gt;&amp;quot;DistanceWalkingRunning&amp;quot;&lt;/span&gt;,&lt;span class="s2"&gt;&amp;quot;km&amp;quot;&lt;/span&gt;,2014-09-21 &lt;span class="m"&gt;07&lt;/span&gt;:08:49 +0100,2014-09-20 &lt;span class="m"&gt;10&lt;/span&gt;:41:30 +0100,2014-09-20 &lt;span class="m"&gt;10&lt;/span&gt;:41:33 +0100,0.00284
&lt;span class="s2"&gt;&amp;quot;Health&amp;quot;&lt;/span&gt;,,,&lt;span class="s2"&gt;&amp;quot;DistanceWalkingRunning&amp;quot;&lt;/span&gt;,&lt;span class="s2"&gt;&amp;quot;km&amp;quot;&lt;/span&gt;,2014-09-21 &lt;span class="m"&gt;07&lt;/span&gt;:08:49 +0100,2014-09-20 &lt;span class="m"&gt;10&lt;/span&gt;:41:33 +0100,2014-09-20 &lt;span class="m"&gt;10&lt;/span&gt;:41:36 +0100,0.00142
&lt;span class="s2"&gt;&amp;quot;Health&amp;quot;&lt;/span&gt;,,,&lt;span class="s2"&gt;&amp;quot;DistanceWalkingRunning&amp;quot;&lt;/span&gt;,&lt;span class="s2"&gt;&amp;quot;km&amp;quot;&lt;/span&gt;,2014-09-21 &lt;span class="m"&gt;07&lt;/span&gt;:08:49 +0100,2014-09-20 &lt;span class="m"&gt;10&lt;/span&gt;:43:54 +0100,2014-09-20 &lt;span class="m"&gt;10&lt;/span&gt;:43:56 +0100,0.00639
&lt;span class="s2"&gt;&amp;quot;Health&amp;quot;&lt;/span&gt;,,,&lt;span class="s2"&gt;&amp;quot;DistanceWalkingRunning&amp;quot;&lt;/span&gt;,&lt;span class="s2"&gt;&amp;quot;km&amp;quot;&lt;/span&gt;,2014-09-21 &lt;span class="m"&gt;07&lt;/span&gt;:08:49 +0100,2014-09-20 &lt;span class="m"&gt;10&lt;/span&gt;:43:59 +0100,2014-09-20 &lt;span class="m"&gt;10&lt;/span&gt;:44:01 +0100,0.0059
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3 id="reference-test"&gt;Reference Test&lt;/h3&gt;
&lt;p&gt;The code for a single reference test is below. It's slightly verbose,
because it tries to use sensible locations for everything, but not complex.&lt;/p&gt;
&lt;p&gt;As before, you can get the code from Github with&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ git clone https://github.com/tdda/applehealthdata.git
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;or if you have pulled it previously, you can update it with&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ git pull
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This version of the code is tagged with &lt;code&gt;v1.1&lt;/code&gt;, so if it has been updated
by the time you read this, get that version with&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ git checkout v1.1
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Here is the code:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;testapplehealthdata.py: tests for the applehealthdata.py&lt;/span&gt;

&lt;span class="sd"&gt;Copyright (c) 2016 Nicholas J. Radcliffe&lt;/span&gt;
&lt;span class="sd"&gt;Licence: MIT&lt;/span&gt;
&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;absolute_import&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;division&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;print_function&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;unicode_literals&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;re&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;shutil&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;unittest&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;applehealthdata&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HealthDataExtractor&lt;/span&gt;

&lt;span class="n"&gt;CLEAN_UP&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;
&lt;span class="n"&gt;VERBOSE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;False&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_base_dir&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;    Return the directory containing this test file,&lt;/span&gt;
&lt;span class="sd"&gt;    which will (normally) be the applyhealthdata directory&lt;/span&gt;
&lt;span class="sd"&gt;    also containing the testdata dir.&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;abspath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="vm"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;))[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_testdata_dir&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;Return the full path to the testdata directory&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_base_dir&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;testdata&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_tmp_dir&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;Return the full path to the tmp directory&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_base_dir&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;tmp&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;remove_any_tmp_dir&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;    Remove the temporary directory if it exists.&lt;/span&gt;
&lt;span class="sd"&gt;    Returns its location either way.&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="n"&gt;tmp_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_tmp_dir&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmp_dir&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;shutil&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rmtree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmp_dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tmp_dir&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;make_tmp_dir&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;    Remove any existing tmp directory.&lt;/span&gt;
&lt;span class="sd"&gt;    Create empty tmp direcory.&lt;/span&gt;
&lt;span class="sd"&gt;    Return the location of the tmp dir.&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="n"&gt;tmp_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;remove_any_tmp_dir&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmp_dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tmp_dir&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;copy_test_data&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;    Copy the test data export6s3sample.xml from testdata directory&lt;/span&gt;
&lt;span class="sd"&gt;    to tmp directory.&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="n"&gt;tmp_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;make_tmp_dir&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;export6s3sample.xml&amp;#39;&lt;/span&gt;
    &lt;span class="n"&gt;in_xml_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_testdata_dir&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;out_xml_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_tmp_dir&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;shutil&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copyfile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;in_xml_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out_xml_file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out_xml_file&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestAppleHealthDataExtractor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unittest&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TestCase&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nd"&gt;@classmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tearDownClass&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;Clean up by removing the tmp directory, if it exists.&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;CLEAN_UP&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;remove_any_tmp_dir&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;expected_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_testdata_dir&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;actual_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_tmp_dir&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected_output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual_output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_tiny_fixed_extraction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;copy_test_data&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HealthDataExtractor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;VERBOSE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;check_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;StepCount.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;check_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;DistanceWalkingRunning.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="vm"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;__main__&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;unittest&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3 id="running-the-test"&gt;Running the Test&lt;/h3&gt;
&lt;p&gt;This is what I get if I run it:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c"&gt;$ python testapplehealthdata&lt;/span&gt;&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;py&lt;/span&gt;
&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="nb"&gt;----------------------------------------------------------------------&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="c"&gt;Ran 1 test in 0&lt;/span&gt;&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;007s&lt;/span&gt;

&lt;span class="c"&gt;OK&lt;/span&gt;
&lt;span class="c"&gt;$&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;That's encouraging, but not particularly informative. If we change
the value of &lt;code&gt;VERBOSE&lt;/code&gt; at the top of the test file to &lt;code&gt;True&lt;/code&gt;, we see
slightly more reassuring output:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python testapplehealthdata.py
Reading data from /Users/njr/qs/applehealthdata/tmp/export6s3sample.xml . . . &lt;span class="k"&gt;done&lt;/span&gt;
Opening /Users/njr/qs/applehealthdata/tmp/StepCount.csv &lt;span class="k"&gt;for&lt;/span&gt; writing
Opening /Users/njr/qs/applehealthdata/tmp/DistanceWalkingRunning.csv &lt;span class="k"&gt;for&lt;/span&gt; writing
Written StepCount data.
Written DistanceWalkingRunning data.
.
----------------------------------------------------------------------
Ran &lt;span class="m"&gt;1&lt;/span&gt; &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;.006s
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;NOTE:&lt;/strong&gt; The &lt;code&gt;tearDownClass&lt;/code&gt; method is a special Python class method
that the unit testing framework runs after executing all the tests in
the class, regardless of whether they pass, fail or produce errors.  I
use it to remove the &lt;code&gt;tmp&lt;/code&gt; directory containing any test output, which
is normally good practice.  In a later post, we'll either modify this
to leave the output around if any tests fail, or make some other
change to make it easier to diagnose what's gone wrong.  In the
meantime, if you change the value of &lt;code&gt;CLEAN_UP&lt;/code&gt;, towards the top of
the code, to &lt;code&gt;False&lt;/code&gt;, it will leave the &lt;code&gt;tmp&lt;/code&gt; directory around,
allowing you to examine the files it has produced.&lt;/p&gt;
&lt;h3 id="overview"&gt;Overview&lt;/h3&gt;
&lt;p&gt;The test itself is in the 5-line method &lt;code&gt;test_tiny_fixed_extraction&lt;/code&gt;.
Here's what the five lines do:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Copy the input XML file from the&lt;/em&gt; &lt;code&gt;testdata&lt;/code&gt;
     &lt;em&gt;directory to the&lt;/em&gt; &lt;code&gt;tmp&lt;/code&gt; &lt;em&gt;directory.&lt;/em&gt;
     The Github repository contains the 100-line input XML file together with
     the expected output in the &lt;code&gt;testdata&lt;/code&gt; subdirectory.
     Because the data extractor writes the CSV files next to the input
     data, the cleanest thing for us to do is to take a copy of the input
     data, write it into a new directory (&lt;code&gt;applehealthdata/tmp&lt;/code&gt;)
     and also to use that directory as the location for the output CSV files.
     The &lt;code&gt;copy_test_data&lt;/code&gt; function removes any existing &lt;code&gt;tmp&lt;/code&gt; directory
     it finds, creates a fresh one, copies the input test data into it
     and returns the path to the test data file.
     The only "magic" here is that the &lt;code&gt;get_base_dir&lt;/code&gt; function
     figures out where to locate everything by using &lt;code&gt;__file__&lt;/code&gt;, which
     is the location of the source file being executed by Python.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Create a&lt;/em&gt; &lt;code&gt;HealthDataExtractor&lt;/code&gt;
     &lt;em&gt;object, using the location of the copy of the input data returned by&lt;/em&gt;
     &lt;code&gt;copy_test_data()&lt;/code&gt;.
     Note that it sets &lt;code&gt;verbose&lt;/code&gt; to &lt;code&gt;False&lt;/code&gt;, making the
     test silent, and allowing the line of dots from a successful test run
     (in this case, a single dot) to be presented without interruption.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Extract the data&lt;/em&gt;.
     This writes two output files to the &lt;code&gt;applehealthdata/tmp&lt;/code&gt; directory.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Check that the contents of&lt;/em&gt; &lt;code&gt;tmp/StepCount.csv&lt;/code&gt;
     &lt;em&gt;match the reference output in&lt;/em&gt; &lt;code&gt;testdata/StepCount.csv&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Check that the contents of&lt;/em&gt; &lt;code&gt;tmp/DistanceWalkingRunning.csv&lt;/code&gt;
     &lt;em&gt;match the reference output in&lt;/em&gt; &lt;code&gt;testdata/DistanceWalkingRunning.csv&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id="write-test-break-run-repair-rerun"&gt;Write-Test-Break-Run-Repair-Rerun&lt;/h3&gt;
&lt;p&gt;In cases in which the tests are written after the code, it's important
to check that they really are running correctly.  My usual approach to
that is to write the test, and if appears to pass first
time,&lt;sup id="fnref:notalways"&gt;&lt;a class="footnote-ref" href="#fn:notalways"&gt;2&lt;/a&gt;&lt;/sup&gt; to break it deliberately to verify that it fails
when it should, before repairing it. In this case, the simplest way to
break the test is to change the reference data temporarily. This
will also reveal a weakness in the current &lt;code&gt;check_file&lt;/code&gt; function.&lt;/p&gt;
&lt;p&gt;We'll try three variants of this:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Variant 1: Break the StepCount.csv reference data.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;First, I add a &lt;code&gt;Z&lt;/code&gt; to the end of &lt;code&gt;testdata/StepCount.csv&lt;/code&gt; and re-run
the tests:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python testapplehealthdata.py
&lt;span class="nv"&gt;F&lt;/span&gt;
&lt;span class="o"&gt;======================================================================&lt;/span&gt;
FAIL: test_tiny_fixed_extraction &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestAppleHealthDataExtractor&lt;span class="o"&gt;)&lt;/span&gt;
----------------------------------------------------------------------
Traceback &lt;span class="o"&gt;(&lt;/span&gt;most recent call last&lt;span class="o"&gt;)&lt;/span&gt;:
  File &lt;span class="s2"&gt;&amp;quot;testapplehealthdata.py&amp;quot;&lt;/span&gt;, line &lt;span class="m"&gt;98&lt;/span&gt;, &lt;span class="k"&gt;in&lt;/span&gt; test_tiny_fixed_extraction
    self.check_file&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;StepCount.csv&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  File &lt;span class="s2"&gt;&amp;quot;testapplehealthdata.py&amp;quot;&lt;/span&gt;, line &lt;span class="m"&gt;92&lt;/span&gt;, &lt;span class="k"&gt;in&lt;/span&gt; check_file
    self.assertEqual&lt;span class="o"&gt;(&lt;/span&gt;expected, actual&lt;span class="o"&gt;)&lt;/span&gt;
AssertionError: &lt;span class="s1"&gt;&amp;#39;sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:47 +0100,2014-09-13 10:27:54 +0100,2014-09-13 10:27:59 +0100,329\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:47 +0100,2014-09-13 10:34:09 +0100,2014-09-13 10:34:14 +0100,283\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:47 +0100,2014-09-13 10:39:29 +0100,2014-09-13 10:39:34 +0100,426\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 10:45:36 +0100,2014-09-13 10:45:41 +0100,61\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 10:51:16 +0100,2014-09-13 10:51:21 +0100,10\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 10:57:40 +0100,2014-09-13 10:57:45 +0100,200\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 11:03:00 +0100,2014-09-13 11:03:05 +0100,390\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 11:08:10 +0100,2014-09-13 11:08:15 +0100,320\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 11:27:22 +0100,2014-09-13 11:27:27 +0100,216\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 11:33:24 +0100,2014-09-13 11:33:29 +0100,282\nZ&amp;#39;&lt;/span&gt; !&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:47 +0100,2014-09-13 10:27:54 +0100,2014-09-13 10:27:59 +0100,329\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:47 +0100,2014-09-13 10:34:09 +0100,2014-09-13 10:34:14 +0100,283\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:47 +0100,2014-09-13 10:39:29 +0100,2014-09-13 10:39:34 +0100,426\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 10:45:36 +0100,2014-09-13 10:45:41 +0100,61\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 10:51:16 +0100,2014-09-13 10:51:21 +0100,10\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 10:57:40 +0100,2014-09-13 10:57:45 +0100,200\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 11:03:00 +0100,2014-09-13 11:03:05 +0100,390\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 11:08:10 +0100,2014-09-13 11:08:15 +0100,320\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 11:27:22 +0100,2014-09-13 11:27:27 +0100,216\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 11:33:24 +0100,2014-09-13 11:33:29 +0100,282\n&amp;#39;&lt;/span&gt;

----------------------------------------------------------------------
Ran &lt;span class="m"&gt;1&lt;/span&gt; &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;.005s

FAILED &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;failures&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;That causes the expected failure.
Because, however, we've compared the entire contents of the
two CSV files, it's hard to see what's actually gone wrong.
We'll address this by improving the &lt;code&gt;check_file&lt;/code&gt; method in a later post.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Variant 2: Break the DistanceWalkingRunning.csv reference data.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;After restoring the &lt;code&gt;StepCount.csv&lt;/code&gt; data,
I modify the reference &lt;code&gt;testdata/DistanceWalkingRunning.csv&lt;/code&gt; data.
This time, I'll change &lt;code&gt;Health&lt;/code&gt; to &lt;code&gt;Wealth&lt;/code&gt; throughout.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python testapplehealthdata.py
&lt;span class="nv"&gt;F&lt;/span&gt;
&lt;span class="o"&gt;======================================================================&lt;/span&gt;
FAIL: test_tiny_fixed_extraction &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestAppleHealthDataExtractor&lt;span class="o"&gt;)&lt;/span&gt;
----------------------------------------------------------------------
Traceback &lt;span class="o"&gt;(&lt;/span&gt;most recent call last&lt;span class="o"&gt;)&lt;/span&gt;:
  File &lt;span class="s2"&gt;&amp;quot;testapplehealthdata.py&amp;quot;&lt;/span&gt;, line &lt;span class="m"&gt;99&lt;/span&gt;, &lt;span class="k"&gt;in&lt;/span&gt; test_tiny_fixed_extraction
    self.check_file&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;DistanceWalkingRunning.csv&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  File &lt;span class="s2"&gt;&amp;quot;testapplehealthdata.py&amp;quot;&lt;/span&gt;, line &lt;span class="m"&gt;92&lt;/span&gt;, &lt;span class="k"&gt;in&lt;/span&gt; check_file
    self.assertEqual&lt;span class="o"&gt;(&lt;/span&gt;expected, actual&lt;span class="o"&gt;)&lt;/span&gt;
AssertionError: &lt;span class="s1"&gt;&amp;#39;sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value\n&amp;quot;Wealth&amp;quot;,,,&amp;quot;DistanceWalkingRunning&amp;quot;,&amp;quot;km&amp;quot;,2014-09-21 07:08:49 +0100,2014-09-20 10:41:28 +0100,2014-09-20 10:41:30 +0100,0.00288\n&amp;quot;Wealth&amp;quot;,,,&amp;quot;DistanceWalkingRunning&amp;quot;,&amp;quot;km&amp;quot;,2014-09-21 07:08:49 +0100,2014-09-20 10:41:30 +0100,2014-09-20 10:41:33 +0100,0.00284\n&amp;quot;Wealth&amp;quot;,,,&amp;quot;DistanceWalkingRunning&amp;quot;,&amp;quot;km&amp;quot;,2014-09-21 07:08:49 +0100,2014-09-20 10:41:33 +0100,2014-09-20 10:41:36 +0100,0.00142\n&amp;quot;Wealth&amp;quot;,,,&amp;quot;DistanceWalkingRunning&amp;quot;,&amp;quot;km&amp;quot;,2014-09-21 07:08:49 +0100,2014-09-20 10:43:54 +0100,2014-09-20 10:43:56 +0100,0.00639\n&amp;quot;Wealth&amp;quot;,,,&amp;quot;DistanceWalkingRunning&amp;quot;,&amp;quot;km&amp;quot;,2014-09-21 07:08:49 +0100,2014-09-20 10:43:59 +0100,2014-09-20 10:44:01 +0100,0.0059\n&amp;#39;&lt;/span&gt; !&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value\n&amp;quot;Health&amp;quot;,,,&amp;quot;DistanceWalkingRunning&amp;quot;,&amp;quot;km&amp;quot;,2014-09-21 07:08:49 +0100,2014-09-20 10:41:28 +0100,2014-09-20 10:41:30 +0100,0.00288\n&amp;quot;Health&amp;quot;,,,&amp;quot;DistanceWalkingRunning&amp;quot;,&amp;quot;km&amp;quot;,2014-09-21 07:08:49 +0100,2014-09-20 10:41:30 +0100,2014-09-20 10:41:33 +0100,0.00284\n&amp;quot;Health&amp;quot;,,,&amp;quot;DistanceWalkingRunning&amp;quot;,&amp;quot;km&amp;quot;,2014-09-21 07:08:49 +0100,2014-09-20 10:41:33 +0100,2014-09-20 10:41:36 +0100,0.00142\n&amp;quot;Health&amp;quot;,,,&amp;quot;DistanceWalkingRunning&amp;quot;,&amp;quot;km&amp;quot;,2014-09-21 07:08:49 +0100,2014-09-20 10:43:54 +0100,2014-09-20 10:43:56 +0100,0.00639\n&amp;quot;Health&amp;quot;,,,&amp;quot;DistanceWalkingRunning&amp;quot;,&amp;quot;km&amp;quot;,2014-09-21 07:08:49 +0100,2014-09-20 10:43:59 +0100,2014-09-20 10:44:01 +0100,0.0059\n&amp;#39;&lt;/span&gt;

----------------------------------------------------------------------
Ran &lt;span class="m"&gt;1&lt;/span&gt; &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;.005s

FAILED &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;failures&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The story is very much the same: the test has failed, which is good,
but again the source of difference is hard to discern.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Variant 3: Break the input XML Data.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;After restoring &lt;code&gt;DistanceWalkingRunning.csv&lt;/code&gt;,
I modify the input XML file.
In this case, I'll just change the first step count to be 330 instead of 329:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python testapplehealthdata.py
&lt;span class="nv"&gt;F&lt;/span&gt;
&lt;span class="o"&gt;======================================================================&lt;/span&gt;
FAIL: test_tiny_fixed_extraction &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestAppleHealthDataExtractor&lt;span class="o"&gt;)&lt;/span&gt;
----------------------------------------------------------------------
Traceback &lt;span class="o"&gt;(&lt;/span&gt;most recent call last&lt;span class="o"&gt;)&lt;/span&gt;:
  File &lt;span class="s2"&gt;&amp;quot;testapplehealthdata.py&amp;quot;&lt;/span&gt;, line &lt;span class="m"&gt;98&lt;/span&gt;, &lt;span class="k"&gt;in&lt;/span&gt; test_tiny_fixed_extraction
    self.check_file&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;StepCount.csv&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  File &lt;span class="s2"&gt;&amp;quot;testapplehealthdata.py&amp;quot;&lt;/span&gt;, line &lt;span class="m"&gt;92&lt;/span&gt;, &lt;span class="k"&gt;in&lt;/span&gt; check_file
    self.assertEqual&lt;span class="o"&gt;(&lt;/span&gt;expected, actual&lt;span class="o"&gt;)&lt;/span&gt;
AssertionError: &lt;span class="s1"&gt;&amp;#39;sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:47 +0100,2014-09-13 10:27:54 +0100,2014-09-13 10:27:59 +0100,329\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:47 +0100,2014-09-13 10:34:09 +0100,2014-09-13 10:34:14 +0100,283\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:47 +0100,2014-09-13 10:39:29 +0100,2014-09-13 10:39:34 +0100,426\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 10:45:36 +0100,2014-09-13 10:45:41 +0100,61\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 10:51:16 +0100,2014-09-13 10:51:21 +0100,10\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 10:57:40 +0100,2014-09-13 10:57:45 +0100,200\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 11:03:00 +0100,2014-09-13 11:03:05 +0100,390\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 11:08:10 +0100,2014-09-13 11:08:15 +0100,320\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 11:27:22 +0100,2014-09-13 11:27:27 +0100,216\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 11:33:24 +0100,2014-09-13 11:33:29 +0100,282\n&amp;#39;&lt;/span&gt; !&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:47 +0100,2014-09-13 10:27:54 +0100,2014-09-13 10:27:59 +0100,330\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:47 +0100,2014-09-13 10:34:09 +0100,2014-09-13 10:34:14 +0100,283\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:47 +0100,2014-09-13 10:39:29 +0100,2014-09-13 10:39:34 +0100,426\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 10:45:36 +0100,2014-09-13 10:45:41 +0100,61\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 10:51:16 +0100,2014-09-13 10:51:21 +0100,10\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 10:57:40 +0100,2014-09-13 10:57:45 +0100,200\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 11:03:00 +0100,2014-09-13 11:03:05 +0100,390\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 11:08:10 +0100,2014-09-13 11:08:15 +0100,320\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 11:27:22 +0100,2014-09-13 11:27:27 +0100,216\n&amp;quot;Health&amp;quot;,,,&amp;quot;StepCount&amp;quot;,&amp;quot;count&amp;quot;,2014-09-21 07:08:48 +0100,2014-09-13 11:33:24 +0100,2014-09-13 11:33:29 +0100,282\n&amp;#39;&lt;/span&gt;

----------------------------------------------------------------------
Ran &lt;span class="m"&gt;1&lt;/span&gt; &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;.005s

FAILED &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;failures&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Again, we get the expected failure, and again it's hard to see what it is.
(We really will need to improve &lt;code&gt;check_file&lt;/code&gt;.)&lt;/p&gt;
&lt;h3 id="enough"&gt;Enough&lt;/h3&gt;
&lt;p&gt;That's enough for this post.
We've successfully added a single
&lt;a href="pages/glossary.html#reference-test"&gt;"reference" test&lt;/a&gt;
to the code, which should at least make sure that if we break it during
further enhancements, we will notice. It will also check that it is working
correctly on other platforms (e.g., yours).&lt;/p&gt;
&lt;p&gt;We haven't done anything to check the the CSV files produced are genuinely
right beyond the initial eye-balling I did on first extracting the data before.
But if we see problems when we start doing proper analysis, it will be easy
to correct the expected output to keep the test running. And in the meantime,
we'll notice if we make changes to the code that result in different output
when it wasn't meant to do so.
This is one part of the pragmatic essence of basic TDDA.&lt;/p&gt;
&lt;p&gt;We also haven't written any unit tests at all for the extraction code;
we'll do that in a later post.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:eg"&gt;
&lt;p&gt;For example, you might have already blogged about it
   and pushed it to a public repository on Github&amp;#160;&lt;a class="footnote-backref" href="#fnref:eg" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:notalways"&gt;
&lt;p&gt;Which is not &lt;em&gt;always&lt;/em&gt; the case&amp;#160;&lt;a class="footnote-backref" href="#fnref:notalways" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="xml"></category><category term="apple"></category><category term="health"></category></entry><entry><title>In Defence of XML: Exporting and Analysing Apple Health Data</title><link href="https://tdda.info/in-defence-of-xml-exporting-and-analysing-apple-health-data.html" rel="alternate"></link><published>2016-04-15T15:35:00+01:00</published><updated>2016-04-15T15:35:00+01:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2016-04-15:/in-defence-of-xml-exporting-and-analysing-apple-health-data.html</id><summary type="html">&lt;p&gt;I'm going to present a series of posts based around the sort of health
and fitness data that can now be collected by some phones and dedicated
fitness trackers. Not all of these  will be centrally on topic for
&lt;em&gt;test-driven data analysis,&lt;/em&gt; but I think they'll provide an interesting
set …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I'm going to present a series of posts based around the sort of health
and fitness data that can now be collected by some phones and dedicated
fitness trackers. Not all of these  will be centrally on topic for
&lt;em&gt;test-driven data analysis,&lt;/em&gt; but I think they'll provide an interesting
set of data for discussing many issues of relevance, so I hope readers
will forgive me to the extent that these stray from the central theme.&lt;/p&gt;
&lt;p&gt;The particular focus for this series will be the data available from an
iPhone and the Apple Health app, over a couple of different phones, and
with a couple of different devices paired to them.&lt;/p&gt;
&lt;p&gt;In particular, the setup will be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Apple iPhone 6s (November 2015 to present)&lt;/li&gt;
&lt;li&gt;Apple iPhone 5s (with fitness data from Sept 2014 to Nov 2015)&lt;/li&gt;
&lt;li&gt;Several Misfit Shine activity trackers (until early March 2016)&lt;/li&gt;
&lt;li&gt;An Apple Watch (about a month of data, to date)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="getting-data-out-of-apple-health-the-exploratory-version"&gt;Getting data out of Apple Health (The Exploratory Version)&lt;/h2&gt;
&lt;p&gt;I hadn't initially spotted a way to get the data out of Apple's Health app,
but a quick web search&lt;sup id="fnref:ddg"&gt;&lt;a class="footnote-ref" href="#fn:ddg"&gt;1&lt;/a&gt;&lt;/sup&gt; turned up this very helpful article:
&lt;a href="https://www.idownloadblog.com/2015/06/10/how-to-export-import-health-data/"&gt;https://www.idownloadblog.com/2015/06/10/how-to-export-import-health-data&lt;/a&gt;.
It turns out there is a properly supported way to export granular data
from Apple Health, described in detail in the post. Essentially:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Open the Apple Health App.&lt;/li&gt;
&lt;li&gt;Navigate to the Health Data section (left icon at the bottom)&lt;/li&gt;
&lt;li&gt;Select &lt;code&gt;All&lt;/code&gt; from the list of categories&lt;/li&gt;
&lt;li&gt;There is a share icon at the top right (a vertical arrow sticking
    up from a square)&lt;/li&gt;
&lt;li&gt;Tap that to export all data&lt;/li&gt;
&lt;li&gt;It thinks for a while (quite a while, in fact)
    and then offers you various export options,
    which for me included Airdrop, email and handing the data to other
    apps. I used Airdrop to dump it onto a Mac.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The result is a compressed XML file called &lt;code&gt;export.zip&lt;/code&gt;.
For me, this was about 5.5MB, which expanded to 109MB when unzipped.
(Interestingly, I started this with an earlier export a couple of weeks ago,
when the zipped file was about 5MB and the expanded version was 90MB, so it
is growing fairly quickly, thanks to the Watch.)&lt;/p&gt;
&lt;p&gt;As helpful as the iDownloadBlog article is, I have to comment
on its introduction to exporting data, which reads&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;There are actually two ways to export the data from your Health app.
The first way, is one provided by Apple, but it is virtually useless.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;To be fair to iDownloadBlog, an XML file like this probably is useless
to the general reader, but it builds on a meme fashionable among
developers and data scientists to the effect of "XML is painful to
process, verbose and always worse than JSON", and I think this is
somewhat unfair.&lt;/p&gt;
&lt;p&gt;Let's explore &lt;code&gt;export.xml&lt;/code&gt; using Python and the &lt;code&gt;ElementTree&lt;/code&gt; library.
Although the decompressed file is quite large (109MB), it's certainly
not problematically large to read into memory on a modern machine, so
I'm not going to worry about reading it in bits: I'm just going to
find out as quickly as possible what's in it.&lt;/p&gt;
&lt;p&gt;The first thing to do, of course, is simply to look at the file, probably
using either the &lt;code&gt;more&lt;/code&gt; or &lt;code&gt;less&lt;/code&gt; command, assuming you are on some flavour
of Unix or Linux. Let's look at the top of my &lt;code&gt;export.xml&lt;/code&gt;:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ head -79 export6s3/export.xml
&lt;span class="cp"&gt;&amp;lt;?xml version=&amp;quot;1.0&amp;quot; encoding=&amp;quot;UTF-8&amp;quot;?&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;&amp;lt;!DOCTYPE HealthData [&lt;/span&gt;
&lt;span class="cp"&gt;&amp;lt;!-- HealthKit Export Version: 3 --&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;&amp;lt;!ELEMENT HealthData (ExportDate,Me,(Record|Correlation|Workout|ActivitySummary)*)&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;&amp;lt;!ATTLIST HealthData&lt;/span&gt;
&lt;span class="cp"&gt;  locale CDATA #REQUIRED&lt;/span&gt;
&lt;span class="cp"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;&amp;lt;!ELEMENT ExportDate EMPTY&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;&amp;lt;!ATTLIST ExportDate&lt;/span&gt;
&lt;span class="cp"&gt;  value CDATA #REQUIRED&lt;/span&gt;
&lt;span class="cp"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;&amp;lt;!ELEMENT Me EMPTY&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;&amp;lt;!ATTLIST Me&lt;/span&gt;
&lt;span class="cp"&gt;  HKCharacteristicTypeIdentifierDateOfBirth         CDATA #REQUIRED&lt;/span&gt;
&lt;span class="cp"&gt;  HKCharacteristicTypeIdentifierBiologicalSex       CDATA #REQUIRED&lt;/span&gt;
&lt;span class="cp"&gt;  HKCharacteristicTypeIdentifierBloodType           CDATA #REQUIRED&lt;/span&gt;
&lt;span class="cp"&gt;  HKCharacteristicTypeIdentifierFitzpatrickSkinType CDATA #REQUIRED&lt;/span&gt;
&lt;span class="cp"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;&amp;lt;!ELEMENT Record (MetadataEntry*)&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;&amp;lt;!ATTLIST Record&lt;/span&gt;
&lt;span class="cp"&gt;  type          CDATA #REQUIRED&lt;/span&gt;
&lt;span class="cp"&gt;  unit          CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;  value         CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;  sourceName    CDATA #REQUIRED&lt;/span&gt;
&lt;span class="cp"&gt;  sourceVersion CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;  device        CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;  creationDate  CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;  startDate     CDATA #REQUIRED&lt;/span&gt;
&lt;span class="cp"&gt;  endDate       CDATA #REQUIRED&lt;/span&gt;
&lt;span class="cp"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="cm"&gt;&amp;lt;!-- Note: Any Records that appear as children of a correlation also appear as top-level records in this document. --&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;&amp;lt;!ELEMENT Correlation ((MetadataEntry|Record)*)&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;&amp;lt;!ATTLIST Correlation&lt;/span&gt;
&lt;span class="cp"&gt;  type          CDATA #REQUIRED&lt;/span&gt;
&lt;span class="cp"&gt;  sourceName    CDATA #REQUIRED&lt;/span&gt;
&lt;span class="cp"&gt;  sourceVersion CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;  device        CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;  creationDate  CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;  startDate     CDATA #REQUIRED&lt;/span&gt;
&lt;span class="cp"&gt;  endDate       CDATA #REQUIRED&lt;/span&gt;
&lt;span class="cp"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;&amp;lt;!ELEMENT Workout ((MetadataEntry|WorkoutEvent)*)&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;&amp;lt;!ATTLIST Workout&lt;/span&gt;
&lt;span class="cp"&gt;  workoutActivityType   CDATA #REQUIRED&lt;/span&gt;
&lt;span class="cp"&gt;  duration              CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;  durationUnit          CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;  totalDistance         CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;  totalDistanceUnit     CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;  totalEnergyBurned     CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;  totalEnergyBurnedUnit CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;  sourceName            CDATA #REQUIRED&lt;/span&gt;
&lt;span class="cp"&gt;  sourceVersion         CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;  device                CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;  creationDate          CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;  startDate             CDATA #REQUIRED&lt;/span&gt;
&lt;span class="cp"&gt;  endDate               CDATA #REQUIRED&lt;/span&gt;
&lt;span class="cp"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;&amp;lt;!ELEMENT WorkoutEvent EMPTY&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;&amp;lt;!ATTLIST WorkoutEvent&lt;/span&gt;
&lt;span class="cp"&gt;  type CDATA #REQUIRED&lt;/span&gt;
&lt;span class="cp"&gt;  date CDATA #REQUIRED&lt;/span&gt;
&lt;span class="cp"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;&amp;lt;!ELEMENT ActivitySummary EMPTY&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;&amp;lt;!ATTLIST ActivitySummary&lt;/span&gt;
&lt;span class="cp"&gt;  dateComponents           CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;  activeEnergyBurned       CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;  activeEnergyBurnedGoal   CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;  activeEnergyBurnedUnit   CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;  appleExerciseTime        CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;  appleExerciseTimeGoal    CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;  appleStandHours          CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;  appleStandHoursGoal      CDATA #IMPLIED&lt;/span&gt;
&lt;span class="cp"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;&amp;lt;!ELEMENT MetadataEntry EMPTY&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;&amp;lt;!ATTLIST MetadataEntry&lt;/span&gt;
&lt;span class="cp"&gt;  key   CDATA #REQUIRED&lt;/span&gt;
&lt;span class="cp"&gt;  value CDATA #REQUIRED&lt;/span&gt;
&lt;span class="cp"&gt;&amp;gt;&lt;/span&gt;
]&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This is immediately encouraging: Apple has provided &lt;code&gt;DOCTYPE&lt;/code&gt; (DTD)
information, which even though slightly old fashioned, tells us what
we should expect to find in the file. DTD's &lt;em&gt;are&lt;/em&gt; awkward to use,
and when coming from untrusted sources, can leave the user potentially
&lt;a href="https://docs.python.org/2/library/xml.html#xml-vulnerabilities"&gt;vulnerable to malicious attacks&lt;/a&gt;, but despite this, they are quite expressive and helpful,
even just as plain-text documentation.&lt;/p&gt;
&lt;p&gt;Roughly speaking, the lines:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;&amp;lt;!ELEMENT&lt;/span&gt; &lt;span class="nt"&gt;HealthData&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nt"&gt;ExportDate&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="nt"&gt;Me&lt;/span&gt;&lt;span class="o"&gt;,(&lt;/span&gt;&lt;span class="nt"&gt;Record&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nt"&gt;Correlation&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nt"&gt;Workout&lt;/span&gt;&lt;span class="o"&gt;)*)&lt;/span&gt;&lt;span class="k"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="k"&gt;&amp;lt;!ATTLIST&lt;/span&gt; &lt;span class="nt"&gt;HealthData&lt;/span&gt;
  &lt;span class="na"&gt;locale&lt;/span&gt; &lt;span class="kc"&gt;CDATA&lt;/span&gt; &lt;span class="kc"&gt;#REQUIRED&lt;/span&gt;
&lt;span class="k"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;say&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;that the top element will be a &lt;code&gt;HealthData&lt;/code&gt; element&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;that this &lt;code&gt;HealthData&lt;/code&gt; element will contain&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;an &lt;code&gt;ExportDate&lt;/code&gt; element&lt;/li&gt;
&lt;li&gt;a &lt;code&gt;Me&lt;/code&gt; element&lt;/li&gt;
&lt;li&gt;zero or more elements of type &lt;code&gt;Record&lt;/code&gt;, &lt;code&gt;Correlation&lt;/code&gt; or &lt;code&gt;Workout&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;and that the &lt;code&gt;HealthData&lt;/code&gt; element will have an attribute &lt;code&gt;locale&lt;/code&gt;
    (which is mandatory).&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The rest of this DTD section describes each kind of record in more detail.&lt;/p&gt;
&lt;p&gt;The next 6 lines in my XML file are as follows (spread out for readability):&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;HealthData&lt;/span&gt; &lt;span class="na"&gt;locale=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;en_GB&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
 &lt;span class="nt"&gt;&amp;lt;ExportDate&lt;/span&gt; &lt;span class="na"&gt;value=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;2016-04-15 07:27:26 +0100&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
 &lt;span class="nt"&gt;&amp;lt;Me&lt;/span&gt; &lt;span class="na"&gt;HKCharacteristicTypeIdentifierDateOfBirth=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;1965-07-31&amp;quot;&lt;/span&gt;
     &lt;span class="na"&gt;HKCharacteristicTypeIdentifierBiologicalSex=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;HKBiologicalSexMale&amp;quot;&lt;/span&gt;
     &lt;span class="na"&gt;HKCharacteristicTypeIdentifierBloodType=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;HKBloodTypeNotSet&amp;quot;&lt;/span&gt;
     &lt;span class="na"&gt;HKCharacteristicTypeIdentifierFitzpatrickSkinType=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;HKFitzpatrickSkinTypeNotSet&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
 &lt;span class="nt"&gt;&amp;lt;Record&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;HKQuantityTypeIdentifierHeight&amp;quot;&lt;/span&gt;
         &lt;span class="na"&gt;sourceName=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Health&amp;quot;&lt;/span&gt;
         &lt;span class="na"&gt;sourceVersion=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;9.2&amp;quot;&lt;/span&gt;
         &lt;span class="na"&gt;unit=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;cm&amp;quot;&lt;/span&gt;
         &lt;span class="na"&gt;creationDate=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;2016-01-02 09:45:10 +0100&amp;quot;&lt;/span&gt;
         &lt;span class="na"&gt;startDate=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;2016-01-02 09:44:00 +0100&amp;quot;&lt;/span&gt;
         &lt;span class="na"&gt;endDate=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;2016-01-02 09:44:00 +0100&amp;quot;&lt;/span&gt;
         &lt;span class="na"&gt;value=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;194&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;MetadataEntry&lt;/span&gt; &lt;span class="na"&gt;key=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;HKWasUserEntered&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;value=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;1&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
 &lt;span class="nt"&gt;&amp;lt;/Record&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;As you can see, the export format is verbose, but extremely comprehensible
and comprehensive. It's also very easy to read into Python and explore.&lt;/p&gt;
&lt;p&gt;Let's do that, here with an interactive python:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;xml.etree&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ElementTree&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;ET&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;export.xml&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt;     &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ET&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt; 
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;
&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;xml&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;etree&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ElementTree&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ElementTree&lt;/span&gt; &lt;span class="nb"&gt;object&lt;/span&gt; &lt;span class="n"&gt;at&lt;/span&gt; &lt;span class="mh"&gt;0x107347a50&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;ElementTree&lt;/code&gt; module turns each XML element into an &lt;code&gt;Element&lt;/code&gt;
object, described by its tag, with a few standard attributes.&lt;/p&gt;
&lt;p&gt;Inspecting the &lt;code&gt;data&lt;/code&gt; object, we find:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="vm"&gt;__dict__&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;_root&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Element&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;HealthData&amp;#39;&lt;/span&gt; &lt;span class="n"&gt;at&lt;/span&gt; &lt;span class="mh"&gt;0x1073c2050&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;i.e., we have a single entry in &lt;code&gt;data&lt;/code&gt;—a root element called &lt;code&gt;HealthData&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Like all &lt;code&gt;Element&lt;/code&gt; objects, it has the four standard attributes:&lt;sup id="fnref:et"&gt;&lt;a class="footnote-ref" href="#fn:et"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;root&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_root&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="vm"&gt;__dict__&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;text&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;attrib&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;tag&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;_children&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;These are:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attrib&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;locale&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;en_GB&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt; &amp;#39;&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt;
&lt;span class="s1"&gt;&amp;#39;HealthData&amp;#39;&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_children&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;446702&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;So nothing much apart from an encoding and a whole lot of child nodes.
Let's inspect the first few of them:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_children&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Element&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;ExportDate&amp;#39;&lt;/span&gt; &lt;span class="n"&gt;at&lt;/span&gt; &lt;span class="mh"&gt;0x1073c2090&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ET&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;ExportDate&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;2016-04-15 07:27:26 +0100&amp;quot;&lt;/span&gt; &lt;span class="o"&gt;/&amp;gt;&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Element&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Me&amp;#39;&lt;/span&gt; &lt;span class="n"&gt;at&lt;/span&gt; &lt;span class="mh"&gt;0x1073c2190&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ET&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Me&lt;/span&gt; &lt;span class="n"&gt;HKCharacteristicTypeIdentifierBiologicalSex&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;HKBiologicalSexMale&amp;quot;&lt;/span&gt;
    &lt;span class="n"&gt;HKCharacteristicTypeIdentifierBloodType&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;HKBloodTypeNotSet&amp;quot;&lt;/span&gt;
    &lt;span class="n"&gt;HKCharacteristicTypeIdentifierDateOfBirth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;1965-07-31&amp;quot;&lt;/span&gt;
    &lt;span class="n"&gt;HKCharacteristicTypeIdentifierFitzpatrickSkinType&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;HKFitzpatrickSkinTypeNotSet&amp;quot;&lt;/span&gt; &lt;span class="o"&gt;/&amp;gt;&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Element&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Record&amp;#39;&lt;/span&gt; &lt;span class="n"&gt;at&lt;/span&gt; &lt;span class="mh"&gt;0x1073c2410&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ET&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Record&lt;/span&gt; &lt;span class="n"&gt;creationDate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;2016-01-02 09:45:10 +0100&amp;quot;&lt;/span&gt;
        &lt;span class="n"&gt;endDate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;2016-01-02 09:44:00 +0100&amp;quot;&lt;/span&gt;
        &lt;span class="n"&gt;sourceName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Health&amp;quot;&lt;/span&gt;
        &lt;span class="n"&gt;sourceVersion&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;9.2&amp;quot;&lt;/span&gt;
        &lt;span class="n"&gt;startDate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;2016-01-02 09:44:00 +0100&amp;quot;&lt;/span&gt;
        &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;HKQuantityTypeIdentifierHeight&amp;quot;&lt;/span&gt;
        &lt;span class="n"&gt;unit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;cm&amp;quot;&lt;/span&gt;
        &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;194&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;MetadataEntry&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;HKWasUserEntered&amp;quot;&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;1&amp;quot;&lt;/span&gt; &lt;span class="o"&gt;/&amp;gt;&lt;/span&gt;
 &lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;Record&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Element&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Record&amp;#39;&lt;/span&gt; &lt;span class="n"&gt;at&lt;/span&gt; &lt;span class="mh"&gt;0x1073c2550&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Element&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Record&amp;#39;&lt;/span&gt; &lt;span class="n"&gt;at&lt;/span&gt; &lt;span class="mh"&gt;0x1073c2650&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;So, exactly as the DTD indicated, we have an &lt;code&gt;ExportDate&lt;/code&gt; node,
a &lt;code&gt;Me&lt;/code&gt; node and then what looks like a great number of records.
Let's confirm that:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:])&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Record&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Workout&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;ActivitySummary&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;So in fact, there are three kinds of nodes after the &lt;code&gt;ExportDate&lt;/code&gt; and &lt;code&gt;Me&lt;/code&gt;
records. Let's count them:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Record&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;446670&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;These records are ones like the &lt;code&gt;Height&lt;/code&gt; record we saw above, though
in fact most of them are not &lt;code&gt;Height&lt;/code&gt; but either &lt;code&gt;StepCount&lt;/code&gt;,
&lt;code&gt;CaloriesBurned&lt;/code&gt; or &lt;code&gt;DistanceWalkingRunning&lt;/code&gt;, e.g.:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ET&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;100000&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Record&lt;/span&gt; &lt;span class="n"&gt;creationDate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;2015-01-11 07:40:15 +0000&amp;quot;&lt;/span&gt;
        &lt;span class="n"&gt;endDate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;2015-01-10 13:39:35 +0000&amp;quot;&lt;/span&gt;
        &lt;span class="n"&gt;sourceName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;njr iPhone 6s&amp;quot;&lt;/span&gt;
        &lt;span class="n"&gt;startDate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;2015-01-10 13:39:32 +0000&amp;quot;&lt;/span&gt;
        &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;HKQuantityTypeIdentifierStepCount&amp;quot;&lt;/span&gt;
        &lt;span class="n"&gt;unit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;count&amp;quot;&lt;/span&gt;
        &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;4&amp;quot;&lt;/span&gt; &lt;span class="o"&gt;/&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;There is also one activity summary per day (since I got the watch).&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;acts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;ActivitySummary&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;acts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;29&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The first one isn't very exciting:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ET&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;acts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;ActivitySummary&lt;/span&gt; &lt;span class="n"&gt;activeEnergyBurned&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;0&amp;quot;&lt;/span&gt;
                 &lt;span class="n"&gt;activeEnergyBurnedGoal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;0&amp;quot;&lt;/span&gt;
                 &lt;span class="n"&gt;activeEnergyBurnedUnit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;kcal&amp;quot;&lt;/span&gt;
                 &lt;span class="n"&gt;appleExerciseTime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;0&amp;quot;&lt;/span&gt;
                 &lt;span class="n"&gt;appleExerciseTimeGoal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;30&amp;quot;&lt;/span&gt;
                 &lt;span class="n"&gt;appleStandHours&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;0&amp;quot;&lt;/span&gt;
                 &lt;span class="n"&gt;appleStandHoursGoal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;12&amp;quot;&lt;/span&gt;
                 &lt;span class="n"&gt;dateComponents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;2016-03-18&amp;quot;&lt;/span&gt; &lt;span class="o"&gt;/&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;but they get better:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ET&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;acts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;ActivitySummary&lt;/span&gt; &lt;span class="n"&gt;activeEnergyBurned&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;652.014&amp;quot;&lt;/span&gt;
                 &lt;span class="n"&gt;activeEnergyBurnedGoal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;500&amp;quot;&lt;/span&gt;
                 &lt;span class="n"&gt;activeEnergyBurnedUnit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;kcal&amp;quot;&lt;/span&gt;
                 &lt;span class="n"&gt;appleExerciseTime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;77&amp;quot;&lt;/span&gt;
                 &lt;span class="n"&gt;appleExerciseTimeGoal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;30&amp;quot;&lt;/span&gt;
                 &lt;span class="n"&gt;appleStandHours&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;17&amp;quot;&lt;/span&gt;
                 &lt;span class="n"&gt;appleStandHoursGoal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;12&amp;quot;&lt;/span&gt;
                 &lt;span class="n"&gt;dateComponents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;2016-03-20&amp;quot;&lt;/span&gt; &lt;span class="o"&gt;/&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Finally, there is a solitary &lt;code&gt;Workout&lt;/code&gt; record.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ET&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workouts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Workout&lt;/span&gt; &lt;span class="n"&gt;creationDate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;2016-04-02 11:12:57 +0100&amp;quot;&lt;/span&gt;
         &lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;31.73680251737436&amp;quot;&lt;/span&gt;
         &lt;span class="n"&gt;durationUnit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;min&amp;quot;&lt;/span&gt;
         &lt;span class="n"&gt;endDate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;2016-04-02 11:12:22 +0100&amp;quot;&lt;/span&gt;
         &lt;span class="n"&gt;sourceName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;NJR Apple&amp;amp;#160;Watch&amp;quot;&lt;/span&gt;
         &lt;span class="n"&gt;sourceVersion&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;2.2&amp;quot;&lt;/span&gt;
         &lt;span class="n"&gt;startDate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;2016-04-02 10:40:38 +0100&amp;quot;&lt;/span&gt;
         &lt;span class="n"&gt;totalDistance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;0&amp;quot;&lt;/span&gt;
         &lt;span class="n"&gt;totalDistanceUnit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;km&amp;quot;&lt;/span&gt;
         &lt;span class="n"&gt;totalEnergyBurned&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;139.3170000000021&amp;quot;&lt;/span&gt;
         &lt;span class="n"&gt;totalEnergyBurnedUnit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;kcal&amp;quot;&lt;/span&gt;
         &lt;span class="n"&gt;workoutActivityType&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;HKWorkoutActivityTypeOther&amp;quot;&lt;/span&gt; &lt;span class="o"&gt;/&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;So there we have it.&lt;/p&gt;
&lt;h2 id="getting-data-out-of-apple-health-the-code"&gt;Getting data out of Apple Health (The Code)&lt;/h2&gt;
&lt;p&gt;Given this exploration, we can take a first shot at writing an exporter
for Apple Health Data.
I'm going to ignore the activity summaries and workout(s) for now, and
concentrate on the main records. (We'll get to the others in a later post.)&lt;/p&gt;
&lt;p&gt;Here is the code:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;applehealthdata.py: Extract data from Apple Health App&amp;#39;s export.xml.&lt;/span&gt;

&lt;span class="sd"&gt;Copyright (c) 2016 Nicholas J. Radcliffe&lt;/span&gt;
&lt;span class="sd"&gt;Licence: MIT&lt;/span&gt;
&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;absolute_import&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;division&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;print_function&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;unicode_literals&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;re&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;sys&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;xml.etree&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ElementTree&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OrderedDict&lt;/span&gt;

&lt;span class="n"&gt;__version__&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;1.0&amp;#39;&lt;/span&gt;

&lt;span class="n"&gt;FIELDS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;OrderedDict&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;sourceName&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;s&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;sourceVersion&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;s&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;device&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;s&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;type&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;s&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;unit&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;s&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;creationDate&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;d&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;startDate&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;d&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;endDate&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;d&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;value&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;n&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;PREFIX_RE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;^HK.*TypeIdentifier(.+)$&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ABBREVIATE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;
&lt;span class="n"&gt;VERBOSE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;format_freqs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;counter&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;    Format a counter object for display.&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="s1"&gt;: &lt;/span&gt;&lt;span class="si"&gt;%d&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;counter&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
                     &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;counter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;format_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;datatype&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;    Format a value for a CSV file, escaping double quotes and backslashes.&lt;/span&gt;

&lt;span class="sd"&gt;    None maps to empty.&lt;/span&gt;

&lt;span class="sd"&gt;    datatype should be&lt;/span&gt;
&lt;span class="sd"&gt;        &amp;#39;s&amp;#39; for string (escaped)&lt;/span&gt;
&lt;span class="sd"&gt;        &amp;#39;n&amp;#39; for number&lt;/span&gt;
&lt;span class="sd"&gt;        &amp;#39;d&amp;#39; for datetime&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;datatype&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;s&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# string&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&amp;quot;&lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="s1"&gt;&amp;quot;&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="se"&gt;\\\\&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&amp;quot;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s1"&gt;&amp;quot;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;datatype&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;n&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;d&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  &lt;span class="c1"&gt;# number or date&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="ne"&gt;KeyError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Unexpected format value: &lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;datatype&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;abbreviate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;    Abbreviate particularly verbose strings based on a regular expression&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PREFIX_RE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ABBREVIATE&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;    Encode string for writing to file.&lt;/span&gt;
&lt;span class="sd"&gt;    In Python 2, this encodes as UTF-8, whereas in Python 3,&lt;/span&gt;
&lt;span class="sd"&gt;    it does nothing&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;UTF-8&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;version_info&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;major&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;



&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;HealthDataExtractor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;object&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;    Extract health data from Apple Health App&amp;#39;s XML export, export.xml.&lt;/span&gt;

&lt;span class="sd"&gt;    Inputs:&lt;/span&gt;
&lt;span class="sd"&gt;        path:      Relative or absolute path to export.xml&lt;/span&gt;
&lt;span class="sd"&gt;        verbose:   Set to False for less verbose output&lt;/span&gt;

&lt;span class="sd"&gt;    Outputs:&lt;/span&gt;
&lt;span class="sd"&gt;        Writes a CSV file for each record type found, in the same&lt;/span&gt;
&lt;span class="sd"&gt;        directory as the input export.xml. Reports each file written&lt;/span&gt;
&lt;span class="sd"&gt;        unless verbose has been set to False.&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="fm"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;VERBOSE&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;in_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;verbose&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;verbose&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;directory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;abspath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Reading data from &lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="s1"&gt; . . . &amp;#39;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ElementTree&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;done&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_root&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nodes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;getchildren&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n_nodes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;abbreviate_types&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;collect_stats&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;count_tags_and_fields&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;count_record_types&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;record_types&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Record&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;record_types&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attrib&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;type&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;collect_stats&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count_record_types&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count_tags_and_fields&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;open_for_writing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;handles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;paths&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;record_types&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;directory&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="s1"&gt;.csv&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;abbreviate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;w&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;,&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;FIELDS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;handles&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Opening &lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="s1"&gt; for writing&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;abbreviate_types&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="sd"&gt;        Shorten types by removing common boilerplate text.&lt;/span&gt;
&lt;span class="sd"&gt;        &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Record&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;type&amp;#39;&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attrib&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attrib&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;type&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;abbreviate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attrib&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;type&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;


    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;write_records&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Record&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;attributes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attrib&lt;/span&gt;
                &lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;type&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                &lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;format_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;datatype&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                          &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;datatype&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;FIELDS&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
                &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;,&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;handles&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;close_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;handles&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Written &lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="s1"&gt; data.&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;abbreviate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;open_for_writing&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write_records&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;close_files&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;report_stats&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;Tags:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;format_freqs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Fields:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;format_freqs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Record types:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;format_freqs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;record_types&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="vm"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;__main__&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;USAGE: python applehealthdata.py /path/to/export.xml&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HealthDataExtractor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;report_stats&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;To run this code, clone the repo from &lt;code&gt;github.com/tdda/applehealthdata&lt;/code&gt;
with:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ git clone https://github.com/tdda/applehealthdata.git
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;or save the text from this post as &lt;code&gt;healthdata.py&lt;/code&gt;.
At the time of posting, the code is consistent with this, but this
commit is also tagged with the version number, &lt;code&gt;v1.0&lt;/code&gt;, so if you check
it out later and want to use this version, check out that version
by saying:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ git checkout v1.0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If your data is in the same directory as the code, then simply run:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python healthdata.py export.xml
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;and, depending on size, wait a few minutes while it runs.
The code runs under both Python 2 and Python 3.&lt;/p&gt;
&lt;p&gt;When I do this, the output is as follows:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python applehealthdata/applehealthdata.py export6s3/export.xml
Reading data from export6s3/export.xml . . . &lt;span class="k"&gt;done&lt;/span&gt;

Tags:
ActivitySummary: &lt;span class="m"&gt;29&lt;/span&gt;
ExportDate: &lt;span class="m"&gt;1&lt;/span&gt;
Me: &lt;span class="m"&gt;1&lt;/span&gt;
Record: &lt;span class="m"&gt;446670&lt;/span&gt;
Workout: &lt;span class="m"&gt;1&lt;/span&gt;

Fields:
HKCharacteristicTypeIdentifierBiologicalSex: &lt;span class="m"&gt;1&lt;/span&gt;
HKCharacteristicTypeIdentifierBloodType: &lt;span class="m"&gt;1&lt;/span&gt;
HKCharacteristicTypeIdentifierDateOfBirth: &lt;span class="m"&gt;1&lt;/span&gt;
HKCharacteristicTypeIdentifierFitzpatrickSkinType: &lt;span class="m"&gt;1&lt;/span&gt;
activeEnergyBurned: &lt;span class="m"&gt;29&lt;/span&gt;
activeEnergyBurnedGoal: &lt;span class="m"&gt;29&lt;/span&gt;
activeEnergyBurnedUnit: &lt;span class="m"&gt;29&lt;/span&gt;
appleExerciseTime: &lt;span class="m"&gt;29&lt;/span&gt;
appleExerciseTimeGoal: &lt;span class="m"&gt;29&lt;/span&gt;
appleStandHours: &lt;span class="m"&gt;29&lt;/span&gt;
appleStandHoursGoal: &lt;span class="m"&gt;29&lt;/span&gt;
creationDate: &lt;span class="m"&gt;446671&lt;/span&gt;
dateComponents: &lt;span class="m"&gt;29&lt;/span&gt;
device: &lt;span class="m"&gt;84303&lt;/span&gt;
duration: &lt;span class="m"&gt;1&lt;/span&gt;
durationUnit: &lt;span class="m"&gt;1&lt;/span&gt;
endDate: &lt;span class="m"&gt;446671&lt;/span&gt;
sourceName: &lt;span class="m"&gt;446671&lt;/span&gt;
sourceVersion: &lt;span class="m"&gt;86786&lt;/span&gt;
startDate: &lt;span class="m"&gt;446671&lt;/span&gt;
totalDistance: &lt;span class="m"&gt;1&lt;/span&gt;
totalDistanceUnit: &lt;span class="m"&gt;1&lt;/span&gt;
totalEnergyBurned: &lt;span class="m"&gt;1&lt;/span&gt;
totalEnergyBurnedUnit: &lt;span class="m"&gt;1&lt;/span&gt;
type: &lt;span class="m"&gt;446670&lt;/span&gt;
unit: &lt;span class="m"&gt;446191&lt;/span&gt;
value: &lt;span class="m"&gt;446671&lt;/span&gt;
workoutActivityType: &lt;span class="m"&gt;1&lt;/span&gt;

Record types:
ActiveEnergyBurned: &lt;span class="m"&gt;19640&lt;/span&gt;
AppleExerciseTime: &lt;span class="m"&gt;2573&lt;/span&gt;
AppleStandHour: &lt;span class="m"&gt;479&lt;/span&gt;
BasalEnergyBurned: &lt;span class="m"&gt;26414&lt;/span&gt;
BodyMass: &lt;span class="m"&gt;155&lt;/span&gt;
DistanceWalkingRunning: &lt;span class="m"&gt;196262&lt;/span&gt;
FlightsClimbed: &lt;span class="m"&gt;2476&lt;/span&gt;
HeartRate: &lt;span class="m"&gt;3013&lt;/span&gt;
Height: &lt;span class="m"&gt;4&lt;/span&gt;
StepCount: &lt;span class="m"&gt;195654&lt;/span&gt;

Opening /Users/njr/qs/export6s3/BasalEnergyBurned.csv &lt;span class="k"&gt;for&lt;/span&gt; writing
Opening /Users/njr/qs/export6s3/HeartRate.csv &lt;span class="k"&gt;for&lt;/span&gt; writing
Opening /Users/njr/qs/export6s3/BodyMass.csv &lt;span class="k"&gt;for&lt;/span&gt; writing
Opening /Users/njr/qs/export6s3/DistanceWalkingRunning.csv &lt;span class="k"&gt;for&lt;/span&gt; writing
Opening /Users/njr/qs/export6s3/AppleStandHour.csv &lt;span class="k"&gt;for&lt;/span&gt; writing
Opening /Users/njr/qs/export6s3/StepCount.csv &lt;span class="k"&gt;for&lt;/span&gt; writing
Opening /Users/njr/qs/export6s3/Height.csv &lt;span class="k"&gt;for&lt;/span&gt; writing
Opening /Users/njr/qs/export6s3/AppleExerciseTime.csv &lt;span class="k"&gt;for&lt;/span&gt; writing
Opening /Users/njr/qs/export6s3/ActiveEnergyBurned.csv &lt;span class="k"&gt;for&lt;/span&gt; writing
Opening /Users/njr/qs/export6s3/FlightsClimbed.csv &lt;span class="k"&gt;for&lt;/span&gt; writing
Written BasalEnergyBurned data.
Written HeartRate data.
Written BodyMass data.
Written DistanceWalkingRunning data.
Written ActiveEnergyBurned data.
Written StepCount data.
Written Height data.
Written AppleExerciseTime data.
Written AppleStandHour data.
Written FlightsClimbed data.
$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;As a quick preview of one of the files, here is the top of the second
biggest output fiele, &lt;code&gt;StepCount.csv&lt;/code&gt;:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ head -5 StepCount.csv
sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value
&lt;span class="s2"&gt;&amp;quot;Health&amp;quot;&lt;/span&gt;,,,&lt;span class="s2"&gt;&amp;quot;HKQuantityTypeIdentifierStepCount&amp;quot;&lt;/span&gt;,&lt;span class="s2"&gt;&amp;quot;count&amp;quot;&lt;/span&gt;,2014-09-21 &lt;span class="m"&gt;06&lt;/span&gt;:08:47 +0000,2014-09-13 &lt;span class="m"&gt;09&lt;/span&gt;:27:54 +0000,2014-09-13 &lt;span class="m"&gt;09&lt;/span&gt;:27:59 +0000,329
&lt;span class="s2"&gt;&amp;quot;Health&amp;quot;&lt;/span&gt;,,,&lt;span class="s2"&gt;&amp;quot;HKQuantityTypeIdentifierStepCount&amp;quot;&lt;/span&gt;,&lt;span class="s2"&gt;&amp;quot;count&amp;quot;&lt;/span&gt;,2014-09-21 &lt;span class="m"&gt;06&lt;/span&gt;:08:47 +0000,2014-09-13 &lt;span class="m"&gt;09&lt;/span&gt;:34:09 +0000,2014-09-13 &lt;span class="m"&gt;09&lt;/span&gt;:34:14 +0000,283
&lt;span class="s2"&gt;&amp;quot;Health&amp;quot;&lt;/span&gt;,,,&lt;span class="s2"&gt;&amp;quot;HKQuantityTypeIdentifierStepCount&amp;quot;&lt;/span&gt;,&lt;span class="s2"&gt;&amp;quot;count&amp;quot;&lt;/span&gt;,2014-09-21 &lt;span class="m"&gt;06&lt;/span&gt;:08:47 +0000,2014-09-13 &lt;span class="m"&gt;09&lt;/span&gt;:39:29 +0000,2014-09-13 &lt;span class="m"&gt;09&lt;/span&gt;:39:34 +0000,426
&lt;span class="s2"&gt;&amp;quot;Health&amp;quot;&lt;/span&gt;,,,&lt;span class="s2"&gt;&amp;quot;HKQuantityTypeIdentifierStepCount&amp;quot;&lt;/span&gt;,&lt;span class="s2"&gt;&amp;quot;count&amp;quot;&lt;/span&gt;,2014-09-21 &lt;span class="m"&gt;06&lt;/span&gt;:08:48 +0000,2014-09-13 &lt;span class="m"&gt;09&lt;/span&gt;:45:36 +0000,2014-09-13 &lt;span class="m"&gt;09&lt;/span&gt;:45:41 +0000,61
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You may need to scroll right to see all of it, or expand your browser window.&lt;/p&gt;
&lt;p&gt;This blog post is long enough already, so I'll discuss (and plot) the contents
of the various output files in later posts.&lt;/p&gt;
&lt;h1 id="notes-on-the-output"&gt;Notes on the Output&lt;/h1&gt;
&lt;p&gt;&lt;strong&gt;Format:&lt;/strong&gt; The code writes CSV files including a header record with field
names. Since the fields are XML attributes, which get read into a dictionary,
they are unordered so the code sorts them alphabetically, which isn't optimal,
but is at least consistent. Nulls are written as empty spaces, strings are
quoted with double quotes, double quotes in strings are escaped
with backslash and backslash is itself escaped with backslash.
The output encoding is UTF-8.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Filenames:&lt;/strong&gt; One file is written per record type, and the names is just
the record type with extension &lt;code&gt;.csv&lt;/code&gt;, except for record types including
&lt;code&gt;HK...TypeIdentifier&lt;/code&gt;, which is excised.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Summary Stats:&lt;/strong&gt; Summary stats about the various CSV files are printed
before the main extraction occurs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Overwriting:&lt;/strong&gt; Any existings CSV files are silently overwritten,
so if you have multiple health data export files in the same directory,
take care.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data Sanitization:&lt;/strong&gt; The code is almost completely opinionless, and with
one exception simply flattens the data in the XML file into a collection
of CSV files. The exception concerns file names and the &lt;code&gt;type&lt;/code&gt; field file.
Apple uses extraordinarily verbose and ugly names like
&lt;code&gt;HKQuantityTypeIdentifierStepCount&lt;/code&gt; and &lt;code&gt;HKQuantityTypeIdentifierHeight&lt;/code&gt;
to describe the contents of each record: the abbreviate function in the
code uses a regular expression to strip off the nonsense, resulting in
nicer, shorter, more comprehensible file names and record types. However,
if you prefer to get your data verbatim, simply change the value
of &lt;code&gt;ABBREVIATE&lt;/code&gt; to &lt;code&gt;False&lt;/code&gt; near the top of the file and all your HealthKit
prefixes will be preserved, at the cost of a non-trivial expansion of the
output file sizes.&lt;/p&gt;
&lt;h1 id="notes-on-the-code-wot-no-tests"&gt;Notes on the code: Wot, no tests?&lt;/h1&gt;
&lt;p&gt;The first thing to say about the code is that there are &lt;em&gt;no tests&lt;/em&gt;
provided with it, which is—cough—slightly ironic, given the theme
of this blog. This isn't because I've written them but am holding them
back for pedagogical reasons, or as an ironical meta-commentary on
the whole &lt;em&gt;test-driven&lt;/em&gt; movement, but merely because I &lt;em&gt;haven't written
any&lt;/em&gt; yet. Happily, writing tests is a good way of documenting and
explaining code, so another post will follow, in which I will present
some tests, possibly correct myriad bugs, and explain more about what
the code is doing.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:ddg"&gt;
&lt;p&gt;I almost said 'I googled "Apple Health export"', but the more accurate
statement would be that 'I DuckDuckGoed "Apple Health export"', but
there are so many problems with DuckDuckGo as a verb, even in the
present tense, let alone in the past as DuckDuckGod.
Maybe I should propose the neologism "to DDGoogle".
Or as Greg Wilson suggested, "to Duckle".
Or maybe not . . .&amp;#160;&lt;a class="footnote-backref" href="#fnref:ddg" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:et"&gt;
&lt;p&gt;The ElementTree structure in Python 3 is slightly
different in this respect: this exploration was carried out with
Python 2. However, the main code presented later in the post
works under Python 2 and 3.&amp;#160;&lt;a class="footnote-backref" href="#fnref:et" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="xml"></category><category term="apple"></category><category term="health"></category></entry><entry><title>Lessons Learned: Bad Data and other SNAFUs</title><link href="https://tdda.info/lessons-learned-bad-data-and-other-snafus.html" rel="alternate"></link><published>2016-02-15T13:30:00+00:00</published><updated>2016-02-15T13:30:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2016-02-15:/lessons-learned-bad-data-and-other-snafus.html</id><summary type="html">&lt;p&gt;My first paid programming job was working for my local education
authority during the summer. &lt;a href="https://www.computinghistory.org.uk/sec/21778/AUCBE-(Advisory-Unit-for-Computer-Based-Education)/"&gt;The Advisory Unit for Computer-Based
Education
(AUCBE)&lt;/a&gt;,
run by a fantastic visionary and literal "greybeard" called &lt;a href="https://www.edtechhistory.org.uk/blog/blog.html#jan13"&gt;Bill
Tagg&lt;/a&gt;, produced
software for schools in Hertfordshire and environs, and one of their
products was a simple database …&lt;/p&gt;</summary><content type="html">&lt;p&gt;My first paid programming job was working for my local education
authority during the summer. &lt;a href="https://www.computinghistory.org.uk/sec/21778/AUCBE-(Advisory-Unit-for-Computer-Based-Education)/"&gt;The Advisory Unit for Computer-Based
Education
(AUCBE)&lt;/a&gt;,
run by a fantastic visionary and literal "greybeard" called &lt;a href="https://www.edtechhistory.org.uk/blog/blog.html#jan13"&gt;Bill
Tagg&lt;/a&gt;, produced
software for schools in Hertfordshire and environs, and one of their
products was a simple database called
&lt;a href="https://www.computinghistory.org.uk/det/21785/Quest/"&gt;Quest&lt;/a&gt;. At this
time (the early 1980s), two computers dominated UK schools—the
&lt;a href="https://en.wikipedia.org/wiki/Research_Machines_380Z"&gt;Research Machines
380Z&lt;/a&gt;, a Zilog
Z-80-based machine running CP/M, and the fantastic, new &lt;a href="https://en.wikipedia.org/wiki/BBC_Micro"&gt;BBC
Micro&lt;/a&gt;, 6502-based machine
produced by Acorn, to specification agreed with the British
Broadcasting Corporation. I was familiar with both, as my school had a
solitary 380Z, and I had harangued my parents into getting me a BBC
Model B,&lt;sup id="fnref:bbcmicro"&gt;&lt;a class="footnote-ref" href="#fn:bbcmicro"&gt;1&lt;/a&gt;&lt;/sup&gt; which was the joy of my life.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Figure: BBC Micro" src="https://farm4.staticflickr.com/3623/3349126651_520486c684_z.jpg" title="BBC Micro"&gt;&lt;/p&gt;
&lt;p&gt;The Quest database existed in two data-compatible forms. Peter Andrews
had written a machine code implementation for the 380Z, and Bill Tagg
himself had written an implementation in &lt;a href="https://www.bbcbasic.org"&gt;BBC
Basic&lt;/a&gt; for the BBC Micro. They shared an
interface and a manual, and my job was to produce a 6502 version that
would also share that manual. Every deviation from the documented and
actual behaviour of the BBC Basic implementation had to be personally
signed off by Bill Tagg.&lt;/p&gt;
&lt;p&gt;Writing Quest was a fantastic project for me, and the most highly
constrained I have ever done: every aspect of it was pinned down by a
combination of manuals, existing data files, specified interfaces,
existing users and reference implementations. Peter Andrews was very
generous in writing out, in fountain pen, on four A4 pages, a
suggested implementation plan, which I followed scrupulously. That plan
probably made the difference between my successfully completing the
project and flailing endlessly, and the project was a success.&lt;/p&gt;
&lt;p&gt;I learned an enormous amount writing Quest, but the path to success
was not devoid of bumps in the road.&lt;/p&gt;
&lt;p&gt;Once I had implemented enough of Quest for it to be worth testing, I took
to delivering versions to Bill periodically. This was the early 1980s,
so he didn't get them by pulling from Github, nor even by FTP or email;
rather, I handed him floppy disks,&lt;sup id="fnref:floppy"&gt;&lt;a class="footnote-ref" href="#fn:floppy"&gt;2&lt;/a&gt;&lt;/sup&gt; in the early days, and later on,
&lt;a href="https://en.wikipedia.org/wiki/EPROM"&gt;EPROMs&lt;/a&gt;—Erasable, Programmable Read-Only Memory chips that he could
plug into the &lt;a href="https://chrisacorns.computinghistory.org.uk/New4Old/RetroClinic_Beebzif.html"&gt;&lt;em&gt;Zero-Insertion Force&lt;/em&gt;&lt;/a&gt; ("ZIF") socket&lt;sup id="fnref:zif"&gt;&lt;a class="footnote-ref" href="#fn:zif"&gt;3&lt;/a&gt;&lt;/sup&gt; on the side of his
machine. (Did I mention how cool the BBC Micro was?)&lt;/p&gt;
&lt;p&gt;&lt;img alt="Figure: ZIF Socket" src="https://farm3.staticflickr.com/2372/2226425940_f4aeba3207_z.jpg" title="ZIF Socket"&gt;&lt;/p&gt;
&lt;p&gt;Towards the end of my development of the 6502 implementation of Quest,
I proudly handed over a version to Bill, and was slightly disappointed
when he complained that it didn't work with one of his database
files. In fact, his database file caused it to &lt;em&gt;hang&lt;/em&gt;. He gave me a
copy of his data and I set about finding the problem. It
goes without saying that a bug that caused the software to hang was
pretty bad, so it was clearly important to find it.&lt;/p&gt;
&lt;p&gt;It was hard to track down. As I recall, it took me the best part of two
solid days to find the problem. When I eventually did find it, it turned
out to be a "bad data" problem. If I remember correctly, Quest saved
data as flat files using the pipe character &lt;code&gt;"|"&lt;/code&gt; to separate fields.
The dataset Bill had given me had an extra pipe separator on one line,
and was therefore not compliant with the data format. My reaction to
this discovery was to curse Bill for sending me on a 2-day wild goose
chase, and the following day I marched into AUCBE and told him—with
the righteousness that only an arrogant teenager can muster—that it
was his data that was at fault, not my beautiful code.&lt;/p&gt;
&lt;p&gt;. . . to which Bill, of course, countered:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"And why didn't your beautiful code detect the bad data and report it, rather than hanging?"&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Oops.&lt;/p&gt;
&lt;h1 id="introducing-snafu-of-the-week"&gt;Introducing SNAFU of the Week&lt;/h1&gt;
&lt;p&gt;Needless to say, Bill was right. Even if my software was perfect and
would never write invalid data (which &lt;em&gt;might&lt;/em&gt; not have been the case),
and even if data could never become corrupt through disk errors (which
was &lt;em&gt;demonstrably&lt;/em&gt; not the case), that didn't mean it would never
encounter bad data. So the software had to deal with invalid inputs
rather better than going into an infinite loop (which is exactly what
it did—nothing a hard reset wouldn't cure!)&lt;/p&gt;
&lt;p&gt;And so it is with data analysis.&lt;/p&gt;
&lt;p&gt;Obviously, there &lt;em&gt;is&lt;/em&gt; such a thing as good data—perfectly formatted,
every value present and correct; it's just that it is almost never
safe to &lt;em&gt;assume&lt;/em&gt; that data your software will receive will be good. Rather,
we almost always need to perform checks to validate it, and
to give various levels of warnings when things are not as they should
be. Hanging or crashing on bad data is obviously bad, but in some
ways, it is &lt;em&gt;less bad&lt;/em&gt; than reading it without generating a warning or
error. The hierarchy of evils for analytical software runs something
like this:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;(Worst) Producing plausible but materially incorrect results
     from good inputs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Producing implausible, materially incorrect results from
     good inputs (generally less bad, because these are much less
     likely to go unnoticed, though obviously they can be even more
     serious if they do).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;(Least serious) Hanging or crashing (embarrassing and inconvenient,
     but not actively misleading).&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In this spirit, we are going to introduce "&lt;a href="https://www.urbandictionary.com/define.php?term=SNAFU"&gt;SNAFU&lt;/a&gt; of the Week", which
will be a (not-necessarily weekly) series of examples of kinds of
things that can go wrong with data (especially data feeds), analysis,
and analytical software, together with a discussion of whether and how it
was, or could have been detected and what lessons we might learn from
them.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:bbcmicro"&gt;
&lt;p&gt;BBC Micro Image: Dave Briggs, &lt;a href="https://www.flickr.com/photos/theclosedcircle/3349126651/"&gt;https://www.flickr.com/photos/theclosedcircle/3349126651/&lt;/a&gt; under &lt;a href="https://creativecommons.org/licenses/by/2.0/"&gt;CC-BY-2.0&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:bbcmicro" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:floppy"&gt;
&lt;p&gt;Floppy disks were like 3D-printed versions of the save icon still used in much software, and in some cases could store over half a megabyte of data. Of course, the 6502 was a 16-bit processor, that could address a maximum of 64K of RAM. In the case of the BBC micro, a single program could occupy at most 16K, so a massive floppy disk could store many versions of Quest together with enormous database files.&amp;#160;&lt;a class="footnote-backref" href="#fnref:floppy" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:zif"&gt;
&lt;p&gt;Zero-Insertion Force Socket: Windell Oskay, &lt;a href="https://www.flickr.com/photos/oskay/2226425940"&gt;https://www.flickr.com/photos/oskay/2226425940&lt;/a&gt;
under &lt;a href="https://creativecommons.org/licenses/by/2.0/"&gt;CC-BY-2.0&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:zif" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="tdda"></category><category term="bad data"></category></entry><entry><title>How far in advance are flights cheapest? An error of interpretation</title><link href="https://tdda.info/how-far-in-advance-are-flights-cheapest-an-error-of-interpretation.html" rel="alternate"></link><published>2016-01-06T15:00:00+00:00</published><updated>2016-01-06T15:00:00+00:00</updated><author><name>Patrick Surry</name></author><id>tag:tdda.info,2016-01-06:/how-far-in-advance-are-flights-cheapest-an-error-of-interpretation.html</id><summary type="html">&lt;p&gt;&lt;strong&gt;Guest Post&lt;/strong&gt; by &lt;a href="https://www.hopper.com/research/patrick-surry/"&gt;Patrick Surry&lt;/a&gt;, Chief Data Scientist, &lt;a href="https://www.hopper.com"&gt;Hopper&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Every year, Expedia and ARC collaborate to publish some annual
statistics about domestic airfare, including their treatment of the
perennial question "How far in advance should you book your flight?"
Here's what they presented in
&lt;a href="https://viewfinder.expedia.com/img/STOR-23513_White_paper.pdf"&gt;their report&lt;/a&gt;
last year:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Figure: Average Ticket Price cs. Advance Purchase Days for Domestic Flights (Source; Expedia/ARC)" src="https://www.tdda.info/images/pds-errors/expedia-arc-1.png" title="Average Ticket Price cs. Advance Purchase Days for Domestic Flights (Source; Expedia/ARC)"&gt;&lt;/p&gt;
&lt;p&gt;Although there …&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;strong&gt;Guest Post&lt;/strong&gt; by &lt;a href="https://www.hopper.com/research/patrick-surry/"&gt;Patrick Surry&lt;/a&gt;, Chief Data Scientist, &lt;a href="https://www.hopper.com"&gt;Hopper&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Every year, Expedia and ARC collaborate to publish some annual
statistics about domestic airfare, including their treatment of the
perennial question "How far in advance should you book your flight?"
Here's what they presented in
&lt;a href="https://viewfinder.expedia.com/img/STOR-23513_White_paper.pdf"&gt;their report&lt;/a&gt;
last year:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Figure: Average Ticket Price cs. Advance Purchase Days for Domestic Flights (Source; Expedia/ARC)" src="https://www.tdda.info/images/pds-errors/expedia-arc-1.png" title="Average Ticket Price cs. Advance Purchase Days for Domestic Flights (Source; Expedia/ARC)"&gt;&lt;/p&gt;
&lt;p&gt;Although there are a lot of things wrong with this picture (including
the callout not being at the right spot on the x-axis, and the $496
average appearing above $500 . . .), the most egregious is a more subtle
error of interpretation. The accompanying commentary reads:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Still, the question remains: How early should travelers book?
. . . Data collected
by ARC indicates that the lowest average ticket price, about US$401,
can be found 57 days in advance.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;While that statement is presumably mathematically correct, it's
completely misleading.  The chart is drawn by calculating the average
price of all domestic roundtrip tickets sold at each advance. That
answers the question "how far in advance is the average ticket sold on
the day lowest?" but is mistakenly interpreted as answering "how far
in advance is a typical ticket cheapest?". That's a completely
different question, because the mix of tickets changes with
advance. Indeed, travelers tend to book more expensive trips earlier,
and cheaper trips later.  In fact, for most markets, prices are fairly
flat at long advances, and then rise more or less steeply at some
point before departure.  As a simplification, assume there are only
two domestic markets, a short, cheap trip, and a long, expensive
one. Both have prices that are flat at long advances, and which start
rising about 60 days before departure:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Figure: Price as a function of booking window, for short-haul and long-haul flights (Simulated Data)" src="https://www.tdda.info/images/pds-errors/pds-errors-2.png" title="Price as a function of booking window, for short-haul and long-haul flights (simulated data)"&gt;&lt;/p&gt;
&lt;p&gt;Now let's assume that the relative demand is directly proportional to
advance, i.e. 300 days ahead, all tickets sold are for FarFarAway, and
0 days ahead, all tickets sold are for StonesThrow, and let's
calculate the price of the average ticket sold as a function of
advance:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Figure: Average price as a function of booking window across long- and short-haul flights, with time-verying proportionate demand (simulated data)" src="https://www.tdda.info/images/pds-errors/pds-errors-3.png" title="Average price as a function of booking window across long- and short-haul flights, with time-verying proportionate demand (simulated data)"&gt;&lt;/p&gt;
&lt;p&gt;What do you know? The average price declines as demand switches from
more expensive to cheaper tickets, with a minimum coincidentally just
less than 60 days in advance.  To get a more meaningful answer to the
question "how far in advance is the typical ticket cheapest?", we
should instead simply calculate separate advance curves for each
market, and then combine them based on the total number (or value) of
tickets sold in each market. In our simple example, if we assume the
two markets have equal overall weight, we get a much more intuitive
result, with prices flat up to 60 days, and then rising towards departure:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Figure: Weighted average advance-purchase price across long-haul and short-haul, with weighting by volume" src="https://www.tdda.info/images/pds-errors/pds-errors-4.png" title="Weighted averaged advance-purchase price across long-haul and short-haul, with weighting by volume"&gt;&lt;/p&gt;
&lt;p&gt;All this goes to show how important it is that we frame our analytical
questions (and answers!) carefully.  When the traveller asks: "How far
in advance should I book my flight?", it's our responsibility as
analysts to recognize that they mean&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;How far in advance is any given ticket cheapest?&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;rather than&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;How far in advance is the average price of tickets sold that day lowest?&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Even a correct answer to the latter is dangerously misleading because
the traveller is unlikely to recognize the distinction and take it as
the (wrong!) answer to their real question.&lt;/p&gt;</content><category term="TDDA"></category><category term="tdda"></category><category term="errors"></category><category term="interpretation"></category></entry><entry><title>Tools and Tooling</title><link href="https://tdda.info/tools-and-tooling.html" rel="alternate"></link><published>2015-12-16T08:30:00+00:00</published><updated>2015-12-16T08:30:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2015-12-16:/tools-and-tooling.html</id><summary type="html">&lt;p&gt;Good tools for testing matter because the temptation to skimp
on testing is real even for true believers: anything that reduces
the friction and pain associated with actually adding tests therefore
has a disproportionate effect on adoption and implementation rates.&lt;/p&gt;
&lt;p&gt;I think there are several reasons the temptation to forego …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Good tools for testing matter because the temptation to skimp
on testing is real even for true believers: anything that reduces
the friction and pain associated with actually adding tests therefore
has a disproportionate effect on adoption and implementation rates.&lt;/p&gt;
&lt;p&gt;I think there are several reasons the temptation to forego writing
tests seems to be strong for most people.&lt;/p&gt;
&lt;p&gt;The first is that code is &lt;em&gt;sometimes&lt;/em&gt; implemented correctly without
writing any tests.  There's no real temptation to miss out mandatory
boiler-plate code at the top of a Java program because we know it can
&lt;em&gt;never&lt;/em&gt; work without it. The fact that it is &lt;em&gt;possible&lt;/em&gt; to produce
correct programs (and correct analytical processes) without writing
tests means that even if you are intellectually convinced, and
conditioned by experience, to believe that it is ultimately faster to
produce reliable results by following a test-driven methodology,
there's always the lingering memory of those rare occasions when
things did &lt;em&gt;just work&lt;/em&gt; first time without the "extra" work of writing
tests.&lt;/p&gt;
&lt;p&gt;The second reason it's tempting to skimp on writing tests is
that this is a &lt;em&gt;support&lt;/em&gt; activity rather than the main task at
hand, making the testing part seem less glamorous and more
of a chore.  (Everyone is familiar with Abraham Lincoln's "Give
me six hours to chop down a tree and I will spend the first four
sharpening the axe", but how many adopt his principle?)&lt;/p&gt;
&lt;p&gt;Perhaps the final reason for skimping on test code is that large
programs and processes typically start life as small programs that we
may not intend to use repeatedly. This is especially true in data
analysis.  While no task is so simple it cannot be botched, the
benefits of and need for systematic testing grow with project
size. The dynamic of "knocking up a script to calculate something",
then later finding yourself using and modifying that script regularly,
gradually extending it to handle ever more cases, while very typical
for some kinds of development and analysis, carries a high risk. The
frequent result is that by the time anyone realises that it really
needs a test suite, the code has become grown complex, undocumented
and hard to back-fill with tests.&lt;/p&gt;
&lt;p&gt;The approaches we propose for test-driven data analysis are deliberately
compatible with retrofitting, because in practice so much code
develops in this organic way. If we ignore this reality, we will
automatically exclude a large proportion—perhaps a majority—of
analytical processes actually deployed. Given that we believe the
test-driven approach to data analysis has much to offer, we are keen
to bring its benefits as widely as possible, so we need to have regard
to the case of analytical processes developed in the real world
without TDDA as a primary focus.&lt;/p&gt;
&lt;p&gt;Understanding that there is a natural temptation not to develop tests
helps us to realise that anything we can do to make the process of
testing less painful to implement and easier to retrofit when it has
been neglected is likely to help adoption.&lt;/p&gt;
&lt;p&gt;With these thoughts in mind, over the coming weeks and months, we will
have further posts on specific aspects of tooling for TDDA. The broad
plan is to discuss ideas we have already implemented in our own
&lt;a href="https://stochasticsolutions.com/miro.html"&gt;Miró&lt;/a&gt;
data analysis suite, some as extensions to Python's built-in
&lt;code&gt;unittest&lt;/code&gt; framework, and to start to extract and publish core
functionality from there in forms that are applicable to the broader
Python (and perhaps, in some cases R) data analysis toolsets.&lt;/p&gt;</content><category term="TDDA"></category><category term="tdda"></category><category term="tools"></category></entry><entry><title>Generalized Overfitting: Errors of Applicability</title><link href="https://tdda.info/generalized-overfitting-errors-of-applicability.html" rel="alternate"></link><published>2015-12-14T11:00:00+00:00</published><updated>2015-12-14T11:00:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2015-12-14:/generalized-overfitting-errors-of-applicability.html</id><summary type="html">&lt;p&gt;Everyone building predictive models or performing statistical fitting
knows about &lt;em&gt;overfitting.&lt;/em&gt; This arises when the function
represented by the model includes components or aspects that are
overly specific to the particularities of the sample data used for
training the model, and that are not general features of datasets
to which …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Everyone building predictive models or performing statistical fitting
knows about &lt;em&gt;overfitting.&lt;/em&gt; This arises when the function
represented by the model includes components or aspects that are
overly specific to the particularities of the sample data used for
training the model, and that are not general features of datasets
to which the model might reasonably be applied.
The failure mode associated with overfitting is that the performance
of the model on the data we used to train it is significantly
better than the performance when we apply the model to other data.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Figure: Overfitting" src="https://www.tdda.info/images/overfit1350x500.png" title="Overfitting. sin(x) with Gaussian noise. Left: Polynomial Fit degree 3 (cubic; good fit). Right: Polynomial Fit degree 10 (overfit)"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Figure:&lt;/strong&gt; &lt;em&gt;Overfitting. Points drawn from sin(x) + Gaussian noise. Left: Polynomial fit, degree 3 (cubic; good fit). Right: Polynomial fit, degree 10 (overfit).&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Statisticians use the term &lt;em&gt;cross-validation&lt;/em&gt; to describe the process
of splitting the training data into two (or more) parts, and using one
part to fit the model, and the other to assess whether or not it
exhibits overfitting.  In machine learning, this is more often
referred to as a "test-training" approach.&lt;/p&gt;
&lt;p&gt;A special form of this approach is &lt;em&gt;longitudinal&lt;/em&gt; validation, in which
we build the model on data from one time period and then check its
performance against data from a later time period, either by
partitioning the data available at build time into older and newer
data, or by using outcomes collected after the model was built for
validation. With longitudinal validation, we seek to verify not only
that we did not overfit the characteristics of a particular data
sample, but also that the patterns we model are stable over time.&lt;/p&gt;
&lt;p&gt;Validating against data for which the outcomes were not known when the
model was developed has the additional benefit of eliminating a common
class of errors that arises when secondary information about
validation outcomes "leaks" during the model building process. Some
degree of such leakage—sometimes known as &lt;em&gt;contaminating&lt;/em&gt; the validation
data—is quite common.&lt;/p&gt;
&lt;h2 id="generalized-overfitting"&gt;Generalized Overfitting&lt;/h2&gt;
&lt;p&gt;As its name suggests, &lt;em&gt;overfitting&lt;/em&gt; as normally conceived is a failure
mode specific to model building, arising when we &lt;em&gt;fit&lt;/em&gt; the training
data "too well". Here, we are are going to argue that overfitting is
an example of a more general failure mode that can be present in any
analytical process, especially if we use the process with data other
than that used to build it. Our suggested name for this broader class
of failures is &lt;em&gt;errors of applicability&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Here are some of the failure modes we are thinking about:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Changes in Distributions of Inputs (and Outputs)&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;New categories.&lt;/em&gt; When we develop the analytical process, we see
    only categories A, B and C in some (categorical) input or
    output. In operation, we also see category D. At this point our
    process may fail completely ("crash"), produce meaningless outputs
    or merely produce less good results.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Missing categories.&lt;/em&gt; The converse can be a problem too: what if a
    category disappears?  Most prosaically, this might lead to a
    divide-by-zero error if we've explicitly used each category frequency
    in a denominator. Subtler errors can also creep in.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Extended ranges.&lt;/em&gt; For numeric and other ordered data, the
    equivalent of new categories is values outside the range we saw in
    the development data.  Even if the analysis code runs without
    incident, the process will be being used in a way that may be
    quite outside that considered and tested during development, so this
    can be dangerous.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Distributions&lt;/em&gt;. More generally, even if the &lt;em&gt;range&lt;/em&gt; of the input
    data doesn't change, its distribution may, either slowly or
    abruptly.  At the very least, this indicates the process is being
    used in unfamiliar territory.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Nulls&lt;/em&gt;.  Did nulls appear in any fields where there were none
    when we developed the process? Does the process
    cater for this appropriately? And are such nulls valid?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Higher Dimensional Shifts&lt;/em&gt;.  Even if the the data ranges and
    distribution for individual fields don't change, their higher
    dimensional distributions (correlations) can change
    significantly. The pair of 2-dimensional distributions below
    illustrates this point in an extreme way.  The distributions of both
    x and y values on the left and right are identical.
    But clearly, in 2 dimensions, we see that the space
    occupied by the two datasets is actually non-overlapping, and on
    the left x and y are negatively correlated, while on the right
    they are positively correlated.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Figure: A shift in distribution (2D)" src="https://www.tdda.info/images/distribution-shift-2d.png" title="A shift in distribution (2D)"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Figure:&lt;/strong&gt; &lt;em&gt;The same x and y values are shared between these two
plots (i.e. the disibution of x and y is identical in each
case). However, the pairing of x and y coordinates is different. A
model or other analytical process built with with negatively
correlated data like that on the left might not work well for
positively correlated data like that on the right. Even if it does
work well, you may want to detect and report a fundamental change
like this.&lt;/em&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Time (always marching on)&lt;/em&gt;. Times and dates are notoriously
    problematical. There are many issues around date and time formats,
    many specifically around timezones (and the difference between a
    local times and times in a fixed time zone, such as GMT or UTC).&lt;/p&gt;
&lt;p&gt;For now, let's assume that we have an input that is a well-defined
time, correctly read and analysed in a known timezone—say
UTC.&lt;sup id="fnref:UTC"&gt;&lt;a class="footnote-ref" href="#fn:UTC"&gt;1&lt;/a&gt;&lt;/sup&gt; Obviously, new data will tend to have later
times—sometimes non-overlapping later times. Most often, we need
to change these to intervals measured with respect to a moving
date (possibly today, or some variable event date, e.g. days since
contact). But in other cases, absolute times, or times in a cycle
matter. For example, season, time of month or time of day may
matter—the last two, probably in local time rather than UTC.&lt;/p&gt;
&lt;p&gt;In handling time, we have to be careful about binnings, about
absolute vs. relative measurement (2015-12-11T11:00:00 vs.
299 hours after the start of the current month), universal
vs. local time, and appropriate bin boundaries that move or
expand with the analytic time window being considered.&lt;/p&gt;
&lt;p&gt;Time is &lt;em&gt;not&lt;/em&gt; unique in the way that its range and maximum naturally
increase with each new data sample. Most obviously, other
counters (such as customer number) and sum-like aggregates may
have this same monotonically increasing character, meaning that it
should be expected that new, higher (but perhaps not new lower)
values will be present in newer data.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;Concrete and Abstract Definitions&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;There's a general issue with choosing values based on data used during
development. This concerns the difference between what we will term
&lt;em&gt;concrete&lt;/em&gt; and &lt;em&gt;abstract&lt;/em&gt; values, and what it means to perform "the same"
operation on different datasets.&lt;/p&gt;
&lt;p&gt;Suppose we decide to handle outliers differently from the rest of
the data in a dataset, at least for some part of the analysis.
For example, suppose we're looking at flight prices in Sterling
and we see the following distribution.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Figure: Ticket Prices" src="https://www.tdda.info/images/prices1000x500.png" title="Ticket Prices"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Figure:&lt;/strong&gt; &lt;em&gt;Ticket prices, in £100 bins to £1,000, then doubling
widths to £256,000, with one final bin for prices above £256,000.
(On the graph, the £100-width bins are red; the rest are blue.)&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;On the basis of this, we see that well over 99% of the data has prices
under £4,000, and also that while there are a few thousand ticket
prices in the £4,000–£32,000 range (most of which are probably real) the
final few thousand probably contain bad data, perhaps as a result of
currency conversion errors.&lt;/p&gt;
&lt;p&gt;We may well want to choose one or more threshold values from the
data—say £4,000 in this case—to specify some aspect of our
analytical process.  We might, for example, use this threshold
in the analysis for filtering, outlier reporting, setting a final bin boundary
or setting the range for the axes of a graph.&lt;/p&gt;
&lt;p&gt;The crucial question here is: &lt;em&gt;How do we specify and represent our
threshold value?&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Concrete Value:&lt;/strong&gt; Our concrete value is £4,000. In the current
    dataset there are 60,995 ticket prices (0.55%) above this value
    and 10,807,905 (99.45%) below. (There are no prices of exactly
    £4,000.)  Obviously, if we specify our threshold using this
    concrete value—£4,000—it will be the same for any dataset we
    use with the process.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Abstract Value:&lt;/strong&gt; Alternatively, we might specify the value
    indirectly, as a function of the input data. One such abstract
    specification is &lt;em&gt;the price P below which which 99.45% of ticket
    prices the dataset lie&lt;/em&gt;. If we specify a threshold using this
    abstract definition, it will vary according to the content of
    the dataset.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;In passing, 99.45% is not precise: if we select the
    bottom 99.45% of this dataset by price we get 10,808,225
    records with a maximum price of £4,007.65. The more precise
    specification is that 99.447046% of the dataset has prices
    under £4,000.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Of course, being human, if we were specifying the value in this
    way, we would probably round the percentage to 99.5%,
    and if we did that we would find that we shifted the threshold
    so that the maximum price below it was £4,186.15, and the minimum
    price above was £4,186.22.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Alternative Abstract Specifications:&lt;/strong&gt; Of course, if we want to
    specify this threshold abstractly, there are countless other ways
    we might do it, some fraught with danger.&lt;/p&gt;
&lt;p&gt;Two things we should definitely avoid when working with data like
this are means and variances across the whole column, because they
will be rendered largely meaningless by outliers.  If we blindly
calculate the mean, μ, and standard deviation, σ, in this dataset,
we get μ=£2,009.85 and σ=£983,956.28. That's because, as we noted
previously, there are a few highly questionable ticket prices in
the data, including a maximum of
£1,390,276,267.42.&lt;sup id="fnref:conversion-error"&gt;&lt;a class="footnote-ref" href="#fn:conversion-error"&gt;2&lt;/a&gt;&lt;/sup&gt; Within the main body of the
data—the ~99.45% with prices below £4,000.00—the
corresponding values are μ=£462.09 and σ=£504.82. This emphasizes
how dangerous it would be to base a definition on full-field
moments&lt;sup id="fnref:statisticalMoments"&gt;&lt;a class="footnote-ref" href="#fn:statisticalMoments"&gt;3&lt;/a&gt;&lt;/sup&gt; such as mean or variance.&lt;/p&gt;
&lt;p&gt;In contrast, the median is much less affected by outliers.  In the
full dataset, for example the median ticket price is £303.77,
while the median of those under £4,000.00 is £301.23. So another
reasonably stable abstract definition of a threshold around
£4,000.00 would be something like 13 times the median.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The reason for labouring this point around abstract vs. concrete
definitions is that it arises very commonly and it is not always
obvious which is preferable. Concrete definitions have the advantage
of (numeric) consistency between analyses, but may result in analyses
that are not well suited to a later dataset, because different choices
would have been made if that later data had been considered by the
developer of the process. Conversely, abstract definitions often make
it easier to ensure that analyses are suitable for a broader
range of input datasets, but can make comparability more difficult;
they also tend to make it harder to get "nice" human-centric
scales, bin boundaries and thresholds (because you end up, as we saw
above, with values like £4,186.22, rather than £4,000).&lt;/p&gt;
&lt;p&gt;Making a poor choice between abstract and concrete specifications
of any data-derived values can lead to large sections of the data
being omitted (if filtering is used), or made invisible (if used
for axis boundaries), or conversely can lead to non-comparability
between results or miscomputations if values are associated with
bins having different boundaries in different datasets.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;NOTE:&lt;/strong&gt; A common source of the leakage of information from
validation data into training data, as discussed above, is the use
of the full dataset to make decisions about thresholds such as those
discussed here. To get the full benefit of cross-validation, all
modelling decisions need to be made solely on the basis of the
training data; even feeding back performance information from
the validation data begins to contaminate that data.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Data-derived thresholds and other values can occur almost anywhere
in an analytical process, but specific dangers include:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Selections (Filters)&lt;/em&gt;.  In designing analytical processes, we may
    choose to filter values, perhaps to removing outliers or
    nonsensical values. Over time, the distribution may shift, and
    these filters may become less appropriate and remove
    ever-increasing proportions of the data.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A good example of this we have seen recently involves negative
charges.  In early versions of ticket price information, almost
all charges were positive, and those that were negative were
clearly erroneous, so we added a filter to remove all negative
charges from the dataset. Later, we started seeing
data in which there were many more, and less obviously erroneous
negative charges. It turned out that a new data source generated
valid negative charges, but we were misled in our initial analysis
and the process we built was unsuitable for the new context.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Binnings (Bandings, Buckets)&lt;/em&gt;. Binning data is very common, and
    it is important to think carefully about when you want bin
    boundaries to be concrete (common across datasets) and when they
    should be abstract (computed, and therefore different for different
    datasets). Quantile binnings (such as deciles), of course, are
    intrinsically adaptive, though if those are used you have to
    be aware that any given bin in one dataset may have different
    boundaries from the "same" bin in another dataset.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Statistics.&lt;/em&gt; As noted above, some care has to be taken when any
    statistic is used in the dataset to determine whether it should be
    recorded algorithmically (as an abstract value) in analysis
    or numerically (as a concrete value), and particular care should be
    taken with statistics that are sensitive to outliers.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;Other Challenges to Applicability&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;In addition to the common sources of &lt;em&gt;errors of applicability&lt;/em&gt; we have
outlined above, we will briefly mention a few more.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Non-uniqueness.&lt;/em&gt; Is a value that was different for each record
    in the input data now non-unique?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Crazy outliers.&lt;/em&gt; Are there (crazy) outliers in fields where there
    were none before?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Actually wrong.&lt;/em&gt; Are there detectable data errors in the operational
    data that were not seen during development?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;New data formats.&lt;/em&gt; Have formats changed, leading to misinterpretation
    of values?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;New outcomes.&lt;/em&gt; Even more problematical than new input categories or
    ranges are new outcome categories or a larger range of output values.
    When we see this, we should almost always re-evaluate our analytical
    processes.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="four-kinds-of-analytical-errors"&gt;Four Kinds of Analytical Errors&lt;/h2&gt;
&lt;p&gt;In the overview of TDDA we published in Predictive Analytic Times
(available
&lt;a href="https://www.predictiveanalyticsworld.com/patimes/four-ways-data-science-goes-wrong-and-how-test-driven-data-analysis-can-help/"&gt;here&lt;/a&gt;),
we made an attempt to summarize how the four main classes of errors
arise with the following diagram:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Figure: Four Kinds of Analytical Error" src="https://www.tdda.info/images/tdda-errors-4stage-1920x1020.png" title="Four Kinds of Analytical Error and their Sources"&gt;&lt;/p&gt;
&lt;p&gt;While this was always intended to be a simplification, a particular problem is
that it suggests there's no room for errors of interpretation in the
operationalization phase, which is far from the case.&lt;sup id="fnref:nevertoolate"&gt;&lt;a class="footnote-ref" href="#fn:nevertoolate"&gt;4&lt;/a&gt;&lt;/sup&gt;
Probably a better representation is as follows:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Figure: Four Kinds of Analytical Error (revisited)" src="https://www.tdda.info/images/tdda-errors-5stage-1920x1020.png" title="Four Kinds of Analytical Error and their Sources (revisited)"&gt;&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:UTC"&gt;
&lt;p&gt;UTC is the curious abbreviation (malacronym?) used for
&lt;em&gt;coordinated universal time&lt;/em&gt;, which is the standardized
version of Greenwich Mean Time now defined by the scientific community.
It is the time at 0º longitude, with no "daylight saving"
(British Summer Time) adjustment.&amp;#160;&lt;a class="footnote-backref" href="#fnref:UTC" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:conversion-error"&gt;
&lt;p&gt;This is probably the result of a currency conversion
error.&amp;#160;&lt;a class="footnote-backref" href="#fnref:conversion-error" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:statisticalMoments"&gt;
&lt;p&gt;Statistical
&lt;a href="https://en.wikipedia.org/wiki/Moment_(mathematics)"&gt;&lt;em&gt;moments&lt;/em&gt;&lt;/a&gt; are
the characterizations of distributions
starting with mean and variance, and continuing with skewness and
kurtosis.&amp;#160;&lt;a class="footnote-backref" href="#fnref:statisticalMoments" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:nevertoolate"&gt;
&lt;p&gt;It's never too late to misinterpret data or results.&amp;#160;&lt;a class="footnote-backref" href="#fnref:nevertoolate" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="tdda"></category><category term="errors"></category><category term="applicability"></category></entry><entry><title>Overview of TDDA in Predictive Analytics Times</title><link href="https://tdda.info/overview-of-tdda-in-predictive-analytics-times.html" rel="alternate"></link><published>2015-12-11T16:20:00+00:00</published><updated>2015-12-11T16:20:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2015-12-11:/overview-of-tdda-in-predictive-analytics-times.html</id><content type="html">&lt;p&gt;We have an overview piece in Predictive Analytics Times this week.&lt;/p&gt;
&lt;p&gt;You can find it &lt;a href="https://www.predictiveanalyticsworld.com/patimes/four-ways-data-science-goes-wrong-and-how-test-driven-data-analysis-can-help/"&gt;here&lt;/a&gt;.&lt;/p&gt;</content><category term="TDDA"></category><category term="tdda"></category></entry><entry><title>Anomaly Detection</title><link href="https://tdda.info/anomaly-detection.html" rel="alternate"></link><published>2015-12-01T08:30:00+00:00</published><updated>2015-12-01T08:30:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2015-12-01:/anomaly-detection.html</id><summary type="html">&lt;h3 id="the-broader-process-anomaly-detection-and-alerting"&gt;The Broader Process: Anomaly detection and Alerting.&lt;/h3&gt;
&lt;p&gt;The fourth major area we will focus on as we develop the ideas of
test-driven data analysis is the correctness of the broader
process. This relates partly to some of the ideas about consistency
checking discussed &lt;a href="#ConsistencyChecking"&gt;earlier&lt;/a&gt;, but goes
further.&lt;/p&gt;
&lt;p&gt;A common situation …&lt;/p&gt;</summary><content type="html">&lt;h3 id="the-broader-process-anomaly-detection-and-alerting"&gt;The Broader Process: Anomaly detection and Alerting.&lt;/h3&gt;
&lt;p&gt;The fourth major area we will focus on as we develop the ideas of
test-driven data analysis is the correctness of the broader
process. This relates partly to some of the ideas about consistency
checking discussed &lt;a href="#ConsistencyChecking"&gt;earlier&lt;/a&gt;, but goes
further.&lt;/p&gt;
&lt;p&gt;A common situation with analysis processes is that they are used
repeatedly on some kind of &lt;em&gt;feed&lt;/em&gt; of data. When this is the case, in
addition to checking the internal consistency of data, we have the
opportunity to compare the current dataset with the previous datasets
that have been seen with a view to detecting and reporting sudden and
potentially unexpected changes. Simple examples might include changes
in data volumes, shifts in distributions, increases in missing data
rates and new or disappearing categories. More complex examples
are multivariate, involving changes in the relationships between
variables over time.&lt;/p&gt;
&lt;p&gt;While this can be a complex topic, simply tracking a time series of
summary stats about various data (especially inputs) and setting
thresholds for deviations between the current data and what's come
before can catch a good many problems. A more sophisticated and
ambitious approach might involve trying to do general automatic
anomaly detection on the incoming data, again using information about
data previously seen as a reference point.&lt;/p&gt;
&lt;p&gt;Depending on how automated the process is, it might be appropriate for
the result of such anomaly detection to be a simple section or note
in the output, or the creation of some kind of alert (such as a triggered
email).&lt;/p&gt;</content><category term="TDDA"></category><category term="tdda"></category><category term="components"></category></entry><entry><title>Unit Testing</title><link href="https://tdda.info/unit-testing.html" rel="alternate"></link><published>2015-11-28T08:30:00+00:00</published><updated>2015-11-28T08:30:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2015-11-28:/unit-testing.html</id><summary type="html">&lt;h3 id="systematic-unit-tests-system-tests-and-reference-tests"&gt;Systematic Unit Tests, System Tests and Reference Tests&lt;/h3&gt;
&lt;p&gt;The third major idea in test-driven data analysis is the one most
directly taken from test-driven development, namely systematically
developing both unit tests for small components of the analytical
process and carefully constructed, specific tests for the whole system
or larger components …&lt;/p&gt;</summary><content type="html">&lt;h3 id="systematic-unit-tests-system-tests-and-reference-tests"&gt;Systematic Unit Tests, System Tests and Reference Tests&lt;/h3&gt;
&lt;p&gt;The third major idea in test-driven data analysis is the one most
directly taken from test-driven development, namely systematically
developing both unit tests for small components of the analytical
process and carefully constructed, specific tests for the whole system
or larger components.&lt;/p&gt;
&lt;p&gt;With regression testing, we emphasized the idea of taking whole-system
analyses that have already been performed and turning those into
overall system tests.  In contrast, here we are talking about creating
(or selecting) specific patterns in the input data for particular
functions that exercise both core functionality (the main code paths)
and so-called &lt;em&gt;edge&lt;/em&gt; cases (the less-trodden paths).  This is
definitely significant extra work, and may (particularly if
retrofitted) require either restructuring of code or use of some
kind of mocking.&lt;/p&gt;
&lt;p&gt;When we talk about "edge cases", we are referring to patterns, code-paths
or functionality that is used only a minority of the time, or perhaps
only on rare occasions. Examples might include handling
missing values, extreme values, small data volumes and so forth.
Some questions that might help to illuminate common edge cases include:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;What happens if we use a standard deviation and there is only one
     value (so that the standard deviation is not defined)?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;What happens if there are no records?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;What happens if some or all of the data is null?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;What happens if a few extreme (and possibly erroneous) values occur:
     will this, for example, cause a mean to have an non-useful value
     or bin boundaries to be set to useless values?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;What happens if two fields are perfectly correlated: will this cause
     instability or errors when performing, for example,
     a statistical regression?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;What happens if there are invalid characters in
     string data, especially field names or file names?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;What happens if input data formats (e.g. string encodings, date formats,
     separators) are not as expected.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Are the various styles of line-endings (e.g. Unix vs. pc.)
     handled correctly?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;What happens if an external system (such as a database) is running
     in some unexpected mode, such as with a different locale?&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:notTDD"&gt;
&lt;p&gt;This idea of performing checks throughout the software does
not really have an analogue in mainstream TDD, but is certainly good
software engineering practice—see, e.g. the timeless &lt;em&gt;Little Bobby
Tables&lt;/em&gt; XKCD &lt;a href="https://xkcd.com/327/"&gt;https://xkcd.com/327/&lt;/a&gt;—and
relates fairly closely to the software engineering
practice of &lt;em&gt;defensive programming&lt;/em&gt;
 &lt;a href="https://en.wikipedia.org/wiki/Defensive_programming"&gt;https://en.wikipedia.org/wiki/Defensive_programming&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:notTDD" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:masterIdentifiers"&gt;
&lt;p&gt;Obviously, in many situations, it's fine for identifiers
or keys to be repeated, but it is also often the case that in a particular
table a field value must be unique, typically when the records act as
master records, defining the entities that exist in some category.
Such tables are often referred to as &lt;em&gt;master tables&lt;/em&gt; in database
contexts
&lt;a href="https://encyclopedia2.thefreedictionary.com/master+file"&gt;https://encyclopedia2.thefreedictionary.com/master+file&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:masterIdentifiers" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="tdda"></category><category term="components"></category></entry><entry><title>Constraints and Assertions</title><link href="https://tdda.info/constraints-and-assertions.html" rel="alternate"></link><published>2015-11-26T11:00:00+00:00</published><updated>2015-11-26T11:00:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2015-11-26:/constraints-and-assertions.html</id><summary type="html">&lt;h3 id="consistency-checking-of-inputs-outputs-and-intermediates"&gt;Consistency Checking of Inputs, Outputs and Intermediates&lt;/h3&gt;
&lt;p&gt;While the idea of &lt;a href="https://www.tdda.info/infinite-gain-the-first-test"&gt;regression
testing&lt;/a&gt; comes
straight from &lt;a href="https://www.tdda.info/test-driven-development-a-review"&gt;test-driven
development&lt;/a&gt;,
the next idea we want to discuss is associated more with general
&lt;a href="https://en.wikipedia.org/wiki/Defensive_programming"&gt;defensive
progamming&lt;/a&gt; than
TDD.  The idea is &lt;em&gt;consistency checking&lt;/em&gt;, i.e. verifying that
what might otherwise be implicit assumptions are …&lt;/p&gt;</summary><content type="html">&lt;h3 id="consistency-checking-of-inputs-outputs-and-intermediates"&gt;Consistency Checking of Inputs, Outputs and Intermediates&lt;/h3&gt;
&lt;p&gt;While the idea of &lt;a href="https://www.tdda.info/infinite-gain-the-first-test"&gt;regression
testing&lt;/a&gt; comes
straight from &lt;a href="https://www.tdda.info/test-driven-development-a-review"&gt;test-driven
development&lt;/a&gt;,
the next idea we want to discuss is associated more with general
&lt;a href="https://en.wikipedia.org/wiki/Defensive_programming"&gt;defensive
progamming&lt;/a&gt; than
TDD.  The idea is &lt;em&gt;consistency checking&lt;/em&gt;, i.e. verifying that
what might otherwise be implicit assumptions are in fact met by adding
checks at various points in the process.&lt;sup id="fnref:notTDD"&gt;&lt;a class="footnote-ref" href="#fn:notTDD"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;Initially, we will assume that we are working with tabular data,
but the ideas can be extended to other kinds of data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Inputs.&lt;/strong&gt; It is useful to perform some basic checks on inputs.
Typical things to consider include:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Are the names and types fields in the input data as we expect?
   In most cases, we also expect field names to be distinct,
   and perhaps to conform to some rules.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Is the distribution of values in the fields reasonable? For example,
   are the minimum and maximum values reasonable?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Are there nulls (missing values) in the data, and if so, are they
   permitted where they occur? If so, are there any restrictions
   (e.g. may all the value for a field or record be null?)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Is the volume of data reasonable (exactly as expected, if there is
   a specific size expectation, or plausible, if the volume is variable)?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Is any required metadata&lt;sup id="fnref:metadata"&gt;&lt;a class="footnote-ref" href="#fn:metadata"&gt;2&lt;/a&gt;&lt;/sup&gt; included?&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In addition to basic sense checks like these, we can also often
formulate self-consistency checks on the data.
For example:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Are any of the fields identifiers or keys for which every value should
   occur only once?&lt;sup id="fnref:masterIdentifiers"&gt;&lt;a class="footnote-ref" href="#fn:masterIdentifiers"&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Are there row-level identities that should be true? For example, we
   might have a set of category counts and an overall total, and
   expect the category totals to sum to the overall total:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;nCategoryA + nCategoryB + nCategoryC = nOverall
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;For categorical data, are all the values found in the data allowed,
   and are any required values missing?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If the data has a time structure, are the times and dates
   self-consistent? For example, do any end dates precede start dates?
   Are there impossible future dates?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Are there any ordering constraints on the data, and if so are they
   respected?&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Our goal in formulating TDDA is pragmatic: we are not suggesting it is
necessary to check for every possible inconsistency in the input
data. Rather, we propose that even one or two simple, well-chosen
checks can catch a surprising number of problems.  As with regression
testing, an excellent time to add new checks is when you discover
problems. If you add a consistency check every time you discover bad
inputs that such a test would have caught, you might quickly build up
a powerful, well-targeted set of diagnostics and verification
procedures. As we will see below, there is also a definite
possibility of tool support in this area.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Intermediate Results and Outputs.&lt;/strong&gt; Checking intermediates and
outputs is very similar to checking inputs, and all same kinds of
tests can be applied.  Some further questions to consider in these
contexts include:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;If we look up reference data, do (and should) all the lookups succeed?
   And are failed lookups handled appropriately?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If we calculate a set of results that should exhibit some identity
   properties, do those hold? Just as physics has
   conservation laws for quantities such as energy and momentum,
   there are similar conservation principles in some analytical calculations.
   As a simple example, if we categorize spending into different,
   non-overlapping categories, the sum of the category totals should
   usually equal the sum of all the transactions, as long as we are
   careful about things like non-categorized values.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If we build predictive models, do they cross-validate correctly
   (if we split the data into a training subset and a validation subset)?
   And, ideally, do they also validate longitudinally (i.e., on later data,
   if this is available)?&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;Transfer checks&lt;/strong&gt;.  With data analysis, our inputs are frequently generated
by some other system or systems.  Often those systems
already perform some checking or reporting of the data they
produce. If any information about checks or statistics from source
systems is available, it is useful to verify that equivalent
statistics calculated over the input data produce the same
results. If our input data is transactional, maybe the source system
reports (or can report) the number or value of transactions over some
time period. Perhaps it breaks things down by category.  Maybe we know
other summary statistics or there are checksums available that can
be verified.&lt;/p&gt;
&lt;p&gt;The value of checking that the data received is the same as the data
the source system was supposed to send is self-evident, and can help
us to detect a variety of problems including data loss, data
duplication, data corruption, encoding issues, truncation errors and
conversion errors, to name but a few.&lt;/p&gt;
&lt;h3 id="tool-support-automatic-constraint-suggestion"&gt;Tool Support: Automatic constraint suggestion&lt;/h3&gt;
&lt;p&gt;A perennial obstacle to better testing is the perception that it is a
"nice to have", rather than a &lt;em&gt;sine qua non&lt;/em&gt;, and that implementing it
will require much tedious work. Because of this, any automated support
that tools could provide would seem to be especially
valuable.&lt;/p&gt;
&lt;p&gt;Fortunately, there is low-hanging fruit in many areas, and one of
of our goals with this blog is to explore various tool enhancements.
We will do this first in our own Miró software and then, as we find
things that work, will try to produce some examples, libraries and
tools for broader use, probably focused around the Python data
analysis stack.&lt;/p&gt;
&lt;p&gt;In the spirit of starting simply, we're first going to look at what
might be possible by way of automatic input checking.&lt;/p&gt;
&lt;p&gt;One characteristic of data analysis is that we often start by trying
to get a result from some particular dataset, rather than
setting out to implement an analytical process to be used
repeatedly with different inputs.
In fact, when we start, we may not even have a very specific
analytical goal in mind: we may simply have some data (perhaps
poorly documented) and perform exploratory analysis with some broad
analytical goal in mind. Perhaps we will stop when we have a result
that seems useful, and which we have convinced ourselves is plausible.
At some later point, we may get a similar dataset (possibly pertaining
to a later period, or a different entity) and need to perform a similar
analysis. It's at this point we may go back to see whether we can
re-use our previous, embryonic analytical process, in whatever form
it was recorded.&lt;/p&gt;
&lt;p&gt;Let's assume, for simplicity, that the process at least exists as some
kind of executable script, but that it's "hard-wired" to the previous
data. We then have three main choices.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Edit the script (in place) to make it work with the new input data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Take a copy of the script and make that work with the new input data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Modify the script to allow it to work with either the new or the old
    data, by parameterizing and generalizing it.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Presented like this, the last sounds like the only sensible approach,
and in general it is the better way forward. However, we've all taken the
other paths from time to time, often because under pressure just changing
a few old hard-wired values to new hard-wired values seems as if it
will get us to our immediate result faster.&lt;sup id="fnref:itwill"&gt;&lt;a class="footnote-ref" href="#fn:itwill"&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;The problem is that even if were very diligent when preparing the first
script, in the context of the original analysis, it is easy for there
to be subtle differences in a later dataset that might compromise or
invalidate the analysis, and it's hard to force ourselves to be as
vigilent the second (third, fourth, ...) time around.&lt;/p&gt;
&lt;p&gt;A simple thing that can help is to generate statements about the original
dataset and record these as constraints. If a later dataset violates
these constraints, it doesn't necessarily mean that anything is wrong,
but being alerted to the difference at least offers us an opportunity
to consider whether this difference might be significant or problematical,
and indeed, whether it might indicate a problem with the data.&lt;/p&gt;
&lt;p&gt;Concretely: let's think about what we probably know about a dataset
whenever we work with it. We'll use the Periodic Table as an example
dataset, based on a snapshot of data I extracted from Wikipedia a few
years ago. This is how Miró summarizes the dataset if we ask for a
"long" listing of the fields with its &lt;code&gt;ls -l&lt;/code&gt; command:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;             Field      Type                        Min                         Max    Nulls
                 Z       int                          1                          92       0
              Name    string                   Actinium                   Zirconium       0
            Symbol    string                         Ac                          Zr       0
            Period       int                          1                           7       0
             Group       int                          1                          18      18
    ChemicalSeries    string                   Actinoid            Transition metal       0
      AtomicWeight      real                       1.01                      238.03       0
         Etymology    string                      Ceres                      zircon       0
RelativeAtomicMass      real                       1.01                      238.03       0
     MeltingPointC      real                    -258.98                    3,675.00       1
MeltingPointKelvin      real                      14.20                    3,948.00       1
     BoilingPointC      real                    -268.93                    5,596.00       0
     BoilingPointF      real                    -452.07                   10,105.00       0
           Density      real                       0.00                       22.61       0
       Description    string                        0.2            transition metal      40
            Colour    string    a soft silver-white ...    yellowish green or g ...      59
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;For each field, we have a name, a type, minimum and maximum values and
a count of the number of missing values. [Scroll sideways if your
window is too narrow to see the Nulls column on the right.]  We also
have, implicitly, the field order.&lt;/p&gt;
&lt;p&gt;This immediately suggests a set of constraints we might want to construct.
We've added an experimental command to Miró for generating constraints based
on the field  metadata shown earlier and a few other statistics.
First, here's the "human-friendly" view that Miró produces if we use
its &lt;code&gt;autoconstraints -l&lt;/code&gt; command.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Figure 2: Auto-constraints from 92-element Periodic Table" src="https://www.tdda.info/images/periodic-table-92-autoconstraints.png"&gt;&lt;/p&gt;
&lt;p&gt;In this table, the green cells represent constraints the system
suggests for fields, and the orange cells show areas in which
potential constraints were not constructed, though they would have
been had the data been different. Darker shades of orange indicate
constraints that were closer to be met within the data.&lt;/p&gt;
&lt;p&gt;In addition to this human-friendly view, Miró generates out a set of
&lt;em&gt;declarations&lt;/em&gt;, which can be thought of as &lt;em&gt;candidate&lt;/em&gt; assertions.
Specifically, they are statements that are true in the current
dataset, and therefore constitute potential checks we might want to
carry out on any future input datasets we are using for the same
analytical process.&lt;/p&gt;
&lt;p&gt;Here they are:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;declare (&amp;gt;= (min Z) 1)
declare (&amp;lt;= (max Z) 92)
declare (= (countnull Z) 0)
declare (non-nulls-unique Z)

declare (&amp;gt;= (min (length Name)) 3)
declare (&amp;lt;= (min (length Name)) 12)
declare (= (countnull Name) 0)
declare (non-nulls-unique Name)

declare (&amp;gt;= (min (length Symbol)) 1)
declare (&amp;lt;= (min (length Symbol)) 2)
declare (= (countnull Symbol) 0)
declare (non-nulls-unique Symbol)

declare (&amp;gt;= (min Period) 1)
declare (&amp;lt;= (max Period) 7)
declare (= (countnull Period) 0)

declare (&amp;gt;= (min Group) 1)
declare (&amp;lt;= (max Group) 18)

declare (&amp;gt;= (min (length ChemicalSeries)) 7)
declare (&amp;lt;= (min (length ChemicalSeries)) 20)
declare (= (countnull ChemicalSeries) 0)
declare (= (countzero
            (or (isnull ChemicalSeries)
                (in ChemicalSeries (list &amp;quot;Actinoid&amp;quot; &amp;quot;Alkali metal&amp;quot;
                                         &amp;quot;Alkaline earth metal&amp;quot;
                                         &amp;quot;Halogen&amp;quot; &amp;quot;Lanthanoid&amp;quot;
                                         &amp;quot;Metalloid&amp;quot; &amp;quot;Noble gas&amp;quot;
                                         &amp;quot;Nonmetal&amp;quot; &amp;quot;Poor metal&amp;quot;
                                         &amp;quot;Transition metal&amp;quot;))))
           0)

declare (&amp;gt;= (min AtomicWeight) 1.007946)
declare (&amp;lt;= (max AtomicWeight) 238.028914)
declare (= (countnull AtomicWeight) 0)
declare (&amp;gt; (min AtomicWeight) 0)

declare (&amp;gt;= (min (length Etymology)) 4)
declare (&amp;lt;= (min (length Etymology)) 39)
declare (= (countnull Etymology) 0)

declare (&amp;gt;= (min RelativeAtomicMass) 1.007946)
declare (&amp;lt;= (max RelativeAtomicMass) 238.028914)
declare (= (countnull RelativeAtomicMass) 0)
declare (&amp;gt; (min RelativeAtomicMass) 0)

declare (&amp;gt;= (min MeltingPointC) -258.975000)
declare (&amp;lt;= (max MeltingPointC) 3675.0)

declare (&amp;gt;= (min MeltingPointKelvin) 14.200000)
declare (&amp;lt;= (max MeltingPointKelvin) 3948.0)
declare (&amp;gt; (min MeltingPointKelvin) 0)

declare (&amp;gt;= (min BoilingPointC) -268.930000)
declare (&amp;lt;= (max BoilingPointC) 5596.0)
declare (= (countnull BoilingPointC) 0)

declare (&amp;gt;= (min BoilingPointF) -452.070000)
declare (&amp;lt;= (max BoilingPointF) 10105.0)
declare (= (countnull BoilingPointF) 0)

declare (&amp;gt;= (min Density) 0.000089)
declare (&amp;lt;= (max Density) 22.610001)
declare (= (countnull Density) 0)
declare (&amp;gt; (min Density) 0)

declare (&amp;gt;= (min (length Description)) 1)
declare (&amp;lt;= (min (length Description)) 83)

declare (&amp;gt;= (min (length Colour)) 4)
declare (&amp;lt;= (min (length Colour)) 80)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Each green entry in the table maps to a declaration in this list.
Let's look at a few:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Min and  Max.&lt;/strong&gt; Z is the atomic number. Each element has an atomic
    number, which is the number of protons in the nucleus, and each is unique.
    Hydrogen has the smallest number of
    protons, 1, and in this dataset, Uranium has the largest number—92.
    So the first suggested constraints are that these values
    should be in the observed range. These show up as the first two
    declarations:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;declare (&amp;gt;= (min Z) 1)
declare (&amp;lt;= (max Z) 92)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We should say a word about how these constraints are expressed.
Miró includes expression language called &lt;em&gt;(lisp-like)&lt;/em&gt;,
(because it's essentially a dialect of
&lt;a href="https://en.wikipedia.org/wiki/Lisp_(programming_language)"&gt;Lisp&lt;/a&gt;).
Lisp is slightly unusual in that instead of writing &lt;code&gt;f(x, y)&lt;/code&gt; you
write &lt;code&gt;(f x y)&lt;/code&gt;. So the first expression would be more commonly
expressed as&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;min(Z) &amp;gt;= 1
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;in regular ("&lt;a href="https://en.wikipedia.org/wiki/Infix_notation"&gt;infix&lt;/a&gt;")
languages.&lt;/p&gt;
&lt;p&gt;Lisp weirdness aside, are these sensible constraints? Well, the first
certainly is. Even if we find some elements beyond Uranium (which we will,
below), we certainly don't expect them to have zero or negative numbers
of protons, so the first constraint seems like a keeper.&lt;/p&gt;
&lt;p&gt;The second constraint is much less sensible. In fact, given that
we know the dataset includes every value of Z from 1 to 92, we
confidently expect that any future revisions of the periodic table
will include values higher than 92. So we would probably discard
that constraint.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;The crucial point is that no one wants to sit down and write out
a bunch of constraints by hand (and anyway, "why have a dog an
bark yourself?"). People are generally much more willing to review
a list of suggested constraints and delete the ones that don't
make sense, or modify them so that they do.&lt;/em&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Nulls.&lt;/strong&gt; The next observation about Z is that it contains no nulls.
    This turns into the (lisp-like) constraint:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;declare (= (countnull Z) 0)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This is also almost certainly a keeper: we'd probably be pretty unhappy
if we received a Periodic Table with missing Z values for any elements.&lt;/p&gt;
&lt;p&gt;(Here, &lt;code&gt;(countnull Z)&lt;/code&gt; just counts the number of nulls in field Z,
and &lt;code&gt;=&lt;/code&gt; tests for equality, so the expression reads &lt;em&gt;"the number of nulls
in Z is equal to zero"&lt;/em&gt;.)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Sign&lt;/strong&gt;. The sign column is more interesting.
    Here, we have recorded the fact that all the values in Z are positive.
    Clearly, this is a logically implied by the fact that the minimum
    value for Z is 1, but we think it's useful to record two separate
    observations about the field—first, that its minimum value is 1,
    and secondly that it is always strictly positive. In cases where the
    minimum is 1, for an integer field, these statements are entirely
    equivalent, but if the minimum had been (say) 3, they would be different.
    The value of recording these observations separately arises if at some
    later stage the minimum changes, while remaining positive. In that case,
    we might want to discard the specific minimum constraint, but leave
    in place the constraint on the sign.&lt;/p&gt;
&lt;p&gt;Although we record the sign as a separate constraint in the table,
in this case it does not generate a separate declaration, as it would
be identical to the constraint on the minimum that we already have.&lt;/p&gt;
&lt;p&gt;In contrast, AtomicWeight, has a minimum value around 1.008, so it
does get a separate sign constraint:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;declare (&amp;gt; (min AtomicWeight) 0)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Uniqueness of Values&lt;/strong&gt;. The next thing our autoconstraints
     framework has noticed about &lt;code&gt;Z&lt;/code&gt; is that none of its values is
     repeated in the data—that all are &lt;em&gt;unique&lt;/em&gt; (a.k.a. distinct).
     The table reports this as &lt;em&gt;yes&lt;/em&gt; (the values are unique) and 92/92
     (100%), meaning that there are 92 distinct values and 92 non-null
     values in total, so that 100% of values are unique.  Other fields,
     such as Etymology, have potential constraints that are not quite
     met: Etymology has 89 different values in the field, so the ratio of
     distinct values to values is about 97%.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;NOTE:&lt;/strong&gt; in considering this, we ignore nulls if there are
 any. You can see this if you look at the Unique entry for the
 field Group: here there are 18 different (non-null) values for
 Group, and 74 records have non-null values for Group.&lt;/p&gt;
&lt;p&gt;There is a dedicated function in (lisp-like) for checking whether
the non-null values in a field are all distinct, so the expression
in the declaration is just:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;(non-nulls-unique Z)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;which evaluates to true&lt;sup id="fnref:True"&gt;&lt;a class="footnote-ref" href="#fn:True"&gt;5&lt;/a&gt;&lt;/sup&gt; or false.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Min and Max for String Fields.&lt;/strong&gt;
    For string fields, the actual minimum and maximum values are
    usually less interesting. (Indeed, there are lots of reasonable
    alternative sort orders for strings, given choices such as case
    sensitivity, whether embedded numbers should be sorted numerically
    or alphanumerically, how spaces and punctuation should be handled,
    what to do with accents etc.) In the initial implementation,
    instead of using any min and max string values as the basis of
    constraints, we suggest constraints based on string length.&lt;/p&gt;
&lt;p&gt;For the string fields here, none of the constraints is
particularly compelling, though a minimum length of 1 might be
interesting and you might even think that a maximum length of 2 is
sensible the symbol is useful. But in many cases they will be.
One common case is fixed-length strings, such as the increasingly
ubiquitous UUIDs,&lt;sup id="fnref:uuid"&gt;&lt;a class="footnote-ref" href="#fn:uuid"&gt;6&lt;/a&gt;&lt;/sup&gt;
where the minimum and maximum values would both 36 if they are
canonically formatted. (Of course, we can add much stronger
constraints if we know all the strings in a field are UUIDs.)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Categorical Values.&lt;/strong&gt;
    The last kind of automatically generated constraint we will
    discuss today is a restriction of the values in a field to be
    chosen from some fixed set. In this case, Miró has noticed that
    there are only 10 different non-null values for ChemicalSeries, so
    has suggested a constraint to capture that reality. The slightly
    verbose way this currently gets expressed as a constraint is:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;declare (= (countzero
            (or (isnull ChemicalSeries)
                (in ChemicalSeries
                    (list &amp;quot;Actinoid&amp;quot; &amp;quot;Alkali metal&amp;quot;
                          &amp;quot;Alkaline earth metal&amp;quot;
                          &amp;quot;Halogen&amp;quot; &amp;quot;Lanthanoid&amp;quot; &amp;quot;Metalloid&amp;quot; &amp;quot;Noble gas&amp;quot;
                          &amp;quot;Nonmetal&amp;quot; &amp;quot;Poor metal&amp;quot; &amp;quot;Transition metal&amp;quot;))))
           0)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;(The &lt;code&gt;or&lt;/code&gt; statement starting on the second line is true for field
values that are either in the list or null. The &lt;code&gt;countzero&lt;/code&gt;
function, when applied to booleans, counts false values, so this
is saying that none of the results of the &lt;code&gt;or&lt;/code&gt; statement should be false,
i.e. all values should be null or in the list. This
would be more elegantly expressed with an &lt;code&gt;(all ...)&lt;/code&gt; statement;
we will probably change it to that formulation soon, though the
current version is more useful for reporting failures.)&lt;/p&gt;
&lt;p&gt;The current implementation generates these constraints only when
the number of distinct values it sees is 20 or fewer, only for
string fields, and only when not all the values in the field are
distinct, but all of these aspect can probably be improved, and
the user can override the number of categories to allow.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In addition to these constraints, we should also probably generate
constraints on the field types and, as we will discuss in future
articles, dataset-level constraints.&lt;/p&gt;
&lt;h3 id="tool-support-using-the-declarations"&gt;Tool Support: Using the Declarations&lt;/h3&gt;
&lt;p&gt;Obviously, if we run test the constraints against the same dataset
we used to generate them, all the constraints should be (and are!)
satisfied. Things are slightly more interesting if we run them
against a different dataset.
In this case, we excluded transuranic elements from the dataset
we used to generate the constraints. But we can add them in.
If we do so, and then execute a script (&lt;code&gt;e92.miros&lt;/code&gt;) containing
the autogenerated constraints, we get the following output:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;miro&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;This&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Miro&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;2.1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="mf"&gt;90.&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;Copyright&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;©&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Stochastic&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Solutions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2008&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;2015.&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;Seed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1463187505&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;Logs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;started&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;at&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2015&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;08&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;23&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tdda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;local&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;Logging&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;Users&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;njr&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;miro&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nb"&gt;log&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;2015&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;session259&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;load&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;elements&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;elements&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;miro&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;118&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;118&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;selected&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;e92&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# Autoconstraints for dataset elements92.miro.&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# Generated from session /Users/njr/miro/log/2015/11/25/session256.miros&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Z&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Z&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;92&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="n"&gt;Miro&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Warning&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Declaration&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;failed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Z&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;92&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;countnull&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Z&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;non&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;nulls&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;unique&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Z&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;countnull&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;non&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;nulls&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;unique&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Symbol&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Symbol&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;countnull&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Symbol&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;non&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;nulls&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;unique&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Symbol&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Period&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Period&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;countnull&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Period&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Group&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;21&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Group&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ChemicalSeries&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;23&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ChemicalSeries&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;countnull&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ChemicalSeries&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;countzero&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                  &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ow"&gt;or&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;isnull&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ChemicalSeries&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                      &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ow"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ChemicalSeries&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;list&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Actinoid&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Alkali metal&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                                               &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Alkaline earth metal&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                                               &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Halogen&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Lanthanoid&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                                               &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Metalloid&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Noble gas&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                                               &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Nonmetal&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Poor metal&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                                               &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Transition metal&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))))&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                 &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;26&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;AtomicWeight&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.007946&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;27&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;AtomicWeight&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;238.028914&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="n"&gt;Miro&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Warning&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Declaration&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;failed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;AtomicWeight&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;238.028914&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;countnull&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;AtomicWeight&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="n"&gt;Miro&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Warning&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Declaration&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;failed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;countnull&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;AtomicWeight&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;29&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;AtomicWeight&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Etymology&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;31&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Etymology&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;39&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;countnull&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Etymology&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="n"&gt;Miro&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Warning&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Declaration&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;failed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;countnull&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Etymology&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;33&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;RelativeAtomicMass&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.007946&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;34&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;RelativeAtomicMass&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;238.028914&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="n"&gt;Miro&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Warning&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Declaration&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;failed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;RelativeAtomicMass&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;238.028914&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;35&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;countnull&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;RelativeAtomicMass&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="n"&gt;Miro&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Warning&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Declaration&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;failed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;countnull&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;RelativeAtomicMass&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;36&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;RelativeAtomicMass&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;37&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MeltingPointC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;258.975000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;38&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MeltingPointC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;3675.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;39&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MeltingPointKelvin&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;14.200000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MeltingPointKelvin&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;3948.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;41&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MeltingPointKelvin&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;BoilingPointC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;268.930000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;43&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;BoilingPointC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;5596.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;44&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;countnull&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;BoilingPointC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="n"&gt;Miro&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Warning&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Declaration&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;failed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;countnull&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;BoilingPointC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;BoilingPointF&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;452.070000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;46&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;BoilingPointF&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;10105.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;47&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;countnull&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;BoilingPointF&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="n"&gt;Miro&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Warning&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Declaration&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;failed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;countnull&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;BoilingPointF&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Density&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.000089&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;49&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Density&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;22.610001&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="n"&gt;Miro&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Warning&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Declaration&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;failed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Density&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;22.610001&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;countnull&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Density&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="n"&gt;Miro&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Warning&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Declaration&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;failed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;countnull&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Density&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;51&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Density&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;52&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;53&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;83&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;54&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Colour&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;55&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;declare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Colour&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;warnings&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;and&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;generated&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="n"&gt;Job&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;completed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;after&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;of&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;10.2801&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;seconds&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;Logs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;closed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;at&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2015&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;08&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;23&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tdda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;local&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="n"&gt;Logs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;written&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;Users&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;njr&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;miro&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nb"&gt;log&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;2015&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;session259&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;By default, Miró generates warnings when declared constraints are violated.
In this case, ten of the declared constraints were not met,
so there were ten warnings. We can also set the declarations to generate
errors rather than warnings, allowing us to stop execution of a script
if the data fails to meet our declared expectations.&lt;/p&gt;
&lt;p&gt;In this case, the failed declarations are mostly unsurprising and
untroubling. The maximum values for Z, AtomicWeight,
RelativeAtomicMass, and Density all increase in this version of the
data, which is expected given that all the new elements are heavier
than those in the initial analysis set. Equally, while the fields
AtomicWeight, RelativeAtomicMass, Etymology, BoilingPointC,
BoilingPointF and Density were all populated in the original dataset,
each now contains nulls. Again, this is unsurprising in this case,
but in other contexts, detecting these sorts of changes in a feed of
data might be important.  Specifically, we should always be interested
in &lt;em&gt;unexpected&lt;/em&gt; differences between the datasets used to
develop an analytical process, and ones for which that process is used
at a later time: it is very possible that they will not be handled
correctly if they were not seen or considered when the process was
developed.&lt;/p&gt;
&lt;p&gt;There are many further improvements we could make to the current state
of the autoconstraint generation, and there are other kinds of
constraints it can generate that we will discuss in later posts. But
as simple as it is, this level of checking has already identified a
number of problems in the work we have been carrying out with
Skyscanner and other clients.&lt;/p&gt;
&lt;p&gt;We will return to this topic, including discussing how we might
add tool support for revising constraint sets in the light of failures,
merging different sets of constraints and adding constraints that are
true only of subsets of the data.&lt;/p&gt;
&lt;h3 id="parting-thoughts"&gt;Parting thoughts&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Outputs and Intermediates.&lt;/strong&gt;
While developing the ideas about automatically generating constraints,
our focus was mostly on input datasets. But in fact, most of the ideas
are almost as applicable to intermediate results and outputs (which,
after all, often form the inputs to the next stage of an analysis pipeline).
We haven't &lt;em&gt;performed&lt;/em&gt; any analysis in this post, but if we had, there might
be similar value in generating constraints for the outputs as well.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Living Constraints and Type Systems.&lt;/strong&gt;
In this article, we've also focused on checking constraints at
particular points in the process—after loading data, or after
generating results.  But it's not too much of a stretch to think of
constraints as statements that should always be true of data, even as
we append records, redefine fields etc.  We might call these &lt;em&gt;living&lt;/em&gt;
or &lt;em&gt;perpetual constraints&lt;/em&gt;. If we do this, individual field
constraints become more like types. This idea, together with dimensional
analysis, will be discussed in future posts.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:notTDD"&gt;
&lt;p&gt;See e.g. the timeless &lt;em&gt;Little Bobby
Tables&lt;/em&gt; XKCD &lt;a href="https://xkcd.com/327/"&gt;https://xkcd.com/327/&lt;/a&gt;
and the Wikipedia entry on
&lt;a href="https://en.wikipedia.org/wiki/Defensive_programming"&gt;Defensive Programming&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:notTDD" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:metadata"&gt;
&lt;p&gt;Metadata is &lt;em&gt;data about data&lt;/em&gt;. In the context of
tabular data, the simplest kinds of metadata are the field
names and types. Any statistics we can compute are another form of
metadata, e.g. minimum and maximum values, averages, null counts,
values present etc. There is literally no limit to what metadata can be
associated with an underlying dataset.&amp;#160;&lt;a class="footnote-backref" href="#fnref:metadata" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:masterIdentifiers"&gt;
&lt;p&gt;Obviously, in many situations, it's fine for identifiers
or keys to be repeated, but it is also often the case that in a particular
table a field value must be unique, typically when the records act as
master records, defining the entities that exist in some category.
Such tables are often referred to as &lt;em&gt;master tables&lt;/em&gt; in database
contexts
&lt;a href="https://encyclopedia2.thefreedictionary.com/master+file"&gt;https://encyclopedia2.thefreedictionary.com/master+file&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:masterIdentifiers" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:itwill"&gt;
&lt;p&gt;We're not saying this conviction is wrong: it &lt;em&gt;is&lt;/em&gt; typically
quicker just to whack in the new values each time. Our contention is
that this is a more error-prone, less systematic approach.&amp;#160;&lt;a class="footnote-backref" href="#fnref:itwill" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:True"&gt;
&lt;p&gt;(lisp-like) actually follows an amalgam of Lisp conventions, using
&lt;code&gt;t&lt;/code&gt; to represent True, like Common Lisp, and &lt;code&gt;f&lt;/code&gt; for False, which is more like
Scheme or Clojure. But it doesn't really matter here.&amp;#160;&lt;a class="footnote-backref" href="#fnref:True" title="Jump back to footnote 5 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:uuid"&gt;
&lt;p&gt;A so-called "universally unique identifier" (UUID) is a 128-bit
number, usually formatted as a string of 32 hex digits separated into
blocks of 8, 4, 4, 4, and 12 digits by hyphens—for example
&lt;code&gt;12345678-1234-1234-1234-123456789abc&lt;/code&gt;. They are also known as &lt;em&gt;globally&lt;/em&gt;
unique identifiers (GUIDs) and are usually generated randomly, sometimes
basing some bits on device and time to reduce the probability of
collisions. Although fundamentally numeric in nature, it is fairly common
for them to be stored and manipulated as strings.
&lt;a href="https://en.wikipedia.org/wiki/Universally_unique_identifier"&gt;Wikipedia entry&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:uuid" title="Jump back to footnote 6 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="tdda"></category><category term="components"></category></entry><entry><title>Site News: Glossary; Table of Contents; Feeds</title><link href="https://tdda.info/site-news-glossary-table-of-contents-feeds.html" rel="alternate"></link><published>2015-11-23T11:40:00+00:00</published><updated>2015-11-23T11:40:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2015-11-23:/site-news-glossary-table-of-contents-feeds.html</id><summary type="html">&lt;p&gt;The site now has a &lt;a href="https://www.tdda.info/pages/glossary"&gt;glossary&lt;/a&gt;,
and also a &lt;a href="https://www.tdda.info/pages/table-of-contents"&gt;table of
contents&lt;/a&gt;, both linked
from the side panel (which is at the top on mobile). The plan,
obviously, is to keep these up-to-date as we discuss more topics. The
table of contents is similar to the
&lt;a href="https://www.tdda.info/archives"&gt;archives&lt;/a&gt; link at the …&lt;/p&gt;</summary><content type="html">&lt;p&gt;The site now has a &lt;a href="https://www.tdda.info/pages/glossary"&gt;glossary&lt;/a&gt;,
and also a &lt;a href="https://www.tdda.info/pages/table-of-contents"&gt;table of
contents&lt;/a&gt;, both linked
from the side panel (which is at the top on mobile). The plan,
obviously, is to keep these up-to-date as we discuss more topics. The
table of contents is similar to the
&lt;a href="https://www.tdda.info/archives"&gt;archives&lt;/a&gt; link at the top, but is
chronological, rather than reverse-chronological, and has a short
description of each article.&lt;/p&gt;
&lt;p&gt;While writing the glossary, we decided that, in addition to the two
classes of errors we discussed in &lt;a href="https://www.tdda.info/why-test-driven-data-analysis"&gt;Why Test-Driven Data
Analysis&lt;/a&gt;—errors
of implementation and errors of interpretation—we should probably
break out a third category, namely errors of &lt;em&gt;process&lt;/em&gt;.  The first of
the "interpretation" questions we listed was "Is the input data
correct?". Presenting incorrect data to an analytical process
certainly seems more like an error of process than an error of
interpretation (though as we will discuss in one of the next posts,
arguably the process should detect at least some kinds of input
errors). We will certainly discuss other examples of process errors
in future posts. We'll probably update the
&lt;a href="https://www.tdda.info/why-test-driven-data-analysis"&gt;Why... post&lt;/a&gt;
with at least with a footnote describing new interpretation.&lt;/p&gt;
&lt;p&gt;We were also informed that some of the links to the RSS and Atom feeds
were broken, even though the feeds themselves were OK.
Apologies for this. As far as we can tell, they're all OK now.
Please let us know if you try them and find they're not OK,
or indeed if you find any other &lt;a href="https://xkcd.com/386/"&gt;problems or errors&lt;/a&gt;.&lt;/p&gt;</content><category term="blog"></category><category term="site news"></category><category term="glossary"></category></entry><entry><title>Infinite Gain: The First Test</title><link href="https://tdda.info/infinite-gain-the-first-test.html" rel="alternate"></link><published>2015-11-16T19:00:00+00:00</published><updated>2015-11-16T19:00:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2015-11-16:/infinite-gain-the-first-test.html</id><summary type="html">&lt;p&gt;The first idea we want to appropriate from test-driven development is
that of &lt;a href="test-driven-development-a-review.html"&gt;regression testing&lt;/a&gt;,
and our specific analytical variant of this, the idea of a &lt;strong&gt;reference test&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;We propose a "zeroth level" of &lt;em&gt;test-driven data analysis&lt;/em&gt; as
recording one or more specific sets of inputs to an analytical
process …&lt;/p&gt;</summary><content type="html">&lt;p&gt;The first idea we want to appropriate from test-driven development is
that of &lt;a href="test-driven-development-a-review.html"&gt;regression testing&lt;/a&gt;,
and our specific analytical variant of this, the idea of a &lt;strong&gt;reference test&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;We propose a "zeroth level" of &lt;em&gt;test-driven data analysis&lt;/em&gt; as
recording one or more specific sets of inputs to an analytical
process, together with the corresponding outputs generated, and
ensuring that the process can be re-run using those recorded
inputs. The first test can then simply be checking that the results
remain the same if the analysis is re-run.&lt;/p&gt;
&lt;p&gt;In the language of test-driven development, this is a &lt;em&gt;regression&lt;/em&gt;
test, because it tests that no regressions have occurred, i.e. the
results are the same now as previously. It is also a &lt;em&gt;system&lt;/em&gt; test, in
the sense that it checks the functioning of the whole system (the
analytical process), rather than one or more specific subunits, as is
the case with &lt;em&gt;unit&lt;/em&gt; tests.&lt;/p&gt;
&lt;p&gt;In our work with Skyscanner, Stochastic Solutions maintains a number
of tests of this type for each of our major analytical processes. They
help to ensure that as we make changes to the analysis scripts, and any
of the software they depend on, we don't break anything without
noticing. We also run them whenever we install new versions on Skyscanner
servers, to check that we get identical results on their platforms as on our
own development systems. We call these whole-system regression tests &lt;em&gt;reference
tests&lt;/em&gt;, and run them as part of the special commit process we use each
time we update the version number of the software. In fact, our
process only allows the version number to be updated if the relevant
tests—including the relevant reference tests—pass.&lt;/p&gt;
&lt;h2 id="some-practical-considerations"&gt;Some practical considerations&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Stochastic (Randomized) Analyses&lt;/p&gt;
&lt;p&gt;We assume that our analytical process is deterministic. If it involves
  a random component, we can make it deterministic by fixing the seed
  (or seeds) used by the random number generators. Any seeds should be
  treated as input parameters; if the process seeds itself
  (e.g. from the clock), it is important it writes out the seeds to
  allow the analysis to be re-run.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Correctness&lt;/p&gt;
&lt;p&gt;We also assume that the analyst has performed some level of
  checking of the results to convince herself that they are correct. In
  the worst case, this may consist of nothing more than verifying that
  the program runs to completion and produces output of the expected
  form that is not glaringly obviously incorrect.&lt;/p&gt;
&lt;p&gt;Needless to say, it is vastly preferable if more diligent
  checking than this has been carried out, but even if the level of
  initial checking of results is superficial, regression tests
  deliver value by allowing us to verify the impact of changes to
  the system.  Specifically, they allow us to detect situations in
  which a result is unexpectedly altered by some modification of
  the process—direct or indirect—that was thought to be
  innocuous (see below).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Size / Time&lt;/p&gt;
&lt;p&gt;Real analysis input datasets can be large, as can outputs, and
  complex analyses can take a long time.  If the data is "too
  large" or the run-time excessive, it is quite acceptable (and in
  various ways advantageous) to cut it down. This should
  obviously be done with a view to maintaining the richness and
  variability of the inputs. Indeed, the data can also be changed
  to include more "corner cases", or, for example, to anonymize it,
  if it is sensitive.&lt;/p&gt;
&lt;p&gt;The main reason we are not specifically advocating cutting
  down the data is that we want to make the overhead of implementing
  a reference test as low as possible.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Feeds&lt;/p&gt;
&lt;p&gt;If the analytical process directly connects to some dynamic data
  feed, it will be desirable (and possibly necessary) to replace
  that feed with a static input source, usually consisting of a
  snapshot of the input data. Obviously, in some circumstances,
  this might be onerous, though in our experience it is usually
  not very hard.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Time-dependent analysis.&lt;/p&gt;
&lt;p&gt;Another factor that can cause analysis of fixed input data, with
  a fixed analytical process, to produce different results is
  explicit or implicit time-dependence in the analysis.  For
  example, the analysis might convert an input that is a date
  stamp to something like "number of whole days before &lt;code&gt;today&lt;/code&gt;",
  or the start of the &lt;em&gt;current&lt;/em&gt; month.  Obviously, such
  transformations produce different results when run on different
  days. As with seeds, if there are such transformations in the
  analysis code, they need to handled. To cope with this sort of
  situation, we typically look up any reference values such as
  &lt;code&gt;today&lt;/code&gt; early in the analytical process, and allow optional
  override parameters to be provided. Thus, in ordinary use we
  might run an analysis script by saying:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;  python analysis_AAA.py
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;but in testing replace this by something like&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;  AAA_TODAY=&amp;quot;2015/11/01&amp;quot; python analysis_AAA.py
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;to set the environment variable &lt;code&gt;AAA_TODAY&lt;/code&gt; to an override value,
  or with a command such as&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt; python analysis_AAA.py -d 2015/11/01
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;to pass in the date as a command-line option to our script.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Numerical Precision.&lt;/p&gt;
&lt;p&gt;Computers are basically deterministic, and, regardless of what
  numerical accuracy they achieve, if they are asked to perform
  the same operations, on the same inputs, in the same order,
  they will normally produce identical results every time. Thus even
  if our outputs are floating-point values, there is no intrinsic
  problem with testing them for exact equality. The only thing
  we really need to be careful about is that we don't perform
  an equality test between a rounded output value and an
  floating-point value held internally without rounding (or,
  more accurately, held as an IEEE floating point value, rather
  than a decimal value of given precision). In practice, when
  comparing floating-point values, we either need to compare
  formatted string output, rounded in some fixed manner, or
  compare to values to some fixed level of precision.
  In most cases, the level of precision will not matter very much,
  though in particular domains we may want to exercise more care
  in choosing this.&lt;/p&gt;
&lt;p&gt;To make this distinction clear, look at the following Python code:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;  &lt;span class="err"&gt;$&lt;/span&gt; &lt;span class="n"&gt;python&lt;/span&gt;
  &lt;span class="n"&gt;Python&lt;/span&gt; &lt;span class="mf"&gt;2.7.10&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Jul&lt;/span&gt; &lt;span class="mi"&gt;14&lt;/span&gt; &lt;span class="mi"&gt;2015&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;46&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;27&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;GCC&lt;/span&gt; &lt;span class="mf"&gt;4.2.1&lt;/span&gt; &lt;span class="n"&gt;Compatible&lt;/span&gt; &lt;span class="n"&gt;Apple&lt;/span&gt; &lt;span class="n"&gt;LLVM&lt;/span&gt; &lt;span class="mf"&gt;6.0&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clang&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;600.0.39&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="n"&gt;on&lt;/span&gt; &lt;span class="n"&gt;darwin&lt;/span&gt;
  &lt;span class="n"&gt;Type&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;help&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;copyright&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;credits&amp;quot;&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;license&amp;quot;&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;more&lt;/span&gt; &lt;span class="n"&gt;information&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
  &lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;division&lt;/span&gt;
  &lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;
  &lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;
  &lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;print&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
  &lt;span class="mf"&gt;0.333333333333&lt;/span&gt;
  &lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mf"&gt;0.333333333333&lt;/span&gt;
  &lt;span class="kc"&gt;False&lt;/span&gt;
  &lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
  &lt;span class="kc"&gt;True&lt;/span&gt;
  &lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nb"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.333333333333&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="kc"&gt;True&lt;/span&gt;
  &lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;0.333333333333&amp;#39;&lt;/span&gt;
  &lt;span class="kc"&gt;True&lt;/span&gt;
  &lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;%.12f&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;0.333333333333&amp;#39;&lt;/span&gt;
  &lt;span class="kc"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In this code fragment,&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The first line tells Python to return floating-point values
    from integer division (always a good idea).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The next two lines just assign &lt;code&gt;a&lt;/code&gt; and &lt;code&gt;b&lt;/code&gt; each to be a third.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The following line confirms the result of this is, as we'd
    expect &lt;code&gt;0.3333...&lt;/code&gt;
    But, crucially, this value is not exact. If we print it to
    60 decimal places, we see:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; print &amp;quot;%.60f&amp;quot; % a
0.333333333333333314829616256247390992939472198486328125000000
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Unsurprisingly, therefore, when in the next statement we ask
    Python whether a is equal to &lt;code&gt;0.333333333333&lt;/code&gt;, the result is
    &lt;code&gt;False&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;After this, as expected, we confirm that &lt;code&gt;a == b&lt;/code&gt; is &lt;code&gt;True&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;We then confirm that if we round &lt;code&gt;a&lt;/code&gt; to 12 decimal places,
    the result is exactly &lt;code&gt;round(0.333333333333, 12)&lt;/code&gt;.
    Do we need the round on the right-hand side? Probably not,
but be aware that 0.333333333333 is not a value that can be stored
exactly in binary, so:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; print &amp;#39;%.60f&amp;#39; % 0.333333333333
0.333333333333000025877623784253955818712711334228515625000000
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;It's probably, therefore, both clearer to round both sides,
    or to use string comparisons.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Finally, we perform two string comparisons. The first relies on
    Python's default string formatting rules, and the second is
more explicit.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;NOTE:&lt;/strong&gt; When it comes to actually writing tests, Python's &lt;code&gt;unittest&lt;/code&gt;
  module includes an &lt;code&gt;assertAlmostEqual&lt;/code&gt; method, that takes a number
  of decimal places, so if a function &lt;code&gt;f(x)&lt;/code&gt; is expected to return
  the result 1/3 when &lt;code&gt;x = 1&lt;/code&gt;, the usual way to test this to 12dp is with
  the following code fragment:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;  def testOneThird(self):
      self.assertAlmostEqual(f(1), 0.333333333333, 12)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Parallel Processing.&lt;/p&gt;
&lt;p&gt;Another factor that can cause differences in results is parallel
  execution, which can often result in subtle changes of detailed
  sequence of operations carried out. A simple example would be a
  task farm in which each of a number of workers calculates a
  result.  If those results are then summed by the controller
  process in the order they are returned, rather than in a
  predefined sequence, numerical rounding errors may result in
  different answers. Thus, more care has to be taken in these
  sorts of cases.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Variable output.&lt;/p&gt;
&lt;p&gt;A final implementation detail is that we sometimes have to be
  careful about simply comparing output logs, graph files etc.
  It is very common for output to include things that may vary
  from run-to-run, such as timestamps, version information or
  sequence numbers (run 1, run 2...) In these cases, the
  comparison process needs to make suitable affordances.
  We will discuss some methods for handling this in a future article.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="reasons-a-regression-test-might-fail"&gt;Reasons a Regression Test Might Fail&lt;/h2&gt;
&lt;p&gt;Changes to the system not intended to change the result, but sometimes
doing so, can take many forms.
For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;We might extend our analysis code to accommodate some variation in
    the input data handled.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;We might add an extra parameter or code path to allow some variation
    in the analysis performed.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;We might upgrade some software, e.g. the operating system,
    libraries, the analysis software or the environment in which the
    software runs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;We might upgrade the hardware (e.g. adding memory, processing capacity
    or GPUs), potentially causing different code paths to be followed.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;We might run the analysis on a different machine.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;We might change the way in which the input data is stored, retrieved
    or presented to the software.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Hardware and software can develop faults, and data corruption can
    and does occur.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="the-law-of-software-regressions"&gt;The Law of Software Regressions&lt;/h2&gt;
&lt;p&gt;Experience shows that regression tests are a very powerful tool for
identifying unexpected changes, and that such changes occur more often
than anyone expects.  In fact writing this reminds me of the
self-referential law&lt;sup id="fnref:HofstadterGEB"&gt;&lt;a class="footnote-ref" href="#fn:HofstadterGEB"&gt;1&lt;/a&gt;&lt;/sup&gt; proposed by Doug Hofstadter:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Hofstadter's Law:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;It always takes longer than you expect,
even when you take into account Hofstadter's Law.&lt;/p&gt;
&lt;p&gt;— &lt;em&gt;Gödel, Esher Bach: An Eternal Golden Braid,&lt;/em&gt;
Douglas R. Hofstadter.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In a similar vein, we might coin a Law of Software Regressions:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The Law of Software Regressions:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Software regressions happen more often than expected,
even when you take into account the Law of Software Regressions.
&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:HofstadterGEB"&gt;
&lt;p&gt;Douglas R. Hofstadter, &lt;em&gt;Gödel, Esher Bach: An Eternal
Golden Braid,&lt;/em&gt; p. 152. Penguin Books (Harmondsworth) 1980.&amp;#160;&lt;a class="footnote-backref" href="#fnref:HofstadterGEB" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="regression tests"></category><category term="reference tests"></category></entry><entry><title>How is this Misleading Data Misleading Me?</title><link href="https://tdda.info/how-is-this-misleading-data-misleading-me.html" rel="alternate"></link><published>2015-11-13T17:15:00+00:00</published><updated>2015-11-13T17:15:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2015-11-13:/how-is-this-misleading-data-misleading-me.html</id><summary type="html">&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Why is this lying bastard lying to me?"&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Louis Heren,&lt;sup id="fnref:HerenBastard"&gt;&lt;a class="footnote-ref" href="#fn:HerenBastard"&gt;1&lt;/a&gt;&lt;/sup&gt; often attributed to Jeremy Paxman.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In &lt;a href="https://www.tdda.info/why-test-driven-data-analysis.html"&gt;a previous post&lt;/a&gt;, we made
a distinction between two kinds of errors—&lt;em&gt;implementation&lt;/em&gt; errors
and errors of &lt;em&gt;interpretation&lt;/em&gt;. I want to amplify that today,
focusing specifically on interpretation.&lt;/p&gt;
&lt;p&gt;The most important question to …&lt;/p&gt;</summary><content type="html">&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Why is this lying bastard lying to me?"&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Louis Heren,&lt;sup id="fnref:HerenBastard"&gt;&lt;a class="footnote-ref" href="#fn:HerenBastard"&gt;1&lt;/a&gt;&lt;/sup&gt; often attributed to Jeremy Paxman.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In &lt;a href="https://www.tdda.info/why-test-driven-data-analysis.html"&gt;a previous post&lt;/a&gt;, we made
a distinction between two kinds of errors—&lt;em&gt;implementation&lt;/em&gt; errors
and errors of &lt;em&gt;interpretation&lt;/em&gt;. I want to amplify that today,
focusing specifically on interpretation.&lt;/p&gt;
&lt;p&gt;The most important question to keep in mind at all times is not
whether the analysis is computing the thing we wanted it to compute,
but rather whether the result we have produced means what we think it
means. The distinction is crucial.&lt;/p&gt;
&lt;p&gt;As a simple example, let's suppose we specify the goal of our analysis
as calculating the mean of a set of numbers. We can test
that by adding them up and dividing by the number of items. But if we
think the goal is to characterize a &lt;em&gt;typical&lt;/em&gt; transaction size, we have
to ask whether the arithmetic mean is the right metric for
understanding that. As we move more towards a business or conceptual
goal, rather than a mathematical or algorithmic formulation of a calculation,
we have more complex and nuanced considerations, such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Do we believe the inputs are correct?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Is our chosen metric capable of addressing our underlying need
    (in this case, determining a typical transaction size)?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;How do we handle nulls (missing values)?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Will outliers (perhaps extremely large values) or invalid inputs (perhaps
    negative values) invalidate the calculation?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If the values have dimensionality,&lt;sup id="fnref:dimensionality"&gt;&lt;a class="footnote-ref" href="#fn:dimensionality"&gt;2&lt;/a&gt;&lt;/sup&gt; do all of the
    values have the same dimensionality, and in the same units
    (e.g. all money and all in pounds sterling, or all distances and
    all measured in miles).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;For that matter, are the inputs even &lt;em&gt;commensurate,&lt;/em&gt; i.e. do they
    quantify sufficiently similar things that calculating their mean
    is even meaningful?&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Paxman/Heren's constant question quoted above—&lt;em&gt;Why is this lying
bastard lying to me?&lt;/em&gt;—will serve as an excellent question to keep in
mind every time we view an analytical result, perhaps recast as
&lt;em&gt;how is this misleading data misleading me?&lt;/em&gt; There is a great
temptation to believe beautifully formatted, painstakingly calculated
results produced by the almost unfathomable power of modern computers.
In fact, there is much to be said for thinking of the combination of
data and processing as an adversary constantly trying to fool you into
drawing false conclusions.&lt;/p&gt;
&lt;p&gt;The questions of implementation are concerned with checking that the data
received as input to the analytical process has been faithfully
transmitted from the source systems, and that the calculations and
manipulations performed in the analysis correctly implement the
algorithms we intended to use.  In contrast, as we outlined
&lt;a href="https://www.tdda.info/why-test-driven-data-analysis"&gt;previously&lt;/a&gt;,
the questions of interpretation emphasize that we need to be ever vigilent,
asking ourselves:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Is the input data correct?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Is our interpretation of the input data correct?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Are the algorithms we are applying to the data meaningful and appropriate?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Is our interpretation of the results we produce correct?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Are the results plausible?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;What am I missing?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;How is this misleading data misleading me?&lt;/em&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:HerenBastard"&gt;
&lt;p&gt;This quote is usually attributed to Jeremy Paxman, as
noted in The Guardian article &lt;em&gt;Paxman answers the questions&lt;/em&gt;
&lt;a href="https://www.theguardian.com/media/2005/jan/31/mondaymediasection.politicsandthemedia"&gt;https://www.theguardian.com/media/2005/jan/31/mondaymediasection.politicsandthemedia&lt;/a&gt;
of 31st January 2005.  According to the article, however, the true
origin is a former deputy editor of the Times, Louis Heren, in his
memoirs, with the full quote being &lt;em&gt;"When a politician tells you
something in confidence, always ask yourself&lt;/em&gt;: 'Why is this lying
bastard lying to me?'"  Still other reports, however, say Heren
himself, was merely quoting advice he was given.
Melvin J. Lasky writes in &lt;em&gt;Profanity, Obscenity and the Media,
Transaction Publishers (New Brunswick) 2005:&lt;/em&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"Find out why the lying bastards are lying!"
This is the famous phrase of an editor of the &lt;em&gt;Times&lt;/em&gt;,
Louis Heren, who received it as "advice given him early in his
career by ... a correspondent of the &lt;em&gt;Daily Worker&lt;/em&gt; [the Communist
daily in London]: 'Always ask yourself why these lying
bastards are lying to you.'"&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a class="footnote-backref" href="#fnref:HerenBastard" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:dimensionality"&gt;
&lt;p&gt;Here, we use &lt;em&gt;dimensionality&lt;/em&gt; in the sense of
&lt;a href="https://en.wikipedia.org/wiki/Dimensional_analysis"&gt;Dimensional Analysis&lt;/a&gt;,
which allows us to make inferences about the results of calculations
based on classifying the inputs by category. For example, we would
distinguish lengths, from times from quantities of money and so forth.
We would also treat separately &lt;em&gt;dimensionless&lt;/em&gt; quantities, such as counts
or ratios of quantitities of the same dimension (e.g. a ratio of two lengths
lengths).&amp;#160;&lt;a class="footnote-backref" href="#fnref:dimensionality" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="tdda"></category><category term="implementation"></category><category term="interpretation"></category><category term="correctness"></category></entry><entry><title>Test-Driven Development: A Review</title><link href="https://tdda.info/test-driven-development-a-review.html" rel="alternate"></link><published>2015-11-09T10:30:00+00:00</published><updated>2015-11-09T10:30:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2015-11-09:/test-driven-development-a-review.html</id><summary type="html">&lt;p&gt;Since a key motivation for developing test-driven data analysis (TDDA)
has been test-driven development (TDD), we need to conduct a lightning
tour of TDD before outlining how we see TDDA developing.
If you are already familiar with test-driven development, this may not
contain too much that is new for you …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Since a key motivation for developing test-driven data analysis (TDDA)
has been test-driven development (TDD), we need to conduct a lightning
tour of TDD before outlining how we see TDDA developing.
If you are already familiar with test-driven development, this may not
contain too much that is new for you, though we will present it with
half an eye to the repurposing of it that we plan as we move towards
 test-driven data analysis.&lt;/p&gt;
&lt;p&gt;Test-driven development (TDD) has gained notable popularity as an
approach to software engineering, both in its own right and as a key
component of the
&lt;a href="https://en.wikipedia.org/wiki/Agile_software_development"&gt;Agile&lt;/a&gt;
development methodology. Its benefits, as articulated by its
adherents, include higher software quality, greater development speed,
improved flexibility during development (i.e., more ability to adjust
course during development), earlier detection of bugs and
regressions&lt;sup id="fnref:software-regression"&gt;&lt;a class="footnote-ref" href="#fn:software-regression"&gt;1&lt;/a&gt;&lt;/sup&gt; and an increased ability to
restructure ("refactor") code.&lt;/p&gt;
&lt;h3 id="the-core-idea-of-test-driven-development"&gt;The Core Idea of Test-Driven Development&lt;/h3&gt;
&lt;blockquote&gt;
&lt;p&gt;Automation + specification + verification + refactoring&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The central idea in test-driven development is that of using a
comprehensive suite of automated tests to specify the desired
behaviour of a program and to verify that it is working correctly.
The goal is to have enough, sufficiently detailed tests to ensure that
when they all pass we feel genuine confidence that the system is
functioning correctly.&lt;/p&gt;
&lt;p&gt;The canonical test-driven approach to software development consists
of the following stages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;First, a suite of tests is written specifying the correct behaviour
    of a software system. As a trivial example, if we are implementing
    a function, &lt;code&gt;f&lt;/code&gt;, to compute the sum of two inputs, &lt;code&gt;a&lt;/code&gt; and &lt;code&gt;b&lt;/code&gt;,
    we might specify a set of correct input-output pairs.
    In TDD, we structure our tests as a series of &lt;em&gt;assertions&lt;/em&gt;,
    each of which is a statement that must be satisfied in order
    for the test to pass.
    In this case, some possible assertions, expressed in pseudo-code, would be:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;assert f( 0,  0)  =  0
assert f( 1,  7)  =  8
assert f(-2, 17)  = 15
assert f(-3, +3)  =  0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Importantly, the tests should also, in general, check and specify
the generation of errors and the handling of so-called &lt;em&gt;edge&lt;/em&gt; cases.
Edge cases are atypical but valid cases, which might include extreme
input values, handling of null values and handling of empty datasets.
For example:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kr"&gt;assert&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;a&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;--&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;TypeError&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="kr"&gt;assert&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MAX_FLOAT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MAX_FLOAT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Infinity&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;NOTE&lt;/strong&gt; This is &lt;em&gt;not&lt;/em&gt; a comprehensive set of tests for &lt;code&gt;f&lt;/code&gt;.
We'll talk more about what might be considered adequate for this
function in later posts. The purpose of this example is simply to show
the general structure of typical tests.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;An important aspect of testing frameworks is that they allow tests
    to take the form of executable code that can be run even before
    the functionality under test has been written.  At this stage,
    since we have not even defined &lt;code&gt;f&lt;/code&gt;, we expect the
    tests not to pass, but to produce &lt;em&gt;errors&lt;/em&gt; such as &lt;code&gt;"No such
    function: f"&lt;/code&gt;.  Once a minimal definition for &lt;code&gt;f&lt;/code&gt; has
    been provided, such as one that always returns 0, or that
    returns no result, the errors should turn into &lt;em&gt;failures&lt;/em&gt;,
    i.e. assertions that are not true.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;When we have a suite of failing tests, software is written with
    the goal of making all the tests pass.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Once all the tests pass, TDD methodology dictates that coding
    should stop because if the test suite is adequate (and free of
    errors) we have now demonstrated that the software is complete and
    correct.  Part of the TDD philosophy is that if more functionality
    is required, one or more further tests should be written to
    specify and demonstrate the need for more (or different) code.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;There is one more important stage in test-driven development,
    namely &lt;em&gt;refactoring&lt;/em&gt;. This is the process of restructuring,
    simplifying or otherwise improving code while maintaining its
    functionality (i.e., keeping the tests passing).  It is widely
    accepted that complexity is one of the biggest problems in
    software, and simplifying code as soon as the tests pass allows us
    to attempt to reduce complexity as early as possible. It is a
    recognition of the fact that the first successful implementation
    of some feature will typically not be the most direct and
    straightforward.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The philosophy of writing tests before the code they are designed to
validate leads some to suggest that the second "D" in TDD
&lt;em&gt;(development)&lt;/em&gt; should really stand for &lt;em&gt;design&lt;/em&gt; (e.g. Allen
Houlob&lt;sup id="fnref:Houlob"&gt;&lt;a class="footnote-ref" href="#fn:Houlob"&gt;3&lt;/a&gt;&lt;/sup&gt;). This idea grows out of the observation that with
TDD, testing is moved from its traditional place towards the end of
the development cycle to a much earlier and more prominent position
where specification and design would traditionally occur.&lt;/p&gt;
&lt;p&gt;TDD advocates tend to argue for making tests very quick to run
(preferably mere seconds for the entire suite) so that there is no
impediment to running them frequently during development, not just
between each code commit,&lt;sup id="fnref:commit"&gt;&lt;a class="footnote-ref" href="#fn:commit"&gt;4&lt;/a&gt;&lt;/sup&gt;
but multiple times during the development of each function.&lt;/p&gt;
&lt;p&gt;Another important idea is that of &lt;em&gt;regression testing&lt;/em&gt;.  As noted
previously a &lt;em&gt;regression&lt;/em&gt; is a defect that is introduced by a
modification to the software. A natural consequence of maintaining and
using a comprehensive suite of tests is that when such a regressions
occur, they should be detected almost immediately. When a bug does
slip through without triggering a test failure, the TDD philosophy
dictates that before it is fixed, one or more failing tests should be
added to demonstrate the incorrect behaviour. By definition, when the
bug is fixed, these new tests will pass unless they themselves
contain errors.&lt;/p&gt;
&lt;h3 id="common-variations-flavours-and-implementations"&gt;Common Variations, Flavours and Implementations&lt;/h3&gt;
&lt;p&gt;A distinction is often made between &lt;em&gt;unit&lt;/em&gt; tests and &lt;em&gt;system tests&lt;/em&gt;
(also known as &lt;em&gt;integration tests&lt;/em&gt;). Unit tests are supposed to test
low-level software units (such individual functions, methods or
classes).  There is often a particular focus on these low-level unit
tests, partly because these can often be made to run very quickly, and
partly (I think) because there is an implicit belief or assumption
that if each individual component is well tested, the whole system
built out of those components is likely to be reliable. (Personally, I
think this is a poor assumption.)&lt;/p&gt;
&lt;p&gt;In contrast, system tests and integration tests exercise many parts of
the system, often completing larger, more realistic tasks, and more
often interfacing with external systems. Such tests are often slower
and it can be hard to avoid their having side effects (such as
updating entries in databases).&lt;/p&gt;
&lt;p&gt;The distinction, however, between the different levels is somewhat
subjective, and some organizations give more equal or greater
weight to higher level tests. This will be an interesting issue
as we consider how to move towards test-driven data analysis.&lt;/p&gt;
&lt;p&gt;Another practice popular within some TDD schools is that of &lt;em&gt;mocking&lt;/em&gt;.
The general idea of mocking is to replace some functionality (such as
a database lookup, a URL fetch, a disk write, a trigger event or a
function call) with a simpler function call or a static value.  This
is done for two main reasons. First, if the mocked functionality is
expensive, or has side effects, test code can often be made much
faster and side-effect free if its execution is bypassed. Secondly,
mocking allows a test to focus on the correctness of a particular
aspect of functionality, without any dependence on the external part
of the system being mocked out.&lt;/p&gt;
&lt;p&gt;Other TDD practitioners are less keen on mocking, feeling that it
leads to less complete and less realistic testing, and raises the risk
of missing some kinds of defects. (Those who favour mocking also tend to
place a strong emphasis on &lt;em&gt;unit&lt;/em&gt; testing, and to argue that more
expensive, non-mocked tests should form part of &lt;em&gt;integration&lt;/em&gt; testing,
rather than part of the more frequently run core unit test suite.)&lt;/p&gt;
&lt;p&gt;While no special software is strictly required in order to follow a
broadly test-driven approach to development, good tools are extremely
helpful. There are standard libraries that support of this for most
mainstream programming languages. The &lt;em&gt;xUnit&lt;/em&gt; family of test software
(e.g. &lt;code&gt;CUnit&lt;/code&gt; for C, &lt;code&gt;jUnit&lt;/code&gt; for Java, &lt;code&gt;unittest&lt;/code&gt; for Python), uses
a common architecture designed by Kent Beck.&lt;sup id="fnref:BeckTDD"&gt;&lt;a class="footnote-ref" href="#fn:BeckTDD"&gt;2&lt;/a&gt;&lt;/sup&gt; It is worth
noting that the &lt;code&gt;rUnit&lt;/code&gt; package is such a system for use with the
popular data analysis package R.&lt;/p&gt;
&lt;h3 id="example"&gt;Example&lt;/h3&gt;
&lt;p&gt;As an example, the following Python code tests a
function &lt;code&gt;f&lt;/code&gt;, as described above, using Python's
&lt;code&gt;unittest&lt;/code&gt; module.
Even if you are completely unfamilar with Python, you will be able
to see the six crucial lines that implement exactly the six
tests described in pseudo-code above, in this case through four separate
test methods.&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;unittest&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestAddFunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unittest&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TestCase&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testNonNegatives&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testNegatives&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testStringInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertRaises&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ne"&gt;TypeError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;a&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;testOverflow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float_info&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float_info&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                         &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;inf&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="vm"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;__main__&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;unittest&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If this code is run, including the function definition for &lt;code&gt;f&lt;/code&gt;,
the output is as follows:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c"&gt;$ python add_function&lt;/span&gt;&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;py&lt;/span&gt;
&lt;span class="nt"&gt;....&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="nb"&gt;----------------------------------------------------------------------&lt;/span&gt;&lt;span class="c"&gt;&lt;/span&gt;
&lt;span class="c"&gt;Ran 4 tests in 0&lt;/span&gt;&lt;span class="nt"&gt;.&lt;/span&gt;&lt;span class="c"&gt;000s&lt;/span&gt;

&lt;span class="c"&gt;OK&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Here, each dot signifies a passing test.&lt;/p&gt;
&lt;p&gt;However, if this is run without defining &lt;code&gt;f&lt;/code&gt;, the result is the following
output:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ python add_function.py
&lt;span class="nv"&gt;EEEE&lt;/span&gt;
&lt;span class="o"&gt;======================================================================&lt;/span&gt;
ERROR: testNegatives &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestAddFunction&lt;span class="o"&gt;)&lt;/span&gt;
----------------------------------------------------------------------
Traceback &lt;span class="o"&gt;(&lt;/span&gt;most recent call last&lt;span class="o"&gt;)&lt;/span&gt;:
  File &lt;span class="s2"&gt;&amp;quot;add_function.py&amp;quot;&lt;/span&gt;, line &lt;span class="m"&gt;13&lt;/span&gt;, &lt;span class="k"&gt;in&lt;/span&gt; testNegatives
    self.assertEqual&lt;span class="o"&gt;(&lt;/span&gt;f&lt;span class="o"&gt;(&lt;/span&gt;-2, &lt;span class="m"&gt;17&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;, &lt;span class="m"&gt;15&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
NameError: global name &lt;span class="s1"&gt;&amp;#39;f&amp;#39;&lt;/span&gt; is not &lt;span class="nv"&gt;defined&lt;/span&gt;

&lt;span class="o"&gt;======================================================================&lt;/span&gt;
ERROR: testNonNegatives &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestAddFunction&lt;span class="o"&gt;)&lt;/span&gt;
----------------------------------------------------------------------
Traceback &lt;span class="o"&gt;(&lt;/span&gt;most recent call last&lt;span class="o"&gt;)&lt;/span&gt;:
  File &lt;span class="s2"&gt;&amp;quot;add_function.py&amp;quot;&lt;/span&gt;, line &lt;span class="m"&gt;9&lt;/span&gt;, &lt;span class="k"&gt;in&lt;/span&gt; testNonNegatives
    self.assertEqual&lt;span class="o"&gt;(&lt;/span&gt;f&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;, &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;, &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
NameError: global name &lt;span class="s1"&gt;&amp;#39;f&amp;#39;&lt;/span&gt; is not &lt;span class="nv"&gt;defined&lt;/span&gt;

&lt;span class="o"&gt;======================================================================&lt;/span&gt;
ERROR: testOverflow &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestAddFunction&lt;span class="o"&gt;)&lt;/span&gt;
----------------------------------------------------------------------
Traceback &lt;span class="o"&gt;(&lt;/span&gt;most recent call last&lt;span class="o"&gt;)&lt;/span&gt;:
  File &lt;span class="s2"&gt;&amp;quot;add_function.py&amp;quot;&lt;/span&gt;, line &lt;span class="m"&gt;20&lt;/span&gt;, &lt;span class="k"&gt;in&lt;/span&gt; testOverflow
    self.assertEqual&lt;span class="o"&gt;(&lt;/span&gt;f&lt;span class="o"&gt;(&lt;/span&gt;sys.float_info.max, sys.float_info.max&lt;span class="o"&gt;)&lt;/span&gt;,
NameError: global name &lt;span class="s1"&gt;&amp;#39;f&amp;#39;&lt;/span&gt; is not &lt;span class="nv"&gt;defined&lt;/span&gt;

&lt;span class="o"&gt;======================================================================&lt;/span&gt;
ERROR: testStringInput &lt;span class="o"&gt;(&lt;/span&gt;__main__.TestAddFunction&lt;span class="o"&gt;)&lt;/span&gt;
----------------------------------------------------------------------
Traceback &lt;span class="o"&gt;(&lt;/span&gt;most recent call last&lt;span class="o"&gt;)&lt;/span&gt;:
  File &lt;span class="s2"&gt;&amp;quot;add_function.py&amp;quot;&lt;/span&gt;, line &lt;span class="m"&gt;17&lt;/span&gt;, &lt;span class="k"&gt;in&lt;/span&gt; testStringInput
    self.assertRaises&lt;span class="o"&gt;(&lt;/span&gt;TypeError, f, &lt;span class="s2"&gt;&amp;quot;a&amp;quot;&lt;/span&gt;, &lt;span class="m"&gt;7&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
NameError: global name &lt;span class="s1"&gt;&amp;#39;f&amp;#39;&lt;/span&gt; is not defined

----------------------------------------------------------------------
Ran &lt;span class="m"&gt;4&lt;/span&gt; tests &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;.000s

FAILED &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Here the four E's at the top of the output represent errors when running
the tests. If a dummy definition of &lt;code&gt;f&lt;/code&gt; is provided, such as:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nv"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;f&lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;a&lt;/span&gt;,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;b&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;:&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;the tests will &lt;em&gt;fail&lt;/em&gt;, producing F, rather than raising the errors
that result in E's.&lt;/p&gt;
&lt;h3 id="benefits-of-test-driven-development"&gt;Benefits of Test-Driven Development&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Correctness.&lt;/strong&gt; The most obvious reason to adopt test-driven
development is the pursuit of higher software quality. TDD proponents
certainly feel that there is considerable benefit to maintaining a
broad and rich set of tests that can be run automatically.  There
is rather more debate about how important it is to write the tests
strictly &lt;em&gt;before&lt;/em&gt; the code it is designed to test. I would say that to
qualify as test-driven development, the tests should be
produced no later than immediately after each piece of functionality
is implemented, but purists would take a stricter view.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Regression detection.&lt;/strong&gt; The second benefit of TDD is in the detection
of regressions, i.e. failures of code in areas that previously ran
successfully. In practice, regression testing is even more powerful
than it sounds because not only can many different failure modes be
detected by a single test, but experience shows that there are often
areas of code that are susceptible to similar breakages from many
different causes and disturbances. (This can be seen as a rare case of
combinatorial explosion working to our advantage: there are many ways
to get code wrong, and far fewer to get it right, so a single test can
catch many different potential failures.)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Specification, Design and Documentation.&lt;/strong&gt; One of the stronger reasons
for writing tests before the functions they are designed to verify is
that the test code then forms a concrete specification. In order even
to write the test, a certain degree of clarity has to be brought to
the question of precisely what the function that is being written is
supposed to do. This is the key insight that leads towards the idea of
TDD as test-driven &lt;em&gt;design&lt;/em&gt; over test-driven &lt;em&gt;development&lt;/em&gt;. A useful
side effect of the test suite is that it also forms a precise and
practical form of documentation as to exactly how the code can be used
successfully, and one that, by definition, has to be kept up to date—a
perenial problem for documentation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Refactoring&lt;/strong&gt;. The benefits listed so far are relatively unsurprising.
The fourth is more profound. In many software projects, particularly
large and complex ones, once the software is deemed to be working
acceptably well, some areas of the code come to be regarded as &lt;em&gt;too
dangerous to modify,&lt;/em&gt; even when problems are discovered. Developers
(and managers) who know how much pain and effort was required to make
something work (or &lt;em&gt;more-or-less&lt;/em&gt; work) become fearful that the risks
associated with fixing or upgrading code are simply too high. In this
way, code becomes brittle and neglected and thus essentially
unmaintainable.&lt;/p&gt;
&lt;p&gt;In my view, the single biggest benefit of test-driven development is
that it goes a long way to eliminating this syndrome, allowing us to
re-write, simplify and extend code safely, confident in the
knowledge that if the tests continue to function, it is unlikely that
anything very bad has happened to the code. The recommended practice
of refactoring code as soon as the tests pass is one aspect of this,
but the larger benefit of maintaining comprehensive set of tests is
that such refactoring can be performed at any time.&lt;/p&gt;
&lt;p&gt;These are just the most important and widely recognized benefits of TDD.
Additional benefits include the ability to check that code is working
correctly on new machines or systems, or in any other new context, providing
a useful baseline of performance (if timed and recorded) and providing
an extremely powerful resource if code needs to be ported or reimplemented.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:software-regression"&gt;
&lt;p&gt;A software &lt;em&gt;regression&lt;/em&gt; is a bug in a later
version of software that was not present in a previous version of
the software.  It contrasts with bugs that may always have been
present but were not detected.&amp;#160;&lt;a class="footnote-backref" href="#fnref:software-regression" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:BeckTDD"&gt;
&lt;p&gt;Kent Beck, &lt;em&gt;Test-Driven Development,&lt;/em&gt; Addison Wesley (Vaseem) 2003.&amp;#160;&lt;a class="footnote-backref" href="#fnref:BeckTDD" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:Houlob"&gt;
&lt;p&gt;Allen Houlob, &lt;em&gt;Test-Driven Design,&lt;/em&gt; Dr. Dobbs Journal, May 5th 2014.
&lt;a href="https://www.drdobbs.com/architecture-and-design/test-driven-design/240168102"&gt;https://www.drdobbs.com/architecture-and-design/test-driven-design/240168102&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:Houlob" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:commit"&gt;
&lt;p&gt;Most non-trivial software development uses a so-called
&lt;em&gt;revision control system&lt;/em&gt; to provide a comprehensive history of versions
of the code. Developers normally run code frequently, and typically
&lt;em&gt;commit&lt;/em&gt; changes to the revision-controlled &lt;em&gt;repository&lt;/em&gt; somewhat less
frequently (though still, perhaps, many times a day). With TDD, the
tests form an integral part of the code base, and it is common good
practice to require that code is only &lt;em&gt;committed&lt;/em&gt; when the tests pass.
Sometimes this requirement is merely a rule or convention, while in other
cases systems are set up in such a way as to enable code to be committed
only when all of its associated tests pass.&amp;#160;&lt;a class="footnote-backref" href="#fnref:commit" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="tdd"></category></entry><entry><title>Why Test-Driven Data Analysis?</title><link href="https://tdda.info/why-test-driven-data-analysis.html" rel="alternate"></link><published>2015-11-05T08:42:00+00:00</published><updated>2015-11-05T08:42:00+00:00</updated><author><name>Stochastic Solutions Limited</name></author><id>tag:tdda.info,2015-11-05:/why-test-driven-data-analysis.html</id><summary type="html">&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;OK, everything you need to know about TeX has been explained—unless
you happen to be fallible.   If you don't plan to make any errors,
don't bother to read this chapter.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;— The TeXbook, Chapter 27, &lt;em&gt;Recovery from Errors&lt;/em&gt;. Donald E. Knuth.&lt;sup id="fnref:KnuthTeXBook"&gt;&lt;a class="footnote-ref" href="#fn:KnuthTeXBook"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The concept of &lt;em&gt;test-driven data analysis&lt;/em&gt; seeks to …&lt;/p&gt;</summary><content type="html">&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;OK, everything you need to know about TeX has been explained—unless
you happen to be fallible.   If you don't plan to make any errors,
don't bother to read this chapter.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;— The TeXbook, Chapter 27, &lt;em&gt;Recovery from Errors&lt;/em&gt;. Donald E. Knuth.&lt;sup id="fnref:KnuthTeXBook"&gt;&lt;a class="footnote-ref" href="#fn:KnuthTeXBook"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The concept of &lt;em&gt;test-driven data analysis&lt;/em&gt; seeks to improve the
answers to two sets of questions, which are defined with
reference to an "analytical process".&lt;/p&gt;
&lt;p&gt;&lt;img alt="Figure 1: A typical analytical process" src="images/analytical-process-900x600.png"&gt;&lt;/p&gt;
&lt;p&gt;The questions assume that you have used the analytical process at
least once, with one or more specific collections of inputs, and that
you are ready to use, share, deliver or simply &lt;em&gt;believe&lt;/em&gt; the
results.&lt;/p&gt;
&lt;p&gt;The questions in the first group concern the &lt;em&gt;implementation&lt;/em&gt; of your
analytical process:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Implementation Questions&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;How confident are you that the outputs produced by the analytical
     process, with the input data you have used, are correct?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;How confident are you that the outputs would be the same if the
     analytical process were repeated using the same input data?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Does your answer change if you repeat the process using different
     hardware, or after upgrading the operating system or other
     software?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Would the analytical process generate any warning or error if
     its results were different from when you first ran it and satisfied
     yourself with the results?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If the analytical process relies on any reference data, how
     confident are you that you would know if that reference data
     changed or became corrupted?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If the analytical process were run with &lt;em&gt;different&lt;/em&gt; input data,
     how confident are you that the output would be correct on
     &lt;em&gt;that&lt;/em&gt; data?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If corrupt or invalid input data were used, how confident are you
     that the process would detect this and raise an appropriate
     warning, error or failure?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Would someone else be able reliably to produce the same results as
     you from the same inputs, given detailed instructions
     and access?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Corollary: do such detailed instructions exist? If you were
     knocked down by the proverbial bus, how easily could someone else
     use the analytical process?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If someone developed an equivalent analytical process, and their
     results were different, how confident are you that yours
     would prove to be correct?&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These questions are broadly similar to the questions addressed
by test-driven development, set in the specific context
of data analysis.&lt;/p&gt;
&lt;p&gt;The questions in our second group are concerned with the &lt;em&gt;meaning&lt;/em&gt; of the
analysis, and a larger, more important sense of correctness:&lt;/p&gt;
&lt;p&gt;** Interpretation Questions**&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Is the input data&lt;sup id="fnref:SingularData"&gt;&lt;a class="footnote-ref" href="#fn:SingularData"&gt;2&lt;/a&gt;&lt;/sup&gt; correct?&lt;sup id="fnref:ProcessError"&gt;&lt;a class="footnote-ref" href="#fn:ProcessError"&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;li&gt;Is your interpretation of the input data correct?&lt;/li&gt;
&lt;li&gt;Are the algorithms you are applying to the data meaningful and appropriate?&lt;/li&gt;
&lt;li&gt;Are the results plausible?&lt;/li&gt;
&lt;li&gt;Is your interpretation of the results correct?&lt;/li&gt;
&lt;li&gt;More generally, what are you missing?&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These questions are less clear cut than the implementation questions,
but are at least as important, and in some ways are more important.
If the implementation questions are about producing the right
&lt;em&gt;answers&lt;/em&gt;, the interpretation questions are about asking the
&lt;em&gt;right questions&lt;/em&gt;, and understanding the answers.&lt;/p&gt;
&lt;p&gt;Over the coming posts, we will seek to shape a coherent
methodology and set of tools to help us provide better answers
to both sets of questions—implementational and interpretational.
If we succeed, the result should be something worthy of the name
&lt;em&gt;test-driven data analysis.&lt;/em&gt;&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:KnuthTeXBook"&gt;
&lt;p&gt;Donald E. Knuth, The TeXbook, Chapter 27,
&lt;em&gt;Recovery from Errors.&lt;/em&gt; Addison Wesley (Reading Mass) 1984.&amp;#160;&lt;a class="footnote-backref" href="#fnref:KnuthTeXBook" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:SingularData"&gt;
&lt;p&gt;I am aware that, classically, &lt;em&gt;data&lt;/em&gt; is the plural of
&lt;em&gt;datum&lt;/em&gt;, and that purists would prefer my question to be phrased as
"Are the data correct?" If the use of 'data' in the singular offends
your sensibilities, I apologise.&amp;#160;&lt;a class="footnote-backref" href="#fnref:SingularData" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:ProcessError"&gt;
&lt;p&gt;When adding &lt;a href="https://www.tdda.info/pages/glossary#error-of-implementation"&gt;Error of Implementation&lt;/a&gt; and
&lt;a href="https://www.tdda.info/pages/glossary#error-of-interpretation"&gt;Error of Interpretation&lt;/a&gt; to the &lt;a href="https://www.tdda.info/pages/glossary"&gt;glossary&lt;/a&gt;, we decided that
this first question really pertained to a third category
of error, namely an &lt;a href="https://www.tdda.info/pages/glossary#error-of-process"&gt;Error of Process&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:ProcessError" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="questions"></category><category term="tdda"></category><category term="tdd"></category></entry><entry><title>Test-Driven Data Analysis</title><link href="https://tdda.info/test-driven-data-analysis.html" rel="alternate"></link><published>2015-11-05T08:30:00+00:00</published><updated>2015-11-05T08:30:00+00:00</updated><author><name>Nicholas J. Radcliffe</name></author><id>tag:tdda.info,2015-11-05:/test-driven-data-analysis.html</id><summary type="html">&lt;p&gt;A dozen or so years ago I stumbled across the idea of &lt;a href="https://en.wikipedia.org/wiki/Test-driven_development"&gt;&lt;em&gt;test-driven
development&lt;/em&gt;&lt;/a&gt;
from reading
&lt;a href="https://www.tbray.org/ongoing/When/200x/2003/05/08/FutureLanguage"&gt;various&lt;/a&gt;
&lt;a href="https://www.tbray.org/ongoing/When/200x/2004/02/16/WritingGenx"&gt;posts&lt;/a&gt;
by &lt;a href="https://www.tbray.org/ongoing/misc/Tim"&gt;Tim Bray&lt;/a&gt; on his
&lt;a href="https://www.tbray.org/ongoing"&gt;Ongoing&lt;/a&gt; blog.  It was obvious that
this was a significant idea, and I adopted it immediately. It has
since become an integral part of the software development …&lt;/p&gt;</summary><content type="html">&lt;p&gt;A dozen or so years ago I stumbled across the idea of &lt;a href="https://en.wikipedia.org/wiki/Test-driven_development"&gt;&lt;em&gt;test-driven
development&lt;/em&gt;&lt;/a&gt;
from reading
&lt;a href="https://www.tbray.org/ongoing/When/200x/2003/05/08/FutureLanguage"&gt;various&lt;/a&gt;
&lt;a href="https://www.tbray.org/ongoing/When/200x/2004/02/16/WritingGenx"&gt;posts&lt;/a&gt;
by &lt;a href="https://www.tbray.org/ongoing/misc/Tim"&gt;Tim Bray&lt;/a&gt; on his
&lt;a href="https://www.tbray.org/ongoing"&gt;Ongoing&lt;/a&gt; blog.  It was obvious that
this was a significant idea, and I adopted it immediately. It has
since become an integral part of the software development processes at
&lt;a href="https://stochasticsolutions.com"&gt;Stochastic Solutions&lt;/a&gt;, where we
develop our own analytical software (&lt;a href="https://stochasticsolutions.com/miro.html"&gt;Miró and the Artists
Suite&lt;/a&gt;) and custom solutions
for clients.  But software development is only part of what we do at
the company: the larger part of our work consists of actually &lt;em&gt;doing&lt;/em&gt;
data analysis for clients. This has a rather different dynamic.&lt;/p&gt;
&lt;p&gt;Fast forward to 2012, and a conversation with my long-term
collaborator and friend,
&lt;a href="https://www.hopper.com/research/patrick-surry/"&gt;Patrick Surry&lt;/a&gt;,
during which he said something to the effect of:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;So what about test-driven data analysis?&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;— Patrick Surry, c. 2012&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The phrase resonated instantly, but neither of us entirely knew what it
meant.  It has lurked in my brain ever since, a kind of proto-meme,
attempting to inspire and attach itself to a concept worthy of the name.&lt;/p&gt;
&lt;p&gt;For the last fifteen months, my colleagues—Sam Rhynas and Simon
Brown—and I have been feeling our way towards an answer to the
question&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;What is test-driven data analysis?&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;We haven't yet pulled all the pieces together into coherent
methodology, but we have assembled a set of useful practices, tools
and processes that feel as if they are part of the answer.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;A few weeks ago, my friend and ex-colleague &lt;a href="https://www.third-bit.com/about.html"&gt;Greg
Wilson&lt;/a&gt; was in town for
&lt;a href="https://www.epcc.ed.ac.uk"&gt;Edinburgh Parallel Computing Centre's&lt;/a&gt;
twenty-fifth birthday bash. Greg is a computer scientist and former
lecturer from University of Toronto. He now spends most of his time
teaching scientists key ideas from software engineering through his
&lt;a href="https://software-carpentry.org"&gt;Software Carpentry&lt;/a&gt; organization. He
lamented that while he has no trouble persuading scientists of the
benefits of adopting ideas such as version control, he finds them almost
completely unreceptive when he champions software testing. I was
initially rather shocked by this, since I routinely say that
test-driven development is the most significant idea in software in
the last thirty or forty years. Thinking about it more, however, I
suspect the reasons for the resistance Greg encounters are similar to
the reasons we have found it harder than we expected to take
mainstream ideas from test-driven development and apply them in the
rather specialized area of data analysis. Testing scientific code is
more like testing analysis processes than it is like testing software
&lt;em&gt;per se.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;As I reflected further on what Greg had said, I experienced a moment
of clarity. The new insight it that while we have a lot useful
components for &lt;em&gt;test-driven data analysis,&lt;/em&gt; including some useful
fragments of a methodology, we really don't have appropriate &lt;em&gt;tools&lt;/em&gt;:
the &lt;a href="https://en.wikipedia.org/wiki/XUnit"&gt;xUnit&lt;/a&gt; frameworks and their
ilk are excellent for test-driven &lt;em&gt;development&lt;/em&gt;, but don't provide
specific support for the patterns we tend to need in analysis, and
address only a subset of the issues we should want test-driven &lt;em&gt;data
analysis&lt;/em&gt; to cover.&lt;/p&gt;
&lt;p&gt;The purpose of this new blog is to think out loud as we—in
partnership with one of our key clients,
&lt;a href="https://skyscanner.net"&gt;Skyscanner&lt;/a&gt;—try to develop tools
and methodologies to form coherent framework and support system for a
more systematic approach to data science—a &lt;em&gt;test-driven&lt;/em&gt;
approach to &lt;em&gt;data analysis&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;So watch this space.&lt;/p&gt;
&lt;p&gt;If you want to subscribe, this site has
&lt;a href="https://www.tdda.info/feeds/all.rss"&gt;RSS&lt;/a&gt; and
&lt;a href="https://www.tdda.info/feeds/all.atom.xml"&gt;ATOM&lt;/a&gt; feeds, and also offers
&lt;a href="https://eepurl.com/bEjuP5"&gt;email subscriptions&lt;/a&gt;.&lt;sup id="fnref:mailchimp"&gt;&lt;a class="footnote-ref" href="#fn:mailchimp"&gt;1&lt;/a&gt;&lt;/sup&gt; We'll be
tweeting on &lt;a href="https://twitter.com/tdda0"&gt;@tdda0&lt;/a&gt; whenever there are new
posts. Twitter is also probably the best to send feedback, since we haven't
plumbed in comments at this time: we'd love to hear what you think.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:mailchimp"&gt;
&lt;p&gt;through &lt;a href="https://mailchimp.com"&gt;MailChimp&lt;/a&gt;; thanks, MailChimp!&amp;#160;&lt;a class="footnote-backref" href="#fnref:mailchimp" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="TDDA"></category><category term="motivation"></category></entry></feed>