2015-11-05: Test-Driven Data Analysis Motivation for test-driven data analysis
2015-11-05: Why Test-Driven Data Analysis? Two sets of questions that inspire test-driven data analysis. The first set focus on correctness of implementation ("getting the right answers"). The second set focus on correctness of interpretation ("asking the right questions").
2015-11-09: Test-Driven Development: A Review A brief overview of test-driven development and its benefits.
2015-11-13: How is this Misleading Data Misleading Me? More on errors of interpretation.
2015-11-16: Infinite Gain: The First Test Regression tests for TDDA: Introducing reference tests and a Law of Software Regressions.
2015-11-23: Site News: Glossary; Table of Contents; Feeds A few site changes and plans
2015-11-26: Constraints and Assertions Automatic generation and verification of constraints on datasets.
2015-12-11: Overview of TDDA in Predictive Analytics Times
2015-12-14: Generalized Overfitting: Errors of Applicability On the many ways even analytical processes can be overfit data even when they do no involve predictive modelling. Topics include concrete and abstract specification.
2016-01-06: How far in advance are flights cheapest? An error of interpretation On the important-but-subtle difference between the questions "How far in advance is any given ticket cheapest?" and "How far in advance is the average price of tickets sold that day lowest?”
2016-02-15: Lessons Learned: Bad Data and other SNAFUs
2016-04-15: In Defence of XML: Exporting and Analysing Apple Health Data Extracting CSV files from the export.xml file written by the Apple Health app on iOS.
2016-04-18: First Test On writing a first "reference" test
2016-04-19: Unit Tests On adding some unit tests
2016-04-20: Extracting More Apple Health Data On extending the iOS Apple Health App data extractor
2016-09-17: Slides and Rough Transcript of TDDA talk from PyCon UK 2016 Test-Driven Data Analysis Talk (slides and transcript) from
2016-09-18: WritableTestCase: Example Use Example of how to use writabletestcase.WritableTestCase
2016-11-03: Constraint Discovery and Verification for Pandas DataFrames Introducing the TDDA constraints library with Pandas bindings.
The TDDA Constraints File Format
.tdda Constraints File Format
2016-11-11: Introducing Rexpy: Automatic Discovery of Regular Expressions Regular expressions are powerful pattern-matching rules for strings. They are fast and widely supported but hard to write and harder to read and debug. Rexpy is a library that aims to take the pain out of producing useful, correct regular expressions by finding them automatically from the collection of strings that are to be matched.
2017-01-26: The New ReferenceTest class for TDDA The Python tdda module has been extended with a new ReferenceTest class, which supersedes WritableTestCase and has many more features. The tdda library is also now available using pip from PyPI.
2017-01-31: Coverage information for Rexpy The tdda library's regular-expression discovery functionality has been extended to provide information about how many examples each resulting regular expressions matches ("covers"). There are new methods for getting various information about such coverage.
2017-02-10: TDDA 1-pager A 1-page summary of TDDA is available.
2017-02-20: Errors of Interpretation: Bad Graphs with Dual Scales It is a primary responsibility of analysts to present findings and data clearly, in ways to minimize the likelihood of misinterpretation. Graphs should help this, but all too often, if drawn badly (whether deliberately or through oversight) they can make misinterpretation highly likely.
2017-03-08: An Error of Process Yesterday, email subscribers to the blog, and some RSS/casual viewers, will have seen a half-finished (in fact, abandoned) post that began to try to characterize success and failure on the crowd-funding platform Kickstarter. This post explains what happened and tries to salvage a "teachable moment" out of this minor fiasco.
2017-03-09: Improving Rexpy Rexpy is an open-source Python library and online tool for finding regular expressions from examples. It focuses on regular expressions for structured data (such as those used for things like identifiers, postcodes, URLs and telephone numbers) rather than free text or toy examples. A new release significantly improves the algorithm used for finding regular expressions, often resulting in more precise regular expressions while degrading performance in very few places.
2017-05-04: Quick Reference for TDDA Library A quick-reference guide ("cheat sheet") for the TDDA library is now available.
2017-09-08: GDPR, Consent and Microformats: A Half-Baked Idea The Generalized Data Protection Regulation (GDPR) is coming. This post outlines and idea for a way to make it more workable by using a "microformat" (or similar) to specify consent requests and responses in a simple digital form, on websites and in apps, that would be more precise, consistent and verifiable for all sides.
2017-09-14: Obtaining the Python tdda Library Reference information about how to obtain/install/use the TDDA library
2017-09-21: Constraint Generation in the Presence of Bad Data Relaxing the requirement that datasets used for algorithmic constraint generation contain only good data.
2017-10-06: Automatic Constraint Generation and Verification White Paper Correctness is a key problem at every stage of data science projects: completing an entire analysis without a serious error at some stage is surprisingly hard. Even errors that reverse or completely invalidate the analysis can be hard to detect. Test-Driven Data Analysis (TDDA) attempts to identify, reduce, and aid correction of such errors. A core tool that we use in TDDA is Automatic Constraint Discovery and Verification. The paper links from this post describes the approach in detail.
2017-11-30: Data Provenance and Data Lineage: the View from the Podcasts Summary of a couple of podcast discussions of data provenance and data lineage
2017-12-12: Our Approach to Data Provenance The last post introduced the idea of data provenance (a.k.a. data lineage), as outlined on a couple of podcasts. This post explains how we approach this issue at Stochastic Solutions, both from a methodological and a software perspective.
2018-05-01: Saving Time Running Subsets of Tests with Tagging When your tests take any non-trivial amount of time to run, as is reasonable common with analytical tests, it is useful to have a convenient way to run a subset of them, or a single test. We have added this capability to unittest-based tests through the new tagging mechanism in the TDDA library.
2018-05-04: Detecting Bad Data and Anomalies with the TDDA Library (Part I) The data verification capbilities of the TDDA library have been extended to allow the identification of individual records failing constraints, optionally with detailed diagnosis of how they fail. There is a new API call and a new command-line primative, detect, and this starts to allow the TDDA library to be used as a general-purpose anomaly detection system.
2018-05-22: Tagging PyTest Tests We recently introduced the @tag decorator to tag a subset of tests or test classes to be run, but this was available only for unittest. This has now been extended to work under pytest.