A recent post
described the new ability to run a subset of
from the tdda library
by tagging tests or test classes
Initially, this ability was only available for
From version 1.0 of the tdda library,
we have …
A recent post
described the new ability to run a subset of
The test-driven data analysis library, tdda, has two main kinds of functionality
- support for testing complex analytical processes
- support for verifying data against constraints, and optionally for discovering such constraints from example data.
Until now, however, the verification process has only reported which constraints failed to …Continue reading
It is common, when working with tests for analytical processes, for test suites to take non-trivial amount of time to run. It is often helpful to have a convenient way to execute a subset of tests, or even a single test.
We have added a simple mechanism for allowing this …Continue reading
Our previous post introduced the idea of data provenance (a.k.a. data lineage), which has been discussed on a couple of podcasts recently. This is an issue that is close to our hearts at Stochastic Solutions. Here, we'll talk about how we handle this issue, both methodologically and in …Continue reading
In Episode 49 of the Not So Standard Deviations podcast, the final segment (starting at 59:32) discusses data lineage, after Roger Peng listened to the September 3rd (2017) episode of another podcast, Linear Digressions, which discussed that subject.
This is a topic very close to our hearts, and I …Continue reading
We have a new White Paper available:
Automatic Constraint Generation and Verification
Correctness is a key problem at every stage of data science projects: completing an entire analysis without a serious error at some stage is surprisingly hard. Even errors that reverse or completely invalidate the analysis can be …Continue reading
Bad data is widespread and pervasive.1
Only datasets and analytical processes that have been subject to rigorous and sustained quality assurance processes are typically capable of achieving low or zero error rates. "Badness" can take many forms and have various aspects, including incorrect values, missing values, duplicated entries, misencoded …Continue reading
This post is a standing post that we plan to try to keep up to date, describing options for obtaining the open-source Python TDDA library that we maintain.
Using pip from PyPI
If you don't need source, and have Python installed, the easiest way to get the TDDA library is …Continue reading
Last night I went to The Protectors of Data Scotland Meetup on the subject of Marketing and GDPR. If you're not familiar with Europe's fast-approaching General Data Protection Regulation, and you keep or process any personal data about humans,1, you probably ought to learn about it. A good place …Continue reading
We will try to keep it up-to-date as the library evolves.
See you all at PyData London 2017 this weekend (5-6 May 2017), where we'll be running a …Continue reading
Today we are announcing some enhancements to Rexpy, the tdda tool for finding regular expressions from examples. In short, the new version often finds more precise regular expressions than was previously the case, with the only downside being a modest increase in run-time.
Background on Rexpy is available in two …Continue reading
Yesterday, email subscribers to the blog, and some RSS/casual viewers, will have seen a half-finished (in fact, abandoned) post that began to try to characterize success and failure on the crowd-funding platform Kickstarter.
The post was abandoned because I didn't believe its first conclusion, but unfortunately was published by …Continue reading
It is a primary responsibility of analysts to present findings and data clearly, in ways to minimize the likelihood of misinterpretation. Graphs should help this, but all too often, if drawn badly (whether deliberately or through oversight) they can make misinterpretation highly likely. I want to illustrate this danger with …Continue reading
We have written a 1-page summary of some of the core ideas in TDDA.
It is available as a PDF from stochasticsolutions.com/pdf/TDDA-One-Pager.pdf.Continue reading
rexpy to the Python
tdda module. Rexpy is used to
find regular expressions from example strings.
One of the most common requests from Rexpy users has been for information regarding how many examples each resulting regular expression matches.
We have now added a few methods …Continue reading
Since the last post, we have extended the reference test functionality
in the Python
Major changes (as of version 0.2.5, at the time of writing) include:
- Introduction of a new
ReferenceTestclass that has significantly more functionality from the previous (now deprecated)
- Support for
There's a Skyscanner data feed we have been working with for a year or so. It's produced some six million records so far, each of which has a transaction ID consisting of three parts—a four-digit alphanumeric transaction type, a numeric timestamp and a UUID, with the three parts …Continue reading
We recently extended the tdda library to include support for automatic discovery of constraints from datasets, and for verification of datasets against constraints. Yesterday's post—Constraint Discovery and Verification for Pandas DataFrames—describes these developments and the API.
The library we published is intended to be a base for …Continue reading
In a previous post, Constraints and Assertions, we introduced the idea of using constraints to verify input, output and intermediate datasets for an analytical process. We also demonstrated that candidate constraints can be automatically generated from example datasets. We prototyped this in our own software (Miró) expressing constraints as …Continue reading
In my PyCon UK talk yesterday I promised to update the and document
the copy of
writabletestcase.WritableTestCase on GitHub.
The version I've put up is not quite as powerful as the example I showed in the talk—that will follow—but has the basic functionality.
I've now added examples …Continue reading
Python UK 2016, Cardiff.
I gave a talk on test-driven data analysis at PyCon UK 2016, Cardiff, today.
The slides (which are kind-of useless without the words) are available here.
More usefully, a rough transcript, with thumbnail slides, is available here.Continue reading
The first version of the Python code for extracting data from the XML export from the Apple Health on iOS neglected to extract Activity Summaries and Workout data. We will now fix that.
As usual, I'll remind you how to get the code, if you want, then discuss the changes …Continue reading
We will now expand that test with a few other, smaller and more conventional unit tests. Each …Continue reading
In the last post,
I presented some code for extracting (some of) the data from the XML
file exported by the
Apple Health app on iOS, but—almost
comically, given this blog's theme—omitted to include any tests.
This post and the next couple (in quick succession) will aim to …
I'm going to present a series of posts based around the sort of health and fitness data that can now be collected by some phones and dedicated fitness trackers. Not all of these will be centrally on topic for test-driven data analysis, but I think they'll provide an interesting set …Continue reading
My first paid programming job was working for my local education authority during the summer. The Advisory Unit for Computer-Based Education (AUCBE), run by a fantastic visionary and literal "greybeard" called Bill Tagg, produced software for schools in Hertfordshire and environs, and one of their products was a simple database …Continue reading
Every year, Expedia and ARC collaborate to publish some annual statistics about domestic airfare, including their treatment of the perennial question "How far in advance should you book your flight?" Here's what they presented in their report last year:
Although there …Continue reading
Everyone building predictive models or performing statistical fitting knows about overfitting. This arises when the function represented by the model includes components or aspects that are overly specific to the particularities of the sample data used for training the model, and that are not general features of datasets to which …Continue reading
We have an overview piece in Predictive Analytics Times this week.
You can find it here.Continue reading
Consistency Checking of Inputs, Outputs and Intermediates
While the idea of regression testing comes straight from test-driven development, the next idea we want to discuss is associated more with general defensive progamming than TDD. The idea is consistency checking, i.e. verifying that what might otherwise be implicit assumptions are …Continue reading
The first idea we want to appropriate from test-driven development is that of regression testing, and our specific analytical variant of this, the idea of a reference test.
We propose a "zeroth level" of test-driven data analysis as recording one or more specific sets of inputs to an analytical process …Continue reading
"Why is this lying bastard lying to me?"
Louis Heren,1 often attributed to Jeremy Paxman.
In a previous post, we made a distinction between two kinds of errors—implementation errors and errors of interpretation. I want to amplify that today, focusing specifically on interpretation.
The most important question to …Continue reading
Since a key motivation for developing test-driven data analysis (TDDA) has been test-driven development (TDD), we need to conduct a lightning tour of TDD before outlining how we see TDDA developing. If you are already familiar with test-driven development, this may not contain too much that is new for you …Continue reading
OK, everything you need to know about TeX has been explained—unless you happen to be fallible. If you don't plan to make any errors, don't bother to read this chapter.
— The TeXbook, Chapter 27, Recovery from Errors. Donald E. Knuth.1
The concept of test-driven data analysis seeks to …Continue reading
A dozen or so years ago I stumbled across the idea of test-driven development from reading various posts by Tim Bray on his Ongoing blog. It was obvious that this was a significant idea, and I adopted it immediately. It has since become an integral part of the software development …Continue reading