TDDA 1-pager

Posted on Fri 10 February 2017 in TDDA • Tagged with tdda

We have written a 1-page summary of some of the core ideas in TDDA.

It is available as a PDF from stochasticsolutions.com/pdf/TDDA-One-Pager.pdf.

Coverage information for Rexpy

Posted on Tue 31 January 2017 in TDDA • Tagged with tdda, regular expressions

Rexpy Stats

We previously added rexpy to the Python tdda module. Rexpy is used to find regular expressions from example strings.

One of the most common requests from Rexpy users has been for information regarding how many examples each resulting regular expression matches.

We have now added a few methods to Rexpy to support this.

Currently this is only available in the Python library for Rexpy, available as part of the tdda module, with either

pip install tdda

git clone https://github.com/tdda/tdda.git

Needless to say, we also plan to use this functionality in the online version of Rexpy in the future.

Rexpy: Quick Recap

The following example shows simple use of Rexpy from Python:

$ python
>>> from tdda import rexpy
>>>
>>> corpus = ['123-AA-971', '12-DQ-802', '198-AA-045', '1-BA-834']
>>> for r in rexpy.extract(corpus):
>>>    print(r)
>>>
^\d{1,3}\-[A-Z]{2}\-\d{3}$

In this case, Rexpy found a single regular expression that matched all the strings, but in many cases it returns a list of regular expressions, each covering some subset of the examples.

The way the algorithm currently works, in most cases¹ each example will match only one regular expression, but in general, some examples might match more than one pattern. So we've designed the new functionality to work even when this is the case. We've provided three new methods on the Extractor class, which gives a more powerful API than the simple extract function.

Here's an example based on one of Rexpy's tests:

>>> urls2 = [
    'stochasticsolutions.com/',
    'apple.com/',
    'stochasticsolutions.com/',  # actual duplicate
    'https://www.stochasticsolutions.co.uk/',
    'https://www.google.co.uk/',
    'https://www.google.com',
    'https://www.google.com/',
    'https://www.guardian.co.uk/',
    'https://www.guardian.com',
    'https://www.guardian.com/',
    'https://www.stochasticsolutions.com',
    'web.stochasticsolutions.com',
    'https://www.stochasticsolutions.com',
    'tdda.info',
    'gov.uk',
    'https://web.web',
]
>>> x = rexpy.Extractor(urls2)
>>> for r in x.results.rex:
>>>    print(r)

^[a-z]{3,4}\.[a-z]{2,4}$
^[a-z]+\.com\/$
^[a-z]{3,4}[\.\/\:]{1,3}[a-z]+\.[a-z]{3}$
^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
^http\:\/\/www\.[a-z]{6,8}\.com\/$
^http\:\/\/www\.[a-z]+\.co\.uk\/$

As you can see, Rexpy has produced six different regular expressions, some of which should probably be collapsed together. The Extractor object we have created has three new methods available.

The New Coverage Methods

The simplest new method is coverage(dedup=False), which returns a list of the number of matches for each regular expression returned, in the same order as the regular expressions in x.results.rex. So:

>>> print(x.coverage())
[2, 3, 2, 4, 2, 3]

is the list of frequencies for the six regular expressions given, in order. So the pairings are illustrated by:

>>> for k, n in zip(x.results.rex, x.coverage()):
        print('%d examples are matched by %s' % (n, k))

2 examples are matched by ^[a-z]{3,4}\.[a-z]{2,4}$
3 examples are matched by ^[a-z]+\.com\/$
2 examples are matched by ^[a-z]{3,4}[\.\/\:]{1,3}[a-z]+\.[a-z]{3}$
4 examples are matched by ^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
2 examples are matched by ^http\:\/\/www\.[a-z]{6,8}\.com\/$
3 examples are matched by ^http\:\/\/www\.[a-z]+\.co\.uk\/$

The optional dedup parameter, when set to True, requests deduplicated frequencies, i.e. ignoring any duplicate strings passed in (remembering that Rexpy strips whitespace from both ends of input strings). In this case, there is just one duplicate string (stochasticsolutions.com/). So:

>>> print(x.coverage(dedup=True))
[2, 2, 2, 4, 2, 3]

where the second number (the matches for ^[a-z]+\.com\/$) is now 2, because stochasticsolutions.com/ has been deduplicated.

We can also find the total number of examples, with or without duplicates, by calling the n_examples(dedup=False) method:

>>> print(x.n_examples())
16
>>> print(x.n_examples(dedup=True))
15

But what we will probably normally be most interested in doing is sorting the regular expressions from highest to lowest coverage, ignoring any examples matched by an earlier pattern in cases where they do overlap. That's exactly what the incremental_coverage(dedup=False) method does for us. It returns an ordered dictionary.

>>> for (k, n) in x.incremental_coverage().items():
        print('%d: %s' % (n, k))
4: ^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
3: ^[a-z]+\.com\/$
3: ^http\:\/\/www\.[a-z]+\.co\.uk\/$
2: ^[a-z]{3,4}[\.\/\:]{1,3}[a-z]+\.[a-z]{3}$
2: ^[a-z]{3,4}\.[a-z]{2,4}$
2: ^http\:\/\/www\.[a-z]{6,8}\.com\/$

This is our sixteen input strings (including duplicates), and the number of examples matched by this expression, not matched by any previous expression. (As noted earlier, that caveat probably won't make any difference at the moment, but it will in future versions.) So, to be explicit, this is saying:

The regular expression that matches most examples is: ^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$ which matches 4 of the 16 strings.
Of the remaining 12 examples, 3 are matched by ^[a-z]+\.com\/$.
Of the remaining 9 examples, 3 more are matched by ^http\:\/\/www\.[a-z]+\.co\.uk\/$
and so on.

Note, that in the case of ties, Rexpy sorts regular expressions as strings to break ties.

We can get the deduplicated numbers if we prefer:

>>> for (k, n) in x.incremental_coverage(dedup=True).items():
        print('%d: %s' % (n, k))
4: ^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
3: ^http\:\/\/www\.[a-z]+\.co\.uk\/$
2: ^[a-z]+\.com\/$
2: ^[a-z]{3,4}[\.\/\:]{1,3}[a-z]+\.[a-z]{3}$
2: ^[a-z]{3,4}\.[a-z]{2,4}$
2: ^http\:\/\/www\.[a-z]{6,8}\.com\/$

That's all the new functionality for now. Let us know how you get on, and if you find any problems. And tweet your email address to @tdda0 if you want to join the TDDA Slack to discuss anything around the subject of test-driven data analysis.

[NOTE: This post was updated on 10.2.2017 after an update to the rexpy library changed function and attribute names from "sequential" (which was not very descriptive) to "incremental", which is better.]

In fact, probably in all cases, currently ↩

The New ReferenceTest class for TDDA

Posted on Thu 26 January 2017 in TDDA • Tagged with tdda, constraints, rexpy

Since the last post, we have extended the reference test functionality in the Python tdda library. Major changes (as of version 0.2.5, at the time of writing) include:

Introduction of a new ReferenceTest class that has significantly more functionality from the previous (now deprecated) WritableTestCase.
Support for pytest as well as unittest.
Available from PyPI with pip install tdda, as well as from Github.
Support for comparing CSV files.
Support for comparing pandas DataFrames.
Support for preprocessing results before comparison (beyond simply dropping lines) in reference tests.
Greater consistency between parameters and options for all comparison methods
Support for categorizing kinds of reference data and rewriting only nominated categories (with -w or --write)
More (meta) tests of the reference test functionality.

Background: reference tests and WritableTestCase

We previously introduced the idea of a reference test with an example in the post First Test, and then when describing the WritableTestCase library. A reference test is essentially the TDDA equivalent of a software system test (or integration test), and is characterized by:

normally testing a relatively large unit of analysis functionality (up to and including whole analytical processes)
normally generating one or more large or complex outputs that are complex to verify (e.g. datasets, tables, graphs, files etc.)
sometimes featuring unimportant run-to-run differences that mean testing equality of actual and expected output will often fail (e.g. files may contain date stamps, version numbers, or random identifiers)
often being impractical to generate by hand
often needing to be regenerated automatically (after verification!) when formats change, when bugs are fixed or when understanding changes.

The old WritableTestCase class that we made available in the tdda library provided support for writing, running and updating such reference tests in Python by extending the unittest.TestCase class and providing methods for writing reference tests and commands for running (and, where necessary, updating the reference results used by) reference tests.

Deprecating WritableTestCase in Favour of ReferenceTest

Through our use of reference testing in various contexts and projects at Stochastic Solutions, we have ended up producing three different implementations of reference-test libraries, each with different capabilities. We also become aware that an increasing number of Python developers have a marked preference for pytest over unittest, and wanted to support that more naturally. The new ReferenceTest class brings together all the capabilities we have developed, standardizes them and fills in missing combinations while providing idiomatic patterns for using it both in the context of unittest and pytest.

We have no immediate plans to remove WritableTestCase from the tdda library (and, indeed, continue use it extensively ourselves), but encourage people to adopt ReferenceTest instead as we believe it is superior in all respects.

Availability and Installation

You can now install the tdda module with pip:

pip install tdda

and the source remains available under an MIT licence from github with

git clone https://github.com/tdda/tdda.git

The tdda library works under Python 2.7 and Python 3 and includes reference test functionality mentionedabove, constraint discovery and verification, including a Pandas version and automatic discovery of regular expressions from examples (rexpy, also available online at here).

After installation, you can run TDDA's tests as follows:

$ python -m tdda.testtdda
............................................................................
----------------------------------------------------------------------
Ran 76 tests in 0.279s

Getting example code

However you have obtained the tdda module, you can get a copy of its reference test examples by running the command

$ python -m tdda.referencetest.examples

which will place them in a referencetest-examples subdirectory of your current directory. Alternatively, you can specify that you want them in a particular place with

$ python -m tdda.referencetest.examples /path/to/particular/place

There are variations for getting examples for the constraint generation and validation functionality, and for regular expression extraction (rexpy)

$ python -m tdda.constraints.examples [/path/to/particular/place]
$ python -m tdda.rexpy.examples [/path/to/particular/place]

Example use of ReferenceTest from unittest

Here is a cut-down example of how to use the ReferenceTest class with Python's unittest, based on the example in referencetest-examples/unittest. For those who prefer pytest, there is a similar pytest-ready example in referencetest-examples/pytest.

from __future__ import unicode_literals

import os
import sys
import tempfile

from tdda.referencetest import ReferenceTestCase

# ensure we can import the generators module in the directory above,
# wherever that happens to be
FILE_DIR = os.path.abspath(os.path.dirname(__file__))
PARENT_DIR = os.path.dirname(FILE_DIR)
sys.path.append(PARENT_DIR)

from generators import generate_string, generate_file

class TestExample(ReferenceTestCase):
    def testExampleStringGeneration(self):
        actual = generate_string()
        self.assertStringCorrect(actual, 'string_result.html',
                                 ignore_substrings=['Copyright', 'Version'])

    def testExampleFileGeneration(self):
        outdir = tempfile.gettempdir()
        outpath = os.path.join(outdir, 'file_result.html')
        generate_file(outpath)
        self.assertFileCorrect(outpath, 'file_result.html',
                               ignore_patterns=['Copy(lef|righ)t',
                                                'Version'])

TestExample.set_default_data_location(os.path.join(PARENT_DIR, 'reference'))


if __name__ == '__main__':
    ReferenceTestCase.main()

Notes

These tests illustrate comparing a string generated by some code to a reference string (stored in a file), and testing a file generated by code to a reference file.
We need to tell the ReferenceTest class where to find reference files used for comparison. The call to set_default_data_location, straight after defining the class, does this.
The first test generates the string actual and compares it to the contents of the file string_result.html in the data location specified (../reference). The ignore_substrings parameter specifies strings which, when encountered, cause these lines to be omitted from the comparison.
The second test instead writes a file to a temporary directory (but using the same name as the reference file). In this case, rather than ignore_strings we have used an ignore_patterns parameter to specify regular expressions which, when matched, cause lines to be disregarded in comparisons.
There are a number of other parameters that can be added to the various assert... methods to allow other kinds of discrepancies between actual and generated files to be disregarded.

Running the reference tests: Success, Failure and Rewriting

If you just run the code above, or the file in the examples, you should see output like this:

$ python test_using_referencetestcase.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.003s

OK

or, if you use pytest, like this:

$ pytest
============================= test session starts ==============================
platform darwin -- Python 2.7.11, pytest-3.0.2, py-1.4.31, pluggy-0.3.1
rootdir: /Users/njr/tmp/referencetest-examples/pytest, inifile:
plugins: hypothesis-3.4.2
collected 2 items

test_using_referencepytest.py ..

=========================== 2 passed in 0.01 seconds ===========================

If you then edit generators.py in the directory above and make some change to the HTML in the generate_string and generate_file functions preferably non-semantic changes, like adding an extra space before the > in </html>) and then rerun the tests, you should get failures. Changing just the generate_string function:

$ python unittest/test_using_referencetestcase.py
.1 line is different, starting at line 33
Expected file /Users/njr/tmp/referencetest-examples/reference/string_result.html
Note exclusions:
    Copyright
    Version
Compare with "diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/actual-string_result.html
/Users/njr/tmp/referencetest-examples/reference/string_result.html".
F
======================================================================
FAIL: testExampleStringGeneration (__main__.TestExample)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "unittest/test_using_referencetestcase.py", line 62, in testExampleStringGeneration
    ignore_substrings=['Copyright', 'Version'])
  File "/Users/njr/python/tdda/tdda/referencetest/referencetest.py", line 527, in assertStringCorrect
    self.check_failures(failures, msgs)
  File "/Users/njr/python/tdda/tdda/referencetest/referencetest.py", line 709, in check_failures
    self.assert_fn(failures == 0, '\n'.join(msgs))
AssertionError: 1 line is different, starting at line 33
Expected file /Users/njr/tmp/referencetest-examples/reference/string_result.html
Note exclusions:
    Copyright
    Version
Compare with "diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/actual-string_result.html
/Users/njr/tmp/referencetest-examples/reference/string_result.html".


    ----------------------------------------------------------------------
    Ran 2 tests in 0.005s

As expected, the string test now fails, and the ReferenceTest library suggests a command you can use to diff the output: because the test failed, it wrote the actual output to a temporary file. (It reports the failure twice, once as it occurs and once at the end. This is deliberate as it's convenient to see it when it happens if the tests take any non-trivial amount of time to run, and convenient to collect together all the failures at the end too.)

Because these are HTML files, I would probably instead open them both (using the open command on Mac OS) and visually inspect them. In these case, the pages look identical, and diff will confirm that the changes are only those we expect:

$ diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/actual-string_result.html
/Users/njr/tmp/referencetest-examples/reference/string_result.html
5,6c5,6
<     Copyright (c) Stochastic Solutions, 2016
<     Version 1.0.0
—
>     Copyright (c) Stochastic Solutions Limited, 2016
>     Version 0.0.0
33c33
< </html >
—
> </html>

In this case

We see that the copyright and version lines are different, but we used ignore_strings to avoid say that's OK
It shows us the extra space before the close tag.

If we are happy that the new output is OK and should replace the previous reference test, you can rerun with the -W (or --write-all).

$ python unittest/test_using_referencetestcase.py -W
Written /Users/njr/tmp/referencetest-examples/reference/file_result.html
.Written /Users/njr/tmp/referencetest-examples/reference/string_result.html
.
----------------------------------------------------------------------
Ran 2 tests in 0.003s

OK

Now if you run them again without -W, the tests should all pass.

You can do the same with pytest, except that in this case you need to use

pytest --write-all

because pytest does not allow short flags.

Other kinds of tests

Since this post is already quite long, we won't go through all the other options, parameters and kinds of tests in detail, but will mention a few other points:

In addition to assertStringCorrect and assertFileCorrect there are various other methods available:
- assertFilesCorrect for checking that multiple files are as expected
- assertCSVFileCorrect for checking a single CSV file
- assertCSVFilesCorrect for checking multiple CSV files
- assertDataFramesEqual to check equality of Pandas DataFrames
- assertDataFrameCorrect to check a data frame matches data in a CSV file.
Where appropriate, assert methods accept the various optional parameters, including:
- lstrip — ignore whitespace at start of files/strings (default False)
- rstrip — ignore whitespace at end of files/strings (default False)
- kind — optional category for test; these can be used to allow selective rewriting of test output with the -w/--write flag
- preprocess — function to call on expected and actual data before comparing results

We'll add more detail in future posts.

If you'd like to join the slack group where we discuss TDDA and related topics, DM your email address to @tdda on Twitter and we'll send you an invitation.

Introducing Rexpy: Automatic Discovery of Regular Expressions

Posted on Fri 11 November 2016 in TDDA • Tagged with tdda, constraints, pandas, regular expressions

Motivation

There's a Skyscanner data feed we have been working with for a year or so. It's produced some six million records so far, each of which has a transaction ID consisting of three parts—a four-digit alphanumeric transaction type, a numeric timestamp and a UUID, with the three parts separated by hyphens. Things like this:

adyt-1466611238-cf68496e-40f1-455e-94d9-ea13a96ff044
ooqt-1466602219-012da468-a820-11e6-8ba1-b8f6b118f191
z65e-1448755954-2d677190-ecda-4279-acb2-31a31ec8e86e

The only thing that really matters is that the transaction IDs are unique, but if everything is working correctly, the three parts should have the right structure and match data that we have in other fields in the feed.

We're pretty familiar with this data; or so we thought . . .

We've added a command to our data analysis software—Miró—for characterizing the patterns in string fields. The command is rex and when we run it on the field (tid), by saying:

[1] load skyscanner/transactions.miro
transactions.miro: 6,020,946 records; 6,020,946 (100%) selected; 57 fields.

[2] rex tid

this is the output:

^[A-Za-z0-9]{2,4}[\-\_]{1,3}\d{10}\-$
^dckx\-1466604137\-1aada032aa7348e1ac0fcfdd02a80f9c$
^[A-Za-z0-9]{2,4}[\-\_]{1,3}\d{10}\-[0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12}$
^q\_\_s\-\d{10}\-[0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12}$

The rex command has found four patterns that, between them, characterize all the values in the field, and it has expressed these in the form of regular expressions. (If you aren't familiar with regular expressions, you might want to read the linked Wikipedia article.)

The third pattern in the list is more-or-less the one we thought would characterize all the transaction IDs. It reads:

start¹ (^)
then 2--4 letters or numbers ([A-Za-z0-9]{2,4})
then a mixture one to three hyphens and underscores ([\-\_]{1,3})
then 10 digits (\d{10})
then a hyphen (\-)
then a UUID, which is a hyphen-separated collection of 28 hex digits, in groups of 8, 4, 4, 4 and 12 ([0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12})
and end² ($).

The only difference between this and what we expected is that it turns out that some of the "alphanumeric transaction types" end with two underscores rather than being stricly alphanumeric, and the rex command has expressed this as "2-4 alphanumeric characters followed by 1-3 hyphens or underscores", rather than "2-4 characters that are alphanumeric or underscores, followed by a hyphen", which would have been more like the way we think about it.

What are the other three expressions?

The first one is the same, but without the UUID. Occasionally the UUID is missing (null), and when this happens the UUID portion of the tid is blank. It is possible to write a regular expression that combines these two cases, but rex doesn't quite yet know how to do this. The way to do it would be to make the UUID optional by enclosing it in parentheses and following it by a question mark:

([0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12})?

which reads zero or one UUIDs, given that we know the pattern inside the parentheses corresponds to a UUID.

The last pattern is another one we could unify with the other two: it is the same except that it identifies a particular transaction type that again uses underscores, but now in the middle: q__s. So we could replace those three (and might, in an ideal world, want rex to find)

^[A-Za-z0-9\_]{4}[\-]\d{10}\-[0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12}$

This is exactly what we described, except with the inclusion of _ as an allowable character in the 4-character transaction type at the start.

But what about the second pattern?

^dckx\-1466604137\-1aada032aa7348e1ac0fcfdd02a80f9c$

It's actually a single, completely specific transaction ID, that matches the main pattern except for omitting the hyphens in the UUID. This shouldn't be possible—by which I mean, the UUID generator should never omit the hyphens. But clearly either it did for this one transaction, or something else stripped them out later. Either way, it shows that among the several million transactions, there a bad transaction ID.

A further check in Miró shows that this occurs just a single time (i.e. it is not duplicated):

[4]>  select tid = "dckx-1466604137-1aada032aa7348e1ac0fcfdd02a80f9c"
transactions.miro: 6,020,946 records; 1 (0%) selected; 57 fields.
Selection:
(= tid "dckx-1466604137-1aada032aa7348e1ac0fcfdd02a80f9c")

So we've learnt something interesting and useful about our data, and Miró's rex command has helped us produce regular expressions that characterize all our transaction IDs. It hasn't done a perfect job, but it was pretty useful, and it's easy for us to merge the three main patterns by hand. We plan to extend the functionality to cover these cases better over coming weeks.

Rexpy outside Miró

If you don't happen to have Miró, or want to find regular expressions from data outside the context of a Miró dataset, you can use equivalent functionality directly from the rexpy module of our open-source, MIT-licenced tdda package, available from Github:

git clone https://github.com/tdda/tdda.git

This provides both a Python API for finding regular expressions from example data, and a command line tool.

The Rexpy Command Line

If you just run rexpy.py from the command line, with no arguments, it expects you to type (or paste) a set of strings, and when you finish (with CTRL-D on unix-like systems, or CTRL-Z on Windows systems) it will spit out the regular expressions it thinks you need to match them: For example:

$ python rexpy.py
EH1 7JQ
WC1 4AA
G2 3PQ
^[A-Za-z0-9]{2,3}\ [A-Za-z0-9]{3}$

Or, of course, you can pipe input into it. If, for example, I do that in a folder full of photos from a Nikon camera, I get

$ ls *.* | python ~/python/tdda/rexpy/rexpy.py
^DSC\_\d{4}\.[A-Za-z]{3}$

(because these are all files like DSC_1234.NEF or DSC_9346.xmp).

You can also give it a filename as the first argument, in which case it will read strings (one per line) from a file.

So given the file ids.txt (which is in the rexpy/examples subdirectory in the tdda repository), containing:

123-AA-971
12-DQ-802
198-AA-045
1-BA-834

we can use rexpy on it by saying:

$ python rexpy.py examples/ids.txt
^\d{1,3}\-[A-Z]{2}\-\d{3}$

and if there is a header line you want to skip, you can add either -h or --header to tell Rexpy to skip that.

You can also give a filename as a second command line argument, in which case Rexpy will write the results (one per line) to that file.

Motivation for Rexpy

There's an old joke among programmers, generally attributed to Jamie Zawinski that bears repeating:

Some people, when confronted with a problem, think "I know, I'll use regular expressions."

Now they have two problems.

— Jamie Zawinski

As powerful as regular expressions are, even their best friends would probably concede that they can be hard to write, harder to read and harder still to debug.

Despite this, regular expressions are an attractive way to specify constraints on string fields for two main reasons:

First, regular expressions constitute a fast, powerful, and near-ubiquitous mechanism for describing a wide variety of possible structures in string data.
Secondly, regular expressions include the concept of capture groups. You may recall that in a grouped³ regular expression, one or more subcomponents is enclosed in parentheses and its matched value can be extracted. As we will see below, this is particulary interesting in the context of test-driven data analysis.

For concreteness, suppose we have a field containing strings such as:

123-AA-971
12-DQ-802
198-AA-045
1-BA-834

One obvious regular expression to describe these would be:

^\d+\-[A-Z]{2}\-\d{3}$

(start [^] with one or more digits [\d+], then a hyphen [\-], then two capital letters [A-Z]{2}, then another hyphen [\-], then three digits [\d{3}] then stop [$]).

But it is in the nature of regular expressions that there are both more and less specific formulations we could use to describe the same strings, ranging from the fairly specific:

^1[2389]{0,2}\-(AA|DQ|BA)\-\d{3}$

(start [^] with a 1 [1], followed by up to two digits chosen from {2, 3, 8, 9} [[2389]{0,2}], followed by a hyphen [\-] then AA, DA or BA [(AA|DQ|BA)], followed by another hypen [\-] then three digits [\d{3}] and finish [$])

to the fully general

^.*$

(start [^], then zero or more characters [.*], then stop [$]).

Going back to our first formulation

^\d+\-[A-Z]{2}\-\d{3}$

one possible grouped equivalent is

^(\d+)\-([A-Z]{2})\-(\d{3})$

The three parenthesized sections are known as groups, and regular expression implementations usually provide a way of looking up the value of these groups when a particular string is matched by it. For example, in Python we might say:

import re

pattern = re.compile(r'^(\d+)\-([A-Z]{2})\-(\d{3})$')
m = re.match(pattern, '123-ZQ-987')
if m:
    print('1: "%s"  2: "%s"  3: "%s"' % (m.group(1), m.group(2),
                                         m.group(3)))

m = re.match(pattern, '00-FT-020')
if m:
    print('1: "%s"  2: "%s"  3: "%s"' % (m.group(1), m.group(2),
                                         m.group(3)))

If we run this, the output is:

1: "123"  2: "ZQ"  3: "987"
1: "00"  2: "FT"  3: "020"

The Big Idea: Automatic Discovery of Constraints on String Structure

In the context of test-driven data analysis, the idea is probably obvious: for string fields with some kind of structure—telephone numbers, post codes, zip codes, UUIDs, more-or-less any kind of structured identifier, airline codes, airport codes, national insurance numbers, credit card numbers, bank sort codes, social security numbers—we would like to specify constraints on the values in the field using regular expressions. A natural extension to the TDDA constraints file format introduced in the last post would be something along the lines of:⁴

"regex": "^\\d+\\-[A-Z]{2}\\-\\d{3}$"

if there is a single regular expression that usefully matches all the allowed field values. If a field contains strings in multiple formats that are so different that using a single regular expression would be unhelpful, we might instead provide a list of regular expressions, such as:

"regex": ["^\\d+\\-[A-Z]{2}\\-\\d{3}$", "^[A-Z]{5}\\+\\d{5}$"]

which would mean each field values should match at least one of the regular expressions in the list.

Just as with the automatic discovery of other types of constraints, we want the TDDA constraint discovery library to be able to suggest suitable regular expression constraints on string fields, where appropriate. This is where Rexpy comes in:

Rexpy is a library for finding regular expressions that usefully characterize a given corpus of strings.

We are choosing to build Rexpy as a stand-alone module because it has clear utility outside the context of constraint generation.

The Other Idea: Automatically discovered Quasi-Fields

We can imagine going beyond simply using (and automatically discovering) regular expressions to describe constraints on string data. Once we have useful regular expressions that characterize some string data—and more particularly, in cases where we have a single regular expression that usefully describes the structure of the string—we can tag meaningful subcomponents. In the example we used above, we had three groups:

(\d+) — the digits at the start of the identifier
([A-Z]{2}) — the pair of letters in the middle
(\d{3}) — the three digits at the end

It's not totally trivial to work out which subcomponents are useful to tag, but I think we could probably find pretty good heuristics that would do a reasonable job, at least in simple cases. Once we know the groups, we can potentially start to treat them as quasi-fields in their own right. So in this case, if we had a field ID containing string identifiers like those shown, we might create from that three quasi fields as follows:

ID_qf1, of type int, values 123, 12, 198, and 1
ID_qf2, of type string, values AA, DQ, AA, and BA
ID_qf3, of type int, values 971, 802, 45 and 834

Once we have these quasi fields, we can potentially subject them to the usual TDDA constraint generation process, which might suggest extra, stronger constraints. For example, we might find the the numbers in ID_qf3 are unique, or form a contiguous sequence, and we might find that although our regular expression only specified that there were two letters in the middle, in fact the only combinations found in the (full) data were AA, DQ, BA and BB.

I don't want to suggest that all of this is easy: there are three or four non-trivial steps to get from where Rexpy is today to this full vision:

First, it has to get better at merging related regular expressions into a useful single regular expression with optional components and alternations than it is today.
Secondly, it would have to be able to identify good subcomponents for grouping.
Thirdly, it would have to do useful type inference on the groups it identifies.
Finally, it would have to be extended to create the quasi fields and apply the TDDA discovery process to them.

But none of this seems hopelessly difficult. So continue to watch this space.

The Rexpy API

Assuming you have cloned the TDDA library somewhere on your PYTHONPATH, you should then be able to use it through the API as follows:

from tdda import rexpy

corpus = ['123-AA-971', '12-DQ-802', '198-AA-045', '1-BA-834']
results = rexpy.extract(corpus)
print('Number of regular expressions found: %d' % len(results))
for rex in results:
    print('   ' +  rex)

which produces:

$ python ids.py
Number of regular expressions found: 1
   ^\d{1,3}\-[A-Z]{2}\-\d{3}$

In general, Rexpy returns a list of regular expressions, and at the moment it is not very good at merging them. There's quite a lot of unused code that, I hope, will soon allow it to do so. But even as it is, it can do a reasonable job of characterizing simple strings. Within reason, the more examples you give it, the better it can do, and it is reasonably performant with hundreds of thousands or even millions of strings.

Rexpy: the Pandas interface

There's also a Pandas binding. You can say:

from tdda import rexpy
results = rexpy.pdextract(df['A'])

to find regular expressions for the strings in column A of a dataframe df, or

from tdda import rexpy
results = rexpy.pdextract([df['A'], df['B'], df['C']])

to find a single set of regular expressions that match all the strings from columns A, B and C. In all cases, null (pandas.np.NaN) values are ignored. The results are returned as a (Python) list of regular expressions, as strings.

Final Words

Take it for a spin and let us know how you get on.

As always, follow us or tweet at us (@tdda0) if you want to hear more, and watch out for the TDDA Slack team, which will be opening up very soon.

More precisely, ^ matches start of line; by default, rex always starts regular expressions with ^ and finishes them with $ on the assumption that strings will be presented one-per-line ↩
Again, more precisely, $ matches end of line. ↩
Grouped regular expressions are also referred to variously as marked or tagged regular expressions, and the groups are also sometimes known as subexpressions. ↩
One thing to notice here is that in JSON we need extra backslashes in the regular expression. This is because regular expressions themselves make fairly liberal use of backslashes, and JSON uses backslash as an escape character. We could avoid this in Python by using raw strings, which are introduced with an r prefix (e.g. '^(\d+)\-([A-Z]{2})\-(\d{3})$'). In such strings, backslashes are not treated in any special way. Since JSON has no equivalent mechanism, we have to escape all our backslashes, leading to the ugliness above. ↩

The TDDA Constraints File Format

Posted on Fri 04 November 2016 in TDDA • Tagged with tdda, constraints, pandas

Background

We recently extended the tdda library to include support for automatic discovery of constraints from datasets, and for verification of datasets against constraints. Yesterday's post—Constraint Discovery and Verification for Pandas DataFrames—describes these developments and the API.

The library we published is intended to be a base for producing various implementations of the constraint discovery and verification process, and uses a JSON file format (extension .tdda) to save constraints in a form that should be interchangable between implementations. We currently have two compatible implementations—the open-source Pandas code in the library and the implementation in our own analytical software, Miró.

This post describes the .tdda JSON file format. The bulk of it is merely a snapshot of the documentation shipped with the library in the Github repository (visible on Github). We intend to keep that file up to date as we expand the format.

The TDDA JSON File Format

The TDDA constraints library (Repository https://github.com/tdda/tdda, module constraints) uses a JSON file to store constraints.

This document describes that file format.

Purpose

TDDA files describe constraints on a dataset, with a view to verifying the dataset to check whether any or all of the specified constraints are satisfied.

A dataset is assumed to consist of one or more fields (also known as columns), each of which has a (different) name¹ and a well-defined type.² Each field has a value for each of a number of records (also known as rows). In some cases, values may be null (or missing).³ Even a field consisting entirely of nulls can be considered to have a type.

Familiar examples of datasets include:

tables in relational databases
DataFrames in (Pandas and R)
flat ("CSV") files (subject to type inference or assigment)
sheets in spreadsheets, or areas within sheets, if the columns have names, are not merged, and have values with consistent meanings and types over an entire column
more generally, many forms of tabular data.

In principle, TDDA files are intended to be capable of supporting any kind of constraint regarding datasets. Today, we are primarily concerned with * field types * minimum and maximum values (or in the case of string fields, minumum and maximum string lengths) * whether nulls are allowed, * whether duplicate values are allowed within a field * the allowed values for a field.

The format also has support for describing relations between fields.

Future extensions we can already foresee include:

dataset-level constraints (e.g. numbers of records; required or disallowed fields)
sortedness of fields, of field values or both
regular expressions to which string fields should conform
constraints on subsets of the data (e.g. records dated after July 2016 should not have null values for the ID field)
constraints on substructure within fields (e.g. constraints on tagged subexpressions from regular expressions to which string fields are expected to conform)
potentially checksums (though this is more suitable for checking the integreity of transfer of a specific dataset, than for use across multiple related datasets)
constraints between datasets, most obviously key relations (e.g. every value field KA in dataset A should also occur in field KB in dataset B).

The motivation for generating, storing and verifying datasets against such sets of constraints is that they can provide a powerful way of detecting bad or unexpected inputs to, or outputs from, a data analysis process. They can also be valuable as checks on intermediate results. While manually generating constraints can be burdensome, automatic discovery of constraints from example datasets, potentially followed by manual removal of over-specific constraints, provides a good cost/benefit ratio in many situations.

Filename and encoding

The preferred extension for TDDA Constraints files is .tdda.
TDDA constraints files must be encoded as UTF-8.
TDDA files must be valid JSON.

Example

This is an extremely simple example TDDA file:

{
    "fields": {
        "a": {
            "type": "int",
            "min": 1,
            "max": 9,
            "sign": "positive",
            "max_nulls": 0,
            "no_duplicates": true
        },
        "b": {
            "type": "string",
            "min_length": 3,
            "max_length": 3,
            "max_nulls": 1,
            "no_duplicates": true,
            "allowed_values": [
                "one",
                "two"
            ]
        }
    }
}

General Structure

A TDDA file is a JSON dictionary. There are currently two supported top-level keys:

fields: constraints for individual fields, keyed on the field name. (In TDDA, we generally refer to dataset columns as fields.)
field_groups: constraints specifying relations between multiple fields (two, for now). field_groups constraints are keyed on a comma-separated list of the names of the fields to which they relate, and order is significant.

Both top-level keys are optional.

In future, we expect to add further top-level keys (e.g. for possible constraints on the number of rows, required or disallowed fields etc.)

The order of constraints in the file is immaterial (of course; this is JSON), though writers may choose to present fields in a particular order, e.g. dataset order or sorted on fieldname.

Field Constraints

The value of a field constraints entry (in the fields section) is a dictionary keyed on constraint kind. For example, the constraints on field a in the example above are specified as:

"a": {
    "type": "int",
    "min": 1,
    "max": 9,
    "sign": "positive",
    "max_nulls": 0,
    "no_duplicates": true
}

The TDDA library currently recognizes the following kinds of constraints:

type
min
max
min_length
max_length
sign
sign
max_nulls
no_duplicates
allowed_values

Other constraint libraries are free to define their own, custom kinds of constraints. We will probably recommend that non-standard constraints have names beginning with colon-terminated prefix. For example, if we wanted to support more specific Pandas type constraints, we would probably use a key such as pandas:type for this.

The value of a constraint is often simply a scalar value, but can be a list or a dictionary; when it is a dictionary, it should include a key value, with the principle value associated with the constraint (true, if there is no specific value beyond the name of the constraint).

If the value of a constraint (the scalar value, or the value key if the value is a dictionary) is null, this is taken to indicate the absence of a constraint. A constraint with value null should be completely ignored, so that a constraints file including null-valued constraints should produce identical results to one omitting those constraints. (This obviously means that we are discouraging using null as a meaningful constraint value, though a string "null" is fine, and in fact we use this for sign constraints.)

The semantics and values of the standard field constraint types are as follows:

type: the allowed (standard, TDDA) type of the field. This can be a single value from bool (boolean), int (integer; whole-numbered); real (floating point values); string (unicode in Python3; byte string in Python2) or date (any kind of date or date time, with or without timezone information). It can also be a list of such allowed values (in which case, order is not significant).

It is up to the generation and verification libraries to map between the actual types in whatever dataset/dataframe/table/... object is used and these TDDA constraint types, though over time we may provide further guidance.

Examples:
- {"type": "int"}
- {"type": ["int", "real"]}
min: the minimum allowed value for a field. This is often a simple value, but in the case of real fields, it can be convenient to specify a level of precision. In particular, a minimum value can have precision (default: fuzzy):
- closed: all non-null values in the field must be greater than or equal to the value specified.
- open: all non-null values in the field must be strictly greater than the value specified.
- fuzzy: when the precision is specified as fuzzy, the verifier should allow a small degree of violation of the constraint without generating a failure. Verifiers take a parameter, epsilon, which specifies how the fuzzy constraints should be taken to be: epsilon is a fraction of the constraint value by which field values are allowed to exceed the constraint without being considered to fail the constraint. This defaults to 0.01 (i.e. 1%). Notice that this means that constraint values of zero are never fuzzy.
Examples are:
- {"min": 1},
- {"min": 1.2},
- {"min": {"value": 3.4}, {"precision": "fuzzy"}}.
JSON, does not—of course—have a date type. TDDA files specifying dates should use string representations in one of the following formats:
- YYYY-MM-DD for dates without times
- YYYY-MM-DD hh:mm:ss for date-times without timezone
- YYYY-MM-DD hh:mm:ss [+-]ZZZZ for date-times with timezone.
We recommend that writers use precisely these formats, but that readers offer some flexibility in reading, e.g. accepting / as well as - to separate date components, and T as well as space to separate the time component from the date.
max: the maximum allowed value for a field. Much like min, but for maximum values. Examples are:
- {"max": 1},
- {"max": 1.2},
- {"max": {"value": 3.4}, {"precision": "closed"}}.
Dates should be formatted as for min values.
min_length: the minimum allowed length of strings in a string field. How unicode strings are counted is up to the implementation. Example:
- {"min_length": 2}
max_length: the minimum allowed length of strings in a string field. How unicode strings are counted is up to the implementation.
- {"max_length": 22}
sign: For numeric fields, the allowed sign of (non-null) values. Although this overlaps with minimum and maximum values, it it often useful to have a separate sign constraint, which carries semantically different information. Allowed values are:
- positive: All values must be greater than zero
- non-negative: No value may be less than zero
- zero: All values must be zero
- non-positive: No value may be greater than zero
- negative: All values must be negative
- null: No signed values are allowed, i.e. the field must be entirely null.
Example:
- {"sign": "non-negative"}
max_nulls: The maximum number of nulls allowed in the field. This can be any non-negative value. We recommend only writing values of zero (no nulls values are allowed) or 1 (At most a single null is allowed) into this constraint, but checking against any value found.

Example:
- {"max_nulls": 0}
no_duplicates: When this constraint is set on a field (with value true), it means that each non-null value must occur only once in the field. The current implementation only uses this constraint for string fields.

Example:
- {"no_duplicates": true}
allowed_values: The value of this constraint is a list of allowed values for the field. The order is not significant.

Example:
- {"allowed_values": ["red", "green", "blue"]}

MultiField Constraints

Multifield constraints are not yet being generated by this implementation, though our (proprietary) Miró implementation does produce them. The currently planned constraint types for field relations cover field equality and inequality for pairs of fields, with options to specify null relations too.

A simple example would be:

"field_groups": {
    "StartDate,EndDate": {"lt": true}
}

This is a less-than constraint, to be interpreted as

StartDate < EndDate wherever StartDate and EndDate and both non-null.

The plan is to support the obvious five equality and inequality relations:

lt: first field value is strictly less than the second field value for each record
lte: first field value is less than or equal to the second field value for each record
eq: first field value is equal to the second field value for each record
gte: first field value is greater than or equal to the second field value for each record
gt: first field value is strictly greater than the second field value for each record.

In the case of equality (only), we will probably also support a precision parameter with values fuzzy or precise.

There should probably also be an option to specify relations between null values in pairs of columns, either as a separate constraint or as a quality on each of the above.

Pandas, of course, allows multiple columns to have the same name. This format makes no concessions to such madness, though there is nothing to stop a verifier or generator sharing constraints across all columns with the same name. The Pandas generators and verifiers in this library do not currently attempt to do this. ↩
Pandas also allows columns of mixed type. Again, this file format does not recognize such columns, and it would probably be sensible not to use type constraints for columns of mixed type. ↩
Pandas uses not-a-number (pandas.np.NaN) to represent null values for numeric, string and boolean fields; it uses a special not-a-time (pandas.NaT) value to represent null date (Timestamp) values. ↩

Older Posts Newer Posts