TDDA 1-pager
We have written a 1-page summary of some of the core ideas in TDDA.
It is available as a PDF from stochasticsolutions.com/pdf/TDDA-One-Pager.pdf.
We have written a 1-page summary of some of the core ideas in TDDA.
It is available as a PDF from stochasticsolutions.com/pdf/TDDA-One-Pager.pdf.
Posted on Tue 31 January 2017 in TDDA • Tagged with tdda, regular expressions
We
previously
added rexpy to the Python tdda module. Rexpy is used to
find regular expressions from example strings.
One of the most common requests from Rexpy users has been for information regarding how many examples each resulting regular expression matches.
We have now added a few methods to Rexpy to support this.
Currently this is only available in the Python library for Rexpy,
available as part of the tdda module, with either
pip install tdda
or
git clone https://github.com/tdda/tdda.git
Needless to say, we also plan to use this functionality in the online version of Rexpy in the future.
The following example shows simple use of Rexpy from Python:
$ python
>>> from tdda import rexpy
>>>
>>> corpus = ['123-AA-971', '12-DQ-802', '198-AA-045', '1-BA-834']
>>> for r in rexpy.extract(corpus):
>>> print(r)
>>>
^\d{1,3}\-[A-Z]{2}\-\d{3}$
In this case, Rexpy found a single regular expression that matched all the strings, but in many cases it returns a list of regular expressions, each covering some subset of the examples.
The way the algorithm currently works, in most cases1 each example
will match only one regular expression, but in general, some examples
might match more than one pattern. So we've designed the new functionality
to work even when this is the case. We've provided three new methods
on the Extractor class, which gives a more powerful API than the
simple extract function.
Here's an example based on one of Rexpy's tests:
>>> urls2 = [
'stochasticsolutions.com/',
'apple.com/',
'stochasticsolutions.com/', # actual duplicate
'https://www.stochasticsolutions.co.uk/',
'https://www.google.co.uk/',
'https://www.google.com',
'https://www.google.com/',
'https://www.guardian.co.uk/',
'https://www.guardian.com',
'https://www.guardian.com/',
'https://www.stochasticsolutions.com',
'web.stochasticsolutions.com',
'https://www.stochasticsolutions.com',
'tdda.info',
'gov.uk',
'https://web.web',
]
>>> x = rexpy.Extractor(urls2)
>>> for r in x.results.rex:
>>> print(r)
^[a-z]{3,4}\.[a-z]{2,4}$
^[a-z]+\.com\/$
^[a-z]{3,4}[\.\/\:]{1,3}[a-z]+\.[a-z]{3}$
^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
^http\:\/\/www\.[a-z]{6,8}\.com\/$
^http\:\/\/www\.[a-z]+\.co\.uk\/$
As you can see, Rexpy has produced six different regular expressions,
some of which should probably be collapsed together. The Extractor
object we have created has three new methods available.
The simplest new method is coverage(dedup=False), which returns a list of
the number of matches for each regular expression returned, in the same
order as the regular expressions in x.results.rex. So:
>>> print(x.coverage())
[2, 3, 2, 4, 2, 3]
is the list of frequencies for the six regular expressions given, in order. So the pairings are illustrated by:
>>> for k, n in zip(x.results.rex, x.coverage()):
print('%d examples are matched by %s' % (n, k))
2 examples are matched by ^[a-z]{3,4}\.[a-z]{2,4}$
3 examples are matched by ^[a-z]+\.com\/$
2 examples are matched by ^[a-z]{3,4}[\.\/\:]{1,3}[a-z]+\.[a-z]{3}$
4 examples are matched by ^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
2 examples are matched by ^http\:\/\/www\.[a-z]{6,8}\.com\/$
3 examples are matched by ^http\:\/\/www\.[a-z]+\.co\.uk\/$
The optional dedup parameter, when set to True, requests
deduplicated frequencies, i.e. ignoring any duplicate strings passed
in (remembering that Rexpy strips whitespace from both ends of input
strings). In this case, there is just one duplicate string
(stochasticsolutions.com/). So:
>>> print(x.coverage(dedup=True))
[2, 2, 2, 4, 2, 3]
where the second number (the matches for ^[a-z]+\.com\/$) is now 2, because
stochasticsolutions.com/ has been deduplicated.
We can also find the total number of examples, with or without duplicates,
by calling the n_examples(dedup=False) method:
>>> print(x.n_examples())
16
>>> print(x.n_examples(dedup=True))
15
But what we will probably normally be most interested in doing is sorting
the regular expressions from highest to lowest coverage, ignoring any
examples matched by an earlier pattern in cases where they do overlap.
That's exactly what the incremental_coverage(dedup=False) method does for us.
It returns an ordered dictionary.
>>> for (k, n) in x.incremental_coverage().items():
print('%d: %s' % (n, k))
4: ^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
3: ^[a-z]+\.com\/$
3: ^http\:\/\/www\.[a-z]+\.co\.uk\/$
2: ^[a-z]{3,4}[\.\/\:]{1,3}[a-z]+\.[a-z]{3}$
2: ^[a-z]{3,4}\.[a-z]{2,4}$
2: ^http\:\/\/www\.[a-z]{6,8}\.com\/$
This is our sixteen input strings (including duplicates), and the number of examples matched by this expression, not matched by any previous expression. (As noted earlier, that caveat probably won't make any difference at the moment, but it will in future versions.) So, to be explicit, this is saying:
The regular expression that matches most examples is:
^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
which matches 4 of the 16 strings.
Of the remaining 12 examples, 3 are matched by ^[a-z]+\.com\/$.
Of the remaining 9 examples, 3 more are matched by
^http\:\/\/www\.[a-z]+\.co\.uk\/$
and so on.
Note, that in the case of ties, Rexpy sorts regular expressions as strings to break ties.
We can get the deduplicated numbers if we prefer:
>>> for (k, n) in x.incremental_coverage(dedup=True).items():
print('%d: %s' % (n, k))
4: ^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
3: ^http\:\/\/www\.[a-z]+\.co\.uk\/$
2: ^[a-z]+\.com\/$
2: ^[a-z]{3,4}[\.\/\:]{1,3}[a-z]+\.[a-z]{3}$
2: ^[a-z]{3,4}\.[a-z]{2,4}$
2: ^http\:\/\/www\.[a-z]{6,8}\.com\/$
That's all the new functionality for now. Let us know how you get on, and if you find any problems. And tweet your email address to @tdda0 if you want to join the TDDA Slack to discuss anything around the subject of test-driven data analysis.
[NOTE: This post was updated on 10.2.2017 after an update to the rexpy library changed function and attribute names from "sequential" (which was not very descriptive) to "incremental", which is better.]
In fact, probably in all cases, currently ↩
Posted on Thu 26 January 2017 in TDDA • Tagged with tdda, constraints, rexpy
Since the last post, we have extended the reference test functionality
in the Python tdda library.
Major changes (as of version 0.2.5, at the time of writing) include:
ReferenceTest class that has significantly
more functionality from the previous (now deprecated)
WritableTestCase.pytest
as well as unittest.pip install tdda, as well as from Github.-w or --write)We previously introduced the idea of a reference test
with an example in the post First Test,
and then when describing the
WritableTestCase
library.
A reference test is essentially the TDDA equivalent of a software system test
(or integration test), and is characterized by:
The old WritableTestCase class that we made available in the tdda
library provided support for writing, running and updating such reference
tests in Python by extending the unittest.TestCase class and providing
methods for writing reference tests and commands for running (and,
where necessary, updating the reference results used by) reference tests.
Through our use of reference testing in various contexts and projects
at Stochastic Solutions, we have ended up producing three different
implementations of reference-test libraries, each with different
capabilities. We also become aware that an increasing number of Python
developers have a marked preference for pytest over unittest, and
wanted to support that more naturally. The new ReferenceTest class
brings together all the capabilities we have developed, standardizes
them and fills in missing combinations while providing idiomatic
patterns for using it both in the context of unittest and pytest.
We have no immediate plans to remove WritableTestCase from the tdda
library (and, indeed, continue use it extensively ourselves), but encourage
people to adopt ReferenceTest instead as we believe it is superior
in all respects.
You can now install the tdda module with pip:
pip install tdda
and the source remains available under an MIT licence from github with
git clone https://github.com/tdda/tdda.git
The tdda library works under Python 2.7 and Python 3 and includes
reference test functionality mentionedabove,
constraint discovery and verification,
including a Pandas version
and automatic discovery of regular expressions from examples
(rexpy, also available online at here).
After installation, you can run TDDA's tests as follows:
$ python -m tdda.testtdda
............................................................................
----------------------------------------------------------------------
Ran 76 tests in 0.279s
However you have obtained the tdda module, you can get a copy of its
reference test examples by running the command
$ python -m tdda.referencetest.examples
which will place them in a referencetest-examples subdirectory
of your current directory. Alternatively, you can specify that you want them
in a particular place with
$ python -m tdda.referencetest.examples /path/to/particular/place
There are variations for getting examples for the constraint generation and validation functionality, and for regular expression extraction (rexpy)
$ python -m tdda.constraints.examples [/path/to/particular/place]
$ python -m tdda.rexpy.examples [/path/to/particular/place]
Here is a cut-down example of how to use the ReferenceTest class
with Python's unittest, based on the example in
referencetest-examples/unittest. For those who prefer pytest,
there is a similar pytest-ready example in referencetest-examples/pytest.
from __future__ import unicode_literals
import os
import sys
import tempfile
from tdda.referencetest import ReferenceTestCase
# ensure we can import the generators module in the directory above,
# wherever that happens to be
FILE_DIR = os.path.abspath(os.path.dirname(__file__))
PARENT_DIR = os.path.dirname(FILE_DIR)
sys.path.append(PARENT_DIR)
from generators import generate_string, generate_file
class TestExample(ReferenceTestCase):
def testExampleStringGeneration(self):
actual = generate_string()
self.assertStringCorrect(actual, 'string_result.html',
ignore_substrings=['Copyright', 'Version'])
def testExampleFileGeneration(self):
outdir = tempfile.gettempdir()
outpath = os.path.join(outdir, 'file_result.html')
generate_file(outpath)
self.assertFileCorrect(outpath, 'file_result.html',
ignore_patterns=['Copy(lef|righ)t',
'Version'])
TestExample.set_default_data_location(os.path.join(PARENT_DIR, 'reference'))
if __name__ == '__main__':
ReferenceTestCase.main()
Notes
These tests illustrate comparing a string generated by some code to a reference string (stored in a file), and testing a file generated by code to a reference file.
We need to tell the ReferenceTest class where to find reference
files used for comparison. The call to set_default_data_location,
straight after defining the class, does this.
The first test generates the string actual and compares it to
the contents of the file string_result.html in the data location
specified (../reference). The ignore_substrings parameter
specifies strings which, when encountered, cause these lines to be
omitted from the comparison.
The second test instead writes a file to a temporary directory (but
using the same name as the reference file). In this case, rather
than ignore_strings we have used an ignore_patterns parameter
to specify regular expressions which, when matched, cause lines to
be disregarded in comparisons.
There are a number of other parameters that can be added to the
various assert... methods to allow other kinds of discrepancies
between actual and generated files to be disregarded.
If you just run the code above, or the file in the examples, you should see output like this:
$ python test_using_referencetestcase.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.003s
OK
or, if you use pytest, like this:
$ pytest
============================= test session starts ==============================
platform darwin -- Python 2.7.11, pytest-3.0.2, py-1.4.31, pluggy-0.3.1
rootdir: /Users/njr/tmp/referencetest-examples/pytest, inifile:
plugins: hypothesis-3.4.2
collected 2 items
test_using_referencepytest.py ..
=========================== 2 passed in 0.01 seconds ===========================
If you then edit generators.py in the directory above and make some change
to the HTML in the generate_string and generate_file functions
preferably non-semantic changes, like adding an extra space before
the > in </html>) and then rerun the tests, you should get failures.
Changing just the generate_string function:
$ python unittest/test_using_referencetestcase.py
.1 line is different, starting at line 33
Expected file /Users/njr/tmp/referencetest-examples/reference/string_result.html
Note exclusions:
Copyright
Version
Compare with "diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/actual-string_result.html
/Users/njr/tmp/referencetest-examples/reference/string_result.html".
F
======================================================================
FAIL: testExampleStringGeneration (__main__.TestExample)
----------------------------------------------------------------------
Traceback (most recent call last):
File "unittest/test_using_referencetestcase.py", line 62, in testExampleStringGeneration
ignore_substrings=['Copyright', 'Version'])
File "/Users/njr/python/tdda/tdda/referencetest/referencetest.py", line 527, in assertStringCorrect
self.check_failures(failures, msgs)
File "/Users/njr/python/tdda/tdda/referencetest/referencetest.py", line 709, in check_failures
self.assert_fn(failures == 0, '\n'.join(msgs))
AssertionError: 1 line is different, starting at line 33
Expected file /Users/njr/tmp/referencetest-examples/reference/string_result.html
Note exclusions:
Copyright
Version
Compare with "diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/actual-string_result.html
/Users/njr/tmp/referencetest-examples/reference/string_result.html".
----------------------------------------------------------------------
Ran 2 tests in 0.005s
As expected, the string test now fails, and the ReferenceTest
library suggests a command you can use to diff the output: because
the test failed, it wrote the actual output to a temporary file. (It
reports the failure twice, once as it occurs and once at the end.
This is deliberate as it's convenient to see it when it happens if the
tests take any non-trivial amount of time to run, and convenient to
collect together all the failures at the end too.)
Because these are HTML files, I would probably instead open them both
(using the open command on Mac OS) and visually inspect them. In
these case, the pages look identical, and diff will confirm that the
changes are only those we expect:
$ diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/actual-string_result.html
/Users/njr/tmp/referencetest-examples/reference/string_result.html
5,6c5,6
< Copyright (c) Stochastic Solutions, 2016
< Version 1.0.0
—
> Copyright (c) Stochastic Solutions Limited, 2016
> Version 0.0.0
33c33
< </html >
—
> </html>
In this case
ignore_strings to avoid say that's OKIf we are happy that the new output is OK and should replace the previous
reference test, you can rerun with the -W (or --write-all).
$ python unittest/test_using_referencetestcase.py -W
Written /Users/njr/tmp/referencetest-examples/reference/file_result.html
.Written /Users/njr/tmp/referencetest-examples/reference/string_result.html
.
----------------------------------------------------------------------
Ran 2 tests in 0.003s
OK
Now if you run them again without -W, the tests should all pass.
You can do the same with pytest, except that in this case you need to use
pytest --write-all
because pytest does not allow short flags.
Since this post is already quite long, we won't go through all the other options, parameters and kinds of tests in detail, but will mention a few other points:
In addition to assertStringCorrect and assertFileCorrect there
are various other methods available:
assertFilesCorrect for checking that multiple files are as expectedassertCSVFileCorrect for checking a single CSV fileassertCSVFilesCorrect for checking multiple CSV filesassertDataFramesEqual to check equality of Pandas DataFramesassertDataFrameCorrect to check a data frame matches data in
a CSV file.Where appropriate, assert methods accept the various optional
parameters, including:
lstrip — ignore whitespace at start of files/strings (default
False)rstrip — ignore whitespace at end of files/strings (default
False)kind — optional category for test; these can be used to allow
selective rewriting of test output with the -w/--write flagpreprocess — function to call on expected and actual data before
comparing resultsWe'll add more detail in future posts.
If you'd like to join the slack group where we discuss TDDA and related topics,
DM your email address to @tdda on Twitter and we'll send you an invitation.
Posted on Fri 11 November 2016 in TDDA • Tagged with tdda, constraints, pandas, regular expressions
There's a Skyscanner data feed we have been working with for a year or so. It's produced some six million records so far, each of which has a transaction ID consisting of three parts—a four-digit alphanumeric transaction type, a numeric timestamp and a UUID, with the three parts separated by hyphens. Things like this:
adyt-1466611238-cf68496e-40f1-455e-94d9-ea13a96ff044
ooqt-1466602219-012da468-a820-11e6-8ba1-b8f6b118f191
z65e-1448755954-2d677190-ecda-4279-acb2-31a31ec8e86e
The only thing that really matters is that the transaction IDs are unique, but if everything is working correctly, the three parts should have the right structure and match data that we have in other fields in the feed.
We're pretty familiar with this data; or so we thought . . .
We've added a command to our data analysis
software—Miró—for
characterizing the patterns in string fields. The command is rex
and when we run it on the field (tid), by saying:
[1] load skyscanner/transactions.miro
transactions.miro: 6,020,946 records; 6,020,946 (100%) selected; 57 fields.
[2] rex tid
this is the output:
^[A-Za-z0-9]{2,4}[\-\_]{1,3}\d{10}\-$
^dckx\-1466604137\-1aada032aa7348e1ac0fcfdd02a80f9c$
^[A-Za-z0-9]{2,4}[\-\_]{1,3}\d{10}\-[0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12}$
^q\_\_s\-\d{10}\-[0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12}$
The rex command has found four patterns that, between them,
characterize all the values in the field, and it has expressed these in
the form of regular expressions. (If you aren't familiar with
regular
expressions, you
might want to read the linked Wikipedia
article.)
The third pattern in the list is more-or-less the one we thought would characterize all the transaction IDs. It reads:
^)[A-Za-z0-9]{2,4})[\-\_]{1,3})\d{10})\-)[0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12})$).The only difference between this and what we expected is that it turns
out that some of the "alphanumeric transaction types" end with two
underscores rather than being stricly alphanumeric, and the rex
command has expressed this as "2-4 alphanumeric characters followed by
1-3 hyphens or underscores", rather than "2-4 characters that are
alphanumeric or underscores, followed by a hyphen", which would have
been more like the way we think about it.
What are the other three expressions?
The first one is the same, but without the UUID. Occasionally the UUID
is missing (null), and when this happens the UUID portion of the tid
is blank. It is possible to write a regular expression that combines
these two cases, but rex doesn't quite yet know how to do this. The
way to do it would be to make the UUID optional by enclosing it in
parentheses and following it by a question mark:
([0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12})?
which reads zero or one UUIDs, given that we know the pattern inside the parentheses corresponds to a UUID.
The last pattern is another one we could unify with the other two:
it is the same except that it identifies a particular transaction type
that again uses underscores, but now in the middle: q__s. So we could
replace those three (and might, in an ideal world, want rex to find)
^[A-Za-z0-9\_]{4}[\-]\d{10}\-[0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12}$
This is exactly what we described, except with the inclusion of _ as an
allowable character in the 4-character transaction type at the start.
But what about the second pattern?
^dckx\-1466604137\-1aada032aa7348e1ac0fcfdd02a80f9c$
It's actually a single, completely specific transaction ID, that matches the main pattern except for omitting the hyphens in the UUID. This shouldn't be possible—by which I mean, the UUID generator should never omit the hyphens. But clearly either it did for this one transaction, or something else stripped them out later. Either way, it shows that among the several million transactions, there a bad transaction ID.
A further check in Miró shows that this occurs just a single time (i.e. it is not duplicated):
[4]> select tid = "dckx-1466604137-1aada032aa7348e1ac0fcfdd02a80f9c"
transactions.miro: 6,020,946 records; 1 (0%) selected; 57 fields.
Selection:
(= tid "dckx-1466604137-1aada032aa7348e1ac0fcfdd02a80f9c")
So we've learnt something interesting and useful about our data, and
Miró's rex command has helped us produce regular expressions that
characterize all our transaction IDs. It hasn't done a perfect job,
but it was pretty useful, and it's easy for us to merge the three
main patterns by hand. We plan to extend the functionality to cover these
cases better over coming weeks.
If
you don't happen to have Miró,
or want to find regular expressions from
data outside the context of a Miró dataset, you can use equivalent
functionality directly from the rexpy module of our open-source,
MIT-licenced tdda package, available from Github:
git clone https://github.com/tdda/tdda.git
This provides both a Python API for finding regular expressions from example data, and a command line tool.
If you just run rexpy.py from the command line, with no arguments,
it expects you to type (or paste) a set of strings, and when you finish
(with CTRL-D on unix-like systems, or CTRL-Z on Windows systems)
it will spit out the regular expressions it thinks you need to match them:
For example:
$ python rexpy.py
EH1 7JQ
WC1 4AA
G2 3PQ
^[A-Za-z0-9]{2,3}\ [A-Za-z0-9]{3}$
Or, of course, you can pipe input into it. If, for example, I do that in a folder full of photos from a Nikon camera, I get
$ ls *.* | python ~/python/tdda/rexpy/rexpy.py
^DSC\_\d{4}\.[A-Za-z]{3}$
(because these are all files like DSC_1234.NEF or DSC_9346.xmp).
You can also give it a filename as the first argument, in which case it will read strings (one per line) from a file.
So given the file ids.txt (which is in the rexpy/examples subdirectory
in the tdda repository), containing:
123-AA-971
12-DQ-802
198-AA-045
1-BA-834
we can use rexpy on it by saying:
$ python rexpy.py examples/ids.txt
^\d{1,3}\-[A-Z]{2}\-\d{3}$
and if there is a header line you want to skip, you can add either -h
or --header to tell Rexpy to skip that.
You can also give a filename as a second command line argument, in which case Rexpy will write the results (one per line) to that file.
There's an old joke among programmers, generally attributed to Jamie Zawinski that bears repeating:
Some people, when confronted with a problem, think "I know, I'll use regular expressions."
Now they have two problems.
— Jamie Zawinski
As powerful as regular expressions are, even their best friends would probably concede that they can be hard to write, harder to read and harder still to debug.
Despite this, regular expressions are an attractive way to specify constraints on string fields for two main reasons:
First, regular expressions constitute a fast, powerful, and near-ubiquitous mechanism for describing a wide variety of possible structures in string data.
Secondly, regular expressions include the concept of capture groups. You may recall that in a grouped3 regular expression, one or more subcomponents is enclosed in parentheses and its matched value can be extracted. As we will see below, this is particulary interesting in the context of test-driven data analysis.
For concreteness, suppose we have a field containing strings such as:
123-AA-971
12-DQ-802
198-AA-045
1-BA-834
One obvious regular expression to describe these would be:
^\d+\-[A-Z]{2}\-\d{3}$
(start [^] with
one or more digits [\d+],
then a hyphen [\-],
then two capital letters [A-Z]{2},
then another hyphen [\-],
then three digits [\d{3}]
then stop [$]).
But it is in the nature of regular expressions that there are both more and less specific formulations we could use to describe the same strings, ranging from the fairly specific:
^1[2389]{0,2}\-(AA|DQ|BA)\-\d{3}$
(start [^]
with a 1 [1],
followed by up to two digits
chosen from {2, 3, 8, 9} [[2389]{0,2}],
followed by a hyphen [\-]
then AA, DA or BA [(AA|DQ|BA)],
followed by another hypen [\-]
then three digits [\d{3}]
and finish [$])
to the fully general
^.*$
(start [^],
then zero or more characters [.*],
then stop [$]).
Going back to our first formulation
^\d+\-[A-Z]{2}\-\d{3}$
one possible grouped equivalent is
^(\d+)\-([A-Z]{2})\-(\d{3})$
The three parenthesized sections are known as groups, and regular expression implementations usually provide a way of looking up the value of these groups when a particular string is matched by it. For example, in Python we might say:
import re
pattern = re.compile(r'^(\d+)\-([A-Z]{2})\-(\d{3})$')
m = re.match(pattern, '123-ZQ-987')
if m:
print('1: "%s" 2: "%s" 3: "%s"' % (m.group(1), m.group(2),
m.group(3)))
m = re.match(pattern, '00-FT-020')
if m:
print('1: "%s" 2: "%s" 3: "%s"' % (m.group(1), m.group(2),
m.group(3)))
If we run this, the output is:
1: "123" 2: "ZQ" 3: "987"
1: "00" 2: "FT" 3: "020"
In the context of test-driven data analysis, the idea is probably obvious: for string fields with some kind of structure—telephone numbers, post codes, zip codes, UUIDs, more-or-less any kind of structured identifier, airline codes, airport codes, national insurance numbers, credit card numbers, bank sort codes, social security numbers—we would like to specify constraints on the values in the field using regular expressions. A natural extension to the TDDA constraints file format introduced in the last post would be something along the lines of:4
"regex": "^\\d+\\-[A-Z]{2}\\-\\d{3}$"
if there is a single regular expression that usefully matches all the allowed field values. If a field contains strings in multiple formats that are so different that using a single regular expression would be unhelpful, we might instead provide a list of regular expressions, such as:
"regex": ["^\\d+\\-[A-Z]{2}\\-\\d{3}$", "^[A-Z]{5}\\+\\d{5}$"]
which would mean each field values should match at least one of the regular expressions in the list.
Just as with the automatic discovery of other types of constraints, we want the TDDA constraint discovery library to be able to suggest suitable regular expression constraints on string fields, where appropriate. This is where Rexpy comes in:
Rexpy is a library for finding regular expressions that usefully characterize a given corpus of strings.
We are choosing to build Rexpy as a stand-alone module because it has clear utility outside the context of constraint generation.
We can imagine going beyond simply using (and automatically discovering) regular expressions to describe constraints on string data. Once we have useful regular expressions that characterize some string data—and more particularly, in cases where we have a single regular expression that usefully describes the structure of the string—we can tag meaningful subcomponents. In the example we used above, we had three groups:
(\d+) — the digits at the start of the identifier([A-Z]{2}) — the pair of letters in the middle(\d{3}) — the three digits at the endIt's not totally trivial to work out which subcomponents are useful to tag,
but I think we could probably find pretty good heuristics that would do
a reasonable job, at least in simple cases. Once we know the groups, we
can potentially start to treat them as quasi-fields in their own right.
So in this case, if we had a field ID containing string identifiers
like those shown, we might create from that three quasi fields as follows:
ID_qf1, of type int, values 123, 12, 198, and 1ID_qf2, of type string, values AA, DQ, AA, and BAID_qf3, of type int, values 971, 802, 45 and 834Once we have these quasi fields, we can potentially subject them to the
usual TDDA constraint generation process, which might suggest extra,
stronger constraints. For example, we might find the the numbers in
ID_qf3 are unique, or form a contiguous sequence, and we might find
that although our regular expression only specified that there were two
letters in the middle, in fact the only combinations found in the
(full) data were AA, DQ, BA and BB.
I don't want to suggest that all of this is easy: there are three or four non-trivial steps to get from where Rexpy is today to this full vision:
First, it has to get better at merging related regular expressions into a useful single regular expression with optional components and alternations than it is today.
Secondly, it would have to be able to identify good subcomponents for grouping.
Thirdly, it would have to do useful type inference on the groups it identifies.
Finally, it would have to be extended to create the quasi fields and apply the TDDA discovery process to them.
But none of this seems hopelessly difficult. So continue to watch this space.
Assuming you have cloned the TDDA library somewhere on your PYTHONPATH,
you should then be able to use it through the API as follows:
from tdda import rexpy
corpus = ['123-AA-971', '12-DQ-802', '198-AA-045', '1-BA-834']
results = rexpy.extract(corpus)
print('Number of regular expressions found: %d' % len(results))
for rex in results:
print(' ' + rex)
which produces:
$ python ids.py
Number of regular expressions found: 1
^\d{1,3}\-[A-Z]{2}\-\d{3}$
In general, Rexpy returns a list of regular expressions, and at the moment it is not very good at merging them. There's quite a lot of unused code that, I hope, will soon allow it to do so. But even as it is, it can do a reasonable job of characterizing simple strings. Within reason, the more examples you give it, the better it can do, and it is reasonably performant with hundreds of thousands or even millions of strings.
There's also a Pandas binding. You can say:
from tdda import rexpy
results = rexpy.pdextract(df['A'])
to find regular expressions for the strings in column A of a dataframe df,
or
from tdda import rexpy
results = rexpy.pdextract([df['A'], df['B'], df['C']])
to find a single set of regular expressions that match all the strings from
columns A, B and C. In all cases, null (pandas.np.NaN) values
are ignored. The results are returned as a (Python) list of regular
expressions, as strings.
Take it for a spin and let us know how you get on.
As always, follow us or tweet at us (@tdda0) if you want to hear more, and watch out for the TDDA Slack team, which will be opening up very soon.
More precisely, ^ matches start of line; by default,
rex always starts regular expressions with ^ and finishes them
with $ on the assumption that strings will be presented one-per-line ↩
Again, more precisely, $ matches end of line. ↩
Grouped regular expressions are also referred to variously as marked or tagged regular expressions, and the groups are also sometimes known as subexpressions. ↩
One thing to notice here is that in JSON we need extra backslashes
in the regular expression. This is because regular expressions themselves
make fairly liberal use of backslashes, and JSON uses backslash as
an escape character. We could avoid this in Python by using raw strings,
which are introduced with an r prefix
(e.g. '^(\d+)\-([A-Z]{2})\-(\d{3})$').
In such strings, backslashes are not treated in any special way.
Since JSON has no equivalent mechanism, we have to escape all
our backslashes, leading to the ugliness above. ↩
Posted on Fri 04 November 2016 in TDDA • Tagged with tdda, constraints, pandas
We recently extended the tdda library to include support for automatic discovery of constraints from datasets, and for verification of datasets against constraints. Yesterday's post—Constraint Discovery and Verification for Pandas DataFrames—describes these developments and the API.
The library we published is intended to be a base for producing various
implementations of the constraint discovery and verification process,
and uses a JSON file format (extension .tdda) to save constraints in a form
that should be interchangable between implementations.
We currently have two compatible implementations—the open-source
Pandas code in the library and the implementation in our own analytical
software, Miró.
This post describes the .tdda JSON file format. The bulk of it is
merely a snapshot of the documentation shipped with the library in the
Github repository (visible on
Github).
We intend to keep that file up to date as we expand the format.
The TDDA constraints library (Repository https://github.com/tdda/tdda, module constraints) uses a JSON file to store constraints.
This document describes that file format.
TDDA files describe constraints on a dataset, with a view to verifying the dataset to check whether any or all of the specified constraints are satisfied.
A dataset is assumed to consist of one or more fields (also known
as columns), each of which has a (different) name1
and a well-defined type.2
Each field has a value for each of a number of records (also
known as rows). In some cases, values may be null (or missing).3
Even a field consisting entirely of nulls can be considered to have
a type.
Familiar examples of datasets include:
In principle, TDDA files are intended to be capable of supporting any kind of constraint regarding datasets. Today, we are primarily concerned with * field types * minimum and maximum values (or in the case of string fields, minumum and maximum string lengths) * whether nulls are allowed, * whether duplicate values are allowed within a field * the allowed values for a field.
The format also has support for describing relations between fields.
Future extensions we can already foresee include:
The motivation for generating, storing and verifying datasets against such sets of constraints is that they can provide a powerful way of detecting bad or unexpected inputs to, or outputs from, a data analysis process. They can also be valuable as checks on intermediate results. While manually generating constraints can be burdensome, automatic discovery of constraints from example datasets, potentially followed by manual removal of over-specific constraints, provides a good cost/benefit ratio in many situations.
The preferred extension for TDDA Constraints files is .tdda.
TDDA constraints files must be encoded as UTF-8.
TDDA files must be valid JSON.
This is an extremely simple example TDDA file:
{
"fields": {
"a": {
"type": "int",
"min": 1,
"max": 9,
"sign": "positive",
"max_nulls": 0,
"no_duplicates": true
},
"b": {
"type": "string",
"min_length": 3,
"max_length": 3,
"max_nulls": 1,
"no_duplicates": true,
"allowed_values": [
"one",
"two"
]
}
}
}
A TDDA file is a JSON dictionary. There are currently two supported top-level keys:
fields: constraints for individual fields, keyed on the field name.
(In TDDA, we generally refer to dataset columns as fields.)
field_groups: constraints specifying relations between multiple
fields (two, for now). field_groups constraints are keyed
on a comma-separated list of the names of the fields to which they
relate, and order is significant.
Both top-level keys are optional.
In future, we expect to add further top-level keys (e.g. for possible constraints on the number of rows, required or disallowed fields etc.)
The order of constraints in the file is immaterial (of course; this is JSON), though writers may choose to present fields in a particular order, e.g. dataset order or sorted on fieldname.
The value of a field constraints entry (in the fields section)
is a dictionary keyed on constraint kind.
For example, the constraints on field a in the example above are
specified as:
"a": {
"type": "int",
"min": 1,
"max": 9,
"sign": "positive",
"max_nulls": 0,
"no_duplicates": true
}
The TDDA library currently recognizes the following kinds of constraints:
typeminmaxmin_lengthmax_lengthsignsignmax_nullsno_duplicatesallowed_valuesOther constraint libraries are free to define their own, custom kinds
of constraints. We will probably recommend that non-standard
constraints have names beginning with colon-terminated prefix. For
example, if we wanted to support more specific Pandas type
constraints, we would probably use a key such as pandas:type for
this.
The value of a constraint is often simply a scalar value, but can be a list
or a dictionary; when it is a dictionary, it should include a key value,
with the principle value associated with the constraint (true, if there
is no specific value beyond the name of the constraint).
If the value of a constraint (the scalar value, or the value key if the
value is a dictionary) is null, this is taken to indicate
the absence of a constraint. A constraint with value null should be
completely ignored, so that a constraints file including null-valued
constraints should produce identical results to one omitting those constraints.
(This obviously means that we are discouraging using null as a meaningful
constraint value, though a string "null" is fine, and in fact we use this
for sign constraints.)
The semantics and values of the standard field constraint types are as follows:
type: the allowed (standard, TDDA) type of the field.
This can be a single value from bool (boolean),
int (integer; whole-numbered); real (floating point values);
string (unicode in Python3; byte string in Python2) or date
(any kind of date or date time, with or without timezone information).
It can also be a list of such allowed values (in which case, order
is not significant).
It is up to the generation and verification libraries to map between the actual types in whatever dataset/dataframe/table/... object is used and these TDDA constraint types, though over time we may provide further guidance.
Examples:
{"type": "int"}{"type": ["int", "real"]}min: the minimum allowed value for a field. This is often
a simple value, but in the case of real fields, it can be
convenient to specify a level of precision. In particular,
a minimum value can have precision (default: fuzzy):
closed: all non-null values in the field must be
greater than or equal to the value specified.open: all non-null values in the field must be
strictly greater than the value specified.fuzzy: when the precision is specified as fuzzy,
the verifier should allow a small degree of violation
of the constraint without generating a failure.
Verifiers take a parameter,
epsilon, which specifies how the fuzzy constraints
should be taken to be: epsilon is a fraction of the
constraint value by which field values are allowed
to exceed the constraint without being considered
to fail the constraint. This defaults to 0.01 (i.e. 1%).
Notice that this means that constraint values of zero
are never fuzzy.Examples are:
{"min": 1},{"min": 1.2},{"min": {"value": 3.4}, {"precision": "fuzzy"}}.JSON, does not—of course—have a date type. TDDA files specifying dates should use string representations in one of the following formats:
YYYY-MM-DD for dates without timesYYYY-MM-DD hh:mm:ss for date-times without timezoneYYYY-MM-DD hh:mm:ss [+-]ZZZZ for date-times with timezone.We recommend that writers use precisely these formats, but that
readers offer some flexibility in reading, e.g. accepting /
as well as - to separate date components, and
T as well as space to separate the time component from the date.
max: the maximum allowed value for a field. Much like min,
but for maximum values.
Examples are:
{"max": 1},{"max": 1.2},{"max": {"value": 3.4}, {"precision": "closed"}}.Dates should be formatted as for min values.
min_length: the minimum allowed length of strings in a string field.
How unicode strings are counted is up to the implementation.
Example:
{"min_length": 2}max_length: the minimum allowed length of strings in a string field.
How unicode strings are counted is up to the implementation.
{"max_length": 22}sign: For numeric fields, the allowed sign of (non-null)
values. Although this overlaps with minimum and maximum
values, it it often useful to have a separate sign constraint,
which carries semantically different information. Allowed
values are:
positive: All values must be greater than zeronon-negative: No value may be less than zerozero: All values must be zeronon-positive: No value may be greater than zeronegative: All values must be negativenull: No signed values are allowed, i.e. the field must
be entirely null.Example:
{"sign": "non-negative"}max_nulls: The maximum number of nulls allowed in the field.
This can be any non-negative value. We recommend only writing
values of zero (no nulls values are allowed) or 1 (At most a
single null is allowed) into this constraint, but checking
against any value found.
Example:
{"max_nulls": 0}no_duplicates: When this constraint is set on a field
(with value true), it means that each non-null value must occur
only once in the field. The current implementation only uses
this constraint for string fields.
Example:
{"no_duplicates": true}allowed_values: The value of this constraint is a list of
allowed values for the field. The order is not significant.
Example:
{"allowed_values": ["red", "green", "blue"]}Multifield constraints are not yet being generated by this implementation, though our (proprietary) Miró implementation does produce them. The currently planned constraint types for field relations cover field equality and inequality for pairs of fields, with options to specify null relations too.
A simple example would be:
"field_groups": {
"StartDate,EndDate": {"lt": true}
}
This is a less-than constraint, to be interpreted as
StartDate < EndDate wherever StartDate and EndDate and both
non-null.The plan is to support the obvious five equality and inequality relations:
lt: first field value is strictly less than the second field value
for each recordlte: first field value is less than or equal to the second field value
for each recordeq: first field value is equal to the second field value
for each recordgte: first field value is greater than or equal to the second field
value for each recordgt: first field value is strictly greater than the second field
value for each record.In the case of equality (only), we will probably also support a precision
parameter with values fuzzy or precise.
There should probably also be an option to specify relations between null values in pairs of columns, either as a separate constraint or as a quality on each of the above.
Pandas, of course, allows multiple columns to have the same name. This format makes no concessions to such madness, though there is nothing to stop a verifier or generator sharing constraints across all columns with the same name. The Pandas generators and verifiers in this library do not currently attempt to do this. ↩
Pandas also allows columns of mixed type. Again, this file format does not recognize such columns, and it would probably be sensible not to use type constraints for columns of mixed type. ↩
Pandas uses not-a-number (pandas.np.NaN) to represent
null values for numeric, string and boolean fields; it uses a special
not-a-time (pandas.NaT) value to represent null date (Timestamp) values. ↩