Introducing Rexpy: Automatic Discovery of Regular Expressions

Posted on Fri 11 November 2016 in TDDA

Motivation

There's a Skyscanner data feed we have been working with for a year or so. It's produced some six million records so far, each of which has a transaction ID consisting of three parts—a four-digit alphanumeric transaction type, a numeric timestamp and a UUID, with the three parts separated by hyphens. Things like this:

adyt-1466611238-cf68496e-40f1-455e-94d9-ea13a96ff044
ooqt-1466602219-012da468-a820-11e6-8ba1-b8f6b118f191
z65e-1448755954-2d677190-ecda-4279-acb2-31a31ec8e86e

The only thing that really matters is that the transaction IDs are unique, but if everything is working correctly, the three parts should have the right structure and match data that we have in other fields in the feed.

We're pretty familiar with this data; or so we thought . . .

We've added a command to our data analysis software—Miró—for characterizing the patterns in string fields. The command is rex and when we run it on the field (tid), by saying:

[1] load skyscanner/transactions.miro
transactions.miro: 6,020,946 records; 6,020,946 (100%) selected; 57 fields.

[2] rex tid

this is the output:

^[A-Za-z0-9]{2,4}[\-\_]{1,3}\d{10}\-$
^dckx\-1466604137\-1aada032aa7348e1ac0fcfdd02a80f9c$
^[A-Za-z0-9]{2,4}[\-\_]{1,3}\d{10}\-[0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12}$
^q\_\_s\-\d{10}\-[0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12}$

The rex command has found four patterns that, between them, characterize all the values in the field, and it has expressed these in the form of regular expressions. (If you aren't familiar with regular expressions, you might want to read the linked Wikipedia article.)

The third pattern in the list is more-or-less the one we thought would characterize all the transaction IDs. It reads:

start¹ (^)
then 2--4 letters or numbers ([A-Za-z0-9]{2,4})
then a mixture one to three hyphens and underscores ([\-\_]{1,3})
then 10 digits (\d{10})
then a hyphen (\-)
then a UUID, which is a hyphen-separated collection of 28 hex digits, in groups of 8, 4, 4, 4 and 12 ([0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12})
and end² ($).

The only difference between this and what we expected is that it turns out that some of the "alphanumeric transaction types" end with two underscores rather than being stricly alphanumeric, and the rex command has expressed this as "2-4 alphanumeric characters followed by 1-3 hyphens or underscores", rather than "2-4 characters that are alphanumeric or underscores, followed by a hyphen", which would have been more like the way we think about it.

What are the other three expressions?

The first one is the same, but without the UUID. Occasionally the UUID is missing (null), and when this happens the UUID portion of the tid is blank. It is possible to write a regular expression that combines these two cases, but rex doesn't quite yet know how to do this. The way to do it would be to make the UUID optional by enclosing it in parentheses and following it by a question mark:

([0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12})?

which reads zero or one UUIDs, given that we know the pattern inside the parentheses corresponds to a UUID.

The last pattern is another one we could unify with the other two: it is the same except that it identifies a particular transaction type that again uses underscores, but now in the middle: q__s. So we could replace those three (and might, in an ideal world, want rex to find)

^[A-Za-z0-9\_]{4}[\-]\d{10}\-[0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12}$

This is exactly what we described, except with the inclusion of _ as an allowable character in the 4-character transaction type at the start.

But what about the second pattern?

^dckx\-1466604137\-1aada032aa7348e1ac0fcfdd02a80f9c$

It's actually a single, completely specific transaction ID, that matches the main pattern except for omitting the hyphens in the UUID. This shouldn't be possible—by which I mean, the UUID generator should never omit the hyphens. But clearly either it did for this one transaction, or something else stripped them out later. Either way, it shows that among the several million transactions, there a bad transaction ID.

A further check in Miró shows that this occurs just a single time (i.e. it is not duplicated):

[4]>  select tid = "dckx-1466604137-1aada032aa7348e1ac0fcfdd02a80f9c"
transactions.miro: 6,020,946 records; 1 (0%) selected; 57 fields.
Selection:
(= tid "dckx-1466604137-1aada032aa7348e1ac0fcfdd02a80f9c")

So we've learnt something interesting and useful about our data, and Miró's rex command has helped us produce regular expressions that characterize all our transaction IDs. It hasn't done a perfect job, but it was pretty useful, and it's easy for us to merge the three main patterns by hand. We plan to extend the functionality to cover these cases better over coming weeks.

Rexpy outside Miró

If you don't happen to have Miró, or want to find regular expressions from data outside the context of a Miró dataset, you can use equivalent functionality directly from the rexpy module of our open-source, MIT-licenced tdda package, available from Github:

git clone https://github.com/tdda/tdda.git

This provides both a Python API for finding regular expressions from example data, and a command line tool.

The Rexpy Command Line

If you just run rexpy.py from the command line, with no arguments, it expects you to type (or paste) a set of strings, and when you finish (with CTRL-D on unix-like systems, or CTRL-Z on Windows systems) it will spit out the regular expressions it thinks you need to match them: For example:

$ python rexpy.py
EH1 7JQ
WC1 4AA
G2 3PQ
^[A-Za-z0-9]{2,3}\ [A-Za-z0-9]{3}$

Or, of course, you can pipe input into it. If, for example, I do that in a folder full of photos from a Nikon camera, I get

$ ls *.* | python ~/python/tdda/rexpy/rexpy.py
^DSC\_\d{4}\.[A-Za-z]{3}$

(because these are all files like DSC_1234.NEF or DSC_9346.xmp).

You can also give it a filename as the first argument, in which case it will read strings (one per line) from a file.

So given the file ids.txt (which is in the rexpy/examples subdirectory in the tdda repository), containing:

123-AA-971
12-DQ-802
198-AA-045
1-BA-834

we can use rexpy on it by saying:

$ python rexpy.py examples/ids.txt
^\d{1,3}\-[A-Z]{2}\-\d{3}$

and if there is a header line you want to skip, you can add either -h or --header to tell Rexpy to skip that.

You can also give a filename as a second command line argument, in which case Rexpy will write the results (one per line) to that file.

Motivation for Rexpy

There's an old joke among programmers, generally attributed to Jamie Zawinski that bears repeating:

Some people, when confronted with a problem, think "I know, I'll use regular expressions."

Now they have two problems.

— Jamie Zawinski

As powerful as regular expressions are, even their best friends would probably concede that they can be hard to write, harder to read and harder still to debug.

Despite this, regular expressions are an attractive way to specify constraints on string fields for two main reasons:

First, regular expressions constitute a fast, powerful, and near-ubiquitous mechanism for describing a wide variety of possible structures in string data.
Secondly, regular expressions include the concept of capture groups. You may recall that in a grouped³ regular expression, one or more subcomponents is enclosed in parentheses and its matched value can be extracted. As we will see below, this is particulary interesting in the context of test-driven data analysis.

For concreteness, suppose we have a field containing strings such as:

123-AA-971
12-DQ-802
198-AA-045
1-BA-834

One obvious regular expression to describe these would be:

^\d+\-[A-Z]{2}\-\d{3}$

(start [^] with one or more digits [\d+], then a hyphen [\-], then two capital letters [A-Z]{2}, then another hyphen [\-], then three digits [\d{3}] then stop [$]).

But it is in the nature of regular expressions that there are both more and less specific formulations we could use to describe the same strings, ranging from the fairly specific:

^1[2389]{0,2}\-(AA|DQ|BA)\-\d{3}$

(start [^] with a 1 [1], followed by up to two digits chosen from {2, 3, 8, 9} [[2389]{0,2}], followed by a hyphen [\-] then AA, DA or BA [(AA|DQ|BA)], followed by another hypen [\-] then three digits [\d{3}] and finish [$])

to the fully general

^.*$

(start [^], then zero or more characters [.*], then stop [$]).

Going back to our first formulation

^\d+\-[A-Z]{2}\-\d{3}$

one possible grouped equivalent is

^(\d+)\-([A-Z]{2})\-(\d{3})$

The three parenthesized sections are known as groups, and regular expression implementations usually provide a way of looking up the value of these groups when a particular string is matched by it. For example, in Python we might say:

import re

pattern = re.compile(r'^(\d+)\-([A-Z]{2})\-(\d{3})$')
m = re.match(pattern, '123-ZQ-987')
if m:
    print('1: "%s"  2: "%s"  3: "%s"' % (m.group(1), m.group(2),
                                         m.group(3)))

m = re.match(pattern, '00-FT-020')
if m:
    print('1: "%s"  2: "%s"  3: "%s"' % (m.group(1), m.group(2),
                                         m.group(3)))

If we run this, the output is:

1: "123"  2: "ZQ"  3: "987"
1: "00"  2: "FT"  3: "020"

The Big Idea: Automatic Discovery of Constraints on String Structure

In the context of test-driven data analysis, the idea is probably obvious: for string fields with some kind of structure—telephone numbers, post codes, zip codes, UUIDs, more-or-less any kind of structured identifier, airline codes, airport codes, national insurance numbers, credit card numbers, bank sort codes, social security numbers—we would like to specify constraints on the values in the field using regular expressions. A natural extension to the TDDA constraints file format introduced in the last post would be something along the lines of:⁴

"regex": "^\\d+\\-[A-Z]{2}\\-\\d{3}$"

if there is a single regular expression that usefully matches all the allowed field values. If a field contains strings in multiple formats that are so different that using a single regular expression would be unhelpful, we might instead provide a list of regular expressions, such as:

"regex": ["^\\d+\\-[A-Z]{2}\\-\\d{3}$", "^[A-Z]{5}\\+\\d{5}$"]

which would mean each field values should match at least one of the regular expressions in the list.

Just as with the automatic discovery of other types of constraints, we want the TDDA constraint discovery library to be able to suggest suitable regular expression constraints on string fields, where appropriate. This is where Rexpy comes in:

Rexpy is a library for finding regular expressions that usefully characterize a given corpus of strings.

We are choosing to build Rexpy as a stand-alone module because it has clear utility outside the context of constraint generation.

The Other Idea: Automatically discovered Quasi-Fields

We can imagine going beyond simply using (and automatically discovering) regular expressions to describe constraints on string data. Once we have useful regular expressions that characterize some string data—and more particularly, in cases where we have a single regular expression that usefully describes the structure of the string—we can tag meaningful subcomponents. In the example we used above, we had three groups:

(\d+) — the digits at the start of the identifier
([A-Z]{2}) — the pair of letters in the middle
(\d{3}) — the three digits at the end

It's not totally trivial to work out which subcomponents are useful to tag, but I think we could probably find pretty good heuristics that would do a reasonable job, at least in simple cases. Once we know the groups, we can potentially start to treat them as quasi-fields in their own right. So in this case, if we had a field ID containing string identifiers like those shown, we might create from that three quasi fields as follows:

ID_qf1, of type int, values 123, 12, 198, and 1
ID_qf2, of type string, values AA, DQ, AA, and BA
ID_qf3, of type int, values 971, 802, 45 and 834

Once we have these quasi fields, we can potentially subject them to the usual TDDA constraint generation process, which might suggest extra, stronger constraints. For example, we might find the the numbers in ID_qf3 are unique, or form a contiguous sequence, and we might find that although our regular expression only specified that there were two letters in the middle, in fact the only combinations found in the (full) data were AA, DQ, BA and BB.

I don't want to suggest that all of this is easy: there are three or four non-trivial steps to get from where Rexpy is today to this full vision:

First, it has to get better at merging related regular expressions into a useful single regular expression with optional components and alternations than it is today.
Secondly, it would have to be able to identify good subcomponents for grouping.
Thirdly, it would have to do useful type inference on the groups it identifies.
Finally, it would have to be extended to create the quasi fields and apply the TDDA discovery process to them.

But none of this seems hopelessly difficult. So continue to watch this space.

The Rexpy API

Assuming you have cloned the TDDA library somewhere on your PYTHONPATH, you should then be able to use it through the API as follows:

from tdda import rexpy

corpus = ['123-AA-971', '12-DQ-802', '198-AA-045', '1-BA-834']
results = rexpy.extract(corpus)
print('Number of regular expressions found: %d' % len(results))
for rex in results:
    print('   ' +  rex)

which produces:

$ python ids.py
Number of regular expressions found: 1
   ^\d{1,3}\-[A-Z]{2}\-\d{3}$

In general, Rexpy returns a list of regular expressions, and at the moment it is not very good at merging them. There's quite a lot of unused code that, I hope, will soon allow it to do so. But even as it is, it can do a reasonable job of characterizing simple strings. Within reason, the more examples you give it, the better it can do, and it is reasonably performant with hundreds of thousands or even millions of strings.

Rexpy: the Pandas interface

There's also a Pandas binding. You can say:

from tdda import rexpy
results = rexpy.pdextract(df['A'])

to find regular expressions for the strings in column A of a dataframe df, or

from tdda import rexpy
results = rexpy.pdextract([df['A'], df['B'], df['C']])

to find a single set of regular expressions that match all the strings from columns A, B and C. In all cases, null (pandas.np.NaN) values are ignored. The results are returned as a (Python) list of regular expressions, as strings.

Final Words

Take it for a spin and let us know how you get on.

As always, follow us or tweet at us (@tdda0) if you want to hear more, and watch out for the TDDA Slack team, which will be opening up very soon.

More precisely, ^ matches start of line; by default, rex always starts regular expressions with ^ and finishes them with $ on the assumption that strings will be presented one-per-line ↩
Again, more precisely, $ matches end of line. ↩
Grouped regular expressions are also referred to variously as marked or tagged regular expressions, and the groups are also sometimes known as subexpressions. ↩
One thing to notice here is that in JSON we need extra backslashes in the regular expression. This is because regular expressions themselves make fairly liberal use of backslashes, and JSON uses backslash as an escape character. We could avoid this in Python by using raw strings, which are introduced with an r prefix (e.g. '^(\d+)\-([A-Z]{2})\-(\d{3})$'). In such strings, backslashes are not treated in any special way. Since JSON has no equivalent mechanism, we have to escape all our backslashes, leading to the ugliness above. ↩