Coverage information for Rexpy

Posted on Tue 31 January 2017 in TDDA

Rexpy Stats

We previously added rexpy to the Python tdda module. Rexpy is used to find regular expressions from example strings.

One of the most common requests from Rexpy users has been for information regarding how many examples each resulting regular expression matches.

We have now added a few methods to Rexpy to support this.

Currently this is only available in the Python library for Rexpy, available as part of the tdda module, with either

pip install tdda

git clone https://github.com/tdda/tdda.git

Needless to say, we also plan to use this functionality in the online version of Rexpy in the future.

Rexpy: Quick Recap

The following example shows simple use of Rexpy from Python:

$ python
>>> from tdda import rexpy
>>>
>>> corpus = ['123-AA-971', '12-DQ-802', '198-AA-045', '1-BA-834']
>>> for r in rexpy.extract(corpus):
>>>    print(r)
>>>
^\d{1,3}\-[A-Z]{2}\-\d{3}$

In this case, Rexpy found a single regular expression that matched all the strings, but in many cases it returns a list of regular expressions, each covering some subset of the examples.

The way the algorithm currently works, in most cases¹ each example will match only one regular expression, but in general, some examples might match more than one pattern. So we've designed the new functionality to work even when this is the case. We've provided three new methods on the Extractor class, which gives a more powerful API than the simple extract function.

Here's an example based on one of Rexpy's tests:

>>> urls2 = [
    'stochasticsolutions.com/',
    'apple.com/',
    'stochasticsolutions.com/',  # actual duplicate
    'https://www.stochasticsolutions.co.uk/',
    'https://www.google.co.uk/',
    'https://www.google.com',
    'https://www.google.com/',
    'https://www.guardian.co.uk/',
    'https://www.guardian.com',
    'https://www.guardian.com/',
    'https://www.stochasticsolutions.com',
    'web.stochasticsolutions.com',
    'https://www.stochasticsolutions.com',
    'tdda.info',
    'gov.uk',
    'https://web.web',
]
>>> x = rexpy.Extractor(urls2)
>>> for r in x.results.rex:
>>>    print(r)

^[a-z]{3,4}\.[a-z]{2,4}$
^[a-z]+\.com\/$
^[a-z]{3,4}[\.\/\:]{1,3}[a-z]+\.[a-z]{3}$
^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
^http\:\/\/www\.[a-z]{6,8}\.com\/$
^http\:\/\/www\.[a-z]+\.co\.uk\/$

As you can see, Rexpy has produced six different regular expressions, some of which should probably be collapsed together. The Extractor object we have created has three new methods available.

The New Coverage Methods

The simplest new method is coverage(dedup=False), which returns a list of the number of matches for each regular expression returned, in the same order as the regular expressions in x.results.rex. So:

>>> print(x.coverage())
[2, 3, 2, 4, 2, 3]

is the list of frequencies for the six regular expressions given, in order. So the pairings are illustrated by:

>>> for k, n in zip(x.results.rex, x.coverage()):
        print('%d examples are matched by %s' % (n, k))

2 examples are matched by ^[a-z]{3,4}\.[a-z]{2,4}$
3 examples are matched by ^[a-z]+\.com\/$
2 examples are matched by ^[a-z]{3,4}[\.\/\:]{1,3}[a-z]+\.[a-z]{3}$
4 examples are matched by ^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
2 examples are matched by ^http\:\/\/www\.[a-z]{6,8}\.com\/$
3 examples are matched by ^http\:\/\/www\.[a-z]+\.co\.uk\/$

The optional dedup parameter, when set to True, requests deduplicated frequencies, i.e. ignoring any duplicate strings passed in (remembering that Rexpy strips whitespace from both ends of input strings). In this case, there is just one duplicate string (stochasticsolutions.com/). So:

>>> print(x.coverage(dedup=True))
[2, 2, 2, 4, 2, 3]

where the second number (the matches for ^[a-z]+\.com\/$) is now 2, because stochasticsolutions.com/ has been deduplicated.

We can also find the total number of examples, with or without duplicates, by calling the n_examples(dedup=False) method:

>>> print(x.n_examples())
16
>>> print(x.n_examples(dedup=True))
15

But what we will probably normally be most interested in doing is sorting the regular expressions from highest to lowest coverage, ignoring any examples matched by an earlier pattern in cases where they do overlap. That's exactly what the incremental_coverage(dedup=False) method does for us. It returns an ordered dictionary.

>>> for (k, n) in x.incremental_coverage().items():
        print('%d: %s' % (n, k))
4: ^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
3: ^[a-z]+\.com\/$
3: ^http\:\/\/www\.[a-z]+\.co\.uk\/$
2: ^[a-z]{3,4}[\.\/\:]{1,3}[a-z]+\.[a-z]{3}$
2: ^[a-z]{3,4}\.[a-z]{2,4}$
2: ^http\:\/\/www\.[a-z]{6,8}\.com\/$

This is our sixteen input strings (including duplicates), and the number of examples matched by this expression, not matched by any previous expression. (As noted earlier, that caveat probably won't make any difference at the moment, but it will in future versions.) So, to be explicit, this is saying:

The regular expression that matches most examples is: ^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$ which matches 4 of the 16 strings.
Of the remaining 12 examples, 3 are matched by ^[a-z]+\.com\/$.
Of the remaining 9 examples, 3 more are matched by ^http\:\/\/www\.[a-z]+\.co\.uk\/$
and so on.

Note, that in the case of ties, Rexpy sorts regular expressions as strings to break ties.

We can get the deduplicated numbers if we prefer:

>>> for (k, n) in x.incremental_coverage(dedup=True).items():
        print('%d: %s' % (n, k))
4: ^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
3: ^http\:\/\/www\.[a-z]+\.co\.uk\/$
2: ^[a-z]+\.com\/$
2: ^[a-z]{3,4}[\.\/\:]{1,3}[a-z]+\.[a-z]{3}$
2: ^[a-z]{3,4}\.[a-z]{2,4}$
2: ^http\:\/\/www\.[a-z]{6,8}\.com\/$

That's all the new functionality for now. Let us know how you get on, and if you find any problems. And tweet your email address to @tdda0 if you want to join the TDDA Slack to discuss anything around the subject of test-driven data analysis.

[NOTE: This post was updated on 10.2.2017 after an update to the rexpy library changed function and attribute names from "sequential" (which was not very descriptive) to "incremental", which is better.]

In fact, probably in all cases, currently ↩