The TDDA Constraints File Format

Posted on Fri 04 November 2016 in TDDA

Background

We recently extended the tdda library to include support for automatic discovery of constraints from datasets, and for verification of datasets against constraints. Yesterday's post—Constraint Discovery and Verification for Pandas DataFrames—describes these developments and the API.

The library we published is intended to be a base for producing various implementations of the constraint discovery and verification process, and uses a JSON file format (extension .tdda) to save constraints in a form that should be interchangable between implementations. We currently have two compatible implementations—the open-source Pandas code in the library and the implementation in our own analytical software, Miró.

This post describes the .tdda JSON file format. The bulk of it is merely a snapshot of the documentation shipped with the library in the Github repository (visible on Github). We intend to keep that file up to date as we expand the format.

The TDDA JSON File Format

The TDDA constraints library (Repository https://github.com/tdda/tdda, module constraints) uses a JSON file to store constraints.

This document describes that file format.

Purpose

TDDA files describe constraints on a dataset, with a view to verifying the dataset to check whether any or all of the specified constraints are satisfied.

A dataset is assumed to consist of one or more fields (also known as columns), each of which has a (different) name1 and a well-defined type.2 Each field has a value for each of a number of records (also known as rows). In some cases, values may be null (or missing).3 Even a field consisting entirely of nulls can be considered to have a type.

Familiar examples of datasets include:

  • tables in relational databases
  • DataFrames in (Pandas and R)
  • flat ("CSV") files (subject to type inference or assigment)
  • sheets in spreadsheets, or areas within sheets, if the columns have names, are not merged, and have values with consistent meanings and types over an entire column
  • more generally, many forms of tabular data.

In principle, TDDA files are intended to be capable of supporting any kind of constraint regarding datasets. Today, we are primarily concerned with * field types * minimum and maximum values (or in the case of string fields, minumum and maximum string lengths) * whether nulls are allowed, * whether duplicate values are allowed within a field * the allowed values for a field.

The format also has support for describing relations between fields.

Future extensions we can already foresee include:

  • dataset-level constraints (e.g. numbers of records; required or disallowed fields)
  • sortedness of fields, of field values or both
  • regular expressions to which string fields should conform
  • constraints on subsets of the data (e.g. records dated after July 2016 should not have null values for the ID field)
  • constraints on substructure within fields (e.g. constraints on tagged subexpressions from regular expressions to which string fields are expected to conform)
  • potentially checksums (though this is more suitable for checking the integreity of transfer of a specific dataset, than for use across multiple related datasets)
  • constraints between datasets, most obviously key relations (e.g. every value field KA in dataset A should also occur in field KB in dataset B).

The motivation for generating, storing and verifying datasets against such sets of constraints is that they can provide a powerful way of detecting bad or unexpected inputs to, or outputs from, a data analysis process. They can also be valuable as checks on intermediate results. While manually generating constraints can be burdensome, automatic discovery of constraints from example datasets, potentially followed by manual removal of over-specific constraints, provides a good cost/benefit ratio in many situations.

Filename and encoding

  • The preferred extension for TDDA Constraints files is .tdda.

  • TDDA constraints files must be encoded as UTF-8.

  • TDDA files must be valid JSON.

Example

This is an extremely simple example TDDA file:

{
    "fields": {
        "a": {
            "type": "int",
            "min": 1,
            "max": 9,
            "sign": "positive",
            "max_nulls": 0,
            "no_duplicates": true
        },
        "b": {
            "type": "string",
            "min_length": 3,
            "max_length": 3,
            "max_nulls": 1,
            "no_duplicates": true,
            "allowed_values": [
                "one",
                "two"
            ]
        }
    }
}

General Structure

A TDDA file is a JSON dictionary. There are currently two supported top-level keys:

  • fields: constraints for individual fields, keyed on the field name. (In TDDA, we generally refer to dataset columns as fields.)

  • field_groups: constraints specifying relations between multiple fields (two, for now). field_groups constraints are keyed on a comma-separated list of the names of the fields to which they relate, and order is significant.

Both top-level keys are optional.

In future, we expect to add further top-level keys (e.g. for possible constraints on the number of rows, required or disallowed fields etc.)

The order of constraints in the file is immaterial (of course; this is JSON), though writers may choose to present fields in a particular order, e.g. dataset order or sorted on fieldname.

Field Constraints

The value of a field constraints entry (in the fields section) is a dictionary keyed on constraint kind. For example, the constraints on field a in the example above are specified as:

"a": {
    "type": "int",
    "min": 1,
    "max": 9,
    "sign": "positive",
    "max_nulls": 0,
    "no_duplicates": true
}

The TDDA library currently recognizes the following kinds of constraints:

  • type
  • min
  • max
  • min_length
  • max_length
  • sign
  • sign
  • max_nulls
  • no_duplicates
  • allowed_values

Other constraint libraries are free to define their own, custom kinds of constraints. We will probably recommend that non-standard constraints have names beginning with colon-terminated prefix. For example, if we wanted to support more specific Pandas type constraints, we would probably use a key such as pandas:type for this.

The value of a constraint is often simply a scalar value, but can be a list or a dictionary; when it is a dictionary, it should include a key value, with the principle value associated with the constraint (true, if there is no specific value beyond the name of the constraint).

If the value of a constraint (the scalar value, or the value key if the value is a dictionary) is null, this is taken to indicate the absence of a constraint. A constraint with value null should be completely ignored, so that a constraints file including null-valued constraints should produce identical results to one omitting those constraints. (This obviously means that we are discouraging using null as a meaningful constraint value, though a string "null" is fine, and in fact we use this for sign constraints.)

The semantics and values of the standard field constraint types are as follows:

  • type: the allowed (standard, TDDA) type of the field. This can be a single value from bool (boolean), int (integer; whole-numbered); real (floating point values); string (unicode in Python3; byte string in Python2) or date (any kind of date or date time, with or without timezone information). It can also be a list of such allowed values (in which case, order is not significant).

    It is up to the generation and verification libraries to map between the actual types in whatever dataset/dataframe/table/... object is used and these TDDA constraint types, though over time we may provide further guidance.

    Examples:

    • {"type": "int"}
    • {"type": ["int", "real"]}
  • min: the minimum allowed value for a field. This is often a simple value, but in the case of real fields, it can be convenient to specify a level of precision. In particular, a minimum value can have precision (default: fuzzy):

    • closed: all non-null values in the field must be greater than or equal to the value specified.
    • open: all non-null values in the field must be strictly greater than the value specified.
    • fuzzy: when the precision is specified as fuzzy, the verifier should allow a small degree of violation of the constraint without generating a failure. Verifiers take a parameter, epsilon, which specifies how the fuzzy constraints should be taken to be: epsilon is a fraction of the constraint value by which field values are allowed to exceed the constraint without being considered to fail the constraint. This defaults to 0.01 (i.e. 1%). Notice that this means that constraint values of zero are never fuzzy.

    Examples are:

    • {"min": 1},
    • {"min": 1.2},
    • {"min": {"value": 3.4}, {"precision": "fuzzy"}}.

    JSON, does not—of course—have a date type. TDDA files specifying dates should use string representations in one of the following formats:

    • YYYY-MM-DD for dates without times
    • YYYY-MM-DD hh:mm:ss for date-times without timezone
    • YYYY-MM-DD hh:mm:ss [+-]ZZZZ for date-times with timezone.

    We recommend that writers use precisely these formats, but that readers offer some flexibility in reading, e.g. accepting / as well as - to separate date components, and T as well as space to separate the time component from the date.

  • max: the maximum allowed value for a field. Much like min, but for maximum values. Examples are:

    • {"max": 1},
    • {"max": 1.2},
    • {"max": {"value": 3.4}, {"precision": "closed"}}.

    Dates should be formatted as for min values.

  • min_length: the minimum allowed length of strings in a string field. How unicode strings are counted is up to the implementation. Example:

    • {"min_length": 2}
  • max_length: the minimum allowed length of strings in a string field. How unicode strings are counted is up to the implementation.

    • {"max_length": 22}
  • sign: For numeric fields, the allowed sign of (non-null) values. Although this overlaps with minimum and maximum values, it it often useful to have a separate sign constraint, which carries semantically different information. Allowed values are:

    • positive: All values must be greater than zero
    • non-negative: No value may be less than zero
    • zero: All values must be zero
    • non-positive: No value may be greater than zero
    • negative: All values must be negative
    • null: No signed values are allowed, i.e. the field must be entirely null.

    Example:

    • {"sign": "non-negative"}
  • max_nulls: The maximum number of nulls allowed in the field. This can be any non-negative value. We recommend only writing values of zero (no nulls values are allowed) or 1 (At most a single null is allowed) into this constraint, but checking against any value found.

    Example:

    • {"max_nulls": 0}
  • no_duplicates: When this constraint is set on a field (with value true), it means that each non-null value must occur only once in the field. The current implementation only uses this constraint for string fields.

    Example:

    • {"no_duplicates": true}
  • allowed_values: The value of this constraint is a list of allowed values for the field. The order is not significant.

    Example:

    • {"allowed_values": ["red", "green", "blue"]}

MultiField Constraints

Multifield constraints are not yet being generated by this implementation, though our (proprietary) Miró implementation does produce them. The currently planned constraint types for field relations cover field equality and inequality for pairs of fields, with options to specify null relations too.

A simple example would be:

"field_groups": {
    "StartDate,EndDate": {"lt": true}
}

This is a less-than constraint, to be interpreted as

  • StartDate < EndDate wherever StartDate and EndDate and both non-null.

The plan is to support the obvious five equality and inequality relations:

  • lt: first field value is strictly less than the second field value for each record
  • lte: first field value is less than or equal to the second field value for each record
  • eq: first field value is equal to the second field value for each record
  • gte: first field value is greater than or equal to the second field value for each record
  • gt: first field value is strictly greater than the second field value for each record.

In the case of equality (only), we will probably also support a precision parameter with values fuzzy or precise.

There should probably also be an option to specify relations between null values in pairs of columns, either as a separate constraint or as a quality on each of the above.


  1. Pandas, of course, allows multiple columns to have the same name. This format makes no concessions to such madness, though there is nothing to stop a verifier or generator sharing constraints across all columns with the same name. The Pandas generators and verifiers in this library do not currently attempt to do this. 

  2. Pandas also allows columns of mixed type. Again, this file format does not recognize such columns, and it would probably be sensible not to use type constraints for columns of mixed type. 

  3. Pandas uses not-a-number (pandas.np.NaN) to represent null values for numeric, string and boolean fields; it uses a special not-a-time (pandas.NaT) value to represent null date (Timestamp) values.