Name Styles

Posted on Mon 04 March 2024 in TDDA • Tagged with TDDA, names

This is just a bit of fun, but I've always been interested in the different kinds of names allowed, encouraged, and used in different areas of computing and data.

A few years ago, I tweeted some well-known naming styles and a collection of lesser-known naming styles. I was playing about with the same idea while thinking about metadata standards today and came up with this. Just as I often think one of the boxes on the uniqitous 2x2 "Boston-Box"-style matrices makes no sense, I think some of the boxes on the evil-good-lawful-chaotic breakdown (which I gather comes from Dungeons and Dragons make little sense, so forgive me if some of this looks slightly forced. But I think it's fun.

Evil-Good-Lawful-Chaotic 3x3 matrix classification of name styles. LAWFUL GOOD: CamelCase, dromedaryCase, snake_case.  NEUTRAL GOOD: kebab-case, SCREAMING_SNAKE_CASE.  CHAOTIC GOOD: Train-Case, SCREAMING-KEBAB-CASE.  LAWFUL NEUTRAL: Pascal_Snake_Case, camel Snake Case, flatcase, UPPERFLATCASE.  NEUTRAL: reservedcase_, private ish case.  CHAOTIC NEUTRAL: space case.  LAWFUL EVIL: double quoted case, single quoted case, __dunder_case__.  NEUTRAL EVIL: path/case.extended, colon:caseflatcase, path/case, endash-kebab-case, quoted embedded newline case.  CHAOTIC EVIL: teRRorIsT nOTe CAse, alternating_separator-case, curly double quoted case, curly single quoted case, unquoted embedded, newline case.


TOMLParams: TOML-based parameter files made better

Posted on Sun 16 July 2023 in TDDA • Tagged with TDDA, reproducibility

TOMLParams is a new open-source library that helps Python developers to externalize parameters in TOML files. This post will explain why storing parameters in non-code files is beneficial (including for reproducibility), why TOML was chosen, and some of the useful features of the library, which include structured sets of parameters using TOML tables, hierarchical inclusion with overriding, default values, parameter (key) checking, optional type checking, and features to help use across programs, including built-in support for setting parameters using environment variables.

The Benefits of Externalizing Parameters

Almost all software can do more than one thing, and has various parameters that are used to control exactly what it does. Some of these parameters are set once and never changed (typically configuration parameters, that each user or installation chooses), while others may be changed more often, perhaps from run to run. Command-line tools often accept some parameters on the command line itself, most obviously input and output files and core parameters such as search terms for search commands. On Unix and Linux systems, it's also common to use command line "switches" (also called flags or options) to refine behaviour. So for example, the Unix/Linux grep tool might be used in any of the following ways:

grep time            # find all lines including 'time' on stdin
grep time p*.txt     # ... on .txt files starting with 'p'
grep -i time p*.txt  # ... ignoring capitalization (-i switch)

All of time, p*.txt and -i are examples of command-line parameters.

Many tools also use configuration files to control the behaviour of the software. On Unix and Linux, these are typically stored in the user's home directory, often in 'dot' files, such as ~/.bashrc for the Bash shell, ~/.emacs for the Emacs editor and ~/.zshrc for the Z Shell. These files are sometimes in propriery formats, but increasly often are in "standard" formats such as .ini (MS-DOS initialization files), JSON (JavaScript Object Notation), YAML (YAML Ain't Markup Language), or TOML (Tom's Obvious Minimal Language).

When writing code, it's often tempting just to set parameters in code.

  • The quickest, dirtiest practice is just to have parameter values "hard-wired" as literals in code where they are used, possibly with the same values being repeated in different places. This is generally frowned upon for a few reasons:

    • Hard-wired parameters are hard to find, and therefore hard to update and inspect;
    • There may be repetition, violating the don't repeat yourself (DRY) principle, and leading to possible inconsistency (though see also the counterveiling rule of three (ROT) and the write everything in threes (WET) principle.
    • If you want to change parameters, you have to go to many places;
    • There may be no name associated with the parameter, making it hard to know what it means;

    So implementing an energy calculation using Einstein's formula E = mc², we would probably prefer either:

def energy(mass):
    """Return energy in joules from mass in kilograms"""
    return mass * constants.speed_of_light * constants.speed_of_light

or

def energy(m):
    """Return energy in joules from mass m in kilograms"""
    c = constants.c  # the speed of light
    return m * c ** 2

(for a physicist) to

def energy(m):
    return m * 3e8 * 3e8

or (worse!)

def energy(m):
    return m * 9e16
  • For something like the speed of light (which is after all, a universal constant) in code is probably appropriate, so we might have a file called constants.py containing something like:
# constants.py: physical constants

speed_of_light = 299_792_458   # metres per second, in vacuuo. (c. 300 000 km/s)
h_cross = 1.054_571_817e-34    # reduced Planck constant in joule seconds (= h/2π)
  • For parameters that change in different situations, or for different users, it's usually much better practice to store the parameters in a separate file, ideally in a common format for which is good library support. Some of the advantages of this include:
    • Changing parameters then does not require changing code. Code may be unavilable to the user—e.g. write locked, compiled or running on a remote server through an API. Moreover, editing code often requires more expertise and confidence than editing a parameter file.
    • The same code can be run multiple times using different parameters at the same time; this can be more challenging if the parameters are "baked into" the code.
    • It becomes easy to maintain multiple sets of parameters for different situations, allowing a user simply to choose a named set on each occasion (usually through the parameter file's name). Of course, the code itself can have different named sets of parameters, but this is less common, and usually less transparent to the user, and harder to maintain.

Why TOML?

There is nothing particularly special about TOML, but it is well suited as a format for parameter files, with a sane and pragmatic set of features, good library support in most modern programming languages and rapidly increasing adoption.

Here's a simple example of a TOML file:

# A TOML file (This is a comment)

start_date = 2024-01-01   # This will produce a datetime.date in Python
run_days = 366            # an integer (int)
tolerance = 0.0001        # a floating-point vlaue (float)
locale = "en_GB"          # a string value
title = "2024"            # also a string valuue
critical_event_time = 2024-07-31T03:22:22+03:00  # a datetime.datetime
fermi_approximate = true  # a boolean value (bool)


[logging]   # A TOML "table", which acts like a section, or group

format = ".csv"           # Yet another string
events = [                # An array (list)
    "financial",
    "telecoms",
]
title = "2024 events"     # The same key as before, but this one is
                          # in the logging table, so does not conflict
                          # with the previous title

Strengths of TOML as a parameter/configuration format

  • Ease of reading, writing and editing ("obviousness").
  • Lack of surprises: there are few if any weirdnesses in TOML (which is less true of the other main competing formats).
  • Good coverage of the main expected types: TOML supports (and differentiates between) booleans, integers, floating point values, strings, ISO8601-formatted dates and timestamps (with and without timezones, and mapping to datetime.date and datetime.datetime in Python), arrays (lists), key-value pairs (dictionaries), and hierarchical sections.
  • It supports comments (which begin with a hash mark #)
  • Support in most modern langauges, including C#, C++, Clojure, Dart, Elixir, Erlang, Go, Haskell, Java, Javascript, Lua, Objective-C, Perl, PHP, Python, Ruby, Rust, Swift, Scala
  • Flexible quoting with support for multiline strings, and no use of bare (unquoted) strings (avoiding ambiguity)
  • Well-defined semantics without becoming fussy and awkward to edit (e.g., being unfussy about trailing commas in arrays).
  • TOML is specifically designed to be a configuration file format. Quoting the front page of the website

TOML aims to be a minimal configuration file format that: is easy to read due to obvious semantics maps unambiguously to a hash table is easy to parse into data structures in a wide variety of languages"

  • In the context of Python, TOML is already being adopted widely, most notably in the increasingly ubiquitous pyproject.toml files. There are good libraries available for reading (tomli) and writing (tomli_w) TOML files, although these are not part of the standard library. (It appears that Python 3.11 does include tomllib, for reading TOML files, in the standard library.)

Weaknesses of TOML as a parameter/configuration format

We don't think TOML has any major weaknesses for this purpose, but points that might count against it for some include:

  • TOML does not have a null (NULL/nil/None/undefined) value. (TOMLParams could address this, but has no current plans to do so.)
  • Hierarchical sections ('tables') in TOML are not nested. So if you want So if you want a section/table called behaviours and subsections/subtables called personal and business, in TOML this might be represented by something like the excerpt below (possibly with a [behaviours] table with its own parameters as well).

Some people, however, really don't like TOML.

[behaviours.personal]

frequency = 3

[behaviours.business]

frequency = 3

Why not JSON, YAML, .ini...

We chose TOML because we think, overall, it is better than each of the obvious competing formats, but the purpose of this post isn't to do down those formats, which all have their established places. But we will comment briefly on the most obvious alternatives.

JSON, while good as a transfer format and very popular in web services, is tricky for humans to write correctly, even with editor support, because it requires correct nesting and refuses to accept trailing commas in lists and dictionaries. The lack of support for dates and timestamps is also a frequent source of frustration, with quoted strings typically being used instead, with all the concomitant problems of that approach.

At first glance, YAML appears more suitable as a configuration/parameter-file format, but is the opposite of obvious and often produces unexpected results (in practice). Sources of frustration include no requirement to quote strings, "magic values" like yes (which maps to true) and no which maps to false (much to the annoyance of Norwegians and anyone using the NO country code), inadvent coercison of numbers with leading zeroes to octal (in YAML 1.1), whitespace sensitivity, and issues around multi-line strings.1 2 3 4

.ini files look a lot like TOML (I would guess they a major inspiration for it) but are much simpler and less rich, have less well-defined syntax, have fewer types and don't require quoting of strings, leading to ambiguity.

What are the Key Features of TOMLParams?

Simple externalization of parameters in one or more TOML files

You write your parameters in a TOML file, pass the name of the TOML file an instance of the TOMLParams class and it reads the parameters from the file and makes them available as attributes on the object, but also makes them available using dictionary-style look-up, i.e. if you TOMLParams instance is p and you have a parameter startdate you can access it as p.startdate or p['startdate'] (and, more pertinently, also as p[k] if k is set to 'startdate').

If you use tables, you can "dot" you way throught to the parameters (p.behaviours.personal.frequency) or use repeated dictionary lookups (p['behaviours']['personal']['frequency']).

Loading, saving, default values, parameter name checking and optional parameter type checking

The parameters that exist are defined by default value with TOMLParams. You can either store you defaults in a TOML file (e.g. defaults.toml) or pass them to the TOMParams initializer as a dictionary.

If you choose a different TOML file, all the parameter values are first set to their default values, and then any parameters set in the file you specify override those. New parameters (i.e. any not listed in defaults) raise an error.

If you wish turn on type checking, the library will check that the all parameter values provided match those of the defaults, and you choose whether these cases result in a warning (which doesn't raise an exception) or an error (which does).

Hierarchical file inclusion with overriding

Perhaps the most powerful feature of TOMLParams is the ability for one parameters file to 'include' one or more others. If you use TOMLParams specifying the parameters file name as base, and the first line of base.toml is

include = 'uk'

then parameters from uk.toml will be read and processed before those from base.toml. So all parameters will first be set to default values, then anything in uk.toml will override those values, and finally any values in base.toml will override those.

The include statement can also be a list

include = ['uk', 'metric', '2023']

Inclusions are processed left to right (i.e. uk.toml is processed, before metric.toml, followed by 2023.toml), followed by the parameters in the including file itself. So in this case, if defaults are in defaults.toml and the TOML file specified is base.toml, the full sequence is

  • defaults.toml
  • uk.toml
  • metric.toml
  • 2023.toml
  • base.toml

Of course, the included files can themselves include others, but the library keeps track of this and each file is only processed once (the first time it is encountered), which prevents infinite recursion and repeated setting.

This makes it very easy to maintain different kinds and groups of parameters in different files, and to create a variation of a set of parameters simply by making the first line include whatever TOML file specifies the parameters you want to start from, and then override the specific parameter or parameters you want to be different in your new file.

Support for writing consolidated parameters as TOML after hierarchical inclusion and resolution

The library can also write the final parameters used out as a single, consolidated TOML file, which is useful when hierarchical inclusion and overriding are used, and keeps a record of the final values of all parameters. This helps with reproducibility and logging.

Support for using environment variables to select parameter set (as well as API)

You can choose how to specify the name of the parameters file to be used, and the name two which it should default. If you create a TOMLParams instance with:

params = TOMLParams(
    defaults='defaults',
    name='newparams',
    standard_params_dir='.'
)

default parameters will be read from ./defaults.toml and then ./newparams.toml will be processed, overriding default values.

If you want to specify the name of the parameters file to use on the command line, the usual pattern would be something like:

import sys
from tomlparams import TOMLParams

class Simulate:
    def __init__(params, ...)


if __name__ == '__main__:
    name = sys.argv[1] if len(sys.argv) > 1 else 'base'
    params = TOMLParams('defaults.toml', name)
    sim = Simulate(params, ...)

Sometimes, however, it's convenient to use an environment variable to set the name of the parameters file, particularly if you want to use the same parameters in multiple programs, or run from a shell script or a Makefile. You can specify an environment variable to use for this and TOMLParams will inspect that environment variable if no name is passed. If you choose SIMPARAMS for this and say

params = TOMLParams('defaults.toml', env_var='SIMPARAMS')

it will look for a name in SIMPARAMS in the environment which you can set with

SIMPARAMS="foo" python pythonfile.py

or

export SIMPARAMS="foo"
python pythonfile.py

If it's not set, it will use base.toml as the file name, or something else you choose with the base_params_stem argument to TOMLParams.

Check it out

You can install TOMLParams from PyPI in the usual way, e.g.

python -m pip -U tomlparams

The source is available from Github (under an MIT license), at github.com/smartdatafoundry/tomlparams

There's a README.md and documentation is available on ReAd the Docs at tomlparams.readthedocs.io.

You can get help within Python on the TOMLParams class with

>>> import tomlparams
>>> help(tomlparams)

After installing tomlparams, you will find you have a tomlparams command, which you can use to copy example code from the README

$ tomlparams examples
Examples copied to ./tomlparams_examples.

You can also get help from with tomlparams help:

$ tomlparams help
TOMLParams

USAGE:
    tomlparams help      — show this message
    tomlparams version   — report version number
    tomlparams examples  — copy the examples to ./tomlparams_examples
    tomlparams test      — run the tomlparams tests

Documentation: https://tomlparams.readthedocs.io/
Source code:   https://github.com/smartdatafoundry.com/tomlparams
Website:       https://tomlparams.com

Installation:

    python -m pip install -U tomlparams

TDDA on the Coding for Thought Podcast

Posted on Tue 11 July 2023 in TDDA • Tagged with podcast, TDDA

I had the pleasure of discussing TDDA with Peter Schmidt on his Coding for Thought podcast.

I think it came out really well, so this might be a nice way for people to learn about the ideas and motivations for the ideas and the library, which Simon Brown, Sam Rhynas and I developed over some years at Stochastic Solutions.

The podcast should be available by search in any podcast player for 'Code for Thought'.

Direct links are:


Overcast Logged-in iCloud Users: Self-Selection Bias and Customer Stickiness

Posted on Sun 08 January 2023 in TDDA • Tagged with stats, bias, interpretation

On Episode 258 of Marco Arment and “Underscore” David Smith’s podcast Under the Radar, and then on Episode 516 of Marco & co’s Accidental Tech Podcast, Marco describes the fact that his data suggests that about 12% of his users don’t have logged-in iCloud accounts with iCloud Drive enabled, which was a significant obstacle to moving his sync system for Overcast to use Apple’s CloudKit, which requires both.

This was a surprise to Marco, who had expected that the figure would be closer to 1%, and led Casey Liss to worry aloud about producing an app that depended on CloudKit.

Marco, entirely correctly, suggested that his user base may be non-representative, and pointed out that

  1. Many people listen to podcasts at work and when commuting, so may be more that usually likely to use work-issued phones; such devices are often locked out of use of iCloud in general, and iCloud Drive, most specifically.
  2. His users are almost certainly biased towards geeks.

It may not be immediately obvious, but there is also a strong “statistical” explanation for why Marco is almost certainly right, and Casey’s fears are likely to be somewhere between exaggerated and unfounded.

TL;DR: Self-Selection Bias and Customer Stickiness

Overcast is unusual (possibly unique) in offering a non-CloudKit-based sync system. Users who need non-CloudKit-based podcast syncing have a limited choice of options, possibly Hobson’s Choice. So it is extremely likely that Overcast has a disproportionately high number of users who can’t/don’t use iCloud Drive. Interestingly these people might also be exceptionally loyal, because they (perhaps) have nowhere else to go.

The Long Version

Let’s do some Fermi Estimation.

  • Let’s suppose Apple has 1 billion iOS users.
  • Let’s suppose 10% of them use a podcast player. That’s 100 million people.
  • Let’s suppose (consistent with Marco’s assumption) that 1% of those don’t have a logged-in iCloud account with iCloud Drive (forthwith to be refered to as “iCloudless”). That’s 1 million people.
  • Let’s suppose Overcast has 1% market share (1 million active users).

Understandably, Marco doesn’t release user stats per se, but did say:

“So, to give you some idea of what I mean by ’hardly anybody uses it’, I’m looking at a few hundred people who use the website, and that is not a large portion of the user base. And this is ... per day. ... It’s under 1,000 people. ... snd that’s ... well under 1% ... [I]t’s a very small portion of the user base.”

— Marco Arment, Accidential Tech Podcast #516, One of My Fits of Outrage, 3rd January 2023, from 31:16 (listen here).

That sounds consist with a million users, and certainly says the number significantly exceeds 100,000. Further, BuzzSprout’s Global Player stats for December 2022 put Overcast’s Market Share at 1%,1 with that number pegged at 1,134,0262 (users, presumably).

Based on these estimates, there are the same number of iCloudless podcast listeners—one million—as there are Overcast users, and their only choices are * to use Overcast, * to find another podcast player that offers non-iCloud-Drive-based sync (if there is one), or, * to forego sync. This is the self-selection bias. Just as restaurants with good vegan food probably have a disportionately high number of vegan customers (and much higher than ones that don't cater for vegans), and buildings with proper disabled access probably have disportionately high numbers of disabled customers (and much higher than those that don't offer disabled access), a podcast player that supports iCloudless sync will almost inevitably have a disproportionately high number of iCloudless users. In principle, if the estimates are reasonable, Marco could see up to 100% of his users in this iCloudless category without running out of iCloudless users (though this would plainly be ridiculous).

If these estimates are the right order of magnitude, 12% seems like a very plausible figure for what Marco would see in his stats. It means only 12% of people who might benefit from the sync service are using Overcast, but that’s an order of magnitude more than its estimated market share, which is pretty good. It also means that 88% of podcast listeners who might benefit from iCloudless sync are not currently using Overcast.

As I said, I don’t know whether any other podcast players do have an non-iCloud sync service, but either way, it also suggests that now that Marco has put his plan to discontinue it on ice, it might not be a terrible idea for him to lean into it. Maybe his marketing should specifically try to target users who want to sync podcasts across devices (and the web!) but are “iCloudless". After all, it’s very plausible that there are a million of them out there; and it would be really hard for Marco’s competitors to respond. As noted above, these also might be exceptionally loyal/sticky customers, because there may be nowhere else for them to go. It’s one of Marco's stronger competitive moats.


  1. amazingly, given that I made the estimate before trying to look up any stats, and as you can see, my numbers really are Fermi Estimates. 

  2. even more amazingly, this number of users is also within 2% of my Fermi Estimate. It’s a just a coincidence, but a very happy one! 


Gentest Talk at 2022 Toronto Workshop on Reproducibility

Posted on Fri 25 February 2022 in TDDA • Tagged with tests, reference tests, gentest

We released version 2.0 of the Python TDDA library this week. The radical new feature of the 2.0 release is Gentest, a command-line tool for automatically generating tests for more-or-less any code that you can run from a command line.

Gentest was introduced at the 2022 Toronto Workshop on Reproducibility yesterday (24th February), where demonstrations included using it to write tests for three increasing complex R programs. This was to emphasize that Gentest is useful for much more than just testing Python code. Our (only slightly facetious) strapline is

Gentest writes tests, so you don't have to.

Slides from the talk are available here:

And here's the video:

We'll be posting more detail about Gentest here over the coming weeks.

Another major upgrade in the 2.0 TDDA release is the documentation. We've made much more effort to separate out the command-line uses of the TDDA library

  • constraint generation
  • data verification
  • data validation
  • inference of regular expressions, with Rexpy
  • (now) automatic test generation, with Gentest

from the API documentation, which is only really relevant to Python users.

The documentation is available at Read The Docs.