Posted on Fri 13 November 2015 in TDDA

"Why is this lying bastard lying to me?"

Louis Heren,1 often attributed to Jeremy Paxman.

In a previous post, we made a distinction between two kinds of errors—implementation errors and errors of interpretation. I want to amplify that today, focusing specifically on interpretation.

The most important question to keep in mind at all times is not whether the analysis is computing the thing we wanted it to compute, but rather whether the result we have produced means what we think it means. The distinction is crucial.

As a simple example, let's suppose we specify the goal of our analysis as calculating the mean of a set of numbers. We can test that by adding them up and dividing by the number of items. But if we think the goal is to characterize a typical transaction size, we have to ask whether the arithmetic mean is the right metric for understanding that. As we move more towards a business or conceptual goal, rather than a mathematical or algorithmic formulation of a calculation, we have more complex and nuanced considerations, such as:

• Do we believe the inputs are correct?

• Is our chosen metric capable of addressing our underlying need (in this case, determining a typical transaction size)?

• How do we handle nulls (missing values)?

• Will outliers (perhaps extremely large values) or invalid inputs (perhaps negative values) invalidate the calculation?

• If the values have dimensionality,2 do all of the values have the same dimensionality, and in the same units (e.g. all money and all in pounds sterling, or all distances and all measured in miles).

• For that matter, are the inputs even commensurate, i.e. do they quantify sufficiently similar things that calculating their mean is even meaningful?

Paxman/Heren's constant question quoted above—Why is this lying bastard lying to me?—will serve as an excellent question to keep in mind every time we view an analytical result, perhaps recast as how is this misleading data misleading me? There is a great temptation to believe beautifully formatted, painstakingly calculated results produced by the almost unfathomable power of modern computers. In fact, there is much to be said for thinking of the combination of data and processing as an adversary constantly trying to fool you into drawing false conclusions.

The questions of implementation are concerned with checking that the data received as input to the analytical process has been faithfully transmitted from the source systems, and that the calculations and manipulations performed in the analysis correctly implement the algorithms we intended to use. In contrast, as we outlined previously, the questions of interpretation emphasize that we need to be ever vigilent, asking ourselves:

• Is the input data correct?

• Is our interpretation of the input data correct?

• Are the algorithms we are applying to the data meaningful and appropriate?

• Is our interpretation of the results we produce correct?

• Are the results plausible?

• What am I missing?