How is this Misleading Data Misleading Me?
Posted on Fri 13 November 2015 in TDDA
"Why is this lying bastard lying to me?"
Louis Heren,^{1} often attributed to Jeremy Paxman.
In a previous post, we made a distinction between two kinds of errors—implementation errors and errors of interpretation. I want to amplify that today, focusing specifically on interpretation.
The most important question to keep in mind at all times is not whether the analysis is computing the thing we wanted it to compute, but rather whether the result we have produced means what we think it means. The distinction is crucial.
As a simple example, let's suppose we specify the goal of our analysis as calculating the mean of a set of numbers. We can test that by adding them up and dividing by the number of items. But if we think the goal is to characterize a typical transaction size, we have to ask whether the arithmetic mean is the right metric for understanding that. As we move more towards a business or conceptual goal, rather than a mathematical or algorithmic formulation of a calculation, we have more complex and nuanced considerations, such as:

Do we believe the inputs are correct?

Is our chosen metric capable of addressing our underlying need (in this case, determining a typical transaction size)?

How do we handle nulls (missing values)?

Will outliers (perhaps extremely large values) or invalid inputs (perhaps negative values) invalidate the calculation?

If the values have dimensionality,^{2} do all of the values have the same dimensionality, and in the same units (e.g. all money and all in pounds sterling, or all distances and all measured in miles).

For that matter, are the inputs even commensurate, i.e. do they quantify sufficiently similar things that calculating their mean is even meaningful?
Paxman/Heren's constant question quoted above—Why is this lying bastard lying to me?—will serve as an excellent question to keep in mind every time we view an analytical result, perhaps recast as how is this misleading data misleading me? There is a great temptation to believe beautifully formatted, painstakingly calculated results produced by the almost unfathomable power of modern computers. In fact, there is much to be said for thinking of the combination of data and processing as an adversary constantly trying to fool you into drawing false conclusions.
The questions of implementation are concerned with checking that the data received as input to the analytical process has been faithfully transmitted from the source systems, and that the calculations and manipulations performed in the analysis correctly implement the algorithms we intended to use. In contrast, as we outlined previously, the questions of interpretation emphasize that we need to be ever vigilent, asking ourselves:

Is the input data correct?

Is our interpretation of the input data correct?

Are the algorithms we are applying to the data meaningful and appropriate?

Is our interpretation of the results we produce correct?

Are the results plausible?

What am I missing?

How is this misleading data misleading me?

This quote is usually attributed to Jeremy Paxman, as noted in The Guardian article Paxman answers the questions http://www.theguardian.com/media/2005/jan/31/mondaymediasection.politicsandthemedia of 31st January 2005. According to the article, however, the true origin is a former deputy editor of the Times, Louis Heren, in his memoirs, with the full quote being "When a politician tells you something in confidence, always ask yourself: 'Why is this lying bastard lying to me?'" Still other reports, however, say Heren himself, was merely quoting advice he was given. Melvin J. Lasky writes in Profanity, Obscenity and the Media, Transaction Publishers (New Brunswick) 2005:
"Find out why the lying bastards are lying!" This is the famous phrase of an editor of the Times, Louis Heren, who received it as "advice given him early in his career by ... a correspondent of the Daily Worker [the Communist daily in London]: 'Always ask yourself why these lying bastards are lying to you.'"

Here, we use dimensionality in the sense of Dimensional Analysis, which allows us to make inferences about the results of calculations based on classifying the inputs by category. For example, we would distinguish lengths, from times from quantities of money and so forth. We would also treat separately dimensionless quantities, such as counts or ratios of quantitities of the same dimension (e.g. a ratio of two lengths lengths). ↩