Unit Tests

Posted on Tue 19 April 2016 in TDDA

In the last post, we presented some code for implementing a "reference" test for the code for extracting CSV files from the XML dump that the Apple Health app on iOS can produce.

We will now expand that test with a few other, smaller and more conventional unit tests. Each unit test focuses on a smaller block of functionality.

The Test Code

As before, you can get the code from Github with

$ git clone https://github.com/tdda/applehealthdata.git

or if you have pulled it previously, with

$ git pull

This version of the code is tagged with v1.2, so if it has been updated by the time you read this, get that version with

$ git checkout v1.2

Here is the updated test code.

# -*- coding: utf-8 -*-
"""
testapplehealthdata.py: tests for the applehealthdata.py

Copyright (c) 2016 Nicholas J. Radcliffe
Licence: MIT
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import re
import shutil
import sys
import unittest

from collections import Counter


from applehealthdata import (HealthDataExtractor,
                             format_freqs, format_value,
                             abbreviate, encode)

CLEAN_UP = True
VERBOSE = False


def get_base_dir():
    """
    Return the directory containing this test file,
    which will (normally) be the applyhealthdata directory
    also containing the testdata dir.
    """
    return os.path.split(os.path.abspath(__file__))[0]


def get_testdata_dir():
    """Return the full path to the testdata directory"""
    return os.path.join(get_base_dir(), 'testdata')


def get_tmp_dir():
    """Return the full path to the tmp directory"""
    return os.path.join(get_base_dir(), 'tmp')


def remove_any_tmp_dir():
    """
    Remove the temporary directory if it exists.
    Returns its location either way.
    """
    tmp_dir = get_tmp_dir()
    if os.path.exists(tmp_dir):
        shutil.rmtree(tmp_dir)
    return tmp_dir


def make_tmp_dir():
    """
    Remove any existing tmp directory.
    Create empty tmp direcory.
    Return the location of the tmp dir.
    """
    tmp_dir = remove_any_tmp_dir()
    os.mkdir(tmp_dir)
    return tmp_dir


def copy_test_data():
    """
    Copy the test data export6s3sample.xml from testdata directory
    to tmp directory.
    """
    tmp_dir = make_tmp_dir()
    name = 'export6s3sample.xml'
    in_xml_file = os.path.join(get_testdata_dir(), name)
    out_xml_file = os.path.join(get_tmp_dir(), name)
    shutil.copyfile(in_xml_file, out_xml_file)
    return out_xml_file


class TestAppleHealthDataExtractor(unittest.TestCase):
    @classmethod
    def tearDownClass(cls):
        """Clean up by removing the tmp directory, if it exists."""
        if CLEAN_UP:
            remove_any_tmp_dir()

    def check_file(self, filename):
        expected_output = os.path.join(get_testdata_dir(), filename)
        actual_output = os.path.join(get_tmp_dir(), filename)
        with open(expected_output) as f:
            expected = f.read()
        with open(actual_output) as f:
            actual = f.read()
        self.assertEqual(expected, actual)

    def test_tiny_reference_extraction(self):
        path = copy_test_data()
        data = HealthDataExtractor(path, verbose=VERBOSE)
        data.extract()
        self.check_file('StepCount.csv')
        self.check_file('DistanceWalkingRunning.csv')

    def test_format_freqs(self):
        counts = Counter()
        self.assertEqual(format_freqs(counts), '')
        counts['one'] += 1
        self.assertEqual(format_freqs(counts), 'one: 1')
        counts['one'] += 1
        self.assertEqual(format_freqs(counts), 'one: 2')
        counts['two'] += 1
        counts['three'] += 1
        self.assertEqual(format_freqs(counts),
                         '''one: 2
three: 1
two: 1''')

    def test_format_null_values(self):
        for dt in ('s', 'n', 'd', 'z'):
            # Note: even an illegal type, z, produces correct output for
            # null values.
            # Questionable, but we'll leave as a feature
            self.assertEqual(format_value(None, dt), '')

    def test_format_numeric_values(self):
        cases = {
            '0': '0',
            '3': '3',
            '-1': '-1',
            '2.5': '2.5',
        }
        for (k, v) in cases.items():
            self.assertEqual((k, format_value(k, 'n')), (k, v))

    def test_format_date_values(self):
        hearts = 'any string not need escaping or quoting; even this: ♥♥'
        cases = {
            '01/02/2000 12:34:56': '01/02/2000 12:34:56',
            hearts: hearts,
        }
        for (k, v) in cases.items():
            self.assertEqual((k, format_value(k, 'd')), (k, v))

    def test_format_string_values(self):
        cases = {
            'a': '"a"',
            '': '""',
            'one "2" three': r'"one \"2\" three"',
            r'1\2\3': r'"1\\2\\3"',
        }
        for (k, v) in cases.items():
            self.assertEqual((k, format_value(k, 's')), (k, v))

    def test_abbreviate(self):
        changed = {
            'HKQuantityTypeIdentifierHeight': 'Height',
            'HKQuantityTypeIdentifierStepCount': 'StepCount',
            'HK*TypeIdentifierStepCount': 'StepCount',
            'HKCharacteristicTypeIdentifierDateOfBirth': 'DateOfBirth',
            'HKCharacteristicTypeIdentifierBiologicalSex': 'BiologicalSex',
            'HKCharacteristicTypeIdentifierBloodType': 'BloodType',
            'HKCharacteristicTypeIdentifierFitzpatrickSkinType':
                                                    'FitzpatrickSkinType',
        }
        unchanged = [
            '',
            'a',
            'aHKQuantityTypeIdentifierHeight',
            'HKQuantityTypeIdentityHeight',
        ]
        for (k, v) in changed.items():
            self.assertEqual((k, abbreviate(k)), (k, v))
            self.assertEqual((k, abbreviate(k, False)), (k, k))
        for k in unchanged:
            self.assertEqual((k, abbreviate(k)), (k, k))

    def test_encode(self):
        # This test looks strange, but because of the import statments
        #     from __future__ import unicode_literals
        # in Python 2, type('a') is unicode, and the point of the encode
        # function is to ensure that it has been converted to a UTF-8 string
        # before writing to file.
        self.assertEqual(type(encode('a')), str)

    def test_extracted_reference_stats(self):
        path = copy_test_data()
        data = HealthDataExtractor(path, verbose=VERBOSE)

        self.assertEqual(data.n_nodes, 19)
        expectedRecordCounts = [
           ('DistanceWalkingRunning', 5),
           ('StepCount', 10),
        ]
        self.assertEqual(sorted(data.record_types.items()),
                         expectedRecordCounts)

        expectedTagCounts = [
           ('ActivitySummary', 2),
           ('ExportDate', 1),
           ('Me', 1),
           ('Record', 15),
        ]
        self.assertEqual(sorted(data.tags.items()),
                         expectedTagCounts)
        expectedFieldCounts = [
            ('HKCharacteristicTypeIdentifierBiologicalSex', 1),
            ('HKCharacteristicTypeIdentifierBloodType', 1),
            ('HKCharacteristicTypeIdentifierDateOfBirth', 1),
            ('HKCharacteristicTypeIdentifierFitzpatrickSkinType', 1),
            ('activeEnergyBurned', 2),
            ('activeEnergyBurnedGoal', 2),
            ('activeEnergyBurnedUnit', 2),
            ('appleExerciseTime', 2),
            ('appleExerciseTimeGoal', 2),
            ('appleStandHours', 2),
            ('appleStandHoursGoal', 2),
            ('creationDate', 15),
            ('dateComponents', 2),
            ('endDate', 15),
            ('sourceName', 15),
            ('startDate', 15),
            ('type', 15),
            ('unit', 15),
            ('value', 16),
        ]
        self.assertEqual(sorted(data.fields.items()),
                         expectedFieldCounts)


if __name__ == '__main__':
    unittest.main()

Notes

We're not going to discuss every part of the code, but will point out a few salient features.

I've added a coding line at the top of both the test script and the main applehealthdata.py script. This tells Python (and my editor, Emacs) the encoding of the file on disk (UTF-8). This is now relevant because one of the new tests (test_format_date_values) features a non-ASCII character in a string literal.
The previous test method test_tiny_fixed_extraction has been renamed test_tiny_reference_extraction, but is otherwise unchanged.
Several of the tests loop over dictionaries or lists of input-output pairs, with an assertion of some kind in the main body. Some people don't like this, and prefer one assertion per test. I don't really agree with that, but do think it's important to be able to see easily which assertion fails. An idiom I often use to assist this is to include the input on both sides of the test. For example, in test_abbreviate, when checking the abbreviation of items that should change, the code reads:
```
for (k, v) in changed.items():
    self.assertEqual((k, abbreviate(k)), (k, v))
```
rather than
```
for (k, v) in changed.items():
    self.assertEqual(abbreviate(k), v)
```
This makes it easy to tell which input fails, if one does, even in cases in which the main values being compared (abbreviate(k) and v, in this case) are long, complex or repeated across different inputs. It doesn't actually make much difference in these examples, but in general I find it helpful.
The test test_extracted_reference_stats checks that three counters used by the code work as expected. Some people would definitely advocate splitting this into three tests, but, even though it's quick, it seems more natural to test these together to me. This also means we don't have to process the XML file three times. There are other ways of achieving the same end, and this approach has the potential disadvantage that the later cases won't be run if the first one fails.

The other point to note here is that the Counter objects are unordered, so I've sorted the expected results on their keys in the expected values, and then used Python's sorted function, which returns a generator to return the values of a list (or other iterable) in sorted order. We could avoid the sort by constructing sets or a dictionaries from the Counter objects and checking those instead, but the sort here is not expensive, and this approach is probably simpler.
I haven't bothered to write a separate test for the extraction phase (checking that it writes the right CSV files) because that seems to me to add almost nothing over the existing reference test (test_tiny_reference_extraction).

Closing

That's it for this post. The unit tests are not terribly exciting, but they will prove useful as we extend the extraction code, which we'll start to do in the next post.

In a few posts' time, we will start analysing the data extracted from the app; it will be interesting to see whether, at that stage, we discover any more serious problems with the extraction code. Experience teaches that we probably will.