In Defence of XML: Exporting and Analysing Apple Health Data

Posted on Fri 15 April 2016 in TDDA

I'm going to present a series of posts based around the sort of health and fitness data that can now be collected by some phones and dedicated fitness trackers. Not all of these will be centrally on topic for test-driven data analysis, but I think they'll provide an interesting set of data for discussing many issues of relevance, so I hope readers will forgive me to the extent that these stray from the central theme.

The particular focus for this series will be the data available from an iPhone and the Apple Health app, over a couple of different phones, and with a couple of different devices paired to them.

In particular, the setup will be:

Apple iPhone 6s (November 2015 to present)
Apple iPhone 5s (with fitness data from Sept 2014 to Nov 2015)
Several Misfit Shine activity trackers (until early March 2016)
An Apple Watch (about a month of data, to date)

Getting data out of Apple Health (The Exploratory Version)

I hadn't initially spotted a way to get the data out of Apple's Health app, but a quick web search¹ turned up this very helpful article: https://www.idownloadblog.com/2015/06/10/how-to-export-import-health-data. It turns out there is a properly supported way to export granular data from Apple Health, described in detail in the post. Essentially:

Open the Apple Health App.
Navigate to the Health Data section (left icon at the bottom)
Select All from the list of categories
There is a share icon at the top right (a vertical arrow sticking up from a square)
Tap that to export all data
It thinks for a while (quite a while, in fact) and then offers you various export options, which for me included Airdrop, email and handing the data to other apps. I used Airdrop to dump it onto a Mac.

The result is a compressed XML file called export.zip. For me, this was about 5.5MB, which expanded to 109MB when unzipped. (Interestingly, I started this with an earlier export a couple of weeks ago, when the zipped file was about 5MB and the expanded version was 90MB, so it is growing fairly quickly, thanks to the Watch.)

As helpful as the iDownloadBlog article is, I have to comment on its introduction to exporting data, which reads

There are actually two ways to export the data from your Health app. The first way, is one provided by Apple, but it is virtually useless.

To be fair to iDownloadBlog, an XML file like this probably is useless to the general reader, but it builds on a meme fashionable among developers and data scientists to the effect of "XML is painful to process, verbose and always worse than JSON", and I think this is somewhat unfair.

Let's explore export.xml using Python and the ElementTree library. Although the decompressed file is quite large (109MB), it's certainly not problematically large to read into memory on a modern machine, so I'm not going to worry about reading it in bits: I'm just going to find out as quickly as possible what's in it.

The first thing to do, of course, is simply to look at the file, probably using either the more or less command, assuming you are on some flavour of Unix or Linux. Let's look at the top of my export.xml:

$ head -79 export6s3/export.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE HealthData [
<!-- HealthKit Export Version: 3 -->
<!ELEMENT HealthData (ExportDate,Me,(Record|Correlation|Workout|ActivitySummary)*)>
<!ATTLIST HealthData
  locale CDATA #REQUIRED
>
<!ELEMENT ExportDate EMPTY>
<!ATTLIST ExportDate
  value CDATA #REQUIRED
>
<!ELEMENT Me EMPTY>
<!ATTLIST Me
  HKCharacteristicTypeIdentifierDateOfBirth         CDATA #REQUIRED
  HKCharacteristicTypeIdentifierBiologicalSex       CDATA #REQUIRED
  HKCharacteristicTypeIdentifierBloodType           CDATA #REQUIRED
  HKCharacteristicTypeIdentifierFitzpatrickSkinType CDATA #REQUIRED
>
<!ELEMENT Record (MetadataEntry*)>
<!ATTLIST Record
  type          CDATA #REQUIRED
  unit          CDATA #IMPLIED
  value         CDATA #IMPLIED
  sourceName    CDATA #REQUIRED
  sourceVersion CDATA #IMPLIED
  device        CDATA #IMPLIED
  creationDate  CDATA #IMPLIED
  startDate     CDATA #REQUIRED
  endDate       CDATA #REQUIRED
>
<!-- Note: Any Records that appear as children of a correlation also appear as top-level records in this document. -->
<!ELEMENT Correlation ((MetadataEntry|Record)*)>
<!ATTLIST Correlation
  type          CDATA #REQUIRED
  sourceName    CDATA #REQUIRED
  sourceVersion CDATA #IMPLIED
  device        CDATA #IMPLIED
  creationDate  CDATA #IMPLIED
  startDate     CDATA #REQUIRED
  endDate       CDATA #REQUIRED
>
<!ELEMENT Workout ((MetadataEntry|WorkoutEvent)*)>
<!ATTLIST Workout
  workoutActivityType   CDATA #REQUIRED
  duration              CDATA #IMPLIED
  durationUnit          CDATA #IMPLIED
  totalDistance         CDATA #IMPLIED
  totalDistanceUnit     CDATA #IMPLIED
  totalEnergyBurned     CDATA #IMPLIED
  totalEnergyBurnedUnit CDATA #IMPLIED
  sourceName            CDATA #REQUIRED
  sourceVersion         CDATA #IMPLIED
  device                CDATA #IMPLIED
  creationDate          CDATA #IMPLIED
  startDate             CDATA #REQUIRED
  endDate               CDATA #REQUIRED
>
<!ELEMENT WorkoutEvent EMPTY>
<!ATTLIST WorkoutEvent
  type CDATA #REQUIRED
  date CDATA #REQUIRED
>
<!ELEMENT ActivitySummary EMPTY>
<!ATTLIST ActivitySummary
  dateComponents           CDATA #IMPLIED
  activeEnergyBurned       CDATA #IMPLIED
  activeEnergyBurnedGoal   CDATA #IMPLIED
  activeEnergyBurnedUnit   CDATA #IMPLIED
  appleExerciseTime        CDATA #IMPLIED
  appleExerciseTimeGoal    CDATA #IMPLIED
  appleStandHours          CDATA #IMPLIED
  appleStandHoursGoal      CDATA #IMPLIED
>
<!ELEMENT MetadataEntry EMPTY>
<!ATTLIST MetadataEntry
  key   CDATA #REQUIRED
  value CDATA #REQUIRED
>
]>

This is immediately encouraging: Apple has provided DOCTYPE (DTD) information, which even though slightly old fashioned, tells us what we should expect to find in the file. DTD's are awkward to use, and when coming from untrusted sources, can leave the user potentially vulnerable to malicious attacks, but despite this, they are quite expressive and helpful, even just as plain-text documentation.

Roughly speaking, the lines:

<!ELEMENT HealthData (ExportDate,Me,(Record|Correlation|Workout)*)>
<!ATTLIST HealthData
  locale CDATA #REQUIRED
>

say

that the top element will be a HealthData element
that this HealthData element will contain
- an ExportDate element
- a Me element
- zero or more elements of type Record, Correlation or Workout
and that the HealthData element will have an attribute locale (which is mandatory).

The rest of this DTD section describes each kind of record in more detail.

The next 6 lines in my XML file are as follows (spread out for readability):

<HealthData locale="en_GB">
 <ExportDate value="2016-04-15 07:27:26 +0100"/>
 <Me HKCharacteristicTypeIdentifierDateOfBirth="1965-07-31"
     HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexMale"
     HKCharacteristicTypeIdentifierBloodType="HKBloodTypeNotSet"
     HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet"/>
 <Record type="HKQuantityTypeIdentifierHeight"
         sourceName="Health"
         sourceVersion="9.2"
         unit="cm"
         creationDate="2016-01-02 09:45:10 +0100"
         startDate="2016-01-02 09:44:00 +0100"
         endDate="2016-01-02 09:44:00 +0100"
         value="194">
  <MetadataEntry key="HKWasUserEntered" value="1"/>
 </Record>

As you can see, the export format is verbose, but extremely comprehensible and comprehensive. It's also very easy to read into Python and explore.

Let's do that, here with an interactive python:

>>> from xml.etree import ElementTree as ET
>>> with open('export.xml') as f:
...     data = ET.parse(f)
... 
>>> data
<xml.etree.ElementTree.ElementTree object at 0x107347a50>

The ElementTree module turns each XML element into an Element object, described by its tag, with a few standard attributes.

Inspecting the data object, we find:

>>> data.__dict__
{'_root': <Element 'HealthData' at 0x1073c2050>}

i.e., we have a single entry in data—a root element called HealthData.

Like all Element objects, it has the four standard attributes:²

>>> root = data._root
>>> root.__dict__.keys()
['text', 'attrib', 'tag', '_children']

These are:

>>> root.attrib
{'locale': 'en_GB'}

>>> root.text
'\n '

>>> root.tag
'HealthData'

>>> len(root._children)
446702

So nothing much apart from an encoding and a whole lot of child nodes. Let's inspect the first few of them:

>>> nodes = root._children
>>> nodes[0]
<Element 'ExportDate' at 0x1073c2090>

>>> ET.dump(nodes[0])
<ExportDate value="2016-04-15 07:27:26 +0100" />

>>> nodes[1]
<Element 'Me' at 0x1073c2190>
>>> ET.dump(nodes[1])
<Me HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexMale"
    HKCharacteristicTypeIdentifierBloodType="HKBloodTypeNotSet"
    HKCharacteristicTypeIdentifierDateOfBirth="1965-07-31"
    HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet" />

>>> nodes[2]
<Element 'Record' at 0x1073c2410>
>>> ET.dump(nodes[2])
<Record creationDate="2016-01-02 09:45:10 +0100"
        endDate="2016-01-02 09:44:00 +0100"
        sourceName="Health"
        sourceVersion="9.2"
        startDate="2016-01-02 09:44:00 +0100"
        type="HKQuantityTypeIdentifierHeight"
        unit="cm"
        value="194">
  <MetadataEntry key="HKWasUserEntered" value="1" />
 </Record>

>>> nodes[3]
<Element 'Record' at 0x1073c2550>
>>> nodes[4]
<Element 'Record' at 0x1073c2650>

So, exactly as the DTD indicated, we have an ExportDate node, a Me node and then what looks like a great number of records. Let's confirm that:

>>> set(node.tag for node in nodes[2:])
set(['Record', 'Workout', 'ActivitySummary'])

So in fact, there are three kinds of nodes after the ExportDate and Me records. Let's count them:

>>> records = [node for node in nodes if node.tag == 'Record']
>>> len(records)
446670

These records are ones like the Height record we saw above, though in fact most of them are not Height but either StepCount, CaloriesBurned or DistanceWalkingRunning, e.g.:

>>> ET.dump(nodes[100000])
<Record creationDate="2015-01-11 07:40:15 +0000"
        endDate="2015-01-10 13:39:35 +0000"
        sourceName="njr iPhone 6s"
        startDate="2015-01-10 13:39:32 +0000"
        type="HKQuantityTypeIdentifierStepCount"
        unit="count"
        value="4" />

There is also one activity summary per day (since I got the watch).

>>> acts = [node for node in nodes if node.tag == 'ActivitySummary']
>>> len(acts)
29

The first one isn't very exciting:

>>> ET.dump(acts[0])
<ActivitySummary activeEnergyBurned="0"
                 activeEnergyBurnedGoal="0"
                 activeEnergyBurnedUnit="kcal"
                 appleExerciseTime="0"
                 appleExerciseTimeGoal="30"
                 appleStandHours="0"
                 appleStandHoursGoal="12"
                 dateComponents="2016-03-18" />

but they get better:

>>> ET.dump(acts[2])
<ActivitySummary activeEnergyBurned="652.014"
                 activeEnergyBurnedGoal="500"
                 activeEnergyBurnedUnit="kcal"
                 appleExerciseTime="77"
                 appleExerciseTimeGoal="30"
                 appleStandHours="17"
                 appleStandHoursGoal="12"
                 dateComponents="2016-03-20" />

Finally, there is a solitary Workout record.

>>> ET.dump(workouts[0])
<Workout creationDate="2016-04-02 11:12:57 +0100"
         duration="31.73680251737436"
         durationUnit="min"
         endDate="2016-04-02 11:12:22 +0100"
         sourceName="NJR Apple&#160;Watch"
         sourceVersion="2.2"
         startDate="2016-04-02 10:40:38 +0100"
         totalDistance="0"
         totalDistanceUnit="km"
         totalEnergyBurned="139.3170000000021"
         totalEnergyBurnedUnit="kcal"
         workoutActivityType="HKWorkoutActivityTypeOther" />

So there we have it.

Getting data out of Apple Health (The Code)

Given this exploration, we can take a first shot at writing an exporter for Apple Health Data. I'm going to ignore the activity summaries and workout(s) for now, and concentrate on the main records. (We'll get to the others in a later post.)

Here is the code:

"""
applehealthdata.py: Extract data from Apple Health App's export.xml.

Copyright (c) 2016 Nicholas J. Radcliffe
Licence: MIT
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import re
import sys

from xml.etree import ElementTree
from collections import Counter, OrderedDict

__version__ = '1.0'

FIELDS = OrderedDict((
    ('sourceName', 's'),
    ('sourceVersion', 's'),
    ('device', 's'),
    ('type', 's'),
    ('unit', 's'),
    ('creationDate', 'd'),
    ('startDate', 'd'),
    ('endDate', 'd'),
    ('value', 'n'),
))

PREFIX_RE = re.compile('^HK.*TypeIdentifier(.+)$')
ABBREVIATE = True
VERBOSE = True

def format_freqs(counter):
    """
    Format a counter object for display.
    """
    return '\n'.join('%s: %d' % (tag, counter[tag])
                     for tag in sorted(counter.keys()))


def format_value(value, datatype):
    """
    Format a value for a CSV file, escaping double quotes and backslashes.

    None maps to empty.

    datatype should be
        's' for string (escaped)
        'n' for number
        'd' for datetime
    """
    if value is None:
        return ''
    elif datatype == 's':  # string
        return '"%s"' % value.replace('\\', '\\\\').replace('"', '\\"')
    elif datatype in ('n', 'd'):  # number or date
        return value
    else:
        raise KeyError('Unexpected format value: %s' % datatype)


def abbreviate(s):
    """
    Abbreviate particularly verbose strings based on a regular expression
    """
    m = re.match(PREFIX_RE, s)
    return m.group(1) if ABBREVIATE and m else s


def encode(s):
    """
    Encode string for writing to file.
    In Python 2, this encodes as UTF-8, whereas in Python 3,
    it does nothing
    """
    return s.encode('UTF-8') if sys.version_info.major < 3 else s



class HealthDataExtractor(object):
    """
    Extract health data from Apple Health App's XML export, export.xml.

    Inputs:
        path:      Relative or absolute path to export.xml
        verbose:   Set to False for less verbose output

    Outputs:
        Writes a CSV file for each record type found, in the same
        directory as the input export.xml. Reports each file written
        unless verbose has been set to False.
    """
    def __init__(self, path, verbose=VERBOSE):
        self.in_path = path
        self.verbose = verbose
        self.directory = os.path.abspath(os.path.split(path)[0])
        with open(path) as f:
            self.report('Reading data from %s . . . ' % path, end='')
            self.data = ElementTree.parse(f)
            self.report('done')
        self.root = self.data._root
        self.nodes = self.root.getchildren()
        self.n_nodes = len(self.nodes)
        self.abbreviate_types()
        self.collect_stats()

    def report(self, msg, end='\n'):
        if self.verbose:
            print(msg, end=end)
            sys.stdout.flush()

    def count_tags_and_fields(self):
        self.tags = Counter()
        self.fields = Counter()
        for record in self.nodes:
            self.tags[record.tag] += 1
            for k in record.keys():
                self.fields[k] += 1

    def count_record_types(self):
        self.record_types = Counter()
        for record in self.nodes:
            if record.tag == 'Record':
                self.record_types[record.attrib['type']] += 1

    def collect_stats(self):
        self.count_record_types()
        self.count_tags_and_fields()

    def open_for_writing(self):
        self.handles = {}
        self.paths = []
        for kind in self.record_types:
            path = os.path.join(self.directory, '%s.csv' % abbreviate(kind))
            f = open(path, 'w')
            f.write(','.join(FIELDS) + '\n')
            self.handles[kind] = f
            self.report('Opening %s for writing' % path)

    def abbreviate_types(self):
        """
        Shorten types by removing common boilerplate text.
        """
        for node in self.nodes:
            if node.tag == 'Record':
                if 'type' in node.attrib:
                    node.attrib['type'] = abbreviate(node.attrib['type'])


    def write_records(self):
        for node in self.nodes:
            if node.tag == 'Record':
                attributes = node.attrib
                kind = attributes['type']
                values = [format_value(attributes.get(field), datatype)
                          for (field, datatype) in FIELDS.items()]
                line = encode(','.join(values) + '\n')
                self.handles[kind].write(line)

    def close_files(self):
        for (kind, f) in self.handles.items():
            f.close()
            self.report('Written %s data.' % abbreviate(kind))

    def extract(self):
        self.open_for_writing()
        self.write_records()
        self.close_files()

    def report_stats(self):
        print('\nTags:\n%s\n' % format_freqs(self.tags))
        print('Fields:\n%s\n' % format_freqs(self.fields))
        print('Record types:\n%s\n' % format_freqs(self.record_types))


if __name__ == '__main__':
    if len(sys.argv) != 2:
        print('USAGE: python applehealthdata.py /path/to/export.xml',
              file=sys.stderr)
        sys.exit(1)
    data = HealthDataExtractor(sys.argv[1])
    data.report_stats()
    data.extract()

To run this code, clone the repo from github.com/tdda/applehealthdata with:

$ git clone https://github.com/tdda/applehealthdata.git

or save the text from this post as healthdata.py. At the time of posting, the code is consistent with this, but this commit is also tagged with the version number, v1.0, so if you check it out later and want to use this version, check out that version by saying:

$ git checkout v1.0

If your data is in the same directory as the code, then simply run:

$ python healthdata.py export.xml

and, depending on size, wait a few minutes while it runs. The code runs under both Python 2 and Python 3.

When I do this, the output is as follows:

$ python applehealthdata/applehealthdata.py export6s3/export.xml
Reading data from export6s3/export.xml . . . done

Tags:
ActivitySummary: 29
ExportDate: 1
Me: 1
Record: 446670
Workout: 1

Fields:
HKCharacteristicTypeIdentifierBiologicalSex: 1
HKCharacteristicTypeIdentifierBloodType: 1
HKCharacteristicTypeIdentifierDateOfBirth: 1
HKCharacteristicTypeIdentifierFitzpatrickSkinType: 1
activeEnergyBurned: 29
activeEnergyBurnedGoal: 29
activeEnergyBurnedUnit: 29
appleExerciseTime: 29
appleExerciseTimeGoal: 29
appleStandHours: 29
appleStandHoursGoal: 29
creationDate: 446671
dateComponents: 29
device: 84303
duration: 1
durationUnit: 1
endDate: 446671
sourceName: 446671
sourceVersion: 86786
startDate: 446671
totalDistance: 1
totalDistanceUnit: 1
totalEnergyBurned: 1
totalEnergyBurnedUnit: 1
type: 446670
unit: 446191
value: 446671
workoutActivityType: 1

Record types:
ActiveEnergyBurned: 19640
AppleExerciseTime: 2573
AppleStandHour: 479
BasalEnergyBurned: 26414
BodyMass: 155
DistanceWalkingRunning: 196262
FlightsClimbed: 2476
HeartRate: 3013
Height: 4
StepCount: 195654

Opening /Users/njr/qs/export6s3/BasalEnergyBurned.csv for writing
Opening /Users/njr/qs/export6s3/HeartRate.csv for writing
Opening /Users/njr/qs/export6s3/BodyMass.csv for writing
Opening /Users/njr/qs/export6s3/DistanceWalkingRunning.csv for writing
Opening /Users/njr/qs/export6s3/AppleStandHour.csv for writing
Opening /Users/njr/qs/export6s3/StepCount.csv for writing
Opening /Users/njr/qs/export6s3/Height.csv for writing
Opening /Users/njr/qs/export6s3/AppleExerciseTime.csv for writing
Opening /Users/njr/qs/export6s3/ActiveEnergyBurned.csv for writing
Opening /Users/njr/qs/export6s3/FlightsClimbed.csv for writing
Written BasalEnergyBurned data.
Written HeartRate data.
Written BodyMass data.
Written DistanceWalkingRunning data.
Written ActiveEnergyBurned data.
Written StepCount data.
Written Height data.
Written AppleExerciseTime data.
Written AppleStandHour data.
Written FlightsClimbed data.
$

As a quick preview of one of the files, here is the top of the second biggest output fiele, StepCount.csv:

$ head -5 StepCount.csv
sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value
"Health",,,"HKQuantityTypeIdentifierStepCount","count",2014-09-21 06:08:47 +0000,2014-09-13 09:27:54 +0000,2014-09-13 09:27:59 +0000,329
"Health",,,"HKQuantityTypeIdentifierStepCount","count",2014-09-21 06:08:47 +0000,2014-09-13 09:34:09 +0000,2014-09-13 09:34:14 +0000,283
"Health",,,"HKQuantityTypeIdentifierStepCount","count",2014-09-21 06:08:47 +0000,2014-09-13 09:39:29 +0000,2014-09-13 09:39:34 +0000,426
"Health",,,"HKQuantityTypeIdentifierStepCount","count",2014-09-21 06:08:48 +0000,2014-09-13 09:45:36 +0000,2014-09-13 09:45:41 +0000,61

You may need to scroll right to see all of it, or expand your browser window.

This blog post is long enough already, so I'll discuss (and plot) the contents of the various output files in later posts.

Notes on the Output

Format: The code writes CSV files including a header record with field names. Since the fields are XML attributes, which get read into a dictionary, they are unordered so the code sorts them alphabetically, which isn't optimal, but is at least consistent. Nulls are written as empty spaces, strings are quoted with double quotes, double quotes in strings are escaped with backslash and backslash is itself escaped with backslash. The output encoding is UTF-8.

Filenames: One file is written per record type, and the names is just the record type with extension .csv, except for record types including HK...TypeIdentifier, which is excised.

Summary Stats: Summary stats about the various CSV files are printed before the main extraction occurs.

Overwriting: Any existings CSV files are silently overwritten, so if you have multiple health data export files in the same directory, take care.

Data Sanitization: The code is almost completely opinionless, and with one exception simply flattens the data in the XML file into a collection of CSV files. The exception concerns file names and the type field file. Apple uses extraordinarily verbose and ugly names like HKQuantityTypeIdentifierStepCount and HKQuantityTypeIdentifierHeight to describe the contents of each record: the abbreviate function in the code uses a regular expression to strip off the nonsense, resulting in nicer, shorter, more comprehensible file names and record types. However, if you prefer to get your data verbatim, simply change the value of ABBREVIATE to False near the top of the file and all your HealthKit prefixes will be preserved, at the cost of a non-trivial expansion of the output file sizes.

Notes on the code: Wot, no tests?

The first thing to say about the code is that there are no tests provided with it, which is—cough—slightly ironic, given the theme of this blog. This isn't because I've written them but am holding them back for pedagogical reasons, or as an ironical meta-commentary on the whole test-driven movement, but merely because I haven't written any yet. Happily, writing tests is a good way of documenting and explaining code, so another post will follow, in which I will present some tests, possibly correct myriad bugs, and explain more about what the code is doing.

I almost said 'I googled "Apple Health export"', but the more accurate statement would be that 'I DuckDuckGoed "Apple Health export"', but there are so many problems with DuckDuckGo as a verb, even in the present tense, let alone in the past as DuckDuckGod. Maybe I should propose the neologism "to DDGoogle". Or as Greg Wilson suggested, "to Duckle". Or maybe not . . . ↩
The ElementTree structure in Python 3 is slightly different in this respect: this exploration was carried out with Python 2. However, the main code presented later in the post works under Python 2 and 3. ↩