Python and HDF5 (2013)

Chapter 6. Storing Metadata with Attributes

Groups and datasets are great for keeping data organized in a file. But the feature that really turns HDF5 into a scientific database, instead of just a file format, is attributes.

Attributes are pieces of metadata you can stick on objects in the file. They can hold equipment settings, timestamps, computation results, version numbers, virtually anything you want. They’re a key mechanism for making self-describing files. Unlike simple binary formats that just hold arrays of numbers, judicious use of metadata makes your files scientifically useful all on their own.

Attribute Basics

You can attach attributes to any kind of object that is linked into the HDF5 tree structure: groups, datasets, and even named datatypes. To demonstrate, let’s create a new file containing a single dataset:

>>> f = h5py.File('attrsdemo.hdf5','w')

>>> dset = f.create_dataset('dataset',(100,))

Looking at the properties attached to the dset object, there’s one called .attrs:

>>> dset.attrs

<Attributes of HDF5 object at 73767504>

This is a little proxy object (an instance of h5py.AttributeManager) that lets you interact with attributes in a Pythonic way. As was the case with groups, the main thing to keep in mind here is that the attrs object works mostly like a Python dictionary.

For example, you can create a new attribute simply by assigning a name to a value:

>>> dset.attrs['title'] = "Dataset from third round of experiments"

>>> dset.attrs['sample_rate'] = 100e6    # 100 MHz digitizer setting

>>> dset.attrs['run_id'] = 144

When retrieving elements, we get back the actual value, not an intermediate object like a Dataset:

>>> dset.attrs['title']

'Dataset from third round of experiments'

>>> dset.attrs['sample_rate']

100000000.0

>>> dset.attrs['run_id']

144

Like groups (and Python dictionaries), iterating over the .attrs object provides the attribute names:

>>> [x for x indset.attrs]

[u'title', u'sample_rate', u'run_id']

You’ll notice that like object names, the names of attributes are always returned as “text” strings; this means unicode on Python 2, which explains the u prefix.

Attributes don’t have the same strict rules as groups for item deletion. You can freely overwrite attributes just by reusing the name:

>>> dset.attrs['another_id'] = 42

>>> dset.attrs['another_id'] = 100

Trying to access missing attributes raises KeyError, although as with Group you don’t get the name of the missing attribute:

>>> del dset.attrs['another_id']

>>> dset.attrs['another_id']

KeyError: "can't open attribute (Attribute: Can't open object)"

There are also the usual methods like iterkeys, iteritems, values, and so on. They all do what you expect:

>>> [(name, val) for name, val indset.attrs.iteritems()]

[(u'title', 'Dataset from third round of experiments'),

 (u'sample_rate', 100000000.0),

 (u'run_id', 144)]

There generally aren’t that many attributes attached to an object, so worrying about items versus iteritems, etc., is less important from a performance perspective.

There is also a get method that (unlike the Group version) is a dictionary-style get:

>>> dset.attrs.get('run_id')

144

>>> print dset.attrs.get('missing')

None

Type Guessing

When you create a dataset, you generally specify the data type you want by providing a NumPy dtype object. There are exceptions; for example, you can get a single-precision float by omitting the dtype when calling create_dataset. But every dataset has an explicit dtype, and you can always discover what it is via the .dtype property:

>>> dset.dtype

dtype('float32')

In contrast, with attributes h5py generally hides the type from you. It’s important to remember that there is a definite type in the HDF5 file. The dictionary-style interface to attributes just means that it’s usually inferred from what you provide.

Let’s flush our file to disk with:

>>> f.flush()

and look at it with h5ls:

$ h5ls -vlr attrsdemo.hdf5

Opened "attrsdemo.hdf5" with sec2 driver.

/                        Group

    Location:  1:96

    Links:     1

/dataset                 Dataset {100/100}

    Attribute: run_id    scalar

        Type:      native int

        Data:  144

    Attribute: sample_rate scalar

        Type:      native double

        Data:  1e+08

    Attribute: title     scalar

        Type:      variable-length null-terminated ASCII string

        Data:  "Dataset from third round of experiments"

    Location:  1:800

    Links:     1

    Storage:   400 logical bytes, 0 allocated bytes

    Type:      native float

In most cases, the type is determined by simply passing the value to np.array and then storing the resulting object. For integers on 32-bit systems you would get a 32-bit (“native”) integer:

>>> np.array(144).dtype

dtype('int32')

This explains the “native int” type for run_id.

You’re not limited to scalar values, by the way. There’s no problem storing whole NumPy arrays in the file:

>>> dset.attrs['ones'] = np.ones((100,))

>>> dset.attrs['ones']

array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,

        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,

        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,

        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,

        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,

        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,

        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,

        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

There are limits, though. In HDF5, with the default settings (“compact” storage, as opposed to “dense” storage), attributes are limited to a size of 64k. For example, if we try to store a (100, 100) array, it complains:

>>> dset.attrs['ones'] = np.ones((100, 100))

ValueError: unable to create attribute (Attribute: Unable to initialize object)

Most regrettably, we discover that in this case the previous attribute was wiped out:

>>> dset.attrs['ones']

KeyError: "can't open attribute (Attribute: Can't open object)"

CAUTION

This is one of the (very few) cases where h5py’s interaction with the file is not atomic. Exercise caution with larger array attributes.

One way around this limitation is simply to store the data in a dataset, and link to it with an object reference (see Chapter 8):

>>> ones_dset = f.create_dataset('ones_data', data=np.ones((100,100)))

>>> dset.attrs['ones'] = ones_dset.ref

>>> dset.attrs['ones']

<HDF5 object reference>

To access the data, use the reference to retrieve the dataset and read it out:

>>> ones_dset = f[dset.attrs['ones']]

>>> ones_dset[...]

array([[ 1.,  1.,  1., ...,  1.,  1.,  1.],

       [ 1.,  1.,  1., ...,  1.,  1.,  1.],

       [ 1.,  1.,  1., ...,  1.,  1.,  1.],

       ...,

       [ 1.,  1.,  1., ...,  1.,  1.,  1.],

       [ 1.,  1.,  1., ...,  1.,  1.,  1.],

       [ 1.,  1.,  1., ...,  1.,  1.,  1.]])

Strings and File Compatibility

There are a couple of types that have special handling. First, there is a subtle difference in HDF5 regarding the type of a string. In the previous example, we assigned a Python string as an attribute. That created a variable-length ASCII string (Variable-Length Strings).

In contrast, an instance of np.string_ would get stored as a fixed-length string in the file:

>>> dset.attrs['title_fixed'] = np.string_("Another title")

This generally isn’t an issue, but some older FORTRAN-based programs can’t deal with variable-length strings. If this is a problem for your application, use np.string_, or equivalently, arrays of NumPy type S.

By the way, you can also store Unicode strings in the file. They’re written out with the HDF5-approved UTF-8 encoding:

>>> dset.attrs['Yet another title'] = u'String with accent (\u00E9)'

>>> f.flush()

Here’s what the file looks like now, with our fixed-length and Unicode strings inside:

$ h5ls -vlr attrsdemo.hdf5/dataset

Opened "attrsdemo.hdf5" with sec2 driver.

dataset                  Dataset {100/100}

    Attribute: Yet\ another\ title scalar

        Type:      variable-length null-terminated UTF-8 string

        Data:  "String with accent (\37777777703\37777777651)"

    Attribute: ones      scalar

        Type:      object reference

        Data:  DATASET-1:70568

    Attribute: run_id    scalar

        Type:      native int

        Data:  144

    Attribute: sample_rate scalar

        Type:      native double

        Data:  1e+08

    Attribute: title     scalar

        Type:      variable-length null-terminated ASCII string

        Data:  "Dataset from third round of experiments"

    Attribute: title_fixed scalar

        Type:      13-byte null-padded ASCII string

        Data:  "Another title"

    Location:  1:800

    Links:     1

    Storage:   400 logical bytes, 0 allocated bytes

    Type:      native float

There is one more thing to mention about strings, and it has to do with the strict separation in Python 3 between byte strings and text strings.

When you read an attribute from a file, you generally get an object with the same type as in HDF5. So if we were to store a NumPy int32, we would get an int32 back.

In Python 3, this means that most of the HDF5 strings “in the wild” would be read as byte strings, which are very awkward to deal with. So in Python 3, scalar strings are always converted to text strings (type str) when they are read.

Python Objects

The question of storing generic Python objects in HDF5 comes up now and then. You’ll notice that you can’t store an arbitrary object as an attribute (or indeed, as a dataset) in HDF5:

>>> dset.attrs['object'] = {}

TypeError: Object dtype dtype('object') has no native HDF5 equivalent

This is intentional. As the error message suggests, HDF5 has no “native,” built-in type to represent a Python object, and serialized objects in a portability-oriented format like HDF5 are generally recognized as bad news. Storing data in “blob” form defeats the wonderful type system and interoperability of HDF5.

However, I can’t tell you how to write your application. If you really want to store Python objects, the best way to do so is by “pickling” (serializing) them to a string:

>>> import pickle

>>> pickled_object = pickle.dumps({'key': 42}, protocol=0)

>>> pickled_object

"(dp0\nS'key'\np1\nI42\ns."

>>> dset.attrs['object'] = pickled_object

>>> obj = pickle.loads(dset.attrs['object'])

>>> obj

{'key': 42}

You will have to manually keep track of which strings are pickled objects. Technically strings created in this fashion support only ASCII characters, so it’s best to stick with pickle protocol “0.”

Explicit Typing

Sometimes, for external compatibility, you may need to create attributes with very precise data types and the default type guessing won’t do. Or, you may have received a file from a colleague and don’t want to change the types of attributes by overwriting them.

There are a couple of mechanisms to deal with this. The .attrs proxy object has a method create, which takes a name, value, and a dtype:

>>> f = h5py.File('attrs_create.hdf5','w')

>>> dset = f.create_dataset('dataset', (100,))

>>> dset.attrs.create('two_byte_int', 190, dtype='i2')

>>> dset.attrs['two_byte_int']

190

>>> f.flush()

Looking at the file in h5ls:

$ h5ls -vlr attrs_create.hdf5

Opened "attrs_create.hdf5" with sec2 driver.

/                        Group

    Location:  1:96

    Links:     1

/dataset                 Dataset {100/100}

    Attribute: two_byte_int scalar

        Type:      native short

        Data:  190

    Location:  1:800

    Links:     1

    Storage:   400 logical bytes, 0 allocated bytes

    Type:      native float

This is a great way to make sure you get the right flavor of string. Unlike scalar strings, by default when you provide an array-like object of strings, they get sent through NumPy and end up as fixed-length strings in the file:

>>> dset.attrs['strings'] = ["Hello", "Another string"]

>>> dset.attrs['strings']

array(['Hello', 'Another string'],

      dtype='|S14')

In contrast, if you specify the “variable-length string” special dtype (see Chapter 7):

>>> dt = h5py.special_dtype(vlen=str)

>>> dset.attrs.create('more_strings', ["Hello", "Another string"], dtype=dt)

>>> dset.attrs['more_strings']

array([Hello, Another string], dtype=object)

Looking at the file, the two attributes have subtly different storage techniques. The original attribute is stored as a pair of 14-byte fixed-length strings, while the other is stored as a pair of variable-length strings:

$ h5ls -vlr attrs_create.hdf5

Opened "attrs_create.hdf5" with sec2 driver.

/                        Group

    Location:  1:96

    Links:     1

/dataset                 Dataset {100/100}

    Attribute: more_strings {2}

        Type:      variable-length null-terminated ASCII string

        Data:  "Hello", "Another string"

    Attribute: strings   {2}

        Type:      14-byte null-padded ASCII string

        Data:  "Hello" '\000' repeats 8 times, "Another string"

    Attribute: two_byte_int scalar

        Type:      native short

        Data:  190

    Location:  1:800

    Links:     1

    Storage:   400 logical bytes, 0 allocated bytes

    Type:      native float

It may seem like a small distinction, but when talking to third-party code this can be the difference between a working program and an error message.

Finally, there’s another convenience method called modify, which as the name suggests preserves the type of the attribute:

>>> dset.attrs.modify('two_byte_int', 33)

>>> dset.attrs['two_byte_int']

33

Keep in mind this may have unexpected consequences when the type of the attribute can’t hold the value you provide. In this case, the value will clip:

>>> dset.attrs.modify('two_byte_int', 40000)

>>> dset.attrs['two_byte_int']

32767

Real-World Example: Accelerator Particle Database

Here’s an example of how the groups, datasets, and attributes in HDF5 can be combined to solve a real-world data management problem. Recently, the University of Colorado installed a new electrostatic dust accelerator facility under a grant from NASA. This device fires charged micrometer-sized dust grains into a target chamber at speeds ranging from 1–100 km/s, to simulate the impact of dust grains on surfaces and hardware in space.

Application Format on Top of HDF5

The machine generates huge quantities of data. Every particle, and there can be up to 10 per second for hours on end, generates three digitized waveforms 100,000 points long. A computer system analyzes these waveforms to figure out what the mass of the particle is and how fast it’s going. The resulting waveforms and speed/mass estimates are recorded in an HDF5 file for use by the project scientists.

So the basic unit is a particle “event,” which has three floating-point waveforms, each of which has some other properties like sampling rate, digitizer range, etc. Then for each event we have metadata estimating particle mass and speed, as well as some top-level metadata like a file timestamp.

Let’s use h5ls to peek inside one of these files:

Opened "November_Run3.hdf5" with sec2 driver.

/                        Group

    Attribute: timestamp scalar

        Type:      native long long

        Data:  1352843341201

    Attribute: version_number scalar

        Type:      native int

        Data:  1

    Location:  1:96

    Links:     1

/0                       Group

    Attribute: experiment_name scalar

        Type:      5-byte null-terminated ASCII string

        Data:  "Run3"

    Attribute: id_dust_event scalar

        Type:      native long long

        Data:  210790

    Attribute: mass      scalar

        Type:      native float

        Data:  3.81768e-17

    Attribute: velocity  scalar

        Type:      native float

        Data:  9646.3

    Location:  1:11637136

    Links:     1

/0/first_detector        Dataset {100000/100000}

    Attribute: dt        scalar

        Type:      native float

        Data:  2e-08

    Location:  1:16048056

    Links:     1

    Storage:   400000 logical bytes, 400000 allocated bytes, 100.00% utilization

    Type:      native float

/0/second_detector       Dataset {100000/100000}

    Attribute: dt        scalar

        Type:      native float

        Data:  2e-08

    Location:  1:16449216

    Links:     1

    Storage:   400000 logical bytes, 400000 allocated bytes, 100.00% utilization

    Type:      native float

/0/third_detector        Dataset {100000/100000}

    Attribute: dt        scalar

        Type:      native float

        Data:  2e-08

    Location:  1:16449616

    Links:     1

    Storage:   400000 logical bytes, 400000 allocated bytes, 100.00% utilization

    Type:      native float

/1                       Group

...

There’s a lot going on here, but it’s pretty straightforward. The root group has attributes for a timestamp (when the file was written), along with a version number for the “format” used to structure the file using groups, datasets, and attributes.

Then each particle that goes down the beamline has its own group. The group attributes record the analyzed mass and velocity, along with an integer that uniquely identifies the event. Finally, the three waveforms with our original data are recorded in the particle group. They also have an attribute, in this case giving the sampling interval of the time series.

Analyzing the Data

The crucial thing here is that the metadata required to make sense of the raw waveforms is stored right next to the data. For example, time series like our waveforms are useless unless you also know the time spacing of the samples. In the preceding file, that time interval (dt) is stored as an attribute on the waveform dataset. If we wanted to plot a waveform with the correct time scaling, all we have to do is:

import pyplot as p

f = h5py.File("November_Run3.hdf5",'r')

# Retrieve HDF5 dataset

first_detector = f['/0/first_detector']

# Make a properly scaled time axis

x_axis = np.arange(len(first_detector))*first_detector.attrs['dt']

# Plot the result

p.plot(x_axis, first_detector[...])

There’s another great way HDF5 can simplify your analysis. With other formats, it’s common to have an input file or files, a code that processes them, and a “results” file with the output of your computation. With HDF5, you can have one file containing both the input data and the results of your analysis.

For example, suppose we wrote a piece of code that determined the electrical charge on the particle from the waveform data. We can store this right in the file next to the estimates for mass and velocity:

from some_science_package import charge_estimator

def update_particle_group(grp):

     # Retrieve waveform data

     first_det = grp['first_detector'][...]

     second_det = grp['second_detector'][...]

     # Retrieve time scaling data

     dt = grp['first_detector'].attrs['dt']

     # Perform charge estimation

     charge = charge_estimator(first_det, second_det, interval=dt)

     # Write charge to file

     grp.attrs['charge'] = charge

     print "For group %s, got charge %.2g" % (grp.name, charge)

>>> for grp inf.itervalues():

...     update_particle_group(grp)

The same goes for analysis that creates output datasets instead of just scalars. The key point here is to transition from thinking of the HDF5 container as a file to treating it as a database.

NOTE

Don’t get carried away. Keep backups of your data in case you accidentally do something wrong.

We’ve covered the four main objects in the HDF5 universe: files, groups, datasets, and attributes. Now it’s time to take a break and talk about the HDF5 type system, and what it can do for you.