Python and HDF5 (2013)

Chapter 7. More About Types

One of the best features of HDF5 is the huge variety of datatypes it supports. In some cases, the HDF5 feature set goes beyond NumPy. To maintain performance and create interoperable files, it’s important to understand exactly what’s going on when you use each type.

The HDF5 Type System

As with NumPy, all data in HDF5 has an associated type. The HDF5 type system is quite flexible and includes the usual suspects like integers and floats of various precisions, as well as strings and vector types.

Table 7-1 shows the native HDF5 datatypes and how they map to NumPy. Keep in mind that most of the types (integers and floats, for example) support a number of different precisions. For example, on most NumPy installations integers come in 1-, 2-, 4-, and 8-byte widths.

Table 7-1. HDF5 types

Native HDF5 type

NumPy equivalent

Integer

dtype("i")

Float

dtype("f")

Strings (fixed width)

dtype("S10")

Strings (variable width)

h5py.special_dtype(vlen=bytes)

Compound

dtype([ ("field1": "i"), ("field2": "f") ])

Enum

h5py.special_dtype(enum=("i",{"RED":0, "GREEN":1, "BLUE":2}))

Array

dtype("(2,2)f")

Opaque

dtype("V10")

Reference

h5py.special_dtype(ref=h5py.Reference)

The h5py package (and PyTables) implement a few additional types on top of this system. Table 7-2 lists additions made by h5py that are described in this chapter.

Table 7-2. Additional Python-side types

Python type

NumPy expression

Stored as

Boolean

np.dtype("bool")

HDF5 enum with FALSE=0, TRUE=1

Complex

np.dtype("complex")

HDF5 compound with fields r and i

Integers and Floats

HDF5 supports all the NumPy integer sizes (1 byte to 8 bytes), signed and unsigned, little-endian and big-endian. Keep in mind that the default behavior for HDF5 when storing a too-large value in a too-small dataset is to clip, not to “roll over” like some versions of NumPy:

>>> f = h5py.File("typesdemo.hdf5")

>>> dset = f.create_dataset('smallint', (10,), dtype=np.int8)

>>> dset[0] = 300

>>> dset[0]

127

>>> a = np.zeros((10,), dtype=np.int8)

>>> a[0] = 300

>>> a[0]

-44

For floating-point numbers, HDF5 supports both single- and double-precision floats (4 and 8 bytes respectively) out of the box.

The HDF5 type representation system is very powerful, and among other things it can represent unusual floating-point precisions. “Half-precision” floats are an interesting case. These tiny 2-byte floats, available in NumPy as float16, are used for storage in applications like image and video processing, since they consume only half the space of the equivalent single-precision float. They’re great where precision isn’t that important and more dynamic range is needed than a 16-bit integer can provide.

>>> dset = f.create_dataset('half_float', (100,100,100), dtype=np.float16)

Keep in mind this is a storage format only; trying to do math on half-precision floats in NumPy will require casting and therefore be slow. Use Dataset.read_direct, the Dataset.astype context manager, or simply convert them after reading:

>>> a = dset[...]

>>> a = a.astype(np.float32)

But if you have values roughly between 10-8 and 60,000, and aren’t too bothered about precision, they’re a great way to save disk space.

Fixed-Length Strings

Strings in HDF5 are a bit of a pain; you got a taste of that in Chapter 6.

As we’ll see in the next section, most real-world strings don’t fit neatly into a constant amount of storage. But fixed-width strings have been around since the FORTRAN days and fit nicely into the NumPy world.

In NumPy, these are generally created using the “S” dtype. This is a flexible dtype that lets you set the string length when you create the type. HDF5 supports fixed-length strings natively:

>>> dt = np.dtype("S10")  # 10-character byte string

>>> dset = f.create_dataset('fixed_string', (100,), dtype=dt)

>>> dset[0] = "Hello"

>>> dset[0]

'Hello'

Like NumPy fixed-width strings, HDF5 will truncate strings that are too big:

>>> dset[0] = "thisstringhasmorethan10characters"

>>> dset[0]

'thisstring'

Technically, these are fixed-length byte strings, which means they use one byte per character. In HDF5, they are assumed to store ASCII text only. NumPy also supports fixed-width Unicode strings, which use multiple bytes to store each character and can represent things outside the ASCII range. The NumPy dtype for this is kind “U,” as in dtype("U10").

Unfortunately, HDF5 does not support such “wide-character” Unicode strings, so there’s no way to directly store “U” strings in a file. However, you aren’t out of luck on the Unicode front. First, we’ll have to take a detour and discuss one of the best features in HDF5: variable-length strings.

Variable-Length Strings

If you’ve used NumPy for a while, you’re used to one subtle but important aspect of its design: all elements in an array have the same size. There are a lot of advantages to this design; for example, to locate the 115th element of a dataset containing 4-byte floats, you know to look 460 bytes from the beginning of the array. And most types you use in everyday computation are of a fixed size—once you’ve chosen to work with double-precision floats, for example, they’re all 8 bytes wide.

This begins to break down when you come to string types. As we saw earlier, NumPy natively includes two string types: one 8-bit “ASCII” string type and one 32-bit “Unicode” string type. You have to explicitly set the size of the string when you create the type. For example, let’s create a length-3 ASCII string array and initialize it:

>>> dt = np.dtype('S3')

>>> a = np.array( [ "a", "ab", "abc", "abcd" ], dtype=dt)

>>> a

array(['a', 'ab', 'abc', 'abc'],

      dtype='|S3')

The limitation is obvious: elements with more than three characters are simply truncated, and the information is lost. It’s tempting to simply increase the length of the string type, say to 100, or 256. But we end up wasting a lot of memory, and there’s still no guarantee our guess will be large enough:

# Read first 5 lines from file

# Ed M. 4/3/12: Increased max line size from 100 to 256 per issue #344

# Ed M. 5/1/12: Increased to 1000 per issue #345

# Ed M. 6/2/12: Fixed.

# TODO: mysterious crashes with MemoryError when many threads running (#346)

a = np.empty((5,), dtype='S100000')

for idx inxrange(5):

     a[idx] = textfile.readline()

This isn’t a problem in every application, of course. But there’s no getting around the fact that strings in real-world data can have virtually any length.

Fortunately, HDF5 has a mechanism to handle this: variable-length strings. Like native Python strings (and strings in C), these can be any width that fits in memory. Here’s how to take advantage of them.

The vlen String Data Type

First, since NumPy doesn’t support variable-length strings at all, we need to use a special dtype provided by h5py:

>>> dt = h5py.special_dtype(vlen=str)

>>> dt

dtype(('|O4', [(({'type': <type 'str'>}, 'vlen'), '|O4')]))

That looks like a mess. But it’s actually a standard NumPy dtype with some metadata attached. In this case, the underlying type is the NumPy object dtype:

>>> dt.kind

'O'

NumPy arrays of kind "O" hold ordinary Python objects. So the dtype effectively says, “This is an object array, which is intended to hold Python strings.”

NOTE

Depending on your version of h5py, you may see a different result when you print the dtype; the details of how the “special” data is attached vary. Don’t depend on any specific implementation. Always use the special_dtype function and don’t try to piece one together yourself.

Working with vlen String Datasets

You can use a “special” dtype to create an array in the normal fashion. Here we create a 100-element variable-length string dataset:

>>> dset = f.create_dataset('vlen_dataset', (100,), dtype=dt)

You can write strings into it from anything that looks “string-shaped,” including ordinary Python strings and fixed-length NumPy strings:

>>> dset[0] = "Hello"

>>> dset[1] = np.string_("Hello2")

>>> dset[3] = "X"*10000

Retrieving a single element, you get a Python string:

>>> out = dset[0]

>>> type(out)

str

Retrieving more than one, you get an object array full of Python strings:

>>> dset[0:2]

array([Hello, Hello2], dtype=object)

There’s one caveat here: for technical reasons, the array returned has a plain-vanilla “object” dtype, not the fancy dtype we created from h5py.special_dtype:

>>> out = dset[0:1]

>>> out.dtype

dtype('object')

This is one of very few cases where dset[...].dtype != dset.dtype.

Byte Versus Unicode Strings

The preceding examples, like the rest of this book, are written assuming you are using Python 2. However, in both Python 2 and 3 there exist two “flavors” of string you should be aware of. They are stored in the file slightly differently, and this has implications for both internationalized applications and data portability.

A complete discussion of the bytes/Unicode mess in Python is beyond the scope of this book. However, it’s important to discuss how the two types interact with HDF5.

The Python 2 str type, used earlier, is more properly called a byte string in the Python world. As the name implies, these are sequences of single-byte elements. They’re available on both Python 2 and 3 under the name bytes (it’s a simple alias for str on Python 2, and a separate type on Python 3). They’re intended to hold strictly binary strings, although in the Python 2 world they play a dual role, generally representing ASCII or Latin-1 encoded text.

In the HDF5 world, these represent “ASCII” strings. Although no checking is done, they are expected to contain values in the range 0-127 and represent plain-ASCII text. When you create a dataset on Python 2 using:

>>> h5py.special_dtype(vlen=str)

or the equivalent-but-more-readable:

>>> h5py.special_dtype(vlen=bytes)

the underlying dataset is created with an ASCII character set. Since there are many third-party applications for HDF5 that understand only ASCII strings, this is by far the most compatible configuration.

Using Unicode Strings

The Python 2 unicode type properly represents “text” strings, in contrast to the str/bytes “byte” strings just discussed. On Python 3, “byte” strings are called bytes and the equivalent “text” strings are called—wait for it—str. Wonderful.

These strings hold sequences of more abstract Unicode characters. You’re not supposed to worry about how they’re actually represented. Before you can store them somewhere, you need to explicitly encode them, which means translating them into byte sequences. The rules that translate these “text” strings into byte strings are called encodings. HDF5 uses the UTF-8 encoding, which is very space-efficient for strings that contain mainly Western characters.

You can actually store these “Unicode” or “text” strings directly in HDF5, by using a similar “special” dtype:

>>> dt = h5py.special_dtype(vlen=unicode)

>>> dt

dtype(('|O4', [(({'type': <type 'unicode'>}, 'vlen'), '|O4')]))

Like before, you can create datasets and interact with them. But now you can use non-ASCII characters:

>>> dset = f.create_dataset('vlen_unicode', (100,), dtype=dt)

>>> dset[0] = "Hello"

>>> dset[1] = u"Accent: \u00E9"

>>> dset[0]

u'Hello'

>>> dset[1]

u'Accent: \xe9'

>>> print dset[1]

Accent: é

When you create this kind of a dataset, the underlying HDF5 character set is set to “UTF-8.” The only disadvantage is that some older third-party applications, like IDL, may not be able to read your strings. If compatibility with legacy code like this is essential for your application, make sure you test!

CAUTION

Remember the default string on Python 3, str, is actually a Unicode string. So on Python 3, h5py.special_dtype(vlen=str) will give you a UTF-8 dataset, not the compatible-with-everything ASCII dataset. Use vlen=bytes instead to get an ASCII dataset.

Don’t Store Binary Data in Strings!

Finally, note that HDF5 will allow you to store raw binary data using the “ASCII” dataset dtype created with special_dtype(vlen=bytes). This may work, but is generally considered evil. And because of how the strings are handled internally, if your binary string has NULLs in it ("\x00"), it will be silently truncated!

The best way to store raw binary data is with the “opaque” type (see Opaque Types).

Future-Proofing Your Python 2 Application

Finally, here are some simple rules you can follow to keep the bytes/Unicode mess from driving you mad. They will also help you when porting to Python 3, using the context-free translation tool 2to3 that ships with Python.

1.    Keep the text-versus-bytes distinction clear in your mind, and cleanly separate the two in code.

2.    Always use the alias bytes instead of str when you’re sure you want a byte string. For literals, you can even use the “b” prefix, for example, b"Hello". In particular, when calling special_dtype to create a byte string, always use bytes.

3.    For text strings use str, or better yet, unicode. Unicode literals are entered with a leading “u”: u"Hello".

Compound Types

For some kinds of data, it makes sense to bundle closely related values together into a single element. The classic example is a C struct: multiple pieces of data that are handled together but can individually be accessed. Another example would be tables in a SQL-style database or a CSV file with multiple column names; each element of data (a row) consists of several related pieces of data (the column values).

NumPy supports this feature through structured arrays, which are similar to (but not the same as) the recarray class. The dtype for these arrays contains a series of fields, each of which has a name and its own sub-dtype. Here’s an example: suppose we wanted to store 100 data elements from a weather-monitoring experiment, which periodically gives us values for temperature, pressure, and wind speed:

>>> dt = np.dtype([("temp", np.float), ("pressure", np.float), ("wind", np.float)])

>>> a = np.zeros((100,), dtype=dt)

In NumPy, you can use a single field name (e.g., "temp") as an index, which in this example would return a shape-(100,) array of floats:

>>> out = a["temp"]

>>> out.shape

(100,)

>>> out.dtype

dtype('float64')

When you access a single element, you get back an object that supports dictionary-style access on the field names:

>>> out = a[0]

>>> out

(0.0, 0.0, 0.0)

>>> out["temp"]

0.0

With HDF5, you have a little more flexibility. Let’s use the same dtype to create a dataset in our file:

>>> dset = f.create_dataset("compound", (100,), dtype=dt)

You’re not limited to a single field when slicing into the dataset. We can access both the "temp" and pressure fields:

>>> out = dset["temp","pressure"]

>>> out.shape

(100,)

>>> out.dtype

dtype([('temp', '<f8'), ('pressure', '<f8')])

We can even mix field names and slices, for example to retrieve only the last 10 temperature points:

>>> out = dset["temp", 90:100]

>>> out.shape

(10,)

>>> out.dtype

dtype('float64')

This process is very efficient; HDF5 only reads the fields you request from disk. Likewise, you can choose to “update” only those fields you wish. If we were to set all the temperatures we just read to a new value and write back out:

>>> out[...] = 98.6

>>> dset["temp", 90:100] = out

HDF5 updates only the temp field in each record. So if, for example, you want to modify only the temperature or pressure part of the dataset, you can cut your memory use by a factor of three.

Complex Numbers

Both NumPy and Python itself support complex numbers. These objects consist of two floating-point numbers pasted together, one representing the real part, and one the imaginary part of the number. In NumPy, you can have single precision (8 bytes total), double precision (16 bytes total), or extended precision (24 bytes total):

>>> dset = f.create_dataset('single_complex', (100,), dtype='c8')

While HDF5 has no out-of-the-box representation for complex numbers, a standard of sorts has arisen, to which h5py adheres. Complex numbers are stored as a two-element compound, the real part labelled r, and the imaginary part labelled i. Keep this in mind if you want to access the data in other programs like IDL or MATLAB. Here’s what the dataset we created looks like with h5ls:

Opened "test.hdf5" with sec2 driver.

/                        Group

    Location:  1:96

    Links:     1

/single_complex          Dataset {100/100}

    Location:  1:800

    Links:     1

    Storage:   800 logical bytes, 0 allocated bytes

    Type:      struct {

                   "r"                +0    native float

                   "i"                +4    native float

               } 8 bytes

Enumerated Types

Those of you who have used C will recognize this next datatype. In the HDF5 world, enumerated types or enums are integer datatypes for which certain values are associated with text tags. For example, for a dataset of type np.uint8 you might define 0 to mean RED, 1 to mean GREEN, and 2 to mean BLUE.

The point of all this is to store the “semantic” meaning of these values as close as possible to the data itself, rather than, for example, in Appendix G of a manual that nobody reads.

There’s no native concept for this in the NumPy world, so we fall back again to our friend h5py.special_dtype. In this case, we use a different keyword, enum, and supply both a base type and dictionary mapping names to values:

>>> mapping = {"RED": 0, "GREEN": 1, "BLUE": 2}

>>> dt = h5py.special_dtype( enum=(np.int8, mapping) )

Datasets you create with this type work just like regular integer datasets:

>>> dset = f.create_dataset('enum', (100,), dtype=dt)

>>> dset[0]

0

Like variable-length strings, data you read from the dataset will have the extra “special dtype” information stripped off:

>>> dset[0].dtype

dtype('int8')

Keep in mind that in both HDF5 and NumPy, no checking is performed to make sure you keep to values specified in the enum. For example, if you were to assign one element to a different value, HDF5 will happily store it:

>>> dset[0] = 100

>>> dset[0]

100

It’s strictly on the honor system.

NOTE

HDF5 itself doesn’t like to convert between integers and enums. So if you create an enum dataset, keep in mind that people who read your data will have to explicitly read it as an enum. Generally this works fine, but as always, if you’re interacting with third-party code it’s a good idea to test.

Booleans

When storing Boolean (True/False) flags, people often resort to simply using integers. In Chapter 3, we saw that NumPy natively supports arrays of Booleans. They have their own data type, np.bool. NumPy hides the storage type from you, but behind the scenes, arrays of type bool are stored as single-byte integers.

There’s no native HDF5 Boolean type, but like complex numbers, h5py automatically provides one for you (in this case using an enum). The base type is np.int8 and the mapping is {"FALSE": 0, "TRUE": 1}. Let’s create a Boolean dataset:

>>> with h5py.File('bool.hdf5','w') as f2:

...     f.create_dataset('bool', (100,), dtype=np.bool)

And now let’s see how it looks in the file, again using h5ls:

Opened "bool.hdf5" with sec2 driver.

/                        Group

    Location:  1:96

    Links:     1

/bool                    Dataset {100/100}

    Location:  1:800

    Links:     1

    Storage:   100 logical bytes, 0 allocated bytes

    Type:      enum native signed char {

                   FALSE            = 0

                   TRUE             = 1

               }

The array Type

Not often encountered in NumPy code, the array type is a good choice when you want to store multiple values of the same type in a single element. Unlike compound types, there are no separate “fields”; rather, each element is itself a multidimensional array.

There are a couple of pitfalls associated with this type and with some “helpful” behavior from NumPy, which can be confusing. Let’s start with an example, in which our elements are 2×2 arrays of floats:

>>> dt = np.dtype('(2,2)f')

>>> dt

dtype(('float32',(2, 2)))

Now let’s create an HDF5 dataset with this dtype that has 100 data points:

>>> dset = f.create_dataset('array', (100,), dtype=dt)

>>> dset.dtype

dtype(('float32',(2, 2)))

>>> dset.shape

(100,)

Retrieving a single element gives us a 2x2 NumPy array:

>>> out = dset[0]

>>> out

array([[ 0.,  0.],

       [ 0.,  0.]], dtype=float32)

You might have expected a NumPy scalar with our original dtype, but it doesn’t work that way. NumPy automatically “promotes” the array-type scalar into a full-fledged array of the base type. This is convenient, but it’s another case where dset[…].dtype != dset.dtype.

Likewise, if we were to create a native NumPy array with our type it would get “eaten” and the extra axes tacked on to the main array’s shape:

>>> a = np.zeros((100,), dtype=dt)

>>> a.dtype

dtype('float32')

>>> a.shape

(100, 2, 2)

So what’s the array type good for? Generally it’s best used as an element of a compound type. For example, if we had an experiment that reported an integer timestamp along with the output from a 2×2 light sensor, one choice for a data type would be:

>>> dt_timestamp = np.dtype('uint64')

>>> dt_sensor = np.dtype('(2,2)f')

>>> dt = np.dtype([ ('time', dt_timestamp), ('sensor', dt_sensor) ])

Creating a dataset with this compound type, it’s easy to store and retrieve individual outputs from the experiment:

>>> import time

>>> dset = f.create_dataset('mydata', (100,), dtype=dt)

>>> dset["time", 0] = time.time()

>>> dset["sensor", 0] = ((1,2), (3,4))

>>> out = dset[0]

>>> out

(1368217143, [[1.0, 2.0], [3.0, 4.0]])

>>> out["sensor"]

array([[ 1.,  2.],

       [ 3.,  4.]], dtype=float32)

When your data contains “packets” of values like this, it’s generally better to use the array type than, say, add extra dimensions to the dataset. Not only does it make access easier, but it’s semantically more meaningful.

Opaque Types

It’s rare, but some data simply can’t be represented in any of the NumPy forms (for example, disk images or other binary data that isn’t numeric). There’s a mechanism for dealing with this in HDF5, which you should consider a last resort for data that needs to be stored, bit for bit, in the file.

The NumPy void "V" type is used to store such “opaque” data. Like the string type "S" this is a fixed-width flexible type. For example, to store opaque fields 200 bytes long in NumPy:

>>> dt = np.dtype('V200')

>>> a = np.zeros((10,), dtype=dt)  # 10 elements each 200 bytes long

When you provide such a dtype to create_dataset, the underlying dataset is created with the HDF5 opaque datatype:

>>> dset = f.create_dataset('opaque', (10,), dtype=dt)

>>> dset.dtype

dtype('|V200')

>>> dset.shape

(10,)

You should seriously consider using opaque types for storing raw binary data. It may be tempting simply to store the data in a string, but remember that strings in HDF5 are reserved either for ASCII or Unicode text.

Here’s an example of how to “round-trip” a Python byte string through the HDF5 opaque type, in this case to store binary data in an attribute:

>>> binary_blob = b"A\x00B\x00"     # Try storing this directly! It won't work.

>>> obj.attrs["name"] = np.void(binary_blob)  # "Void" type maps to HDF5 opaque

>>> out = obj.attrs["name"]

>>> binary_blob = out.tostring()

Dates and Times

One frequently asked question is how to express time information in HDF5. At one point there was a datetime type in HDF5, although to my knowledge nobody in the Python world ever used it. Typically dates and times are expressed in HDF5 on an ad-hoc basis.

One way to represent time is by a count of seconds (including fractional seconds) since some time in the past, called the “epoch.” For example, “Unix time” or “POSIX time” counts the number of seconds since midnight Jan. 1, 1970 UTC.

If you need only seconds of resolution, an integer works well:

>>> timestamp = np.dtype('u8')

You can also use a double-precision float to represent fractional time, as provided by the built-in time.time():

>>> import time, datetime

>>> time.time()

1377548506.627

datetime objects can be used to provide a string in “ISO” format, which yields a nicer-looking result:

>>> datetime.datetime.now().isoformat()

'2013-08-26T14:30:02.633000'

Such timestamps are also called “naive” timestamps, because they don’t include information on the time zone or leap seconds. If your application is purely working in one time zone, or only dealing in time differences (and can ignore leap seconds for this purpose), this is likely OK. Otherwise, you will have to store appropriate data on the time zone somewhere close by (like in another member of a compound type).

There’s one last type to discuss, and it’s important enough to warrant its own chapter in this book. We have come to references: the HDF5 pointer type.