Cython (2015)

Chapter 10. Cython, NumPy, and Typed Memoryviews

All problems in computer science can be solved by another level of indirection, except, of course, for the problem of too many indirections.

— D. Wheeler

Two great qualities of Cython are its breadth and maturity: it compiles nearly all Python code (and whatever it cannot handle is typically straightforward to address); it brings the power and control of C’s type system to Python; and it integrates with external C and C++ code with ease. The task for this chapter is to round out Cython’s capabilities and cover Cython’s array features—which include support for NumPy arrays—in depth.

We have seen how well Cython supports built-in containers like list, tuple, dict, and set. These container objects are very easy to use, can contain any type of Python object, and are highly optimized for object lookup, assignment, and retrieval. The way the list type implements storage and retrieval is very different from dict, but from an implementation perspective, containers all have one thing in common: they all store references to Python objects. If we have a Python list of one million ints, every element in that list, at the C level, is a pointer to a boxed-upPyObject. Converting such a list to a C array of C ints is expensive, requiring us to iterate through the list and convert each PyObject to a C int, all the while doing proper error checking.

For homogeneous containers (e.g., a list containing nothing but floats), we can do much better in terms of storage overhead and performance. Large arrays of homogeneous numeric types are common, and not just in numerical programming contexts. Furthermore, CPUs and modern memory hierarchies are optimized to work with such arrays. C has fixed-size and heap-allocated arrays. C++ has the std::vector workhorse STL templated type. What we want is a way to represent and work with a homogeneous contiguous array, or buffer, of unboxed data types in Python.

Enter Python buffers and the new Python buffer protocol. Buffers allow us to represent contiguous or simply strided unboxed data of a single data type. NumPy arrays—the most widely used array type in Python—support the buffer protocol. It is useful to think of buffers as simplified NumPy arrays.

Using buffers effectively is often the key to obtaining C-level performance from Cython code. Fortunately, Cython makes it particularly easy to work with buffers. It has first-class support for the new buffer protocol and, with it, NumPy arrays.

The Power of the New Buffer Protocol

The new buffer protocol is a C-level protocol.[19] Python objects can implement the protocol, but it does not affect their interface at the Python level. The protocol is supported in all Python 3 versions and has been backported to Python 2.6 and later. It defines a C-level struct that has a data buffer and metadata to describe the buffer’s layout, data type, and read and write permissions. It also defines the API that an object supporting the protocol must implement.

NOTE

The new buffer protocol’s most important feature is its ability to represent the same underlying data in different ways. It allows NumPy arrays, several Python built-in types, and Cython-level array-like objects to share the same data without copying. With Cython, we can also easily extend the buffer protocol to work with data coming from an external library.

We do not cover the protocol’s details here; it is thoroughly documented in Python’s C API reference manual. Thankfully, Cython allows us to work with buffers without having to know the details of the protocol. It is sufficient to know that, when working with buffers, we can efficiently access their underlying data without copying, reducing overhead.

What types implement the protocol?

NumPy ndarray

The well-known and widely used NumPy package has an ndarray object that supports the buffer protocol, making it a valid Python buffer.

Built-in str (Py 2)

The built-in string type in Python 2.6 and 2.7 implements the protocol. The Unicode type in Python 2 and the string type in Python 3, however, do not.

Built-in bytes and bytearray types

The bytes and bytearray types in all Python versions implement the protocol.

Standard library array.array

The array.array Python standard library type implements a list-like array type that supports the protocol.

Standard library ctypes arrays

Arrays in the ctypes package also implement the protocol.

Various third-party types

For instance, the Python Imaging Library (PIL) implements the protocol for various image types.

The memoryview Type

There is another built-in Python type, memoryview, whose sole purpose is to represent a C-level buffer at the Python level. We create a memoryview object by passing the memoryview callable an object that implements the protocol, like a bytes object:

$ ipython --no-banner

In [1]: bb = b"These are the times that try men's souls."

In [2]: memv = memoryview(bb)

In [3]: memv

Out[3]: <memory at 0x101955348>

Here, memv is an object that shares data with the bytes string.

Playing with a memoryview object gives us a feel for what buffers are doing at the C level.

For instance, we can access data from the underlying buffer by indexing:

In [4]: memv[0]

Out[4]: 'T'

In [5]: memv[-1]

Out[5]: '.'

Slicing returns another memoryview, which also shares the underlying bytes data:

In [6]: memv[:10]

Out[6]: <memory at 0x102a223e0>

In [7]: memv[:10][0]

Out[7]: 'T'

NOTE

We can slice a memoryview with arbitrary start, stop, and step values, allowing us to efficiently select only the data elements of interest. In this way, memoryview objects provide functionality beyond having multiple variables referring to the same object.

Because a bytes object is immutable, a memoryview of a bytes object is readonly:

In [8]: memv.readonly

Out[8]: True

In [9]: memv[0] = 'F'

...

TypeError: cannot modify read-only memory

If, instead, we take a memoryview of a mutable buffer like bytearray, we can modify its data. First, let’s make two memoryviews that share an underlying buffer:

In [10]: ba = bytearray(b"If the facts don't fit the theory, change the facts.")

In [11]: mutable1 = memoryview(ba)

In [12]: mutable2 = mutable1[:10]

Modifying the mutable1 memoryview modifies it in the original bytearray and in mutable2 as well:

In [13]: mutable2[0]

Out[13]: 'I'

In [14]: mutable1[0] = "A"

In [15]: mutable2[0]

Out[15]: 'A'

In [16]: ba[:1]

Out[16]: bytearray(b'A')

A memoryview has several attributes that query the underlying buffer’s metadata. We have already seen the readonly attribute. For something a bit more interesting, let’s take a memoryview of a multidimensional NumPy array:

In [17]: import numpy as np

In [18]: np_mv = memoryview(np.ones((10, 20, 30)))

We can ask for the number of dimensions using ndim:

In [19]: np_mv.ndim

Out[19]: 3L

And we can see the extent of the memoryview in each dimension with the shape attribute:

In [20]: np_mv.shape

Out[20]: (10L, 20L, 30L)

memoryviews also have a strides attribute, which specifies the number of bytes separating elements in the buffer in that dimension:

In [21]: np_mv.strides

Out[21]: (4800L, 240L, 8L)

Looking at strides, we can tell that the buffer is C contiguous in memory, as the skip in the last dimension is smallest and matches np_mv.itemsize.

NOTE

The strides of an array indicates the number of bytes separating elements in the array in that dimension. A NumPy array also has a strides attribute, and more details about strides and how it is used can be found in NumPy’s strides documentation.

The underlying data type comes from the format attribute, which gives back a format string:

In [22]: np_mv.format

Out[22]: 'd'

Structured data types are supported as well. First, let’s create a NumPy structured dtype with fields a and b with data types int8 and complex128, respectively:

In [23]: dt = np.dtype([('a', np.int8), ('b', np.complex128)])

In [24]: dt

Out[24]: dtype([('a', 'i1'), ('b', '<c16')])

We can now make a memoryview from an empty NumPy array with our new dtype:

In [25]: structured_mv = memoryview(np.empty((10,), dtype=dt))

The memoryview’s format string comes from the struct standard library module’s specification, and for structured types is rather cryptic:

In [26]: structured_mv.format

Out[26]: 'T{b:a:=Zd:b:}'

We leave the details of memoryview format strings to the official documentation; thankfully, we do not have to work with them directly. We can rest assured that buffers and memoryview objects work with simple scalar types as well as user-defined structured types.

How do memoryviews and buffer objects translate to Cython? Given that Cython lives between Python and C, it is ideally suited to work with memoryview objects and the buffer protocol at the C level.

Typed Memoryviews

Cython has a C-level type, the typed memoryview, that conceptually overlaps with the Python memoryview type and expands on it. As suggested by the name, a typed memoryview is used to view (i.e., share) data from a buffer-producing object. Because a typed memoryview operates at the C level, it has minimal Python overhead and is very efficient. A typed memoryview has a memoryview-like interface, so it is easier to use than working with C-level buffers directly. And because a typed memoryview is designed to work with the buffer protocol, it supports any buffer-producing object efficiently, allowing sharing of data buffers without copying.

Let’s see an example.

Typed Memoryview Example

Suppose we want to work with a buffer of one-dimensional data efficiently in Cython. We do not care how the data is created at the Python level; we just want to access it in an efficient way.

Let’s create a def function in Cython that has a typed memoryview argument:[20]

def summer(double[:] mv):

    """Sums its argument's contents."""

    # ...

The double[:] mv syntax declares mv to be a typed memoryview. The double specifies the memoryview’s underlying data type, and the single colon in brackets indicates a one-dimensional memoryview object.

When we call summer from Python, we pass in a Python object that is implicitly assigned to mv as part of the usual function calling process. When an object is assigned to a typed memoryview, the memoryview attempts to access the object’s underlying data buffer. If the passed object cannot provide a buffer—that is, it does not support the protocol—a ValueError is raised. If it does support the protocol, then it provides a C-level buffer for the memoryview to use.

Iterating through mv like a regular Python object is supported:

def summer(double[:] mv):

    """Sums its argument's contents."""

    cdef double d, ss = 0.0

    for d inmv:

        ss += d

    return ss

To play with this code (memviews.pyx) from IPython, we use pyximport to quickly compile this function at import time:

$ ipython --no-banner

In [1]: import pyximport; pyximport.install()

Out[1]: (None, <pyximport.pyximport.PyxImporter at 0x101c6c450>)

In [2]: import memviews

Let’s create a million-element NumPy array to test:

In [3]: import numpy as np

In [4]: arr = np.ones((10**6,), dtype=np.double)

Now we can pass arr to memviews.summer:

In [5]: memviews.summer(arr)

Out[5]: 1000000.0

It also works with array.array objects. First, let’s create a million-element array:

In [6]: from array import array

In [7]: a = array('d', [1]*10**6)

In [8]: len(a)

Out[8]: 1000000

We can pass a to memviews.summer and it works automatically in Python 3. In Python 2, we have to make sure we cimport cpython.array in our Cython source, which allows Cython to work with array.array objects:

In [9]: memviews.summer(a)

Out[9]: 1000000.0

This implementation of summer is not particularly efficient, however:

In [10]: %timeit memviews.summer(arr)

1 loops, best of 3: 262 ms per loop

When iterating through a typed memoryview, Cython essentially treats it as a general Python iterator, calling into the Python/C API for every access. We can do better.

C-Level Access to Typed Memoryview Data

Typed memoryviews are designed for C-style access with no Python overhead. A better way to add mv’s elements is:

def summer(double[:] mv):

    """Sums its argument's contents."""

    cdef:

        double ss = 0.0

        int i, N

    N = mv.shape[0]

    for i inrange(N):

        ss += mv[i]

    return ss

This version has much better performance: about 1 millisecond for our million-element array. When indexing into a typed memoryview with a typed integer, Cython generates code that bypasses Python/C API calls and indexes into the underlying buffer directly. This is the source of our large speedup. But we can do better still.

Trading Safety for Performance

Every time we access our memoryview, Cython checks that the index is in bounds. If it is out of bounds, Cython raises an IndexError. Also, Cython allows us to index into memoryviews with negative indices (i.e., index wraparound) just like Python lists.

In our summer function, we iterate through the memoryview once, and do not do anything fancy. We know ahead of time that we never index with an out-of-bounds or negative index, so we can instruct Cython to turn off these checks for better performance. To do so, we use the cythonspecial module with the boundscheck and wraparound compiler directives (see Compiler Directives):

from cython cimport boundscheck, wraparound

def summer(double[:] mv):

    # ...

    with boundscheck(False), wraparound(False):

        for i inrange(N):

            ss += mv[i]

    # ...

We modified our original summer definition by placing our loop inside a context manager (i.e., a with block) that turns off bounds and wraparound checking when accessing our memoryview. These modifications are in effect for the duration of the context manager. The result is a small performance improvement and more efficient code generation. It is up to us to ensure that we do not index out of bounds or with a negative index; doing so could lead to a segmentation fault.

To turn off bounds and wraparound checking for the entire function, we use the decorator form of the directives and remove the context manager form:

from cython cimport boundscheck, wraparound

@boundscheck(False)

@wraparound(False)

def summer(double[:] mv):

    # ...

    for i inrange(N):

        ss += mv[i]

    # etc.

To turn off bounds and wraparound checking everywhere for an entire extension module, we use a compiler directive in a special Cython comment at the top of our file:

# cython: boundscheck=False

# cython: wraparound=False

def summer(double[:] mv):

    # ...

    for i inrange(N):

        ss += mv[i]

    # etc.

We can also globally enable these directives when compiling by means of the --directive flag; see Chapter 2.

NOTE

The different scope levels for these directives—context manager, decorator, and module global—provide precise control over where the directives are in effect. They can be easily disabled for development and debugging, and easily enabled for production runs.

With these performance optimizations in place, the performance of our summer function is the same as that of the equivalent NumPy sum method:

In [1]: import numpy as np

In [2]: arr = np.ones((10**6,), dtype=np.double)

In [3]: %timeit arr.sum()

1000 loops, best of 3: 1.01 ms per loop

A C version of summer has the same performance as our typed memoryview version, when accounting for Python call overhead.

So, what have we learned? We saw how to declare a simple typed memoryview, we saw how indexing a typed memoryview with an integral argument efficiently accesses the memoryview’s underlying buffer, and we saw how to use the boundscheck and wraparound directives to generate even more efficient code, understanding when it is safe to do so.

There are many more details to cover, starting with the syntax and semantics of typed memoryview declaration.

Declaring Typed Memoryviews

When declaring typed memoryviews, we can control many attributes:

Element type

The element type of a typed memoryview may be a numeric scalar type like int, float, or double complex; it may be a ctypedef alias; or it may be a structured type declared with cdef struct, for example. There is initial (and still developing) support for generic fused types as well—see the sidebar Typed Memoryviews and Fused Types.

Dimensionality

Typed memoryviews (currently) may have up to seven dimensions. To declare a three-dimensional typed memoryview, we use three comma-separated colons in the bracketed dimension spec after the element type—for example, double[:, :, :].

Contiguous or strided data packing

A strided dimension—declared with a single colon—in a typed memoryview is compatible with a strided (i.e., noncontiguous and regularly spaced) buffer dimension. This can result when the typed memoryview accesses the underlying data from a NumPy array that is a strided view of another array, for example. A contiguous dimension is more restrictive: the dimension must be contiguous in memory, and this is enforced when the typed memoryview accesses the underlying data at runtime. Because strided access is more general, it is the default.

C or Fortran contiguity

C- or Fortran-contiguous typed memoryviews are important cases with specific data packing constraints. C-contiguous—or column-major—layout means that the buffer as a whole is contiguous in memory, and, if multidimensional, that the memoryview’s last dimension is also contiguous. Fortran-contiguous—or row-major—layout means that the entire buffer is contiguous in memory, and, if multidimensional, that the first dimension is also contiguous. When possible, it is advantageous from a performance standpoint to declare arrays as C or Fortran contiguous, as this enables Cython to generate faster code that does not have to take strided access into account.

Direct or indirect access

Direct access is the default and covers nearly all use cases—it specifies that this dimension can use straightforward indexing arithmetic to directly access the underlying data. If indirect access is specified for a dimension, the underlying buffer stores a pointer to the rest of the array that must be dereferenced on access (hence indirect). In part because NumPy does not currently support indirect access, this access specification is rarely used, and for that reason direct access is the default.

If we declare a typed memoryview with a single colon in each dimension’s slot, the typed memoryview can acquire a buffer from an object of the same dimensionality and with either strided or contiguous packing.

For example, consider the default typed memoryview declaration for a three-dimensional object:

cdef int[:, :, :] mv

This is the most general and most flexible typed memoryview declaration. We can assign to mv, and thereby acquire a buffer from, any three-dimensional NumPy array with the int data type:

mv = np.empty((10, 20, 30), dtype=np.int32)

The mv typed memoryview can also acquire a buffer from a Fortran-ordered array, since each dimension has strided packing:

mv = np.ones((10, 20, 30), dtype=np.int32, order='F')

Lastly, it can acquire a buffer from a fully strided ndarray:

arr = np.ones((13, 17, 19), dtype=np.int32)

mv = arr[4:10:2, ::3, 5::-2]

When indexing into mv, Cython generates indexing code that takes the array’s strides into account. If we are willing to trade some flexibility for speed, C- or Fortran-contiguous typed memoryviews can be indexed more efficiently.

Declaring a C-contiguous typed memoryview requires a simple modification to the strided version: all dimensions except the last are specified with a single colon, and the last dimension is specified with two colons followed by a literal 1. The mnemonic is that the last dimension has a unitary stride (i.e., is contiguous in memory), hence C contiguous.

For example, to declare a two-dimensional C-contiguous typed memoryview, we would say:

cdef float[:, ::1] c_contig_mv

We can assign a C-contiguous NumPy array to it. C contiguous is the default layout for all NumPy array-creation functions:

c_contig_mv = np.ones((3, 4), dtype=np.float32)

But assigning a Fortran-ordered or a strided array to c_contig_mv raises a runtime ValueError:

c_contig_mv = np.ones((3, 4), dtype=np.float32, order='F')

#=> ValueError: ndarray is not C-contiguous

arr = np.ones((3, 4), dtype=np.float32)

c_contig_mv = arr[:, ::2]

#=> ValueError: ndarray is not C-contiguous

In contrast to the C-contiguous version, a Fortran-contiguous typed memoryview has the unitary stride in the first dimension:

cdef double[::1, :] f_contig_mv = np.ones((3, 4), dtype=np.float64, order='F')

The f_contig_mv cannot acquire a buffer from a C-contiguous or strided buffer-supporting object.

One-dimensional contiguous typed memoryviews are simultaneously C and Fortran contiguous:

cdef float complex[::1] both_ways = np.zeros((100,), dtype=np.complex64)

# ...

both_ways = np.empty((73,), dtype=np.complex64, order='F')

These three typed memoryview declarations—fully strided, C contiguous, and Fortran contiguous—cover the vast majority of use cases. For the common case where all arrays are C contiguous, it is recommended to use C-contiguous memoryviews: it is the most common memory layout, it is required when we are working with external C or C++ libraries, and the performance improvements it allows are worth the extra syntax and small loss in flexibility. In many situations the ValueError that results when assigning a non-C-contiguous buffer to a C-contiguous typed memoryview is a feature: it noisily tells us when an incompatible (strided or Fortran-contiguous) array has sneaked through.

If the application is Fortran-centric, then Fortran-contiguous memoryviews are preferable.

NumPy provides the ascontiguousarray and asfortranarray conversion functions, which take an array-like object as an argument and return a guaranteed C- or Fortran-contiguous NumPy array, respectively. Each returns the argument unmodified when it is already C or Fortran contiguous, so they are as efficient as can be expected.

Fully strided typed memoryviews are valuable when we are iterating through an array once and the input array’s layout is ambiguous. In these situations, the overhead of manually creating a contiguous copy for use by contiguous memoryviews may outweigh the performance gain from contiguous access.

TYPED MEMORYVIEWS AND FUSED TYPES

We can use Cython’s nascent fused types for a typed memoryview’s element type to provide more generalization and flexibility. This comes with the usual restrictions for fused types (see the sidebar “Fused Types and Generic Programming” in Chapter 3). The fused type used with the typed memoryview must be used to declare at least one argument type so that Cython can determine which fused type specialization to dispatch to at compile time or runtime.

For instance, suppose we want to declare a cdef, cpdef, or def function that generalizes the preceding summer function to accept either a float or double strided and one-dimensional typed memoryview. We can do so using the cython.floating built-in fused type:

cimport cython

cpdef cython.floating generic_summer(cython.floating[:] mv):

    cdef cython.floating f, ss = 0.0

    for f inmv:

        ss += f

    return ss

Because the cython.floating fused type is used for the mv argument, it can also be used for the internal f and ss variable types.

With this definition, generic_summer can accept either a float or a double array, unlike the original summer function, which is restricted to buffers of double elements only:

import numpy as np

double_array = np.arange(10., dtype=np.double)

float_array = np.asarray(double_array, dtype=np.float)

print generic_summer(double_array)

#=> 1000000.0

print generic_summer(float_array)

#=> 1000000.0

Because generic_summer is a cpdef function, it can also be called from Cython with a typed memoryview argument:

import numpy as np

cdef double[:] double_array = np.arange(10., dtype=np.double)

cdef float[:] float_array = np.asarray(double_array, dtype=np.float)

print generic_summer(double_array)

#=> 1000000.0

print generic_summer(float_array)

#=> 1000000.0

The combination of fused types and typed memoryviews allows typed memoryviews to generalize not only the manner in which data is accessed, but also the underlying data type.

Using Typed Memoryviews

Once we have declared a typed memoryview, we must assign a buffer-supporting object to it. Doing so causes the typed memoryview to acquire (or view) a buffer from the righthand-side object. The assigned-to typed memoryview shares access with the object’s underlying buffer.

If we forget to acquire a buffer with a typed memoryview, we cannot perform any operations with it that require a buffer. Doing so will result in runtime exceptions.

What operations do typed memoryviews support?

We can access and modify individual elements by indexing into the typed memoryview in a NumPy-like fashion:

cdef int[:, :] mv = obj

print(mv[10, -20]) # access

mv[0, -1] = 3 # modify

As we saw previously, typed memoryviews can be indexed efficiently, especially when we turn off bounds checking and wraparound checking:

from cython cimport boundscheck, wraparound

def mv_sum(int[:, ::1] mv):

    cdef int N, M, i, j

    cdef long s=0

    N = mv.shape[0]; M = mv.shape[1]

    with boundscheck(False), wraparound(False):

        for i inrange(N):

            for j inrange(M):

                s += mv[i,j]

    return s

To modify a memoryview in its entirety, thereby modifying the contents of the buffer it views, we can use slice assignment with an ellipsis (...); to modify a sliceable section, we can use regular slice assignment. Doing either copies data from the righthand side. The righthand side can be a scalar:

cdef double[:, :] mv = np.empty((10, 20))

mv[...] = math.pi

or it can be another memoryview with the same element type and of the right shape:

cdef double[:, :] mv1 = np.zeros((10, 20))

cdef double[:, ::1] mv2 = np.ones((20, 40))

mv1[::2, ::2] = mv2[1:11:2, 10:40:3]

If the shapes of the lefthand and righthand sides do not match, a runtime ValueError will be raised.

NOTE

When we intend to copy data into a typed memoryview, slice assignment is necessary. If instead of slice assignment we had used regular assignment, then no copy would be made. Regular assignment with typed memoryviews results in another typed memoryview sharing the righthand side’s underlying buffer. This behavior is conceptually—if not precisely—analogous to that of Python lists, where slice assignment copies data, and regular assignment simply creates another variable by which to access the same data.

We can also use the copy or copy_fortran method to generate a C- or Fortran-contiguous copy of a memoryview’s buffer, respectively.

Once a buffer has been acquired, we can slice it like a NumPy ndarray to get another typed memoryview that shares the buffer:

cdef float[:, :, ::1] mv = obj

cdef float[:, :] two_dee_mv = mv[:, 0, :]

The usual start, stop, and step arguments are allowed with slicing:

two_dee_mv[...] = mv[4:10:2, ::3, -1]

Like NumPy arrays, typed memoryviews support partial indexing, which results in a typed memoryview slice:

cdef int[:, :, :] mv = obj

assert mv[10].shape == mv[10, ...].shape == mv[10, :, :].shape

Also as with NumPy arrays, we can insert new dimensions into typed memoryviews with None:

cdef double[:] mv = np.ones((50,))

assert mv[None, :].shape == (1, 50)

assert mv[:, None].shape == (50, 1)

Unlike NumPy arrays, however, typed memoryviews do not support universal functions, so no broadcasting operations are possible other than simple scalar assignment. But we can efficiently (i.e., without copying) make a NumPy array from a typed memoryview, since typed memoryviews themselves support the buffer protocol:

cdef float[:] rows = np.arange((100,), dtype=np.float32)

cdef float[:] cols = rows

# broadcasting sum

plane = np.asarray(rows[:,None]) + np.asarray(cols[None,:])

And lastly, to transpose a typed memoryview we use the T attribute, as with a NumPy ndarray. Transposing a C-contiguous typed memoryview results in a Fortran-contiguous one:

cdef int[:, ::1] c_contig = obj

cdef int[::1, :] f_contig = c_contig.T

TIP

It is helpful to think of typed memoryviews as very flexible Cython-space objects that allow efficient sharing, indexing, and modification of homogeneous data. They have many of the core features of NumPy arrays, and what features they do not have are easily addressed by their efficient interoperability with NumPy.

But typed memoryviews go beyond the buffer protocol—they can be used to view C-level arrays as well.

ORIGINAL BUFFER SYNTAX

Before typed memoryviews, Cython had different syntax for working efficiently with NumPy arrays and other buffer-supporting objects. This original buffer syntax is still in use, but it has been superseded by typed memoryviews, which provide more features and cleaner syntax.

An example of the original buffer syntax, adapted from Cython’s online documentation, is:

cimport numpy as np

def convolve(np.ndarray[double, ndim=2] f,

             np.ndarray[double, ndim=2] g):

    cdef:

        np.ndarray[double, ndim=2] h

        # ...other static declarations elided...

    h = np.zeros((xmax, ymax), dtype=np.double_t)

The convolve function uses three NumPy buffers—f, g, and h—each of which is declared with Cython’s original NumPy buffer syntax. This syntax uses np.ndarray to declare the type of the object exposing the buffer interface, and places the C data type for the array’s elements inside square brackets after np.ndarray. Because these buffers are all two-dimensional, the ndim=2attribute is included inside the square brackets.

The body of convolve loops over f and g to compute the two-dimensional convolution and store the result in h. The original buffer syntax also allows Cython to generate efficient indexing code.

We can translate convolve to use typed memoryviews instead. The body of convolve remains unchanged; only the array declarations need be modified:

def convolve(double[:, ::1] f, double[:, ::1] g):

    cdef:

        double[:, ::1] h

        # ...

    # ...

Here we use the syntax for C-contiguous typed memoryviews, which is appropriate for when we know the input arrays are standard unstrided arrays.

Besides a cleaner syntax, what benefits do typed memoryviews bring over the original syntax?

§  Typed memoryviews can work with a wider range of buffer-supporting objects: NumPy arrays, Python memoryview objects, array.array objects, and any other type that supports the new buffer protocol. They can also work with C arrays. They are therefore more general than the NumPy array buffer syntax, which is restricted to work with NumPy arrays only.

§  Typed memoryviews can be used in any scope. This includes module scope; arguments for def, cpdef, or cdef functions or methods; function or method local scope; and cdef class attribute scope. The NumPy buffer syntax can be used only in function-local scope and for def function arguments.

§  Typed memoryviews have many more options that provide precise control: contiguous or strided data packing, C or Fortran contiguity, and direct or indirect data access. Some of these options can be controlled on a dimension-by-dimension basis. The NumPy array buffer syntax does not provide this level of control.

§  In all circumstances, typed memoryviews match or exceed the original buffer syntax’s performance.

Updating the original buffer syntax to use typed memoryviews is straightforward, as we saw in the previous example. Besides the small time and testing investment required to update, there are very few (if any) reasons to prefer the original buffer syntax to typed memoryviews.

Beyond Buffers

So far, we have assigned various types of Python objects to typed memoryviews: NumPy ndarray objects, array.array objects, bytes objects, and bytearray objects. NumPy arrays are the most common in practice, given NumPy’s ubiquity, flexibility, and expressiveness. Beyond Python-space objects, however, typed memoryviews can also work with C-level arrays: either dynamic heap-allocated arrays or fixed-size stack-allocated arrays.

To view a C array with a memoryview, we simply assign the array to the memoryview. If the array is fixed size (or complete), the righthand side of the assignment can be the array’s name only. Cython has enough information to keep track of the array’s size:

cdef int a[3][5][7]

cdef int[:, :, ::1] mv = a

mv[...] = 0

In this example we declare mv as a C-contiguous memoryview, as fixed-size arrays are always C contiguous. The last line initializes the array a to all zeros, using slice assignment and broadcasting.

If we have a dynamically allocated C array rather than a fixed-size array, Cython does not know its extent, but we can still use it with typed memoryviews.

First, the dynamic array allocation:

from libc.stdlib cimport malloc

def dynamic(size_t N, size_t M):

    cdef long *arr = <long*>malloc(N * M * sizeof(long))

We can certainly use arr inside our function directly, but it would require that we manually do index calculations. For higher-dimensional arrays, this is inconvenient. Let’s interact with our dynamic array via the typed-memoryview interface.

Suppose we try to assign our dynamic array to a typed memoryview, as in the fixed-size array example:

def dynamic(size_t N, size_t M):

    cdef long *arr = <long*>malloc(N * M * sizeof(long))

    cdef long[:, ::1] mv = arr

This does not compile, resulting in the error: "Cannot convert long * to memoryviewslice". Part of the reason is that Cython knows only that arr is a long pointer. We have to give Cython more information to indicate that arr is convertible to a typed memoryview. That hint comes in the form of a typed memoryview cast:

def dynamic(size_t N, size_t M):

    cdef long *arr = <long*>malloc(N * M * sizeof(long))

    cdef long[:, ::1] mv = <long[:N, :M]>arr

We use the memoryview casting syntax, <long[:N, :M]>, to provide Cython with the information it needs to assign arr to our memoryview. Notice that the type in the cast uses slice notation with stop values for each dimension. The stop values are necessary to communicate to Cython the shape we intend the typed memoryview to have.

WARNING

At the C level, there is no way to programmatically determine the length of a dynamically allocated C array via its head pointer. It is the responsibility of the programmer to know the right extent of the C array when casting a C array to a typed memoryview. If this is incorrect, buffer overruns, segmentation faults, or data corruption may result.

This rounds out the features of typed memoryviews and shows how they can be used with either buffer-supporting Python objects or C-level arrays, whether fixed size or dynamic. If a Cython function has a typed memoryview argument, it can be called with either Python objects or C arrays as arguments.

When returning a typed memoryview in a def function, Cython converts it to a regular Python memoryview without copying the buffer. In the preceding dynamic function, returning mv will work: the underlying arr C array is heap allocated, so it is not tied to the function’s scope. If arrwere fixed size (and therefore stack allocated), then it would be tied to the call stack, and returning a memoryview that viewed the array would be erroneous.

But there is still an issue with memoryviews that view heap-allocated C arrays: who is responsible for freeing the array when the memoryview is no longer needed? A related question: when a C or C++ library returns a dynamically allocated array, how can we return it as a NumPy array, and how can we properly manage its finalization?

Wrapping C and C++ Arrays

Suppose a C function make_matrix_c returns a dynamically allocated C array. Its declaration in Cython would be something like:

cdef extern from "matrix.h":

    float *make_matrix_c(int nrows, int ncols)

Suppose also that we want to return a NumPy array that views this array, allowing interaction with the underlying data from Python. Using what we know of typed memoryviews—and setting aside proper cleanup for the moment—we can use memoryviews to easily do what we want:

import numpy as np

def make_matrix(int nrows, int ncols):

    cdef float[:, ::1] mv = <float[:nrows, :ncols]>make_matrix_c(nrows, ncols)

    return np.asarray(mv)

This compiles and allows NumPy access to the C array, but it leaks memory. How do we properly clean up after ourselves?

Correct (and Automatic) Memory Management with Cython and C Arrays

First, we know (by construction) that we are responsible for this memory. If there is a possibility that we are sharing this array with other C code, then properly handling the shared array can become tricky. The difficult part is communicating to all interested parties who is responsible for cleanup. Because C has no automatic memory management features (like C++ shared pointers, for example), ensuring proper cleanup can be challenging. Often the cleanest solution in these situations is to make a copy of the data to clarify ownership semantics.

Knowing that we own this C array and are responsible for freeing it, how do we do so properly from Python? The C array is owned by a NumPy array. What we need is a way to automatically call the right destructor when the last viewing NumPy array is finalized by the Python runtime.

The NumPy/C API defines a base attribute on the PyArrayObject, which is designed for just this purpose. According to NumPy’s documentation, “If you are constructing an array using the C API, and specifying your own memory, you should use the functionPyArray_SetBaseObject to set the base to an object which owns the memory.” We will use a Cython-provided function rather than PyArray_SetBaseObject to accomplish the same end.

First, we need access to NumPy’s C API. We can cimport numpy (mind the c) to access NumPy’s C interface. Let’s give it an alias to keep it distinct from the Python-level numpy package we already imported:

import numpy as np

cimport numpy as cnp

We know from Chapter 6 that the cimport numpy as cnp statement is a compile-time operation that gives us access to C-level constructs. Cython includes a numpy package alongside the libc and libcpp packages that are used by cimport.

We need to set the base to “an object which owns the memory.” We can create a minimal extension type that does just that. It needs just one attribute to hold a reference to the array, and just one method, __dealloc__. This is the object that owns the memory, and its sole purpose is to callfree on the array at finalization. Let’s call it _finalizer:

cdef class _finalizer:

    cdef void *_data

    def __dealloc__(self):

        print "_finalizer.__dealloc__"

        if self._data is not NULL:

            free(self._data)

With our _finalizer class, we have everything we need to properly manage memory. The print statement is there just to ensure the array is deallocated appropriately. We can now create a convenience cdef function that creates a _finalizer and uses the set_array_base function from Cython’s numpy C interface:

cdef void set_base(cnp.ndarray arr, void *carr):

    cdef _finalizer f = _finalizer()

    f._data = <void*>carr

    cnp.set_array_base(arr, f)

This function first creates an empty _finalizer object, then initializes its _data attribute, and lastly calls set_array_base.

Returning to our make_matrix function, we can use set_base to tie everything together:

def make_matrix(int nrows, int ncols):

    cdef float *mat = make_matrix_c(nrows, ncols)

    cdef float[:, ::1] mv = <float[:nrows, :ncols]>mat

    cdef cnp.ndarray arr = np.asarray(mv)

    set_base(arr, mat)

    return arr

The first line of our function calls make_matrix_c and stores the result in a float pointer. The next line creates a C-contiguous typed memoryview from the mat array.

The next line creates a NumPy array from our typed memoryview; this uses the buffer protocol behind the scenes to share the underlying C array. Then we use our set_base helper function to set the base attribute of our NumPy array to a _finalizer object. This ties everything together properly, and we can return our NumPy array as a result.

If we name our extension module numpy_cleanup.pyx, we can compile it using a distutils script:

from distutils.core import setup, Extension

from Cython.Build import cythonize

from numpy import get_include

ext = Extension("numpy_cleanup", ["numpy_cleanup.pyx"],

                include_dirs=['.', get_include()])

setup(name="numpy_cleanup",

      ext_modules = cythonize(ext))

Because we use the NumPy/C API (via the cimport numpy as cnp statement), we need to include some NumPy headers when compiling. That is the reason for the include_dirs option to the Extension call. NumPy provides a get_include function that returns the full path to its include directory.

After compiling:

$ python setup.py build_ext -i

running build_ext

building 'numpy_cleanup' extension

gcc -fno-strict-aliasing -fno-common -dynamic -g -O2

    -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I.

    -I/Users/ksmith/PY/lib/python2.7/site-packages/numpy/core/include

    -I/Users/ksmith/Devel/PY64/Python.framework/Versions/2.7/include/python2.7

    -c numpy_cleanup.c -o build/temp.macosx-10.4-x86_64-2.7/numpy_cleanup.o

gcc -bundle -undefined dynamic_lookup

    build/temp.macosx-10.4-x86_64-2.7/numpy_cleanup.o

    -o /Users/ksmith/examples/memviews/numpy_cleanup.so

We can try out our make_matrix from IPython:

$ ipython --no-banner

In [1]: import numpy_cleanup

In [2]: arr = numpy_cleanup.make_matrix(100, 100)

Let’s check the base attribute:

In [3]: arr.base

Out[3]: <numpy_cleanup._finalizer at 0x100284eb8>

What we’re interested in is that the finalizer’s __dealloc__ method is called at cleanup time. We can force IPython to wipe out any references to the arr NumPy array with %reset:

In [4]: %reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y

_finalizer.__dealloc__

We have the satisfaction of seeing the "_finalizer.__dealloc__" string output, indicating the array was, indeed, freed. It is left as an exercise for the reader to confirm that the finalizer’s __dealloc__ is called even when there are multiple views of the array.

There is a lot going on here. Interlanguage programming can require more effort to properly manage memory and resources, but Cython has the features and functionality to make it straightforward. The fact that we can do these low-level operations at the Cython level and do not have to resort to pure-C code saves us a tremendous amount of work. This is another instance of Cython making difficult things possible.

It is worth emphasizing that the most common use case is to use NumPy arrays to manage data, and to use the basic features of typed memoryviews to efficiently access and modify these NumPy arrays from Cython.

Summary

In this chapter we learned all about Cython’s features for working with NumPy arrays, array.array objects, and objects that support the new buffer protocol. The central figure was Cython’s typed memoryview, which provides a consistent abstraction that works with all of these Python types and gives us efficient C-level access to buffer elements. Typed memoryviews both use and support the buffer protocol, so they do not copy memory unnecessarily. They are highly efficient: we saw a simple example where using typed memoryviews provided a speedup of multiple orders of magnitude over pure Python.

We also learned how typed memoryviews can easily work with C and C++ arrays, either fixed size or dynamic. To pull everything together, we saw an example that uses a typed memoryview and a NumPy array to view a dynamically allocated C array. This required that we dip into the NumPy/C API to ensure that the dynamic memory is properly finalized at the appropriate time.


[19The new buffer protocol is also referred to as PEP-3118, referring to the Python Enhancement Proposal that is the protocol’s authoritative source of documentation.

[20To follow along with the examples in this chapter, please see https://github.com/cythonbook/examples.