High Performance Python (2014)
Chapter 6. Matrix and Vector Computation
QUESTIONS YOU’LL BE ABLE TO ANSWER AFTER THIS CHAPTER
§ What are the bottlenecks in vector calculations?
§ What tools can I use to see how efficiently the CPU is doing my calculations?
§ Why is numpy better at numerical calculations than pure Python?
§ What are cachemisses and pagefaults?
§ How can I track the memory allocations in my code?
Regardless of what problem you are trying to solve on a computer, you will encounter vector computation at some point. Vector calculations are integral to how a computer works and how it tries to speed up runtimes of programs down at the silicon level—the only thing the computer knows how to do is operate on numbers, and knowing how to do several of those calculations at once will speed up your program.
In this chapter, we try to unwrap some of the complexities of this problem by focusing on a relatively simple mathematical problem, solving the diffusion equation, and understanding what is happening at the CPU level. By understanding how different Python code affects the CPU and how to effectively probe these things, we can learn how to understand other problems as well.
We will start by introducing the problem and coming up with a quick solution using pure Python. After identifying some memory issues and trying to fix them using pure Python, we will introduce numpy and identify how and why it speeds up our code. Then we will start doing some algorithmic changes and specialize our code to solve the problem at hand. By removing some of the generality of the libraries we are using, we will yet again be able to gain more speed. Finally, we introduce some extra modules that will help facilitate this sort of process out in the field, and also explore a cautionary tale about optimizing before profiling.
Introduction to the Problem
NOTE
This section is meant to give a deeper understanding of the equations we will be solving throughout the chapter. It is not strictly necessary to understand this section in order to approach the rest of the chapter. If you wish to skip this section, make sure to look at the algorithm in Examples 61 and 62 to understand the code we will be optimizing.
On the other hand, if you read this section and want even more explanation, read Chapter 17 of Numerical Recipes, Third Edition, by William Press et al. (Cambridge University Press).
In order to explore matrix and vector computation in this chapter, we will repeatedly use the example of diffusion in fluids. Diffusion is one of the mechanisms that moves fluids and tries to make them uniformly mixed.
In this section we will explore the mathematics behind the diffusion equation. This may seem complicated, but don’t worry! We will quickly simplify this to make it more understandable. Also, it is important to note that while having a basic understanding of the final equation we are solving will be useful while reading this chapter, it is not completely necessary; the subsequent chapters will focus mainly on various formulations of the code, not the equation. However, having an understanding of the equations will help you gain intuition about ways of optimizing your code. This is true in general—understanding the motivation behind your code and the intricacies of the algorithm will give you deeper insight about possible methods of optimization.
One simple example of diffusion is dye in water: if you put several drops of dye into water at room temperature it will slowly move out until it fully mixes with the water. Since we are not stirring the water, nor is it warm enough to create convection currents, diffusion will be the main process mixing the two liquids. When solving these equations numerically, we pick what we want the initial condition to look like and are able to evolve the initial condition forward in time to see what it will look like at a later time (see Figure 62).
All this being said, the most important thing to know about diffusion for our purposes is its formulation. Stated as a partial differential equation in one dimension (1D), the diffusion equation is written as:
In this formulation, u is the vector representing the quantities we are diffusing. For example, we could have a vector with values of 0 where there is only water and 1 where there is only dye (and values in between where there is mixing). In general, this will be a 2D or 3D matrix representing an actual area or volume of fluid. In this way, we could have u be a 3D matrix representing the fluid in a glass, and instead of simply doing the second derivative along the x direction, we’d have to take it over all axes. In addition, D is a physical value that represents properties of the fluid we are simulating. A large value of D represents a fluid that can diffuse very easily. For simplicity, we will set D = 1 for our code, but still include it in the calculations.
NOTE
The diffusion equation is also called the heat equation. In this case, u represents the temperature of a region and D describes how well the material conducts heat. Solving the equation tells us how the heat is being transferred. So, instead of solving for how a couple of drops of dye diffuse through water, we might be solving for how the heat generated by a CPU diffuses into a heat sink.
What we will do is take the diffusion equation, which is continuous in space and time, and approximate it using discrete volumes and discrete times. We will do so using Euler’s method. Euler’s method simply takes the derivative and writes it as a difference, such that:
where dt is now a fixed number. This fixed number represents the time step, or the resolution in time for which we wish to solve this equation. It can be thought of as the frame rate of the movie we are trying to make. As the frame rate goes up (or dt goes down), we get a clearer picture of what happens. In fact, as dt approaches zero, Euler’s approximation becomes exact (note, however, that this exactness can only be achieved theoretically, since there is only finite precision on a computer and numerical errors will quickly dominate any results). We can thus rewrite this equation to figure out what u(x, t+dt) is given u(x,t). What this means for us is that we can start with some initial state (u(x,0), representing the glass of water just as we add a drop of dye into it) and churn through the mechanisms we’ve outlined to “evolve” that initial state and see what it will look like at future times (u(x,dt)). This type of problem is called an “initial value problem” or “Cauchy problem.”
Doing a similar trick for the derivative in x using the finite differences approximation, we arrive at the final equation:
Here, similar to how dt represents the frame rate, dx represents the resolution of the images—the smaller dx is the smaller a region every cell in our matrix represents. For simplicity, we will simply set D = 1 and dx = 1. These two values become very important when doing proper physical simulations; however, since we are simply solving the diffusion equation for illustrative purposes they are not important to us.
Using this equation, we can solve almost any diffusion problem. However, there are some considerations regarding this equation. Firstly, we said before that the spatial index in u (i.e., the x parameter) will be represented as the indices into a matrix. What happens when we try to find the value at xdx when x is at the beginning of the matrix? This problem is called the boundary condition. You can have fixed boundary conditions that say, “any value out of the bounds of my matrix will be set to 0” (or any other value). Alternatively, you can have periodic boundary conditions that say that the values will wrap around (i.e., if one of the dimensions of your matrix has length N, the value in that dimension at index 1 is the same as at N1, and the value at N is the same as at index 0. In other words, if you are trying to access the value at index i, you will get the value at index (i%N)).
Another consideration is how we are going to store the multiple time components of u. We could have one matrix for every time value we do our calculation for. At minimum, it seems that we will need two matrices: one for the current state of the fluid and one for the next state of the fluid. As we’ll see, there are very drastic performance considerations for this particular question.
So, what does it look like to solve this problem in practice? Example 61 contains some pseudocode to illustrate the way we can use our equation to solve the problem.
Example 61. Pseudocode for 1D diffusion
# Create the initial conditions
u = vector of length N
for i inrange(N):
u = 0 if there iswater, 1 if there isdye
# Evolve the initial conditions
D = 1
t = 0
dt = 0.0001
while True:
print "Current time is: %f" % t
unew = vector of size N
# Update step for every cell
for i inrange(N):
unew[i] = u[i] + D * dt * (u[(i+1)%N] + u[(i1)%N]  2 * u[i])
# Move the updated solution into u
u = unew
visualize(u)
This code will take some initial condition of the dye in water and tell us what the system looks like at every 0.0001second interval in the future. The results of this can be seen in Figure 61, where we evolve our very concentrated drop of dye (represented by the tophat function) into the future. We can see how, far into the future, the dye becomes well mixed, to the point where everywhere has a similar concentration of the dye.
Figure 61. Example of 1D diffusion
For the purposes of this chapter, we will be solving the 2D version of the preceding equation. All this means is that instead of operating over a vector (or in other words, a matrix with one index), we will be operating over a 2D matrix. The only change to the equation (and thus the subsequent code) is that we must now also take the second derivative in the y direction. This simply means that the original equation we were working with becomes:
This numerical diffusion equation in 2D translates to the pseudocode in Example 62, using the same methods we used before.
Example 62. Algorithm for calculating 2D diffusion
for i inrange(N):
for j inrange(M):
unew[i][j] = u[i][j] + dt * ( \
(u[(i+1)%N][j] + u[(i1)%N][j]  2 * u[i][j]) + \ # d^2 u / dx^2
(u[i][(j+1)%M] + u[j][(j1)%M]  2 * u[i][j]) \ # d^2 u / dy^2
)
We can now put all of this together and write the full Python 2D diffusion code that we will use as the basis for our benchmarks for the rest of this chapter. While the code looks more complicated, the results are similar to that of the 1D diffusion (as can be seen in Figure 62).
Figure 62. Example of diffusion for two sets of initial conditions
If you’d like to do some additional reading on the topics in this section, check out the Wikipedia page on the diffusion equation and Chapter 7 of “Numerical methods for complex systems,” by S. V. Gurevich.
Aren’t Python Lists Good Enough?
Let’s take our pseudocode from Example 61 and formalize it so we can better analyze its runtime performance. The first step is to write out the evolution function that takes in the matrix and returns its evolved state. This is shown in Example 63.
Example 63. Pure Python 2D diffusion
grid_shape = (1024, 1024)
def evolve(grid, dt, D=1.0):
xmax, ymax = grid_shape
new_grid = [[0.0,] * ymax for x inxrange(xmax)]
for i inxrange(xmax):
for j inxrange(ymax):
grid_xx = grid[(i+1)%xmax][j] + grid[(i1)%xmax][j]  2.0 * grid[i][j]
grid_yy = grid[i][(j+1)%ymax] + grid[i][(j1)%ymax]  2.0 * grid[i][j]
new_grid[i][j] = grid[i][j] + D * (grid_xx + grid_yy) * dt
return new_grid
NOTE
Instead of preallocating the new_grid list, we could have built it up in the for loop by using appends. While this would have been noticeably faster than what we have written, the conclusions we draw are still applicable. We chose this method because it is more illustrative.
The global variable grid_shape designates how big a region we will simulate; and, as explained in Introduction to the Problem, we are using periodic boundary conditions (which is why we use modulo for the indices). In order to actually use this code, we must initialize a grid and callevolve on it. The code in Example 64 is a very generic initialization procedure that will be reused throughout the chapter (its performance characteristics will not be analyzed since it must only run once, as opposed to the evolve function, which is called repeatedly).
Example 64. Pure Python 2D diffusion initialization
def run_experiment(num_iterations):
# Setting up initial conditions
xmax, ymax = grid_shape
grid = [[0.0,] * ymax for x inxrange(xmax)]
block_low = int(grid_shape[0] * .4)
block_high = int(grid_shape[0] * .5)
for i inxrange(block_low, block_high):
for j inxrange(block_low, block_high):
grid[i][j] = 0.005
# Evolve the initial conditions
start = time.time()
for i inrange(num_iterations):
grid = evolve(grid, 0.1)
return time.time()  start
The initial conditions used here are the same as in the square example in Figure 62.
The values for dt and grid elements have been chosen to be sufficiently small that the algorithm is stable. See William Press et al.’s Numerical Recipes, Third Edition, for a more indepth treatment of this algorithm’s convergence characteristics.
Problems with Allocating Too Much
By using line_profiler on the pure Python evolution function, we can start to unravel what is contributing to a possibly slow runtime. Looking at the profiler output in Example 65, we see that most of the time spent in the function is spent doing the derivative calculation and updating the grid.^{[}9^{]} This is what we want, since this is a purely CPUbound problem—any time spent not solving the CPUbound problem is an obvious place for optimization.
Example 65. Pure Python 2D diffusion profiling
$ kernprof.py lv diffusion_python.py
Wrote profile results to diffusion_python.py.lprof
Timer unit: 1e06 s
File: diffusion_python.py
Function: evolve at line 8
Total time: 16.1398 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 @profile
9 def evolve(grid, dt, D=1.0):
10 10 39 3.9 0.0 xmax, ymax = grid_shape #
11 2626570 2159628 0.8 13.4 new_grid = ...
12 5130 4167 0.8 0.0 for i in xrange(xmax): #
13 2626560 2126592 0.8 13.2 for j in xrange(ymax):
14 2621440 4259164 1.6 26.4 grid_xx = ...
15 2621440 4196964 1.6 26.0 grid_yy = ...
16 2621440 3393273 1.3 21.0 new_grid[i][j] = ...
17 10 10 1.0 0.0 return grid #
This statement takes such a long time per hit because grid_shape must be retrieved from the local namespace (see Dictionaries and Namespaces for more information).
This line has 5,130 hits associated with it, which means the grid we operated over had xmax = 512. This comes from 512 evaluations for each value in xrange plus one evaluation of the termination condition of the loop, and all this happened 10 times.
This line has 10 hits associated with it, which informs us that the function was profiled over 10 runs.
However, the output also shows that we are spending 20% of our time allocating the new_grid list. This is a waste, because the properties of new_grid do not change—no matter what values we send to evolve, the new_grid list will always be the same shape and size and contain the same values. A simple optimization would be to allocate this list once and simply reuse it. This sort of optimization is similar to moving repetitive code outside of a fast loop:
from math import sin
def loop_slow(num_iterations):
"""
>>> %timeit loop_slow(int(1e4))
100 loops, best of 3: 2.67 ms per loop
"""
result = 0
for i inxrange(num_iterations):
result += i * sin(num_iterations) #
return result
def loop_fast(num_iterations):
"""
>>> %timeit loop_fast(int(1e4))
1000 loops, best of 3: 1.38 ms per loop
"""
result = 0
factor = sin(num_iterations)
for i inxrange(num_iterations):
result += i
return result * factor
The value of sin(num_iterations) doesn’t change throughout the loop, so there is no use recalculating it every time.
We can do a similar transformation to our diffusion code, as illustrated in Example 66. In this case, we would want to instantiate new_grid in Example 64 and send it in to our evolve function. That function will do the same as it did before: read the grid list and write to the new_gridlist. Then, we can simply swap new_grid with grid and continue again.
Example 66. Pure Python 2D diffusion after reducing memory allocations
def evolve(grid, dt, out, D=1.0):
xmax, ymax = grid_shape
for i inxrange(xmax):
for j inxrange(ymax):
grid_xx = grid[(i+1)%xmax][j] + grid[(i1)%xmax][j]  2.0 * grid[i][j]
grid_yy = grid[i][(j+1)%ymax] + grid[i][(j1)%ymax]  2.0 * grid[i][j]
out[i][j] = grid[i][j] + D * (grid_xx + grid_yy) * dt
def run_experiment(num_iterations):
# Setting up initial conditions
xmax,ymax = grid_shape
next_grid = [[0.0,] * ymax for x inxrange(xmax)]
grid = [[0.0,] * ymax for x inxrange(xmax)]
block_low = int(grid_shape[0] * .4)
block_high = int(grid_shape[0] * .5)
for i inxrange(block_low, block_high):
for j inxrange(block_low, block_high):
grid[i][j] = 0.005
start = time.time()
for i inrange(num_iterations):
evolve(grid, 0.1, next_grid)
grid, next_grid = next_grid, grid
return time.time()  start
We can see from the line profile of the modified version of the code in Example 67^{[}10^{]} that this small change has given us a 21% speedup. This leads us to a conclusion similar to the conclusion made during our discussion of append operations on lists (see Lists as Dynamic Arrays): memory allocations are not cheap. Every time we request memory in order to store a variable or a list, Python must take its time to talk to the operating system in order to allocate the new space, and then we must iterate over the newly allocated space to initialize it to some value. Whenever possible, reusing space that has already been allocated will give performance speedups. However, be careful when implementing these changes. While the speedups can be substantial, as always, you should profile to make sure you are achieving the results you want and are not simply polluting your code base.
Example 67. Line profiling Python diffusion after reducing allocations
$ kernprof.py lv diffusion_python_memory.py
Wrote profile results to diffusion_python_memory.py.lprof
Timer unit: 1e06 s
File: diffusion_python_memory.py
Function: evolve at line 8
Total time: 13.3209 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 @profile
9 def evolve(grid, dt, out, D=1.0):
10 10 15 1.5 0.0 xmax, ymax = grid_shape
11 5130 3853 0.8 0.0 for i in xrange(xmax):
12 2626560 1942976 0.7 14.6 for j in xrange(ymax):
13 2621440 4059998 1.5 30.5 grid_xx = ...
14 2621440 4038560 1.5 30.3 grid_yy = ...
15 2621440 3275454 1.2 24.6 out[i][j] = ...
Memory Fragmentation
The Python code we wrote in Example 66 still has a problem that is at the heart of using Python for these sorts of vectorized operations: Python doesn’t natively support vectorization. There are two reasons for this: Python lists store pointers to the actual data, and Python bytecode is not optimized for vectorization, so for loops cannot predict when using vectorization would be beneficial.
The fact that Python lists store pointers means that instead of actually holding the data we care about, lists store locations where that data can be found. For most uses, this is good because it allows us to store whatever type of data we like inside of a list. However, when it comes to vector and matrix operations, this is a source of a lot of performance degradation.
The degradation occurs because every time we want to fetch an element from the grid matrix, we must do multiple lookups. For example, doing grid[5][2] requires us to first do a list lookup for index 5 on the list grid. This will return a pointer to where the data at that location is stored. Then we need to do another list lookup on this returned object, for the element at index 2. Once we have this reference, we have the location where the actual data is stored.
The overhead for one such lookup is not big and can be, in most cases, disregarded. However, if the data we wanted was located in one contiguous block in memory, we could move all of the data in one operation instead of needing two operations for each element. This is one of the major points with data fragmentation: when your data is fragmented, you must move each piece over individually instead of moving the entire block over. This means you are invoking more memory transfer overhead, and you are forcing the CPU to wait while data is being transferred. We will see with perf just how important this is when looking at the cachemisses.
This problem of getting the right data to the CPU when it is needed is related to the socalled “Von Neumann bottleneck.” This refers to the fact that there is limited bandwidth between the memory and the CPU as a result of the tiered memory architecture that modern computers use. If we could move data infinitely fast, we would not need any cache, since the CPU could retrieve any data it needed instantly. This would be a state where the bottleneck is nonexistent.
Since we can’t move data infinitely fast, we must prefetch data from RAM and store it in smaller but faster CPU caches so that hopefully, when the CPU needs a piece of data, it will be located in a location that can be read from quickly. While this is a severely idealized way of looking at the architecture, we can still see some of the problems with it—how do we know what data will be needed in the future? The CPU does a good job with mechanisms called branch prediction and pipelining, which try to predict the next instruction and load the relevant portions of memory into the cache while still working on the current instruction. However, the best way to minimize the effects of the bottleneck is to be smart about how we allocate our memory and how we do our calculations over our data.
Probing how well memory is being moved to the CPU can be quite hard; however, in Linux the perf tool can be used to get amazing amounts of insight into how the CPU is dealing with the program being run. For example, we can run perf on the pure Python code from Example 66 and see just how efficiently the CPU is running our code. The results are shown in Example 68. Note that the output in this and the following perf examples has been truncated to fit within the margins of the page. The removed data included variances for each measurement, indicating how much the values changed throughout multiple benchmarks. This is useful for seeing how much a measured value is dependent on the actual performance characteristics of the program versus other system properties, such as other running programs using system resources.
Example 68. Performance counters for pure Python 2D diffusion with reduced memory allocations
$ perf stat e cycles,stalledcyclesfrontend,stalledcyclesbackend,instructions,\
cachereferences,cachemisses,branches,branchmisses,taskclock,faults,\
minorfaults,cs,migrations r 3 python diffusion_python_memory.py
Performance counter stats for 'python diffusion_python_memory.py' (3 runs):
329,155,359,015 cycles # 3.477 GHz
76,800,457,550 stalledcyclesfrontend # 23.33% frontend cycles idle
46,556,100,820 stalledcyclesbackend # 14.14% backend cycles idle
598,135,111,009 instructions # 1.82 insns per cycle
# 0.13 stalled cycles per insn
35,497,196 cachereferences # 0.375 M/sec
10,716,972 cachemisses # 30.191 % of all cache refs
133,881,241,254 branches # 1414.067 M/sec
2,891,093,384 branchmisses # 2.16% of all branches
94678.127621 taskclock # 0.999 CPUs utilized
5,439 pagefaults # 0.057 K/sec
5,439 minorfaults # 0.057 K/sec
125 contextswitches # 0.001 K/sec
6 CPUmigrations # 0.000 K/sec
94.749389121 seconds time elapsed
Understanding perf
Let’s take a second to understand the various performance metrics that perf is giving us and their connection to our code. The taskclock metric tells us how many clock cycles our task took. This is different from the total runtime, because if our program took 1 second to run but used two CPUs, then the taskclock would be 1000 (taskclock is generally in milliseconds). Conveniently, perf does the calculation for us and tells us, next to this metric, how many CPUs were utilized. This number isn’t exactly 1 because there were periods where the process relied on other subsystems to do instructions for it (for example, when memory was allocated).
contextswitches and CPUmigrations tell us about how the program is halted in order to wait for a kernel operation to finish (such as I/O), let other applications run, or to move execution to another CPU core. When a contextswitch happens, the program’s execution is halted and another program is allowed to run instead. This is a very timeintensive task and is something we would like to minimize as much as possible, but we don’t have too much control over when this happens. The kernel delegates when programs are allowed to be switched out; however, we can do things to disincentivize the kernel from moving our program. In general, the kernel suspends a program when it is doing I/O (such as reading from memory, disk, or the network). As we’ll see in later chapters, we can use asynchronous routines to make sure that our program still uses the CPU even when waiting for I/O, which will let us keep running without being contextswitched. In addition, we could set the nice value of our program in order to give our program priority and stop the kernel from context switching it. Similarly, CPUmigrations happen when the program is halted and resumed on a different CPU than the one it was on before, in order to have all CPUs have the same level of utilization. This can be seen as an especially bad context switch, since not only is our program being temporarily halted, but we lose whatever data we had in the L1 cache (recall that each CPU has its own L1 cache).
A pagefault is part of the modern Unix memory allocation scheme. When memory is allocated, the kernel doesn’t do much except give the program a reference to memory. However, later, when the memory is first used, the operating system throws a minor page fault interrupt, which pauses the program that is being run and properly allocates the memory. This is called a lazy allocation system. While this method is quite an optimization over previous memory allocation systems, minor page faults are quite an expensive operation since most of the operations are done outside the scope of the program you are running. There is also a major page fault, which happens when the program requests data from a device (disk, network, etc.) that hasn’t been read yet. These are even more expensive operations, since not only do they interrupt your program, but they also involve reading from whichever device the data lives on. This sort of page fault does not generally affect CPUbound work; however, it will be a source of pain for any program that does disk or network reads/writes.
Once we have data in memory and we reference it, the data makes its way through the various tiers of memory (see Communications Layers for a discussion of this). Whenever we reference data that is in our cache, the cachereferences metric increases. If we did not already have this data in the cache and need to fetch it from RAM, this counts as a cachemiss. We won’t get a cache miss if we are reading data we have read recently (that data will still be in the cache), or data that is located near data we have recently (data is sent from RAM into the cache in chunks). Cache misses can be a source of slowdowns when it comes to CPUbound work, since not only do we need to wait to fetch the data from RAM, but we also interrupt the flow of our execution pipeline (more on this in a second). We will be discussing how to reduce this effect by optimizing the layout of data in memory later in this chapter.
instructions tells us how many instructions our code is issuing to the CPU. Because of pipelining, these instructions can be run several at a time, which is what the insns per cycle annotation tells us. To get a better handle on this pipelining, stalledcyclesfrontend andstalledcyclesbackend tell us how many cycles our program was waiting for the frontend or backend of the pipeline to be filled. This can happen because of a cache miss, a mispredicted branch, or a resource conflict. The frontend of the pipeline is responsible for fetching the next instruction from memory and decoding it into a valid operation, while the backend is responsible for actually running the operation. With pipelining, the CPU is able to run the current operation while fetching and preparing the next one.
A branch is a time in the code where the execution flow changes. Think of an if..then statement—depending on the result of the conditional, we will either be executing one section of code or another. This is essentially a branch in the execution of the code—the next instruction in the program could be one of two things. In order to optimize this, especially with regard to the pipeline, the CPU tries to guess which direction the branch will take and preload the relevant instructions. When this prediction is incorrect, we will get some stalledcycles and a branchmiss. Branch misses can be quite confusing and result in many strange effects (for example, some loops will run substantially faster on sorted lists than unsorted lists simply because there will be fewer branch misses).
NOTE
If you would like a more thorough explanation of what is going on at the CPU level with the various performance metrics, check out Gurpur M. Prabhu’s fantastic “Computer Architecture Tutorial.” It deals with the problems at a very low level, which gives you a good understanding of what is going on under the hood when you run your code.
Making Decisions with perf’s Output
With all this in mind, the performance metrics in Example 68 are telling us that while running our code, the CPU had to reference the L1/L2 cache 35,497,196 times. Of those references, 10,716,972 (or 30.191%) were requests for data that wasn’t in memory at the time and had to be retrieved. In addition, we can see that in each CPU cycle we are able to perform an average of 1.82 instructions, which tells us the total speed boost from pipelining, outoforder execution, and hyperthreading (or any other CPU feature that lets you run more than one instruction per clock cycle).
Fragmentation increases the number of memory transfers to the CPU. Additionally, since you don’t have multiple pieces of data ready in the CPU cache when a calculation is requested, it means you cannot vectorize the calculations. As explained in Communications Layers, vectorization of computations (or having the CPU do multiple computations at a time) can only occur if we can fill the CPU cache with all the relevant data. Since the bus can only move contiguous chunks of memory, this is only possible if the grid data is stored sequentially in RAM. Since a list stores pointers to data instead of the actual data, the actual values in the grid are scattered throughout memory and cannot be copied all at once.
We can alleviate this problem by using the array module instead of lists. These objects store data sequentially in memory, so that a slice of the array actually represents a continuous range in memory. However, this doesn’t completely fix the problem—now we have data that is stored sequentially in memory, but Python still does not know how to vectorize our loops. What we would like is for any loop that does arithmetic on our array one element at a time to work on chunks of data, but as mentioned previously, there is no such bytecode optimization in Python (partly due to the extremely dynamic nature of the language).
NOTE
Why doesn’t having the data we want stored sequentially in memory automatically give us vectorization? If we looked at the raw machine code that the CPU is running, vectorized operations (such as multiplying two arrays) use a different part of the CPU, and different instructions, than nonvectorized operations. In order for Python to use these special instructions, we must have a module that was created to use them. We will soon see how numpy gives us access to these specialized instructions.
Furthermore, because of implementation details, using the array type when creating lists of data that must be iterated on is actually slower than simply creating a list. This is because the array object stores a very lowlevel representation of the numbers it stores, and this must be converted into a Pythoncompatible version before being returned to the user. This extra overhead happens every time you index an array type. That implementation decision has made the array object less suitable for math and more suitable for storing fixedtype data more efficiently in memory.
Enter numpy
In order to deal with the fragmentation we found using perf, we must find a package that can efficiently vectorize operations. Luckily, numpy has all of the features we need—it stores data in contiguous chunks of memory and supports vectorized operations on its data. As a result, any arithmetic we do on numpy arrays happens in chunks without us having to explicitly loop over each element. Not only does this make it much easier to do matrix arithmetic this way, but it is also faster. Let’s look at an example:
from array import array
import numpy
def norm_square_list(vector):
"""
>>> vector = range(1000000)
>>> %timeit norm_square_list(vector_list)
1000 loops, best of 3: 1.16 ms per loop
"""
norm = 0
for v invector:
norm += v*v
return norm
def norm_square_list_comprehension(vector):
"""
>>> vector = range(1000000)
>>> %timeit norm_square_list_comprehension(vector_list)
1000 loops, best of 3: 913 µs per loop
"""
return sum([v*v for v invector])
def norm_squared_generator_comprehension(vector):
"""
>>> vector = range(1000000)
>>> %timeit norm_square_generator_comprehension(vector_list)
1000 loops, best of 3: 747 μs per loop
"""
return sum(v*v for v invector)
def norm_square_array(vector):
"""
>>> vector = array('l', range(1000000))
>>> %timeit norm_square_array(vector_array)
1000 loops, best of 3: 1.44 ms per loop
"""
norm = 0
for v invector:
norm += v*v
return norm
def norm_square_numpy(vector):
"""
>>> vector = numpy.arange(1000000)
>>> %timeit norm_square_numpy(vector_numpy)
10000 loops, best of 3: 30.9 µs per loop
"""
return numpy.sum(vector * vector) #
def norm_square_numpy_dot(vector):
"""
>>> vector = numpy.arange(1000000)
>>> %timeit norm_square_numpy_dot(vector_numpy)
10000 loops, best of 3: 21.8 µs per loop
"""
return numpy.dot(vector, vector) #
This creates two implied loops over vector, one to do the multiplication and one to do the sum. These loops are similar to the loops from norm_square_list_comprehension, but they are executed using numpy’s optimized numerical code.
This is the preferred way of doing vector norms in numpy by using the vectorized numpy.dot operation. The less efficient norm_square_numpy code is provided for illustration.
The simpler numpy code runs 37.54x faster than norm_square_list and 29.5x faster than the “optimized” Python list comprehension. The difference in speed between the pure Python looping method and the list comprehension method shows the benefit of doing more calculation behind the scenes rather than explicitly doing it in your Python code. By performing calculations using Python’s alreadybuilt machinery, we get the speed of the native C code that Python is built on. This is also partly the same reasoning behind why we have such a drastic speedup in the numpycode: instead of using the very generalized list structure, we have a very finely tuned and specially built object for dealing with arrays of numbers.
In addition to more lightweight and specialized machinery, the numpy object also gives us memory locality and vectorized operations, which are incredibly important when dealing with numerical computations. The CPU is exceedingly fast, and most of the time simply getting it the data it needs faster is the best way to optimize code quickly. Running each function using the perf tool we looked at earlier shows that the array and pure Python functions takes ~1x10^{11} instructions, while the numpy version takes ~3x10^{9} instructions. In addition, the array and pure Python versions have ~80% cache misses, while numpy has ~55%.
In our norm_square_numpy code, when doing vector * vector, there is an implied loop that numpy will take care of. The implied loop is the same loop we have explicitly written out in the other examples: loop over all items in vector, multiplying each item by itself. However, since we tell numpy to do this instead of explicitly writing it out in Python code, it can take advantage of all the optimizations it wants. In the background, numpy has very optimized C code that has been specifically made to take advantage of any vectorization the CPU has enabled. In addition,numpy arrays are represented sequentially in memory as lowlevel numerical types, which gives them the same space requirements as array objects (from the array module).
As an extra bonus, we can reformulate the problem as a dot product, which numpy supports. This gives us a single operation to calculate the value we want, as opposed to first taking the product of the two vectors and then summing them. As you can see in Figure 63, this operation,norm_numpy_dot, outperforms all the others by quite a substantial margin—this is thanks to the specialization of the function, and because we don’t need to store the intermediate value of vector * vector as we did in norm_numpy.
Figure 63. Runtimes for the various norm squared routines with vectors of different length
Applying numpy to the Diffusion Problem
Using what we’ve learned about numpy, we can easily adapt our pure Python code to be vectorized. The only new functionality we must introduce is numpy’s roll function. This function does the same thing as our moduloindex trick, but it does so for an entire numpy array. In essence, it vectorizes this reindexing:
>>> import numpy as np
>>> np.roll([1,2,3,4], 1)
array([4, 1, 2, 3])
>>> np.roll([[1,2,3],[4,5,6]], 1, axis=1)
array([[3, 1, 2],
[6, 4, 5]])
The roll function creates a new numpy array, which can be thought of as both good and bad. The downside is that we are taking time to allocate new space, which then needs to be filled with the appropriate data. On the other hand, once we have created this new rolled array we will be able to vectorize operations on it quite quickly and not suffer from cache misses from the CPU cache. This can substantially affect the speed of the actual calculation we must do on the grid. Later in this chapter, we will rewrite this so that we get the same benefit without having to constantly allocate more memory.
With this additional function we can rewrite the Python diffusion code from Example 66 using simpler, and vectorized, numpy arrays. In addition, we break up the calculation of the derivatives, grid_xx and grid_yy, into a separate function. Example 69 shows our initial numpy diffusion code.
Example 69. Initial numpy diffusion
import numpy as np
grid_shape = (1024, 1024)
def laplacian(grid):
return np.roll(grid, +1, 0) + np.roll(grid, 1, 0) + \
np.roll(grid, +1, 1) + np.roll(grid, 1, 1)  4 * grid
def evolve(grid, dt, D=1):
return grid + dt * D * laplacian(grid)
def run_experiment(num_iterations):
grid = np.zeros(grid_shape)
block_low = int(grid_shape[0] * .4)
block_high = int(grid_shape[0] * .5)
grid[block_low:block_high, block_low:block_high] = 0.005
start = time.time()
for i inrange(num_iterations):
grid = evolve(grid, 0.1)
return time.time()  start
Immediately we see that this code is much shorter. This is generally a good indication of performance: we are doing a lot of the heavy lifting outside of the Python interpreter, and hopefully inside a module specially built for performance and for solving a particular problem (however, this should always be tested!). One of the assumptions here is that numpy is using better memory management in order to give the CPU the data it needs faster. However, since whether or not this happens relies on the actual implementation of numpy, let’s profile our code to see whether our hypothesis is correct. Example 610 shows the results.
Example 610. Performance counters for numpy 2D diffusion
$ perf stat e cycles,stalledcyclesfrontend,stalledcyclesbackend,instructions,\
cachereferences,cachemisses,branches,branchmisses,taskclock,faults,\
minorfaults,cs,migrations r 3 python diffusion_numpy.py
Performance counter stats for 'python diffusion_numpy.py' (3 runs):
10,194,811,718 cycles # 3.332 GHz
4,435,850,419 stalledcyclesfrontend # 43.51% frontend cycles idle
2,055,861,567 stalledcyclesbackend # 20.17% backend cycles idle
15,165,151,844 instructions # 1.49 insns per cycle
# 0.29 stalled cycles per insn
346,798,311 cachereferences # 113.362 M/sec
519,793 cachemisses # 0.150 % of all cache refs
3,506,887,927 branches # 1146.334 M/sec
3,681,441 branchmisses # 0.10% of all branches
3059.219862 taskclock # 0.999 CPUs utilized
751,707 pagefaults # 0.246 M/sec
751,707 minorfaults # 0.246 M/sec
8 contextswitches # 0.003 K/sec
1 CPUmigrations # 0.000 K/sec
3.061883218 seconds time elapsed
This shows that the simple change to numpy has given us a 40x speedup over the pure Python implementation with reduced memory allocations (Example 68). How was this achieved?
First of all, we can thank the vectorization that numpy gives. Although the numpy version seems to be running fewer instructions per cycle, each of the instructions does much more work. That is to say, one vectorized instruction can multiply four (or more) numbers in an array together instead of requiring four independent multiplication instructions. Overall this results in fewer total instructions necessary to solve the same problem.
There are also several other factors that contribute to the numpy version requiring a lower absolute number of instructions to solve the diffusion problem. One of them has to do with the full Python API being available when running the pure Python version, but not necessarily for the numpyversion—for example, the pure Python grids can be appended to in pure Python, but not in numpy. Even though we aren’t explicitly using this (or other) functionality, there is overhead in providing a system where it could be available. Since numpy can make the assumption that the data being stored is always going to be numbers, everything regarding the arrays can be optimized for operations over numbers. We will continue on the track of removing necessary functionality in favor of performance when we talk about Cython (see Cython), where it is even possible to remove list bounds checking to speed up list lookups.
Normally, the number of instructions doesn’t necessarily correlate to performance—the program with fewer instructions may not issue them efficiently, or they may be slow instructions. However, we see that in addition to reducing the number of instructions, the numpy version has also reduced a large inefficiency: cache misses (0.15% cache misses instead of 30.2%). As explained in Memory Fragmentation, cache misses slow down computations because the CPU must wait for data to be retrieved from slower memory instead of having it immediately available in its cache. In fact, memory fragmentation is such a dominant factor in performance that if we disable vectorization^{[}11^{]} in numpy but keep everything else the same, we still see a sizable speed increase compared to the pure Python version (Example 611).
Example 611. Performance counters for numpy 2D diffusion without vectorization
$ perf stat e cycles,stalledcyclesfrontend,stalledcyclesbackend,instructions,\
cachereferences,cachemisses,branches,branchmisses,taskclock,faults,\
minorfaults,cs,migrations r 3 python diffusion_numpy.py
Performance counter stats for 'python diffusion_numpy.py' (3 runs):
48,923,515,604 cycles # 3.413 GHz
24,901,979,501 stalledcyclesfrontend # 50.90% frontend cycles idle
6,585,982,510 stalledcyclesbackend # 13.46% backend cycles idle
53,208,756,117 instructions # 1.09 insns per cycle
# 0.47 stalled cycles per insn
83,436,665 cachereferences # 5.821 M/sec
1,211,229 cachemisses # 1.452 % of all cache refs
4,428,225,111 branches # 308.926 M/sec
3,716,789 branchmisses # 0.08% of all branches
14334.244888 taskclock # 0.999 CPUs utilized
751,185 pagefaults # 0.052 M/sec
751,185 minorfaults # 0.052 M/sec
24 contextswitches # 0.002 K/sec
5 CPUmigrations # 0.000 K/sec
14.345794896 seconds time elapsed
This shows us that the dominant factor in our 40x speedup when introducing numpy is not the vectorized instruction set, but rather the memory locality and reduced memory fragmentation. In fact, we can see from the preceding experiment that vectorization accounts for only about 15% of that 40x speedup.^{[}12^{]}
This realization that memory issues are the dominant factor in slowing down our code doesn’t come as too much of a shock. Computers are very well designed to do exactly the calculations we are requesting them to do with this problem—multiplying and adding numbers together. The bottleneck is in getting those numbers to the CPU fast enough to see it do the calculations as fast as it can.
Memory Allocations and InPlace Operations
In order to optimize the memorydominated effects, let us try using the same method we used in Example 66 in order to reduce the number of allocations we make in our numpy code. Allocations are quite a bit worse than the cache misses we discussed previously. Instead of simply having to find the right data in RAM when it is not found in the cache, an allocation also must make a request to the operating system for an available chunk of data and then reserve it. The request to the operating system generates quite a lot more overhead than simply filling a cache—while filling a cache miss is a hardware routine that is optimized on the motherboard, allocating memory requires talking to another process, the kernel, in order to complete.
In order to remove the allocations in Example 69, we will preallocate some scratch space at the beginning of the code and then only use inplace operations. Inplace operations, such as +=, *=, etc., reuse one of the inputs as their output. This means that we don’t need to allocate space to store the result of the calculation.
To show this explicitly, we will look at how the id of a numpy array changes as we perform operations on it (Example 612). The id is a good way of tracking this for numpy arrays, since the id has to do with what section of memory is being referenced. If two numpy arrays have the sameid, then they are referencing the same section of memory.^{[}13^{]}
Example 612. Inplace operations reducing memory allocations
>>> import numpy as np
>>> array1 = np.random.random((10,10))
>>> array2 = np.random.random((10,10))
>>> id(array1)
140199765947424 #
>>> array1 += array2
>>> id(array1)
140199765947424 #
>>> array1 = array1 + array2
>>> id(array1)
140199765969792 #
These two ids are the same, since we are doing an inplace operation. This means that the memory address of array1 does not change; we are simply changing the data contained within it.
Here, the memory address has changed. When doing array1 + array2, a new memory address is allocated and filled with the result of the computation. This does have benefits, however, for when the original data needs to be preserved (i.e., array3 = array1 + array2 allows you to keep using array1 and array2, while inplace operations destroy some of the original data).
Furthermore, we see an expected slowdown from the noninplace operation. For small numpy arrays, this overhead can be as much as 50% of the total computation time. For larger computations, the speedup is more in the singlepercent region, but this still represents a lot of time if this is happening millions of times. In Example 613, we see that using inplace operations gives us a 20% speedup for small arrays. This margin will become larger as the arrays grow, since the memory allocations become more strenuous.
Example 613. Inplace operations reducing memory allocations
>>> %%timeit array1, array2 = np.random.random((10,10)), np.random.random((10,10))
#
... array1 = array1 + array2
...
100000 loops, best of 3: 3.03 us per loop
>>> %%timeit array1, array2 = np.random.random((10,10)), np.random.random((10,10))
... array1 += array2
...
100000 loops, best of 3: 2.42 us per loop
Note that we use %%timeit instead of %timeit, which allows us to specify code to set up the experiment that doesn’t get timed.
The downside is that while rewriting our code from Example 69 to use inplace operations is not very complicated, it does make the resulting code a bit harder to read. In Example 614 we can see the results of this refactoring. We instantiate grid and next_grid vectors, and we constantly swap them with each other. grid is the current information we know about the system and, after running evolve, next_grid contains the updated information.
Example 614. Making most numpy operations inplace
def laplacian(grid, out):
np.copyto(out, grid)
out *= 4
out += np.roll(grid, +1, 0)
out += np.roll(grid, 1, 0)
out += np.roll(grid, +1, 1)
out += np.roll(grid, 1, 1)
def evolve(grid, dt, out, D=1):
laplacian(grid, out)
out *= D * dt
out += grid
def run_experiment(num_iterations):
next_grid = np.zeros(grid_shape)
grid = np.zeros(grid_shape)
block_low = int(grid_shape[0] * .4)
block_high = int(grid_shape[0] * .5)
grid[block_low:block_high, block_low:block_high] = 0.005
start = time.time()
for i inrange(num_iterations):
evolve(grid, 0.1, next_grid)
grid, next_grid = next_grid, grid #
return time.time()  start
Since the output of evolve gets stored in the output vector next_grid, we must swap these two variables so that, for the next iteration of the loop, grid has the most uptodate information. This swap operation is quite cheap because only the references to the data are changed, not the data itself.
It is important to remember that since we want each operation to be inplace, whenever we do a vector operation we must have it be on its own line. This can make something as simple as A = A * B + C become quite convoluted. Since Python has a heavy emphasis on readability, we should make sure that the changes we have made give us sufficient speedups to be justified.
Comparing the performance metrics from Examples 615 and 610, we see that removing the spurious allocations sped up our code by 29%. This comes partly from a reduction in the number of cache misses, but mostly from a reduction in page faults.
Example 615. Performance metrics for numpy with inplace memory operations
$ perf stat e cycles,stalledcyclesfrontend,stalledcyclesbackend,instructions,\
cachereferences,cachemisses,branches,branchmisses,taskclock,faults,\
minorfaults,cs,migrations r 3 python diffusion_numpy_memory.py
Performance counter stats for 'python diffusion_numpy_memory.py' (3 runs):
7,864,072,570 cycles # 3.330 GHz
3,055,151,931 stalledcyclesfrontend # 38.85% frontend cycles idle
1,368,235,506 stalledcyclesbackend # 17.40% backend cycles idle
13,257,488,848 instructions # 1.69 insns per cycle
# 0.23 stalled cycles per insn
239,195,407 cachereferences # 101.291 M/sec
2,886,525 cachemisses # 1.207 % of all cache refs
3,166,506,861 branches # 1340.903 M/sec
3,204,960 branchmisses # 0.10% of all branches
2361.473922 taskclock # 0.999 CPUs utilized
6,527 pagefaults # 0.003 M/sec
6,527 minorfaults # 0.003 M/sec
6 contextswitches # 0.003 K/sec
2 CPUmigrations # 0.001 K/sec
2.363727876 seconds time elapsed
Selective Optimizations: Finding What Needs to Be Fixed
Looking at the code from Example 614, it seems like we have addressed most of the issues at hand: we have reduced the CPU burden by using numpy, and we have reduced the number of allocations necessary to solve the problem. However, there is always more investigation to be done. If we do a line profile on that code (Example 616), we see that the majority of our work is done within the laplacian function. In fact, 93% of the time evolve takes to run is spent in laplacian.
Example 616. Line profiling shows that laplacian is taking too much time
Wrote profile results to diffusion_numpy_memory.py.lprof
Timer unit: 1e06 s
File: diffusion_numpy_memory.py
Function: laplacian at line 8
Total time: 3.67347 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 @profile
9 def laplacian(grid, out):
10 500 162009 324.0 4.4 np.copyto(out, grid)
11 500 111044 222.1 3.0 out *= 4
12 500 464810 929.6 12.7 out += np.roll(grid, +1, 0)
13 500 432518 865.0 11.8 out += np.roll(grid, 1, 0)
14 500 1261692 2523.4 34.3 out += np.roll(grid, +1, 1)
15 500 1241398 2482.8 33.8 out += np.roll(grid, 1, 1)
File: diffusion_numpy_memory.py
Function: evolve at line 17
Total time: 3.97768 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
17 @profile
18 def evolve(grid, dt, out, D=1):
19 500 3691674 7383.3 92.8 laplacian(grid, out)
20 500 111687 223.4 2.8 out *= D * dt
21 500 174320 348.6 4.4 out += grid
There could be many reasons why laplacian is so slow. However, there are two main highlevel issues to consider. First, it looks like the calls to np.roll are allocating new vectors (we can verify this by looking at the documentation for the function). This means that even though we removed seven memory allocations in our previous refactoring, there are still four outstanding allocations. Furthermore, np.roll is a very generalized function that has a lot of code to deal with special cases. Since we know exactly what we want to do (move just the first column of data to be the last in every dimension), we can rewrite this function to eliminate most of the spurious code. We could even merge the np.roll logic with the add operation that happens with the rolled data to make a very specialized roll_add function that does exactly what we want with the fewest number of allocations and the least extra logic.
Example 617 shows what this refactoring would look like. All we need to do is create our new roll_add function and have laplacian use it. Since numpy supports fancy indexing, implementing such a function is just a matter of not jumbling up the indices. However, as stated earlier, while this code may be more performant, it is much less readable.
WARNING
Notice the extra work that has gone into having an informative docstring for the function, in addition to full tests. When you are taking a route similar to this one it is important to maintain the readability of the code, and these steps go a long way to making sure that your code is always doing what it was intended to do and that future programmers can modify your code and know what things do and when things are not working.
Example 617. Creating our own roll function
import numpy as np
def roll_add(rollee, shift, axis, out):
"""
Given a matrix, rollee, and an output matrix, out, this function will
perform the calculation:
>>> out += np.roll(rollee, shift, axis=axis)
This is done with the following assumptions:
* rollee is 2D
* shift will only ever be +1 or 1
* axis will only ever be 0 or 1 (also implied by the first assumption)
Using these assumptions, we are able to speed up this function by avoiding
extra machinery that numpy uses to generalize the `roll` function and also
by making this operation intrinsically inplace.
"""
if shift == 1 andaxis == 0:
out[1:, :] += rollee[:1, :]
out[0 , :] += rollee[1, :]
elif shift == 1 andaxis == 0:
out[:1, :] += rollee[1:, :]
out[1 , :] += rollee[0, :]
elif shift == 1 andaxis == 1:
out[:, 1:] += rollee[:, :1]
out[:, 0 ] += rollee[:, 1]
elif shift == 1 andaxis == 1:
out[:, :1] += rollee[:, 1:]
out[:, 1] += rollee[:, 0]
def test_roll_add():
rollee = np.asarray([[1,2],[3,4]])
for shift in (1, +1):
for axis in (0, 1):
out = np.asarray([[6,3],[9,2]])
expected_result = np.roll(rollee, shift, axis=axis) + out
roll_add(rollee, shift, axis, out)
assert np.all(expected_result == out)
def laplacian(grid, out):
np.copyto(out, grid)
out *= 4
roll_add(grid, +1, 0, out)
roll_add(grid, 1, 0, out)
roll_add(grid, +1, 1, out)
roll_add(grid, 1, 1, out)
If we look at the performance counters in Example 618 for this rewrite, we see that while it is considerably faster than Example 614 (70% faster, in fact), most of the counters are about the same. The number of pagefaults went down, but not by 70%. Similarly, cachemisses andcachereferences went down, but not enough to account for the entire speedup. One of the most important metrics here is the instructions metric. The instructions metric counts how many CPU instructions had to be issued to run the program—in other words, how many things the CPU had to do. Somehow, the change to the customized roll_add function reduced the total number of instructions necessary by about 2.86x. This is because instead of having to rely on the entire machinery of numpy to roll our matrix, we are able to create shorter and simpler machinery that can take advantage of assumptions about our data (namely, that our data is twodimensional and that we will only ever be rolling by 1). This theme of trimming out unnecessary machinery in both numpy and Python in general will continue in Cython.
Example 618. Performance metrics for numpy with inplace memory operations and custom laplacian function
$ perf stat e cycles,stalledcyclesfrontend,stalledcyclesbackend,instructions,\
cachereferences,cachemisses,branches,branchmisses,taskclock,faults,\
minorfaults,cs,migrations r 3 python diffusion_numpy_memory2.py
Performance counter stats for 'python diffusion_numpy_memory2.py' (3 runs):
4,303,799,244 cycles # 3.108 GHz
2,814,678,053 stalledcyclesfrontend # 65.40% frontend cycles idle
1,635,172,736 stalledcyclesbackend # 37.99% backend cycles idle
4,631,882,411 instructions # 1.08 insns per cycle
# 0.61 stalled cycles per insn
272,151,957 cachereferences # 196.563 M/sec
2,835,948 cachemisses # 1.042 % of all cache refs
621,565,054 branches # 448.928 M/sec
2,905,879 branchmisses # 0.47% of all branches
1384.555494 taskclock # 0.999 CPUs utilized
5,559 pagefaults # 0.004 M/sec
5,559 minorfaults # 0.004 M/sec
6 contextswitches # 0.004 K/sec
3 CPUmigrations # 0.002 K/sec
1.386148918 seconds time elapsed
numexpr: Making InPlace Operations Faster and Easier
One downfall of numpy’s optimization of vector operations is that it only occurs on one operation at a time. That is to say, when we are doing the operation A * B + C with numpy vectors, first the entire A * B operation completes, and the data is stored in a temporary vector; then this new vector is added with C. The inplace version of the diffusion code in Example 614 shows this quite explicitly.
However, there are many modules that can help with this. numexpr is a module that can take an entire vector expression and compile it into very efficient code that is optimized to minimize cache misses and temporary space used. In addition, the expressions can utilize multiple CPU cores (see Chapter 9 for more information) and specialized instructions for Intel chips to maximize the speedup.
It is very easy to change code to use numexpr: all that’s required is to rewrite the expressions as strings with references to local variables. The expressions are compiled behind the scenes (and cached so that calls to the same expression don’t incur the same cost of compilation) and run using optimized code. Example 619 shows the simplicity of changing the evolve function to use numexpr. In this case, we chose to use the out parameter of the evaluate function so that numexpr doesn’t allocate a new vector to which to return the result of the calculation.
Example 619. Using numexpr to further optimize large matrix operations
from numexpr import evaluate
def evolve(grid, dt, next_grid, D=1):
laplacian(grid, next_grid)
evaluate("next_grid*D*dt+grid", out=next_grid)
An important feature of numexpr is its consideration of CPU caches. It specifically moves data around so that the various CPU caches have the correct data in order to minimize cache misses. When we run perf on the updated code (Example 620), we see a speedup. However, if we look at the performance on a smaller, 512 × 512 grid (see Figure 64, at the end of the chapter) we see a ~15% decrease in speed. Why is this?
Example 620. Performance metrics for numpy with inplace memory operations, custom laplacian function, and numexpr
$ perf stat e cycles,stalledcyclesfrontend,stalledcyclesbackend,instructions,\
cachereferences,cachemisses,branches,branchmisses,taskclock,faults,\
minorfaults,cs,migrations r 3 python diffusion_numpy_memory2_numexpr.py
Performance counter stats for 'python diffusion_numpy_memory2_numexpr.py' (3 runs):
5,940,414,581 cycles # 1.447 GHz
3,706,635,857 stalledcyclesfrontend # 62.40% frontend cycles idle
2,321,606,960 stalledcyclesbackend # 39.08% backend cycles idle
6,909,546,082 instructions # 1.16 insns per cycle
# 0.54 stalled cycles per insn
261,136,786 cachereferences # 63.628 M/sec
11,623,783 cachemisses # 4.451 % of all cache refs
627,319,686 branches # 152.851 M/sec
8,443,876 branchmisses # 1.35% of all branches
4104.127507 taskclock # 1.364 CPUs utilized
9,786 pagefaults # 0.002 M/sec
9,786 minorfaults # 0.002 M/sec
8,701 contextswitches # 0.002 M/sec
60 CPUmigrations # 0.015 K/sec
3.009811418 seconds time elapsed
Much of the extra machinery we are bringing into our program with numexpr deals with cache considerations. When our grid size is small and all the data we need for our calculations fits in the cache, this extra machinery simply adds more instructions that don’t help performance. In addition, compiling the vector operation that we encoded as a string adds a large overhead. When the total runtime of the program is small, this overhead can be quite noticeable. However, as we increase the grid size, we should expect to see numexpr utilize our cache better than nativenumpy does. In addition, numexpr utilizes multiple cores to do its calculation and tries to saturate each of the cores’ caches. When the size of the grid is small, the extra overhead of managing the multiple cores overwhelms any possible increase in speed.
The particular computer we are running the code on has a 20,480 KB cache (Intel® Xeon® E52680). Since we are operating on two arrays, one for input and one for output, we can easily do the calculation for the size of the grid that will fill up our cache. The number of grid elements we can store in total is 20,480 KB / 64 bit = 2,560,000. Since we have two grids, this number is split between two objects (so each one can be at most 2,560,000 / 2 = 1,280,000 elements). Finally, taking the square root of this number gives us the size of the grid that uses that many grid elements. All in all, this means that approximately two 2D arrays of size 1,131 × 1,131 would fill up the cache (). In practice, however, we do not get to fill up the cache ourselves (other programs will fill up parts of the cache), so realistically we can probably fit two 800 × 800 arrays. Looking at Tables 61 and 62, we see that when the grid size jumps from 512 × 512 to 1,024 × 1,024, the numexpr code starts to outperform pure numpy.
A Cautionary Tale: Verify “Optimizations” (scipy)
An important thing to take away from this chapter is the approach we took to every optimization: profile the code to get a sense of what is going on, come up with a possible solution to fix slow parts, then profile to make sure the fix actually worked. Although this sounds straightforward, things can get complicated quickly, as we saw with how the performance of numexpr depended greatly on the size of the grid we were considering.
Of course, our proposed solutions don’t always work as expected. While writing the code for this chapter, this author saw that the laplacian function was the slowest routine and hypothesized that the scipy routine would be considerably faster. This thinking came from the fact that Laplacians are a common operation in image analysis and probably have a very optimized library to speed up the calls. scipy has an image submodule, so we must be in luck!
The implementation was quite simple (Example 621) and required little thought about the intricacies of implementing the periodic boundary conditions (or “wrap” condition, as scipy calls it).
Example 621. Using scipy’s laplace filter
from scipy.ndimage.filters import laplace
def laplacian(grid, out):
laplace(grid, out, mode='wrap')
Ease of implementation is quite important, and definitely won this method some points before we considered performance. However, once we benchmarked the scipy code (Example 622), we got the revelation: this method offers no substantial speedup compared to the code it was based on (Example 614). In fact, as we increase the grid size, this method starts performing worse (see Figure 64 at the end of the chapter).
Example 622. Performance metrics for diffusion with scipy’s laplace function
$ perf stat e cycles,stalledcyclesfrontend,stalledcyclesbackend,instructions,\
cachereferences,cachemisses,branches,branchmisses,taskclock,faults,\
minorfaults,cs,migrations r 3 python diffusion_scipy.py
Performance counter stats for 'python diffusion_scipy.py' (3 runs):
6,573,168,470 cycles # 2.929 GHz
3,574,258,872 stalledcyclesfrontend # 54.38% frontend cycles idle
2,357,614,687 stalledcyclesbackend # 35.87% backend cycles idle
9,850,025,585 instructions # 1.50 insns per cycle
# 0.36 stalled cycles per insn
415,930,123 cachereferences # 185.361 M/sec
3,188,390 cachemisses # 0.767 % of all cache refs
1,608,887,891 branches # 717.006 M/sec
4,017,205 branchmisses # 0.25% of all branches
2243.897843 taskclock # 0.994 CPUs utilized
7,319 pagefaults # 0.003 M/sec
7,319 minorfaults # 0.003 M/sec
12 contextswitches # 0.005 K/sec
1 CPUmigrations # 0.000 K/sec
2.258396667 seconds time elapsed
Comparing the performance metrics of the scipy version of the code with those of our custom laplacian function (Example 618), we can start to get some indication as to why we aren’t getting the speedup we were expecting from this rewrite.
The metrics that stand out the most are pagefaults and instructions. Both of these values are substantially larger for the scipy version. The increase in pagefaults shows us that while the scipy laplacian function has support for inplace operations, it is still allocating a lot of memory. In fact, the number of pagefaults in the scipy version is larger than in our first rewrite of the numpy code (Example 615).
Most importantly, however, is the instructions metric. This shows us that the scipy code is requesting that the CPU do over double the amount of work as our custom laplacian code. Even though these instructions are more optimized (as we can see with the higher insns per cycle count, which says how many instructions the CPU can do in one clock cycle), the extra optimization doesn’t win out over the sheer number of added instructions. This could be in part due to the fact that the scipy code is written very generally, so that it can process all sorts of inputs with different boundary conditions (which requires extra code and thus more instructions). We can see this, in fact, by the high number of branches that the scipy code requires.
WrapUp
Looking back on our optimizations, we seem to have taken two main routes: reducing the time taken to get data to the CPU and reducing the amount of work that the CPU had to do. Tables 61 and 62 show a comparison of the results achieved by our various optimization efforts, for various dataset sizes, in relation to the original pure Python implementation.
Figure 64 shows graphically how all these methods compared to each other. We can see three bands of performance that correspond to these two methods: the band along the bottom shows the small improvement made in relation to our pure Python implementation by our first effort at reducing memory allocations, the middle band shows what happened when we used numpy and further reduced allocations, and the upper band illustrates the results achieved by reducing the work done by our process.
Table 61. Total runtime of all schemes for various grid sizes and 500 iterations of the evolve function
Method 
256 × 256 
512 × 512 
1,024 × 1,024 
2,048 × 2,048 
4,096 × 4,096 

Python 
2.32s 
9.49s 
39.00s 
155.02s 
617.35s 

Python + memory 
2.56s 
10.26s 
40.87s 
162.88s 
650.26s 

numpy 
0.07s 
0.28s 
1.61s 
11.28s 
45.47s 

numpy + memory 
0.05s 
0.22s 
1.05s 
6.95s 
28.14s 

numpy + memory + laplacian 
0.03s 
0.12s 
0.53s 
2.68s 
10.57s 

numpy + memory + laplacian + numexpr 
0.04s 
0.13s 
0.50s 
2.42s 
9.54s 

numpy + memory + scipy 
0.05s 
0.19s 
1.22s 
6.06s 
30.31s 
Table 62. Speedup compared to naive Python (Example 63) for all schemes and various grid sizes over 500 iterations of the evolve function
Method 
256 × 256 
512 × 512 
1,024 × 1,024 
2,048 × 2,048 
4,096 × 4,096 

Python 
0.00x 
0.00x 
0.00x 
0.00x 
0.00x 

Python + memory 
0.90x 
0.93x 
0.95x 
0.95x 
0.95x 

numpy 
32.33x 
33.56x 
24.25x 
13.74x 
13.58x 

numpy + memory 
42.63x 
42.75x 
37.13x 
22.30x 
21.94x 

numpy + memory + laplacian 
77.98x 
78.91x 
73.90x 
57.90x 
58.43x 

numpy + memory + laplacian + numexpr 
65.01x 
74.27x 
78.27x 
64.18x 
64.75x 

numpy + memory + scipy 
42.43x 
51.28x 
32.09x 
25.58x 
20.37x 
One important lesson to take away from this is that you should always take care of any administrative things the code must do during initialization. This may include allocating memory, or reading configuration from a file, or even precomputing some values that will be needed throughout the lifetime of the program. This is important for two reasons. First, you are reducing the total number of times these tasks must be done by doing them once up front, and you know that you will be able to use those resources without too much penalty in the future. Secondly, you are not disrupting the flow of the program; this allows it to pipeline more efficiently and keep the caches filled with more pertinent data.
We also learned more about the importance of data locality and how important simply getting data to the CPU is. CPU caches can be quite complicated, and often it is best to allow the various mechanisms designed to optimize them take care of the issue. However, understanding what is happening and doing all that is possible to optimize how memory is handled can make all the difference. For example, by understanding how caches work we are able to understand that the decrease in performance that leads to a saturated speedup no matter the grid size in Figure 64 can probably be attributed to the L3 cache being filled up by our grid. When this happens, we stop benefiting from the tiered memory approach to solving the Von Neumann bottleneck.
Figure 64. Summary of speedups from the methods attempted in this chapter
Another important lesson concerns the use of external libraries. Python is fantastic for its ease of use and readability, which allow you to write and debug code fast. However, tuning performance down to the external libraries is essential. These external libraries can be extremely fast, because they can be written in lowerlevel languages—but since they interface with Python, you can also still write code that used them quickly.
Finally, we learned the importance of benchmarking everything and forming hypotheses about performance before running the experiment. By forming a hypothesis before running the benchmark, we are able to form conditions to tell us whether our optimization actually worked. Was this change able to speed up runtime? Did it reduce the number of allocations? Is the number of cache misses lower? Optimization can be an art at times because of the vast complexity of computer systems, and having a quantitative probe into what is actually happening can help enormously.
One last point about optimization is that a lot of care must be taken to make sure that the optimizations you make generalize to different computers (the assumptions you make and the results of benchmarks you do may be dependent on the architecture of the computer you are running on, how the modules you are using were compiled, etc.). In addition, when making these optimizations it is incredibly important to consider other developers and how the changes will affect the readability of your code. For example, we realized that the solution we implemented in Example 617was potentially vague, so care was taken to make sure that the code was fully documented and tested to help not only us, but also help other people in the team.
In the next chapter we will talk about how to create your own external modules that can be finely tuned to solve specific problems with much greater efficiencies. This allows us to follow the rapid prototyping method of making our programs—first solve the problem with slow code, then identify the elements that are slow, and finally find ways to make those elements faster. By profiling often and only trying to optimize sections of code we know are slow, we can save ourselves time while still making our programs run as fast as possible.
^{[}9^{] }This is the code from Example 63, truncated to fit within the page margins. Recall that kernprof.py requires functions to be decorated with @profile in order to be profiled (see Using line_profiler for LinebyLine Measurements).
^{[}10^{] }The code profiled here is the code from Example 66; it has been truncated to fit within the page margins.
^{[}11^{] }We do this by compiling numpy with the O0 flag. For this experiment we built numpy 1.8.0 with the command: $ OPT='O0' FOPT='O0' BLAS=None LAPACK=None ATLAS=None python setup.py build.
^{[}12^{] }This is very contingent on what CPU is being used.
^{[}13^{] }This is not strictly true, since two numpy arrays can reference the same section of memory but use different striding information in order to represent the same data in different ways. These two numpy arrays will have different ids. There are many subtleties to the id structure of numpy arrays that are outside the scope of this discussion.