Functional Python Programming (2015)
Chapter 9. More Itertools Techniques
Functional programming emphasizes stateless programming. In Python, this leads us to work with generator expressions, generator functions, and iterables. In this chapter, we'll continue our study of the itertools library, with numerous functions to help us work with iterable collections.
In the previous chapter, we looked at three broad groupings of iterator functions. They are as follows:
· Functions that work with infinite iterators can be applied to any iterable or an iterator over any collection; they will consume the entire source
· Functions that work with finite iterators can either accumulate a source multiple times, or they produce a reduction of the source
· The tee() iterator function clones an iterator into several copies that can each be used independently
In this chapter, we'll look at the itertools functions that work with permutations and combinations. These include several functions and a few recipes built on these functions. The functions are as follows:
· product(): This function forms a Cartesian product equivalent to the nested for loops
· permutations(): This function emits tuples of length r from a universe p in all possible orderings; there are no repeated elements
· combinations(): This function emits tuples of length r from a universe p in sorted order; there are no repeated elements
· combinations_with_replacement(): This function emits tuples of length r from p in a sorted order, with repeated elements
These functions embody algorithms that iterate over potentially large result sets from small collections of input data. Some kinds of problems have exact solutions based on exhaustively enumerating a potentially gigantic universe of permutations. The functions make it simple to emit a large number of permutations; in some cases, the simplicity isn't actually optimal.
Enumerating the Cartesian product
The term Cartesian product refers to the idea of enumerating all the possible combinations of elements drawn from a number of sets.
Mathematically, we might say that the product of two sets, , has 52 pairs as follows:
{(1, C), (1, D), (1, H), (1, S), (2, C), (2, D), (2, H), (2, S), ..., (13, C), (13, D), (13, H), (13, S)}
We can produce the preceding results by executing the following commands:
>>> list(product(range(1, 14), '♣♦♥♠'))
[(1, '♣'), (1, '♦'), (1, '♥'), (1, '♠'),(2, '♣'), (2, '♦'), (2, '♥'), (2, '♠'),… (13, '♣'), (13, '♦'), (13, '♥'), (13, '♠')]
The calculation of a product can be extended to any number of iterable collections. Using a large number of collections can lead to a very large result set.
Reducing a product
In relational database theory, a join between tables can be thought of as a filtered product. A SQL SELECT statement that joins tables without a WHERE clause will produce a Cartesian product of rows in the tables. This can be thought of as the worst-case algorithm: a product without any filtering to pick the proper results.
We can use the join() function to join two tables, as shown in the following commands:
def join(t1, t2, where):):
return filter(where, product(t1, t2)))))
All combinations of the two iterables, t1 and t2, are computed. The filter() function will apply the given where function to pass or reject items that didn't fit the given condition to match appropriate rows from each iterable. This will work when the where function returns a simple Boolean.
In some cases, we don't have a simple Boolean matching function. Instead, we're forced to search for a minimum or maximum of some distance between items.
Assume that we have a table of Color objects as follows:
[Color(rgb=(239, 222, 205), name='Almond'), Color(rgb=(255, 255, 153), name='Canary'), Color(rgb=(28, 172, 120), name='Green'),...Color(rgb=(255, 174, 66), name='Yellow Orange')]
For more information, see Chapter 6, Recursions and Reductions, where we showed you how to parse a file of colors to create namedtuple objects. In this case, we've left the RGB as a triple, instead of decomposing each individual field.
An image will have a collection of pixels:
pixels= [(([(r, g, b), (r, g, b), (r, g, b), ...)
As a practical matter, the Python Imaging Library (PIL) package presents the pixels in a number of forms. One of these is the mapping from (x, y) coordinate to RGB triple. For more information, visit https://pypi.python.org/pypi/Pillow for the Pillow project documentation.
Given a PIL.Image object, we can iterate over the collection of pixels with something like the following commands:
def pixel_iter(image):
w, h = img.size
return ((c, img.getpixel(c)) for c in product(range(w), range(h)))
We've determined the range of each coordinate based on the image size. The calculation of the product(range(w), range(h)) method creates all the possible combinations of coordinates. It is, effectively, two nested for loops.
This has the advantage of providing each pixel with its coordinates. We can then process the pixels in no particular order and still reconstruct an image. This is particularly handy when using multiprocessing or multithreading to spread the workload among several cores or processors. The concurrent.futures module provides an easy way to distribute work among cores or processors.
Computing distances
A number of decision-making problems require that we find a close-enough match. We might not be able to use a simple equality test. Instead, we have to use a distance metric and locate items with the shortest distance to our target. For text, we might use the Levenshtein distance; this shows how many changes are required to get from a given block of text to our target.
We'll use a slightly simpler example. This will involve very simple math. However, even though it's simple, it doesn't work out well if we approach it naively.
When doing color matching, we won't have a simple equality test. We're rarely able to check for the exact equality of pixel colors. We're often forced to define a minimal distance function to determine whether two colors are close enough, without being the same three values of R, G, and B. There are several common approaches, including the Euclidean distance, Manhattan distance, and yet other complex weightings based on visual preferences.
Here are the Euclidean and Manhattan distance functions:
def euclidean(pixel, color):
return math.sqrt(sum(map(lambda x, y: (x-y)**2, pixel, color.rgb)))))))
def manhattan(pixel, color):
return sum(map(lambda x, y: abs(x-y), pixel, color.rgb)))))
The Euclidean distance measures the hypotenuse of a right-angled triangle among the three points in an RGB space. The Manhattan distance sums the edges of each leg of the right-angled triangle among the three points. The Euclidean distance offers precision where the Manhattan distance offers calculation speed.
Looking forward, we're aiming for a structure that looks like this. For each individual pixel, we can compute the distance from that pixel's color to the available colors in a limited color set. The results of this calculation for a single pixel might look like this:
(((0, 0), (92, 139, 195), Color(rgb=(239, 222, 205), name='Almond'), 169.10943202553784), ((0, 0), (92, 139, 195), Color(rgb=(255, 255, 153), name='Canary'), 204.42357985320578), ((0, 0), (92, 139, 195), Color(rgb=(28, 172, 120), name='Green'), 103.97114984456024), ((0, 0), (92, 139, 195), Color(rgb=(48, 186, 143), name='Mountain Meadow'), 82.75868534480233), ((0, 0), (92, 139, 195), Color(rgb=(255, 73, 108), name='Radical Red'), 196.19887869200477), ((0, 0), (92, 139, 195), Color(rgb=(253, 94, 83), name='Sunset Orange'), 201.2212712413874), ((0, 0), (92, 139, 195), Color(rgb=(255, 174, 66), name='Yellow Orange'), 210.7961100210343))
We've shown an overall tuple that consists of a number of four tuples. Each of the four tuples contains the following contents:
· The pixel's coordinates, for example, (0,0)
· The pixel's original color, for example, (92, 139, 195)
· A Color object from our set of seven colors, for example, Color(rgb=(239, 222, 205),name='Almond')
· The Euclidean distance between the original color and the given Color object
We can see that the smallest Euclidean distance is the closest match color. This kind of reduction is done easily with the min() function. If the overall tuple is assigned to a variable name, choices, the pixel-level reduction would look like this:
min(choices, key=lambda xypcd: xypcd[3]))])
We've called each four tuple an xypcd, that is, an xy coordinate, pixel, color, and distance. The minimum distance calculation will then pick a single four tuple as the optimal match between pixel and color.
Getting all pixels and all colors
How do we get to the structure that contains all pixels and all colors? The answer is simple but, as we'll see, less than optimal.
One way to map pixels to colors is to enumerate all pixels and all colors using the product() function:
xy = lambda xyp_c: xyp_c[0][0]
p = lambda xyp_c: xyp_c[0][1]
c = lambda xyp_c: xyp_c[1]
distances= (( = ((xy(item), p(item), c(item), euclidean(p(item), c(item)))
for item in product(pixel_iter(img), colors)))))
The core of this is the product(pixel_iter(img), colors) method that creates all pixels combined with all colors. We will do a bit of restructuring of the data to flatten it out. We will apply the euclidean() function to compute distances between pixel colors and Colorobjects.
The final selection of colors uses the groupby() function and the min(choices,...) expression, as shown in the following command snippet:
for _, choices in groupby(distances, key=lambda xy_p_c_d:
xy_p_c_d[0]):
print(min(choices, key=lambda xypcd: xypcd[3])))]))
The overall product of pixels and colors is a long, flat iterable. We grouped the iterable into smaller collections where the coordinates match. This will break the big iterable into smaller iterables of just colors associated with a single pixel. We can then pick the minimal color distance for each color.
In a picture that's 3,648×2,736 with 133 Crayola colors, we have an iterable with 1,327,463,424 items to be evaluated. Yes. That's a billion combinations created by this distances expression. The number is not necessarily impractical. It's well within the limits of what Python can do. However, it reveals an important flaw in the naïve use of the product() function.
We can't trivially do this kind of large-scale processing without some analysis to see how large it is. Here are some timeit numbers for these that do each of these calculations only 1,000,000 times:
· Euclidean 2.8
· Manhattan 1.8
Scaling up from 1 million to 1 billion means 1,800 seconds, that is, about half an hour for the Manhattan distance and 46 minutes to calculate the Euclidean distance. It appears that the core arithmetic operations of Python are too slow for this kind of naïve bulk processing.
More importantly, we're doing it wrong. This kind of width×height×color processing is simply a bad design. In many cases, we can do much better.
Performance analysis
A key feature of any big data algorithm is locating a way to execute some kind of a divide-and-conquer strategy. This is true of functional programming design as well as imperative design.
We have three options to speed up this processing; they are as follows:
· We can try to use parallelism to do more of the calculations concurrently. On a four-core processor, the time can be cut to approximately ¼. This cuts the time to 8 minutes for Manhattan distances.
· We can see if caching intermediate results will reduce the amount of redundant calculation. The question arises of how many colors are the same and how many colors are unique.
· We can look for a radical change in the algorithm.
We'll combine the last two points by computing all the possible comparisons between source colors and target colors. In this case, as in many other contexts, we can easily enumerate the entire mapping and avoid redundant calculation when done on a pixel-by-pixel basis. We'll also change the algorithm from a series of comparisons to a series of simple lookups in a mapping object.
When looking at this idea of precomputing all transformations for source color to target color, we need some overall statistics for an arbitrary image. The code associated with this book includes IMG_2705.jpg. Here is a basic algorithm to collect some data from the specified image:
from collections import defaultdict, Counter
palette = defaultdict(list)
for xy_p in pixel_iter(img):
xy, p = xy_p
palette[p].append(xy)
w, h = img.size
print(""("Total pixels", w*h)
print(""("Total colors", len(palette)))))
We collected all pixels of a given color into a list organized by color. From this, we'll learn the first of the following facts:
· The total number of pixels is 9,980,928. This is not surprising for a 10 megapixel image.
· The total number of colors is 210,303. If we try to compute the Euclidean distance between actual colors and the 133 colors, we would merely do 27,970,299 calculations, which might take about 76 seconds.
· Using a 3-bit mask, 0b11100000, there are 214 colors used out of a possible 512.
· Using a 4-bit mask, 0b11110000, there are 1,150 colors used out of 4,096.
· Using a 5-bit mask, 0b11111000, there are 5,845 colors used out of 32,768.
· Using a 6-bit mask, 0b11111100, there are 27,726 colors out of 262,144.
This gives us some insight into how we can rearrange the data structure, calculate the matching colors quickly, and then rebuild the image without doing a billion comparisons.
We can apply mask values to the RGB bytes with the following piece of command:
masked_color= tuple(map(lambda x: x&0b11100000, c))
This will pick out the most significant 3 bits of red, green, and blue values. If we use this instead of the original color to create a Counter object, we'll see that we have 214 distinct values.
Rearranging the problem
The naïve use of the product() function to compare all pixels and all colors was a bad idea. There are 10 million pixels, but only 200,000 unique colors. When mapping the source colors to target colors, we only have to save 200,000 values in a simple map.
We'll approach it as follows:
· Compute the source to target color mapping. In this case, let's use 3-bit color values as output. Each R, G, and B value comes from the eight values in the range(0, 256, 32) method. We can use this expression to enumerate all the output colors:
· product(range(0,256,32), range(0,256,32), range(0,256,32))
· We can then compute the Euclidean distance to the nearest color in our source palette, doing just 68,096 calculations. This takes about 0.14 seconds. It's done one time only and computes the 200,000 mappings.
· In one pass through the image, build a new image using the revised color table. In some cases, we can exploit the truncation of integer values. We can use an expression such as (0b11100000&r, 0b11100000&g, 0b11100000&b) to remove the least significant bits of an image color. We'll look at this additional reduction in computation later.
This will replace a billion Euclidean distance calculations with 10 million dictionary lookups. This will replace 30 minutes of calculation with about 30 seconds of calculation.
Instead of doing color mapping for all pixels, we'll create a static mapping from input to output values. We can build the image building using simple lookup mapping from original color to new color.
Once we have the palette of all 200,000 colors, we can apply the fast Manhattan distance to locate the nearest color in an output, such as the Crayola colors. This will use the algorithm for color matching shown earlier to compute the mapping instead of a result image. The difference will center on using the palette.keys() function instead of the pixel_iter() function.
We'll fold in yet another optimization: truncation. This will give us an even faster algorithm.
Combining two transformations
When combining multiple transformations, we can build a more complex mapping from source through intermediate targets to the result. To illustrate this, we'll truncate the colors as well as apply a mapping.
In some problem contexts, truncation can be difficult. In other cases, it's often quite simple. For example, truncating US postal ZIP codes from 9 to 5 characters is common. Postal codes can be further truncated to three characters to determine a regional facility that represents a larger geography.
For colors, we can use the bit-masking shown previously to truncate colors form three 8-bit values (24 bits, 16 million colors) to three 3-bit values (9 bits, 512 colors).
Here is a way to build a color map that combines both distances to a given set of colors and truncation of the source colors:
bit3 = range(0, 256, 0b100000)
best = (min(((((euclidean(rgb, c), rgb, c) for c in colors)
for rgb in product(bit3, bit3, bit3)))))
color_map = dict(((((b[1], b[2].rgb) for b in best)
We created a range object, bit3, that will iterate through all eight of the 3-bit color values.
Note
The range objects aren't like ordinary iterators; they can be used multiple times. As a result of this, the product(bit3, bit3, bit3) expression will produce all 512 color combinations that we'll use as the output colors.
For each truncated RGB color, we created a three tuple that has (0) the distance from all crayon colors, (1) the RGB color, and (2) the crayon Color object. When we ask for the minimum value of this collection, we'll get the closest crayon Color object to the truncated RGB color.
We built a dictionary that maps from the truncated RGB color to the closest crayon. In order to use this mapping, we'll truncate a source color before looking up the nearest crayon in the mapping. This use of truncation coupled with the precomputed mapping shows how we might need to combine mapping techniques.
The following are the commands for the image replacement:
clone = img.copy()
for xy, p in pixel_iter(img):
r, g, b = p
repl = color_map[(([(0b11100000&r, 0b11100000&g, 0b11100000&b)]])]
clone.putpixel(xy, repl)
clone.show()
This simply uses a number of PIL features to replace all of the pixels in a picture with other pixels.
What we've seen is that the naïve use of some functional programming tools can lead to algorithms that are expressive and succinct, but also inefficient. The essential tools to compute the complexity of a calculation—sometimes called Big-O analysis—is just as important for functional programming as it is for imperative programming.
The problem is not that the product() function is inefficient. The problem is that we can use the product() function in an inefficient algorithm.
Permuting a collection of values
When we permute a collection of values, we'll elaborate all the possible orders for the items. There are ways to permute items. We can use permutations as a kind of brute-force solution to a variety of optimization problems.
By visiting http://en.wikipedia.org/wiki/Combinatorial_optimization, we can see that the exhaustive enumeration of all permutations isn't appropriate for larger problems. The use of the itertools.permutations() function is a handy way to explore very small problems.
One popular example of these combinatorial optimization problems is the assignment problem. We have n agents and n tasks, but the cost of each agent performing a given task is not equal. Imagine that some agents have trouble with some details, while other agents excel at these details. If we can properly assign tasks to agents, we can minimize the costs.
We can create a simple grid that shows how well a given agent is able to perform a given task. For a small problem of a half-dozen agents and tasks, there will be a grid of 36 costs. Each cell in the grid shows agents 0 to 5 performing tasks A to F.
We can easily enumerate all the possible permutations. However, this approach doesn't scale well. 10! is 3,628,800. We can see this sequence of 3 million items with the list(permutations(range(10))) method.
We would expect to solve a problem of this size in a few seconds. If we double the size of the problem to 20!, we would have a bit of a scalability problem: there would be 2,432,902,008,176,640,000 permutations. If it takes about 0.56 seconds to generate 10! permutations, then to generate 20! permutations, it would take about 12,000 years.
Assume that we have a cost matrix with 36 values that show the costs of six agents and six tasks. We can formulate the problem as follows:
perms = permutations(range(6)))))
alt= [(([(sum(cost[x][y] for y, x in enumerate(perm)), perm) for perm in perms]
m = min(alt)[0]
print([[([ans for s, ans in alt if s == m]))])
We've created all permutations of tasks for our six agents. We've computed the sums of all the costs in our cost matrix for each task assigned to each agent. The minimum cost is the optimal solution. In many cases, there might be multiple optimal solutions; we'll locate all of them.
For small text-book examples, this is very fast. For larger examples, an approximation algorithm is more appropriate.
Generating all combinations
The itertools module also supports computing all combinations of a set of values. When looking at combinations, the order doesn't matter, so there are far fewer combinations than permutations. The number of combinations is often stated as . This is the number of ways that we can take combinations of r things at a time from a universe of p items overall.
For example, there are 2,598,960 5-card poker hands. We can actually enumerate all 2 million hands by executing the following command:
hands = list(combinations(tuple(product(range(13), '♠♥♦♣')), 5))
More practically, we have a dataset with a number of variables. A common exploratory technique is to determine the correlation among all pairs of variables in a set of data. If there are v variables, then we will enumerate all variables that must be compared by executing the following command:
combinations(range(v), 2)
Let's get some sample data from http://www.tylervigen.com to show how this will work. We'll pick three datasets with the same time range: numbers 7, 43, and 3890. We'll simply laminate the data into a grid, repeating the year column.
This is how the first and the remaining rows of the yearly data will look:
[('year', 'Per capita consumption of cheese (US)Pounds (USDA)', 'Number of people who died by becoming tangled in their bedsheetsDeaths (US) (CDC)', 'year', 'Per capita consumption of mozzarella cheese (US)Pounds (USDA)', 'Civil engineering doctorates awarded (US)Degrees awarded (National Science Foundation)', 'year', 'US crude oil imports from VenezuelaMillions of barrels (Dept. of Energy)', 'Per capita consumption of high fructose corn syrup (US)Pounds (USDA)'),
(2000, 29.8, 327, 2000, 9.3, 480, 2000, 446, 62.6),(2001, 30.1, 456, 2001, 9.7, 501, 2001, 471, 62.5),(2002, 30.5, 509, 2002, 9.7, 540, 2002, 438, 62.8),(2003, 30.6, 497, 2003, 9.7, 552, 2003, 436, 60.9),(2004, 31.3, 596, 2004, 9.9, 547, 2004, 473, 59.8),(2005, 31.7, 573, 2005, 10.2, 622, 2005, 449, 59.1),(2006, 32.6, 661, 2006, 10.5, 655, 2006, 416, 58.2),(2007, 33.1, 741, 2007, 11, 701, 2007, 420, 56.1),(2008, 32.7, 809, 2008, 10.6, 712, 2008, 381, 53),(2009, 32.8, 717, 2009, 10.6, 708, 2009, 352, 50.1)]
This is how we can use the combinations() function to emit all the combinations of the nine variables in this dataset, taken two at a time:
combinations(range(9), 2)
There are 36 possible combinations. We'll have to reject the combinations that involve year and year. These will trivially correlate with a value of 1.00.
Here is a function that picks a column of data out of our dataset:
def column(source, x):
for row in source:
yield row[x]
This allows us to use the corr() function from Chapter 4, Working with Collections, to compare two columns of data.
This is how we can compute all combinations of correlations:
from itertools import *
from Chapter_4.ch04_ex4 import corr
for p, q in combinations(range(9), 2):
header_p, *data_p = list(column(source, p))
header_q, *data_q = list(column(source, q))
if header_p == header_q: continue
r_pq = corr(data_p, data_q)
print("{"{("{2: 4.2f}: {0} vs {1}".format(header_p, header_q, r_pq)))))
For each combination of columns, we've extracted the two columns of data from our data set and used multiple assignments to separate the header from the remaining rows of data. If the headers match, we're comparing a variable to itself. This will be True for the three combinations of year and year that stem from the redundant year columns.
Given a combination of columns, we will compute the correlation function and then print the two headings along with the correlation of the columns. We've intentionally chosen some datasets that show spurious correlations with a dataset that doesn't follow the same pattern. In spite of this, the correlations are remarkably high.
The results look like this:
0.96: year vs Per capita consumption of cheese (US)Pounds (USDA)
0.95: year vs Number of people who died by becoming tangled in their bedsheetsDeaths (US) (CDC)
0.92: year vs Per capita consumption of mozzarella cheese (US)Pounds (USDA)
0.98: year vs Civil engineering doctorates awarded (US)Degrees awarded (National Science Foundation)
-0.80: year vs US crude oil imports from VenezuelaMillions of barrels (Dept. of Energy)
-0.95: year vs Per capita consumption of high fructose corn syrup (US)Pounds (USDA)
0.95: Per capita consumption of cheese (US)Pounds (USDA) vs Number of people who died by becoming tangled in their bedsheetsDeaths (US) (CDC)
0.96: Per capita consumption of cheese (US)Pounds (USDA) vs year
0.98: Per capita consumption of cheese (US)Pounds (USDA) vs Per capita consumption of mozzarella cheese (US)Pounds (USDA)
...
0.88: US crude oil imports from VenezuelaMillions of barrels (Dept. of Energy) vs Per capita consumption of high fructose corn syrup (US)Pounds (USDA)
It's not at all clear what this pattern means. We used a simple expression, combinations(range(9), 2), to enumerate all the possible combinations of data. This kind of succinct, expressive technique makes it easier to focus on the data analysis issues instead of the Combinatoric algorithm considerations.
Recipes
The itertools chapter of the Python library documentation is outstanding. The basic definitions are followed by a series of recipes that are extremely clear and helpful. Since there's no reason to reproduce these, we'll reference them here. They are the required reading materials on functional programming in Python.
Section 10.1.2, Itertools Recipes, of Python Standard Library is a wonderful resource. Visit https://docs.python.org/3/library/itertools.html#itertools-recipes more details.
These function definitions aren't importable functions in the itertools modules. These are ideas that need to be read and understood and then, perhaps, copied or modified before inclusion in an application.
The following table summarizes some recipes that show functional programming algorithms built from the itertools basics:
Function Name |
Arguments |
Results |
powerset |
(iterable) |
This generates all the subsets of the iterable. Each subset is actually a tuple object, not a set instance. |
random_product |
(*args, repeat=1) |
This randomly selects from itertools.product(*args, **kwds). |
random_permutation |
(iterable, r=None) |
This randomly selects from itertools.permutations(iterable, r). |
random_combination |
(iterable, r) |
This randomly selects from itertools.combinations(iterable, r). |
Summary
In this chapter, we looked at a number of functions in the itertools module. This library module provides a number of functions that help us work with iterators in sophisticated ways.
We looked at the product() function that will compute all the possible combinations of the elements chosen from two or more collections. The permutations() function gives us different ways to reorder a given set of values. The combinations() function returns all the possible subsets of the original set.
We also looked at ways in which the product() and permutations() functions can be used naïvely to create extremely large result sets. This is an important cautionary note. A succinct and expressive algorithm can also involve a vast amount of computation. We must perform basic complexity analysis to be sure that the code will finish in a reasonable amount of time.
In the next chapter, we'll look at the functools module. This module includes some tools to work with functions as first-class objects. This builds on some material shown in Chapter 2, Introducing Some Functional Features, and Chapter 5, Higher-order Functions.