Learning Python (2013)

Part IV. Functions and Generators

Chapter 20. Comprehensions and Generations

This chapter continues the advanced function topics theme, with a reprisal of the comprehension and iteration concepts previewed in Chapter 4 and introduced in Chapter 14. Because comprehensions are as much related to the prior chapter’s functional tools (e.g., map and filter) as they are to for loops, we’ll revisit them in this context here. We’ll also take a second look at iterables in order to study generator functions and their generator expression relatives—user-defined ways to produce results on demand.

Iteration in Python also encompasses user-defined classes, but we’ll defer that final part of this story until Part VI, when we study operator overloading. As this is the last pass we’ll make over built-in iteration tools, though, we will summarize the various tools we’ve met thus far. The next chapter continues this thread by timing the relative performance of these tools as a larger case study. Before that, though, let’s continue the comprehensions and iterations story, and extend it to include value generators.

List Comprehensions and Functional Tools

As mentioned early in this book, Python supports the procedural, object-oriented, and function programming paradigms. In fact, Python has a host of tools that most would considered functional in nature, which we enumerated in the preceding chapter—closures, generators, lambdas, comprehensions, maps, decorators, function objects, and more. These tools allow us to apply and combine functions in powerful ways, and often offer state retention and coding solutions that are alternatives to classes and OOP.

For instance, the prior chapter explored tools such as map and filter—key members of Python’s early functional programming toolset inspired by the Lisp language—that map operations over iterables and collect results. Because this is such a common task in Python coding, Python eventually sprouted a new expression—the list comprehension—that is even more flexible than the tools we just studied.

Per Python history, list comprehensions were originally inspired by a similar tool in the functional programming language Haskell, around the time of Python 2.0. In short, list comprehensions apply an arbitrary expression to items in an iterable, rather than applying a function. Accordingly, they can be more general tools. In later releases, the comprehension was extended to other roles—sets, dictionaries, and even the value generator expressions we’ll explore in this chapter. It’s not just for lists anymore.

We first met list comprehensions in Chapter 4’s preview, and studied them further in Chapter 14, in conjunction with looping statements. Because they’re also related to functional programming tools like the map and filter calls, though, we’ll resurrect the topic here for one last look. Technically, this feature is not tied to functions—as we’ll see, list comprehensions can be a more general tool than map and filter—but it is sometimes best understood by analogy to function-based alternatives.

List Comprehensions Versus map

Let’s work through an example that demonstrates the basics. As we saw in Chapter 7, Python’s built-in ord function returns the integer code point of a single character (the chr built-in is the converse—it returns the character for an integer code point). These happen to be ASCII codes if your characters fall into the ASCII character set’s 7-bit code point range:

>>> ord('s')

115

Now, suppose we wish to collect the ASCII codes of all characters in an entire string. Perhaps the most straightforward approach is to use a simple for loop and append the results to a list:

>>> res = []

>>> for x in 'spam':

        res.append(ord(x))                # Manual results collection

>>> res

[115, 112, 97, 109]

Now that we know about map, though, we can achieve similar results with a single function call without having to manage list construction in the code:

>>> res = list(map(ord, 'spam'))          # Apply function to sequence (or other)

>>> res

[115, 112, 97, 109]

However, we can get the same results from a list comprehension expression—while map maps a function over an iterable, list comprehensions map an expression over a sequence or other iterable:

>>> res = [ord(x) for x in 'spam']        # Apply expression to sequence (or other)

>>> res

[115, 112, 97, 109]

List comprehensions collect the results of applying an arbitrary expression to an iterable of values and return them in a new list. Syntactically, list comprehensions are enclosed in square brackets—to remind you that they construct lists. In their simple form, within the brackets you code an expression that names a variable followed by what looks like a for loop header that names the same variable. Python then collects the expression’s results for each iteration of the implied loop.

The effect of the preceding example is similar to that of the manual for loop and the map call. List comprehensions become more convenient, though, when we wish to apply an arbitrary expression to an iterable instead of a function:

>>> [x ** 2 for x in range(10)]

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Here, we’ve collected the squares of the numbers 0 through 9 (we’re just letting the interactive prompt print the resulting list object; assign it to a variable if you need to retain it). To do similar work with a map call, we would probably need to invent a little function to implement the square operation. Because we won’t need this function elsewhere, we’d typically (but not necessarily) code it inline, with a lambda, instead of using a def statement elsewhere:

>>> list(map((lambda x: x ** 2), range(10)))

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

This does the same job, and it’s only a few keystrokes longer than the equivalent list comprehension. It’s also only marginally more complex (at least, once you understand the lambda). For more advanced kinds of expressions, though, list comprehensions will often require considerably less typing. The next section shows why.

Adding Tests and Nested Loops: filter

List comprehensions are even more general than shown so far. For instance, as we learned in Chapter 14, you can code an if clause after the for to add selection logic. List comprehensions with if clauses can be thought of as analogous to the filter built-in discussed in the preceding chapter—they skip an iterable’s items for which the if clause is not true.

To demonstrate, following are both schemes picking up even numbers from 0 to 4; like the map list comprehension alternative of the prior section, the filter version here must invent a little lambda function for the test expression. For comparison, the equivalent for loop is shown here as well:

>>> [x for x in range(5) if x % 2 == 0]

[0, 2, 4]

>>> list(filter((lambda x: x % 2 == 0), range(5)))

[0, 2, 4]

>>> res = []

>>> for x in range(5):

        if x % 2 == 0:

            res.append(x)

>>> res

[0, 2, 4]

All of these use the modulus (remainder of division) operator, %, to detect even numbers: if there is no remainder after dividing a number by 2, it must be even. The filter call here is not much longer than the list comprehension either. However, we can combine an if clause and an arbitrary expression in our list comprehension, to give it the effect of a filter and a map, in a single expression:

>>> [x ** 2 for x in range(10) if x % 2 == 0]

[0, 4, 16, 36, 64]

This time, we collect the squares of the even numbers from 0 through 9: the for loop skips numbers for which the attached if clause on the right is false, and the expression on the left computes the squares. The equivalent map call would require a lot more work on our part—we would have to combine filter selections with map iteration, making for a noticeably more complex expression:

>>> list( map((lambda x: x**2), filter((lambda x: x % 2 == 0), range(10))) )

[0, 4, 16, 36, 64]

Formal comprehension syntax

In fact, list comprehensions are more general still. In their simplest form, you must always code an accumulation expression and a single for clause:

[ expression for target in iterable ]

Though all other parts are optional, they allow richer iterations to be expressed—you can code any number of nested for loops in a list comprehension, and each may have an optional associated if test to act as a filter. The general structure of list comprehensions looks like this:

[ expression for target1 in iterable1 if condition1

             for target2 in iterable2 if condition2 ...

             for targetN in iterableN if conditionN ]

This same syntax is inherited by set and dictionary comprehensions as well as the generator expressions coming up, though these use different enclosing characters (curly braces or often-optional parentheses), and the dictionary comprehension begins with two expressions separated by a colon (for key and value).

We experimented with the if filter clause in the previous section. When for clauses are nested within a list comprehension, they work like equivalent nested for loop statements. For example:

>>> res = [x + y for x in [0, 1, 2] for y in [100, 200, 300]]

>>> res

[100, 200, 300, 101, 201, 301, 102, 202, 302]

This has the same effect as this substantially more verbose equivalent:

>>> res = []

>>> for x in [0, 1, 2]:

        for y in [100, 200, 300]:

            res.append(x + y)

>>> res

[100, 200, 300, 101, 201, 301, 102, 202, 302]

Although list comprehensions construct list results, remember that they can iterate over any sequence or other iterable type. Here’s a similar bit of code that traverses strings instead of lists of numbers, and so collects concatenation results:

>>> [x + y for x in 'spam' for y in 'SPAM']

['sS', 'sP', 'sA', 'sM', 'pS', 'pP', 'pA', 'pM',

'aS', 'aP', 'aA', 'aM', 'mS', 'mP', 'mA', 'mM']

Each for clause can have an associated if filter, no matter how deeply the loops are nested—though use cases for the following sort of code, apart from perhaps multidimensional arrays, start to become more and more difficult to imagine at this level:

>>> [x + y for x in 'spam' if x in 'sm' for y in 'SPAM' if y in ('P', 'A')]

['sP', 'sA', 'mP', 'mA']

>>> [x + y + z for x in 'spam' if x in 'sm'

               for y in 'SPAM' if y in ('P', 'A')

               for z in '123'  if z > '1']

['sP2', 'sP3', 'sA2', 'sA3', 'mP2', 'mP3', 'mA2', 'mA3']

Finally, here is a similar list comprehension that illustrates the effect of attached if selections on nested for clauses applied to numeric objects rather than strings:

>>> [(x, y) for x in range(5) if x % 2 == 0 for y in range(5) if y % 2 == 1]

[(0, 1), (0, 3), (2, 1), (2, 3), (4, 1), (4, 3)]

This expression combines even numbers from 0 through 4 with odd numbers from 0 through 4. The if clauses filter out items in each iteration. Here is the equivalent statement-based code:

>>> res = []

>>> for x in range(5):

        if x % 2 == 0:

            for y in range(5):

                if y % 2 == 1:

                    res.append((x, y))

>>> res

[(0, 1), (0, 3), (2, 1), (2, 3), (4, 1), (4, 3)]

Recall that if you’re confused about what a complex list comprehension does, you can always nest the list comprehension’s for and if clauses inside each other like this—indenting each clause successively further to the right—to derive the equivalent statements. The result is longer, but perhaps clearer in intent to some human readers on first glance, especially those more familiar with basic statements.

The map and filter equivalent of this last example would be wildly complex and deeply nested, so I won’t even try showing it here. I’ll leave its coding as an exercise for Zen masters, ex–Lisp programmers, and the criminally insane!

Example: List Comprehensions and Matrixes

Not all list comprehensions are so artificial, of course. Let’s look at one more application to stretch a few synapses. As we saw in Chapter 4 and Chapter 8, one basic way to code matrixes (a.k.a. multidimensional arrays) in Python is with nested list structures. The following, for example, defines two 3 × 3 matrixes as lists of nested lists:

>>> M = [[1, 2, 3],

         [4, 5, 6],

         [7, 8, 9]]

>>> N = [[2, 2, 2],

         [3, 3, 3],

         [4, 4, 4]]

Given this structure, we can always index rows, and columns within rows, using normal index operations:

>>> M[1]              # Row 2

[4, 5, 6]

>>> M[1][2]           # Row 2, item 3

6

List comprehensions are powerful tools for processing such structures, though, because they automatically scan rows and columns for us. For instance, although this structure stores the matrix by rows, to collect the second column we can simply iterate across the rows and pull out the desired column, or iterate through positions in the rows and index as we go:

>>> [row[1] for row in M]                          # Column 2

[2, 5, 8]

>>> [M[row][1] for row in (0, 1, 2)]               # Using offsets

[2, 5, 8]

Given positions, we can also easily perform tasks such as pulling out a diagonal. The first of the following expressions uses range to generate the list of offsets and then indexes with the row and column the same, picking out M[0][0], then M[1][1], and so on. The second scales the column index to fetch M[0][2], M[1][1], etc. (we assume the matrix has the same number of rows and columns):

>>> [M[i][i] for i in range(len(M))]               # Diagonals

[1, 5, 9]

>>> [M[i][len(M)-1-i] for i in range(len(M))]

[3, 5, 7]

Changing such a matrix in place requires assignment to offsets (use range twice if shapes differ):

>>> L = [[1, 2, 3], [4, 5, 6]]

>>> for i in range(len(L)):

        for j in range(len(L[i])):                 # Update in place

            L[i][j] += 10

>>> L

[[11, 12, 13], [14, 15, 16]]

We can’t really do the same with list comprehensions, as they make new lists, but we could always assign their results to the original name for a similar effect. For example, we can apply an operation to every item in a matrix, producing results in either a simple vector or a matrix of the same shape:

>>> [col + 10 for row in M for col in row]         # Assign to M to retain new value

[11, 12, 13, 14, 15, 16, 17, 18, 19]

>>> [[col + 10 for col in row] for row in M]

[[11, 12, 13], [14, 15, 16], [17, 18, 19]]

To understand these, translate to their simple statement form equivalents that follow—indent parts that are further to the right in the expression (as in the first loop in the following), and make a new list when comprehensions are nested on the left (like the second loop in the following). As its statement equivalent makes clearer, the second expression in the preceding works because the row iteration is an outer loop: for each row, it runs the nested column iteration to build up one row of the result matrix:

>>> res = []

>>> for row in M:                                  # Statement equivalents

        for col in row:                            # Indent parts further right

            res.append(col + 10)

>>> res

[11, 12, 13, 14, 15, 16, 17, 18, 19]

>>> res = []

>>> for row in M:

        tmp = []                                   # Left-nesting starts new list

        for col in row:

            tmp.append(col + 10)

        res.append(tmp)

>>> res

[[11, 12, 13], [14, 15, 16], [17, 18, 19]]

Finally, with a bit of creativity, we can also use list comprehensions to combine values of multiple matrixes. The following first builds a flat list that contains the result of multiplying the matrixes pairwise, and then builds a nested list structure having the same values by nesting list comprehensions again:

>>> M

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

>>> N

[[2, 2, 2], [3, 3, 3], [4, 4, 4]]

>>> [M[row][col] * N[row][col] for row in range(3) for col in range(3)]

[2, 4, 6, 12, 15, 18, 28, 32, 36]

>>> [[M[row][col] * N[row][col] for col in range(3)] for row in range(3)]

[[2, 4, 6], [12, 15, 18], [28, 32, 36]]

This last expression works because the row iteration is an outer loop again; it’s equivalent to this statement-based code:

res = []

for row in range(3):

    tmp = []

    for col in range(3):

        tmp.append(M[row][col] * N[row][col])

    res.append(tmp)

And for more fun, we can use zip to pair items to be multiplied—the following comprehension and loop statement forms both produce the same list-of-lists pairwise multiplication result as the last preceding example (and because zip is a generator of values in 3.X, this isn’t as inefficient as it may seem):

[[col1 * col2 for (col1, col2) in zip(row1, row2)] for (row1, row2) in zip(M, N)]

res = []

for (row1, row2) in zip(M, N):

    tmp = []

    for (col1, col2) in zip(row1, row2):

        tmp.append(col1 * col2)

    res.append(tmp)

Compared to their statement equivalents, the list comprehension versions here require only one line of code, might run substantially faster for large matrixes, and just might make your head explode! Which brings us to the next section.

Don’t Abuse List Comprehensions: KISS

With such generality, list comprehensions can quickly become, well, incomprehensible, especially when nested. Some programming tasks are inherently complex, and we can’t sugarcoat them to make them any simpler than they are (see the upcoming permutations for a prime example). Tools like comprehensions are powerful solutions when used wisely, and there’s nothing inherently wrong with using them in your scripts.

At the same time, code like that of the prior section may push the complexity envelope more than it should—and, frankly, tends to disproportionately pique the interest of those holding the darker and misguided assumption that code obfuscation somehow implies talent. Because such tools tend to appeal to some people more than they probably should, I need to be clear about their scope here.

This book demonstrates advanced comprehensions to teach, but in the real world, using complicated and tricky code where not warranted is both bad engineering and bad software citizenship. To repurpose a line from the first chapter: programming is not about being clever and obscure—it’s about how clearly your program communicates its purpose.

Or, to quote from Python’s import this motto:

Simple is better than complex.

Writing complicated comprehension code may be a fun academic recreation, but it doesn’t have a place in programs that others will someday need to understand.

Consequently, my advice is to use simple for loops when getting started with Python, and comprehensions or map in isolated cases where they are easy to apply. The “keep it simple” rule applies here as always: code conciseness is a much less important goal than code readability. If you have to translate code to statements to understand it, it should probably be statements in the first place. In other words, the age-old acronym KISS still applies: Keep It Simple—followed either by a word that is today too sexist (Sir), or another that is too colorful for a family-oriented book like this...

On the other hand: performance, conciseness, expressiveness

However, in this case, there is currently a substantial performance advantage to the extra complexity: based on tests run under Python today, map calls can be twice as fast as equivalent for loops, and list comprehensions are often faster than map calls. This speed difference can vary per usage pattern and Python, but is generally due to the fact that map and list comprehensions run at C language speed inside the interpreter, which is often much faster than stepping through Python for loop bytecode within the PVM.

In addition, list comprehensions offer a code conciseness that’s compelling and even warranted when that reduction in size doesn’t also imply a reduction in meaning for the next programmer. Moreover, many find the expressiveness of comprehensions to be a powerful ally. Because map and list comprehensions are both expressions, they also can show up syntactically in places that for loop statements cannot, such as in the bodies of lambda functions, within list and dictionary literals, and more.

Because of this, list comprehensions and map calls are worth knowing and using for simpler kinds of iterations, especially if your application’s speed is an important consideration. Still, because for loops make logic more explicit, they are generally recommended on the grounds of simplicity, and often make for more straightforward code. When used, you should try to keep your map calls and list comprehensions simple; for more complex tasks, use full statements instead.

NOTE

As I’ve stated before, performance generalizations like those just given here can depend on call patterns, as well as changes and optimizations in Python itself. Recent Python releases have sped up the simple for loop statement, for example. On some code, though, list comprehensions are still substantially faster than for loops and even faster than map, though map can still win when the alternatives must apply a function call, built-in functions or otherwise. At least until this story changes arbitrarily—to time these alternatives yourself, see tools in the standard library’s time module or in the newer timeit module added in Release 2.4, or stay tuned for the extended coverage of both of these in the next chapter, where we’ll prove the prior paragraph’s claims.

WHY YOU WILL CARE: LIST COMPREHENSIONS AND MAP

Here are some more realistic examples of list comprehensions and map in action. We solved the first with list comprehensions in Chapter 14, but we’ll revive it here to add map alternatives. Recall that the file readlines method returns lines with \n end-of-line characters at the ends (the following assumes a 3-line text file in the current directory):

>>> open('myfile').readlines()

['aaa\n', 'bbb\n', 'ccc\n']

If you don’t want the end-of-line characters, you can slice them off all the lines in a single step with a list comprehension or a map call (map results are iterables in Python 3.X, so we must run them through list to display all their results at once):

>>> [line.rstrip() for line in open('myfile').readlines()]

['aaa', 'bbb', 'ccc']

>>> [line.rstrip() for line in open('myfile')]

['aaa', 'bbb', 'ccc']

>>> list(map((lambda line: line.rstrip()), open('myfile')))

['aaa', 'bbb', 'ccc']

The last two of these make use of file iterators; as we saw in Chapter 14, this means that you don’t need a method call to read lines in iteration contexts such as these. The map call is slightly longer than the list comprehension, but neither has to manage result list construction explicitly.

A list comprehension can also be used as a sort of column projection operation. Python’s standard SQL database API returns query results as a sequence of sequences like the following—the list is the table, tuples are rows, and items in tuples are column values:

>>> listoftuple = [('bob', 35, 'mgr'), ('sue', 40, 'dev')]

A for loop could pick up all the values from a selected column manually, but map and list comprehensions can do it in a single step, and faster:

>>> [age for (name, age, job) in listoftuple]

[35, 40]

>>> list(map((lambda row: row[1]), listoftuple))

[35, 40]

The first of these makes use of tuple assignment to unpack row tuples in the list, and the second uses indexing. In Python 2.X (but not in 3.X—see the note on 2.X argument unpacking in Chapter 18), map can use tuple unpacking on its argument, too:

# 2.X only

>>> list(map((lambda (name, age, job): age), listoftuple))

[35, 40]

See other books and resources for more on Python’s database API.

Besides the distinction between running functions versus expressions, the biggest difference between map and list comprehensions in Python 3.X is that map is an iterable, generating results on demand. To achieve the same memory economy and execution time division, list comprehensions must be coded as generator expressions—a major topic of this chapter.

Generator Functions and Expressions

Python today supports procrastination much more than it did in the past—it provides tools that produce results only when needed, instead of all at once. We’ve seen this at work in built-in tools: files that read lines on request, and functions like map and zip that produce items on demand in 3.X. Such laziness isn’t confined to Python itself, though. In particular, two language constructs delay result creation whenever possible in user-defined operations:

§  Generator functions (available since 2.3) are coded as normal def statements, but use yield statements to return results one at a time, suspending and resuming their state between each.

§  Generator expressions (available since 2.4) are similar to the list comprehensions of the prior section, but they return an object that produces results on demand instead of building a result list.

Because neither constructs a result list all at once, they save memory space and allow computation time to be split across result requests. As we’ll see, both of these ultimately perform their delayed-results magic by implementing the iteration protocol we studied in Chapter 14.

These features are not new (generator expressions were available as an option as early as Python 2.2), and are fairly common in Python code today. Python’s notion of generators owes much to other programming languages, especially Icon. Though they may initially seem unusual if you’re accustomed to simpler programming models, you’ll probably find generators to be a powerful tool where applicable. Moreover, because they are a natural extension to the function, comprehension, and iteration ideas we’ve already explored, you already know more about coding generators than you might expect.

Generator Functions: yield Versus return

In this part of the book, we’ve learned about coding normal functions that receive input parameters and send back a single result immediately. It is also possible, however, to write functions that may send back a value and later be resumed, picking up where they left off. Such functions, available in both Python 2.X and 3.X, are known as generator functions because they generate a sequence of values over time.

Generator functions are like normal functions in most respects, and in fact are coded with normal def statements. However, when created, they are compiled specially into an object that supports the iteration protocol. And when called, they don’t return a result: they return a result generator that can appear in any iteration context. We studied iterables in Chapter 14, and Figure 14-1 gave a formal and graphic summary of their operation. Here, we’ll revisit them to see how they relate to generators.

State suspension

Unlike normal functions that return a value and exit, generator functions automatically suspend and resume their execution and state around the point of value generation. Because of that, they are often a useful alternative to both computing an entire series of values up front and manually saving and restoring state in classes. The state that generator functions retain when they are suspended includes both their code location, and their entire local scope. Hence, their local variables retain information between results, and make it available when the functions are resumed.

The chief code difference between generator and normal functions is that a generator yields a value, rather than returning one—the yield statement suspends the function and sends a value back to the caller, but retains enough state to enable the function to resume from where it left off. When resumed, the function continues execution immediately after the last yield run. From the function’s perspective, this allows its code to produce a series of values over time, rather than computing them all at once and sending them back in something like a list.

Iteration protocol integration

To truly understand generator functions, you need to know that they are closely bound up with the notion of the iteration protocol in Python. As we’ve seen, iterator objects define a __next__ method (next in 2.X), which either returns the next item in the iteration, or raises the specialStopIteration exception to end the iteration. An iterable object’s iterator is fetched initially with the iter built-in function, though this step is a no-op for objects that are their own iterator.

Python for loops, and all other iteration contexts, use this iteration protocol to step through a sequence or value generator, if the protocol is supported (if not, iteration falls back on repeatedly indexing sequences instead). Any object that supports this interface works in all iteration tools.

To support this protocol, functions containing a yield statement are compiled specially as generators—they are not normal functions, but rather are built to return an object with the expected iteration protocol methods. When later called, they return a generator object that supports the iteration interface with an automatically created method named __next__ to start or resume execution.

Generator functions may also have a return statement that, along with falling off the end of the def block, simply terminates the generation of values—technically, by raising a StopIteration exception after any normal function exit actions. From the caller’s perspective, the generator’s__next__ method resumes the function and runs until either the next yield result is returned or a StopIteration is raised.

The net effect is that generator functions, coded as def statements containing yield statements, are automatically made to support the iteration object protocol and thus may be used in any iteration context to produce results over time and on demand.

NOTE

As noted in Chapter 14, in Python 2.X, iterator objects define a method named next instead of __next__. This includes the generator objects we are using here. In 3.X this method is renamed to __next__. The next built-in function is provided as a convenience and portability tool: next(I) is the same as I.__next__() in 3.X and I.next() in 2.6 and 2.7. Prior to 2.6, programs simply call I.next() instead to iterate manually.

Generator functions in action

To illustrate generator basics, let’s turn to some code. The following code defines a generator function that can be used to generate the squares of a series of numbers over time:

>>> def gensquares(N):

        for i in range(N):

            yield i ** 2        # Resume here later

This function yields a value, and so returns to its caller, each time through the loop; when it is resumed, its prior state is restored, including the last values of its variables i and N, and control picks up again immediately after the yield statement. For example, when it’s used in the body of afor loop, the first iteration starts the function and gets its first result; thereafter, control returns to the function after its yield statement each time through the loop:

>>> for i in gensquares(5):     # Resume the function

        print(i, end=' : ')     # Print last yielded value

0 : 1 : 4 : 9 : 16 :

>>> 

To end the generation of values, functions either use a return statement with no value or simply allow control to fall off the end of the function body.

To most people, this process seems a bit implicit (if not magical) on first encounter. It’s actually quite tangible, though. If you really want to see what is going on inside the for, call the generator function directly:

>>> x = gensquares(4)

>>> x

<generator object gensquares at 0x000000000292CA68>

You get back a generator object that supports the iteration protocol we met in Chapter 14—the generator function was compiled to return this automatically. The returned generator object in turn has a __next__ method that starts the function or resumes it from where it last yielded a value, and raises a StopIteration exception when the end of the series of values is reached and the function returns. For convenience, the next(X) built-in calls an object’s X.__next__() method for us in 3.X (and X.next() in 2.X):

>>> next(x)                     # Same as x.__next__() in 3.X

0

>>> next(x)                     # Use x.next() or next() in 2.X

1

>>> next(x)

4

>>> next(x)

9

>>> next(x)

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

StopIteration

As we learned in Chapter 14, for loops (and other iteration contexts) work with generators in the same way—by calling the __next__ method repeatedly, until an exception is caught. For a generator, the result is to produce yielded values over time. If the object to be iterated over does not support this protocol, for loops instead use the indexing protocol to iterate.

Notice that the top-level iter call of the iteration protocol isn’t required here because generators are their own iterator, supporting just one active iteration scan. To put that another way generators return themselves for iter, because they support next directly. This also holds true in the generator expressions we’ll meet later in this chapter (more on this ahead):

>>> y = gensquares(5)           # Returns a generator which is its own iterator

>>> iter(y) is y                # iter() is not required: a no-op here

True

>>> next(y)                     # Can run next()immediately

0

Why generator functions?

Given the simple examples we’re using to illustrate fundamentals, you might be wondering just why you’d ever care to code a generator at all. In this section’s example, for instance, we could also simply build the list of yielded values all at once:

>>> def buildsquares(n):

        res = []

        for i in range(n): res.append(i ** 2)

        return res

>>> for x in buildsquares(5): print(x, end=' : ')

0 : 1 : 4 : 9 : 16 :

For that matter, we could use any of the for loop, map, or list comprehension techniques:

>>> for x in [n ** 2 for n in range(5)]:

        print(x, end=' : ')

0 : 1 : 4 : 9 : 16 :

>>> for x in map((lambda n: n ** 2), range(5)):

        print(x, end=' : ')

0 : 1 : 4 : 9 : 16 :

However, generators can be better in terms of both memory use and performance in larger programs. They allow functions to avoid doing all the work up front, which is especially useful when the result lists are large or when it takes a lot of computation to produce each value. Generators distribute the time required to produce the series of values among loop iterations.

Moreover, for more advanced uses, generators can provide a simpler alternative to manually saving the state between iterations in class objects—with generators, variables accessible in the function’s scopes are saved and restored automatically.[41] We’ll discuss class-based iterables in more detail in Part VI.

Generator functions are also much more broadly focused than implied so far. They can operate on and return any type of object, and as iterables may appear in any of Chapter 14’s iteration contexts, including tuple calls, enumerations, and dictionary comprehensions:

>>> def ups(line):

        for sub in line.split(','):               # Substring generator

            yield sub.upper()

>>> tuple(ups('aaa,bbb,ccc'))                     # All iteration contexts

('AAA', 'BBB', 'CCC')

>>> {i: s for (i, s) in enumerate(ups('aaa,bbb,ccc'))}

{0: 'AAA', 1: 'BBB', 2: 'CCC'}

In a moment we’ll see the same assets for generator expressions—a tool that trades function flexibility for comprehension conciseness. Later in this chapter we’ll also see that generators can sometimes make the impossible possible, by producing components of result sets that would be far too large to create all at once. First, though, let’s explore some advanced generator function features.

Extended generator function protocol: send versus next

In Python 2.5, a send method was added to the generator function protocol. The send method advances to the next item in the series of results, just like __next__, but also provides a way for the caller to communicate with the generator, to affect its operation.

Technically, yield is now an expression form that returns the item passed to send, not a statement (though it can be called either way—as yield X, or A = (yield X)). The expression must be enclosed in parentheses unless it’s the only item on the right side of the assignment statement. For example, X = yield Y is OK, as is X = (yield Y) + 42.

When this extra protocol is used, values are sent into a generator G by calling G.send(value). The generator’s code is then resumed, and the yield expression in the generator returns the value passed to send. If the regular G.__next__() method (or its next(G) equivalent) is called to advance, the yield simply returns None. For example:

>>> def gen():

       for i in range(10):

           X = yield i

           print(X)

>>> G = gen()

>>> next(G)              # Must call next() first, to start generator

0

>>> G.send(77)           # Advance, and send value to yield expression

77

1

>>> G.send(88)

88

2

>>> next(G)              # next() and X.__next__() send None

None

3

The send method can be used, for example, to code a generator that its caller can terminate by sending a termination code, or redirect by passing a new position in data being processed inside the generator.

In addition, generators in 2.5 and later also support a throw(type) method to raise an exception inside the generator at the latest yield, and a close method that raises a special GeneratorExit exception inside the generator to terminate the iteration entirely. These are advanced features that we won’t delve into in more detail here; see reference texts and Python’s standard manuals for more information, and watch for more on exceptions in Part VII.

Note that while Python 3.X provides a next(X) convenience built-in that calls the X.__next__() method of an object, other generator methods, like send, must be called as methods of generator objects directly (e.g., G.send(X)). This makes sense if you realize that these extra methods are implemented on built-in generator objects only, whereas the __next__ method applies to all iterable objects—both built-in types and user-defined classes.

Also note that Python 3.3 introduces an extension to yield—a from clause—that allows generators to delegate to nested generators. Since this is an extension to what is already a fairly advanced topic, we’ll delegate this topic itself to a sidebar, and move on here to a tool that’s close enough to be called a twin.

Generator Expressions: Iterables Meet Comprehensions

Because the delayed evaluation of generator functions was so useful, it eventually spread to other tools. In both Python 2.X and 3.X, the notions of iterables and list comprehensions are combined in a new tool: generator expressions. Syntactically, generator expressions are just like normal list comprehensions, and support all their syntax—including if filters and loop nesting—but they are enclosed in parentheses instead of square brackets (like tuples, their enclosing parentheses are often optional):

>>> [x ** 2 for x in range(4)]          # List comprehension: build a list

[0, 1, 4, 9]

>>> (x ** 2 for x in range(4))          # Generator expression: make an iterable

<generator object <genexpr> at 0x00000000029A8288>

In fact, at least on a functionality basis, coding a list comprehension is essentially the same as wrapping a generator expression in a list built-in call to force it to produce all its results in a list at once:

>>> list(x ** 2 for x in range(4))      # List comprehension equivalence

[0, 1, 4, 9]

Operationally, however, generator expressions are very different: instead of building the result list in memory, they return a generator object—an automatically created iterable. This iterable object in turn supports the iteration protocol to yield one piece of the result list at a time in any iteration context. The iterable object also retains generator state while active—the variable x in the preceding expressions, along with the generator’s code location.

The net effect is much like that of generator functions, but in the context of a comprehension expression: we get back an object that remembers where it left off after each part of its result is returned. Also like generator functions, looking under the hood at the protocol that these objects automatically support can help demystify them; the iter call is again not required at the top here, for reasons we’ll expand on ahead:

>>> G = (x ** 2 for x in range(4))

>>> iter(G) is G                           # iter(G) optional: __iter__ returns self

True

>>> next(G)                                # Generator objects: automatic methods

0

>>> next(G)

1

>>> next(G)

4

>>> next(G)

9

>>> next(G)

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

StopIteration

>>> G

<generator object <genexpr> at 0x00000000029A8318>

Again, we don’t typically see the next iterator machinery under the hood of a generator expression like this because for loops trigger it for us automatically:

>>> for num in (x ** 2 for x in range(4)):          # Calls next() automatically

        print('%s, %s' % (num, num / 2.0))

0, 0.0

1, 0.5

4, 2.0

9, 4.5

As we’ve already learned, every iteration context does this—including for loops; the sum, map, and sorted built-in functions; list comprehensions; and other iteration contexts we learned about in Chapter 14, such as the any, all, and list built-in functions. As iterables, generator expressions can appear in any of these iteration contexts, just like the result of a generator function call.

For example, the following deploys generator expressions in the string join method call and tuple assignment, iteration contexts both. In the first test here, join runs the generator and joins the substrings it produces with nothing between—to simply concatenate:

>>> ''.join(x.upper() for x in 'aaa,bbb,ccc'.split(','))

'AAABBBCCC'

>>> a, b, c = (x + '\n' for x in 'aaa,bbb,ccc'.split(','))

>>> a, c

('aaa\n', 'ccc\n')

Notice how the join call in the preceding doesn’t require extra parentheses around the generator. Syntactically, parentheses are not required around a generator expression that is the sole item already enclosed in parentheses used for other purposes—like those of a function call. Parentheses are required in all other cases, however, even if they seem extra, as in the second call to sorted that follows:

>>> sum(x ** 2 for x in range(4))                           # Parens optional

14

>>> sorted(x ** 2 for x in range(4))                        # Parens optional

[0, 1, 4, 9]

>>> sorted((x ** 2 for x in range(4)), reverse=True)        # Parens required

[9, 4, 1, 0]

Like the often-optional parentheses in tuples, there is no widely accepted rule on this, though a generator expression does not have as clear a role as a fixed collection of other objects as a tuple, making extra parentheses seem perhaps more spurious here.

Why generator expressions?

Just like generator functions, generator expressions are a memory-space optimization—they do not require the entire result list to be constructed all at once, as the square-bracketed list comprehension does. Also like generator functions, they divide the work of results production into smallertime slices—they yield results in piecemeal fashion, instead of making the caller wait for the full set to be created in a single call.

On the other hand, generator expressions may also run slightly slower than list comprehensions in practice, so they are probably best used only for very large result sets, or applications that cannot wait for full results generation. A more authoritative statement about performance, though, will have to await the timing scripts we’ll code in the next chapter.

Though more subjective, generator expressions offer coding advantages too—as the next sections show.

Generator expressions versus map

One way to see the coding benefits of generator expressions is to compare them to other functional tools, as we did for list comprehensions. For example, generator expressions often are equivalent to 3.X map calls, because both generate result items on request. Like list comprehensions, though, generator expressions may be simpler to code when the operation applied is not a function call. In 2.X, map makes temporary lists and generator expressions do not, but the same coding comparisons apply:

>>> list(map(abs, (−1, −2, 3, 4)))                          # Map function on tuple

[1, 2, 3, 4]

>>> list(abs(x) for x in (−1, −2, 3, 4))                    # Generator expression

[1, 2, 3, 4]

>>> list(map(lambda x: x * 2, (1, 2, 3, 4)))                # Nonfunction case

[2, 4, 6, 8]

>>> list(x * 2 for x in (1, 2, 3, 4))                       # Simpler as generator?

[2, 4, 6, 8]

The same holds true for text-processing use cases like the join call we saw earlier—a list comprehension makes an extra temporary list of results, which is completely pointless in this context because the list is not retained, and map loses simplicity points compared to generator expression syntax when the operation being applied is not a call:

>>> line = 'aaa,bbb,ccc'

>>> ''.join([x.upper() for x in line.split(',')])           # Makes a pointless list

'AAABBBCCC'

>>> ''.join(x.upper() for x in line.split(','))             # Generates results

'AAABBBCCC'

>>> ''.join(map(str.upper, line.split(',')))                # Generates results

'AAABBBCCC'

>>> ''.join(x * 2 for x in line.split(','))                 # Simpler as generator?

'aaaaaabbbbbbcccccc'

>>> ''.join(map(lambda x: x * 2, line.split(',')))

'aaaaaabbbbbbcccccc'

Both map and generator expressions can also be arbitrarily nested, which supports general use in programs, and requires a list call or other iteration context to start the process of producing results. For example, the list comprehension in the following produces the same result as the 3.Xmap and generator equivalents that follow it, but makes two physical lists; the others generate just one integer at a time with nested generators, and the generator expression form may more clearly reflect its intent:

>>> [x * 2 for x in [abs(x) for x in (−1, −2, 3, 4)]]       # Nested comprehensions

[2, 4, 6, 8]

>>> list(map(lambda x: x * 2, map(abs, (−1, −2, 3, 4))))    # Nested maps

[2, 4, 6, 8]

>>> list(x * 2 for x in (abs(x) for x in (−1, −2, 3, 4)))   # Nested generators

[2, 4, 6, 8]

Although the effect of all three of these is to combine operations, the generators do so without making multiple temporary lists. In 3.X, the next example both nests and combines generators—the nested generator expression is activated by map, which in turn is only activated by list.

>>> import math

>>> list(map(math.sqrt, (x ** 2 for x in range(4))))        # Nested combinations

[0.0, 1.0, 2.0, 3.0]

Technically speaking, the range on the right in the preceding is a value generator in 3.X too, activated by the generator expression itself—three levels of value generation, which produce individual values from inner to outer only on request, and which “just works” because of Python’s iteration tools and protocol. In fact, generator nestings can be arbitrarily mixed and deep, though some may be more valid than others:

>>> list(map(abs, map(abs, map(abs, (−1, 0, 1)))))          # Nesting gone bad?

[1, 0, 1]

>>> list(abs(x) for x in (abs(x) for x in (abs(x) for x in (−1, 0, 1))))

[1, 0, 1]

These last examples illustrate how general generators can be, but are also coded in an intentionally complex form to underscore that generator expressions have the same potential for abuse as the list comprehensions discussed earlier—as usual, you should keep them simple unless they must be complex, a theme we’ll revisit later in this chapter.

When used well, though, generator expressions combine the expressiveness of list comprehensions with the space and time benefits of other iterables. Here, for example, nonnested approaches provide simpler solutions but still leverage generators’ strengths—per a Python motto, flat is generally better than nested:

>>> list(abs(x) * 2 for x in (−1, −2, 3, 4))                # Unnested equivalents

[2, 4, 6, 8]

>>> list(math.sqrt(x ** 2) for x in range(4))               # Flat is often better

[0.0, 1.0, 2.0, 3.0]

>>> list(abs(x) for x in (−1, 0, 1))

[1, 0, 1]

Generator expressions versus filter

Generator expressions also support all the usual list comprehension syntax—including if clauses, which work like the filter call we met earlier. Because filter is an iterable in 3.X that generates its results on request, a generator expression with an if clause is operationally equivalent (in 2.X, filter produces a temporary list that the generator does not, but the code comparisons again apply). Again, the join in the following suffices to force all forms to produce their results:

>>> line = 'aa bbb c'

>>> ''.join(x for x in line.split() if len(x) > 1)          # Generator with 'if'

'aabbb'

>>> ''.join(filter(lambda x: len(x) > 1, line.split()))     # Similar to filter

'aabbb'

The generator seems marginally simpler than the filter here. As for list comprehensions, though, adding processing steps to filter results requires a map too, which makes filter noticeably more complex than a generator expression:

>>> ''.join(x.upper() for x in line.split() if len(x) > 1)

'AABBB'

>>> ''.join(map(str.upper, filter(lambda x: len(x) > 1, line.split())))

'AABBB'

In effect, generator expressions do for 3.X iterables like map and filter what list comprehensions do for the 2.X list-builder flavors of these calls—they provide more general coding structures that do not rely on functions, but still delay results production. Also like list comprehensions, there is always a statement-based equivalent to a generator expression, though it sometimes renders substantially more code:

>>> ''.join(x.upper() for x in line.split() if len(x) > 1)

'AABBB'

>>> res = ''

>>> for x in line.split():                                   # Statement equivalent?

        if len(x) > 1:                                       # This is also a join

            res += x.upper()

>>> res

'AABBB'

In this case, though, the statement form isn’t quite the same—it cannot produce items one at a time, and it’s also emulating the effect of the join that forces results to be produced all at once. The true equivalent to a generator expression would be a generator function with a yield, as the next section shows.

Generator Functions Versus Generator Expressions

Let’s recap what we’ve covered so far in this section:

Generator functions

A function def statement that contains a yield statement is turned into a generator function. When called, it returns a new generator object with automatic retention of local scope and code position; an automatically created __iter__ method that simply returns itself; and an automatically created __next__ method (next in 2.X) that starts the function or resumes it where it last left off, and raises StopIteration when finished producing results.

Generator expressions

A comprehension expression enclosed in parentheses is known as a generator expression. When run, it returns a new generator object with the same automatically created method interface and state retention as a generator function call’s results—with an __iter__ method that simply returns itself; and a _next__ method (next in 2.X) that starts the implied loop or resumes it where it last left off, and raises StopIteration when finished producing results.

The net effect is to produce results on demand in iteration contexts that employ these interfaces automatically.

As implied by some of the preceding sections, the same iteration can often be coded with either a generator function or a generator expression. The following generator expression, for example, repeats each character in a string four times:

>>> G = (c * 4 for c in 'SPAM')           # Generator expression

>>> list(G)                               # Force generator to produce all results

['SSSS', 'PPPP', 'AAAA', 'MMMM']

The equivalent generator function requires slightly more code, but as a multiple-statement function it will be able to code more logic and use more state information if needed. In fact, this is essentially the same as the prior chapter’s tradeoff between lambda and def—expression conciseness versus statement power:

>>> def timesfour(S):                     # Generator function

        for c in S:

            yield c * 4

>>> G = timesfour('spam')

>>> list(G)                               # Iterate automatically

['ssss', 'pppp', 'aaaa', 'mmmm']

To clients, the two are more similar than different. Both expressions and functions support both automatic and manual iteration—the prior list call iterates automatically, and the following iterate manually:

>>> G = (c * 4 for c in 'SPAM')

>>> I = iter(G)                           # Iterate manually (expression)

>>> next(I)

'SSSS'

>>> next(I)

'PPPP'

>>> G = timesfour('spam')

>>> I = iter(G)                           # Iterate manually (function)

>>> next(I)

'ssss'

>>> next(I)

'pppp'

In either case, Python automatically creates a generator object, which has both the methods required by the iteration protocol, and state retention for variables in the generator’s code and its current code location. Notice how we make new generators here to iterate again—as explained in the next section, generators are one-shot iterators.

First, though, here’s the true statement-based equivalent of expression at the end of the prior section: a function that yields values—though the difference is irrelevant if the code using it produces all results with a tool like join:

>>> line = 'aa bbb c'

>>> ''.join(x.upper() for x in line.split() if len(x) > 1)     # Expression

'AABBB'

>>> def gensub(line):                                          # Function

        for x in line.split():

            if len(x) > 1:

                yield x.upper()

>>> ''.join(gensub(line))                                      # But why generate?

'AABBB'

Though generators have valid roles, in cases like this the use of generators over the simple statement equivalent shown earlier may be difficult to justify, except on stylistic grounds. On the other hand, trading four lines for one may to many seem fairly compelling stylistic grounds!

Generators Are Single-Iteration Objects

A subtle but important point: both generator functions and generator expressions are their own iterators and thus support just one active iteration—unlike some built-in types, you can’t have multiple iterators of either positioned at different locations in the set of results. Because of this, a generator’s iterator is the generator itself; in fact, as suggested earlier, calling iter on a generator expression or function is an optional no-op:

>>> G = (c * 4 for c in 'SPAM')

>>> iter(G) is G                          # My iterator is myself: G has __next__

True

If you iterate over the results stream manually with multiple iterators, they will all point to the same position:

>>> G = (c * 4 for c in 'SPAM')           # Make a new generator

>>> I1 = iter(G)                          # Iterate manually

>>> next(I1)

'SSSS'

>>> next(I1)

'PPPP'

>>> I2 = iter(G)                          # Second iterator at same position!

>>> next(I2)

'AAAA'

Moreover, once any iteration runs to completion, all are exhausted—we have to make a new generator to start again:

>>> list(I1)                              # Collect the rest of I1's items

['MMMM']

>>> next(I2)                              # Other iterators exhausted too

StopIteration

>>> I3 = iter(G)                          # Ditto for new iterators

>>> next(I3)

StopIteration

>>> I3 = iter(c * 4 for c in 'SPAM')      # New generator to start over

>>> next(I3)

'SSSS'

The same holds true for generator functions—the following def statement-based equivalent supports just one active iterator and is exhausted after one pass:

>>> def timesfour(S):

        for c in S:

            yield c * 4

>>> G = timesfour('spam')                 # Generator functions work the same way

>>> iter(G) is G

True

>>> I1, I2 = iter(G), iter(G)

>>> next(I1)

'ssss'

>>> next(I1)

'pppp'

>>> next(I2)                              # I2 at same position as I1

'aaaa'

This is different from the behavior of some built-in types, which support multiple iterators and passes and reflect their in-place changes in active iterators:

>>> L = [1, 2, 3, 4]

>>> I1, I2 = iter(L), iter(L)

>>> next(I1)

1

>>> next(I1)

2

>>> next(I2)                              # Lists support multiple iterators

1

>>> del L[2:]                             # Changes reflected in iterators

>>> next(I1)

StopIteration

Though not readily apparent in these simple examples, this can matter in your code: if you wish to scan a generator’s values multiple times, you must either create a new generator for each scan or build a rescannable list out of its values—a single generator’s values will be consumed and exhausted after a single pass. See this chapter’s sidebar Why You Will Care: One-Shot Iterations for a prime example of the sort of code that must accommodate this generator property.

When we begin coding class-based iterables in Part VI, we’ll also see that it’s up to us to decide how many iterations we wish to support for our objects, if any. In general, objects that wish to support multiple scans will return supplemental class objects instead of themselves. The next section previews more of this model.

THE PYTHON 3.3 YIELD FROM EXTENSION

Python 3.3 introduces extended syntax for the yield statement that allows delegation to a subgenerator with a from generator clause. In simple cases, it’s the equivalent to a yielding for loop—the list here in the following forces the generator to produce all its values, and the comprehension in parentheses is a generator expression, covered in this chapter:

>>> def both(N):

        for i in range(N): yield i

        for i in (x ** 2 for x in range(N)): yield i

>>> list(both(5))

[0, 1, 2, 3, 4, 0, 1, 4, 9, 16]

The new 3.3 syntax makes this arguably more concise and explicit, and supports all the usual generator usage contexts:

>>> def both(N):

        yield from range(N)

        yield from (x ** 2 for x in range(N))

>>> list(both(5))

[0, 1, 2, 3, 4, 0, 1, 4, 9, 16]

>>> ' : '.join(str(i) for i in both(5))

'0 : 1 : 2 : 3 : 4 : 0 : 1 : 4 : 9 : 16'

In more advanced roles, however, this extension allows subgenerators to receive sent and thrown values directly from the calling scope, and return a final value to the outer generator. The net effect is to allow such generators to be split into multiple subgenerators much as a single function can be split into multiple subfunctions.

Since this is only available in 3.3 and later, and is beyond this chapter’s generator coverage in general, we’ll defer to Python 3.3’s manuals for additional details. For an additional yield from example, also see the solution to this part’s Exercise 11 described at the end of Chapter 21.

Generation in Built-in Types, Tools, and Classes

Finally, although we’ve focused on coding value generators ourselves in this section, don’t forget that many built-in types behave in similar ways—as we saw in Chapter 14, for example, dictionaries are iterables with iterators that produce keys on each iteration:

>>> D = {'a':1, 'b':2, 'c':3}

>>> x = iter(D)

>>> next(x)

'c'

>>> next(x)

'b'

Like the values produced by handcoded generators, dictionary keys may be iterated over both manually and with automatic iteration tools including for loops, map calls, list comprehensions, and the many other contexts we met in Chapter 14:

>>> for key in D:

        print(key, D[key])

c 3

b 2

a 1

As we’ve also seen, for file iterators, Python simply loads lines from the file on demand:

>>> for line in open('temp.txt'):

        print(line, end='')

Tis but

a flesh wound.

While built-in type iterables are bound to a specific type of value generation, the concept is similar to the multipurpose generators we code with expressions and functions. Iteration contexts like for loops accept any iterable that has the expected methods, whether user-defined or built-in.

Generators and library tools: Directory walkers

Though beyond this book’s scope, many Python standard library tools generate values today too, including email parsers, and the standard directory walker—which at each level of a tree yields a tuple of the current directory, its subdirectories, and its files:

>>> import os

>>> for (root, subs, files) in os.walk('.'):         # Directory walk generator

        for name in files:                           # A Python 'find' operation

            if name.startswith('call'):

                print(root, name)

. callables.py

.\dualpkg callables.py

In fact, os.walk is coded as a recursive function in Python in its os.py standard library file, in C:\Python33\Lib on Windows. Because it uses yield (and in 3.3 yield from instead of a for loop) to return results, it’s a normal generator function, and hence an iterable object:

>>> G = os.walk(r'C:\code\pkg')

>>> iter(G) is G                     # Single-scan iterator: iter(G) optional

True

>>> I = iter(G)

>>> next(I)

('C:\\code\\pkg', ['__pycache__'], ['eggs.py', 'eggs.pyc', 'main.py', ...etc...])

>>> next(I)

('C:\\code\\pkg\\__pycache__', [], ['eggs.cpython-33.pyc', ...etc...])

>>> next(I)

StopIteration

By yielding results as it goes, the walker does not require its clients to wait for an entire tree to be scanned. See Python’s manuals and follow-up books such as Programming Python for more on this tool. Also see Chapter 14 and others for os.popen—a related iterable used to run a shell command and read its output.

Generators and function application

In Chapter 18, we noted that starred arguments can unpack an iterable into individual arguments. Now that we’ve seen generators, we can also see what this means in code. In both 3.X and 2.X (though 2.X’s range is a list):

>>> def f(a, b, c): print('%s, %s, and %s' % (a, b, c))

>>> f(0, 1, 2)                       # Normal positionals

0, 1, and 2

>>> f(*range(3))                     # Unpack range values: iterable in 3.X

0, 1, and 2

>>> f(*(i for i in range(3)))        # Unpack generator expression values

0, 1, and 2

This applies to dictionaries and views too (though dict.values is also a list in 2.X, and order is arbitrary when passing values by position):

>>> D = dict(a='Bob', b='dev', c=40.5); D

{'b': 'dev', 'c': 40.5, 'a': 'Bob'}

>>> f(a='Bob', b='dev', c=40.5)      # Normal keywords

Bob, dev, and 40.5

>>> f(**D)                           # Unpack dict: key=value

Bob, dev, and 40.5

>>> f(*D)                            # Unpack keys iterator

b, c, and a

>>> f(*D.values())                   # Unpack view iterator: iterable in 3.X

dev, 40.5, and Bob

Because the built-in print function in 3.X prints all its variable number of arguments, this also makes the following three forms equivalent—the latter using a * to unpack the results forced from a generator expression (though the second also creates a list of return values, and the first may leave your cursor at the end of the output line in some shells, but not in the IDLE GUI):

>>> for x in 'spam': print(x.upper(), end=' ')

S P A M

>>> list(print(x.upper(), end=' ') for x in 'spam')

S P A M [None, None, None, None]

>>> print(*(x.upper() for x in 'spam'))

S P A M

See Chapter 14 for an additional example that unpacks a file’s lines by iterator into arguments.

Preview: User-defined iterables in classes

Although beyond the scope of this chapter, it is also possible to implement arbitrary user-defined generator objects with classes that conform to the iteration protocol. Such classes define a special __iter__ method run by the iter built-in function, which in turn returns an object having a__next__ method (next in 2.X) run by the next built-in function:

class SomeIterable:

    def __init__(...): ...     # On iter(): return self or supplemental object

    def __next__(...): ...     # On next(): coded here, or in another class

As the prior section suggested, these classes usually return their objects directly for single-iteration behavior, or a supplemental object with scan-specific state for multiple-scan support.

Alternatively, a user-defined iterable class’s method functions can sometimes use yield to transform themselves into generators, with an automatically created __next__ method—a common application of yield we’ll meet in Chapter 30 that is both wildly implicit and potentially useful! A __getitem__ indexing method is also available as a fallback option for iteration, though this is often not as flexible as the __iter__ and __next__ scheme (but has advantages for coding sequences).

The instance objects created from such a class are considered iterable and may be used in for loops and all other iteration contexts. With classes, though, we have access to richer logic and data structuring options, such as inheritance, that other generator constructs cannot offer by themselves. By coding methods, classes also can make iteration behavior much more explicit than the “magic” generator objects associated with built-in types and generator functions and expressions (though classes wield some magic of their own).

Hence, the iterator and generator story won’t really be complete until we’ve seen how it maps to classes, too. For now, we’ll have to settle for postponing its conclusion—and its final sequel—until we study class-based iterables in Chapter 30.

Example: Generating Scrambled Sequences

To demonstrate the power of iteration tools in action, let’s turn to some more complete use case examples. In Chapter 18, we wrote a testing function that scrambled the order of arguments used to test generalized intersection and union functions. There, I noted that this might be better coded as a generator of values. Now that we’ve learned how to write generators, this serves to illustrate a practical application.

One note up front: because they slice and concatenate objects, all the examples in the section (including the permutations at the end) work only on sequences like strings and list, not on arbitrary iterables like files, maps, and other generators. That is, some of these examples will be generators themselves, producing values on request, but they cannot process generators as their inputs. Generalization for broader categories is left as an open issue, though the code here will suffice unchanged if you wrap nonsequence generators in list calls before passing them in.

Scrambling sequences

As coded in Chapter 18, we can reorder a sequence with slicing and concatenation, moving the front item to the end on each loop; slicing instead of indexing the item allows + to work for arbitrary sequence types:

>>> L, S = [1, 2, 3], 'spam'

>>> for i in range(len(S)):            # For repeat counts 0..3

        S = S[1:] + S[:1]              # Move front item to the end

        print(S, end=' ')

pams amsp mspa spam

>>> for i in range(len(L)):

        L = L[1:] + L[:1]              # Slice so any sequence type works

        print(L, end=' ')

[2, 3, 1] [3, 1, 2] [1, 2, 3]

Alternatively, as we saw in Chapter 13, we get the same results by moving an entire front section to the end, though the order of the results varies slightly:

>>> for i in range(len(S)):            # For positions 0..3

        X = S[i:] + S[:i]              # Rear part + front part (same effect)

        print(X, end=' ')

spam pams amsp mspa

Simple functions

As is, this code works on specific named variables only. To generalize, we can turn it into a simple function to work on any object passed to its argument and return a result; since the first of these exhibits the classic list comprehension pattern, we can save some work by coding it as such in the second:

>>> def scramble(seq):

        res = []

        for i in range(len(seq)):

            res.append(seq[i:] + seq[:i])

        return res

>>> scramble('spam')

['spam', 'pams', 'amsp', 'mspa']

>>> def scramble(seq):

        return [seq[i:] + seq[:i] for i in range(len(seq))]

>>> scramble('spam')

['spam', 'pams', 'amsp', 'mspa']

>>> for x in scramble((1, 2, 3)):

        print(x, end=' ')

(1, 2, 3) (2, 3, 1) (3, 1, 2)

We could use recursion here as well, but it’s probably overkill in this context.

Generator functions

The preceding section’s simple approach works, but must build an entire result list in memory all at once (not great on memory usage if it’s massive), and requires the caller to wait until the entire list is complete (less than ideal if this takes a substantial amount of time). We can do better on both fronts by translating this to a generator function that yields one result at a time, using either coding scheme:

>>> def scramble(seq):

        for i in range(len(seq)):

            seq = seq[1:] + seq[:1]              # Generator function

            yield seq                            # Assignments work here

>>> def scramble(seq):

        for i in range(len(seq)):                # Generator function

            yield seq[i:] + seq[:i]              # Yield one item per iteration

>>> list(scramble('spam'))                       # list()generates all results

['spam', 'pams', 'amsp', 'mspa']

>>> list(scramble((1, 2, 3)))                    # Any sequence type works

[(1, 2, 3), (2, 3, 1), (3, 1, 2)]

>>> 

>>> for x in scramble((1, 2, 3)):                # for loops generate results

        print(x, end=' ')

(1, 2, 3) (2, 3, 1) (3, 1, 2)

Generator functions retain their local scope state while active, minimize memory space requirements, and divide the work into shorter time slices. As full functions, they are also very general. Importantly, for loops and other iteration tools work the same whether stepping through a real list or a generator of values—the function can select between the two schemes freely, and even change strategies in the future.

Generator expressions

As we’ve seen, generator expressions—comprehensions in parentheses instead of square brackets—also generate values on request and retain their local state. They’re not as flexible as full functions, but because they yield their values automatically, expressions can often be more concise in specific use cases like this:

>>> S

'spam'

>>> G = (S[i:] + S[:i] for i in range(len(S)))   # Generator expression equivalent

>>> list(G)

['spam', 'pams', 'amsp', 'mspa']

Notice that we can’t use the assignment statement of the first generator function version here, because generator expressions cannot contain statements. This makes them a bit narrower in scope; in many cases, though, expressions can do similar work, as shown here. To generalize a generator expression for an arbitrary subject, wrap it in a simple function that takes an argument and returns a generator that uses it:

>>> F = lambda seq: (seq[i:] + seq[:i] for i in range(len(seq)))

>>> F(S)

<generator object <genexpr> at 0x00000000029883F0>

>>> 

>>> list(F(S))

['spam', 'pams', 'amsp', 'mspa']

>>> list(F([1, 2, 3]))

[[1, 2, 3], [2, 3, 1], [3, 1, 2]]

>>> for x in F((1, 2, 3)):

        print(x, end=' ')

(1, 2, 3) (2, 3, 1) (3, 1, 2)

Tester client

Finally, we can use either the generator function or its expression equivalent in Chapter 18’s tester to produce scrambled arguments—the sequence scrambling function becomes a tool we can use in other contexts:

# file scramble.py

def scramble(seq):

    for i in range(len(seq)):                # Generator function

        yield seq[i:] + seq[:i]              # Yield one item per iteration

scramble2 = lambda seq: (seq[i:] + seq[:i] for i in range(len(seq)))

And by moving the values generation out to an external tool, the tester becomes simpler:

>>> from scramble import scramble

>>> from inter2 import intersect, union

>>> 

>>> def tester(func, items, trace=True):

        for args in scramble(items):         # Use generator (or: scramble2(items))

            if trace: print(args)

            print(sorted(func(*args)))

>>> tester(intersect, ('aab', 'abcde', 'ababab'))

('aab', 'abcde', 'ababab')

['a', 'b']

('abcde', 'ababab', 'aab')

['a', 'b']

('ababab', 'aab', 'abcde')

['a', 'b']

>>> tester(intersect, ([1, 2], [2, 3, 4], [1, 6, 2, 7, 3]), False)

[2]

[2]

[2]

Permutations: All possible combinations

These techniques have many other real-world applications—consider generating attachments in an email message or points to be plotted in a GUI. Moreover, other types of sequence scrambles serve central roles in other applications, from searches to mathematics. As is, our sequence scrambler is a simple reordering, but some programs warrant the more exhaustive set of all possible orderings we get from permutations—produced using recursive functions in both list-builder and generator forms by the following module file:

# File permute.py

def permute1(seq):

    if not seq:                               # Shuffle any sequence: list

        return [seq]                          # Empty sequence

    else:

        res = []

        for i in range(len(seq)):

            rest = seq[:i] + seq[i+1:]        # Delete current node

            for x in permute1(rest):          # Permute the others

                res.append(seq[i:i+1] + x)    # Add node at front

        return res

def permute2(seq):

    if not seq:                               # Shuffle any sequence: generator

        yield seq                             # Empty sequence

    else:

        for i in range(len(seq)):

            rest = seq[:i] + seq[i+1:]        # Delete current node

            for x in permute2(rest):          # Permute the others

                yield seq[i:i+1] + x          # Add node at front

Both of these functions produce the same results, though the second defers much of its work until it is asked for a result. This code is a bit advanced, especially the second of these functions (and to some Python newcomers might even be categorized as cruel and inhumane punishment!). Still, as I’ll explain in a moment, there are cases where the generator approach can be highly useful.

Study and test this code for more insight, and add prints to trace if it helps. If it’s still a mystery, try to make sense of the first version first; remember that generator functions simply return objects with methods that handle next operations run by for loops at each level, and don’t produce any results until iterated; and trace through some of the following examples to see how they’re handled by this code.

Permutations produce more orderings than the original shuffler—for N items, we get N! (factorial) results instead of just N (24 for 4: 4 * 3 * 2 * 1). In fact, that’s why we need recursion here: the number of nested loops is arbitrary, and depends on the length of the sequence permuted:

>>> from scramble import scramble

>>> from permute import permute1, permute2

>>> list(scramble('abc'))                            # Simple scrambles: N

['abc', 'bca', 'cab']

>>> permute1('abc')                                  # Permutations larger: N!

['abc', 'acb', 'bac', 'bca', 'cab', 'cba']

>>> list(permute2('abc'))                            # Generate all combinations

['abc', 'acb', 'bac', 'bca', 'cab', 'cba']

>>> G = permute2('abc')                              # Iterate (iter() not needed)

>>> next(G)

'abc'

>>> next(G)

'acb'

>>> for x in permute2('abc'): print(x)               # Automatic iteration

...prints six lines...

The list and generator versions’ results are the same, though the generator minimizes both space usage and delays for results. For larger items, the set of all permutations is much larger than the simpler scrambler’s:

>>> permute1('spam') == list(permute2('spam'))

True

>>> len(list(permute2('spam'))), len(list(scramble('spam')))

(24, 4)

>>> list(scramble('spam'))

['spam', 'pams', 'amsp', 'mspa']

>>> list(permute2('spam'))

['spam', 'spma', 'sapm', 'samp', 'smpa', 'smap', 'psam', 'psma', 'pasm', 'pams',

 'pmsa', 'pmas', 'aspm', 'asmp', 'apsm', 'apms', 'amsp', 'amps', 'mspa', 'msap',

 'mpsa', 'mpas', 'masp', 'maps']

Per Chapter 19, there are nonrecursive alternatives here too, using explicit stacks or queues, and other sequence orderings are common (e.g., fixed-size subsets and combinations that filter out duplicates of differing order), but these require coding extensions we’ll forgo here. See the bookProgramming Python for more on this theme, or experiment further on your own.

Don’t Abuse Generators: EIBTI

Generators are a somewhat advanced tool, and might be better treated as an optional topic, but for the fact that they permeate the Python language, especially in 3.X. In fact, they seem less optional to this book’s audience than Unicode (which was exiled to Part VIII). As we’ve seen, fundamental built-in tools such as range, map, dictionary keys, and even files are now generators, so you must be familiar with the concept even if you don’t write new generators of your own. Moreover, user-defined generators are increasingly common in Python code that you might come across today—in the Python standard library, for instance.

In general, the same cautions I gave for list comprehensions apply here as well: don’t complicate your code with user-defined generators if they are not warranted. Especially for smaller programs and data sets, there may be no good reason to use these tools. In such cases, simple lists of results will suffice, will be easier to understand, will be garbage-collected automatically, and may be produced quicker (and they are today: see the next chapter). Advanced tools like generators that rely on implicit “magic” can be fun to experiment with, but they have no place in real code that must be used by others except when clearly justified.

Or, to quote from Python’s import this motto again:

Explicit is better than implicit.

The acronym for this, EIBTI, is one of Python’s core guidelines, and for good reason: the more explicit your code is about its behavior, the more likely it is that the next programmer will be able to understand it. This applies directly to generators, whose implicit behavior may very well be more difficult for some to grasp than less obscure alternatives. Always: keep it simple unless it must be complicated!

On the other hand: Space and time, conciseness, expressiveness

That being said, there are specific use cases that generators can address well. They can reduce memory footprint in some programs, reduce delays in others, and can occasionally make the impossible possible. Consider, for example, a program that must produce all possible permutations of a nontrivial sequence. Since the number of combinations is a factorial that explodes exponentially, the preceding permute1 recursive list-builder function will either introduce a noticeable and perhaps interminable pause or fail completely due to memory requirements, whereas the permute2recursive generator will not—it returns each individual result quickly, and can handle very large result sets:

>>> import math

>>> math.factorial(10)               # 10 * 9 * 8 * 7 * 6 * 5 * 4 * 3 * 2 * 1

3628800

>>> from permute import permute1, permute2

>>> seq = list(range(10))

>>> p1 = permute1(seq)               # 37 seconds on a 2GHz quad-core machine

                                     # Creates a list of 3.6M numbers

>>> len(p1), p1[0], p1[1]

(3628800, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 9, 8])

In this case, the list builder pauses for 37 seconds on my computer to build a 3.6-million-item list, but the generator can begin returning results immediately:

>>> p2 = permute2(seq)               # Returns generator immediately

>>> next(p2)                         # And produces each result quickly on request

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

>>> next(p2)

[0, 1, 2, 3, 4, 5, 6, 7, 9, 8]

>>> p2 = list(permute2(seq))         # About 28 seconds, though still impractical

>>> p1 == p2                         # Same set of results generated

True

Naturally, we might be able to optimize the list builder’s code to run quicker (e.g., an explicit stack instead of recursion might change its performance), but for larger sequences, it’s not an option at all—at just 50 items, the number of permutations precludes building a results list, and would take far too long for mere mortals like us (and larger values will overflow the preset recursion stack depth limit: see the preceding chapter). The generator, however, is still viable—it is able to produce individual results immediately:

>>> math.factorial(50)

30414093201713378043612608166064768844377641568960512000000000000

>>> p3 = permute2(list(range(50)))

>>> next(p3)                         # permute1 is not an option here!

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,

23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,

44, 45, 46, 47, 48, 49]

For more fun—and to yield results that are more variable and less obviously deterministic—we could also use Python’s random module of Chapter 5 to randomly shuffle the sequence to be permuted before the permuter begins its work. (In fact, we might be able to use the random shuffler as a permutation generator in general, as long as we either can assume that it won’t repeat shuffles during the time we consume them, or test its results against prior shuffles to avoid repeats—and hope that we do not live in the strange universe where a random sequence repeats the same result an infinite number of times!). In the following, each permute2 and next call returns immediately as before, but a permute1 hangs:

>>> import random

>>> math.factorial(20)               # permute1 is not an option here

2432902008176640000

>>> seq = list(range(20))

>>> random.shuffle(seq)              # Shuffle sequence randomly first

>>> p = permute2(seq)

>>> next(p)

[10, 17, 4, 14, 11, 3, 16, 19, 12, 8, 6, 5, 2, 15, 18, 7, 1, 0, 13, 9]

>>> next(p)

[10, 17, 4, 14, 11, 3, 16, 19, 12, 8, 6, 5, 2, 15, 18, 7, 1, 0, 9, 13]

>>> random.shuffle(seq)

>>> p = permute2(seq)

>>> next(p)

[16, 1, 5, 14, 15, 12, 0, 2, 6, 19, 10, 17, 11, 18, 13, 7, 4, 9, 8, 3]

>>> next(p)

[16, 1, 5, 14, 15, 12, 0, 2, 6, 19, 10, 17, 11, 18, 13, 7, 4, 9, 3, 8]

The main point here is that generators can sometimes produce results from large solution sets when list builders cannot. Then again, it’s not clear how common such use cases may be in the real world, and this doesn’t necessarily justify the implicit flavor of value generation that we get with generator functions and expressions. As we’ll see in Part VI, value generation can also be coded as iterable objects with classes. Class-based iterables can produce items on request too, and are far more explicit than the magic objects and methods produced for generator functions and expressions.

Part of programming is finding a balance among tradeoffs like these, and there are no absolute rules here. While the benefits of generators may sometimes justify their use, maintainability should always be a top priority too. Like comprehensions, generators also offer an expressiveness andcode economy that’s hard to resist if you understand how they work—but you’ll want to weigh this against the frustration of coworkers who might not.

Example: Emulating zip and map with Iteration Tools

To help you evaluate their roles further, let’s take a quick look at one more example of generators in action that illustrates just how expressive they can be. Once you know about comprehensions, generators, and other iteration tools, it turns out that emulating many of Python’s functional built-ins is both straightforward and instructive. For example, we’ve already seen how the built-in zip and map functions combine iterables and project functions across them, respectively. With multiple iterable arguments, map projects the function across items taken from each iterable in much the same way that zip pairs them up:

>>> S1 = 'abc'

>>> S2 = 'xyz123'

>>> list(zip(S1, S2))                          # zip pairs items from iterables

[('a', 'x'), ('b', 'y'), ('c', 'z')]

# zip pairs items, truncates at shortest

>>> list(zip([−2, −1, 0, 1, 2]))               # Single sequence: 1-ary tuples

[(−2,), (−1,), (0,), (1,), (2,)]

>>> list(zip([1, 2, 3], [2, 3, 4, 5]))         # N sequences: N-ary tuples

[(1, 2), (2, 3), (3, 4)]

# map passes paired items to function, truncates

>>> list(map(abs, [−2, −1, 0, 1, 2]))          # Single sequence: 1-ary function

[2, 1, 0, 1, 2]

>>> list(map(pow, [1, 2, 3], [2, 3, 4, 5]))    # N sequences: N-ary function

[1, 8, 81]

# map and zip accept arbitrary iterables

>>> map(lambda x, y: x + y, open('script2.py'), open('script2.py'))

['import sys\nimport sys\n', 'print(sys.path)\nprint(sys.path)\n', ...etc...]

>>> [x + y for (x, y) in zip(open('script2.py'), open('script2.py'))]

['import sys\nimport sys\n', 'print(sys.path)\nprint(sys.path)\n', ...etc...]

Though they’re being used for different purposes, if you study these examples long enough, you might notice a relationship between zip results and mapped function arguments that our next example can exploit.

Coding your own map(func, ...)

Although the map and zip built-ins are fast and convenient, it’s always possible to emulate them in code of our own. In the preceding chapter, for example, we saw a function that emulated the map built-in for a single sequence (or other iterable) argument. It doesn’t take much more work to allow for multiple sequences, as the built-in does:

# map(func, seqs...) workalike with zip

def mymap(func, *seqs):

    res = []

    for args in zip(*seqs):

        res.append(func(*args))

    return res

print(mymap(abs, [-2, −1, 0, 1, 2]))

print(mymap(pow, [1, 2, 3], [2, 3, 4, 5]))

This version relies heavily upon the special *args argument-passing syntax—it collects multiple sequence (really, iterable) arguments, unpacks them as zip arguments to combine, and then unpacks the paired zip results as arguments to the passed-in function. That is, we’re using the fact that the zipping is essentially a nested operation in mapping. The test code at the bottom applies this to both one and two sequences to produce this output—the same we would get with the built-in map (this code is in file mymap.py in the book’s examples if you want to run it live):

[2, 1, 0, 1, 2]

[1, 8, 81]

Really, though, the prior version exhibits the classic list comprehension pattern, building a list of operation results within a for loop. We can code our map more concisely as an equivalent one-line list comprehension:

# Using a list comprehension

def mymap(func, *seqs):

    return [func(*args) for args in zip(*seqs)]

print(mymap(abs, [−2, −1, 0, 1, 2]))

print(mymap(pow, [1, 2, 3], [2, 3, 4, 5]))

When this is run the result is the same as before, but the code is more concise and might run faster (more on performance in the section Timing Iteration Alternatives). Both of the preceding mymap versions build result lists all at once, though, and this can waste memory for larger lists. Now that we know about generator functions and expressions, it’s simple to recode both these alternatives to produce results on demand instead:

# Using generators: yield and (...)

def mymap(func, *seqs):

    res = []

    for args in zip(*seqs):

        yield func(*args)

def mymap(func, *seqs):

    return (func(*args) for args in zip(*seqs))

These versions produce the same results but return generators designed to support the iteration protocol—the first yields one result at a time, and the second returns a generator expression’s result to do the same. They produce the same results if we wrap them in list calls to force them to produce their values all at once:

print(list(mymap(abs, [−2, −1, 0, 1, 2])))

print(list(mymap(pow, [1, 2, 3], [2, 3, 4, 5])))

No work is really done here until the list calls force the generators to run, by activating the iteration protocol. The generators returned by these functions themselves, as well as that returned by the Python 3.X flavor of the zip built-in they use, produce results only on demand.

Coding your own zip(...) and map(None, ...)

Of course, much of the magic in the examples shown so far lies in their use of the zip built-in to pair arguments from multiple sequences or iterables. Our map workalikes are also really emulating the behavior of the Python 3.X map—they truncate at the length of the shortest argument, and they do not support the notion of padding results when lengths differ, as map does in Python 2.X with a None argument:

C:code> c:\python27\python

>>> map(None, [1, 2, 3], [2, 3, 4, 5])

[(1, 2), (2, 3), (3, 4), (None, 5)]

>>> map(None, 'abc', 'xyz123')

[('a', 'x'), ('b', 'y'), ('c', 'z'), (None, '1'), (None, '2'), (None, '3')]

Using iteration tools, we can code workalikes that emulate both truncating zip and 2.X’s padding map—these turn out to be nearly the same in code:

# zip(seqs...) and 2.X map(None, seqs...) workalikes

def myzip(*seqs):

    seqs = [list(S) for S in seqs]

    res  = []

    while all(seqs):

        res.append(tuple(S.pop(0) for S in seqs))

    return res

def mymapPad(*seqs, pad=None):

    seqs = [list(S) for S in seqs]

    res  = []

    while any(seqs):

        res.append(tuple((S.pop(0) if S else pad) for S in seqs))

    return res

S1, S2 = 'abc', 'xyz123'

print(myzip(S1, S2))

print(mymapPad(S1, S2))

print(mymapPad(S1, S2, pad=99))

Both of the functions coded here work on any type of iterable object, because they run their arguments through the list built-in to force result generation (e.g., files would work as arguments, in addition to sequences like strings). Notice the use of the all and any built-ins here—these return True if all and any items in an iterable are True (or equivalently, nonempty), respectively. These built-ins are used to stop looping when any or all of the listified arguments become empty after deletions.

Also note the use of the Python 3.X keyword-only argument, pad; unlike the 2.X map, our version will allow any pad object to be specified (if you’re using 2.X, use a **kargs form to support this option instead; see Chapter 18 for details). When these functions are run, the following results are printed—a zip, and two padding maps:

[('a', 'x'), ('b', 'y'), ('c', 'z')]

[('a', 'x'), ('b', 'y'), ('c', 'z'), (None, '1'), (None, '2'), (None, '3')]

[('a', 'x'), ('b', 'y'), ('c', 'z'), (99, '1'), (99, '2'), (99, '3')]

These functions aren’t amenable to list comprehension translation because their loops are too specific. As before, though, while our zip and map workalikes currently build and return result lists, it’s just as easy to turn them into generators with yield so that they each return one piece of their result set at a time. The results are the same as before, but we need to use list again to force the generators to yield their values for display:

# Using generators: yield

def myzip(*seqs):

    seqs = [list(S) for S in seqs]

    while all(seqs):

        yield tuple(S.pop(0) for S in seqs)

def mymapPad(*seqs, pad=None):

    seqs = [list(S) for S in seqs]

    while any(seqs):

        yield tuple((S.pop(0) if S else pad) for S in seqs)

S1, S2 = 'abc', 'xyz123'

print(list(myzip(S1, S2)))

print(list(mymapPad(S1, S2)))

print(list(mymapPad(S1, S2, pad=99)))

Finally, here’s an alternative implementation of our zip and map emulators—rather than deleting arguments from lists with the pop method, the following versions do their job by calculating the minimum and maximum argument lengths. Armed with these lengths, it’s easy to code nested list comprehensions to step through argument index ranges:

# Alternate implementation with lengths

def myzip(*seqs):

    minlen = min(len(S) for S in seqs)

    return [tuple(S[i] for S in seqs) for i in range(minlen)]

def mymapPad(*seqs, pad=None):

    maxlen = max(len(S) for S in seqs)

    index  = range(maxlen)

    return [tuple((S[i] if len(S) > i else pad) for S in seqs) for i in index]

S1, S2 = 'abc', 'xyz123'

print(myzip(S1, S2))

print(mymapPad(S1, S2))

print(mymapPad(S1, S2, pad=99))

Because these use len and indexing, they assume that arguments are sequences or similar, not arbitrary iterables, much like our earlier sequence scramblers and permuters. The outer comprehensions here step through argument index ranges, and the inner comprehensions (passed to tuple) step through the passed-in sequences to pull out arguments in parallel. When they’re run, the results are as before.

Most strikingly, generators and iterators seem to run rampant in this example. The arguments passed to min and max are generator expressions, which run to completion before the nested comprehensions begin iterating. Moreover, the nested list comprehensions employ two levels of delayed evaluation—the Python 3.X range built-in is an iterable, as is the generator expression argument to tuple.

In fact, no results are produced here until the square brackets of the list comprehensions request values to place in the result list—they force the comprehensions and generators to run. To turn these functions themselves into generators instead of list builders, use parentheses instead of square brackets again. Here’s the case for our zip:

# Using generators: (...)

def myzip(*seqs):

    minlen = min(len(S) for S in seqs)

    return (tuple(S[i] for S in seqs) for i in range(minlen))

S1, S2 = 'abc', 'xyz123'

print(list(myzip(S1, S2)))         # Go!... [('a', 'x'), ('b', 'y'), ('c', 'z')]

In this case, it takes a list call to activate the generators and other iterables to produce their results. Experiment with these on your own for more details. Developing further coding alternatives is left as a suggested exercise (see also the sidebar Why You Will Care: One-Shot Iterations for investigation of one such option).

NOTE

Watch for more yield examples in Chapter 30, where we’ll use it in conjunction with the __iter__ operator overloading method to implement user-defined iterable objects in an automated fashion. The state retention of local variables in this role serves as an alternative to class attributes in the same spirit as the closure functions of Chapter 17; as we’ll see, though, this technique combinesclasses and functional tools instead of posing a paradigm alternative.

WHY YOU WILL CARE: ONE-SHOT ITERATIONS

In Chapter 14, we saw how some built-ins (like map) support only a single traversal and are empty after it occurs, and I promised to show you an example of how that can become subtle but important in practice. Now that we’ve studied a few more iteration topics, I can make good on this promise. Consider the following clever alternative coding for this chapter’s zip emulation examples, adapted from one in Python’s manuals at the time I wrote these words:

def myzip(*args):

    iters = map(iter, args)

    while iters:

        res = [next(i) for i in iters]

        yield tuple(res)

Because this code uses iter and next, it works on any type of iterable. Note that there is no reason to catch the StopIteration raised by the next(it) inside the comprehension here when any one of the arguments’ iterators is exhausted—allowing it to pass ends this generator function and has the same effect that a return statement would. The while iters: suffices to loop if at least one argument is passed, and avoids an infinite loop otherwise (the list comprehension would always return an empty list).

This code works fine in Python 2.X as is:

>>> list(myzip('abc', 'lmnop'))

[('a', 'l'), ('b', 'm'), ('c', 'n')]

But it falls into an infinite loop and fails in Python 3.X, because the 3.X map returns a one-shot iterable object instead of a list as in 2.X. In 3.X, as soon as we’ve run the list comprehension inside the loop once, iters will be exhausted but still True (and res will be []) forever. To make this work in 3.X, we need to use the list built-in function to create an object that can support multiple iterations:

def myzip(*args):

    iters = list(map(iter, args))       # Allow multiple scans

    ...rest as is...

Run this on your own to trace its operation. The lesson here: wrapping map calls in list calls in 3.X is not just for display!


[41] Interestingly, generator functions are also something of a “poor man’s” multithreading device—they interleave a function’s work with that of its caller, by dividing its operation into steps run between yields. Generators are not threads, though: the program is explicitly directed to and from the function within a single thread of control. In one sense, threading is more general (producers can run truly independently and post results to a queue), but generators may be simpler to code. See the footnote in Chapter 17 for a brief introduction to Python multithreading tools. Note that because control is routed explicitly at yield and next calls, generators are also not backtracking, but are more strongly related to coroutines—formal concepts that are both beyond this chapter’s scope.

Comprehension Syntax Summary

We’ve been focusing on list comprehensions and generators in this chapter, but keep in mind that there are two other comprehension expression forms available in both 3.X and 2.7: set and dictionary comprehensions. We met these briefly in Chapter 5 and Chapter 8, but with our new knowledge of comprehensions and generators, you should now be able to grasp these extensions in full:

§  For sets, the new literal form {1, 3, 2} is equivalent to set([1, 3, 2]), and the new set comprehension syntax {f(x) for x in S if P(x)} is like the generator expression set(f(x) for x in S if P(x)), where f(x) is an arbitrary expression.

§  For dictionaries, the new dictionary comprehension syntax {key: val for (key, val) in zip(keys, vals)} works like the form dict(zip(keys, vals)), and {x: f(x) for x in items} is like the generator expression dict((x, f(x)) for x in items).

Here’s a summary of all the comprehension alternatives in 3.X and 2.7. The last two are new and are not available in 2.6 and earlier:

>>> [x * x for x in range(10)]            # List comprehension: builds list

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]      # Like list(generator expr)

>>> (x * x for x in range(10))            # Generator expression: produces items

<generator object at 0x009E7328>          # Parens are often optional

>>> {x * x for x in range(10)}            # Set comprehension, 3.X and 2.7

{0, 1, 4, 81, 64, 9, 16, 49, 25, 36}      # {x, y} is a set in these versions too

>>> {x: x * x for x in range(10)}         # Dictionary comprehension, 3.X and 2.7

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81}

Scopes and Comprehension Variables

Now that we’ve seen all comprehension forms, be sure to also review Chapter 17’s overview of the localization of loop variables in these expressions. Python 3.X localizes loop variables in all four forms—temporary loop variable names in generator, set, dictionary, and list comprehensions are local to the expression. They don’t clash with names outside, but are also not available there, and work differently than the for loop iteration statement:

c:\code> py −3

>>> (X for X in range(5))

<generator object <genexpr> at 0x00000000028E4798>

>>> X

NameError: name 'X' is not defined

>>> X = 99

>>> [X for X in range(5)]         # 3.X: generator, set, dict, and list localize

[0, 1, 2, 3, 4]

>>> X

99

>>> Y = 99

>>> for Y in range(5): pass       # But loop statements do not localize names

>>> Y

4

As mentioned in Chapter 17, 3.X variables assigned in a comprehension are really a further nested special-case scope; other names referenced within these expressions follow the usual LEGB rules. In the following generator, for example, Z is localized in the comprehension, but Y and X are found in the enclosing local and global scopes as usual:

>>> X = 'aaa'

>>> def func():

        Y = 'bbb'

        print(''.join(Z for Z in X + Y))       # Z comprehension, Y local, X global

>>> func()

aaabbb

Python 2.X is the same in this regard, except that list comprehension variables are not localized—they work just like for loops and keep their last iteration values, but are also open to unexpected clashes with outside names. Generator, set, and dictionary forms localize names as in 3.X:

c:\code> py −2

>>> (X for X in range(5))

<generator object <genexpr> at 0x0000000002147EE8>

>>> X

NameError: name 'X' is not defined

>>> X = 99

>>> [X for X in range(5)]         # 2.X: List does not localize its names, like for

[0, 1, 2, 3, 4]

>>> X

4

>>> Y = 99

>>> for Y in range(5): pass       # for loops do not localize names in 2.X or 3.X

>>> Y

4

If you care about version portability, and symmetry with the for loop statement, use unique names for variables in comprehension expressions as a rule of thumb. The 2.X behavior makes sense given that a generator object is discarded after it finishes producing results, but a list comprehension is equivalent to a for loop—though this analogy doesn’t hold for the set and dictionary forms that localize their names in both Pythons, and are, somewhat coincidentally, the topic of the next section.

Comprehending Set and Dictionary Comprehensions

In a sense, set and dictionary comprehensions are just syntactic sugar for passing generator expressions to the type names. Because both accept any iterable, a generator works well here:

>>> {x * x for x in range(10)}                # Comprehension

{0, 1, 4, 81, 64, 9, 16, 49, 25, 36}

>>> set(x * x for x in range(10))             # Generator and type name

{0, 1, 4, 81, 64, 9, 16, 49, 25, 36}

>>> {x: x * x for x in range(10)}

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81}

>>> dict((x, x * x) for x in range(10))

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81}

>>> x                                         # Loop variable localized in 2.X + 3.X

NameError: name 'x' is not defined

As for list comprehensions, though, we can always build the result objects with manual code, too. Here are statement-based equivalents of the last two comprehensions (though they differ in that name localization):

>>> res = set()

>>> for x in range(10):                        # Set comprehension equivalent

        res.add(x * x)

>>> res

{0, 1, 4, 81, 64, 9, 16, 49, 25, 36}

>>> res = {}

>>> for x in range(10):                        # Dict comprehension equivalent

        res[x] = x * x

>>> res

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81}

>>> x   # Localized in comprehension expressions, but not in loop statements

9

Notice that although both set and dictionary comprehensions accept and scan iterables, they have no notion of generating results on demand—both forms build complete objects all at once. If you mean to produce keys and values upon request, a generator expression is more appropriate:

>>> G = ((x, x * x) for x in range(10))

>>> next(G)

(0, 0)

>>> next(G)

(1, 1)

Extended Comprehension Syntax for Sets and Dictionaries

Like list comprehensions and generator expressions, both set and dictionary comprehensions support nested associated if clauses to filter items out of the result—the following collect squares of even items (i.e., items having no remainder for division by 2) in a range:

>>> [x * x for x in range(10) if x % 2 == 0]           # Lists are ordered

[0, 4, 16, 36, 64]

>>> {x * x for x in range(10) if x % 2 == 0}           # But sets are not

{0, 16, 4, 64, 36}

>>> {x: x * x for x in range(10) if x % 2 == 0}        # Neither are dict keys

{0: 0, 8: 64, 2: 4, 4: 16, 6: 36}

Nested for loops work as well, though the unordered and no-duplicates nature of both types of objects can make the results a bit less straightforward to decipher:

>>> [x + y for x in [1, 2, 3] for y in [4, 5, 6]]      # Lists keep duplicates

[5, 6, 7, 6, 7, 8, 7, 8, 9]

>>> {x + y for x in [1, 2, 3] for y in [4, 5, 6]}      # But sets do not

{8, 9, 5, 6, 7}

>>> {x: y for x in [1, 2, 3] for y in [4, 5, 6]}       # Neither do dict keys

{1: 6, 2: 6, 3: 6}

Like list comprehensions, the set and dictionary varieties can also iterate over any type of iterable—lists, strings, files, ranges, and anything else that supports the iteration protocol:

>>> {x + y for x in 'ab' for y in 'cd'}

{'ac', 'bd', 'bc', 'ad'}

>>> {x + y: (ord(x), ord(y)) for x in 'ab' for y in 'cd'}

{'ac': (97, 99), 'bd': (98, 100), 'bc': (98, 99), 'ad': (97, 100)}

>>> {k * 2 for k in ['spam', 'ham', 'sausage'] if k[0] == 's'}

{'sausagesausage', 'spamspam'}

>>> {k.upper(): k * 2 for k in ['spam', 'ham', 'sausage'] if k[0] == 's'}

{'SAUSAGE': 'sausagesausage', 'SPAM': 'spamspam'}

For more details, experiment with these tools on your own. They may or may not have a performance advantage over the generator or for loop alternatives, but we would have to time their performance explicitly to be sure—which seems a natural segue to the next chapter.

Chapter Summary

This chapter wrapped up our coverage of built-in comprehension and iteration tools. It explored list comprehensions in the context of functional tools, and presented generator functions and expressions as additional iteration protocol tools. As a finale, we also summarized the four forms of comprehension in Python today—list, generator, set, and dictionary. Though we’ve now seen all the built-in iteration tools, the subject will resurface when we study user-defined iterable class objects in Chapter 30.

The next chapter is something of a continuation of the theme of this one—it rounds out this part of the book with a case study that times the performance of the tools we’ve studied here, and serves as a more realistic example at the midpoint in this book. Before we move ahead to benchmarking comprehensions and generators, though, this chapter’s quizzes give you a chance to review what you’ve learned about them here.

Test Your Knowledge: Quiz

1.    What is the difference between enclosing a list comprehension in square brackets and parentheses?

2.    How are generators and iterators related?

3.    How can you tell if a function is a generator function?

4.    What does a yield statement do?

5.    How are map calls and list comprehensions related? Compare and contrast the two.

Test Your Knowledge: Answers

1.    List comprehensions in square brackets produce the result list all at once in memory. When they are enclosed in parentheses instead, they are actually generator expressions—they have a similar meaning but do not produce the result list all at once. Instead, generator expressions return a generator object, which yields one item in the result at a time when used in an iteration context.

2.    Generators are iterable objects that support the iteration protocol automatically—they have an iterator with a __next__ method (next in 2.X) that repeatedly advances to the next item in a series of results and raises an exception at the end of the series. In Python, we can code generator functions with def and yield, generator expressions with parenthesized comprehensions, and generator objects with classes that define a special method named __iter__ (discussed later in the book).

3.    A generator function has a yield statement somewhere in its code. Generator functions are otherwise identical to normal functions syntactically, but they are compiled specially by Python so as to return an iterable generator object when called. That object retains state and code location between values.

4.    When present, this statement makes Python compile the function specially as a generator; when called, the function returns a generator object that supports the iteration protocol. When the yield statement is run, it sends a result back to the caller and suspends the function’s state; the function can then be resumed after the last yield statement, in response to a next built-in or __next__ method call issued by the caller. In more advanced roles, the generator send method similarly resumes the generator, but can also pass a value that shows up as the yieldexpression’s value. Generator functions may also have a return statement, which terminates the generator.

5.    The map call is similar to a list comprehension—both produce a series of values, by collecting the results of applying an operation to each item in a sequence or other iterable, one item at a time. The primary difference is that map applies a function call to each item, and list comprehensions apply arbitrary expressions. Because of this, list comprehensions are more general; they can apply a function call expression like map, but map requires a function to apply other kinds of expressions. List comprehensions also support extended syntax such as nestedfor loops and if clauses that subsume the filter built-in. In Python 3.X, map also differs in that it produces a generator of values; the list comprehension materializes the result list in memory all at once. In 2.X, both tools create result lists.