CHAPTER 21: Performance Tips
We have already mentioned quite a few times that Python is not the fastest of all the programming languages when it comes to the interpreter. But then again, what kind of software would it be if it cannot be modified for performance? In this chapter, we will be discussing the different tips you can use to optimize Python.
Ever since about 1996, people have been writing about how to make Python faster. That means there is a large wealth of information that may or may not work for you. This chapter will give you a plethora of tips that will allow you to mix and match whatever suits your style and your system. Then again, nothing’s stopping you from implementing all of these.
The Lowdown on Optimizing
First off—the Python implementation. We had given you a lot of these at the start. When choosing any of these, make sure that you are choosing based on what you really like to use and on what your job demands—not simply because this book or anyone else told you that this implementation is faster than the rest.
Once you have begun using the software and played around with it for a bit, you will be able to find what is making the program slow. Get the program to yield correct results, then run the results so you can check whether or not the correct program is slow. If you find the program to really be slow, then you can use profiling techniques to show you which parts of the program consume the most time. There are comprehensive by snappy suites available for you to use afterwards, which aid in making sure that any future optimization does not affect the correct way that the program runs.
In a nutshell, you have to:
1. Get the program right.
2. Test that it’s really right.
3. Profile if it is slow.
5. Repeat from step 2.
There are certain ways to optimize which amount to a better programming style—such as the idioms we had discussed earlier. Make sure that you learn these as you go along with Python programming.
Profiling the Code
The first tangible step in trying to speed up your program is analyzing exactly where the bottlenecks may lie. It will hardly make sense to try and optimize a piece of code that will not be executed, or one that already runs at blazing speeds. There are two modules that can be used to help locate any code hotspots you might have—trace and profile. You can also use the timeit module, if you are using Python 2.3 and later.
There are actually a number of modules included in Python that will allow for optimization and profiling. One of these can be used to profile a set of functions’ execution. Let’s say that your primary function is called test1, and that it takes no argument. You will then want it to execute under the profile module’s control. This will be the simplest way it is done:
Once the main() is returned, the profile module would then print a table of the function calls with the execution times. This output may then be tweaked through the stats class also included in the module. From Python 2.4 onwards, profile is also allowed to profile the time being consumed by Python’s builtins as well as extension module functions.
Hotshot and cProfile Modules
The hotshot package has been available since Python 2.2, as a replacement for profile. In spite of this, the cProfile package has been the one more preferred. The module has been written in C, so using either could result in a smaller hit to the overall performance. This will allow you a better idea of how exactly the application is performing. In the Tools/scripts of the distribution, the hotshotdomain.py program can also be found. This will make it better for the programmer to run the code from the command line under hotshot control.
Being open-source, even Python modules have their spin-offs. One for the profile module we mentioned earlier is the trace module, which was originally written to help perform test coverage. It has been since modified by the Python community, and can be found in the Tools/script directory in your Python distribution back in Python 2.0’s release. It has been added to the standard library (Lib directory) starting Python 2.3—it can be copied to the local bin directory, and the execute permission can be set from there. It will be easy to run this from the command line, tracing the execution of whole scripts:
% trace.py -t eggs.py spam
In the 2.4 release, it will be even easier to run—simply type python -m trace. You can also use pydoc trace in order to view the documentation of the trace module.
Visualizing the Profiling Results
You can use the following to visualize the results of your profiling attempts:
RunSnakeRun. This is a GUI tool that will visualize the profile dumps from the cProfile module using square maps. The function and method calls can be sorted by different criteria, and the source code can also be displayed beside the visualization and the call statistics.
Gprof2Dot. This is a tool based on Python which allows you to transform the profiling results into a graph. This graph can then be converted into SVG or PNG.
PyCallGraph. This is a module which helps you create call graphs for your Python programs. It can generate a file in PNG, which shows the function calls of the modules as well as their links to other function calls. It can also show the amount of times the function is called, as well as the time the program spent within that function.
PyProf2CallTree. This is a script that will help you visualize the profiling data you collected through the cProfile module. It uses the kcachegrind calltree analyzer.
ProfileEye. This is a front end to Gprof2Dot. It is browser-based, and uses d3.js in order to declutter the visual information.
Sorting the list of the basic Python objects is a pretty efficient process. This method for lists uses a comparison function (optional) as the argument which is then used to change the behavior of the sorting. This is convenient, even though it can slow down the sorting process itself—as it runs, the comparison function has to be called a lot of times. Back in Python 2.4 and older, you will have to use the built-in sort’s key argument instead—this should have been the fastest way of sorting.
If you are indeed using Python 2.4 or any older versions, then the following advice should be applicable—take heed as it comes from Guido van Rossum himself.
To speed up the sorting process, create a list of tuples where a sort key is the first element. This key will properly sort through the default comparison. The second element should be the original list element. Also known as the DSU (DecorateSortUndecorate), it is further known to Pythonistas as the Schwartzian Transform.
For example, you have are with a list of tuples that you would want to arrange by the nth field of the tuple. This function should look like:
Xlist=[(xn],x) for x in alist]
Return[val for (key,val) in xlist
Another thing you can easily achieve as well is matching the preset list sort method’s behavior (also called sorting in place).
Alist[:]=[(x[n],x) for x in alist]
Alist[:]=[valfor(key,val) in alist]
This section will be effective based on the type of Python implementation you are running. Concatenation tends to be fairly fast for later versions, but then again you might be using the one which came when you first had your computer a few years back.
Remember that strings are immutable in Python. This is important to remember, since many Python novices overlook this, thus leading to code-fatal programming mistakes. The immutable state of strings can lend itself into both advantages and disadvantages. Among the former is the fact that strings can be used as dictionary keys, and multiple variable bindings can share individual copies. In fact, one- and two-character strings are automatically shared in Python. On the down side, however, a programmer cannot do something like “change all x’s to y’s” in any string. Instead, a new string with all the desired properties will have to be created. This process of continual copying may be a source of inefficiencies in some Python programs.
As an illustration, here is something that you would want to avoid:
For substring in this list:
a += substring
Instead, use the a=””.join(list) line. The former example is actually a common mistake Python novices have in the process of building large strings.
In the same vein, if you generate bits of a string in a sequential manner, avoid the following:
For b in list:
alist=[a_function(elt) for elt in anotherlist]
Also, avoid the following:
Instead of that, use:
Out=”<html>”%s%s%s%s</html>” % (head, prologue, query, tail)
For readability, it is even better to use a dictionary substitution (this, however, will not have anything to do with efficiency aside from your own as a programmer).
Out=”<html>%(head)s%(prologue)s%(query)s%(tail)s</html>” % locals()
The last couple of examples are both going to be much faster than the first, and even more so when they are piled up over multiple CGI script executions. They are also easier to modify if necessary. On top of this, the addition of the rich comparisons in Python 2.X has made the slow way even slower. If you are using that branch, it will take your virtual machine even more time to figure out how the two strings will be concatenated. Also, do not forget that all the method look-up will be done by Python at runtime.
Some looping constructs are supported by Python. Of them, the most commonly used statement is the for. It works by looping over the elements of the sequence, each being assigned to the loop variable. If the loop’s body is simple, then the for loop’s interpreter overhead may be a substantial chunk of the entire overhead. At this point, the map function will be handy.
You may visualize the map function as for which has been moved into C. The sole restriction is that the map’s loop body should be a function call. Here is an example—this one loops over the list of words and converts them to the upper case:
For word in thatlist:
You can instead use the map function to have the loop pushed from your interpreter into a compiled C code:
It doesn’t even matter if you have plain old Python 2.0, as list comprehensions have been added from this version onwards. These provide a more compact (syntactically) and a more efficient way to write the for loop:
Thislist=[s.upper() for s in thatlist:
The generator expressions have been added since Python 2.4. These function in about a similar way as map or list comprehensions. However, they avoid the need for additional overhead (i.e., having to generate the entire list all at once). Instead of this, a generator object which can be incrementally iterated is being returned:
Iterator=(s.upper() for s in thatlist)
Whether the method you are using is appropriate will depend on which version of Python you will be using. Another consideration will be the characteristics of the data that you are manipulating.
Avoiding the dots
Suppose you will not be able to use a list comprehension or map? Then you may be stuck with using the for loop. This will lend your program to another inefficiency. The word.upper and thislist.append are function references which are being re-evaluated every time through the loop. Instead, you can replace the original loop with the following code:
for word in thatlist:
However, use this technique with caution. It will be more difficult to maintain once the loop starts to get larger. Unless you already are, you will first have to be very acquainted with the usage of upper and append.
Without the map version, the last speedup available for the for loop will be the use of local variables as long as possible. If the loop above has been cast as a function, upper and append will become local variables. The Python language will access these local variables a lot better than global variables. Take a look at the following example:
Upper = str,upper
For word in thatlist:
Initializing Dictionary Elements
Let’s say that you are building a dictionary containing word frequencies, and that you have already broken up the text into word lists. You may then execute something similar to the following code:
For word in words:
If word not in wdict:
The exception is that for the first time, every time that a word is seen then the test of the if statement will fail. If you will be counting a large amount of words, then many of them will probably be recurring multiple times. In a time where the value’s initialization will only be occurring once and the value’s augmentation will occur multiple times, it will be better to use a try statement. Check out the following code:
For word in words:
It will be important to catch the KeyError exception that you are expecting. At the same time, you should not use a generic except clause in order to avoid attempting to recover from exceptions which cannot really be handled using statements in the try clause.
Another alternative is to use the get() method, available since Python 2.0—this means that if you downloaded a fresh Python implementation when you started this book, this is the best thing to use. This method will return a default value when the desired key cannot be found in the dictionary. This will simplify the loop.
For word in words:
In addition, if the value that the dictionary stores is a mutable list (object), you may use dict.setdefault:
4 wdict.setdefault(key,) .append(new_element)
You might be thinking that this will avoid having to look the key up twice, but it doesn’t—not even in Python 3.x. At least, however, the double look up is being performed in C.
Another available option is using the defaultdict:
From collections import defaultdict
Wdict = defaultdict(int)
For word in words:
Overhead for Import Statements
The import statement can be used almost anywhere. It will often be useful to place these inside functions in order to restrict the visibility and also to help reduce the start-up time. While the interpreter is optimized to avoid the importation of the module multiple times, the repeated execution of an import statement will affect (sometimes seriously so) the performance of the program in certain circumstances.
Check out the next two examples of code:
Import string #this will import the statement inside function
For num in range (100000):
Import string # this will import statement outside function
For num in range(100000):
In the above examples, dothis2 will be running significantly faster than dothis1, despite the string module reference being global in dothis2. Then again, string methods (which have been introduced in Python 2.0) can be used, totally avoiding the need for import and making the program run even faster:
For num in range (100000):
Notice that putting the import within a function may speed up the module’s initial recording, especially when the imported module may not even be required. This is usually seen as a case of “lazy” optimization—avoiding the work (in this case, an expensive case of importing) until it is confirmed that the said work is required. This can only be a significant optimization in case the module would not have been imported in any way, from any module. Once the module is already loaded—like in the case of a lot of standard modules such as re or string—avoiding the import will not save anything. To see the modules that have already been loaded in the system, you can check in sys.modules.
You can also use the following method to do a good lazy import:
If email is None:
In this example, the email module will only have to be imported once, once parse_email() is invoked.
In Python, the overhead for function calls is quite high. This is especially tru when compared with a builtin function’s execution speed. This suggests that when appropriate, data aggregates should be handled by functions. Here is an example (a contrived one):
For s in list:
Compare the above example with the next one:
For s in list:
Even if written in Python, the second example will still run at around four times faster. If this is written in the C language, the difference in speed would have been more profound, as this involves using a C for loop instead of a Python one, while also removing many of the code’s function calls.
Well, we already talk of going lazy a bit in the last section, so let’s take a break from the technical things and focus on that one. Take note that periodically, Python performs some checks. Specifically, it will decide during such times whether or not it should let another thread run. It also things of whether or not a pending call should be run (these calls are typically established by signal handlers).
Most of the time, these checks do not result in anything for Python to do. That means each pass these checks make around the interpreter loop can cause things to slow down. In such circumstances, we can use the setcheckinterval function within the sys module—this can be called to tell the interpreter exactly how often you would want these periodic checks to be performed. Before the advent of Python 2.3, this has been defaulted to 10. Since then, it had been raised to a hundred. So if you are not expecting to catch many signals (and if you are not running with threads), you can set this to an even larger value to help nudge the interpreter’s performance.
Instead of range, use xrange
This will not apply if you are using the Python 3.X branch, where the range function will provide an iterator for arbitrarily-sized ranges, and where the xrange function does not exist.
However, if you are using other Python versions, this may be useful. Python provides two ways of getting a range of numbers—xrange and range. Most programmers already know range, since the name makes it quite obvious. However, xrange is a lot less known. It is a generator object, which is equivalent to the following code (exhibited in Python 2.3):
Def xrange(start,stop=None, step=1):
If stop is None:
Start += step
All of this, except that the implementation is purely in C.
The xtrange function does have its limits. More specifically, it will only work with int$—you will not be able to use it with float$ or long$ as they will also be converted to int$ as demonstrated above.
It would, however, save you a significant amount of memory. Also, unless the yielded objects are stores somewhere, only a single yielded object will be there at a time. The difference, then, will be this: when the programmer calls the range function, it will create a list which contains a certain number of objects (long, int, or float). All these objects will be created at once, with all of them existing all at the same time. This will be a significant issue when the amount of numbers is very large.
On the other hand, xrange creates exactly zero numbers immediately—just the range object. The number objects will be created once the programmer pulls on the generator, such as done by looping through it. Here is an example:
No numbers here as instantiated—there are no loops, and no calls to .next. For this reason, the code will instantaneously run. If instead you use range here, Python will be locking up—it would become too caught up allocating the number objects for sys.maxint. This number could rise up to about 2.1 billion on the normal PC. At this point, it will not do anything else—instead, it will eventually exit from running out of memory.
In the Python versions before the advent of 2.2, the xrange objects will also support optimizations such as better membership testing (a in xrange(b)). Because of lack of use, this was eventually removed.
Remapping functions at Runtime
Let’s say that you have the following:
For b in xrange(0,100000):
Suppose, after this, that the function will get called from someplace else a number of times. The check will be having the if statement slow you down for each time except for the first, so you can do this instead:
For b in xrange(0,100000):
You might find that the example can be quite inadequate, however if the if statement comes with lots of dots or other markers of a complicated expression, you can save yourself from evaluating it again and again if you know that it can only be true for the first time.