Learning Python (2013)

Part II. Types and Operations

Chapter 7. String Fundamentals

So far, we’ve studied numbers and explored Python’s dynamic typing model. The next major type on our in-depth core object tour is the Python string—an ordered collection of characters used to store and represent text- and bytes-based information. We looked briefly at strings in Chapter 4. Here, we will revisit them in more depth, filling in some of the details we skipped earlier.

This Chapter’s Scope

Before we get started, I also want to clarify what we won’t be covering here. Chapter 4 briefly previewed Unicode strings and files—tools for dealing with non-ASCII text. Unicode is a key tool for some programmers, especially those who work in the Internet domain. It can pop up, for example, in web pages, email content and headers, FTP transfers, GUI APIs, directory tools, and HTML, XML and JSON text.

At the same time, Unicode can be a heavy topic for programmers just starting out, and many (or most) of the Python programmers I meet today still do their jobs in blissful ignorance of the entire topic. In light of that, this book relegates most of the Unicode story to Chapter 37 of its Advanced Topics part as optional reading, and focuses on string basics here.

That is, this chapter tells only part of the string story in Python—the part that most scripts use and most programmers need to know. It explores the fundamental str string type, which handles ASCII text, and works the same regardless of which version of Python you use. Despite this intentionally limited scope, because str also handles Unicode in Python 3.X, and the separate unicode type works almost identically to str in 2.X, everything we learn here will apply directly to Unicode processing too.

Unicode: The Short Story

For readers who do care about Unicode, I’d like to also provide a quick summary of its impacts and pointers for further study. From a formal perspective, ASCII is a simple form of Unicode text, but just one of many possible encodings and alphabets. Text from non-English-speaking sources may use very different letters, and may be encoded very differently when stored in files.

As we saw in Chapter 4, Python addresses this by distinguishing between text and binary data, with distinct string object types and file interfaces for each. This support varies per Python line:

§  In Python 3.X there are three string types: str is used for Unicode text (including ASCII), bytes is used for binary data (including encoded text), and bytearray is a mutable variant of bytes. Files work in two modes: text, which represents content as str and implements Unicode encodings, and binary, which deals in raw bytes and does no data translation.

§  In Python 2.X, unicode strings represent Unicode text, str strings handle both 8-bit text and binary data, and bytearray is available in 2.6 and later as a back-port from 3.X. Normal files’ content is simply bytes represented as str, but a codecs module opens Unicode text files, handles encodings, and represents content as unicode objects.

Despite such version differences, if and when you do need to care about Unicode you’ll find that it is a relatively minor extension—once text is in memory, it’s a Python string of characters that supports all the basics we’ll study in this chapter. In fact, the primary distinction of Unicode often lies in the translation (a.k.a. encoding) step required to move it to and from files. Beyond that, it’s largely just string processing.

Again, though, because most programmers don’t need to come to grips with Unicode details up front, I’ve moved most of them to Chapter 37. When you’re ready to learn about these more advanced string concepts, I encourage you to see both their preview in Chapter 4 and the full Unicode and bytes disclosure in Chapter 37 after reading the string fundamentals material here.

For this chapter, we’ll focus on the basic string type and its operations. As you’ll find, the techniques we’ll study here also apply directly to the more advanced string types in Python’s toolset.

String Basics

From a functional perspective, strings can be used to represent just about anything that can be encoded as text or bytes. In the text department, this includes symbols and words (e.g., your name), contents of text files loaded into memory, Internet addresses, Python source code, and so on. Strings can also be used to hold the raw bytes used for media files and network transfers, and both the encoded and decoded forms of non-ASCII Unicode text used in internationalized programs.

You may have used strings in other languages, too. Python’s strings serve the same role as character arrays in languages such as C, but they are a somewhat higher-level tool than arrays. Unlike in C, in Python, strings come with a powerful set of processing tools. Also unlike languages such as C, Python has no distinct type for individual characters; instead, you just use one-character strings.

Strictly speaking, Python strings are categorized as immutable sequences, meaning that the characters they contain have a left-to-right positional order and that they cannot be changed in place. In fact, strings are the first representative of the larger class of objects called sequences that we will study here. Pay special attention to the sequence operations introduced in this chapter, because they will work the same on other sequence types we’ll explore later, such as lists and tuples.

Table 7-1 previews common string literals and operations we will discuss in this chapter. Empty strings are written as a pair of quotation marks (single or double) with nothing in between, and there are a variety of ways to code strings. For processing, strings support expression operations such as concatenation (combining strings), slicing (extracting sections), indexing (fetching by offset), and so on. Besides expressions, Python also provides a set of string methods that implement common string-specific tasks, as well as modules for more advanced text-processing tasks such as pattern matching. We’ll explore all of these later in the chapter.

Table 7-1. Common string literals and operations

Operation

Interpretation

S = ''

Empty string

S = "spam's"

Double quotes, same as single

S = 's\np\ta\x00m'

Escape sequences

S = """...multiline..."""

Triple-quoted block strings

S = r'\temp\spam'

Raw strings (no escapes)

B = b'sp\xc4m'

Byte strings in 2.6, 2.7, and 3.X (Chapter 4Chapter 37)

U = u'sp\u00c4m'

Unicode strings in 2.X and 3.3+ (Chapter 4Chapter 37)

S1 + S2

S * 3

Concatenate, repeat

S[i]

S[i:j]

len(S)

Index, slice, length

"a %s parrot" % kind

String formatting expression

"a {0} parrot".format(kind)

String formatting method in 2.6, 2.7, and 3.X

S.find('pa')

S.rstrip()

S.replace('pa', 'xx')

S.split(',')

S.isdigit()

S.lower()

S.endswith('spam')

'spam'.join(strlist)

S.encode('latin-1')

B.decode('utf8')

String methods (see ahead for all 43): search,

remove whitespace,

replacement,

split on delimiter,

content test,

case conversion,

end test,

delimiter join,

Unicode encoding,

Unicode decoding, etc. (see Table 7-3)

for x in S: print(x)

'spam' in S

[c * 2 for c in S]

map(ord, S)

Iteration, membership

re.match('sp(.*)am', line)

Pattern matching: library module

Beyond the core set of string tools in Table 7-1, Python also supports more advanced pattern-based string processing with the standard library’s re (for “regular expression”) module, introduced in Chapter 4 and Chapter 36, and even higher-level text processing tools such as XML parsers (discussed briefly in Chapter 37). This book’s scope, though, is focused on the fundamentals represented by Table 7-1.

To cover the basics, this chapter begins with an overview of string literal forms and string expressions, then moves on to look at more advanced tools such as string methods and formatting. Python comes with many string tools, and we won’t look at them all here; the complete story is chronicled in the Python library manual and reference books. Our goal here is to explore enough commonly used tools to give you a representative sample; methods we won’t see in action here, for example, are largely analogous to those we will.

String Literals

By and large, strings are fairly easy to use in Python. Perhaps the most complicated thing about them is that there are so many ways to write them in your code:

§  Single quotes: 'spa"m'

§  Double quotes: "spa'm"

§  Triple quotes: '''... spam ...''', """... spam ..."""

§  Escape sequences: "s\tp\na\0m"

§  Raw strings: r"C:\new\test.spm"

§  Bytes literals in 3.X and 2.6+ (see Chapter 4Chapter 37): b'sp\x01am'

§  Unicode literals in 2.X and 3.3+ (see Chapter 4Chapter 37): u'eggs\u0020spam'

The single- and double-quoted forms are by far the most common; the others serve specialized roles, and we’re postponing further discussion of the last two advanced forms until Chapter 37. Let’s take a quick look at all the other options in turn.

Single- and Double-Quoted Strings Are the Same

Around Python strings, single- and double-quote characters are interchangeable. That is, string literals can be written enclosed in either two single or two double quotes—the two forms work the same and return the same type of object. For example, the following two strings are identical, once coded:

>>> 'shrubbery', "shrubbery"

('shrubbery', 'shrubbery')

The reason for supporting both is that it allows you to embed a quote character of the other variety inside a string without escaping it with a backslash. You may embed a single-quote character in a string enclosed in double-quote characters, and vice versa:

>>> 'knight"s', "knight's"

('knight"s', "knight's")

This book generally prefers to use single quotes around strings just because they are marginally easier to read, except in cases where a single quote is embedded in the string. This is a purely subjective style choice, but Python displays strings this way too and most Python programmers do the same today, so you probably should too.

Note that the comma is important here. Without it, Python automatically concatenates adjacent string literals in any expression, although it is almost as simple to add a + operator between them to invoke concatenation explicitly (as we’ll see in Chapter 12, wrapping this form in parentheses also allows it to span multiple lines):

>>> title = "Meaning " 'of' " Life"        # Implicit concatenation

>>> title

'Meaning of Life'

Adding commas between these strings would result in a tuple, not a string. Also notice in all of these outputs that Python prints strings in single quotes unless they embed one. If needed, you can also embed quote characters by escaping them with backslashes:

>>> 'knight\'s', "knight\"s"

("knight's", 'knight"s')

To understand why, you need to know how escapes work in general.

Escape Sequences Represent Special Characters

The last example embedded a quote inside a string by preceding it with a backslash. This is representative of a general pattern in strings: backslashes are used to introduce special character codings known as escape sequences.

Escape sequences let us embed characters in strings that cannot easily be typed on a keyboard. The character \, and one or more characters following it in the string literal, are replaced with a single character in the resulting string object, which has the binary value specified by the escape sequence. For example, here is a five-character string that embeds a newline and a tab:

>>> s = 'a\nb\tc'

The two characters \n stand for a single character—the binary value of the newline character in your character set (in ASCII, character code 10). Similarly, the sequence \t is replaced with the tab character. The way this string looks when printed depends on how you print it. The interactive echo shows the special characters as escapes, but print interprets them instead:

>>> s

'a\nb\tc'

>>> print(s)

a

b       c

To be completely sure how many actual characters are in this string, use the built-in len function—it returns the actual number of characters in a string, regardless of how it is coded or displayed:

>>> len(s)

5

This string is five characters long: it contains an ASCII a, a newline character, an ASCII b, and so on.

NOTE

If you’re accustomed to all-ASCII text, it’s tempting to think of this result as meaning 5 bytes too, but you probably shouldn’t. Really, “bytes” has no meaning in the Unicode world. For one thing, the string object is probably larger in memory in Python.

More critically, string content and length both reflect code points (identifying numbers) in Unicode-speak, where a single character does not necessarily map directly to a single byte, either when encoded in files or when stored in memory. This mapping might hold true for simple 7-bit ASCII text, but even this depends on both the external encoding type and the internal storage scheme used. Under UTF-16, for example, ASCII characters are multiple bytes in files, and they may be 1, 2, or 4 bytes in memory depending on how Python allocates their space. For other, non-ASCII text, whose characters’ values might be too large to fit in an 8-bit byte, the character-to-byte mapping doesn’t apply at all.

In fact, 3.X defines str strings formally as sequences of Unicode code points, not bytes, to make this clear. There’s more on how strings are stored internally in Chapter 37 if you care to know. For now, to be safest, think characters instead of bytes in strings. Trust me on this; as an ex-C programmer, I had to break the habit too!

Note that the original backslash characters in the preceding result are not really stored with the string in memory; they are used only to describe special character values to be stored in the string. For coding such special characters, Python recognizes a full set of escape code sequences, listed in Table 7-2.

Table 7-2. String backslash characters

Escape

Meaning

\newline

Ignored (continuation line)

\\

Backslash (stores one \)

\'

Single quote (stores ')

\"

Double quote (stores ")

\a

Bell

\b

Backspace

\f

Formfeed

\n

Newline (linefeed)

\r

Carriage return

\t

Horizontal tab

\v

Vertical tab

\xhh

Character with hex value hh (exactly 2 digits)

\ooo

Character with octal value ooo (up to 3 digits)

\0

Null: binary 0 character (doesn’t end string)

\N{ id }

Unicode database ID

\uhhhh

Unicode character with 16-bit hex value

\Uhhhhhhhh

Unicode character with 32-bit hex value[a]

\other

Not an escape (keeps both \ and other)

[a] The \Uhhhh... escape sequence takes exactly eight hexadecimal digits (h); both \u and \U are recognized only in Unicode string literals in 2.X, but can be used in normal strings (which are Unicode) in 3.X. In a 3.X bytes literal, hexadecimal and octal escapes denote the byte with the given value; in a string literal, these escapes denote a Unicode character with the given code-point value. There is more on Unicode escapes in Chapter 37.

Some escape sequences allow you to embed absolute binary values into the characters of a string. For instance, here’s a five-character string that embeds two characters with binary zero values (coded as octal escapes of one digit):

>>> s = 'a\0b\0c'

>>> s

'a\x00b\x00c'

>>> len(s)

5

In Python, a zero (null) character like this does not terminate a string the way a “null byte” typically does in C. Instead, Python keeps both the string’s length and text in memory. In fact, no character terminates a string in Python. Here’s a string that is all absolute binary escape codes—a binary 1 and 2 (coded in octal), followed by a binary 3 (coded in hexadecimal):

>>> s = '\001\002\x03'

>>> s

'\x01\x02\x03'

>>> len(s)

3

Notice that Python displays nonprintable characters in hex, regardless of how they were specified. You can freely combine absolute value escapes and the more symbolic escape types in Table 7-2. The following string contains the characters “spam”, a tab and newline, and an absolute zero value character coded in hex:

>>> S = "s\tp\na\x00m"

>>> S

's\tp\na\x00m'

>>> len(S)

7

>>> print(S)

s       p

a m

This becomes more important to know when you process binary data files in Python. Because their contents are represented as strings in your scripts, it’s OK to process binary files that contain any sorts of binary byte values—when opened in binary modes, files return strings of raw bytes from the external file (there’s much more on files in Chapter 4Chapter 9, and Chapter 37).

Finally, as the last entry in Table 7-2 implies, if Python does not recognize the character after a \ as being a valid escape code, it simply keeps the backslash in the resulting string:

>>> x = "C:\py\code"           # Keeps \ literally (and displays it as \\)

>>> x

'C:\\py\\code'

>>> len(x)

10

However, unless you’re able to commit all of Table 7-2 to memory (and there are arguably better uses for your neurons!), you probably shouldn’t rely on this behavior. To code literal backslashes explicitly such that they are retained in your strings, double them up (\\ is an escape for one \) or use raw strings; the next section shows how.

Raw Strings Suppress Escapes

As we’ve seen, escape sequences are handy for embedding special character codes within strings. Sometimes, though, the special treatment of backslashes for introducing escapes can lead to trouble. It’s surprisingly common, for instance, to see Python newcomers in classes trying to open a file with a filename argument that looks something like this:

myfile = open('C:\new\text.dat', 'w')

thinking that they will open a file called text.dat in the directory C:\new. The problem here is that \n is taken to stand for a newline character, and \t is replaced with a tab. In effect, the call tries to open a file named C:(newline)ew(tab)ext.dat, with usually less-than-stellar results.

This is just the sort of thing that raw strings are useful for. If the letter r (uppercase or lowercase) appears just before the opening quote of a string, it turns off the escape mechanism. The result is that Python retains your backslashes literally, exactly as you type them. Therefore, to fix the filename problem, just remember to add the letter r on Windows:

myfile = open(r'C:\new\text.dat', 'w')

Alternatively, because two backslashes are really an escape sequence for one backslash, you can keep your backslashes by simply doubling them up:

myfile = open('C:\\new\\text.dat', 'w')

In fact, Python itself sometimes uses this doubling scheme when it prints strings with embedded backslashes:

>>> path = r'C:\new\text.dat'

>>> path                          # Show as Python code

'C:\\new\\text.dat'

>>> print(path)                   # User-friendly format

C:\new\text.dat

>>> len(path)                     # String length

15

As with numeric representation, the default format at the interactive prompt prints results as if they were code, and therefore escapes backslashes in the output. The print statement provides a more user-friendly format that shows that there is actually only one backslash in each spot. To verify this is the case, you can check the result of the built-in len function, which returns the number of characters in the string, independent of display formats. If you count the characters in the print(path) output, you’ll see that there really is just 1 character per backslash, for a total of 15.

Besides directory paths on Windows, raw strings are also commonly used for regular expressions (text pattern matching, supported with the re module introduced in Chapter 4 and Chapter 37). Also note that Python scripts can usually use forward slashes in directory paths on Windows and Unix because Python tries to interpret paths portably (i.e., 'C:/new/text.dat' works when opening files, too). Raw strings are useful if you code paths using native Windows backslashes, though.

NOTE

Despite its role, even a raw string cannot end in a single backslash, because the backslash escapes the following quote character—you still must escape the surrounding quote character to embed it in the string. That is, r"...\" is not a valid string literal—a raw string cannot end in an odd number of backslashes. If you need to end a raw string with a single backslash, you can use two and slice off the second (r'1\nb\tc\\'[:-1]), tack one on manually (r'1\nb\tc' + '\\'), or skip the raw string syntax and just double up the backslashes in a normal string ('1\\nb\\tc\\'). All three of these forms create the same eight-character string containing three backslashes.

Triple Quotes Code Multiline Block Strings

So far, you’ve seen single quotes, double quotes, escapes, and raw strings in action. Python also has a triple-quoted string literal format, sometimes called a block string, that is a syntactic convenience for coding multiline text data. This form begins with three quotes (of either the single or double variety), is followed by any number of lines of text, and is closed with the same triple-quote sequence that opened it. Single and double quotes embedded in the string’s text may be, but do not have to be, escaped—the string does not end until Python sees three unescaped quotes of the same kind used to start the literal. For example (the “...” here is Python’s prompt for continuation lines outside IDLE: don’t type it yourself):

>>> mantra = """Always look

...   on the bright

... side of life."""

>>> 

>>> mantra

'Always look\n  on the bright\nside of life.'

This string spans three lines. As we learned in Chapter 3, in some interfaces, the interactive prompt changes to ... on continuation lines like this, but IDLE simply drops down one line; this book shows listings in both forms, so extrapolate as needed. Either way, Python collects all the triple-quoted text into a single multiline string, with embedded newline characters (\n) at the places where your code has line breaks. Notice that, as in the literal, the second line in the result has leading spaces, but the third does not—what you type is truly what you get. To see the string with the newlines interpreted, print it instead of echoing:

>>> print(mantra)

Always look

  on the bright

side of life.

In fact, triple-quoted strings will retain all the enclosed text, including any to the right of your code that you might intend as comments. So don’t do this—put your comments above or below the quoted text, or use the automatic concatenation of adjacent strings mentioned earlier, with explicit newlines if desired, and surrounding parentheses to allow line spans (again, more on this latter form when we study syntax rules in Chapter 10 and Chapter 12):

>>> menu = """spam     # comments here added to string!

... eggs               # ditto

... """

>>> menu

'spam     # comments here added to string!\neggs               # ditto\n'

>>> menu = (

... "spam\n"           # comments here ignored

... "eggs\n"           # but newlines not automatic

... )

>>> menu

'spam\neggs\n'

Triple-quoted strings are useful anytime you need multiline text in your program; for example, to embed multiline error messages or HTML, XML, or JSON code in your Python source code files. You can embed such blocks directly in your scripts by triple-quoting without resorting to external text files or explicit concatenation and newline characters.

Triple-quoted strings are also commonly used for documentation strings, which are string literals that are taken as comments when they appear at specific points in your file (more on these later in the book). These don’t have to be triple-quoted blocks, but they usually are to allow for multiline comments.

Finally, triple-quoted strings are also sometimes used as a “horribly hackish” way to temporarily disable lines of code during development (OK, it’s not really too horrible, and it’s actually a fairly common practice today, but it wasn’t the intent). If you wish to turn off a few lines of code and run your script again, simply put three quotes above and below them, like this:

X = 1

"""

import os                            # Disable this code temporarily

print(os.getcwd())

"""

Y = 2

I said this was hackish because Python really might make a string out of the lines of code disabled this way, but this is probably not significant in terms of performance. For large sections of code, it’s also easier than manually adding hash marks before each line and later removing them. This is especially true if you are using a text editor that does not have support for editing Python code specifically. In Python, practicality often beats aesthetics.

Strings in Action

Once you’ve created a string with the literal expressions we just met, you will almost certainly want to do things with it. This section and the next two demonstrate string expressions, methods, and formatting—the first line of text-processing tools in the Python language.

Basic Operations

Let’s begin by interacting with the Python interpreter to illustrate the basic string operations listed earlier in Table 7-1. You can concatenate strings using the + operator and repeat them using the * operator:

% python

>>> len('abc')            # Length: number of items

3

>>> 'abc' + 'def'         # Concatenation: a new string

'abcdef'

>>> 'Ni!' * 4             # Repetition: like "Ni!" + "Ni!" + ...

'Ni!Ni!Ni!Ni!'

The len built-in function here returns the length of a string (or any other object with a length). Formally, adding two string objects with + creates a new string object, with the contents of its operands joined, and repetition with * is like adding a string to itself a number of times. In both cases, Python lets you create arbitrarily sized strings; there’s no need to predeclare anything in Python, including the sizes of data structures—you simply create string objects as needed and let Python manage the underlying memory space automatically (see Chapter 6 for more on Python’s memory management “garbage collector”).

Repetition may seem a bit obscure at first, but it comes in handy in a surprising number of contexts. For example, to print a line of 80 dashes, you can count up to 80, or let Python count for you:

>>> print('------- ...more... ---')      # 80 dashes, the hard way

>>> print('-' * 80)                      # 80 dashes, the easy way

Notice that operator overloading is at work here already: we’re using the same + and * operators that perform addition and multiplication when using numbers. Python does the correct operation because it knows the types of the objects being added and multiplied. But be careful: the rules aren’t quite as liberal as you might expect. For instance, Python doesn’t allow you to mix numbers and strings in + expressions: 'abc'+9 raises an error instead of automatically converting 9 to a string.

As shown in the last row in Table 7-1, you can also iterate over strings in loops using for statements, which repeat actions, and test membership for both characters and substrings with the in expression operator, which is essentially a search. For substrings, in is much like the str.find()method covered later in this chapter, but it returns a Boolean result instead of the substring’s position (the following uses a 3.X print call and may leave your cursor a bit indented; in 2.X say print c, instead):

>>> myjob = "hacker"

>>> for c in myjob: print(c, end=' ')   # Step through items, print each (3.X form)

...

h a c k e r

>>> "k" in myjob                        # Found

True

>>> "z" in myjob                        # Not found

False

>>> 'spam' in 'abcspamdef'              # Substring search, no position returned

True

The for loop assigns a variable to successive items in a sequence (here, a string) and executes one or more statements for each item. In effect, the variable c becomes a cursor stepping across the string’s characters here. We will discuss iteration tools like these and others listed in Table 7-1 in more detail later in this book (especially in Chapter 14 and Chapter 20).

Indexing and Slicing

Because strings are defined as ordered collections of characters, we can access their components by position. In Python, characters in a string are fetched by indexing—providing the numeric offset of the desired component in square brackets after the string. You get back the one-character string at the specified position.

As in the C language, Python offsets start at 0 and end at one less than the length of the string. Unlike C, however, Python also lets you fetch items from sequences such as strings using negative offsets. Technically, a negative offset is added to the length of a string to derive a positive offset. You can also think of negative offsets as counting backward from the end. The following interaction demonstrates:

>>> S = 'spam'

>>> S[0], S[−2]                         # Indexing from front or end

('s', 'a')

>>> S[1:3], S[1:], S[:−1]               # Slicing: extract a section

('pa', 'pam', 'spa')

The first line defines a four-character string and assigns it the name S. The next line indexes it in two ways: S[0] fetches the item at offset 0 from the left—the one-character string 's'; S[−2] gets the item at offset 2 back from the end—or equivalently, at offset (4 + (−2)) from the front. In more graphic terms, offsets and slices map to cells as shown in Figure 7-1.[16]

Offsets and slices: positive offsets start from the left end (offset 0 is the first item), and negatives count back from the right end (offset −1 is the last item). Either kind of offset can be used to give positions in indexing and slicing operations.

Figure 7-1. Offsets and slices: positive offsets start from the left end (offset 0 is the first item), and negatives count back from the right end (offset −1 is the last item). Either kind of offset can be used to give positions in indexing and slicing operations.

The last line in the preceding example demonstrates slicing, a generalized form of indexing that returns an entire section, not a single item. Probably the best way to think of slicing is that it is a type of parsing (analyzing structure), especially when applied to strings—it allows us to extract an entire section (substring) in a single step. Slices can be used to extract columns of data, chop off leading and trailing text, and more. In fact, we’ll explore slicing in the context of text parsing later in this chapter.

The basics of slicing are straightforward. When you index a sequence object such as a string on a pair of offsets separated by a colon, Python returns a new object containing the contiguous section identified by the offset pair. The left offset is taken to be the lower bound (inclusive), and the right is the upper bound (noninclusive). That is, Python fetches all items from the lower bound up to but not including the upper bound, and returns a new object containing the fetched items. If omitted, the left and right bounds default to 0 and the length of the object you are slicing, respectively.

For instance, in the example we just saw, S[1:3] extracts the items at offsets 1 and 2: it grabs the second and third items, and stops before the fourth item at offset 3. Next, S[1:] gets all items beyond the first—the upper bound, which is not specified, defaults to the length of the string. Finally, S[:−1] fetches all but the last item—the lower bound defaults to 0, and −1 refers to the last item, noninclusive.

This may seem confusing at first glance, but indexing and slicing are simple and powerful tools to use, once you get the knack. Remember, if you’re unsure about the effects of a slice, try it out interactively. In the next chapter, you’ll see that it’s even possible to change an entire section of another object in one step by assigning to a slice (though not for immutables like strings). Here’s a summary of the details for reference:

Indexing (S[i]) fetches components at offsets:

§ The first item is at offset 0.

§ Negative indexes mean to count backward from the end or right.

§ S[0] fetches the first item.

§ S[−2] fetches the second item from the end (like S[len(S)−2]).

Slicing (S[i:j]) extracts contiguous sections of sequences:

§ The upper bound is noninclusive.

§ Slice boundaries default to 0 and the sequence length, if omitted.

§ S[1:3] fetches items at offsets 1 up to but not including 3.

§ S[1:] fetches items at offset 1 through the end (the sequence length).

§ S[:3] fetches items at offset 0 up to but not including 3.

§ S[:−1] fetches items at offset 0 up to but not including the last item.

§ S[:] fetches items at offsets 0 through the end—making a top-level copy of S.

Extended slicing (S[i:j:k]) accepts a step (or stride) k, which defaults to +1:

§ Allows for skipping items and reversing order—see the next section.

The second-to-last bullet item listed here turns out to be a very common technique: it makes a full top-level copy of a sequence object—an object with the same value, but a distinct piece of memory (you’ll find more on copies in Chapter 9). This isn’t very useful for immutable objects like strings, but it comes in handy for objects that may be changed in place, such as lists.

In the next chapter, you’ll see that the syntax used to index by offset (square brackets) is used to index dictionaries by key as well; the operations look the same but have different interpretations.

Extended slicing: The third limit and slice objects

In Python 2.3 and later, slice expressions have support for an optional third index, used as a step (sometimes called a stride). The step is added to the index of each item extracted. The full-blown form of a slice is now X[I:J:K], which means “extract all the items in X, from offset I throughJ−1, by K.” The third limit, K, defaults to +1, which is why normally all items in a slice are extracted from left to right. If you specify an explicit value, however, you can use the third limit to skip items or to reverse their order.

For instance, X[1:10:2] will fetch every other item in X from offsets 1–9; that is, it will collect the items at offsets 1, 3, 5, 7, and 9. As usual, the first and second limits default to 0 and the length of the sequence, respectively, so X[::2] gets every other item from the beginning to the end of the sequence:

>>> S = 'abcdefghijklmnop'

>>> S[1:10:2]                          # Skipping items

'bdfhj'

>>> S[::2]

'acegikmo'

You can also use a negative stride to collect items in the opposite order. For example, the slicing expression "hello"[::−1] returns the new string "olleh"—the first two bounds default to 0 and the length of the sequence, as before, and a stride of −1 indicates that the slice should go from right to left instead of the usual left to right. The effect, therefore, is to reverse the sequence:

>>> S = 'hello'

>>> S[::−1]                            # Reversing items

'olleh'

With a negative stride, the meanings of the first two bounds are essentially reversed. That is, the slice S[5:1:−1] fetches the items from 2 to 5, in reverse order (the result contains items from offsets 5, 4, 3, and 2):

>>> S = 'abcedfg'

>>> S[5:1:−1]                          # Bounds roles differ

'fdec'

Skipping and reversing like this are the most common use cases for three-limit slices, but see Python’s standard library manual for more details (or run a few experiments interactively). We’ll revisit three-limit slices again later in this book, in conjunction with the for loop statement.

Later in the book, we’ll also learn that slicing is equivalent to indexing with a slice object, a finding of importance to class writers seeking to support both operations:

>>> 'spam'[1:3]                        # Slicing syntax

'pa'

>>> 'spam'[slice(1, 3)]                # Slice objects with index syntax + object

'pa'

>>> 'spam'[::-1]

'maps'

>>> 'spam'[slice(None, None, −1)]

'maps'

WHY YOU WILL CARE: SLICES

Throughout this book, I will include common use-case sidebars (such as this one) to give you a peek at how some of the language features being introduced are typically used in real programs. Because you won’t be able to make much sense of realistic use cases until you’ve seen more of the Python picture, these sidebars necessarily contain many references to topics not introduced yet; at most, you should consider them previews of ways that you may find these abstract language concepts useful for common programming tasks.

For instance, you’ll see later that the argument words listed on a system command line used to launch a Python program are made available in the argv attribute of the built-in sys module:

# File echo.py

import sys

print(sys.argv)

% python echo.py −a −b −c

['echo.py', '−a', '−b', '−c']

Usually, you’re only interested in inspecting the arguments that follow the program name. This leads to a typical application of slices: a single slice expression can be used to return all but the first item of a list. Here, sys.argv[1:] returns the desired list, ['−a', '−b', '−c']. You can then process this list without having to accommodate the program name at the front.

Slices are also often used to clean up lines read from input files. If you know that a line will have an end-of-line character at the end (a \n newline marker), you can get rid of it with a single expression such as line[:−1], which extracts all but the last character in the line (the lower limit defaults to 0). In both cases, slices do the job of logic that must be explicit in a lower-level language.

Having said that, calling the line.rstrip method is often preferred for stripping newline characters because this call leaves the line intact if it has no newline character at the end—a common case for files created with some text-editing tools. Slicing works if you’re sure the line is properly terminated.

String Conversion Tools

One of Python’s design mottos is that it refuses the temptation to guess. As a prime example, you cannot add a number and a string together in Python, even if the string looks like a number (i.e., is all digits):

# Python 3.X

>>> "42" + 1

TypeError: Can't convert 'int' object to str implicitly

# Python 2.X

>>> "42" + 1

TypeError: cannot concatenate 'str' and 'int' objects

This is by design: because + can mean both addition and concatenation, the choice of conversion would be ambiguous. Instead, Python treats this as an error. In Python, magic is generally omitted if it will make your life more complex.

What to do, then, if your script obtains a number as a text string from a file or user interface? The trick is that you need to employ conversion tools before you can treat a string like a number, or vice versa. For instance:

>>> int("42"), str(42)          # Convert from/to string

(42, '42')

>>> repr(42)                    # Convert to as-code string

'42'

The int function converts a string to a number, and the str function converts a number to its string representation (essentially, what it looks like when printed). The repr function (and the older backquotes expression, removed in Python 3.X) also converts an object to its string representation, but returns the object as a string of code that can be rerun to recreate the object. For strings, the result has quotes around it if displayed with a print statement, which differs in form between Python lines:

>>> print(str('spam'), repr('spam'))       # 2.X: print str('spam'), repr('spam')

spam 'spam'

>>> str('spam'), repr('spam')              # Raw interactive echo displays

('spam', "'spam'")

See the sidebar in Chapter 5’s str and repr Display Formats for more on these topics. Of these, int and str are the generally prescribed to-number and to-string conversion techniques.

Now, although you can’t mix strings and number types around operators such as +, you can manually convert operands before that operation if needed:

>>> S = "42"

>>> I = 1

>>> S + I

TypeError: Can't convert 'int' object to str implicitly

>>> int(S) + I            # Force addition

43

>>> S + str(I)            # Force concatenation

'421'

Similar built-in functions handle floating-point-number conversions to and from strings:

>>> str(3.1415), float("1.5")

('3.1415', 1.5)

>>> text = "1.234E-10"

>>> float(text)           # Shows more digits before 2.7 and 3.1

1.234e-10

Later, we’ll further study the built-in eval function; it runs a string containing Python expression code and so can convert a string to any kind of object. The functions int and float convert only to numbers, but this restriction means they are usually faster (and more secure, because they do not accept arbitrary expression code). As we saw briefly in Chapter 5, the string formatting expression also provides a way to convert numbers to strings. We’ll discuss formatting further later in this chapter.

Character code conversions

On the subject of conversions, it is also possible to convert a single character to its underlying integer code (e.g., its ASCII byte value) by passing it to the built-in ord function—this returns the actual binary value used to represent the corresponding character in memory. The chr functionperforms the inverse operation, taking an integer code and converting it to the corresponding character:

>>> ord('s')

115

>>> chr(115)

's'

Technically, both of these convert characters to and from their Unicode ordinals or “code points,” which are just their identifying number in the underlying character set. For ASCII text, this is the familiar 7-bit integer that fits in a single byte in memory, but the range of code points for other kinds of Unicode text may be wider (more on character sets and Unicode in Chapter 37). You can use a loop to apply these functions to all characters in a string if required. These tools can also be used to perform a sort of string-based math. To advance to the next character, for example, convert and do the math in integer:

>>> S = '5'

>>> S = chr(ord(S) + 1)

>>> S

'6'

>>> S = chr(ord(S) + 1)

>>> S

'7'

At least for single-character strings, this provides an alternative to using the built-in int function to convert from string to integer (though this only makes sense in character sets that order items as your code expects!):

>>> int('5')

5

>>> ord('5') - ord('0')

5

Such conversions can be used in conjunction with looping statements, introduced in Chapter 4 and covered in depth in the next part of this book, to convert a string of binary digits to their corresponding integer values. Each time through the loop, multiply the current value by 2 and add the next digit’s integer value:

>>> B = '1101'                 # Convert binary digits to integer with ord

>>> I = 0

>>> while B != '':

...     I = I * 2 + (ord(B[0]) - ord('0'))

...     B = B[1:]

...

>>> I

13

A left-shift operation (I << 1) would have the same effect as multiplying by 2 here. We’ll leave this change as a suggested exercise, though, both because we haven’t studied loops in detail yet and because the int and bin built-ins we met in Chapter 5 handle binary conversion tasks for us as of Python 2.6 and 3.0:

>>> int('1101', 2)             # Convert binary to integer: built-in

13

>>> bin(13)                    # Convert integer to binary: built-in

'0b1101'

Given enough time, Python tends to automate most common tasks!

Changing Strings I

Remember the term “immutable sequence”? As we’ve seen, the immutable part means that you cannot change a string in place—for instance, by assigning to an index:

>>> S = 'spam'

>>> S[0] = 'x'                 # Raises an error!

TypeError: 'str' object does not support item assignment

How to modify text information in Python, then? To change a string, you generally need to build and assign a new string using tools such as concatenation and slicing, and then, if desired, assign the result back to the string’s original name:

>>> S = S + 'SPAM!'            # To change a string, make a new one

>>> S

'spamSPAM!'

>>> S = S[:4] + 'Burger' + S[−1]

>>> S

'spamBurger!'

The first example adds a substring at the end of S, by concatenation. Really, it makes a new string and assigns it back to S, but you can think of this as “changing” the original string. The second example replaces four characters with six by slicing, indexing, and concatenating. As you’ll see in the next section, you can achieve similar effects with string method calls like replace:

>>> S = 'splot'

>>> S = S.replace('pl', 'pamal')

>>> S

'spamalot'

Like every operation that yields a new string value, string methods generate new string objects. If you want to retain those objects, you can assign them to variable names. Generating a new string object for each string change is not as inefficient as it may sound—remember, as discussed in the preceding chapter, Python automatically garbage-collects (reclaims the space of) old unused string objects as you go, so newer objects reuse the space held by prior values. Python is usually more efficient than you might expect.

Finally, it’s also possible to build up new text values with string formatting expressions. Both of the following substitute objects into a string, in a sense converting the objects to strings and changing the original string according to a format specification:

>>> 'That is %d %s bird!' % (1, 'dead')           # Format expression: all Pythons

That is 1 dead bird!

>>> 'That is {0} {1} bird!'.format(1, 'dead')     # Format method in 2.6, 2.7, 3.X

'That is 1 dead bird!'

Despite the substitution metaphor, though, the result of formatting is a new string object, not a modified one. We’ll study formatting later in this chapter; as we’ll find, formatting turns out to be more general and useful than this example implies. Because the second of the preceding calls is provided as a method, though, let’s get a handle on string method calls before we explore formatting further.

NOTE

As previewed in Chapter 4 and to be covered in Chapter 37, Python 3.0 and 2.6 introduced a new string type known as bytearray, which is mutable and so may be changed in place. bytearray objects aren’t really text strings; they’re sequences of small, 8-bit integers. However, they support most of the same operations as normal strings and print as ASCII characters when displayed. Accordingly, they provide another option for large amounts of simple 8-bit text that must be changed frequently (richer types of Unicode text imply different techniques). In Chapter 37 we’ll also see that ord and chr handle Unicode characters, too, which might not be stored in single bytes.


[16] More mathematically minded readers (and students in my classes) sometimes detect a small asymmetry here: the leftmost item is at offset 0, but the rightmost is at offset −1. Alas, there is no such thing as a distinct −0 value in Python.

String Methods

In addition to expression operators, strings provide a set of methods that implement more sophisticated text-processing tasks. In Python, expressions and built-in functions may work across a range of types, but methods are generally specific to object types—string methods, for example, work only on string objects. The method sets of some types intersect in Python 3.X (e.g., many types have count and copy methods), but they are still more type-specific than other tools.

Method Call Syntax

As introduced in Chapter 4, methods are simply functions that are associated with and act upon particular objects. Technically, they are attributes attached to objects that happen to reference callable functions which always have an implied subject. In finer-grained detail, functions are packages of code, and method calls combine two operations at once—an attribute fetch and a call:

Attribute fetches

An expression of the form object.attribute means “fetch the value of attribute in object.”

Call expressions

An expression of the form function(arguments) means “invoke the code of function, passing zero or more comma-separated argument objects to it, and return function’s result value.”

Putting these two together allows us to call a method of an object. The method call expression:

object.method(arguments)

is evaluated from left to right—Python will first fetch the method of the object and then call it, passing in both object and the arguments. Or, in plain words, the method call expression means this:

Call method to process object with arguments.

If the method computes a result, it will also come back as the result of the entire method-call expression. As a more tangible example:

>>> S = 'spam'

>>> result = S.find('pa')     # Call the find method to look for 'pa' in string S

This mapping holds true for methods of both built-in types, as well as user-defined classes we’ll study later. As you’ll see throughout this part of the book, most objects have callable methods, and all are accessed using this same method-call syntax. To call an object method, as you’ll see in the following sections, you have to go through an existing object; methods cannot be run (and make little sense) without a subject.

Methods of Strings

Table 7-3 summarizes the methods and call patterns for built-in string objects in Python 3.3; these change frequently, so be sure to check Python’s standard library manual for the most up-to-date list, or run a dir or help call on any string (or the str type name) interactively. Python 2.X’s string methods vary slightly; it includes a decode, for example, because of its different handling of Unicode data (something we’ll discuss in Chapter 37). In this table, S is a string object, and optional arguments are enclosed in square brackets. String methods in this table implement higher-level operations such as splitting and joining, case conversions, content tests, and substring searches and replacements.

Table 7-3. String method calls in Python 3.3

S.capitalize()

S.ljust(width [, fill])

S.casefold()

S.lower()

S.center(width [, fill])

S.lstrip([chars])

S.count(sub [, start [, end]])

S.maketrans(x[, y[, z]])

S.encode([encoding [,errors]])

S.partition(sep)

S.endswith(suffix [, start [, end]])

S.replace(old, new [, count])

S.expandtabs([tabsize])

S.rfind(sub [,start [,end]])

S.find(sub [, start [, end]])

S.rindex(sub [, start [, end]])

S.format(fmtstr, *args, **kwargs)

S.rjust(width [, fill])

S.index(sub [, start [, end]])

S.rpartition(sep)

S.isalnum()

S.rsplit([sep[, maxsplit]])

S.isalpha()

S.rstrip([chars])

S.isdecimal()

S.split([sep [,maxsplit]])

S.isdigit()

S.splitlines([keepends])

S.isidentifier()

S.startswith(prefix [, start [, end]])

S.islower()

S.strip([chars])

S.isnumeric()

S.swapcase()

S.isprintable()

S.title()

S.isspace()

S.translate(map)

S.istitle()

S.upper()

S.isupper()

S.zfill(width)

S.join(iterable)

 

As you can see, there are quite a few string methods, and we don’t have space to cover them all; see Python’s library manual or reference texts for all the fine points. To help you get started, though, let’s work through some code that demonstrates some of the most commonly used methods in action, and illustrates Python text-processing basics along the way.

String Method Examples: Changing Strings II

As we’ve seen, because strings are immutable, they cannot be changed in place directly. The bytearray supports in-place text changes in 2.6, 3.0, and later, but only for simple 8-bit types. We explored changes to text strings earlier, but let’s take a quick second look here in the context of string methods.

In general, to make a new text value from an existing string, you construct a new string with operations such as slicing and concatenation. For example, to replace two characters in the middle of a string, you can use code like this:

>>> S = 'spammy'

>>> S = S[:3] + 'xx' + S[5:]          # Slice sections from S

>>> S

'spaxxy'

But, if you’re really just out to replace a substring, you can use the string replace method instead:

>>> S = 'spammy'

>>> S = S.replace('mm', 'xx')         # Replace all mm with xx in S

>>> S

'spaxxy'

The replace method is more general than this code implies. It takes as arguments the original substring (of any length) and the string (of any length) to replace it with, and performs a global search and replace:

>>> 'aa$bb$cc$dd'.replace('$', 'SPAM')

'aaSPAMbbSPAMccSPAMdd'

In such a role, replace can be used as a tool to implement template replacements (e.g., in form letters). Notice that this time we simply printed the result, instead of assigning it to a name—you need to assign results to names only if you want to retain them for later use.

If you need to replace one fixed-size string that can occur at any offset, you can do a replacement again, or search for the substring with the string find method and then slice:

>>> S = 'xxxxSPAMxxxxSPAMxxxx'

>>> where = S.find('SPAM')            # Search for position

>>> where                             # Occurs at offset 4

4

>>> S = S[:where] + 'EGGS' + S[(where+4):]

>>> S

'xxxxEGGSxxxxSPAMxxxx'

The find method returns the offset where the substring appears (by default, searching from the front), or −1 if it is not found. As we saw earlier, it’s a substring search operation just like the in expression, but find returns the position of a located substring.

Another option is to use replace with a third argument to limit it to a single substitution:

>>> S = 'xxxxSPAMxxxxSPAMxxxx'

>>> S.replace('SPAM', 'EGGS')         # Replace all

'xxxxEGGSxxxxEGGSxxxx'

>>> S.replace('SPAM', 'EGGS', 1)      # Replace one

'xxxxEGGSxxxxSPAMxxxx'

Notice that replace returns a new string object each time. Because strings are immutable, methods never really change the subject strings in place, even if they are called “replace”!

The fact that concatenation operations and the replace method generate new string objects each time they are run is actually a potential downside of using them to change strings. If you have to apply many changes to a very large string, you might be able to improve your script’s performance by converting the string to an object that does support in-place changes:

>>> S = 'spammy'

>>> L = list(S)

>>> L

['s', 'p', 'a', 'm', 'm', 'y']

The built-in list function (an object construction call) builds a new list out of the items in any sequence—in this case, “exploding” the characters of a string into a list. Once the string is in this form, you can make multiple changes to it without generating a new copy for each change:

>>> L[3] = 'x'                        # Works for lists, not strings

>>> L[4] = 'x'

>>> L

['s', 'p', 'a', 'x', 'x', 'y']

If, after your changes, you need to convert back to a string (e.g., to write to a file), use the string join method to “implode” the list back into a string:

>>> S = ''.join(L)

>>> S

'spaxxy'

The join method may look a bit backward at first sight. Because it is a method of strings (not of lists), it is called through the desired delimiter. join puts the strings in a list (or other iterable) together, with the delimiter between list items; in this case, it uses an empty string delimiter to convert from a list back to a string. More generally, any string delimiter and iterable of strings will do:

>>> 'SPAM'.join(['eggs', 'sausage', 'ham', 'toast'])

'eggsSPAMsausageSPAMhamSPAMtoast'

In fact, joining substrings all at once might often run faster than concatenating them individually. Be sure to also see the earlier note about the mutable bytearray string available as of Python 3.0 and 2.6, described fully in Chapter 37; because it may be changed in place, it offers an alternative to this list/join combination for some kinds of 8-bit text that must be changed often.

String Method Examples: Parsing Text

Another common role for string methods is as a simple form of text parsing—that is, analyzing structure and extracting substrings. To extract substrings at fixed offsets, we can employ slicing techniques:

>>> line = 'aaa bbb ccc'

>>> col1 = line[0:3]

>>> col3 = line[8:]

>>> col1

'aaa'

>>> col3

'ccc'

Here, the columns of data appear at fixed offsets and so may be sliced out of the original string. This technique passes for parsing, as long as the components of your data have fixed positions. If instead some sort of delimiter separates the data, you can pull out its components by splitting. This will work even if the data may show up at arbitrary positions within the string:

>>> line = 'aaa bbb  ccc'

>>> cols = line.split()

>>> cols

['aaa', 'bbb', 'ccc']

The string split method chops up a string into a list of substrings, around a delimiter string. We didn’t pass a delimiter in the prior example, so it defaults to whitespace—the string is split at groups of one or more spaces, tabs, and newlines, and we get back a list of the resulting substrings. In other applications, more tangible delimiters may separate the data. This example splits (and hence parses) the string at commas, a separator common in data returned by some database tools:

>>> line = 'bob,hacker,40'

>>> line.split(',')

['bob', 'hacker', '40']

Delimiters can be longer than a single character, too:

>>> line = "i'mSPAMaSPAMlumberjack"

>>> line.split("SPAM")

["i'm", 'a', 'lumberjack']

Although there are limits to the parsing potential of slicing and splitting, both run very fast and can handle basic text-extraction chores. Comma-separated text data is part of the CSV file format; for more advanced tools on this front, see also the csv module in Python’s standard library.

Other Common String Methods in Action

Other string methods have more focused roles—for example, to strip off whitespace at the end of a line of text, perform case conversions, test content, and test for a substring at the end or front:

>>> line = "The knights who say Ni!\n"

>>> line.rstrip()

'The knights who say Ni!'

>>> line.upper()

'THE KNIGHTS WHO SAY NI!\n'

>>> line.isalpha()

False

>>> line.endswith('Ni!\n')

True

>>> line.startswith('The')

True

Alternative techniques can also sometimes be used to achieve the same results as string methods—the in membership operator can be used to test for the presence of a substring, for instance, and length and slicing operations can be used to mimic endswith:

>>> line

'The knights who say Ni!\n'

>>> line.find('Ni') != −1       # Search via method call or expression

True

>>> 'Ni' in line

True

>>> sub = 'Ni!\n'

>>> line.endswith(sub)          # End test via method call or slice

True

>>> line[-len(sub):] == sub

True

See also the format string formatting method described later in this chapter; it provides more advanced substitution tools that combine many operations in a single step.

Again, because there are so many methods available for strings, we won’t look at every one here. You’ll see some additional string examples later in this book, but for more details you can also turn to the Python library manual and other documentation sources, or simply experiment interactively on your own. You can also check the help(S.method) results for a method of any string object S for more hints; as we saw in Chapter 4, running help on str.method likely gives the same details.

Note that none of the string methods accepts patterns—for pattern-based text processing, you must use the Python re standard library module, an advanced tool that was introduced in Chapter 4 but is mostly outside the scope of this text (one further brief example appears at the end ofChapter 37). Because of this limitation, though, string methods may sometimes run more quickly than the re module’s tools.

The Original string Module’s Functions (Gone in 3.X)

The history of Python’s string methods is somewhat convoluted. For roughly the first decade of its existence, Python provided a standard library module called string that contained functions that largely mirrored the current set of string object methods. By popular demand, in Python 2.0 these functions were made available as methods of string objects. Because so many people had written so much code that relied on the original string module, however, it was retained for backward compatibility.

Today, you should use only string methods, not the original string module. In fact, the original module call forms of today’s string methods have been removed completely from Python 3.X, and you should not use them in new code in either 2.X or 3.X. However, because you may still see the module in use in older Python 2.X code, and this text covers both Pythons 2.X and 3.X, a brief look is in order here.

The upshot of this legacy is that in Python 2.X, there technically are still two ways to invoke advanced string operations: by calling object methods, or by calling string module functions and passing in the objects as arguments. For instance, given a variable X assigned to a string object, calling an object method:

X.method(arguments)

is usually equivalent to calling the same operation through the string module (provided that you have already imported the module):

string.method(X, arguments)

Here’s an example of the method scheme in action:

>>> S = 'a+b+c+'

>>> x = S.replace('+', 'spam')

>>> x

'aspambspamcspam'

To access the same operation through the string module in Python 2.X, you need to import the module (at least once in your process) and pass in the object:

>>> import string

>>> y = string.replace(S, '+', 'spam')

>>> y

'aspambspamcspam'

Because the module approach was the standard for so long, and because strings are such a central component of most programs, you might see both call patterns in Python 2.X code you come across.

Again, though, today you should always use method calls instead of the older module calls. There are good reasons for this, besides the fact that the module calls have gone away in 3.X. For one thing, the module call scheme requires you to import the string module (methods do not require imports). For another, the module makes calls a few characters longer to type (when you load the module with import, that is, not using from). And, finally, the module runs more slowly than methods (the module maps most calls back to the methods and so incurs an extra call along the way).

The original string module itself, without its string method equivalents, is retained in Python 3.X because it contains additional tools, including predefined string constants (e.g., string.digits) and a Template object system—a relatively obscure formatting tool that predates the stringformat method and is largely omitted here (for details, see the brief note comparing it to other formatting tools ahead, as well as Python’s library manual). Unless you really want to have to change your 2.X code to use 3.X, though, you should consider any basic string operation calls in it to be just ghosts of Python past.

String Formatting Expressions

Although you can get a lot done with the string methods and sequence operations we’ve already met, Python also provides a more advanced way to combine string processing tasks—string formatting allows us to perform multiple type-specific substitutions on a string in a single step. It’s never strictly required, but it can be convenient, especially when formatting text to be displayed to a program’s users. Due to the wealth of new ideas in the Python world, string formatting is available in two flavors in Python today (not counting the less-used string module Templatesystem mentioned in the prior section):

String formatting expressions: '...%s...' % (values)

The original technique available since Python’s inception, this form is based upon the C language’s “printf” model, and sees widespread use in much existing code.

String formatting method calls: '...{}...'.format(values)

A newer technique added in Python 2.6 and 3.0, this form is derived in part from a same-named tool in C#/.NET, and overlaps with string formatting expression functionality.

Since the method call flavor is newer, there is some chance that one or the other of these may become deprecated and removed over time. When 3.0 was released in 2008, the expression seemed more likely to be deprecated in later Python releases. Indeed, 3.0’s documentation threatened deprecation in 3.1 and removal thereafter. This hasn’t happened as of 2013 and 3.3, and now looks unlikely given the expression’s wide use—in fact, it still appears even in Python’s own standard library thousands of times today!

Naturally, this story’s development depends on the future practice of Python’s users. On the other hand, because both the expression and method are valid to use today and either may appear in code you’ll come across, this book covers both techniques in full here. As you’ll see, the two are largely variations on a theme, though the method has some extra features (such as thousands separators), and the expression is often more concise and seems second nature to most Python programmers.

This book itself uses both techniques in later examples for illustrative purposes. If its author has a preference, he will keep it largely classified, except to quote from Python’s import this motto:

There should be one—and preferably only one—obvious way to do it.

Unless the newer string formatting method is compellingly better than the original and widely used expression, its doubling of Python programmers’ knowledge base requirements in this domain seems unwarranted—and even un-Pythonic, per the original and longstanding meaning of that term. Programmers should not have to learn two complicated tools if those tools largely overlap. You’ll have to judge for yourself whether formatting merits the added language heft, of course, so let’s give both a fair hearing.

Formatting Expression Basics

Since string formatting expressions are the original in this department, we’ll start with them. Python defines the % binary operator to work on strings (you may recall that this is also the remainder of division, or modulus, operator for numbers). When applied to strings, the % operator provides a simple way to format values as strings according to a format definition. In short, the % operator provides a compact way to code multiple string substitutions all at once, instead of building and concatenating parts individually.

To format strings:

1.    On the left of the % operator, provide a format string containing one or more embedded conversion targets, each of which starts with a % (e.g., %d).

2.    On the right of the % operator, provide the object (or objects, embedded in a tuple) that you want Python to insert into the format string on the left in place of the conversion target (or targets).

For instance, in the formatting example we saw earlier in this chapter, the integer 1 replaces the %d in the format string on the left, and the string 'dead' replaces the %s. The result is a new string that reflects these two substitutions, which may be printed or saved for use in other roles:

>>> 'That is %d %s bird!' % (1, 'dead')             # Format expression

That is 1 dead bird!

Technically speaking, string formatting expressions are usually optional—you can generally do similar work with multiple concatenations and conversions. However, formatting allows us to combine many steps into a single operation. It’s powerful enough to warrant a few more examples:

>>> exclamation = 'Ni'

>>> 'The knights who say %s!' % exclamation         # String substitution

'The knights who say Ni!'

>>> '%d %s %g you' % (1, 'spam', 4.0)               # Type-specific substitutions

'1 spam 4 you'

>>> '%s -- %s -- %s' % (42, 3.14159, [1, 2, 3])     # All types match a %s target

'42 -- 3.14159 -- [1, 2, 3]'

The first example here plugs the string 'Ni' into the target on the left, replacing the %s marker. In the second example, three values are inserted into the target string. Note that when you’re inserting more than one value, you need to group the values on the right in parentheses (i.e., put them in a tuple). The % formatting expression operator expects either a single item or a tuple of one or more items on its right side.

The third example again inserts three values—an integer, a floating-point object, and a list object—but notice that all of the targets on the left are %s, which stands for conversion to string. As every type of object can be converted to a string (the one used when printing), every object type works with the %s conversion code. Because of this, unless you will be doing some special formatting, %s is often the only code you need to remember for the formatting expression.

Again, keep in mind that formatting always makes a new string, rather than changing the string on the left; because strings are immutable, it must work this way. As before, assign the result to a variable name if you need to retain it.

Advanced Formatting Expression Syntax

For more advanced type-specific formatting, you can use any of the conversion type codes listed in Table 7-4 in formatting expressions; they appear after the % character in substitution targets. C programmers will recognize most of these because Python string formatting supports all the usual C printf format codes (but returns the result, instead of displaying it, like printf). Some of the format codes in the table provide alternative ways to format the same type; for instance, %e, %f, and %g provide alternative ways to format floating-point numbers.

Table 7-4. String formatting type codes

Code

Meaning

s

String (or any object’s str(X) string)

r

Same as s, but uses repr, not str

c

Character (int or str)

d

Decimal (base-10 integer)

i

Integer

u

Same as d (obsolete: no longer unsigned)

o

Octal integer (base 8)

x

Hex integer (base 16)

X

Same as x, but with uppercase letters

e

Floating point with exponent, lowercase

E

Same as e, but uses uppercase letters

f

Floating-point decimal

F

Same as f, but uses uppercase letters

g

Floating-point e or f

G

Floating-point E or F

%

Literal % (coded as %%)

In fact, conversion targets in the format string on the expression’s left side support a variety of conversion operations with a fairly sophisticated syntax all their own. The general structure of conversion targets looks like this:

%[(keyname)][flags][width][.precision]typecode

The type code characters in the first column of Table 7-4 show up at the end of this target string’s format. Between the % and the type code character, you can do any of the following:

§  Provide a key name for indexing the dictionary used on the right side of the expression

§  List flags that specify things like left justification (−), numeric sign (+), a blank before positive numbers and a – for negatives (a space), and zero fills (0)

§  Give a total minimum field width for the substituted text

§  Set the number of digits (precision) to display after a decimal point for floating-point numbers

Both the width and precision parts can also be coded as a * to specify that they should take their values from the next item in the input values on the expression’s right side (useful when this isn’t known until runtime). And if you don’t need any of these extra tools, a simple %s in the format string will be replaced by the corresponding value’s default print string, regardless of its type.

Advanced Formatting Expression Examples

Formatting target syntax is documented in full in the Python standard manuals and reference texts, but to demonstrate common usage, let’s look at a few examples. This one formats integers by default, and then in a six-character field with left justification and zero padding:

>>> x = 1234

>>> res = 'integers: ...%d...%−6d...%06d' % (x, x, x)

>>> res

'integers: ...1234...1234  ...001234'

The %e, %f, and %g formats display floating-point numbers in different ways, as the following interaction demonstrates—%E is the same as %e but the exponent is uppercase, and g chooses formats by number content (it’s formally defined to use exponential format e if the exponent is less than −4 or not less than precision, and decimal format f otherwise, with a default total digits precision of 6):

>>> x = 1.23456789

>>> x                                     # Shows more digits before 2.7 and 3.1

1.23456789

>>> '%e | %f | %g' % (x, x, x)

'1.234568e+00 | 1.234568 | 1.23457'

>>> '%E' % x

'1.234568E+00'

For floating-point numbers, you can achieve a variety of additional formatting effects by specifying left justification, zero padding, numeric signs, total field width, and digits after the decimal point. For simpler tasks, you might get by with simply converting to strings with a %s format expression or the str built-in function shown earlier:

>>> '%−6.2f | %05.2f | %+06.1f' % (x, x, x)

'1.23   | 01.23 | +001.2'

>>> '%s' % x, str(x)

('1.23456789', '1.23456789')

When sizes are not known until runtime, you can use a computed width and precision by specifying them with a * in the format string to force their values to be taken from the next item in the inputs to the right of the % operator—the 4 in the tuple here gives precision:

>>> '%f, %.2f, %.*f' % (1/3.0, 1/3.0, 4, 1/3.0)

'0.333333, 0.33, 0.3333'

If you’re interested in this feature, experiment with some of these examples and operations on your own for more insight.

Dictionary-Based Formatting Expressions

As a more advanced extension, string formatting also allows conversion targets on the left to refer to the keys in a dictionary coded on the right and fetch the corresponding values. This opens the door to using formatting as a sort of template tool. We’ve only met dictionaries briefly thus far in Chapter 4, but here’s an example that demonstrates the basics:

>>> '%(qty)d more %(food)s' % {'qty': 1, 'food': 'spam'}

'1 more spam'

Here, the (qty) and (food) in the format string on the left refer to keys in the dictionary literal on the right and fetch their associated values. Programs that generate text such as HTML or XML often use this technique—you can build up a dictionary of values and substitute them all at once with a single formatting expression that uses key-based references (notice the first comment is above the triple quote so it’s not added to the string, and I’m typing this in IDLE without a “...” prompt for continuation lines):

>>>                                           # Template with substitution targets

>>> reply = """

Greetings...

Hello %(name)s!

Your age is %(age)s

"""

>>> values = {'name': 'Bob', 'age': 40}       # Build up values to substitute

>>> print(reply % values)                     # Perform substitutions

Greetings...

Hello Bob!

Your age is 40

This trick is also used in conjunction with the vars built-in function, which returns a dictionary containing all the variables that exist in the place it is called:

>>> food = 'spam'

>>> qty = 10

>>> vars()

{'food': 'spam', 'qty': 10, ...plus built-in names set by Python... }

When used on the right side of a format operation, this allows the format string to refer to variables by name—as dictionary keys:

>>> '%(qty)d more %(food)s' % vars()          # Variables are keys in vars()

'10 more spam'

We’ll study dictionaries in more depth in Chapter 8. See also Chapter 5 for examples that convert to hexadecimal and octal number strings with the %x and %o formatting expression target codes, which we won’t repeat here. Additional formatting expression examples also appear ahead as comparisons to the formatting method—this chapter’s next and final string topic.

String Formatting Method Calls

As mentioned earlier, Python 2.6 and 3.0 introduced a new way to format strings that is seen by some as a bit more Python-specific. Unlike formatting expressions, formatting method calls are not closely based upon the C language’s “printf” model, and are sometimes more explicit in intent. On the other hand, the new technique still relies on core “printf” concepts, such as type codes and formatting specifications. Moreover, it largely overlaps with—and sometimes requires a bit more code than—formatting expressions, and in practice can be just as complex in many roles. Because of this, there is no best-use recommendation between expressions and method calls today, and most programmers would be well served by a cursory understanding of both schemes. Luckily, the two are similar enough that many core concepts overlap.

Formatting Method Basics

The string object’s format method, available in Python 2.6, 2.7, and 3.X, is based on normal function call syntax, instead of an expression. Specifically, it uses the subject string as a template, and takes any number of arguments that represent values to be substituted according to the template.

Its use requires knowledge of functions and calls, but is mostly straightforward. Within the subject string, curly braces designate substitution targets and arguments to be inserted either by position (e.g., {1}), or keyword (e.g., {food}), or relative position in 2.7, 3.1, and later ({}). As we’ll learn when we study argument passing in depth in Chapter 18, arguments to functions and methods may be passed by position or keyword name, and Python’s ability to collect arbitrarily many positional and keyword arguments allows for such general method call patterns. For example:

>>> template = '{0}, {1} and {2}'                             # By position

>>> template.format('spam', 'ham', 'eggs')

'spam, ham and eggs'

>>> template = '{motto}, {pork} and {food}'                   # By keyword

>>> template.format(motto='spam', pork='ham', food='eggs')

'spam, ham and eggs'

>>> template = '{motto}, {0} and {food}'                      # By both

>>> template.format('ham', motto='spam', food='eggs')

'spam, ham and eggs'

>>> template = '{}, {} and {}'                                # By relative position

>>> template.format('spam', 'ham', 'eggs')                    # New in 3.1 and 2.7

'spam, ham and eggs'

By comparison, the last section’s formatting expression can be a bit more concise, but uses dictionaries instead of keyword arguments, and doesn’t allow quite as much flexibility for value sources (which may be an asset or liability, depending on your perspective); more on how the two techniques compare ahead:

>>> template = '%s, %s and %s'                                # Same via expression

>>> template % ('spam', 'ham', 'eggs')

'spam, ham and eggs'

>>> template = '%(motto)s, %(pork)s and %(food)s'

>>> template % dict(motto='spam', pork='ham', food='eggs')

'spam, ham and eggs'

Note the use of dict() to make a dictionary from keyword arguments here, introduced in Chapter 4 and covered in full in Chapter 8; it’s an often less-cluttered alternative to the {...} literal. Naturally, the subject string in the format method call can also be a literal that creates a temporary string, and arbitrary object types can be substituted at targets much like the expression’s %s code:

>>> '{motto}, {0} and {food}'.format(42, motto=3.14, food=[1, 2])

'3.14, 42 and [1, 2]'

Just as with the % expression and other string methods, format creates and returns a new string object, which can be printed immediately or saved for further work (recall that strings are immutable, so format really must make a new object). String formatting is not just for display:

>>> X = '{motto}, {0} and {food}'.format(42, motto=3.14, food=[1, 2])

>>> X

'3.14, 42 and [1, 2]'

>>> X.split(' and ')

['3.14, 42', '[1, 2]']

>>> Y = X.replace('and', 'but under no circumstances')

>>> Y

'3.14, 42 but under no circumstances [1, 2]'

Adding Keys, Attributes, and Offsets

Like % formatting expressions, format calls can become more complex to support more advanced usage. For instance, format strings can name object attributes and dictionary keys—as in normal Python syntax, square brackets name dictionary keys and dots denote object attributes of an item referenced by position or keyword. The first of the following examples indexes a dictionary on the key “spam” and then fetches the attribute “platform” from the already imported sys module object. The second does the same, but names the objects by keyword instead of position:

>>> import sys

>>> 'My {1[kind]} runs {0.platform}'.format(sys, {'kind': 'laptop'})

'My laptop runs win32'

>>> 'My {map[kind]} runs {sys.platform}'.format(sys=sys, map={'kind': 'laptop'})

'My laptop runs win32'

Square brackets in format strings can name list (and other sequence) offsets to perform indexing, too, but only single positive offsets work syntactically within format strings, so this feature is not as general as you might think. As with % expressions, to name negative offsets or slices, or to use arbitrary expression results in general, you must run expressions outside the format string itself (note the use of *parts here to unpack a tuple’s items into individual function arguments, as we did in Chapter 5 when studying fractions; more on this form in Chapter 18):

>>> somelist = list('SPAM')

>>> somelist

['S', 'P', 'A', 'M']

>>> 'first={0[0]}, third={0[2]}'.format(somelist)

'first=S, third=A'

>>> 'first={0}, last={1}'.format(somelist[0], somelist[-1])   # [-1] fails in fmt

'first=S, last=M'

>>> parts = somelist[0], somelist[-1], somelist[1:3]          # [1:3] fails in fmt

>>> 'first={0}, last={1}, middle={2}'.format(*parts)          # Or '{}' in 2.7/3.1+

"first=S, last=M, middle=['P', 'A']"

Advanced Formatting Method Syntax

Another similarity with % expressions is that you can achieve more specific layouts by adding extra syntax in the format string. For the formatting method, we use a colon after the possibly empty substitution target’s identification, followed by a format specifier that can name the field size, justification, and a specific type code. Here’s the formal structure of what can appear as a substitution target in a format string—its four parts are all optional, and must appear without intervening spaces:

{fieldname component !conversionflag :formatspec}

In this substitution target syntax:

§  fieldname is an optional number or keyword identifying an argument, which may be omitted to use relative argument numbering in 2.7, 3.1, and later.

§  component is a string of zero or more “.name” or “[index]” references used to fetch attributes and indexed values of the argument, which may be omitted to use the whole argument value.

§  conversionflag starts with a ! if present, which is followed by r, s, or a to call repr, str, or ascii built-in functions on the value, respectively.

§  formatspec starts with a : if present, which is followed by text that specifies how the value should be presented, including details such as field width, alignment, padding, decimal precision, and so on, and ends with an optional data type code.

The formatspec component after the colon character has a rich format all its own, and is formally described as follows (brackets denote optional components and are not coded literally):

[[fill]align][sign][#][0][width][,][.precision][typecode]

In this, fill can be any fill character other than { or }; align may be <, >, =, or ^, for left alignment, right alignment, padding after a sign character, or centered alignment, respectively; sign may be +, −, or space; and the , (comma) option requests a comma for a thousands separator as of Python 2.7 and 3.1. width and precision are much as in the % expression, and the formatspec may also contain nested {} format strings with field names only, to take values from the arguments list dynamically (much like the * in formatting expressions).

The method’s typecode options almost completely overlap with those used in % expressions and listed previously in Table 7-4, but the format method also allows a b type code used to display integers in binary format (it’s equivalent to using the bin built-in call), allows a % type code to display percentages, and uses only d for base-10 integers (i or u are not used here). Note that unlike the expression’s %s, the s type code here requires a string object argument; omit the type code to accept any type generically.

See Python’s library manual for more on substitution syntax that we’ll omit here. In addition to the string’s format method, a single object may also be formatted with the format(objectformatspec) built-in function (which the method uses internally), and may be customized in user-defined classes with the __format__ operator-overloading method (see Part VI).

Advanced Formatting Method Examples

As you can tell, the syntax can be complex in formatting methods. Because your best ally in such cases is often the interactive prompt here, let’s turn to some examples. In the following, {0:10} means the first positional argument in a field 10 characters wide, {1:<10} means the second positional argument left-justified in a 10-character-wide field, and {0.platform:>10} means the platform attribute of the first argument right-justified in a 10-character-wide field (note again the use of dict() to make a dictionary from keyword arguments, covered in Chapter 4 andChapter 8):

>>> '{0:10} = {1:10}'.format('spam', 123.4567)         # In Python 3.3

'spam       =   123.4567'

>>> '{0:>10} = {1:<10}'.format('spam', 123.4567)

'      spam = 123.4567  '

>>> '{0.platform:>10} = {1[kind]:<10}'.format(sys, dict(kind='laptop'))

'     win32 = laptop    '

In all cases, you can omit the argument number as of Python 2.7 and 3.1 if you’re selecting them from left to right with relative autonumbering—though this makes your code less explicit, thereby negating one of the reported advantages of the formatting method over the formatting expression (see the related note ahead):

>>> '{:10} = {:10}'.format('spam', 123.4567)

'spam       =   123.4567'

>>> '{:>10} = {:<10}'.format('spam', 123.4567)

'      spam = 123.4567  '

>>> '{.platform:>10} = {[kind]:<10}'.format(sys, dict(kind='laptop'))

'     win32 = laptop    '

Floating-point numbers support the same type codes and formatting specificity in formatting method calls as in % expressions. For instance, in the following {2:g} means the third argument formatted by default according to the “g” floating-point representation, {1:.2f} designates the “f” floating-point format with just two decimal digits, and {2:06.2f} adds a field with a width of six characters and zero padding on the left:

>>> '{0:e}, {1:.3e}, {2:g}'.format(3.14159, 3.14159, 3.14159)

'3.141590e+00, 3.142e+00, 3.14159'

>>> '{0:f}, {1:.2f}, {2:06.2f}'.format(3.14159, 3.14159, 3.14159)

'3.141590, 3.14, 003.14'

Hex, octal, and binary formats are supported by the format method as well. In fact, string formatting is an alternative to some of the built-in functions that format integers to a given base:

>>> '{0:X}, {1:o}, {2:b}'.format(255, 255, 255)      # Hex, octal, binary

'FF, 377, 11111111'

>>> bin(255), int('11111111', 2), 0b11111111         # Other to/from binary

('0b11111111', 255, 255)

>>> hex(255), int('FF', 16), 0xFF                    # Other to/from hex

('0xff', 255, 255)

>>> oct(255), int('377', 8), 0o377                   # Other to/from octal, in 3.X

('0o377', 255, 255)                                  # 2.X prints and accepts 0377

Formatting parameters can either be hardcoded in format strings or taken from the arguments list dynamically by nested format syntax, much like the * syntax in formatting expressions’ width and precision:

>>> '{0:.2f}'.format(1 / 3.0)                        # Parameters hardcoded

'0.33'

>>> '%.2f' % (1 / 3.0)                               # Ditto for expression

'0.33'

>>> '{0:.{1}f}'.format(1 / 3.0, 4)                   # Take value from arguments

'0.3333'

>>> '%.*f' % (4, 1 / 3.0)                            # Ditto for expression

'0.3333'

Finally, Python 2.6 and 3.0 also introduced a new built-in format function, which can be used to format a single item. It’s a more concise alternative to the string format method, and is roughly similar to formatting a single item with the % formatting expression:

>>> '{0:.2f}'.format(1.2345)                         # String method

'1.23'

>>> format(1.2345, '.2f')                            # Built-in function

'1.23'

>>> '%.2f' % 1.2345                                  # Expression

'1.23'

Technically, the format built-in runs the subject object’s __format__ method, which the str.format method does internally for each formatted item. It’s still more verbose than the original % expression’s equivalent here, though—which leads us to the next section.

Comparison to the % Formatting Expression

If you study the prior sections closely, you’ll probably notice that at least for positional references and dictionary keys, the string format method looks very much like the % formatting expression, especially in advanced use with type codes and extra formatting syntax. In fact, in common use cases formatting expressions may be easier to code than formatting method calls, especially when you’re using the generic %s print-string substitution target, and even with autonumbering of fields added in 2.7 and 3.1:

print('%s=%s' % ('spam', 42))            # Format expression: in all 2.X/3.X

print('{0}={1}'.format('spam', 42))      # Format method: in 3.0+ and 2.6+

print('{}={}'.format('spam', 42))        # With autonumbering: in 3.1+ and 2.7

As we’ll see in a moment, more complex formatting tends to be a draw in terms of complexity (difficult tasks are generally difficult, regardless of approach), and some see the formatting method as redundant given the pervasiveness of the expression.

On the other hand, the formatting method also offers a few potential advantages. For example, the original % expression can’t handle keywords, attribute references, and binary type codes, although dictionary key references in % format strings can often achieve similar goals. To see how the two techniques overlap, compare the following % expressions to the equivalent format method calls shown earlier:

>>> '%s, %s and %s' % (3.14, 42, [1, 2])                      # Arbitrary types

'3.14, 42 and [1, 2]'

>>> 'My %(kind)s runs %(platform)s' % {'kind': 'laptop', 'platform': sys.platform}

'My laptop runs win32'

>>> 'My %(kind)s runs %(platform)s' % dict(kind='laptop', platform=sys.platform)

'My laptop runs win32'

>>> somelist = list('SPAM')

>>> parts = somelist[0], somelist[-1], somelist[1:3]

>>> 'first=%s, last=%s, middle=%s' % parts

"first=S, last=M, middle=['P', 'A']"

When more complex formatting is applied the two techniques approach parity in terms of complexity, although if you compare the following with the format method call equivalents listed earlier you’ll again find that the % expressions tend to be a bit simpler and more concise; in Python 3.3:

# Adding specific formatting

>>> '%-10s = %10s' % ('spam', 123.4567)

'spam       =   123.4567'

>>> '%10s = %-10s' % ('spam', 123.4567)

'      spam = 123.4567  '

>>> '%(plat)10s = %(kind)-10s' % dict(plat=sys.platform, kind='laptop')

'     win32 = laptop    '

# Floating-point numbers

>>> '%e, %.3e, %g' % (3.14159, 3.14159, 3.14159)

'3.141590e+00, 3.142e+00, 3.14159'

>>> '%f, %.2f, %06.2f' % (3.14159, 3.14159, 3.14159)

'3.141590, 3.14, 003.14'

# Hex and octal, but not binary (see ahead)

>>> '%x, %o' % (255, 255)

'ff, 377'

The format method has a handful of advanced features that the % expression does not, but even more involved formatting still seems to be essentially a draw in terms of complexity. For instance, the following shows the same result generated with both techniques, with field sizes and justifications and various argument reference methods:

# Hardcoded references in both

>>> import sys

>>> 'My {1[kind]:<8} runs {0.platform:>8}'.format(sys, {'kind': 'laptop'})

'My laptop   runs    win32'

>>> 'My %(kind)-8s runs %(plat)8s' % dict(kind='laptop', plat=sys.platform)

'My laptop   runs    win32'

In practice, programs are less likely to hardcode references like this than to execute code that builds up a set of substitution data ahead of time (for instance, to collect input form or database data to substitute into an HTML template all at once). When we account for common practice in examples like this, the comparison between the format method and the % expression is even more direct:

# Building data ahead of time in both

>>> data = dict(platform=sys.platform, kind='laptop')

>>> 'My {kind:<8} runs {platform:>8}'.format(**data)

'My laptop   runs    win32'

>>> 'My %(kind)-8s runs %(platform)8s' % data

'My laptop   runs    win32'

As we’ll see in Chapter 18, the **data in the method call here is special syntax that unpacks a dictionary of keys and values into individual “name=value” keyword arguments so they can be referenced by name in the format string—another unavoidable far conceptual forward reference to function call tools, which may be another downside of the format method in general, especially for newcomers.

As usual, though, the Python community will have to decide whether % expressions, format method calls, or a toolset with both techniques proves better over time. Experiment with these techniques on your own to get a feel for what they offer, and be sure to see the library reference manuals for Python 2.6, 3.0, and later for more details.

NOTE

String format method enhancements in Python 3.1 and 2.7: Python 3.1 and 2.7 added a thousand-separator syntax for numbers, which inserts commas between three-digit groups. To make this work, add a comma before the type code, and between the width and precision if present, as follows:

>>> '{0:d}'.format(999999999999)

'999999999999'

>>> '{0:,d}'.format(999999999999)

'999,999,999,999'

These Pythons also assign relative numbers to substitution targets automatically if they are not included explicitly, though using this extension doesn’t apply in all use cases, and may negate one of the main benefits of the formatting method—its more explicit code:

>>> '{:,d}'.format(999999999999)

'999,999,999,999'

>>> '{:,d} {:,d}'.format(9999999, 8888888)

'9,999,999 8,888,888'

>>> '{:,.2f}'.format(296999.2567)

'296,999.26'

See the 3.1 release notes for more details. See also the formats.py comma-insertion and money-formatting function examples in Chapter 25 for a simple manual solution that can be imported and used prior to Python 3.1 and 2.7. As typical in programming, it’s straightforward to implement new functionality in a callable, reusable, and customizable function of your own, rather than relying on a fixed set of built-in tools:

>>> from formats import commas, money

>>> '%s' % commas(999999999999)

'999,999,999,999'

>>> '%s %s' % (commas(9999999), commas(8888888))

'9,999,999 8,888,888'

>>> '%s' % money(296999.2567)

'$296,999.26'

And as usual, a simple function like this can be applied in more advanced contexts too, such as the iteration tools we met in Chapter 4 and will study fully in later chapters:

>>> [commas(x) for x in (9999999, 8888888)]

['9,999,999', '8,888,888']

>>> '%s %s' % tuple(commas(x) for x in (9999999, 8888888))

'9,999,999 8,888,888'

>>> ''.join(commas(x) for x in (9999999, 8888888))

'9,999,9998,888,888'

For better or worse, Python developers often seem to prefer adding special-case built-in tools over general development techniques—a tradeoff explored in the next section.

Why the Format Method?

Now that I’ve gone to such lengths to compare and contrast the two formatting techniques, I wish to also explain why you still might want to consider using the format method variant at times. In short, although the formatting method can sometimes require more code, it also:

§  Has a handful of extra features not found in the % expression itself (though % can use alternatives)

§  Has more flexible value reference syntax (though it may be overkill, and % often has equivalents)

§  Can make substitution value references more explicit (though this is now optional)

§  Trades an operator for a more mnemonic method name (though this is also more verbose)

§  Does not allow different syntax for single and multiple values (though practice suggests this is trivial)

§  As a function can be used in places an expression cannot (though a one-line function renders this moot)

Although both techniques are available today and the formatting expression is still widely used, the format method might eventually grow in popularity and may receive more attention from Python developers in the future. Further, with both the expression and method in the language,either may appear in code you will encounter so it behooves you to understand both. But because the choice is currently still yours to make in new code, let’s briefly expand on the tradeoffs before closing the book on this topic.

Extra features: Special-case “batteries” versus general techniques

The method call supports a few extras that the expression does not, such as binary type codes and (as of Python 2.7 and 3.1) thousands groupings. As we’ve seen, though, the formatting expression can usually achieve the same effects in other ways. Here’s the case for binary formatting:

>>> '{0:b}'.format((2 ** 16) − 1)        # Expression (only) binary format code

'1111111111111111'

>>> '%b' % ((2 ** 16) − 1)

ValueError: unsupported format character 'b'...

>>> bin((2 ** 16) − 1)                   # But other more general options work too

'0b1111111111111111'

>>> '%s' % bin((2 ** 16) - 1)            # Usable with both method and % expression

'0b1111111111111111'

>>> '{}'.format(bin((2 ** 16) - 1))      # With 2.7/3.1+ relative numbering

'0b1111111111111111'

>>> '%s' % bin((2 ** 16) - 1)[2:]        # Slice off 0b to get exact equivalent

'1111111111111111'

The preceding note showed that general functions could similarly stand in for the format method’s thousands groupings option, and more fully support customization. In this case, a simple 8-line reusable function buys us the same utility without extra special-case syntax:

>>> '{:,d}'.format(999999999999)         # New str.format method feature in 3.1/2.7

'999,999,999,999'

>>> '%s' % commas(999999999999)          # But % is same with simple 8-line function

'999,999,999,999'

See the prior note for more comma comparisons. This is essentially the same as the preceding bin case for binary formatting, but the commas function here is user-defined, not built in. As such, this technique is far more general purpose than precoded tools or special syntax added for a single purpose.

This case also seems indicative, perhaps, of a trend in Python (and scripting language in general) toward relying more on special-case “batteries included” tools than on general development techniques—a mindset that makes code dependent on those batteries, and seems difficult to justify unless one views software development as an end-user enterprise. To some, programmers might be better served learning how to code an algorithm to insert commas than be provided a tool that does.

We’ll leave that philosophical debate aside here, but in practical terms the net effect of the trend in this case is extra syntax for you to have to both learn and remember. Given their alternatives, it’s not clear that these extra features of the methods by themselves are compelling enough to be decisive.

Flexible reference syntax: Extra complexity and functional overlap

The method call also supports key and attribute references directly, which some may see as more flexible. But as we saw in earlier examples comparing dictionary-based formatting in the % expression to key and attribute references in the format method, the two are usually too similar to warrant a preference on these grounds. For instance, both can reference the same value multiple times:

>>> '{name} {job} {name}'.format(name='Bob', job='dev')

'Bob dev Bob'

>>> '%(name)s %(job)s %(name)s' % dict(name='Bob', job='dev')

'Bob dev Bob'

Especially in common practice, though, the expression seems just as simple, or simpler:

>>> D = dict(name='Bob', job='dev')

>>> '{0[name]} {0[job]} {0[name]}'.format(D)      # Method, key references

'Bob dev Bob'

>>> '{name} {job} {name}'.format(**D)             # Method, dict-to-args

'Bob dev Bob'

>>> '%(name)s %(job)s %(name)s' % D               # Expression, key references

'Bob dev Bob'

To be fair, the method has even more specialized substitution syntax, and other comparisons might favor either scheme in small ways. But given the overlap and extra complexity, one could argue that the format method’s utility seems either a wash, or features in search of use cases. At the least, the added conceptual burden on Python programmers who may now need to know both tools doesn’t seem clearly justified.

Explicit value references: Now optional and unlikely to be used

One use case where the format method is at least debatably clearer is when there are many values to be substituted into the format string. The lister.py classes example we’ll meet in Chapter 31, for example, substitutes six items into a single string, and in this case the method’s {i} position labels seem marginally easier to read than the expression’s %s:

'\n%s<Class %s, address %s:\n%s%s%s>\n' % (...)               # Expression

'\n{0}<Class {1}, address {2}:\n{3}{4}{5}>\n'.format(...)     # Method

On the other hand, using dictionary keys in % expressions can mitigate much of this difference. This is also something of a worst-case scenario for formatting complexity, and not very common in practice; more typical use cases seem more of a tossup. Further, as of Python 3.1 and 2.7, numbering substitution targets becomes optional when relative to position, potentially subverting this purported benefit altogether:

>>> 'The {0} side {1} {2}'.format('bright', 'of', 'life')     # Python 3.X, 2.6+

'The bright side of life'

>>> 'The {} side {} {}'.format('bright', 'of', 'life')        # Python 3.1+, 2.7+

'The bright side of life'

>>> 'The %s side %s %s' % ('bright', 'of', 'life')            # All Pythons

'The bright side of life'

Given its conciseness, the second of these is likely to be preferred to the first, but seems to negate part of the method’s advantage. Compare the effect on floating-point formatting, for example—the formatting expression is still more concise, and still seems less cluttered:

>>> '{0:f}, {1:.2f}, {2:05.2f}'.format(3.14159, 3.14159, 3.14159)

'3.141590, 3.14, 03.14'

>>> '{:f}, {:.2f}, {:06.2f}'.format(3.14159, 3.14159, 3.14159)

'3.141590, 3.14, 003.14'

>>> '%f, %.2f, %06.2f' % (3.14159, 3.14159, 3.14159)

'3.141590, 3.14, 003.14'

Named method and context-neutral arguments: Aesthetics versus practice

The formatting method also claims an advantage in replacing the % operator with a more mnemonic format method name, and not distinguishing between single and multiple substitution values. The former may make the method appear simpler to beginners at first glance (“format” may be easier to parse than multiple “%” characters), though this probably varies per reader and seems minor.

Some may see the latter difference as more significant—with the format expression, a single value can be given by itself, but multiple values must be enclosed in a tuple:

>>> '%.2f' % 1.2345                       # Single value

'1.23'

>>> '%.2f %s' % (1.2345, 99)              # Multiple values tuple

'1.23 99'

Technically, the formatting expression accepts either a single substitution value, or a tuple of one or more items. As a consequence, because a single item can be given either by itself or within a tuple, a tuple to be formatted must be provided as a nested tuple—a perhaps rare but plausible case:

>>> '%s' % 1.23                           # Single value, by itself

'1.23'

>>> '%s' % (1.23,)                        # Single value, in a tuple

'1.23'

>>> '%s' % ((1.23,),)                     # Single value that is a tuple

'(1.23,)'

The formatting method, on the other hand, tightens this up by accepting only general function arguments in both cases, instead of requiring a tuple both for multiple values or a single value that is a tuple:

>>> '{0:.2f}'.format(1.2345)              # Single value

'1.23'

>>> '{0:.2f} {1}'.format(1.2345, 99)      # Multiple values

'1.23 99'

>>> '{0}'.format(1.23)                    # Single value, by itself

'1.23'

>>> '{0}'.format((1.23,))                 # Single value that is a tuple

'(1.23,)'

Consequently, the method might be less confusing to beginners and cause fewer programming mistakes. This seems a fairly minor issue, though—if you always enclose values in a tuple and ignore the nontupled option, the expression is essentially the same as the method call here. Moreover, the method incurs a price in inflated code size to achieve its constrained usage mode. Given the expression’s wide use over Python’s history, this issue may be more theoretical than practical, and may not justify porting existing code to a new tool that is so similar to that it seeks to subsume.

Functions versus expressions: A minor convenience

The final rationale for the format method—it’s a function that can appear where an expression cannot—requires more information about functions than we yet have at this point in the book, so we won’t dwell on it here. Suffice it to say that both the str.format method and the formatbuilt-in function can be passed to other functions, stored in other objects, and so on. An expression like % cannot directly, but this may be narrow-sighted—it’s trivial to wrap any expression in a one-line def or partial-line lambda once to turn it into a function with the same properties (though finding a reason to do so may be more challenging):

def myformat(fmt, args): return fmt % args       # See Part IV

myformat('%s %s', (88, 99))                      # Call your function object

str.format('{} {}', 88, 99)                      # Versus calling the built-in

otherfunction(myformat)                          # Your function is an object too

In the end, this may not be an either/or choice. While the expression still seems more pervasive in Python code, both formatting expressions and methods are available for use in Python today, and most programmers will benefit from being familiar with both techniques for years to come. That may double the work of newcomers to the language in this department, but in this bazaar of ideas we call the open source software world, there always seems to be room for more.[17]

NOTE

Plus one more: Technically speaking, there are 3 (not 2) formatting tools built into Python, if we include the obscure string module’s Template tool mentioned earlier. Now that we’ve seen the other two, I can show you how it compares. The expression and method can be used as templating tools too, referring to substitution values by name via dictionary keys or keyword arguments:

>>> '%(num)i = %(title)s' % dict(num=7, title='Strings')

'7 = Strings'

>>> '{num:d} = {title:s}'.format(num=7, title='Strings')

'7 = Strings'

>>> '{num} = {title}'.format(**dict(num=7, title='Strings'))

'7 = Strings'

The module’s templating system allows values to be referenced by name too, prefixed by a $, as either dictionary keys or keywords, but does not support all the utilities of the other two methods—a limitation that yields simplicity, the prime motivation for this tool:

>>> import string

>>> t = string.Template('$num = $title')

>>> t.substitute({'num': 7, 'title': 'Strings'})

'7 = Strings'

>>> t.substitute(num=7, title='Strings')

'7 = Strings'

>>> t.substitute(dict(num=7, title='Strings'))

'7 = Strings'

See Python’s manuals for more details. It’s possible that you may see this alternative (as well as additional tools in the third-party domain) in Python code too; thankfully this technique is simple, and is used rarely enough to warrant its limited coverage here. The best bet for most newcomers today is to learn and use %, str.format, or both.


[17] See also the Chapter 31 note about a str.format bug (or regression) in Pythons 3.2 and 3.3 concerning generic empty substitution targets for object attributes that define no __format__ handler. This impacted a working example from this book’s prior edition. While it may be a temporary regression, it does at the least underscore that this method is still a bit of a moving target—yet another reason to question the feature redundancy it implies.

General Type Categories

Now that we’ve explored the first of Python’s collection objects, the string, let’s close this chapter by defining a few general type concepts that will apply to most of the types we look at from here on. With regard to built-in types, it turns out that operations work the same for all the types in the same category, so we’ll only need to define most of these ideas once. We’ve only examined numbers and strings so far, but because they are representative of two of the three major type categories in Python, you already know more about several other types than you might think.

Types Share Operation Sets by Categories

As you’ve learned, strings are immutable sequences: they cannot be changed in place (the immutable part), and they are positionally ordered collections that are accessed by offset (the sequence part). It so happens that all the sequences we’ll study in this part of the book respond to the same sequence operations shown in this chapter at work on strings—concatenation, indexing, iteration, and so on. More formally, there are three major type (and operation) categories in Python that have this generic nature:

Numbers (integer, floating-point, decimal, fraction, others)

Support addition, multiplication, etc.

Sequences (strings, lists, tuples)

Support indexing, slicing, concatenation, etc.

Mappings (dictionaries)

Support indexing by key, etc.

I’m including the Python 3.X byte strings and 2.X Unicode strings I mentioned at the start of this chapter under the general “strings” label here (see Chapter 37). Sets are something of a category unto themselves (they don’t map keys to values and are not positionally ordered sequences), and we haven’t yet explored mappings on our in-depth tour (we will in the next chapter). However, many of the other types we will encounter will be similar to numbers and strings. For example, for any sequence objects X and Y:

§  X + Y makes a new sequence object with the contents of both operands.

§  X * N makes a new sequence object with N copies of the sequence operand X.

In other words, these operations work the same way on any kind of sequence, including strings, lists, tuples, and some user-defined object types. The only difference is that the new result object you get back is of the same type as the operands X and Y—if you concatenate lists, you get back a new list, not a string. Indexing, slicing, and other sequence operations work the same on all sequences, too; the type of the objects being processed tells Python which flavor of the task to perform.

Mutable Types Can Be Changed in Place

The immutable classification is an important constraint to be aware of, yet it tends to trip up new users. If an object type is immutable, you cannot change its value in place; Python raises an error if you try. Instead, you must run code to make a new object containing the new value. The major core types in Python break down as follows:

Immutables (numbers, strings, tuples, frozensets)

None of the object types in the immutable category support in-place changes, though we can always run expressions to make new objects and assign their results to variables as needed.

Mutables (lists, dictionaries, sets, bytearray)

Conversely, the mutable types can always be changed in place with operations that do not create new objects. Although such objects can be copied, in-place changes support direct modification.

Generally, immutable types give some degree of integrity by guaranteeing that an object won’t be changed by another part of a program. For a refresher on why this matters, see the discussion of shared object references in Chapter 6. To see how lists, dictionaries, and tuples participate in type categories, we need to move ahead to the next chapter.

Chapter Summary

In this chapter, we took an in-depth tour of the string object type. We learned about coding string literals, and we explored string operations, including sequence expressions, string method calls, and string formatting with both expressions and method calls. Along the way, we studied a variety of concepts in depth, such as slicing, method call syntax, and triple-quoted block strings. We also defined some core ideas common to a variety of types: sequences, for example, share an entire set of operations.

In the next chapter, we’ll continue our types tour with a look at the most general object collections in Python—lists and dictionaries. As you’ll find, much of what you’ve learned here will apply to those types as well. And as mentioned earlier, in the final part of this book we’ll return to Python’s string model to flesh out the details of Unicode text and binary data, which are of interest to some, but not all, Python programmers. Before moving on, though, here’s another chapter quiz to review the material covered here.

Test Your Knowledge: Quiz

1.    Can the string find method be used to search a list?

2.    Can a string slice expression be used on a list?

3.    How would you convert a character to its ASCII integer code? How would you convert the other way, from an integer to a character?

4.    How might you go about changing a string in Python?

5.    Given a string S with the value "s,pa,m", name two ways to extract the two characters in the middle.

6.    How many characters are there in the string "a\nb\x1f\000d"?

7.    Why might you use the string module instead of string method calls?

Test Your Knowledge: Answers

1.    No, because methods are always type-specific; that is, they only work on a single data type. Expressions like X+Y and built-in functions like len(X) are generic, though, and may work on a variety of types. In this case, for instance, the in membership expression has a similar effect as the string find, but it can be used to search both strings and lists. In Python 3.X, there is some attempt to group methods by categories (for example, the mutable sequence types list and bytearray have similar method sets), but methods are still more type-specific than other operation sets.

2.    Yes. Unlike methods, expressions are generic and apply to many types. In this case, the slice expression is really a sequence operation—it works on any type of sequence object, including strings, lists, and tuples. The only difference is that when you slice a list, you get back a new list.

3.    The built-in ord(S) function converts from a one-character string to an integer character code; chr(I) converts from the integer code back to a string. Keep in mind, though, that these integers are only ASCII codes for text whose characters are drawn only from ASCII character set. In the Unicode model, text strings are really sequences of Unicode code point identifying integers, which may fall outside the 7-bit range of numbers reserved by ASCII (more on Unicode in Chapter 4 and Chapter 37).

4.    Strings cannot be changed; they are immutable. However, you can achieve a similar effect by creating a new string—by concatenating, slicing, running formatting expressions, or using a method call like replace—and then assigning the result back to the original variable name.

5.    You can slice the string using S[2:4], or split on the comma and index the string using S.split(',')[1]. Try these interactively to see for yourself.

6.    Six. The string "a\nb\x1f\000d" contains the characters a, newline (\n), b, binary 31 (a hex escape \x1f), binary 0 (an octal escape \000), and d. Pass the string to the built-in len function to verify this, and print each of its character’s ord results to see the actual code point (identifying number) values. See Table 7-2 for more details on escapes.

7.    You should never use the string module instead of string object method calls today—it’s deprecated, and its calls are removed completely in Python 3.X. The only valid reason for using the string module at all today is for its other tools, such as predefined constants. You might also see it appear in what is now very old and dusty Python code (and books of the misty past—like the 1990s).