MySQL Stored Procedure Programming (2009)
Part IV. Optimizing Stored Programs
Chapter 20. Basic SQL Tuning
In this chapter, we will tune simple SQL statements that may be included in MySQL stored programs. In particular, we'll optimize two of the most often executed SQL operations: retrieving data from a single table and joining two or more tables. Topics include:
§ How to determine when the use of an index is required to optimize a query
§ How to construct the best indexes to support specific queries
§ How MySQL chooses between available indexes, and how to direct MySQL to use a specific index if necessary
§ How to avoid "suppressing" an index
§ What to do when no index will suffice to optimize a query
§ How MySQL processes joins between multiple tables
§ How to create indexes that optimize table joins
§ How to determine the optimal join order and how to force MySQL to use a particular join order
Chapter 21 builds on these fundamentals, optimizing more complex SQL operations.
Examples in this chapter are based on tables created using the InnoDB storage engine. Although the same MySQL optimizer is used for all storage engines, you may observe different behaviors in other storage engines because of differences in optimizer statistics and indexing approaches.
Tuning Table Access
When retrieving data from a table, MySQL can basically follow one of two paths to locating the relevant rows:
§ Read every row in the table concerned (a full table scan), and return only those rows that match the WHERE clause criteria.
§ Use an index to find a subset of rows, and return the rows that match the WHERE clause criteria.
Unless we need to retrieve a substantial proportion of the rows from a table, we probably want to use an index. It should not come as a big surprise, therefore, that much of this section will address creating the best indexes for our queries.
Index Lookup Versus Full Table Scan
A common mistake made by those new to SQL tuning is to assume that it is always better to use an index to retrieve data. Typically, an index lookup requires three or four logical reads for each row returned. If we only have to traverse the index tree a few times, then that will be quicker than reading every row in that table. However, traversing the index tree for a large number of rows in the table could easily turn out to be more expensive than simply reading every row directly from the table.
For this reason, we generally want to use an index only when retrieving a small proportion of the rows in the table. The exact break-even point will depend on your data, your indexes, and maybe even your server configuration, but we have found that a reasonable rule of thumb is to use an index when retrieving no more 5-10% of the rows in a table.
To illustrate this point, consider a scenario in which we are trying to generate sales totals over a particular period of time. To get sales totals for the previous week, for example, we might execute a statement such as the following:
SELECT SUM( s.sale_value ),COUNT(*)
FROM sales s
WHERE sale_date>date_sub(curdate( ),INTERVAL 1 WEEK);
Since we have sales data for many years, we would guess that an index on sales_date would be effective in optimizing this query—and we would be right.
On the other hand, suppose that we want to get the sales totals for the preceding year. The query would look like this:
SELECT SUM( s.sale_value ),COUNT(*)
FROM sales s
WHERE sale_date>date_sub(curdate( ),INTERVAL 1 YEAR);
It is not immediately obvious that an index-driven retrieval would result in the best query performance; it depends on the number of years of data in the table and the relative volume of data for the preceding year. Luckily, MySQL will, in most situations, make a good determination in such cases, provided that you have given MySQL a good set of indexes with which to work.
The MySQL optimizer predicts when to use an index based on the percentage of data from the table it expects to retrieve given our WHERE clause. The optimizer chooses to use the index for small intervals, while relying on a full table scan for large intervals. This basic algorithm works well when the volume of data is evenly distributed for the different indexed values. However, if the data is not evenly distributed, or if the statistics on table sizing are inaccurate, then the MySQL optimizer may make a less than perfect decision.
Figure 20-1 shows the elapsed time for retrieving various proportions of rows when forcing an index scan or a full table scan, or when allowing the MySQL optimizer to make that decision. In this example, MySQL switched from an index scan to a full table scan when the rows returned represented approximately 7% of the total. However, in this case, the index outperformed the table scan until about 17% of the rows were retrieved. So although MySQL made the correct decision in most cases, there were a few cases where forcing an index lookup would have improved performance.
Figure 20-1. Full table scan versus indexed lookup
As a very rough rule of thumb, you should not expect an index to improve performance unless you are retrieving less than 5–15% of the table data.
There are a number of circumstances in which MySQL might not pick the best possible index. One of these circumstances is when the data is "skewed." In the preceding example, sales were fairly evenly distributed over a five-year period. However, in the real world this is unlikely to be true—sales will be greater during certain periods (Christmas, perhaps) and we might hope that sales would increase over time. This "skewed" table data can make it harder for the MySQL optimizer to make the best decision.
If you think that your data may be skewed and that MySQL may choose a table scan or index inappropriately, you can use the USE INDEX, FORCE INDEX, or IGNORE INDEX optimizer hints , as appropriate, to force or suppress the index. Take care to only use these hints when absolutely necessary, as they can also prevent the MySQL optimizer from selecting the best plan if used inappropriately. These hints are explained in more detail later, in the section "Manually Choosing an Index."
It's also worth noting that it is sometimes possible to resolve a query using an index alone—provided that the index contains all of the columns from the table that are referenced in both the SELECT and WHERE clauses. In this case, the index can be used in place of the table, and can perform very efficiently, even when retrieving a very large proportion (or all) of the rows in the table. See the section "Covering indexes " later in this chapter for more details.
How MySQL Chooses Between Indexes
In the above examples, MySQL switched between an index and a full table scan as the number of rows to be retrieved increased. This is a pretty neat trick—just how did MySQL work this out?
When you send a SQL statement to the MySQL server, MySQL has to parse the statement, which involves all of the following: verify that the SQL syntax is correct; ensure that the user has the necessary authority to run the statement; and determine the exact nature of the data to be retrieved. As part of this process, MySQL determines if any of the indexes defined on the table would help optimize the query.
The MySQL optimizer has a general sense of the "selectivity" of an index—how many rows an average index lookup will return—and of the size of the table. The optimizer examines the index to work out how many rows will have to be used given the values in the WHERE clause and the range of values in the index. MySQL then calculates the relative overhead of using the index and compares this value to the overhead of scanning the full contents of the table.
For most queries, this simple but effective strategy allows MySQL to choose between a full table scan and an indexed lookup, or to choose between multiple candidate indexes.
Manually Choosing an Index
You can add hints to your SQL statement to influence how the optimizer will choose between various indexing options. You should only do this if you have determined that MySQL is not making the optimal decision on index utilization. These hints can appear after the table name within theFROM clause. The three hints are:
USE INDEX( list_of_indexes )
Tells MySQL to consider only the indexes listed (i.e., to ignore all other indexes)
IGNORE INDEX( list_of_indexes )
Tells MySQL to ignore any of the listed indexes when determining the execution plan
FORCE INDEX( list_of_indexes )
Tells MySQL to use one of the listed indexes even if it has determined that a full table scan would be more efficient
For instance, to force the use of an index named sales_i_date, we could write a query as follows:
SELECT SUM( s.sale_value ),count(*)
FROM sales s FORCE INDEX(sales_i_date)
WHERE sale_date>date_sub(curdate( ),INTERVAL 1 WEEK);
Prefixed ("Partial") Indexes
MySQL allows you to create an index based on the first few characters of a column. For instance, the following statement creates an index based on the first four bytes of the customer's address:
CREATE INDEX i_cust_name_l4 on customers(address1(4));
Partial indexes generally use less storage than "full" indexes, and in some cases may actually improve performance, since a smaller index is more likely to fit into the MySQL memory cache. However, we encourage you to create partial indexes with great care. A very short partial index may actually be worse than no index at all. For very long columns, the partial index might be as good as the full index—it all depends on how many bytes you need to read to get an exact match on the column concerned.
For instance, consider searching for a customer by address, as follows:
WHERE address1 = '1000 EXCEPTIONABLE STREET';
There might be plenty of customers that have an address starting with '1000'. Many fewer will have an address starting with '1000 E', and by the time we extend the search to '1000 EX', we might be matching only a single customer. As we extend the length of the partial index, it becomes more "selective" and more likely to match the performance of a full index.
Figure 20-2 shows the results of doing the above search for various prefix lengths. For this data, prefix lengths of 1 or 2 are worse than no index at all; a length of 3 is slightly better than no index; while lengths greater than 3 are quite effective. Once the length hits 6, no further increase in the length of the prefix increased the effectiveness of the index. Remember that the optimum length for your prefixed index depends entirely on the data item you are searching for—in this case, short prefixes did not work well because most addresses started with street numbers that were not very selective. For more selective data—surname for instance—prefixed indexes could be much more effective.
Figure 20-2. Performance of "partial" indexes of various lengths
A concatenated index—often called a composite index—is an index that is created on multiple columns. For instance, if we frequently retrieve customers by name and date of birth, we might create an index as follows:
CREATE INDEX i_customers_first_surname_dob ON
There is very little chance that two customers would have the same first name, surname, and date of birth, so use of this index would almost always take us to a single, correct customer. If you find that you frequently need to query against the same set of multiple columns' values on a table, then a concatenated index based on those columns should help you optimize your queries.
If a query references multiple columns from a single table in the WHERE clause, consider creating a concatenated (composite or multicolumn) index on those columns.
For instance, to optimize the following query, we should probably create a concatenated index on customer_id, product_id, and sales_rep_id:
SELECT count(*), SUM(quantity)
This index would be defined as follows:
CREATE INDEX I_sales_cust_prod_rep ON
We can use a concatenated index to resolve queries where only some of the columns in the index are specified, provided that at least one of the "leading" columns in the index is included.
For instance, if we create an index on (surname,firstname,date_of_birth), we can use that index to search on surname or on surname and firstname, but we cannot use it to search on date_of_birth. Given this flexibility, organize the columns in the index in an order that will support the widest range of queries. Remember that you can rarely afford to support all possible indexes because of the overhead indexes add to DML operations—so make sure you pick the most effective set of indexes.
A concatenated index can support queries that provide a subset of the columns in the index, provided that none of the leading columns is omitted. Pick the order of your columns in the concatenated index carefully to support the widest possible range of queries.
Merging multiple indexes
While a concatenated index on all the columns in the WHERE clause will almost always provide the best performance, sometimes the sheer number of column combinations will prevent us from creating all of the desirable concatenated indexes.
For instance, consider the sales table in our sample database. We may want to support queries based on any combination of customer_id, product_id, and sales_rep_id—that would only require four indexes. Add another column and we would need at least six indexes. All of these indexes take up space in the database and—perhaps worse—slow down inserts, updates, and deletes. Whenever we insert or delete a row, we have to insert or delete the index entry as well. If we update an indexed column, we have to update the index as well.
If you can't create all of the necessary indexes, do not despair. MySQL 5.0 can merge multiple indexes quite effectively. So instead of creating a concatenated index on the three columns, we could create indexes on each of the columns concerned. MySQL will merge rows retrieved from each index to find only those rows matching all conditions.
Index merges can be identified by the index_merge access type in the EXPLAIN statement output. All the indexes being merged will be listed in the keys column, and the Extra column will include a Using intersect clause with the indexes being merged listed. Example 20-1 shows the EXPLAIN output for a query that performs an index merge.
Example 20-1. Example of an index merge
SELECT count(*), SUM(quantity)
ID=1 Table=sales Select type=SIMPLE Access type=index_merge Rows=1
Extra=Using intersect(i_sales_rep, i_sales_customer,i_sales_product);
Not all index merges are equal; just as indexes on different columns will have different performance characteristics (due to their selectivity), different combinations of merged indexes will yield the best result. Figure 20-3 shows the performance for the three possible single-column indexes created to support our example query, and shows the performance of each possible merge of two indexes. As you can see, the best result was obtained by merging the two most selective indexes.
Figure 20-3. Comparison of various single-column indexes and index merge performance
Creating a covering index is a very powerful technique for squeezing the last drop of performance from your indexes. If there are only a few columns in the SELECT clause that are not also in the WHERE clause, you can consider adding these columns to the index. MySQL will then be able to resolve the query using the index alone, avoiding the I/Os involved in retrieving the rows from the table. Such an index is sometimes called a covering index.
For our previous example, if we add the quantity column to the index, our query can be resolved from the index alone. In the EXPLAIN output, the Extra column will include the tag Using index to indicate that the step was resolved using only the index, as in Example 20-2.
Example 20-2. Using a covering index
SELECT count(*), SUM(quantity)
ID=1 Table=sales Select type=SIMPLE Access type=ref Rows=1
For queries that retrieve only a single row, the savings gained by covering indexes are probably going to be hard to notice. However, when scanning multiple rows from a table, the cost savings add up rapidly. In fact, it is often quicker to use a covering index to return all the rows from a table than to perform a full table scan. Remember that for normal indexed retrieval, the (very rough) rule of thumb is that the index probably isn't worth using unless you are accessing maybe 10% of the rows in the table. However, a covering index might be appropriate even if all of the rows are being read.
Covering indexes—which allow a query to be resolved from the index alone—can be efficient even if all or most of a table is being accessed.
Comparing the Different Indexing Approaches
Figure 20-4 summarizes the performance of the various options for resolving our sample query (retrieving sales totals for a specific sales rep, customer, and product). Even for this simple query, there is a wide range of indexing options; in fact, we did not try every possible indexing option. For example, we didn't try a concatenated index on product_id + sales_rep_id.
There are a several key lessons to be learned from these examples:
Not all index plans are equal
Novice SQL programmers are often satisfied once they see that the EXPLAIN output shows that an index is being used. However, there is a huge difference between the performance provided by the "best" and the "worst" index (in this example, the worst index was more than 10,000 times more expensive than the best index!).
Concatenated indexes rule
The best possible index for any particular table access with more than one column in the WHERE clause will almost always be a concatenated index.
Think about over-indexing
If the SELECT list contains only a few columns beyond those in the WHERE clause, it is probably worth adding these to the index.
Remember that indexes come at a cost
Indexes are often essential to achieve decent query performance, but they will slow down every INSERT and DELETE and many UPDATE operations. You need to make sure that every index is "paying its way" by significantly improving query performance.
Rely on merge joins to avoid huge numbers of concatenated indexes
If you have to support a wide range of column combinations in the WHERE clause, create concatenated indexes to support the most common queries, and single-column indexes that can be merged to support less common combinations.
Figure 20-4. Comparison of different indexing techniques when retrieving sales total for specific product, customer, and sales rep
Avoiding Accidental Table Scans
There are a few circumstances in which MySQL might perform a full table scan even if a suitable index exists and perhaps even after you instruct MySQL to use an index with the FORCE INDEX hint. The three main reasons for such "accidental" table scans are:
§ You modify an indexed column in the WHERE clause with a function or an operator.
§ You are searching for a substring within an indexed column.
§ You are using only some of the columns within a concatenated index, and the order of columns in the index does not support searching on the columns you have specified.
Let's look at each situation in the following sections.
Accidentally suppressing an index using a function
One of the most common causes for what might appear to be an inexplicable refusal by MySQL to use an index is some kind of manipulation of the query column.
For instance, let's suppose that we are trying to find all customers that are older than 55 (we might want to target them for a specific sales campaign). We have an index on date_of_birth and the index is certainly selective, but MySQL does not use the index, as shown in Example 20-3.
Example 20-3. Index suppressed by function on query column
WHERE (datediff(curdate( ),date_of_birth)/365.25) >55
1 SIMPLE select(ALL) on customers using no key
The problem here is that by enclosing the date_of_birth column within the DATEDIFF function, we prevent MySQL from looking up values in the index. If we rewrite the query so that the functions are applied to the search value rather than the search column, we see that the index can be used, as shown in Example 20-4.
Example 20-4. Applying a function to the search value does not suppress the index
WHERE date_of_birth < date_sub(curdate( ),interval 55 year)
1 SIMPLE select(range) on customers using i_customer_dob
Avoid modifying search columns in the WHERE clause with functions or operators, as this could suppress an index lookup. Where possible, modify the search value instead.
Accidentally suppressing an index using a substring
Another way to suppress an index on a column is to search on a nonleading substring of the column. For instance, indexes can be used to find the leading segments of a column, as shown in Example 20-5.
Example 20-5. Indexes can be used to search for a leading portion of a string
WHERE customer_name like 'HEALTHCARE%'
1 SIMPLE select(range) on customers using i_customer_name
But we can't use the index to find text strings in the middle of the column, as demonstrated in Example 20-6.
Example 20-6. Indexes can't be used to find nonleading substrings
SELECT * FROM customers WHERE customer_name LIKE '%BANK%'
1 SIMPLE select(ALL) on customers using no key
If you have text strings and need to search for words within those strings, you could consider using the MyISAM full-text search capability. Otherwise, be aware that you can only use indexes to find leading substrings within character columns.
Creating concatenated indexes with a poor column order
Another time we might experience an accidental table scan is when we expect a concatenated index to support the query, but we are not specifying one of the leading columns of the index. For instance, suppose that we created an index on customers as follows:
CREATE INDEX i_customer_contact
ON customers(contact_firstname, contact_surname)
It might seem natural to create this index with firstname before surname, but that is usually a poor choice, since concatenated indexes can only be used if the leading columns appear in the query, and it is more common to search on surname alone than on firstname alone.
For instance, the index can support a query to find a customer by contact_firstname:
1 SIMPLE select(ref) on customers using i_customer_contact
But MySQL cannot use the index if only contact_surname is specified:
1 SIMPLE select(ALL) on customers using no key
We probably should have created the index as (contact_surname,contact_firstname) if we need to support searching by surname only. If we want to support searching whenever either the surname or the firstname appears alone, then we will need an additional index.
A concatenated index cannot be used to resolve a query unless the leading (first) column in the index appears in the WHERE clause.
Optimizing Necessary Table Scans
We don't necessarily want to avoid a full table scan at all cost. For instance, we might choose not to create an index to support a unique query that only runs once every month if that index would degrade UPDATE and INSERT statements that are being executed many times a second.
Furthermore, sometimes the nature of our queries leaves no alternative to performing a full table scan. For instance, consider an online book store that maintains a database of books in stock. One of the key tables might contain a row for each individual book, as shown in Figure 20-5.
Figure 20-5. Single-table book catalog
Every day, an inventory report is run that summarizes inventory and outstanding orders. The core of the report is the SQL shown in Example 20-7.
Example 20-7. SQL for inventory report example
GROUP BY publisher
1 SIMPLE select(ALL) on book_catalog using no key
Using temporary; Using filesort
There is no WHERE clause to optimize with an index, so (we might think) there is no alternative to a full table scan. Nevertheless, the person who determines whether or not we get a raise this year strongly encourages us to improve the performance of the query. So what are we going to do?
If we must read every row in the table, then the path to improved performance is to decrease the size of that table. There are at least two ways of doing this:
§ Move any large columns not referenced in the query to another table (provided that this doesn't degrade other critical queries).
§ Create an index based on all of the columns referenced in the query. MySQL can then use the index alone to satisfy the query.
Let's consider splitting the table as a first option. We can see in Figure 20-5 that the book_catalog table contains both a BLOB column containing a picture of the book's cover and a TEXT column containing the publisher's description of the book. Both of these columns are large and do not appear in our query. Furthermore, it turns out that these columns are never accessed by a full table scan—the only time the description and cover picture are accessed is when a customer pulls up the details for a single book on the company's web site.
It therefore may make sense to move the BLOB and TEXT columns to a separate table. They can be quickly retrieved via index lookup when required, while their removal will make the main table smaller and quicker to scan. The new two-table schema is shown in Figure 20-6.
Removing the BLOB and TEXT columns reduced the size of the table by about 60% and more than halved the time required to perform a full table scan (see Figure 20-7).
Another option to consider when faced with a seemingly unavoidable full table scan is to create an index on the columns concerned and resolve the query with an index
Figure 20-6. Two-column book schema
Figure 20-7. Optimizing a full table scan by removing long columns or using a full index scan
scan rather than a table scan. The index is likely to be smaller than the table. For our example report, we could create an index as follows:
CREATE INDEX i_book_inventory ON book_catalog
The EXPLAIN output (which follows) shows that now only the index is used to resolve the query (as shown by the Using index note in the Extra column), and, as we can see in Figure 20-7, this results in even better performance than removing the large columns from the original table.
GROUP BY publisher
1 SIMPLE select(index) on book_catalog using i_book_inventory
One of the reasons that the index performs so well in this case is that MySQL uses the index to optimize the GROUP BY clause. Previous examples all created and sorted temporary tables (shown by Using temporary;using filesort in the EXPLAIN output). Because the leading column of the index was publisher, and because this column is also the column to be sorted to support the GROUP BY clause, no sort was required. We'll discuss the topic of optimizing GROUP BY and ORDER BY using indexes in detail in the next chapter.
Using Merge or Partitioned Tables
Sometimes we are faced with queries that retrieve a proportion of the table that is too high to be optimized by an index, but that is still only a fraction of that table's total. For instance, we might want to optimize a query that retrieves sales data for a particular year. An index to support such a query might return too high a percentage of rows in the table and actually take longer than a full table scan.
One possible way to optimize this scenario is to create a separate table for each year's sales, so that we are able to retrieve data for a particular year from the particular table, thus avoiding the overhead of scanning all of our sales data.
Separate tables for each year would make application code fairly awkward; the programmer would need to know which table to use for a given query, and we would have to provide some way to retrieve data for all years when necessary. To avoid this problem, MyISAM offers merge tables. A MyISAM merge table is a logical table that comprises multiple real tables that are UNIONed together. You can insert into a merge table (provided that the INSERT_METHOD is not set to NO), and you can query from it as you would a normal table.
For instance, we could create separate sales tables for each year, as shown in Example 20-8.
Example 20-8. Creating MyISAM merge tables
CREATE TABLE SALES2000 TYPE=MYISAM AS
WHERE sale_date BETWEEN '2000-01-01' AND '2000-12-31';
CREATE TABLE SALES2001 TYPE=MYISAM AS
WHERE sale_date BETWEEN '2001-01-01' AND '2001-12-31';
. . . Create other "year" tables . . .
CREATE TABLE all_sales
(sales_id INT(8) NOT NULL PRIMARY KEY,
. . . Other column definitions . . .
If we need to obtain sales data for a particular year, we can do so fairly quickly by accessing one of the merge table's constituents directly. For queries that span year boundaries, we can access the merge table itself. We also have the advantage of being able to purge old rows very quickly by rebuilding the merge table without the unwanted years and then dropping the old table.
However, you should bear in mind that when you access the merge table directly, you will experience an additional overhead as MySQL merges the individual tables into a logical whole. This means that scanning the merge table will take substantially longer than scanning a single table containing all of the necessary data.
In MySQL 5.1 (which is alpha as we finalize this chapter), we can create a partitioned table to provide a similar solution to merge tables , as well as to provide other management and performance advantages. Example 20-9 shows the syntax for creating a MySQL 5.1 partitioned table that is similar to the MyISAM merge table created in the previous example.
Example 20-9. Creating MySQL 5.1 partitioned tables
CREATE TABLE sales_partitioned (
sales_id INTEGER NOT NULL,
customer_id INTEGER NOT NULL,
product_id INTEGER NOT NULL,
sale_date DATE NOT NULL,
quantity INTEGER NOT NULL,
sale_value DECIMAL (8,0) NOT NULL
PARTITION BY RANGE (YEAR(sale_date)) (
PARTITION p_sales_pre2000 VALUES LESS THAN (2000),
PARTITION p_sales_2000 VALUES LESS THAN (2001),
PARTITION p_sales_2001 VALUES LESS THAN (2002),
PARTITION p_sales_2002 VALUES LESS THAN (2003),
PARTITION p_sales_2003 VALUES LESS THAN (2004),
PARTITION p_sales_2004 VALUES LESS THAN (2005),
PARTITION p_sales_2005 VALUES LESS THAN (2006),
PARTITION p_sales_2006 VALUES LESS THAN (2007)
If we issue a query that requires data from only one of the partitions, MySQL will be able to eliminate unnecessary partitions from the scan, allowing us to rapidly retrieve information for an individual year. Partitioned tables offer a host of other performance advantages, such as rapid purging of stale data, parallel processing of large result sets, and easier distribution of I/O across multiple disk devices. Partitioning is one of the major new features of MySQL 5.1.
So far we have looked at tuning SQL queries against a single table only. Let's move on to tuning SQL queries that join rows from two or more tables.
How MySQL Joins Tables
MySQL currently joins tables using a fairly simple technique with a complicated-sounding name. The MySQL manual refers to the join algorithm as single-sweep multi-join. In essence, when MySQL joins two tables, it will read the rows from the first table and—for each row—search the second table for matching rows. Further details can be found in the MySQL Internals Manual; see http://dev.mysql.com/doc/internals/en/index-merge-overview.html.
Joins Without Indexes
The basic join algorithm is not very well suited to joining multiple tables unless there are indexes to support the join.[*] Performance might be adequate for very small tables, but as table sizes increase, the join overhead will increase rapidly. Even worse, the join overhead will increase almost exponentially.
Figure 20-8 shows how response time increases for nonindexed joins as the size of each table increases. This semi-exponential degradation is extremely undesirable: if we extrapolate the response time curve for larger tables, we predict that it would take 20 minutes to join two tables of 100,000 rows, 20 hours to join two tables with 1 million rows each, and 81 days to join two tables of 10 million rows each! This is definitely not the way you want your applications to perform as your database grows in size.
Figure 20-8. Table size versus elapsed time for nonindexed joins
Joins with Indexes
To get predictable and acceptable performance for our join, we need to create indexes to support the join. Generally, we will want to create concatenated indexes based on any columns in a table that might be used to join that table to another table. However, we don't need an index on the first (or "driving") table's columns; that is, if we are joining customers to sales, in that order, then our index needs to be on sales—we don't need an index on both tables.
Creating an index on the join column not only reduces execution time, but also prevents an exponential increase in response time as the tables grow in size. Figure 20-9 shows how the response time increases as the number of rows increases when there is an index to support the join. Not only is performance much better (about 0.1 second compared to more than 25 seconds for two tables of 20,000 rows), but the increase in response time is far more predictable. Extrapolating the response time for the indexed join, we can predict that joining two tables of 10 million rows each could be achieved in only 40 seconds—compared to 81 days for the nonindexed join.[*]
Unless you are sure that the tables involved will always be very small, always create an index (concatenated, if appropriate) to support a join of one table to another.
Figure 20-9. Response time versus table size for an indexed join
By far, the most important factor in the optimization of MySQL joins is to ensure that each successive join is supported by an index. Beyond that, we should:
§ Ensure that any rows to be eliminated by WHERE clause conditions are done so as early as possible in the join.
§ Pick an optimal join order. A good rule of thumb is to join tables from smallest to largest.
Generally, the MySQL optimizer can be relied upon to pick a good join order. However, if we need to change the join order, we can use the STRAIGHT_JOIN hint to ensure that the tables are joined in the order in which they appear in the FROM clause. For instance, the following use ofSTRAIGHT_JOIN ensures that the join order is from the smallest table (ta_1000) to the largest (ta_5000):
SELECT STRAIGHT_JOIN count(*)
FROM ta_1000 JOIN ta_2000 USING (sales_id)
JOIN ta_3000 USING (sales_id)
JOIN ta_4000 USING (sales_id)
JOIN ta_5000 USING (sales_id);
Figure 20-10 shows the difference in elapsed time when joining tables in either ascending or descending order of table size. Joining from smallest to largest is about twice as fast as joining from largest to smallest.
When determining a join order, tables with WHERE clauses that eliminate rows should be introduced to the join as early as possible. After that, try to join tables from smallest to largest.
Figure 20-10. Table size and join order
A Simple Join Example
Based on our discussions so far, here is a summary of the most important rules for optimizing MySQL joins:
§ Ensure that every join is supported by an index.
§ Eliminate rows as early as possible in the join sequence.
§ Join tables from smallest to largest.
Let's apply these rules to a simple example.
Consider the case in which we are listing all sales for a particular customer. The query looks like this:
FROM sales JOIN customers
WHERE customer_name='LARSCOM INC'
With just the primary key indexes in place, the EXPLAIN output looks like this:
1 SIMPLE select(ALL) on sales using no key
1 SIMPLE select(eq_ref) on customers using PRIMARY
This execution plan satisfies our first rule: an index (the primary key customer_id of customers) is used to join sales to customers.
However, our second rule—eliminating rows as early as possible in the join sequence—is violated: all of the sales rows are read first, even though only some of those sales (those for a particular customer) are needed. Furthermore, we are joining the larger table sales (2.5 million rows) to the smaller table customers (100,000 rows).
So, what we need to achieve is an efficient join from customers to sales. This means indexing the sales.customer_id column so that we can find sales for a particular customer. The following index should do the trick:
CREATE INDEX i_sales_customer ON sales(customer_id)
The execution plan now looks like this:
1 SIMPLE select(ALL) on customers using no key
1 SIMPLE select(ref) on sales using i_sales_customer
This is better, but we could improve matters further if we did not have to do the full scan on customers. Adding the following index will let us obtain the desired customer more efficiently:
CREATE INDEX i_customer_name ON customers(customer_name)
Once this is done, the execution plan looks like this:
1 SIMPLE select(ref) on customers using i_customer_name
Using where; Using index
1 SIMPLE select(ref) on sales using i_sales_customer
This is the optimal execution plan for this query. The desired customer is found quickly by the index, and then matching sales for that customer are found using the i_sales_customer index. Figure 20-11 shows the performance improvements gained by our optimizations.
Figure 20-11. Optimization of a simple join
[*] We are hoping to see a join algorithm that can perform adequately in the absence of indexes—the hash join algorithm—in MySQL 5.2.
[*] Joining two very large tables may involve other types of overhead, such as passing the data back to the client and fitting the tables in memory, but the overhead of actually performing the join with the index will be massively less than that of the unindexed join.
In this chapter we examined some of the basic principles for tuning simple SQL statements. Tuning SQL inside of MySQL stored programs is probably the single most important thing we can do to avoid poorly performing stored programs.
For SQL statements that retrieve a small proportion of the rows from a table (say, 5 to 15%), you will probably want to create indexes to obtain good performance. Here are some best practice guidelines for creating indexes:
§ Create concatenated indexes that include all of the columns referenced in the WHERE clause.
§ Consider adding additional columns that appear in the SELECT list to allow for an "index only" access path.
§ Create concatenated indexes to support the widest possible range of queries—concatenated indexes can be used for queries that reference only a subset of the columns in the index, provided that the "leading" columns are in the WHERE clause. This means that you should put the most commonly used columns first in the index.
§ If the number of concatenated indexes needed to support all possible queries is too large (say five or more), create single-column indexes on selective columns that MySQL can merge.
MySQL can join large tables effectively only if an index exists on the join columns for at least one of the tables being joined. To optimize basic joins:
§ Create a concatenated index on all of the columns used to join the two tables.
§ Make sure that any WHERE clause conditions are executed before the tables are joined. That is, the "driving table" should be the table that has the most selective condition in the WHERE clause. This will create the most efficient join.
§ Provided that joins are supported by indexes and that WHERE clause conditions are processed in the first few tables to be joined, be aware that the best join order will be from smallest table to largest table.