Learning MySQL and MariaDB (2015)

Part V. Administration and Beyond

Chapter 15. Bulk Importing Data

You might be asked one day to create a MySQL or MariaDB database that will replace an existing database that uses a different database system — or some other format used for storing data. Or you might be asked to take the data from an application that was not designed for databases, like a spreadsheet. So that you don’t have to manually enter the data, there are ways you can import it. This chapter explains how to bulk import data into a database.

When using other applications, export the data from the source application to a format that MySQL can read, such as a text file containing data separated by particular characters. If you’re given a large amount of data to import, hopefully it will already be well organized and in a data text file. Then you can use the LOAD DATA INFILE statement to import the data.

This isn’t an overly difficult task, but the processing of large amounts of data can be intimidating the first time. It can be a barrier to migrating data to MySQL and MariaDB. There are many nuances to consider for a clean import, which is especially important if you want to automate the process. There may also be restraints to consider when importing data onto a server provided by a web hosting company. We’ll cover all of these in this chapter.

Preparing to Import

To import data into MySQL or MariaDB, the data needs to be in a compatible format. Both database systems will accept a simple text file in which the values are delimited in some way. The easiest way to deal with incompatible data in any format is to load it in its original software and to export it to a delimited text file. Most applications can do this. They will usually separate field values by commas and separate records by hard returns, putting each row on a separate line. Some applications will allow you to set the delimiters to your choice. If that’s an option, you might use the bar (i.e., |) to separate fields because it won’t typically be used within the data, and separate records with a new-line.

For some examples related to the rookery database, let’s get a large data text file to use. Cornell University is famous for ornithology. They also publish books on birds through Cornell University Press. One of their publications is The Clements Checklist of World Birds by James F. Clements. The list of birds from this publication is on its website in a spreadsheet and in a comma-separated values (CSV) format. Every August, an updated list is posted on the site for people and organizations to use freely on their own sites and in their databases to promote the study and appreciation of birds.

Suppose we want to compare the latest list to our birds table to see whether there are any new species. This may seem intimidating, but it can be done without much trouble. To participate, download the CSV file from Cornell’s site or MySQL Resources. For the examples that follow, I downloaded the Clements-Checklist-6.9-final.csv file.

WARNING

If you use FTP to upload a text file to the server, be sure to upload it in ASCII mode and not binary mode. If the text file was created with a program that uses binary characters or binary hard returns, these will cause problems when loading the data.

After you download the Cornell data text file, open it with a text editor to see how the content looks. You will need to know how the lines and fields are organized and delineated. Some excerpts follow from the Cornell data file that I downloaded:

sort 6.9,Clements 6.9 change,2014 Text for website,

Category,Scientific name,English name,Range,

Order,Family,Extinct,Extinction Year,sort 6.8,sort 6.7,page 6.0,,,,,

...

4073,new species,"Walters (1991) and Cibois et al. (2012) proposed

recognition of Prosobonia ellisi Sharpe 1906, with English name

Moorea Sandpiper and range ""extinct;

formerly Moorea (Society Islands)"".",

species,Prosobonia ellisi,Moorea Sandpiper,extinct;

formerly Moorea (Society Islands),

Charadriiformes,Scolopacidae (Sandpipers and Allies)

  ,1,xxxx,,,addition (2014),,,,,

...

6707,new species,"Robb et al. (2013) describe a new species of owl, Omani Owl

(Strix omanensis), from the Arabian Peninsula, with range

""central Al Hajar mountains, northern Oman"".

Position Omani Owl immediately following Hume's Owl (Strix butleri).",

species, inStrix omanensis,Omani Owl,"central Al Hajar mountains, northern Oman",

Strigiformes,Strigidae (Owls),,,,,addition (2014),,,,,

...

The CSV file contains about 32,000 lines, but I’ve listed here just a few lines of interest as a sample. I put hard returns within each record to make them easier to discuss. Each record in the original file, though, is on one long line without breaks.

The first record gives the field names. Some of the names are a bit confusing, as they refer to earlier versions of the Clements list for continuity with earlier lists. The first field, sort 6.9, is an identification number for each row. The sort 6.8 and sort 6.7 fields you see further down are the identification numbers from those earlier lists. There are several more fields, but for the examples in this chapter we care only about the Clements 6.9 change, Scientific name, English name, Order, and Family fields.

The Clements 6.9 change field indicates the type of change for the bird since the last Clements list. For the purpose of the scenario we’re concerned about now, we want the new species changes.

The two records containing data are the ones that we want to import. Record 4073 is related to a new species that was added to the Clements list, the Prosobonia ellisi or Moorea Sandpiper. Unfortunately, this bird is extinct. Ornithologists collect information on all known birds, even extinct ones. For good form, we’ll add it to the birds table, even though none of our birdwatchers will see one. Record 6707 shows another new species, the Strix omanensis or Omani Owl. Fortunately, this owl from the Arabian Peninsula isn’t extinct.

Before begining an import, you will need to put the CSV file on the server and in a directory accessible by MySQL. It’s a good security habit to put data files in non public directories. But to keep it simple, for our purposes, we’ll use the /tmp directory to hold temporarily the data text files for importing.

The next task in preparing to import the Clements-Checklist-6.9-final.csv file is to create a table into which to import it. It contains more rows and more columns than we need, but importing 32,000 lines from a CSV file will take only seconds. So the size is not a problem.

We could import the data directly into an existing table, but it’s best to create a new table that we’ll use only for the import. We can execute an INSERT INTO...SELECT statement later to copy the data from the import table we create into an existing table. Execute the following on your server to create the import table:

CREATE TABLE rookery.clements_list_import

(id INT, change_type VARCHAR(255),

col2 CHAR(0), col3 CHAR(0),

scientific_name VARCHAR(255),

english_name VARCHAR(255),

col6 CHAR(0), `order` VARCHAR(255),

family VARCHAR(255),

col9 CHAR(0), col10 CHAR(0),

col11 CHAR(0), col12 CHAR(0),

col13 CHAR(0), col14 CHAR(0),

col15 CHAR(0), col16 CHAR(0), col17 CHAR(0));

This CREATE TABLE statement creates a table with one column for each field of a line in the data text file. The columns are in the same order as the fields in the data text file. For the fields that we won’t need, we’ve assigned generic names for the related columns with a data type of CHAR(0) — a fixed character field with a width of 0 characters — so that the data for those fields won’t be stored. There’s a better way to do this. We could just import the columns we want. But we’ll cover that later in this chapter. For this example, we’ll use this simple method and focus on the other fields.

For the fields we want, I’ve assigned names for the columns close to the field names from the data text file and a data type of VARCHAR(255). Notice that we had to put the order field within backticks. That’s because the word order is a reserved word (e.g., the ORDER BY clause). We can use it for a column name, as long as we always refer to it in this way. Otherwise it will confuse MySQL and cause an error.

At this point, we have a good data text file to import and we have placed the file in an accessible directory on the server. We have determined how the data is organized in the file. And we have created a table to receive the data. We’re now ready to load the data.

Loading Data Basics

To load data into MySQL or MariaDB, you need an administrative user account that has FILE privileges. Let’s use the user account, admin_import that we created in Chapter 13.

The LOAD DATA INFILE statement loads data from a text file. It’s a versatile SQL statement with several options and clauses. We’ll look at them throughout this chapter. The following command is the minimum we would enter from the mysql client to load the data from the Clements-Checklist-6.9-final.csv file data file from Cornell into the clements_list_import table:

LOAD DATA INFILE '/tmp/Clements-Checklist-6.9-final.csv'

INTO TABLE rookery.clements_list_import

FIELDS TERMINATED BY ',';

Notice in the SQL statement here that the file path and name are enclosed in quotes. You can use single or double quotes. Notice also the FIELDS clause. In this clause, we define the parameters of the fields, how they are identified. For the CSV file we’re importing, fields are deliminated from each other with a comma. For this, we add to the FIELDS clause the TERMINATED BY subclause and a comma within quotes.

There are other subclauses and other clauses, but this is the least required for the LOAD DATA INFILE statement. However, this SQL statement as we’ve constructed it will cause problems and generate warning messages.

Watching for Warnings

If you ran the LOAD DATA INFILE statement in the previous section, you may have noticed many warnings. The following output shows the message generated by running that SQL statement, and the first few warnings:

Query OK, 32187 rows affected, 65535 warnings (0.67 sec)

Records: 32187  Deleted: 0  Skipped: 0  Warnings: 209249

SHOW WARNINGS;

+---------+------+--------------------------------------------------------------+

| Level   | Code | Message                                                      |

+---------+------+--------------------------------------------------------------+

| Warning | 1366 | Incorrect integer value: 'sort 6.9' for column 'id' at row 1 |

| Warning | 1265 | Data truncated for column 'col2' at row 1                    |

| Warning | 1265 | Data truncated for column 'col3' at row 1                    |

...

You can execute the SHOW WARNINGS statement to get a list of the warnings. Because there were 209,249 warnings, I’ve listed only a few of them here, just the ones for the first row. The warnings for all of the other rows are about the same. Most of these warnings appeared because we have columns using the CHAR data type with a width of 0. This means that any data in the fields that corresponds to those columns will contain more data than it can hold. In such cases, the data is truncated upon being loaded into the table and the server generates a warning for each such column. Let’s look at a sample of the data in the table to see more clearly what the warnings are trying to tell us and how well the statement did:

SELECT * FROM rookery.clements_list_import LIMIT 2 \G;

*************************** 1. row ***************************

             id: 0

    change_type: Clements 6.9 change

           col2:

           col3:

scientific_name: Scientific name

   english_name: English name

           col6:

          order: Order

         family: Family

           col9:

          col10:

          col12:

          col13:

          col14:

          col15:

          col16:

          col17:

*************************** 2. row ***************************

             id: 1

    change_type:

           col2:

           col3:

scientific_name: Struthio camelus

   english_name: Ostrich

           col6:

          order: Struthioniformes

         family: Struthionidae (Ostrich)

           col9:

          col10:

          col12:

          col13:

          col14:

          col15:

          col16:

          col17:

The LOAD DATA INFILE statement seems to be working well. It has inserted the fields correctly into the columns of the table. The first row, though, contains the field names. We don’t need that row, but it won’t affect anything for our scenario. Looking at the second row, you can see that the data we want from the text file went into the right columns: we have the scientific and common name of the birds, as well as the name of the order and family to which they belong. For the fields that we don’t want, the columns with generic names have no value. That’s fine. As I said before, there is a more professional, cleaner way in which we could have loaded the data. We’ll cover that method later. Let’s proceed with the next step to add new species to the birds table.

Checking the Accuracy of the Import

Before inserting data into the birds table, let’s check a little more closely the accuracy of the data loaded into the clements_list_import table. We’ll use the SELECT statement to see how the data looks for the rows we want, the new species. Enter the following SQL statement on your server and review the results:

SELECT id, change_type,

scientific_name, english_name,

`order`, family

FROM rookery.clements_list_import

WHERE change_type = 'new species' LIMIT 2 \G

*************************** 1. row ***************************

             id: 4073

    change_type: new species

scientific_name: species

   english_name: Prosobonia ellisi

          order: extinct; formerly Moorea (Society Islands)

         family: Charadriiformes

*************************** 2. row ***************************

             id: 6707

    change_type: new species

scientific_name:  from the Arabian Peninsula

   english_name:  with range ""central Al Hajar mountains

          order: species

         family: Strix omanensis

The results here are limited to two rows, but you can remove the LIMIT clause to see all of the rows. There should be 11 in all. These two rows relate to the two records in the excerpt from the Clements-Checklist-6.9-final.csv file shown earlier in this chapter. Notice that data isn’t getting into the correct columns. To determine where things are going awry, let’s look closely at the record for the second row:

6707,

new species,

"Robb et al. (2013) describe a new species of owl,

 Omani Owl (Strix omanensis),

 from the Arabian Peninsula,

 with range ""central Al Hajar mountains,

 northern Oman"". Position Omani Owl immediately following

 Hume's Owl (Strix butleri).",

species,

Strix omanensis,

Omani Owl,

"central Al Hajar mountains,

northern Oman",

Strigiformes,

Strigidae (Owls),

,,,,addition (2014),,,,,

The text that was inserted in the columns is shown in boldface here. It seems that MySQL was confused by the commas contained within some of the fields. This is because the LOAD DATA INFILE we executed included a FIELDS clause that stipulated that they are terminated by a comma. The result is that text from fields containing commas is being cut into pieces and inserted into the subsequent columns. We can fix this problem by adding more parameters to the FIELDS clause.

Let’s delete the data in the clements_list_import table. This is one of the advantages of using a temporary table as we have done: we can delete everything and start anew. Then we’ll reload the data. Enter the following two SQL statements on your server:

DELETE FROM rookery.clements_list_import;

LOAD DATA INFILE '/tmp/Clements-Checklist-6.9-final.csv'

INTO TABLE rookery.clements_list_import

FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'

IGNORE 1 LINES;

The first SQL statement deletes all of the data in clements_list_import so that we may start with an empty table. The second SQL statement is the same as the previous LOAD DATA INFILE, except that we’ve added the ENCLOSED BY subclause to the FIELDS clause to specify that fields are enclosed with double quotes. In addition, we’ve included the OPTIONALLY option to that subclause to indicate that some fields may not be enclosed within double quotes. That tells MySQL that if it encounters a double quote, to look for a second one and to treat everything inside of the pair of double quotes as data. So if it finds a comma within double quotes, it will not consider it a marker indicating the termination of a field.

It may seem strange that this works, considering there is text outside of the double quotes and more than one pair of double quotes in some fields, but it does work.

TIP

When loading data into a table, it’s generally locked and other users are prevented from accessing the table. However, you can include the LOW_PRIORITY option to let other clients read from the table while you are loading it: LOAD DATA LOW_PRIORITY INFILE. The execution of the SQL statement will be delayed until no other clients are reading the table. It works only with tables that use storage engines with table-level locking (e.g., MyISAM), not with row-level locking tables (e.g., InnoDB).

There was another addition to the LOAD DATA INFILE statement we used here. We added the IGNORE clause to the end. This tells MySQL to ignore the number of lines specified, starting from the beginning of the data text file. By specifying that the statement ignore one line, we skip over the first line, which is the line containing the field names that we don’t need. If the data text file has more than one line for the header, you can tell it to ignore more than one.

Execute the earlier SELECT statement again:

SELECT id, change_type,

scientific_name, english_name,

`order`, family

FROM rookery.clements_list_import

WHERE change_type = 'new species' LIMIT 2 \G

*************************** 1. row ***************************

             id: 4073

    change_type: new species

scientific_name: Prosobonia ellisi

   english_name: Moorea Sandpiper

          order: Charadriiformes

         family: Scolopacidae (Sandpipers and Allies)

*************************** 2. row ***************************

             id: 6707

    change_type: new species

scientific_name: Strix omanensis

   english_name: Omani Owl

          order: Strigiformes

         family: Strigidae (Owls)

It’s now loading well. The scientific and common names are in the correct columns, along with the other columns that we want. We’re ready to move to the next step.

Selecting Imported Data

Now that we have properly loaded the data from the Cornell data text file into the clements_list_import table, we can use the INSERT INTO...SELECT statement to copy the data we want to the birds table. We’re learning and experimenting, so let’s create a table identical to the birds table to insert the data from the clements_list_import table. Execute the following on your server:

CREATE TABLE rookery.birds_new

LIKE rookery.birds;

Now let’s select the rows we want from clements_list_import and insert them into birds_new. Execute this on your server:

INSERT INTO birds_new

   (scientific_name, common_name, family_id)

SELECT clements.scientific_name, english_name, bird_families.family_id

  FROM clements_list_import AS clements

  JOIN bird_families

     ON bird_families.scientific_name =

     SUBSTRING(family, 1, LOCATE(' (', family) )

  WHERE change_type = 'new species';

In this SQL statement, we’re inserting only two columns from the clements_list_import table (i.e., scientific_name and english_name). We’re joining the clements_list_import table to the bird_families table to get the family_id. To determine the family_id, we have to join on the name of the family. This is included in the family column of the clements_list_import table, but it has extra text in parentheses — common names for some of the birds in the family. So we’re using the SUBSTRING() and the LOCATE() functions to get all of the text from the start of the string until it finds a space followed by an open parenthesis, as in Strigidae (Owls). In the WHERE clause here, we’re selecting any change_type that has a value of new species.

Let’s see how effective the INSERT INTO...SELECT statement was. Execute the following on your server:

SELECT birds_new.scientific_name,

common_name, family_id,

bird_families.scientific_name AS family

FROM birds_new

JOIN bird_families USING(family_id);

+------------------------+---------------------+-----------+---------------+

| scientific_name        | common_name         | family_id | family        |

+------------------------+---------------------+-----------+---------------+

| Prosobonia ellisi      | Moorea Sandpiper    |       164 | Scolopacidae  |

| Strix omanensis        | Omani Owl           |       178 | Strigidae     |

| Batrachostomus chaseni | Palawan Frogmouth   |       180 | Podargidae    |

| Erythropitta yairocho  | Sulu Pitta          |       217 | Pittidae      |

| Cichlocolaptes maza... | Cryptic Treehunter  |       223 | Furnariidae   |

| Pomarea nukuhivae      | Nuku Hiva Monarch   |       262 | Monarchidae   |

| Pomarea mira           | Ua Pou Monarch      |       262 | Monarchidae   |

| Pnoepyga mutica        | Chinese Cupwing     |       285 | Pnoepygidae   |

| Robsonius thompsoni    | Sierra Madre Gro... |       290 | Locustellidae |

| Zoothera atrigena      | Bougainville Thrush |       303 | Turdidae      |

| Sporophila beltoni     | Tropeiro Seedeater  |       322 | Thraupidae    |

+------------------------+---------------------+-----------+---------------+

This looks good. It’s shows all 11 new species and we’re able to match them to the appropriate bird families. Now we need only run an INSERT INTO...SELECT to copy all of this data into the birds table.

Although there are plenty of records in the CSV data text file, it wasn’t too difficult to load the data. There are smoother ways we could have loaded the data. And there are some other clauses, subclauses, and options available for other situations. We’ll look at all of these in the next few sections.

Better Loading

Although we have done well at loading a rather large data text file, we could do better. This section covers a few ways we can improve the method of loading data with the LOAD DATA INFILE statement.

Mapping Fields

When we loaded the data from the Cornell CSV data text file, we included many fields containing data in which we had no interest. We dealt with this by creating pointless character columns with no width to store data. That generated many warnings, which we ignored.

There’s a better way to address unwanted fields. At the end of the LOAD DATA INFILE statement, you can add a comma-separated list of columns in the table that map to fields in the original input. This list can also include user variables in place of columns. There must be a column or a variable for each field and the columns must match the order of the fields, but the order of columns in the LOAD DATA INFILE can be different from the order the are in the table. So you can import fields into a table in any order you want. Additionally, you can import fields you don’t want into a temporary variable multiple times and their data will be discarded; the variable itself disappears when the client session is terminated.

Let’s drop the clements_list_import table and re-create it without the generic columns that we don’t need. Let’s also put the columns in a different order. Enter the following two SQL statements on your server:

DROP TABLE rookery.clements_list_import;

CREATE TABLE rookery.clements_list_import

(id INT, scientific_name VARCHAR(255),

english_name VARCHAR(255), family VARCHAR(255),

bird_order VARCHAR(255), change_type VARCHAR(255));

Now we have only the columns we want in this import table. We have the family before the bird_order, and we put the change_type last.

Now let’s load the data again. This time we’ll provide a list of columns and variables to map the fields where we want. We’ll direct data from unwanted fields to a temporary variable, @niente. Any name is fine. Niente means nothing in Italian. Execute this SQL statement on your server:

LOAD DATA INFILE '/tmp/Clements-Checklist-6.9-final.csv'

INTO TABLE rookery.clements_list_import

FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'

IGNORE 1 LINES

(id, change_type, @niente, @niente,

scientific_name, english_name,

@niente, bird_order, family, @niente,

@niente, @niente, @niente, @niente,

@niente, @niente, @niente, @niente);

Query OK, 32180 rows affected (0.66 sec)

Records: 32180  Deleted: 0  Skipped: 0  Warnings: 0

The list of columns and variables are in the order of the fields in the CSV data text file. The fields we want to store in the table have the names of the columns with which MySQL is to associate them. They’re in a different order from the table, but MySQL will handle them the way we want. The contents of the fields we want are stored in the @niente variable, replacing its value each time. This works fine and without any warnings. Let’s select the last two new species from the table to see how the data looks now:

SELECT * FROM rookery.clements_list_import

WHERE change_type='new species'

ORDER BY id DESC LIMIT 2 \G

*************************** 1. row ***************************

             id: 30193

scientific_name: Sporophila beltoni

   english_name: Tropeiro Seedeater

         family: Thraupidae (Tanagers and Allies)

     bird_order: Passeriformes

    change_type: new species

*************************** 2. row ***************************

             id: 26879

scientific_name: Zoothera atrigena

   english_name: Bougainville Thrush

         family: Turdidae (Thrushes and Allies)

     bird_order: Passeriformes

    change_type: new species

Your results might be different, depending on which file you downloaded from Cornell’s site. We can see here, though, that the data is in the correct columns. We can now simply run the INSERT INTO...SELECT statement to copy the new bird species into the birds_new and then to the birdstable — or directly to the birds table if we’re feeling confident about our abilities to import data. This is much better than our first pass at loading the data, but we can do better. Let’s try loading the data again, but this time let’s get rid of those common names in the family column.

Setting Columns

If you want to process the values found in a field before loading them into a column in a table, you can use the SET clause of the LOAD DATA INFILE statement to do that. In the previous examples, we used SUBSTRING() in the INSERT INTO...SELECT statement to eliminate common names contained within parentheses from the family column in the clements_list_import table. Let’s try loading the data again, but this time let’s get rid of those common names when they’re loaded into the family column. Delete and load the data again by running these two SQL statements:

DELETE FROM rookery.clements_list_import;

LOAD DATA INFILE '/tmp/Clements-Checklist-6.9-final.csv'

INTO TABLE rookery.clements_list_import

FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'

IGNORE 1 LINES

(id, change_type, @niente, @niente,

scientific_name, english_name,

@niente, bird_order, @family, @niente,

@niente, @niente, @niente, @niente,

@niente, @niente, @niente, @niente, @niente)

SET family = SUBSTRING(@family, 1, LOCATE(' (', @family) );

This is the same as the previous LOAD DATA INFILE statement, except that we are storing the family name in a variable called @family and we added the SET clause. This clause sets the value of columns in the table into which data is loaded. Here we are setting the value of the family column to the value returned by SUBSTRING(), which is extracting a substring from the @family variable. Let’s see how well that did by selecting just one of the new species, the Treehunter bird:

SELECT * FROM rookery.clements_list_import

WHERE change_type='new species'

AND english_name LIKE '%Treehunter%' \G

*************************** 1. row ***************************

             id: 13864

scientific_name: Cichlocolaptes mazarbarnetti

   english_name: Cryptic Treehunter

         family: Furnariidae

     bird_order: Passeriformes

    change_type: new species

We can see here that the data is in the correct columns. In addition, the parenthetical text listing common names of birds in the family has been removed. If we want, we can run the INSERT INTO...SELECT statement again to copy the data for new species to the birds table.

More Field and Line Definitions

Not all data text files will be constructed like the Cornell CSV data text file we used in the examples so far in this chapter. Some files format the fields and lines differently. Let’s load a different data text file to learn about other ways to define lines and fields with the LOAD DATA INFILEstatement.

For the examples in this section, let’s refer back to an earlier example (Extracting Text) in which our marketing agency gave us a table in a dump file containing prospects for our site. This time, let’s assume the marketing agency gave us a data text file. The text file is named birdwatcher-prospects.csv and contains a list of names and email addresses of people who might want to be members of the Rookery site. You can download a copy of this file from the MySQL Resources site. Here are the first few lines of that text file:

["prospect name"|"prospect email"|"prospect country"]

["Mr. Bogdan Kecman"|"bodgan\@kecman-birds.com"|"Serbia"]

["Ms. Sveta Smirnova"|"bettasveta\@gmail.com"|"Russia"]

["Mr. Collin Charles"|"callincollin\@gmail.com"|"Malaysia"]

["Ms. Sveta A. Smirnova"|"bettasveta\@gmail.com"|"Russia"]

The first line lists the name of the fields. Lines start with an opening bracket and end with a closing bracket. Fields are enclosed within double quotes and separated by a vertical bar. The ampersand is preceded with a backslash as an escape character, to indicate that the character that follows it is a literal character. To import the data, we’ll have to allow for all of these details so that MySQL knows when a record starts and ends, when a field starts and ends, and how characters are escaped.

Starting, Terminating, and Escaping

Before loading the birdwatcher-prospects.csv file, let’s create a table in which to import its contents. In addition to columns for each of the three fields in the data text file, we’ll add an incremental column as the primary key. Because email addresses are generally taken by individuals, we’ll make the column for the prospect’s email address a UNIQUE key column. Execute the following SQL statement to create this table:

CREATE TABLE birdwatchers.birdwatcher_prospects_import

(prospect_id INT AUTO_INCREMENT KEY,

prospect_name VARCHAR(255),

prospect_email VARCHAR(255) UNIQUE,

prospect_country VARCHAR(255));

That creates the import table. Let’s load the data from the birdwatcher-prospects.csv file into it. Execute the following SQL statement:

LOAD DATA INFILE '/tmp/birdwatcher-prospects.csv'

INTO TABLE birdwatchers.birdwatcher_prospects_import

FIELDS TERMINATED BY '|' ENCLOSED BY '"' ESCAPED BY '\\'

LINES STARTING BY '[' TERMINATED BY ']\r\n'

IGNORE 1 LINES

(prospect_name, prospect_email, prospect_country);

Although this SQL statement is correct, if you loaded the birdwatcher-prospects.csv file, it generated an error and no data was inserted into the table. We’ll address that error in the next section. Let’s focus now on the subclauses of the FIELDS and LINES clause included in the LOAD DATA INFILEstatement here.

First, let’s look at the FIELDS clause:

§  The TERMINATED BY subclause says that fields end with a vertical bar. The last field doesn’t have one, but because we’ll let the statement know it’s the end of the line, MySQL will then assume the last field has ended.

§  The ENCLOSED BY subclause says that each field is positioned between double quotes.

§  The ESCAPED BY clause specified the character that’s used to escape special characters. The default is a backslash. So there’s no need to include this subclause for this data text file, but I wanted you to be aware that it exists.

Let’s look now at the LINES clause:

§  The STARTING BY subclause specifies an opening bracket.

§  The TERMINATED BY subclause specifies a closing bracket followed by a carriage return and a newline. Normally, a newline is sufficient. But this data text file was created on a MS Windows computer with an application that ends lines this way.

Replacing Data Versus Ignoring Errors

Let’s address the error generated by executing the LOAD DATA INFILE statement in the previous section. The following error message appeared when that SQL statement was run:

ERROR 1062: Duplicate entry 'bettasveta@gmail.com' for key 'prospect_email'

This error was caused because there are two identical email addresses for Sveta Smirnova in the data text file and we stipulated that the prospect_email be unique. Because of the error, the entire import was rolled back and no data was inserted.

We have a few choices of how to handle such an error. We could modify the table so that the prospect_email column allows for duplicate email addresses. Another choice would be to tell MySQL to ignore any errors like this. To do this, we would add the IGNORE option to the LOAD DATA INFILE statement. Try entering this:

LOAD DATA INFILE '/tmp/birdwatcher-prospects.csv'

IGNORE INTO TABLE birdwatchers.birdwatcher_prospects_import

FIELDS TERMINATED BY '|' ENCLOSED BY '"' ESCAPED BY '\\'

LINES STARTING BY '[' TERMINATED BY ']\r\n'

IGNORE 1 LINES

(prospect_name, prospect_email, prospect_country);

Query OK, 4 rows affected, 1 warning (0.02 sec)

Records: 5  Deleted: 0  Skipped: 1  Warnings: 1

SHOW WARNINGS \G

*************************** 1. row ***************************

  Level: Warning

   Code: 1062

Message: Duplicate entry 'bettasveta@gmail.com' for key 'prospect_email'

This worked. Notice the results message. It says one row was skipped and there’s a warning. The warning in turn says there’s a duplicate entry. That’s the row it skipped, the second entry for Sveta. Let’s execute a SELECT statement to see how the row for Sveta looks now:

SELECT * FROM birdwatchers.birdwatcher_prospects_import

WHERE prospect_name LIKE '%Sveta%' \G

*************************** 1. row ***************************

     prospect_id: 16

   prospect_name: Ms. Sveta Smirnova

  prospect_email: bettasveta@gmail.com

prospect_country: Russia

This shows that the first record for Sveta was inserted into the table, but the second one was not. We know this because the second record included a middle initial for her name. If we prefer that duplicate records replace previous ones, we can replace the IGNORE option with the REPLACE option. The statement would then be entered as follows:

LOAD DATA INFILE '/tmp/birdwatcher-prospects.csv'

REPLACE INTO TABLE birdwatchers.birdwatcher_prospects_import

FIELDS TERMINATED BY '|' ENCLOSED BY '"' ESCAPED BY '\\'

LINES STARTING BY '[' TERMINATED BY ']\n'

IGNORE 1 LINES

(prospect_name, prospect_email, prospect_country);

Query OK, 6 rows affected (0.02 sec)

Records: 5  Deleted: 1  Skipped: 0  Warnings: 0

SELECT * FROM birdwatchers.birdwatcher_prospects_import

WHERE prospect_name LIKE '%Sveta%' \G

*************************** 1. row ***************************

     prospect_id: 26

   prospect_name: Ms. Sveta A. Smirnova

  prospect_email: bettasveta@gmail.com

prospect_country: Russia

Notice how the results message reads this time. It says that no rows were skipped, but one was deleted. That’s the replacement of the first entry for Sveta. You can see in the results of the SELECT statement that the record containing her middle initial replaced the one without it.

Importing from Outside MySQL

So far in this chapter, we have covered ways to load from within MySQL. It is possible to import data while not logged into MySQL, per se. At a minimum, you can execute the LOAD DATA INFILE statement through the mysql client with the --execute option. However, there is another client made specifically for importing data, the mysqlimport utility. We’ll cover it in this section. This utility, as well as the LOAD DATA INFILE statement, require FILE privileges. But if you don’t have this privilege, there is a way around it. First, let’s cover how to load a data text file located locally without uploading it to the server.

Importing Local Files

If you are not allowed to upload a data text file to the server, you can use LOAD DATA INFILE to load it locally through the mysql client. For this operation, add the LOCAL option. You don’t log onto the server first and start the mysql client on the server with the host as localhost. Instead, you log locally onto the server by entering something like this on your local computer:

mysql --user=admin_import --password \

      --host=mysqlresources.com --database=rookery

Once you have established the connection through the local client, you can execute the SQL statement like so:

LOAD DATA LOCAL INFILE '/tmp/birdwatcher-prospects.csv'

REPLACE INTO TABLE birdwatchers.birdwatcher_prospects_import

FIELDS TERMINATED BY '|' ENCLOSED BY '"' ESCAPED BY '\\'

LINES STARTING BY '[' TERMINATED BY ']\n'

IGNORE 1 LINES

(prospect_name, prospect_email, prospect_country);

Basically, the data text file is read by the client and the contents sent to the server to store in the operating system’s temporary directory (e.g., /tmp).

This works only if the server and client have been configured to allow the LOCAL option. This requires someone to add local-infile=1 to the MySQL configuration file on both systems. Additionally, the user account must have FILE privileges on the server from the remote location. Normally, this isn’t given to a user. But if it’s your server, you can allow it. See Chapter 13 for more information on how to do this.

Using mysqlimport

If you regularly receive a data text file in the same format, you might find it useful to create a simple shell script to load the data into MySQL. For such a task, you can use the mysqlimport utility. It will execute the LOAD DATA INFILE statement with any options you include.

For an example of how this utility may be used, let’s use one of the recent examples of the LOAD DATA INFILE statement, where we loaded data from the birdwatcher-prospects.csv file. For this utility, though, we will have to rename the file to the same as the import table, so the file’s name will be birdwatcher_prospects.csv. I’ll explain this in a moment. For now, try executing the following from the command line on your server:

mysqlimport –user='marie_dyer' --password='sevenangels' \

  --replace --low-priority  --ignore-lines='1' \

  --fields-enclosed-by='"' --fields-terminated-by='|' --fields-escaped-by='\\' \

  --lines-terminated-by=']\r\n' \

  --columns='prospect_name, prospect_email, prospect_country' \

    birdwatchers '/tmp/birdwatcher_prospects_import.csv'

As you can see, all of the options are the sames as their counterparts, but in lowercase letters and preceded by two hyphens. The order of options doesn’t matter, except that the database and filename are last. After the filename, you can list multiple text files separated by spaces, and they will be processed in order by mysqlimport.

The prefix of the filename must be the same as the table — the dot and the extension are ignored. This rule lets the command determine the table into which to load data. Because table names cannot include a hyphen, which could be mistaken for a minus sign, we had to use an underscore.

The mysqlimport utility works the same as LOAD DATA INFILE; in fact, internally it calls that SQL statement. As mentioned, you can include it in a shell script or an entry in crontab to automate the loading of data from a data text file that is periodically replaced on the server.

WARNING

You may have noticed that the --lines-starting-by option was not included in the previous example. That’s because there is no such option for mysqlimport. Paul Dubois, a famous writer specializing in MySQL software, reported this oversight in 2006. So far, nothing has been done to add that option, which tells us that this is not a well-supported utility. In fact, in testing it on my server, I had difficulty getting it to work. If it works on your server, though, that’s fine. If you’re constructing a script to load data, you may want instead to use the LOAD DATA INFILE statement as part of an API script (see Chapter 16). Most scripting languages include modules that can be used to convert data text files.

Importing Without FILE Privileges

Some web hosting companies do not allow the use of LOAD DATA INFILE due to security vulnerabilities it would present for them. They block its use by not giving you FILE privileges. If your database is located on a server on which you don’t have this privilege, it’s possible to get around it, but that requires some extra steps.

First, you will need access to another MySQL server on which you do have FILE privileges. It could be on your own personal computer. We’ll call whatever computer you use your staging server and the other the live server. On the staging server, you will need to create a table identical to the one on the live server into which you want to load the data. You should also create an import table on the live server as we did in earlier examples in this chapter, rather than import directly into the ultimate table.

After you’ve created tables on the staging and live server, execute the LOAD DATA INFILE statement on the staging server to load the data from the text file.

Next, export the data from the table on the staging server using the mysqldump utility (this utility was covered extensively in Chapter 14). Be sure to use the --tables option so that you dump only the import table (see Backing Up Specific Tables), and use the --no-create-info option so that the utility doesn’t include CREATE DATABASE and CREATE TABLE statements in the dump file.

After you’ve created the dump file of the table, upload it to the live server. On the live server, use the mysql client to process the dump file to insert the rows of data into the import table on that server (this method was covered in Restoring Backups). From there, you can use the INSERT INTO...SELECT statement to copy the rows to the appropriate table.

This method is the same as the other methods for loading data, but with the extra steps of loading the data on a staging server and then using mysqldump to dump the data and mysql to insert the data into the appropriate table on the live server. It’s not particularly difficult, just more time consuming.

Bulk Exporting Data

Thus far in this chapter we have looked at how to bulk import data into MySQL and MariaDB from data text files. However, you may be asked to do the opposite, and bulk export data to provide someone with a text file containing data from your MySQL databases. This can be done more easily than importing, so long as you get to decide the layout of the data text file.

The easiest way to bulk export data to a text file is to use the SELECT statement with the INTO OUTFILE clause. This works similarly to the LOAD DATA INFILE statement, with the same subclauses — except that it exports instead of imports data. Let’s look at an example.

Suppose we want to give someone a list of birds from the rookery database. We want specifically to give them a test file containing a list of birds in the Charadriiformes — an order of birds that includes Sea Gulls and Plovers. We want to export the scientific and common name of each bird, and the family name.

We’ll do this in stages. First, let’s construct a SELECT statement to make sure we’re exporting the correct data. Execute this from your server:

SELECT birds.scientific_name,

IFNULL(common_name, ''),

bird_families.scientific_name

FROM rookery.birds

JOIN rookery.bird_families USING(family_id)

JOIN rookery.bird_orders USING(order_id)

WHERE bird_orders.scientific_name = 'Charadriiformes'

ORDER BY common_name;

This SELECT statement includes a JOIN (covered extensively in Chapter 9). We’re joining together the main three tables in the rookery database to get the bird names and the family names for the family order that we want. We’re ordering the list based on common_name. The SELECT...INTO OUTFILE statement will generally convert NULL values to the letter N. So we’re using IFNULL() to change any null values for the common_name to a blank space. That SELECT statement works fine. If you tried it on your server, it should have returned about 718 rows.

To keep anyone receiving the data text from being confused as to what each field represents, let’s include a first row containing field names. The easiest way to do this is to just execute SELECT with a set of strings like this:

SELECT 'scientific name','common name','family name';

These field names don’t have to be the same as the columns in the tables for which they will be associated, and don’t have to conform to any convention for our purposes. We’ll join the results of this SQL statement with the previous one with the UNION, but with the field names first. This was also covered in Chapter 9.

Having tested the SELECT statements, we’re now ready to put them together to export data to a text file. Execute the following on your server:

( SELECT 'scientific name','common name','family name' )

UNION

( SELECT birds.scientific_name,

  IFNULL(common_name, ''),

  bird_families.scientific_name

  FROM rookery.birds

  JOIN rookery.bird_families USING(family_id)

  JOIN rookery.bird_orders USING(order_id)

  WHERE bird_orders.scientific_name = 'Charadriiformes'

  ORDER BY common_name

  INTO OUTFILE '/tmp/birds-list.csv'

  FIELDS ENCLOSED BY '"' TERMINATED BY '|' ESCAPED BY '\\'

  LINES TERMINATED BY '\n');

That should have executed without any problems. Because we’ve already discussed the SELECT statements in general, let’s focus on the INTO OUTFILE clause in the second SELECT statement. First notice that the path for the export file is /tmp. MySQL will generally only write to an accessible directory like this one, one in which everyone on the server has full read and write privileges. Next notice that the subclauses are listed after the file path and name — the opposite of LOAD DATA INFILE. The subclauses, though, are the same.

Here we’re enclosing fields with double quotes and separating them with a vertical bar. We’re using the backslash as the escape character. For the SELECT...INTO OUTFILE statement, you have to include the ESCAPED BY subclause, because there is no default escape character for this statement. There are two backslashes here because the first escapes the second; a backslash by itself is an escape character in this command. Finally, we’re terminating each line with a newline character.

Here are the first few lines of the file generated by the previous SELECT...INTO OUTFILE statement:

"scientific name"|"common name"|"family name"

"Charadrius vociferus"|"Killdeer"|"Charadriidae"

"Charadrius montanus"|"Mountain Plover"|"Charadriidae"

"Charadrius alexandrinus"|"Snowy Plover"|"Charadriidae"

"Pluvialis squatarola"|"Black-bellied Plover"|"Charadriidae"

"Pluvialis fulva"|"Pacific Golden Plover"|"Charadriidae"

"Burhinus vermiculatus"|"Water Thick-knee"|"Burhinidae"

"Burhinus oedicnemus"|"Eurasian Thick-knee"|"Burhinidae"

...

This works nicely. The first line provides a list of field names. The lines that follow organize the text from the columns we selected. This is an easy and simple way to export data to a text file to give to someone using a different database system.

Summary

Although you may rarely use the LOAD DATA INFILE statement, when you need it and use it, you’ll find that it saves you plenty of time. It makes bulk importing of data and migrating to MySQL and MariaDB much easier. Because the layout of data text files can vary so much, you might need a few attempts to load data properly. But as long as you create a temporary import table, you can keep deleting the data and trying again without disturbing anyone else or risking the loss of other data.

The SELECT...INTO OUTFILE statement is an excellent method for sharing data with others. It may become a regularly used tool for you if you work in an organization that shares data with other organizations. So at least be very familiar with it in case your situation calls for it.

Exercises

For the exercises in this chapter, you will need to download the employees.csv and birder-list.csv files from the MySQL Resources site. You should copy it to the /tmp directory, or another directory on your server that is accessible by the mysql system user.

I generated the employees.csv file by using the SELECT...INTO OUTFILE statement to export data from the employee database. This is a large sample database created originally by the staff at MySQL, and is free for download.

1.    Open the employees-list.csv file with a text editor to see how it’s formatted. Then create an import table to match it. When you’re finished, use the LOAD DATA INFILE statement to load the list of employees into the import table you created.

2.    Open the birder-list.csv in a text editor to determine how it’s formatted. It contains a list of people who live in Italy and are prospects for our site. Create in the birdwatchers database an import table with columns with these names and in this order: id, formal_title, name_first,name_last, country, and email. Make the id column an automatically incremented key column.
Construct a LOAD DATA INFILE statement to load the data from the birder-list.csv file into the import table you create. Be sure to provide a list of column names with this SQL statement. Use the SET clause to set the value of formal_title when loading the data. Female Italian names generally end with the letter a. Male Italian names end generally with the letter o, but sometimes with i, or e. Use these assumptions to have MySQL make a reasonable guess as to the person’s title of either Ms. or Mr. when loading the data. When ready, run the LOAD DATA INFILE you constructed to load the data into the import table.
When finished, execute a SELECT statement to make sure the data loaded properly. If it didn’t, delete the data in the import table and try again until you get it right. Once you’ve successfully loaded the data, run a INSERT INTO...SELECT statement to add the names to the humans table.

3.    Using the SELECT...INTO OUTFILE statement, export a list of birds with the word Least in their common name to a text file named little-birds.csv. Export the common and scientific name of the birds, as well as the scientific names of the family and order to which they belong. Enclose the fields in double quotes, and separate them with commas. End the lines with a semicolon, but without a line ending (i.e., no \n or \r). This should cause the CSV file to write all of the text to one long line. After you’ve exported the data, open the file with a text editor to verify that the data is contained on one line.

4.    Create a table named birds_least in the rookery database. It should have four columns: scientific_name, common_name, family_name, and order_name. Load this table with data from the little-birds.csv you generated in the previous exercise, using the LOAD DATA INFILE statement. This may be a little tricky. If you don’t do it right, delete the data in the import table and load it again until you get the values in the right columns.