R Recipes: A Problem-Solution Approach (2014)

Chapter 3. Data Structures

As a refresher, the basic data structures in R are vectors, matrices, lists, and data frames. Remember R does not recognize a scalar quantity, instead treating that quantity as a vector of length 1. In Chapter 3, you will learn what you need to know about working with the various data structures in R.

Recipe 3-1. How to Work with Vectors

Problem

Vectors were introduced in Chapter 1, and were described as the fundamental data type in R. In Recipe 3-1, you will learn more about working with vectors, adding and deleting elements, and subsetting vectors. You will also learn more about how vectors relate to other data types in R and how to perform vector operations.

Solution

As you recall, a vector can be any of the six atomic types, but a vector must contain elements of only one data type. As you learned in Chapter 1, you can create a vector from the R Console or the R Editor by entering values with either the c() or the scan() function.

Remember that if you work with vectors of different lengths, R will recycle the elements of the shorter vector to match the length of the longer vector. This is often exactly what you want to do, but sometimes it is not. When you use vectors that are mismatched, that is, in which the longer vector’s length is not a multiple of the shorter vector’s length, R will give you a warning to that effect:

> x <- 1:10
> y <- 1:5
> x/y
 [1] 1.000000 1.000000 1.000000 1.000000 1.000000 6.000000 3.500000 2.666667
 [9] 2.250000 2.000000
> z <- 1:3
> x/z
 [1]  1.0  1.0  1.0  4.0  2.5  2.0  7.0  4.0  3.0 10.0
Warning message:
In x/z : longer object length is not a multiple of shorter object length

Because the length of x is a multiple of the length of y, the division produced no warning. In the second example, the numbers 1, 2, and 3 were recycled so that on the 10th division, 1 was the element of z divided into 10. Next, examine vector arithmetic in R.

As long as the vectors have the same length, all is well. Arithmetic operations work on vectors elementwise. That is, the operation is performed for the first element of each vector, then for the second, and so on until the last pair of elements is reached.

> xvec
 [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13
> yvec
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14
> zvec <- xvec + yvec
> zvec
 [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27
> xvec - yvec
 [1] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
> xvec * yvec
 [1]   0   2   6  12  20  30  42  56  72  90 110 132 156 182
> xvec / yvec
 [1] 0.0000000 0.5000000 0.6666667 0.7500000 0.8000000 0.8333333 0.8571429
 [8] 0.8750000 0.8888889 0.9000000 0.9090909 0.9166667 0.9230769 0.9285714

Vectors are combined by the use of the c() function:

> newVec <- c(xvec, yvec)
> newVec
 [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13  1  2  3  4  5  6  7  8  9 10 11
[26] 12 13 14

Remember there is no scalar quantity in R. When you retrieve an element of a vector in R, the result is not really the element itself, but a “vector slice.” We use the index or indexes of the vector slice we need by putting the indexes in square brackets ([]). Remove an element or elements of a vector by using negative indexes. Add elements or change the values of elements using indexes. If we ask for an element by using an out-of-range index, R will report NA. Let’s examine all these operations.

> newVec <- newVec[-1]
> newVec
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13  1  2  3  4  5  6  7  8  9 10 11 12
[26] 13 14
> vecSlice <- newVec[1:13]
> vecSlice
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13
> vecSlice[14] <- 14
> vecSlice
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14
> vecSlice[15]
[1] NA

It is also possible to use a logical index vector to slice a new vector from a given vector. The logical vector must be of the same length as the original vector. The following code illustrates this:

> xVec <- 1:10
> logicVec <- c(TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE)
> vecSlice[logicVec]
[1]  1  3  5  7  9 11 13

We can assign names to the elements of vectors. Let us switch to a character vector for this illustration. The names can be used to retrieve and reorder the elements of the vector:

> charVec <- c("Phyllis","Argo")
> charVec
[1] "Phyllis" "Argo"
> names(charVec) <- c("FirstName","LastName")
> charVec
FirstName  LastName
"Phyllis"    "Argo"
> charVec[c("LastName","FirstName")]
 LastName FirstName
   "Argo" "Phyllis"

The replicate function rep() can be used to create a vector with any number of replications of the same entry:

> X <- rep(1,10)
> X
 [1] 1 1 1 1 1 1 1 1 1 1

The sequence function seq() requires a starting value, an ending value, and an increment value. For example:

> z <- seq(-4, 4, 0.1)
> z
 [1] -4.0 -3.9 -3.8 -3.7 -3.6 -3.5 -3.4 -3.3 -3.2 -3.1 -3.0 -2.9 -2.8 -2.7 -2.6
[16] -2.5 -2.4 -2.3 -2.2 -2.1 -2.0 -1.9 -1.8 -1.7 -1.6 -1.5 -1.4 -1.3 -1.2 -1.1
[31] -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1  0.0  0.1  0.2  0.3  0.4
[46]  0.5  0.6  0.7  0.8  0.9  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9
[61]  2.0  2.1  2.2  2.3  2.4  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4
[76]  3.5  3.6  3.7  3.8  3.9  4.0

Recall that missing values are represented in R by NA, and that many R functions will not apply when there are missing data unless you set the argument na.rm to TRUE.

Recipe 3-2. How to Work with Matrices

Problem

As you learned in Chapter 1, R provides many operations for dealing with matrices. We will use real data taken from the General Satisfaction Survey (GSS) for this recipe so that you can see the power of R for working with matrices.

Solution

Matrices and vectors are related, as we have discussed before. A matrix is a vector with dimensions. The elements of the matrix must be of the same basic data type. Use the matrix() function to create a matrix. As you will recall, when you create a matrix from data elements, R will fill the matrix columnwise by default. Matrix transposition is the interchanging of rows and columns. We accomplish matrix transposition by the function t(). Matrix inversion is done by the solve() function. An illustration of these operations follows.

First, let us extract some variables from the GSS dataset discussed earlier. We will use the cbind() function to create a matrix. Let us use the job satisfaction variable as the dependent variable Y. We will create an Xij matrix by combining a vector of 1s with the age, job security, and income variables. Then we will transpose the Xij matrix and solve for the regression coefficients using matrix operations. You may recall from a statistics class along the way that the column of 1s allow us to calculate the vector of unstandardized regression coefficients. We create our various components as follows. The data frame is a list, as you learned earlier.

> head(jobSat)
  age sex race jobsecok income06 satjob7
1  22   1    1        2       25       2
2  36   2    2        2       19       4
3  36   1    1        3       19       3
4  47   2    2        2       18       3
5  54   1    1        3       22       5
6  45   2    3        1       24       2

So far, so good. Now for a little matrix wizardry. We create the vector Y, the matrix Xij, and then solve for the regression coefficients using matrix algebra. To explain, the vector Y is simply the column of job satisfaction scores. The matrix Xij is created from a vector of 1s and the age, income, and job security variable. See the following code:

> Y <- jobSat$satjob7
> ones <- rep(1, 695)
> Xij <- cbind(ones, jobSat$age, jobSat$race, jobSat$jobsecok, jobSat$income06)

With the column of 1s added, our Xij matrix looks like this:

> head(Xij)
     index age income06 jobsecok
[1,]     1  22       25        2
[2,]     1  36       19        2
[3,]     1  36       19        3
[4,]     1  47       18        2
[5,]     1  54       22        3
[6,]     1  45       24        1

Now, use the traditional matrix formula B=(X’X)−1X’Y to solve for the regression coefficients.

> transpose <- t(Xij)
> product <- transpose %*% Xij
> product
         index     age income06 jobsecok
index      695   29451    12569     1094
age      29451 1371371   538411    46474
income06 12569  538411   245013    19675
jobsecok  1094   46474    19675     2118
> inverse <- solve(product)
> inverse
                 index           age      income06
index     0.0376712386 -2.949646e-04 -9.507791e-04
age      -0.0002949646  8.236163e-06 -2.714459e-06
income06 -0.0009507791 -2.714459e-06  5.747651e-05
jobsecok -0.0041537160 -3.148802e-06  1.673929e-05
              jobsecok
index    -4.153716e-03
age      -3.148802e-06
income06  1.673929e-05
jobsecok  2.531236e-03
> B <- inverse %*% (transpose %*% Y)
> B
            [,1]
index     2.5233
age      -0.0107
income06 -0.0152
jobsecok  0.5213

Just for comparison purposes, do this analysis using R’s linear model lm() function.

> Model <- lm(Y ~ age + income06 + jobsecok)
> summary(Model)

Call:
lm(formula = Y ~ age + income06 + jobsecok)

Residuals:
   Min     1Q Median     3Q    Max
-3.039 -0.811 -0.142  0.611  4.453

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  2.52326    0.22008   11.47   <2e-16 ***
age         -0.01071    0.00325   -3.29   0.0011 **
income06    -0.01525    0.00860   -1.77   0.0766 .
jobsecok     0.52134    0.05705    9.14   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.13 on 691 degrees of freedom
Multiple R-squared:  0.126,     Adjusted R-squared:  0.122
F-statistic: 33.2 on 3 and 691 DF,  p-value: <2e-16

As you see, the coefficients are the same as the ones we calculated using matrix algebra.

Recipe 3-3. How to Work with Lists

Problem

Lists are another very important data structure in R. The advantage of a list is that it can combine multiple data types. Recall that indexing is done differently for lists than for vectors and matrices. Lists form the basis for objects such as data frames, and are useful for combining mismatched vectors, as you will soon learn.

Solution

You have learned that there are six atomic vector types in R. A list is a vector, too, but unlike the atomic vectors, which cannot be broken down any further, lists are a special kind of recursive vector. Here is a common application for a list. If you are familiar with Python, you will immediately think of a dictionary. In Recipe 3-3, you will learn how to work with lists, including how to create a list, how to access list components and values, and how to apply functions to lists.

Creating a List

Use the list() function to create a list. You might think of a list as a generic vector that can contain other objects. For illustrative purposes, I will create an inventory of a few of the books lying around my desk and in the nearby bookshelves. I include the title, the year of publication, the author, and the publisher. This is the kind of bibliographic information one might be interested in when creating a reference list. First, I entered the information for one of my favorite books.

> book <- list(title="Exploratory Data Analysis", year=1977,author="John W. Tukey")
> book
$title
[1] "Exploratory Data Analysis"

$year
[1] 1977

$author
[1] "John W. Tukey"

Note that it is not necessary to add component names (also known as tags), but they are helpful. We can use the names to retrieve list components. Recall that we index the elements (or components) of a list by using bracket notation, but we can do so in two different ways. We can use either single square brackets ([]) or double  square brackets ([[]]), and the results will be different. Using single brackets results in a list, whereas using double brackets results in a component, and the result will have the type of that component. To illustrate, see that we have three components. Even though book1 and book2 look the same, they are different types of data. Lists can also contain other lists, and they are combined in the same way vectors are.

> book <- list(title="Exploratory Data Analysis", year=1977,author="John W. Tukey")
> book
$title
[1] "Exploratory Data Analysis"

$year
[1] 1977

$author
[1] "John W. Tukey"

> book$title
[1] "Exploratory Data Analysis"
> book$year
[1] 1977
> book$author
[1] "John W. Tukey"
> book1 <- book [1]
> book2 <- book[[1]]
> book1
$title
[1] "Exploratory Data Analysis"

> book2
[1] "Exploratory Data Analysis"
> typeof(book1)
[1] "list"
> typeof(book2)
[1] "character"
> book2 <- list(title="Statistics for the Social Sciences",year=1973,author="William L. Hays")
> books <- c(book1, book2)
> books
$title
[1] "Exploratory Data Analysis"

$title
[1] "Statistics for the Social Sciences"

$year
[1] 1973

$author
[1] "William L. Hays"

Adding and Deleting List Components

To add a component to an existing list, simply assign it using a new name and value, or add a list element by using vector indexing:

> newList <- list(a = 1, b = 2, c = 3)
> newList
$a
[1] 1

$b
[1] 2

$c
[1] 3

> newList$d <- 4
> newList
$a
[1] 1

$b
[1] 2

$c
[1] 3

$d
[1] 4

> newList$e <- 5
> newList[6] <- 6
> newList
$a
[1] 1

$b
[1] 2

$c
[1] 3

$d
[1] 4

$e
[1] 5

[[6]]
[1] 6

Recall that for vectors, we simply use a negative index, as in [–3], to remove an element. With lists, the way to delete a list element is to assign the special value NULL to the component. Here is an example. Assume I accidentally entered a line feed at the end of entry 9 in my list. I want to delete that entry but leave the others as they are. When I assign NULL to the 9th entry, it is removed from my list, and the length of the list is reduced accordingly.

> Grades <- list("A","B","A","B+","C","F","A-","D","B-
+ ","C+")
> Grades
[[1]]
[1] "A"

[[2]]
[1] "B"

[[3]]
[1] "A"

[[4]]
[1] "B+"

[[5]]
[1] "C"

[[6]]
[1] "F"

[[7]]
[1] "A-"

[[8]]
[1] "D"

[[9]]
[1] "B-\n"

[[10]]
[1] "C+"
> Grades[[9]] <- NULL
> Grades
[[1]]
[1] "A"

[[2]]
[1] "B"

[[3]]
[1] "A"

[[4]]
[1] "B+"

[[5]]
[1] "C"

[[6]]
[1] "F"

[[7]]
[1] "A-"

[[8]]
[1] "D"

[[9]]
[1] "C+"
> length(Grades)
[1] 9

Applying Functions to Lists

The lapply() and sapply() functions can be used to apply R functions to lists. In the following code, I compare the quiz scores for two sections of the same statistics class I am currently teaching. The lapply() function applies the mean() function and returns a list, whereas thesapply() function applies the mean function and returns a vector.

> quizzes <- list(sect1 = c(10,18,16,16,16,18,14,18,6,20),
+ sect2 = c(18,16,12,16,16,14,18,18,10,14,20,6,16,16,10,14))
> lapply(quizzes,mean)
$sect1
[1] 15.2

$sect2
[1] 14.625

> sapply(quizzes,mean)
  sect1  sect2
15.200 14.625
> list <- lapply(quizzes,mean)
> typeof(list)
[1] "list"
> vector <- sapply(quizzes,mean)
> typeof(vector)
[1] "double"

Recipe 3-4. Working with Data Frames

Problem

You have already learned that the data frame is the most frequently used data structure for statistical analysis, and that a data frame is a kind of list. Like matrices, data frames must be rectangular in that every row and column intersection (cell, if you will) contains a value. Data frames, like all lists, can contain any combination of data types, including integer, numeric, character, and logical. Some character variables are used as factors in statistical analyses.

Let us work with some data I collected concerning graduate students’ writing assignments in a management class I taught. Data included the course section, the student’s sex, the overall course grade, the grades on the Week 2 writing assignment and the Week 6 writing assignment, whether the student used excessive quotation in each assignment, whether the student was documented to have plagiarized the assignment, whether the student voluntarily submitted the assignment to Turnitin.com, the Turnitin similarity indexes for the Week 2 and Week 6 writing assignments, and the percentage of quoted material in each assignment.

Solution

The data described were archival in nature. As a matter of course, I submitted each student’s Week 2 and Week 6 (the final week) written assignments to Turnitin.com. The grades were retrieved from the course gradebook. The data represented three sections of the same online course, with 55 students in all. There were some missing data, as one might expect. Students probably committed more plagiarism than the data indicate, because I did not count suspected plagiarism, but only the specific incidents I could document.

Creating a Data Frame and Accessing Data Frame Elements

Data frames can contain any type of data, including other data frames, but in this book, we will limit ourselves to data frames containing numbers and character strings. Applying the length() function returns the number of variables in the dataset, while applying the  same function to one of the variables returns the number of records:

> plagiarism <- read.csv("plagiarism.csv")
> length(plagiarism)
[1] 17
> length(plagiarism$Course)
[1] 55

You can access data frame elements using matrix-like indexing. You can also use variable names to access individual variables (columns, if you will):

plagiarism[1,1]
[1] MFE1135A
Levels: MFE1123A MFE1129A MFE1135A
> plagiarism$Course
 [1] MFE1135A MFE1135A MFE1135A MFE1135A MFE1135A MFE1135A MFE1135A MFE1135A
 [9] MFE1135A MFE1135A MFE1135A MFE1129A MFE1129A MFE1129A MFE1129A MFE1129A
[17] MFE1129A MFE1129A MFE1129A MFE1129A MFE1129A MFE1129A MFE1129A MFE1129A
[25] MFE1129A MFE1129A MFE1129A MFE1129A MFE1129A MFE1129A MFE1129A MFE1123A
[33] MFE1123A MFE1123A MFE1123A MFE1123A MFE1123A MFE1123A MFE1123A MFE1123A
[41] MFE1123A MFE1123A MFE1123A MFE1123A MFE1123A MFE1123A MFE1123A MFE1123A
[49] MFE1123A MFE1123A MFE1123A MFE1123A MFE1123A MFE1123A MFE1123A
Levels: MFE1123A MFE1129A MFE1135A

> lowestGrade <- min(plagiarism$CourseGr)
> lowestGrade
[1] 51.01

You can filter data and apply functions to data frame elements. For example, let’s see who got the lowest overall course grade.

> lowestGrade <- min(plagiarism$CourseGr)
> lowestGrade
[1] 51.01
> hist(plagiarism$CourseGr)

Is my reputation as a notoriously easy grader still intact (see Figure 3-1)?

9781484201312_Fig03-01

Figure 3-1. Histogram of final course grades

Yes, students have nothing to worry about.

Dealing with Missing Data, Take 2

You will recall that in the preparation of the data, missing data must be coded as NA. R will attempt to deal with missing data, but you often must specify the na.rm = TRUE option. If you simply want to remove all records with missing data, use the complete.cases() function.

> summary(plagiarism)
      Course   Sex       CourseGr         Wk2Gr            Wk6Gr
 MFE1123A:24   F:38   Min.   :51.01   Min.   :  0.00   Min.   :  0.00
 MFE1129A:20   M:17   1st Qu.:89.69   1st Qu.: 85.71   1st Qu.: 87.50
 MFE1135A:11          Median :93.66   Median : 90.00   Median : 92.50
                      Mean   :90.99   Mean   : 88.00   Mean   : 87.68
                      3rd Qu.:96.72   3rd Qu.: 95.00   3rd Qu.: 97.50
                      Max.   :99.68   Max.   :100.00   Max.   :100.00
 ExcQuote1 Plagiarized1 Plagiarized2 VoluntaryTII          Source1
 No :33    No :48       No :54       No :43       Internet     :10
 Yes:22    Yes: 7       Yes: 1       Yes:12       Publication  : 2
                                                  StudentPapers: 9
                                                  None         :34

        Wk2Sim          Wk2Inc         PctQuot1          Wk6Sim
 Min.   : 0.00   Min.   : 0.00   Min.   : 0.000   Min.   : 0.00
 1st Qu.: 4.00   1st Qu.: 8.00   1st Qu.: 0.000   1st Qu.: 3.00
 Median : 8.00   Median :17.00   Median : 4.000   Median : 5.00
 Mean   :12.73   Mean   :19.98   Mean   : 7.255   Mean   :10.16
 3rd Qu.:14.50   3rd Qu.:28.00   3rd Qu.:13.500   3rd Qu.:14.00
 Max.   :79.00   Max.   :79.00   Max.   :29.000   Max.   :42.00
                                                  NA's   :10
     Wk6Inc         PctQuot2      ExcQuote2
 Min.   : 0.00   Min.   : 0.000   No  :32
 1st Qu.: 6.00   1st Qu.: 1.000   Yes :13
 Median :10.00   Median : 2.000   NA's:10
 Mean   :13.33   Mean   : 3.178
 3rd Qu.:17.00   3rd Qu.: 4.000
 Max.   :46.00   Max.   :15.000
 NA's   :10      NA's   :10
> plagiarism2 <- plagiarism[complete.cases(plagiarism),]
> summary(plagiarism2)
      Course    Sex        CourseGr            Wk2Gr            Wk6Gr
 MFE1123A:14   F:31   Min.   :69.99   Min.   :  0.00   Min.   : 70.00
 MFE1129A:20   M:14   1st Qu.:89.81   1st Qu.: 85.71   1st Qu.: 87.50
 MFE1135A:11          Median :94.52   Median : 90.00   Median : 92.50
                      Mean   :92.16   Mean   : 87.71   Mean   : 91.17
                      3rd Qu.:96.85   3rd Qu.: 94.29   3rd Qu.: 97.50
                      Max.   :99.68   Max.   :100.00   Max.   :100.00
 ExcQuote1 Plagiarized1 Plagiarized2 VoluntaryTII          Source1
 No :30    No :40       No :44       No :36       Internet     : 7
 Yes:15    Yes: 5       Yes: 1       Yes: 9       Publication  : 2
                                                  StudentPapers: 5
                                                  None         :31

        Wk2Sim          Wk2Inc        PctQuot1          Wk6Sim
 Min.   : 0.00   Min.   : 2.0   Min.   : 0.000   Min.   : 0.00
 1st Qu.: 4.00   1st Qu.: 8.0   1st Qu.: 0.000   1st Qu.: 3.00
 Median : 8.00   Median :15.0   Median : 4.000   Median : 5.00
 Mean   :12.91   Mean   :19.8   Mean   : 6.889   Mean   :10.16
 3rd Qu.:13.00   3rd Qu.:24.0   3rd Qu.: 8.000   3rd Qu.:14.00
 Max.   :79.00   Max.   :79.0   Max.   :29.000   Max.   :42.00
        Wk6Inc         PctQuot2   ExcQuote2
 Min.   : 0.00   Min.   : 0.000      No :32
 1st Qu.: 6.00   1st Qu.: 1.000      Yes:13
 Median :10.00   Median : 2.000
 Mean   :13.33   Mean   : 3.178
 3rd Qu.:17.00   3rd Qu.: 4.000
 Max.   :46.00   Max.   :15.000

Subsetting Data

You can subset data in many ways. For example, using the dataset women from the R distribution, you can select only women who are above the median in weight. The parentheses in the code fragment (women2 <- women[women$weight > 135, ]) permits you to save a step and show the selected data immediately.

> women
   height weight
1      58    115
2      59    117
3      60    120
4      61    123
5      62    126
6      63    129
7      64    132
8      65    135
9      66    139
10     67    142
11     68    146
12     69    150
13     70    154
14     71    159
15     72    164

summary(women$weight)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  115.0   124.5   135.0   136.7   148.0   164.0
> (women2 <- women[women$weight > 135, ])
   height weight
9      66    139
10     67    142
11     68    146
12     69    150
13     70    154
14     71    159
15     72    164

As you have seen, you can use matrix-type indexing with a data frame. For example, to select only the vector of women’s heights, do the following:

> (height <- women[,1])
 [1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72

The subset() function can also be used to select variables using logical tests. To illustrate, let’s return briefly to the GSS data first mentioned in Chapter 2, Recipe 2-1. We will select only subjects who are married (marital status = 1).

> married <- subset(jobSat, marital == 1)

We can also combine logical tests. Select people age 21 and older who worked at least 40 hours last week. Note the use of & in the following.

Time40plusHrs <- subset(married, age >=21 & hrs1 >=40)
> head(fullTime40plusHrs)
    wrkstat hrs1 marital age sex race happy weekswrk jobsec jobsecok happy7
24        1   40       1  45   2    3     1       52      2        1      2
35        1   40       1  42   2    1     2       52      5        2      3
75        1   50       1  46   1    1     2       52      5        2      2
83        1   55       1  44   1    1     1       52      5        2      2
118       1   40       1  40   2    1     2       52      4        2      3
127       1   45       1  53   1    1     2       40      2        2      1
    satjob7 satfam7 realrinc  conrinc
24        2       2  49000.0  76600.0
35        3       3  22050.0  34470.0
75        4       2  49000.0  76600.0
83        3       2  33075.0  51705.0
118       2       2  40425.0  63195.0
127       2       1 341672.4 324512.3

Saving Datasets

You have already learned that you can write data to various file types, such as CSV and tab-delimited text files. You can also save your datasets in the native R data format, which is the *.rda format. Instead of reading *.rda files, you load them as you would any other R object. This makes it very convenient for other R users who want to work with your data. Let’s see how this is done using the data from the plagiarism study. An advantage of loading an R dataset in *.rda format is that it will stay in your workspace when you save it, and you will not have to reload the data during the next session.

> plagiarism <- read.csv("plagiarism.csv")
> save(plagiarism, file = "plagiarism.rda")
> file.exists("plagiarism.rda")
[1] TRUE
> load("plagiarism.rda")
> head(plagiarism)
    Course Sex CourseGr Wk2Gr Wk6Gr ExcQuote1 Plagiarized1 Plagiarized2
1 MFE1135A   F    84.55 92.86  97.5        No           No           No
2 MFE1135A   F    91.29 87.14  90.0        No           No           No
3 MFE1135A   M    89.83 84.29  80.0       Yes           No           No
4 MFE1135A   M    91.43 87.14  77.5       Yes          Yes           No
5 MFE1135A   F    71.20  0.00  70.0       Yes          Yes          Yes
6 MFE1135A   F    96.60 88.57  97.5       Yes          Yes           No
  VoluntaryTII       Source1 Wk2Sim Wk2Inc PctQuot1 Wk6Sim Wk6Inc PctQuot2
1           No          <NA>     10     10        0     11     14        3
2           No          <NA>      4     18       14     21     32       11
3           No      Internet      5     31       26      4      9        5
4           No      Internet     11     11        0     18     19        1
5           No   Publication     62     63        1     13     28       15
6           No StudentPapers     55     55        0      2      6        4
  ExcQuote2
1        No
2       Yes
3       Yes
4        No
5       Yes
6        No