R Recipes: A Problem-Solution Approach (2014)

Introduction

R is an open source implementation of the programming language S, created at Bell Laboratories by John Chambers, Rick Becker, and Alan Wilks. In addition to R, S is the basis of the commercially available S-PLUS system. Widely recognized as the chief architect of S, Chambers in 1998 won the prestigious Software System Award from the Association for Computing Machinery, which said Chambers’ design of the S system “forever altered how people analyze, visualize, and manipulate data.”

Think of R as an integrated system or environment that allows users multiple ways to access its many functions and features. You can use R as an interactive command-line interpreted language, much like a calculator. Type a command, press Enter, and R provides the answer in the R console. R is simultaneously a functional language and an object-oriented language. In addition to thousands of contributed packages, R has programming features, just as all computer programming languages do, allowing conditionals and looping, and giving the user the facility to create custom functions and specify various input and output options.

R is widely used as a statistical computing and software environment, but the R Core Team would rather consider R an environment “within which many classical and modern statistical techniques have been implemented.” In addition to its statistical prowess, R provides impressive and flexible graphics capabilities. Many users are attracted to R primarily because of its graphical features. R has basic and advanced plotting functions with many customization features.

Chambers and others at Bell Labs were developing S while I was in college and grad school, and of course I was completely oblivious to that fact, even though my major professor and I were consulting with another AT&T division at the time. I began my own statistical software journey writing programs in Fortran. I might find that a given program did not have a particular analysis I needed, such as a routine for calculating an intraclass correlation, so I would write my own program. BMDP and SAS were available in batch versions for mainframe computers when I was in graduate school—one had to learn Job Control Language (JCL) in order to tell the computer which tapes to load. I typed punch cards and used a card reader to read in JCL and data.

On a much larger and very much more sophisticated scale, this is essentially why the computer scientists at Bell Labs created S (for statistics). Fortran was and still is a general-purpose language, but it did not have many statistical capabilities. The design of S began with an informal meeting in 1976 at Bell Labs to discuss the design of a high-level language with an “algorithm,” which meant a Fortran-callable subroutine. Like its predecessor S, R can easily and transparently access compiled code from various other languages, including Fortran and C++ among others. R can also be interfaced with a variety of other programs, such as Python and SPSS.

R works in batch mode, but its most popular use is as an interactive data analysis, calculation, and graphics system running in a windowing system. R works on Linux, PC, and Mac systems. Be forewarned that R is not a point-and-click graphical user interface (GUI) program such as SPSS or Minitab. Unlike these programs, R provides terse output, but can be queried for more information should you need it. In this book, you will see screen captures of R running in the Windows operating system.

According to my friend and colleague, computer scientist and bioinformatics expert Dr. Nathan Goodman, statistical analysis essentially boils down to four empirical problems: problems involving description, problems involving differences, problems involving relationships, and problems involving classification. I agree wholeheartedly with Nat. All the problems and solutions presented in this book fall into one or more of those general categories. The problems are manifold, but the solutions are mostly limited to these four situations.

What this Book Covers

This book is for anyone—business professional, programmer, statistician, teacher, or student—who needs to find a way to use R to solve practical problems. Readers who have solved or attempted problems similar to the ones in this book using other tools will readily concur that each tool in one’s toolbox works better for some problems than for others. R novices will find best practices for using R’s features effectively. Intermediate-to-advanced R users and programmers will find shortcuts and applications that they may not have considered, as well as different ways to do things they might want to do.

The Structure of this Book

The standardized format will make this a useful book for future reference. Unlike most other books, you do not have to start at the beginning and go through this book sequentially. Each chapter is a stand-alone lesson that starts with a typical problem (most of which come from true-life problems that I have faced, or ones that others have described and have given me permission to share). The datasets used with this book to illustrate the solutions should be similar to the datasets readers have worked with, or would like to work with.

Apart from a few contrived examples in the early chapters, most of the datasets and exercises come from real-world problems and data. Following a bit of background, the problem and the data are presented, and then readers learn one efficient way to solve the problem using R. Similar problems will quickly come to mind, and readers will be able to adapt what they learn here to those problems.

Conventions Used in this Book

In this book, code and script segments will be shown this way:

 
> x <- c(1, 3, 5)
> px <- c(0.5, 0.25, 0.25)
> dist <- sample(x, size = 1000, replace = TRUE, prob <- px)
>

Code and R functions written inline will also be formatted in the code style.

When you are instructed to perform a command within the R Console or R Editor by using the (limited) point-and-click interface, the instructions will appear as follows: File image Workspace.

Looking Forward

In Chapter 1, you will learn how to get R, how R works, and some of the basic things you can do with R. You will learn how to work with the R interface and the various windows you will find in R. Finally, you will learn how R deals with missing data, vectors, and matrices.