Field Guide to Hadoop (2015)

Preface

What is Hadoop and why should you care? This book will help you understand what Hadoop is, but for now, let’s tackle the second part of that question. Hadoop is the most common single platform for storing and analyzing big data. If you and your organization are entering the exciting world of big data, you’ll have to decide whether Hadoop is the right platform and which of the many components are best suited to the task. The goal of this book is to introduce you to the topic and get you started on your journey.

There are many books, websites, and classes about Hadoop and related technologies. This one is different. It does not provide a lengthy tutorial introduction to a particular aspect of Hadoop or to any of the many components of the Hadoop ecosystem. It certainly is not a rich, detailed discussion of any of these topics. Instead, it is organized like a field guide to birds or trees. Each chapter focuses on portions of the Hadoop ecosystem that have a common theme. Within each chapter, the relevant technologies and topics are briefly introduced: we explain their relation to Hadoop and discuss why they may be useful (and in some cases less than useful) for particular needs. To that end, this book includes various short sections on the many projects and subprojects of Apache Hadoop and some related technologies, with pointers to tutorials and links to related technologies and processes.

In each section, we have included a table that looks like this:

License

<License here>

Activity

None, Low, Medium, High

Purpose

<Purpose here>

Official Page

<URL>

Hadoop Integration

Fully Integrated, API Compatible, No Integration, Not Applicable

Let’s take a deeper look at what each of these categories entails:

License

While all of the sections in the first version of this field guide are open source, there are several different licenses that come with the software—mostly alike, with some differences. If you plan to include this software in a product, you should familiarize yourself with the conditions of the license.

Activity

We have done our best to measure how much active development work is being done on the technology. We may have misjudged in some cases, and the activity level may have changed since we first wrote on the topic.

Purpose

What does the technology do? We have tried to group topics with a common purpose together, and sometimes we found that a topic could fit into different chapters. Life is about making choices; these are the choices we made.

Official Page

If those responsible for the technology have a site on the Internet, this is the home page of the project.

Hadoop Integration

When we started writing, we weren’t sure exactly what topics we would include in the first version. Some on the initial list were tightly integrated or bound into Apache Hadoop. Others were alternative technologies or technologies that worked with Hadoop but were not part of the Apache Hadoop family. In those cases, we tried to best understand what the level of integration was at the time of our writing. This will no doubt change over time.

You should not think that this book is something you read from cover to cover. If you’re completely new to Hadoop, you should start by reading the introductory chapter, Chapter 1. Then you should look for topics of interest, read the section on that component, read the chapter header, and possibly scan other selections in the same chapter. This should help you get a feel for the subject. We have often included links to other sections in the book that may be relevant. You may also want to look at links to tutorials on the subject or to the “official” page for the topic.

We’ve arranged the topics into sections that follow the pattern in the diagram shown in Figure P-1. Many of the topics fit into the Hadoop Common (formerly the Hadoop Core), the basic tools and techniques that support all the other Apache Hadoop modules. However, the set of tools that play an important role in the big data ecosystem isn’t limited to technologies in the Hadoop core. In this book we also discuss a number of related technologies that play a critical role in the big data landscape.

fgth 00in01

Figure P-1. Overview of the topics covered in this book

In this first edition, we have not included information on any proprietary Hadoop distributions. We realize that these projects are important and relevant, but the commercial landscape is shifting so quickly that we propose a focus on open source technology only. Open source has a strong hold on the Hadoop and big data markets at the moment, and many commercial solutions are heavily based on the open source technology we describe in this book. Readers who are interested in adopting the open source technologies we discuss are encouraged to look for commercial distributions of those technologies if they are so inclined.

This work is not meant to be a static document that is only updated every year or two. Our goal is to keep it as up to date as possible, adding new content as the Hadoop environment grows and some of the older technologies either disappear or go into maintenance mode as they become supplanted by others that meet newer technology needs or gain in favor for other reasons.

Since this subject matter changes very rapidly, readers are invited to submit suggestions and comments to Kevin (ksitto@gmail.com) and Marshall (bigmaish@gmail.com). Thank you for any suggestions you wish to make.

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic

Indicates new terms, URLs, email addresses, filenames, and file extensions.

Constant width

Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.

Constant width bold

Shows commands or other text that should be typed literally by the user.

Constant width italic

Shows text that should be replaced with user-supplied values or by values determined by context.