Hadoop in Practice, Second Edition (2015)

Appendix. Installing Hadoop and friends

This appendix contains instructions on how to install Hadoop and other tools that are used in the book.

 

Getting started quickly with Hadoop

The quickest way to get up and running with Hadoop is to download a preinstalled virtual machine from one of the Hadoop vendors. Following is a list of the popular VMs:

·        Cloudera Quickstart VM—http://www.cloudera.com/content/cloudera-content/cloudera-docs/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html

·        Hortonworks Sandbox—http://hortonworks.com/products/hortonworkssandbox/

·        MapR Sandbox for Hadoop—http://doc.mapr.com/display/MapR/MapR+Sandbox+for+Hadoop

 

A.1. Code for the book

Before we get to the instructions for installing Hadoop, let’s get you set up with the code that accompanies this book. The code is hosted on GitHub at https://github.com/alexholmes/hiped2. To get you up and running quickly, there are prepackaged tarballs that don’t require you to build the code—just install and go.

Downloading

First you’ll need to download the most recent release of the code from https://github.com/alexholmes/hiped2/releases.

Installing

The second step is to unpackage the tarball into a directory of your choosing. For example, the following untars the code into /usr/local, the same directory where you’ll install Hadoop:

$ cd /usr/local

$ sudo tar -xzvf <download directory>/hip-<version>-package.tar.gz

Adding the home directory to your path

All the examples in the book assume that the home directory for the code is in your path. The methods for doing this differ by operating system and shell. If you’re on Linux using Bash, then the following should work (use of the single quotes for the second command is required to avoid variable substitution):

$ echo "export HIP_HOME=/usr/local/hip-<version>" >> ~/.bash_profile

$ echo 'export PATH=${PATH}:${HIP_HOME}/bin' >> ~/.bash_profile

Running an example job

You can run the following commands to test your installation. This assumes that you have a running Hadoop setup (if you don’t, please jump to section A.3):

# create two input files in HDFS

$ hadoop fs -mkdir -p hip/input

$ echo "cat sat mat" | hadoop fs -put - hip/input/1.txt

$ echo "dog lay mat" | hadoop fs -put - hip/input/2.txt

# run the inverted index example

$ hip hip.ch1.InvertedIndexJob --input hip/input --output hip/output

# examine the results in HDFS

$ hadoop fs -cat hip/output/part*

Downloading the sources and building

There are some techniques (such as Avro code generation) that require access to the full sources. First, check out the sources using git:

$ git clone git@github.com:alexholmes/hiped2.git

Set up your environment so that some techniques know where the source is installed:

$ echo "export HIP_SRC=<installation dir>/hiped2" >> ~/.bash_profile

You can build the project using Maven:

$ cd hiped2

$ mvn clean validate package

This generates a target/hip-<version>-package.tar.gz file, which is the same file that’s uploaded to GitHub when releases are made.

A.2. Recommended Java versions

The Hadoop project keeps a list of recommended Java versions that have been proven to work well with Hadoop in production. For details, take a look at “Hadoop Java Versions” on the Hadoop Wiki at http://wiki.apache.org/hadoop/HadoopJavaVersions.

A.3. Hadoop

This section covers installing, configuring, and running the Apache distribution of Hadoop. Please refer to distribution-specific instructions if you’re working with a different distribution of Hadoop.

Apache tarball installation

The following instructions are for users who want to install the tarball version of the vanilla Apache Hadoop distribution. This is a a pseudo-distributed setup and not for a multi-node cluster.[1]

1 Pseudo-distributed mode is when you have all the Hadoop components running on a single host.

First you’ll need to download the tarball from the Apache downloads page at http://hadoop.apache.org/common/releases.html#Download and extract the tarball under /usr/local:

$ cd /usr/local

$ sudo tar -xzf <path-to-apache-tarball>

$ sudo ln -s hadoop-<version> hadoop

$ sudo chown -R <user>:<group> /usr/local/hadoop*

$ mkdir /usr/local/hadoop/tmp

 

Installation directory for users that don’t have root privileges

If you don’t have root permissions on your host, you can install Hadoop under a different directory and substitute instances of /usr/local in the following instructions with your directory name.

 

Configuration for pseudo-distributed mode for Hadoop 1 and earlier

The following instructions work for Hadoop version 1 and earlier. Skip to the next section if you’re working with Hadoop 2.

Edit the file /usr/local/hadoop/conf/core-site.xml and make sure it looks like the following:

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>

    <name>hadoop.tmp.dir</name>

    <value>/usr/local/hadoop/tmp</value>

  </property>

  <property>

    <name>fs.default.name</name>

    <value>hdfs://localhost:8020</value>

  </property>

</configuration>

Then edit the file /usr/local/hadoop/conf/hdfs-site.xml and make sure it looks like the following:

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>

    <name>dfs.replication</name>

    <value>1</value>

  </property>

  <property>

     <!-- specify this so that running 'hadoop namenode -format'

          formats the right dir -->

     <name>dfs.name.dir</name>

     <value>/usr/local/hadoop/cache/hadoop/dfs/name</value>

  </property>

</configuration>

Finally, edit the file /usr/local/hadoop/conf/mapred-site.xml and make sure it looks like the following (you may first need to copy mapred-site.xml.template to mapred-site.xml):

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>

    <name>mapred.job.tracker</name>

    <value>localhost:8021</value>

  </property>

</configuration>

Configuration for pseudo-distributed mode for Hadoop 2

The following instructions work for Hadoop 2. See the previous section if you’re working with Hadoop version 1 and earlier.

Edit the file /usr/local/hadoop/etc/hadoop/core-site.xml and make sure it looks like the following:

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>

    <name>hadoop.tmp.dir</name>

    <value>/usr/local/hadoop/tmp</value>

  </property>

  <property>

    <name>fs.default.name</name>

    <value>hdfs://localhost:8020</value>

  </property>

</configuration>

Then edit the file /usr/local/hadoop/etc/hadoop/hdfs-site.xml and make sure it looks like the following:

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>

    <name>dfs.replication</name>

    <value>1</value>

  </property>

</configuration>

Next, edit the file /usr/local/hadoop/etc/hadoop/mapred-site.xml and make sure it looks like the following:

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>

    <name>mapreduce.framework.name</name>

    <value>yarn</value>

  </property>

</configuration>

Finally, edit the file /usr/local/hadoop/etc/hadoop/yarn-site.xml and make sure it looks like the following:

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>

    <name>yarn.nodemanager.aux-services</name>

    <value>mapreduce_shuffle</value>

    <description>Shuffle service that needs to be set for

                       Map Reduce to run.</description>

  </property>

  <property>

    <name>yarn.log-aggregation-enable</name>

    <value>true</value>

  </property>

  <property>

    <name>yarn.log-aggregation.retain-seconds</name>

    <value>2592000</value>

  </property>

  <property>

    <name>yarn.log.server.url</name>

    <value>http://0.0.0.0:19888/jobhistory/logs/</value>

  </property>

  <property>

    <name>yarn.nodemanager.delete.debug-delay-sec</name>

    <value>-1</value>

    <description>Amount of time in seconds to wait before

                 deleting container resources.</description>

  </property>

</configuration>

Set up SSH

Hadoop uses Secure Shell (SSH) to remotely launch processes such as the Data-Node and TaskTracker, even when everything is running on a single node in pseudo-distributed mode. If you don’t already have an SSH key pair, create one with the following command:

$ ssh-keygen -b 2048 -t rsa

You’ll need to copy the .ssh/id_rsa file to the authorized_keys file:

$ cp ~/.ssh/id_rsa.pub  ~/.ssh/authorized_keys

You’ll also need an SSH agent running so that you aren’t prompted to enter your password a bazillion times when starting and stopping Hadoop. Different operating systems have different ways of running an SSH agent, and there are details online for CentOS and other Red Hat derivatives[2]and for OS X.[3] Google is your friend if you’re running on a different system.

2 See the Red Hat Deployment Guide section on “Configuring ssh-agent” at www.centos.org/docs/5/html/5.2/Deployment_Guide/s3-openssh-config-ssh-agent.html.

3 See “Using SSH Agent With Mac OS X Leopard” at www-uxsup.csx.cam.ac.uk/~aia21/osx/leopard-ssh.html.

To verify that the agent is running and has your keys loaded, try opening an SSH connection to the local system:

$ ssh 127.0.0.1

If you’re prompted for a password, the agent’s not running or doesn’t have your keys loaded.

Java

You need a current version of Java (1.6 or newer) installed on your system. You’ll need to ensure that the system path includes the binary directory of your Java installation. Alternatively, you can edit /usr/local/hadoop/conf/hadoop-env.sh, uncomment the JAVA_HOME line, and update the value with the location of your Java installation.

Environment settings

For convenience, it’s recommended that you add the Hadoop binary directory to your path. The following code shows what you can add to the bottom of your Bash shell profile file in ~/.bash_profile (assuming you’re running Bash):

HADOOP_HOME=/usr/local/hadoop

PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

export PATH

Format HDFS

Next you need to format HDFS. The rest of the commands in this section assume that the Hadoop binary directory exists in your path, as per the preceding instructions. On Hadoop 1 and earlier, type

$ hadoop namenode -format

On Hadoop versions 2 and newer, type

$ hdfs namenode -format

After HDFS has been formatted, you’re ready to start Hadoop.

Starting Hadoop 1 and earlier

A single command can be used to start Hadoop on versions 1 and earlier:

$ start-all.sh

After running the start script, use the jps Java utility to check that all the processes are running. You should see the following output (with the exception of the process IDs, which will be different):

$ jps

23836 JobTracker

23475 NameNode

23982 TaskTracker

23619 DataNode

24024 Jps

23756 SecondaryNameNode

If any of these processes aren’t running, check the logs directory (/usr/local/hadoop/logs) to see why the processes didn’t start correctly. Each of the preceding processes has two output files that can be identified by name and should be checked for errors.

The most common error is that the HDFS formatting step, which I showed earlier, was skipped.

Starting Hadoop 2

The following commands are required to start Hadoop version 2:

$ yarn-daemon.sh start resourcemanager

$ yarn-daemon.sh start nodemanager

$ hadoop-daemon.sh start namenode

$ hadoop-daemon.sh start datanode

$ mr-jobhistory-daemon.sh start historyserver

After running the start script, use the jps Java utility to check that all the processes are running. You should see the output that follows, although the ordering and process IDs will differ:

$ jps

32542 NameNode

1085  Jps

32131 ResourceManager

32613 DataNode

32358 NodeManager

1030  JobHistoryServer

If any of these processes aren’t running, check the logs directory (/usr/local/hadoop/logs) to see why the processes didn’t start correctly. Each of the preceding processes has two output files that can be identified by name and should be checked for errors. The most common error is that the HDFS formatting step, which I showed earlier, was skipped.

Creating a home directory for your user on HDFS

Once Hadoop is up and running, the first thing you’ll want to do is create a home directory for your user. If you’re running on Hadoop 1, the command is

$ hadoop fs -mkdir /user/<your-linux-username>

On Hadoop 2, you’ll run

$ hdfs dfs -mkdir -p /user/<your-linux-username>

Verifying the installation

The following commands can be used to test your Hadoop installation. The first two commands create a directory in HDFS and create a file in HDFS:

$ hadoop fs -mkdir /tmp

$ echo "the cat sat on the mat" | hadoop fs -put - /tmp/input.txt

Next you want to run a word-count MapReduce job. On Hadoop 1 and earlier, run the following:

$ hadoop jar /usr/local/hadoop/*-examples*.jar wordcount \

  /tmp/input.txt /tmp/output

On Hadoop 2, run the following:

$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/*-examples*.jar \

  wordcount /tmp/input.txt /tmp/output

Examine and verify the MapReduce job outputs on HDFS (the outputs will differ based on the contents of the config files that you used for the job inputs):

$ hadoop fs -cat /tmp/output/part*

at    1

mat    1

on    1

sat    1

the    2

Stopping Hadoop 1

To stop Hadoop 1, use the following command:

$ stop-all.sh

Stopping Hadoop 2

To stop Hadoop 2, use the following commands:

$ mr-jobhistory-daemon.sh stop historyserver

$ hadoop-daemon.sh stop datanode

$ hadoop-daemon.sh stop namenode

$ yarn-daemon.sh stop nodemanager

$ yarn-daemon.sh stop resourcemanager

Just as with starting, the jps command can be used to verify that all the Hadoop processes have stopped.

Hadoop 1.x UI ports

There are a number of web applications in Hadoop. Table A.1 lists them, along with the ports they run on and their URLs (assuming they’re running on the local host, as is the case if you have a pseudo-distributed installation running).

Table A.1. Hadoop 1.x web applications and ports

Component

Default port

Config parameter

Local URL

MapReduce JobTracker

50030

mapred.job.tracker.http.address

http://127.0.0.1:50030/

MapReduce TaskTracker

50060

mapred.task.tracker.http.address

http://127.0.0.1:50060/

HDFS NameNode

50070

dfs.http.address

http://127.0.0.1:50070/

HDFS DataNode

50075

dfs.datanode.http.address

http://127.0.0.1:50075/

HDFS Secondary-NameNode

50090

dfs.secondary.http.address

http://127.0.0.1:50090/

HDFS Backup and Checkpoint Node

50105

dfs.backup.http.address

http://127.0.0.1:50105/

Each of these URLs supports the following common paths:

·        /logs —This shows a listing of all the files under hadoop.log.dir. By default, this is under $HADOOP_HOME/logs on each Hadoop node.

·        /logLevel —This can be used to view and set the logging levels for Java packages.

·        /metrics —This shows JVM and component-level statistics. It’s available in Hadoop 0.21 and newer (not in 1.0, 0.20.x, or earlier).

·        /stacks —This shows a stack dump of all the current Java threads in the daemon.

Hadoop 2.x UI ports

There are a number of web applications in Hadoop. Table A.2 lists them, including the ports that they run on and their URLs (assuming they’re running on the local host, as is the case if you have a pseudo-distributed installation running).

Table A.2. Hadoop 2.x web applications and ports

Component

Default port

Config parameter

Local URL

YARN ResourceManager

8088

yarn.resourcemanager.webapp.address

http://localhost:8088/cluster

YARN NodeManager

8042

yarn.nodemanager.webapp.address

http://localhost:8042/node

MapReduce Job History

19888

mapreduce.jobhistory.webapp.address

http://localhost:19888/jobhistory

HDFS Name-Node

50070

dfs.http.address

http://127.0.0.1:50070/

HDFS DataNode

50075

dfs.datanode.http.address

http://127.0.0.1:50075/

A.4. Flume

Flume is a log collection and distribution system that can transport data across a large number of hosts into HDFS. It’s an Apache project originally developed by Cloudera.

Chapter 5 contains a section on Flume and how it can be used.

Getting more information

Table A.3 lists some useful resources to help you become more familiar with Flume.

Table A.3. Useful resources

Resource

URL

Flume main page

http://flume.apache.org/

Flume user guide

http://flume.apache.org/FlumeUserGuide.html

Flume Getting Started guide

https://cwiki.apache.org/confluence/display/FLUME/Getting+Started

Installation on Apache Hadoop 1.x systems

Follow the Getting Started guide referenced in the resources.

Installation on Apache Hadoop 2.x systems

If you’re trying to get Flume 1.4 to work with Hadoop 2, follow the Getting Started guide to install Flume. Next, you’ll need to remove the protobuf and guava JARs from Flume’s lib directory because they conflict with the versions bundled with Hadoop 2:

$ mv ${flume_bin}/lib/{protobuf-java-2.4.1.jar,guava-10.0.1.jar} ~/

A.5. Oozie

Oozie is an Apache project that started life inside Yahoo. It’s a Hadoop workflow engine that manages data processing activities.

Getting more information

Table A.4 lists some useful resources to help you become more familiar with Oozie.

Table A.4. Useful resources

Resource

URL

Oozie project page

https://oozie.apache.org/

Oozie Quick Start

https://oozie.apache.org/docs/4.0.0/DG_QuickStart.html

Additional Oozie resources

https://oozie.apache.org/docs/4.0.0/index.html

Installation on Hadoop 1.x systems

Follow the Quick Start guide to install Oozie. The Oozie documentation has installation instructions.

If you’re using Oozie 4.4.0 and targeting Hadoop 2.2.0, you’ll need to run the following commands to patch your Maven files and perform the build:

cd oozie-4.0.0/

find . -name pom.xml | xargs sed -ri 's/(2.2.0\-SNAPSHOT)/2.2.0/'

mvn -DskipTests=true -P hadoop-2 clean package assembly:single

Installation on Hadoop 2.x systems

Unfortunately Oozie 4.0.0 doesn’t play nicely with Hadoop 2. To get Oozie working with Hadoop, you’ll first need to download the 4.0.0 tarball from the project page and then unpackage it. Next, run the following command to change the Hadoop version being targeted:

$ cd oozie-4.0.0/

$ find . -name pom.xml | xargs sed -ri 's/(2.2.0\-SNAPSHOT)/2.2.0/'

Now all you need to do is target the hadoop-2 profile in Maven:

$ mvn -DskipTests=true -P hadoop-2 clean package assembly:single

A.6. Sqoop

Sqoop is a tool for importing data from relational databases into Hadoop and vice versa. It can support any JDBC-compliant database, and it also has native connectors for efficient data transport to and from MySQL and PostgreSQL.

Chapter 5 contains details on how imports and exports can be performed with Sqoop.

Getting more information

Table A.5 lists some useful resources to help you become more familiar with Sqoop.

Table A.5. Useful resources

Resource

URL

Sqoop project page

http://sqoop.apache.org/

Sqoop User Guide

http://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html

Installation

Download the Sqoop tarball from the project page. Pick the version that matches with your Hadoop installation and explode the tarball. The following instructions assume that you’re installing under /usr/local:

$ sudo tar -xzf \

    sqoop-<version>.bin.hadoop-<hadoop-version>.tar.gz \

    -C /usr/local/

$ ln -s /usr/local/sqoop-<version> /usr/local/sqoop

 

Sqoop 2

This book currently covers Sqoop version 1. When selecting which tarball to download, please note that version 1.99.x and newer are the Sqoop 2 versions, so be sure to pick an older version.

 

If you’re planning on using Sqoop with MySQL, you’ll need to download the MySQL JDBC driver tarball from http://dev.mysql.com/downloads/connector/j/, explode it into a directory, and then copy the JAR file into the Sqoop lib directory:

$ tar -xzf mysql-connector-java-<version>.tar.gz

$ cd mysql-connector-java-<version>

$ sudo cp mysql-connector-java-<version>-bin.jar \

  /usr/local/sqoop/lib

To run Sqoop, there are a few environment variables that you may need to set. They’re listed in table A.6.

Table A.6. Sqoop environment variables

Environment variable

Description

JAVA_HOME

The directory where Java is installed. If you have the Sun JDK installed on Red Hat, this would be /usr/java/latest.

HADOOP_HOME

The directory of your Hadoop installation.

HIVE_HOME

Only required if you’re planning on using Hive with Sqoop. Refers to the directory where Hive was installed.

HBASE_HOME

Only required if you’re planning on using HBase with Sqoop. Refers to the directory where HBase was installed.

The /usr/local/sqoop/bin directory contains the binaries for Sqoop. Chapter 5 contains a number of techniques that show how the binaries are used for imports and exports.

A.7. HBase

HBase is a real-time, key/value, distributed, column-based database modeled after Google’s BigTable.

Getting more information

Table A.7 lists some useful resources to help you become more familiar with HBase.

Table A.7. Useful resources

Resource

URL

Apache HBase project page

http://hbase.apache.org/

Apache HBase Quick Start

http://hbase.apache.org/book/quickstart.html

Apache HBase Reference Guide

http://hbase.apache.org/book/book.html

Cloudera blog post on HBase Dos and Don’ts

http://blog.cloudera.com/blog/2011/04/hbase-dos-and-donts/

Installation

Follow the installation instructions in the Quick Start guide at https://hbase.apache.org/book/quickstart.html.

A.8. Kafka

Kafka is a publish/subscribe messaging system built by LinkedIn.

Getting more information

Table A.8 lists some useful resources to help you become more familiar with Kafka.

Table A.8. Useful resources

Resource

URL

Kafka project page

http://kafka.apache.org/

Kafka documentation

http://kafka.apache.org/documentation.html

Kafka Quick Start

http://kafka.apache.org/08/quickstart.html

Installation

Follow the installation instructions in the Quick Start guide.

A.9. Camus

Camus is a tool for importing data in Kafka into Hadoop.

Getting more information

Table A.9 lists some useful resources to help you become more familiar with Camus.

Table A.9. Useful resources

Resource

URL

Camus project page

https://github.com/linkedin/camus

Camus Overview

https://github.com/linkedin/camus/wiki/Camus-Overview

Installation on Hadoop 1

Download the code from the 0.8 branch in GitHub, and run the following command to build it:

$ mvn clean package

Installation on Hadoop 2

At the time of writing, the 0.8 version of Camus doesn’t support Hadoop 2. You have a couple of options to get it working—if you’re just experimenting with Camus, you can download a patched version of the code from my GitHub project. Alternatively, you can patch the Maven build files.

Using my patched GitHub project

Download my cloned and patched version of Camus from GitHub and build it just as you would the Hadoop 1 version:

$ wget https://github.com/alexholmes/camus/archive/camus-kafka-0.8.zip

$ unzip camus-kafka-0.8.zip

$ cd camus-camus-kafka-0.8

$ mvn clean package

Patching the Maven build files

If you want to patch the original Camus files, you can do that by taking a look at the patch I applied to my own clone: https://mng.bz/Q8GV.

A.10. Avro

Avro is a data serialization system that provides features such as compression, schema evolution, and code generation. It can be viewed as a more sophisticated version of a SequenceFile, with additional features such as schema evolution.

Chapter 3 contains details on how Avro can be used in MapReduce as well as with basic input/output streams.

Getting more information

Table A.10 lists some useful resources to help you become more familiar with Avro.

Table A.10. Useful resources

Resource

URL

Avro project page

http://avro.apache.org/

Avro issue tracking page

https://issues.apache.org/jira/browse/AVRO

Cloudera blog about Avro use

http://blog.cloudera.com/blog/2011/12/apache-avro-at-richrelevance/

CDH usage page for Avro

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/5.0/CDH5-Installation-Guide/cdh5ig_avro_usage.html

Installation

Avro is a full-fledged Apache project, so you can download the binaries from the downloads link on the Apache project page.

A.11. Apache Thrift

Apache Thrift is essentially Facebook’s version of Protocol Buffers. It offers very similar data-serialization and RPC capabilities. In this book, I use it with Elephant Bird to support Thrift in MapReduce. Elephant Bird currently works with Thrift version 0.7.

Getting more information

Thrift documentation is lacking, something which the project page attests to. Table A.11 lists some useful resources to help you become more familiar with Thrift.

Table A.11. Useful resources

Resource

URL

Thrift project page

http://thrift.apache.org/

Blog post with a Thrift tutorial

http://bit.ly/vXpZ0z

Building Thrift 0.7

To build Thrift, download the 0.7 tarball and extract the contents. You may need to install some Thrift dependencies:

$ sudo yum  install automake libtool flex bison pkgconfig gcc-c++ \

  boost-devel libevent-devel zlib-devel python-devel \

  ruby-devel php53.x86_64 php53-devel.x86_64 openssl-devel

Build and install the native and Java/Python libraries and binaries:

$ ./configure

$ make

$ make check

$ sudo make install

Build the Java library. This step requires Ant to be installed, instructions for which are available in the Apache Ant Manual at http://ant.apache.org/manual/index.html:

$ cd lib/java

$ ant

Copy the Java JAR into Hadoop’s lib directory. The following instructions are for CDH:

# replace the following path with your actual

# Hadoop installation directory

#

# the following is the CDH Hadoop home dir

#

export HADOOP_HOME=/usr/lib/hadoop

$ cp lib/java/libthrift.jar $HADOOP_HOME/lib/

A.12. Protocol Buffers

Protocol Buffers is Google’s data serialization and Remote Procedure Call (RPC) library, which is used extensively at Google. In this book, we’ll use it in conjunction with Elephant Bird and Rhipe. Elephant Bird requires version 2.3.0 of Protocol Buffers (and won’t work with any other version), and Rhipe only works with Protocol Buffers version 2.4.0 and newer.

Getting more information

Table A.12 lists some useful resources to help you become more familiar with Protocol Buffers.

Table A.12. Useful resources

Resource

URL

Protocol Buffers project page

http://code.google.com/p/protobuf/

Protocol Buffers Developer Guide

https://developers.google.com/protocol-buffers/docs/overview?csw=1

Protocol Buffers downloads page, containing a link for version 2.3.0 (required for use with Elephant Bird)

http://code.google.com/p/protobuf/downloads/list

Building Protocol Buffers

To build Protocol Buffers, download the 2.3 or 2.4 (2.3 for Elephant Bird and 2.4 for Rhipe) source tarball from http://code.google.com/p/protobuf/downloads and extract the contents.

You’ll need a C++ compiler, which can be installed on 64-bit RHEL systems with the following command:

sudo yum install gcc-c++.x86_64

Build and install the native libraries and binaries:

$ cd protobuf-<version>/

$ ./configure

$ make

$ make check

$ sudo make install

Build the Java library:

$ cd java

$ mvn package install

Copy the Java JAR into Hadoop’s lib directory. The following instructions are for CDH:

# replace the following path with your actual

# Hadoop installation directory

#

# the following is the CDH Hadoop home dir

#

export HADOOP_HOME=/usr/lib/hadoop

$ cp target/protobuf-java-2.3.0.jar $HADOOP_HOME/lib/

A.13. Snappy

Snappy is a native compression codec developed by Google that offers fast compression and decompression times. It can’t be split (as opposed to LZOP compression). In the book’s code examples, which don’t require splittable compression, we’ll use Snappy because of its time efficiency.

Snappy is integrated into the Apache distribution of Hadoop since versions 1.0.2 and 2.

Getting more information

Table A.13 lists some useful resources to help you become more familiar with Snappy.

Table A.13. Useful resources

Resource

URL

Google’s Snappy project page

http://code.google.com/p/snappy/

Snappy integration with Hadoop

http://code.google.com/p/hadoop-snappy/

A.14. LZOP

LZOP is a compression codec that can be used to support splittable compression in MapReduce. Chapter 4 has a section dedicated to working with LZOP. In this section we’ll cover how to build and set up your cluster to work with LZOP.

Getting more information

Table A.14 shows a useful resource to help you become more familiar with LZOP.

Table A.14. Useful resource

Resource

URL

Hadoop LZO project maintained by Twitter

https://github.com/twitter/hadoop-lzo

Building LZOP

The following steps walk you through the process of configuring LZOP compression. Before you do this, there are a few things to consider:

·        It’s highly recommended that you build the libraries on the same hardware that you have deployed in production.

·        All of the installation and configuration steps will need to be performed on any client hosts that will be using LZOP, as well as all the DataNodes in your cluster.

·        These steps are for Apache Hadoop distributions. Please refer to distribution-specific instructions if you’re using a different distribution.

Twitter’s LZO project page has instructions on how to download dependencies and build the project. Follow the Building and Configuring section on the project home page.

Configuring Hadoop

You need to configure Hadoop core to be aware of your new compression codecs. Add the following lines to your core-site.xml. Make sure you remove the newlines and spaces so that there are no whitespace characters between the commas:

<property>

  <name>mapred.compress.map.output</name>

  <value>true</value>

</property>

<property>

  <name>mapred.map.output.compression.codec</name>

  <value>com.hadoop.compression.lzo.LzoCodec</value>

</property>

<property>

  <name>io.compression.codecs</name>

  <value>org.apache.hadoop.io.compress.GzipCodec,

  org.apache.hadoop.io.compress.DefaultCodec,

  org.apache.hadoop.io.compress.BZip2Codec,

  com.hadoop.compression.lzo.LzoCodec,

  com.hadoop.compression.lzo.LzopCodec,

  org.apache.hadoop.io.compress.SnappyCodec</value>

</property>

<property>

  <name>io.compression.codec.lzo.class</name>

  <value>com.hadoop.compression.lzo.LzoCodec</value>

</property>

The value for io.compression.codecs assumes that you have the Snappy compression codec already installed. If you don’t, remove org.apache.hadoop.io.compress.SnappyCodec from the value.

A.15. Elephant Bird

Elephant Bird is a project that provides utilities for working with LZOP-compressed data. It also provides a container format that supports working with Protocol Buffers and Thrift in MapReduce.

Getting more information

Table A.15 shows a useful resource to help you become more familiar with Elephant Bird.

Table A.15. Useful resource

Resource

URL

Elephant Bird project page

https://github.com/kevinweil/elephant-bird

At the time of writing, the current version of Elephant Bird (4.4) doesn’t work with Hadoop 2 due the use of an incompatible version of Protocol Buffers. To get Elephant Bird to work in this book, I had to build a version of the project from the trunk that works with Hadoop 2 (as will 4.5 when it is released).

A.16. Hive

Hive is a SQL interface on top of Hadoop.

Getting more information

Table A.16 lists some useful resources to help you become more familiar with Hive.

Table A.16. Useful resources

Resource

URL

Hive project page

http://hive.apache.org/

Getting Started

https://cwiki.apache.org/confluence/display/Hive/GettingStarted

Installation

Follow the installation instructions in Hive’s Getting Started guide.

A.17. R

R is an open source tool for statistical programming and graphics.

Getting more information

Table A.17 lists some useful resources to help you become more familiar with R.

Table A.17. Useful resources

Resource

URL

R project page

http://www.r-project.org/

R function search engine

http://rseek.org/

Installation on Red Hat–based systems

Installing R from Yum makes things easy: it will figure out RPM dependencies and install them for you.

Go to http://www.r-project.org/, click on CRAN, select a download region that’s close to you, select Red Hat, and pick the version and architecture appropriate for your system. Replace the URL in baseurl in the following code and execute the command to add the R mirror repo to your Yum configuration:

$ sudo -s

$ cat << EOF > /etc/yum.repos.d/r.repo

# R-Statistical Computing

[R]

name=R-Statistics

baseurl=http://cran.mirrors.hoobly.com/bin/linux/redhat/el5/x86_64/

enabled=1

gpgcheck=0

EOF

A simple Yum command can be used to install R on 64-bit systems:

$ sudo yum install R.x86_64

 

Perl-File-Copy-Recursive RPM

On CentOS, the Yum install may fail, complaining about a missing dependency. In this case, you may need to manually install the perl-File-Copy-Recursive RPM (for CentOS you can get it from http://mng.bz/n4C2).

 

Installation on non–Red Hat systems

Go to http://www.r-project.org/, click on CRAN, select a download region that’s close to you, and select the appropriate binaries for your system.

A.18. RHadoop

RHadoop is an open source tool developed by Revolution Analytics for integrating R with MapReduce.

Getting more information

Table A.18 lists some useful resources to help you become more familiar with RHadoop.

Table A.18. Useful resources

Resource

URL

RHadoop project page

https://github.com/RevolutionAnalytics/RHadoop/wiki

RHadoop downloads and prerequisites

https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads

rmr/rhdfs installation

Each node in your Hadoop cluster will require the following components:

·        R (installation instructions are in section A.17).

·        A number of RHadoop and dependency packages

RHadoop requires that you set environment variables to point to the Hadoop binary and the streaming JAR. It’s best to stash this in your .bash_profile (or equivalent).

$ export HADOOP_CMD=/usr/local/hadoop/bin/hadoop

$ export HADOOP_STREAMING=${HADOOP_HOME}/share/hadoop/tools/lib/

hadoop-streaming-<version>.jar

We’ll focus on the rmr and rhdfs RHadoop packages, which provide MapReduce and HDFS integration with R. Click on the rmr and rhdfs download links on https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads. Then execute the following commands:

$ sudo -s

yum install -y libcurl-devel java-1.7.0-openjdk-devel

$ export HADOOP_CMD=/usr/bin/hadoop

$ R CMD javareconf

$ R

> install.packages( c('rJava'),

    repos='http://cran.revolutionanalytics.com')

> install.packages( c('RJSONIO', 'itertools', 'digest', 'Rcpp','httr',

    'functional','devtools', 'reshape2', 'plyr', 'caTools'),

    repos='http://cran.revolutionanalytics.com')

$ R CMD INSTALL  /media/psf/Home/Downloads/rhdfs_1.0.8.tar.gz

$ R CMD INSTALL  /media/psf/Home/Downloads/rmr2_3.1.1.tar.gz

$ R CMD INSTALL rmr_<version>.tar.gz

$ R CMD INSTALL rhdfs_<version>.tar.gz

If you get an error installing rJava, you may need to set JAVA_HOME and reconfigure R prior to running the rJava installation:

$ sudo -s

$ export JAVA_HOME=/usr/java/latest

$ R CMD javareconf

$ R

> install.packages("rJava")

Test that the rmr package was installed correctly by running the following command—if no error messages are generated, this means you have successfully installed the RHadoop packages.

$ R

> library(rmr2)

A.19. Mahout

Mahout is a predictive analytics project that offers both in-JVM and MapReduce implementations for some of its algorithms.

Getting more information

Table A.19 lists some useful resources to help you become more familiar with Mahout.

Table A.19. Useful resources

Resource

URL

Mahout project page

http://mahout.apache.org/

Mahout downloads

https://cwiki.apache.org/confluence/display/MAHOUT/Downloads

Installation

Mahout should be installed on a node that has access to your Hadoop cluster. Mahout is a client-side library and doesn’t need to be installed on your Hadoop cluster.

Building a Mahout distribution

To get Mahout working with Hadoop 2, I had to check out the code, modify the build file, and then build a distribution. The first step is to check out the code:

$ git clone https://github.com/apache/mahout.git

$ cd mahout

Next you need to modify pom.xml and remove the following section from the file:

<plugin>

<inherited>true</inherited>

<groupId>org.apache.maven.plugins</groupId>

<artifactId>maven-gpg-plugin</artifactId>

<version>1.4</version>

<executions>

  <execution>

    <goals>

      <goal>sign</goal>

    </goals>

  </execution>

</executions>

</plugin>

Finally, build a distribution:

$ mvn -Dhadoop2.version=2.2.0 -DskipTests -Prelease

This will generate a tarball located at distribution/target/mahout-distribution-1.0-SNAPSHOT.tar.gz, which you can install using the instructions in the next section.

Installing Mahout

Mahout is packaged as a tarball. The following instructions will work on most Linux operating systems.

If you’re installing an official Mahout release, click on the “official release” links on the Mahout download page and select the current release. If Mahout 1 hasn’t yet been released and you want to use Mahout with Hadoop 2, follow the instructions in the previous section to generate the tarball.

Install Mahout using the following instructions:

$ cd /usr/local

$ sudo tar -xzf <path-to-mahout-tarball>

$ sudo ln -s mahout-distribution-<version> mahout

$ sudo chown -R <user>:<group> /usr/local/mahout*

For convenience, it’s worthwhile updating your ~/.bash_profile to export a MAHOUT_HOME environment variable to your installation directory. The following command shows how this can be performed on the command line (the same command can be copied into your bash profile file):

$ export MAHOUT_HOME=/usr/local/mahout