Field Guide to Hadoop (2015)

Chapter 8. Cloud Computing and Virtualization

Most Hadoop clusters today run on “real iron”—that is, on small, Intel-based computers running some variant of the Linux operating system with directly attached storage. However, you might want to try this in a cloud or virtual environment. While virtualization usually comes with some degree of performance degradation, you may find it minimal for your task set or that it’s a worthwhile trade-off for the benefits of cloud computing; these benefits include low up-front costs and the ability to scale up (and down sometimes) as your dataset and analytic needs change.

By cloud computing, we’ll follow guidelines established by the National Institute of Standards and Technology (NIST), whose definition of cloud computing you’ll find here. A Hadoop cluster in the cloud will have:

§  On-demand self-service

§  Network access

§  Resource sharing

§  Rapid elasticity

§  Measured resource service

While these resource need not exist virtually, in practice, they usually do.

Virtualization means creating virtual, as opposed to real, computing entities. Frequently, the virtualized object is an operating system on which software or applications are overlaid, but storage and networks can also be virtualized. Lest you think that virtualization is a relatively new computing technology, in 1972 IBM released VM/370, in which the 370 mainframe could be divided into many small, single-user virtual machines. Currently, Amazon Web Services is likely the most well-known cloud-computing facility. For a brief explanation of virtualization, look here on Wikipedia.

The official Hadoop perspective on cloud computing and virtualization is explained on this Wikipedia page. One guiding principle of Hadoop is that data analytics should be run on nodes in the cluster close to the data. Why? Transporting blocks of data in a cluster diminishes performance. Because blocks of HDFS files are normally stored three times, it’s likely that MapReduce can chose nodes to run your jobs on datanodes on which the data is stored. In a naive virtual environment, the physical location of the data is not known, and in fact, the real physical storage may be someplace that is not on any node in the cluster at all.

While it’s admittedly from a VMware perspective, good background reading on virtualizing Hadoop can be found here.

In this chapter, you’ll read about some of the open source software that facilitates cloud computing and virtualization. There are also proprietary solutions, but they’re not covered in this edition of the Field Guide to Hadoop.


fgth 08in01


Apache License, Version 2.0




Hadoop Virtualization

Official Page

Hadoop Integration

No Integration

If your organization uses VMware’s vSphere as the basis of the virtualization strategy, then Serengeti provides you with a method of quickly building Hadoop clusters in your environment. Admittedly, vSphere is a proprietary environment, but the code to run Hadoop in this environment is open source. Though Serengeti is not affiliated with the Apache Software Foundation (which operates many of the other Hadoop-related projects), many people have successfully used it in deployments.

Why virtualize Hadoop at all? Historically, Hadoop clusters have run on commodity servers (i.e., Intel x86 machines with their own set of disks running the Linux OS). When scheduling jobs, Hadoop made use of the location of data in the HDFS (described here) to run the code as close to the data as possible, preferably in the same node, to minimize the amount of data transferred across the network. In many virtualized environments, the directly attached storage is replaced by a common storage device like a storage area network (SAN) or a network attached storage (NAS). In these environments, there is no notion of locality of storage.

There are good reasons for virtualizing Hadoop, and there seem to be many Hadoop clusters running on public clouds today:

§  Speed of quickly spinning up a cluster. You don’t need to order and configure hardware.

§  Ability to quickly increase and reduce the size of the cluster to meet demand for services.

§  Resistance and recovery from failures managed by the virtualization technology.

And there are some disadvantages:

§  MapReduce and YARN assume complete control of machine resources. This is not true in a virtualized environment.

§  Data layout is critical, so excessive disk head movement may occur and the normal triple mirroring is critical for data protection. A good virtualization strategy must do the same. Some do, some don’t.

You’ll need to weigh the advantages and disadvantages to decide if Virtual Hadoop is appropriate for your projects.

Tutorial Links

Background reading on virtualizing Hadoop can be found at:

§  “Deploying Hadoop with Serengeti”

§  The Virtual Hadoop wiki

§  “Hadoop Virtualization Extensions on VMware vSphere 5”

§  “Virtualizing Apache Hadoop”


fgth 08in02


Apache License, Version 2.0




Container to run apps, including Hadoop nodes

Official Page

Hadoop Integration

No Integration

You may have heard the buzz about Docker and containerized applications. A little history may help here. Virtual machines were a large step forward in both cloud computing and infrastructure as a service (IaaS). Once a Linux virtual machine was created, it took less than a minute to spin up a new one, whereas building a Linux hardware machine could take hours. But there are some drawbacks. If you built a cluster of 100 VMs, and needed to change an OS parameter or update something in your Hadoop environment, you would need to do it on each of the 100 VMs.

To understand Docker’s advantages, you’ll find it useful to understand its lineage. First came chroot jails, in which Unix subsystems could be built that were restricted to a smaller namespace than the entire Unix OS. Then came Solaris containers in which users and programs could be restricted to zones, each protected from the others with many virtualized resources. Linux containers are roughly the same as Solaris containers, but for the Linux OS rather than Solaris. Docker arose as a technology to do lightweight virtualization for applications. The Docker container sits on top of Linux OS resources and just houses the application code and whatever it depends upon over and above OS resources. Thus Dockers enjoys the resource isolation and resource allocation features of a virtual machine, but is much more portable and lightweight. A full description of Docker is beyoond the scope of this book, but recently attempts have been made to run Hadoop nodes in a Docker environment.

Docker is new. It’s not clear that this is ready for a large Hadoop production environment.

Tutorial Links

The Docker folks have made it easy to get started with an interactive tutorial. Readers who want to know more about the container technology behind Docker will find this developerWorks article particularly interesting.

Example Code

The tutorials do a very good job giving examples of running Docker. This Pivotal blog post illustrates an example of deploying Hadoop on Docker.


fgth 08in03


Apache License, Version 2.0




Cluster Deployment

Official Page

Hadoop Integration

API Compatible

Building a big data cluster is an expensive, time-consuming, and complicated endeavor that often requires coordination between many teams. Sometimes you just need to spin up a cluster to test a capability or prototype a new idea. Cloud services like Amazon EC2 or Rackspace Cloud Servers provide a way to get that done. Unfortunately, different providers have very different interfaces for working with their services, so once you’ve developed some automation around the process of building and tearing down these test clusters, you’ve effectively locked yourself in with a single service provider. Apache Whirr provides a standard mechanism for working with a handful of different service providers. This allows you to easily change cloud providers or to share configurations with other teams that do not use the same cloud provider.

The most basic building block of Whirr is the instance template. Instance templates define a purpose; for example, there are templates for the Hadoop jobtracker, ZooKeeper, and HBase region nodes. Recipes are one step up the stack from templates and define a cluster. For example, a recipefor a simple data-processing cluster might call for deploying a Hadoop NameNode, a Hadoop jobtracker, a couple ZooKeeper servers, an HBase master, and a handful of HBase region servers.

Tutorial Links

The official Apache Whirr website provides a couple of excellent tutorials. The Whirr in 5 minutes tutorial provides the exact commands necessary to spin up and shut down your first cluster. The quick-start guide is a little more involved, walking through what happens during each stage of the process.

Example Code

In this case, we’re going to deploy the simple data cluster we described earlier to an Amazon EC2 account we’ve already established.

The first step is to build our recipe file (we’ll call this file


# The name we'll give this cluster,

# this gets communicated with the cloud service provider


# Because we're just testing

# we'll put all the masters on one single machine

# and build only three worker nodes

whirr.instance-templates= \

1 zookeeper+hadoop-namenode \

+hadoop-jobtracker \


3 hadoop-datanode \

+hadoop-tasktracker \


# We're going to deploy the cluster to Amazon EC2


# The identity and credential mean different things

# depending on the provider you choose.

# Because we're using EC2, we need to enter our

# Amazon access key ID and secret access key;

# these are easily available from your provider.

whirr.identity=<your identity here>

whirr.credential=<your key here>

# The credentials we'll use to access the cluster.

# In this case, Whirr will create a user named field_guide_user

# on each of the machines it spins up and

# we'll use our ssh public key to connect as that user.




We're now ready to deploy our cluster.

# In order to do so we simply run whirr with the

# "launch cluster" argument and pass it our recipe:

$ whirr launch-cluster --config

Once we're done with the cluster and we want to tear it down

# we run a similar command,

# this time passing the aptly named "destroy-cluster" argument:

$ whirr destroy-cluster --config

About the Authors

Kevin Sitto is a field solutions engineer with Pivotal Software, providing consulting services to help folks understand and address their big data needs.

He lives in Maryland with his wife and two kids, and enjoys making homebrew beer when he’s not writing books about big data.

Marshall Presser is a field chief technology officer for Pivotal Software and is based in McLean, Virginia. In addition to helping customers solve complex analytic problems with the Greenplum Database, he leads the Hadoop Virtual Field Team, working on issues of integrating Hadoop with relational databases.

Prior to coming to Pivotal (formerly Greenplum), he spent 12 years at Oracle, specializing in high availability, business continuity, clustering, parallel database technology, disaster recovery, and large-scale database systems. Marshall has also worked for a number of hardware vendors implementing clusters and other parallel architectures. His background includes parallel computation and operating system/compiler development, as well as private consulting for organizations in healthcare, financial services, and federal and state governments.

Marshall holds a B.A. in mathematics and an M.A. in economics and statistics from the University of Pennsylvania, and a M.Sc. in computing from Imperial College, London.


The animals on the cover of Field Guide to Hadoop are the O’Reilly animals most associated with the technologies covered in this book: the skua seabird, lowland paca, hydra porpita pacifica, trigger fish, African elephant, Pere David’s deer, European wildcat, ruffed grouse, and chimpanzee.

Many of the animals on O’Reilly covers are endangered; all of them are important to the world. To learn more about how you can help, go to

The cover fonts are URW Typewriter and Guardian Sans. The text font is Adobe Minion Pro; the heading font is Adobe Myriad Condensed; and the code font is Dalton Maag’s Ubuntu Mono.