Field Guide to Hadoop (2015)

Chapter 7. Security, Access Control, and Auditing

When Hadoop was getting started, its basic security model might have been described as “build a fence around an elephant, but once inside the fence, security is a bit lax.” While the HDFS has access control mechanisms, security is a bit of an afterthought in the Hadoop world. Recently, as Hadoop has become much more mainstream, security issues are being addressed through the development of new tools, such as Sentry and Knox, as well as established mechanisms like Kerberos.

Large, well-established computing systems have methods for access and authorization, encryption, and audit logging, as required by HIPAA, FISMA, and PCI requirements.

Authentication answers the question, “Who are you?” Traditional strong authentication methods include Kerberos, Lightweight Directory Access Protocol (LDAP), and Active Directory (AD). These are done outside of Hadoop, usually at the client site, or within the web server if appropriate.

Authorization answers the question, “What can you do?” Here Hadoop is spread all over the place. For example, the MapReduce job queue system stores its authorization in a different way than HDFS, which uses a common read/write/execute permission for users/groups/other. HBase has column family and table-level authorization, and Accumulo has cell-level authorization.

Data protection generally refers to encryption, both at rest or in transit. HTTP, RPC, JDBC, and ODBC all provide encryption in transit or over the wire. HDFS currently has no native encryption, but there is a proposal in process to include this in a future release.

Governance and auditing are now done component-wise in Hadoop. There are some basic mechanisms in HDFS and MapReduce, whereas Hive metastore provides logging services and Oozie provides logging for its job-management service.

This guide is a good place to start reading about a more secure Hadoop.

Recently, as Hadoop has become much more mainstream, these issues are being addressed through the development of new tools, such as Sentry (described here), Kerberos (described here), and Knox (described here).

Sentry

fgth 07in01

License

Apache License, Version 2.0

Activity

High

Purpose

Provide a base level of authorization in Hadoop

Official Page

https://incubator.apache.org/projects/sentry.html

Hadoop Integration

API Compatible Incubator project (work in progress)

If you need authentication services in Hadoop, one possibility is Sentry, an Apache Incubator project to provide authentication services to components in the Hadoop ecosystem. The system currently defines a set of policy rules in a file that defines groups, mapping of groups to rules, and rules that define the privileges of groups to resources. You can think of this as role-based access control (RBAC). Your application then calls a Sentry API with the name of the user, the resource the user wishes to access, and the manner of access. The Sentry policy engine then sees if the user belongs to a group that has a role that enables it to use the resource in the manner requested. It returns a binary yes/no answer to the application, which can then take the appropriate response.

At the moment, this is filesystem-based and works with Hive and Impala out of the box. Other components can utilitze the API. One shortcoming of this system is that one could write a rogue MapReduce program that can access the data that would be restricted by using the Hive interface to the data.

Incubator projects are not part of the official Hadoop distribution and should not be used in production systems.

Tutorial Links

There are a pair of excellent posts on the official Apache blog. The first post provides an overview of the technology, while the second post is a getting-started guide.

Example Code

Configuration of Sentry is fairly complex and beyond the scope of this book. The Apache blog posts referenced here an excellent resource for readers looking to get started with the technology.

There is very succinct example code in this Apache blog tutorial.

Kerberos

fgth 07in02

License

MIT license

Activity

High

Purpose

Secure Authentication

Official Page

http://web.mit.edu/kerberos

Hadoop Integration

API Compatible

One common way to authenticate in a Hadoop cluster is with a security tool called Kerberos. Kerberos is a network-based tool distributed by the Massachusetts Institute of Technology to provide strong authentication based upon supplying secure encrypted tickets between clients requesting access to servers providing the access.

The model is fairly simple. Clients register with the Kerberos key distribution center (KDC) and share their password. When a client wants access to a resource like a file server, it sends a request to the KDC with some portion encryped with this password. The KDC attempts to decrypt this material. If successful, it sends back a ticket generating ticket (TGT) to the client, which has material encrypted with its special passcode. When the client receives the TGT, it sends a request back to the KDC with a request for access to the file server. The KDC sends back a ticket with bits encrypted with the file server’s passcode. From then on, the client and the file server use this ticket to authenticate.

The notion is that the file server, which might be very busy with many client requests, is not bogged down with the mechanics of keeping many user passcodes. It just shares its passcode with the KDC and uses the ticket the client has received from the KDC to authenticate.

Kerberos is thought to be tedious to set up and maintain. To this end, there is some active work in the Hadoop community to present a simpler and more effective authentication mechanism.

Tutorial Links

This lecture provides a fairly concise and easy-to-follow description of the technology.

Example Code

An effective Kerberos installation can be a daunting task and is well beyond the scope of this book. Many operating system vendors provide a guide for configuring Kerberos. For more information, refer to the guide for your particular OS.

Knox

fgth 07in03

License

Apache License, Version 2.0

Activity

Medium

Purpose

Secure Gateway

Official Page

https://knox.apache.org

Hadoop Integration

Fully Integrated

Securing a Hadoop cluster is often a complicated, time-consuming endeavor fraught with trade-offs and compromise. The largest contributing factor to this challenge is that Hadoop is made of a variety of different technologies, each of which has its own idea of security.

One common approach to securing a cluster is to simply wrap the environment with a firewall (“fence the elephant”). This may have been acceptable in the early days when Hadoop was largely a standalone tool for data scientists and information analysts, but the Hadoop of today is part of a much larger big data ecosystem and interfaces with many tools in a variety of ways. Unfortunately, each tool seems to have its own public interface, and if a security model happens to be present, it’s often different from that of any other tool. The end result of all this is that users who want to maintain a secure environment find themselves fighting a losing battle of poking holes in firewalls and attempting to manage a large variety of separate user lists and tool configurations.

Knox is designed to help combat this complexity. It is a single gateway that lives between systems external to your Hadoop cluster and those internal to your cluster. It also provides a single security interface with authorization, authentication, and auditing (AAA) capabilies that interface with many standard systems, such as Active Directory and LDAP.

Tutorial Links

The folks at Hortonworks have put together a very concise guide for getting a minimal Knox gateway going. If you’re interested in digging a little deeper, the official quick-start guide, which can be found on the Knox home page, provides a considerable amount of detail.

Example Code

Even a simple configuration of Knox is beyond the scope of this book. Interested readers are encouraged to check out the tutorials and quickstarts.