HDFS for the Batch Layer
By Nathan Marz and James Warren, authors of Big Data: Principles and best practices of scalable realtime data systems
The high-level requirements for storing data in the Lambda Architecture batch layer are straightforward. In fact, you can map these requirements to a required checklist for a storage solution. In this article, Big Data authors Nathan Marz and James Warren walk you through a batch layer storage solution based on Hadoop Distributed File System (HDFS) and explain how it meets their requirements checklist.
After you learn how to design a data model for the master dataset, the next step is learning how to physically store the master dataset in the batch layer so that it can be processed easily and efficiently. The master dataset is typically too large to exist on a single server, so you must choose how to distribute your data across multiple machines. The way you store your master dataset will impact how you consume it, so it's vital to devise your storage strategy with your usage patterns in mind.
In this article, you'll learn how to store your master dataset using Hadoop Distributed File System (HDFS). We'll begin with an introductory look at HDFS, and we'll conclude with a checklist that summarizes how HDFS meets the storage requirements of the batch layer.
HDFS and Hadoop MapReduce are the two prongs of the Hadoop project: a Java framework for distributed storage and distributed processing of large amounts of data. Hadoop is deployed across multiple servers, typically called a cluster, and HDFS is a distributed and scalable filesystem that manages how data is stored across the cluster. Hadoop is a project of significant size and depth, and we'll provide only a high-level description in this article.
A Hadoop cluster has two types of HDFS nodes:
- Multiple datanodes
- A single namenode
When you upload a file to HDFS, it undergoes the process shown in figure 1.
Figure 1 Files are chunked into blocks, which are dispersed to datanodes in the cluster.
The file is first chunked (#1) into blocks of a fixed size, typically between 64MB and 256 MB. Each block is then replicated across multiple (typically three) datanodes that are chosen at random (#2). The namenode (#3) keeps track of the file-to-block mapping and where each block is located.
Distributing a file across many nodes allows it to be easily processed in parallel as shown in figure 2.
Figure 2 The namenode provides the client application with the block locations of a distributed file stored in HDFS.
When a program needs to access a file stored in HDFS, it contacts the namenode (#1) to determine which datanodes host the file contents. The program then accesses the file (#2) for processing. By replicating each block across multiple nodes, your data remains available even when individual nodes are offline.
Getting started with Hadoop
Setting up Hadoop can be an arduous task. Hadoop has numerous configuration parameters that should be tuned for your hardware to perform optimally. To avoid getting bogged down in details, we recommend downloading a preconfigured virtual machine for your first encounter with Hadoop. A virtual machine accelerates your learning of HDFS and MapReduce, and you'll have a better understanding when setting up your own cluster. At the time of this writing, Hadoop vendors Cloudera, Hortonworks, and MapR all have made images publicly available.
Implementing a distributed filesystem is a difficult task; this article covers the basics from a user perspective. Let's now explore how to store a master dataset using HDFS.
Storing a master dataset with HDFS
As a filesystem, HDFS offers support for files and directories, which makes storing a master dataset on HDFS straightforward. You store data units sequentially in files, with each file containing megabytes or gigabytes of data. All of the files of a dataset are then stored together in a common folder in HDFS.
To add new data to the dataset, you create and upload another file containing the new information. For example, suppose you want to store all logins on a server. The following are example logins:
$ cat logins-2012-10-25.txt alex 192.168.12.125 Thu Oct 25 22:33 - 22:46 (00:12) bob 192.168.8.251 Thu Oct 25 21:04 - 21:28 (00:24) charlie 192.168.12.82 Thu Oct 25 21:02 - 23:14 (02:12) doug 192.168.8.13 Thu Oct 25 20:30 - 21:03 (00:33) ...
To store this data on HDFS, you create a directory for the dataset and upload the logins file (logins-2012-10-25.txt):
$ hadoop fs -mkdir /logins $ hadoop fs -put logins-2012-10-25.txt /logins
The hadoop fs commands are Hadoop shell commands that interact directly with HDFS. A full list of commands is available at http://hadoop.apache.org/. In the second line, uploading a file automatically chunks and distributes the blocks across the datanodes. You can list the directory contents:
$ hadoop fs -ls -R /logins -rw-r--r-- 3 hdfs hadoop 175802352 2012-10-26 01:38 /logins/logins-2012-10-25.txt
The ls command is based on the UNIX command of the same name.
And you can verify the contents of the file:
$ hadoop fs -cat /logins/logins-2012-10-25.txt alex 192.168.12.125 Thu Oct 25 22:33 - 22:46 (00:12) bob 192.168.8.251 Thu Oct 25 21:04 - 21:28 (00:24) ...
As shown in figure 1, the file was automatically chunked into blocks and distributed among the datanodes when it was uploaded. To identify the blocks and their locations, use the following command, which runs a HDFS filesystem checking utility:
$ hadoop fsck /logins/logins-2012-10-25.txt -files -blocks -locations /logins/logins-2012-10-25.txt 175802352 bytes, 2 block(s): #A OK 0. blk_-1821909382043065392_1523 len=134217728 #B repl=3 [10.100.0.249:50010, 10.100.1.4:50010, 10.100.0.252:50010] 1. blk_2733341693279525583_1524 len=41584624 repl=3 [10.100.0.255:50010, 10.100.1.2:50010, 10.100.1.5:50010]
#A File stored in two blocks #B IP addresses and port numbers of the datanodes hosting each block
Nested folders provide an easy implementation of vertical partitioning. For this logins file example, you may want to partition your data by login date. The layout shown in figure 3 stores each day's information in a separate subfolder so that a function can pass over data not relevant to its computation.
Figure 3 A vertical partitioning scheme for login data. By separating information for each date in a separate folder, a function can select only the folders containing data relevant to its computation.
If you're new to HDFS, it's a platform worth investigating for several reasons: it's an open-source project with an active developer community; it's tightly coupled with Hadoop MapReduce, a distributed computing framework; and it's widely adopted and deployed in production systems by hundreds of companies.
Meeting the batch layer storage requirements with HDFS
Table 1 summarizes how HDFS meets the storage requirements of the batch layer. We covered most of the checklist discussion points in this article, so they should look familiar.
Table 1 HDFS checklist of storage requirements
|writes||Efficient appends of new data||Appending new data is as simple as adding a new file to the folder containing the master dataset.|
|Scalable storage||HDFS evenly distributes the storage across a cluster of machines. You increase storage space and I/O throughput by adding more machines.|
|reads||Support for parallel processing||HDFS integrates with Hadoop MapReduce, a parallel computing framework that can compute nearly arbitrary functions on the data stored in HDFS.|
|Ability to vertically partition data||Vertical partitioning is done grouping data into subfolders. A function can read only the select set of subfolders needed for its computation.|
|both||Tunable storage/processing costs||You have full control over how you store your data units within the HDFS files. You choose the file format for your data as well as the level of compression.|
Beyond storing your master dataset, HDFS can also help you maintain it, primarily through two operations:
- Appending new files to the master dataset folder
- Consolidating data to remove small files
These operations must be bulletproof to preserve the integrity and performance of the batch layer. See chapter 3 of Big Data for a discussion of these tasks as well as potential pitfalls.
HDFS is a powerful tool for storing your data. This article explored how to store a master dataset using HDFS and explained how it meets the storage requirements of the batch layer.
Among the many candidates available for storing the master dataset, HDFS is most commonly chosen because of its integration with Hadoop MapReduce.
Hadoop in Action
Pig in Action