Big data implementation is only as good as its filesystem. From an architectural standpoint, managing the massive volume and throughput of data is a challenge. Big data solutions typically use large, distributed arrays of servers and specialized software. For risk management, huge amounts of data flying across distributed servers also requires exceptional built-in fault tolerance.
In this article, we present four leading contenders for big data filesystems.
MapReduce: A Key Function
MapReduce is not a filesystem, but a data management protocol. MapReduce handles critical data functionality, performing such operations as categorizing, sorting, summarizing, redundancy, and fault tolerance. MapReduce is robust and thorough, and at one point, it was the protocol of choice for big data file systems.
Looking at our four filesystem candidates, two of them use MapReduce in two of them don’t.
1. HDFS: Hadoop Distributed File System
Usually simply called Hadoop, HDFS in extremely popular and has gained a great deal of prominence in the big data world. It uses MapReduce as a key function of its data management. Hadoop is an open-source system written in Java designed to run on low-cost hardware. With that in mind, Hadoop was built with exceptional fault tolerance. Hadoop terminology refers to “nodes” for how data is moved and stored. NameNodes are “controller” nodes of a sort, managing how data is parsed and distributed as well as housing metadata. DataNodes exist across multiple remote servers and are controlled by NameNodes. DataNodes store data in blocks and provision I/O.
Hadoop is a product of the Apache Software Foundation. Apache also contributes our second entry, which might be considered a next-generation big data file solution.
2. Apache Spark
Apache Spark moves beyond MapReduce for its data manipulation, using a new resilient distributed dataset (RDD) protocol. Spark is fast and flexible with sophisticated analytics, including the ability to run interactive analytical applications against streaming data in real-time. It won the 2014 GraySort Benchmark contest by sorting 100 TB of data three times faster than Hadoop on one-tenth of the machines. Spark has API tools available in Java, Python, Scala, SQL, and other languages.
It’s fully backward-compatible with Hadoop and HDFS systems, so legacy Hadoop systems can easily upgrade. Similar to Hadoop, Spark uses a distributed storage model over a large cluster of servers. Spark scalability is a key benefit. It can easily accommodate running on thousands of nodes. There are options to install Spark, so it runs on its own standalone cluster manager, on other cluster managers like Mesos or Hadoop YARN, or in an EC2 cloud. It can access data from HDFS, Cassandra, HBase, Hive, Tachyon, and other Hadoop sources.
The Quantcast File System (QFS) is an alternative to Hadoop as far as compatibility with MapReduce processing. It was designed to run more efficiently than Hadoop, due in large part to a different redundancy and fault tolerance methodology. Using Reed-Solomon error correction, the QFS does essentially the same job of distributing fault tolerance backups with half the disk space. Half the disk space means half the data that needs to move back and forth. On a large distributed server array moving massive data, that leads to dramatically improved performance.
The QFS was developed to manage Quantcast’s company data needs. Quantcast specializes in measuring audience engagement on Internet sites, processing over 800,000 transactions per second at over 100 million websites.
GlusterFS is a filesystem from Red Hat, Inc., the company best known for its enterprise Linux operating system. It uses a proprietary protocol for data management. Like the other candidates, it manages massive data by using scalable distributed file networks. GlusterFS gets excellent file look-up speed from using elastic hash algorithms rather than centralized metadata. GlusterFS has been most notably used for cloud computing, streaming media, and content delivery. Like Hadoop, it can run on low-cost commodity computer hardware. GlusterFS is known for scalability to massive sizes with excellent performance.
The two Apache products, Hadoop and Spark, are the more widely accepted for a variety of enterprise applications. Hadoop is a solid classic; Spark is a next-generation innovator that offers improved performance. Either Quantcast or GlusterFS could be fine choices; however, they are more likely to be outliers in the big data filesystem community.