Maybe we’re talking more about algaecide and not chlorine, but microbiology aside, a data lake often gets rather cloudy and disorganized shortly after being opened for use. Hadoop’s promise of schema on read lures many in but often ends up forcing a soul-searching reevaluation of one’s principles related to data management — not to mention a new strategy (and cost) for cleaning up a swampy data lake.
Fortunately, data management solutions for Hadoop are just now starting to come into their own. Much like Linux as an operating system, the packages required for Hadoop as a “data operating system” are starting to be refined by Hadoop vendors. However, all of these point products can create technology overload for consumers. Initial hurdles to creating production-grade data lakes with point products typically include managing ingestion across various technologies in the Hadoop stack, basic Hadoop file system security, and metadata and data lifecycle management.
The key here — your chlorine — is automation. Automation of key processes including data ingestion, metadata management, security and data lifecycle management will help keep your data lake clean. To help develop an understanding of each of these elements of data lake automation let's investigate each topic in more depth.
Using Hadoop forces the use of a mix of technologies. It's not uncommon to traverse four or five Apache projects and their related programming requirements to enable a single use case in Hadoop. Integrating these different technologies requires using many interfaces, from command line to web services specific to each technology. Programmatically ingesting data to limit exposure and keep data private and secure while permitting deep technical functionality is a fine balancing act. To do it, ingestion must be automated -- and automated to the extent it doesn’t require a heavy amount of manual intervention.
Managing ingestion is not just about organizing the process of data movement but also about data placement. Where is the data landed once ingested into Hadoop? The basic organization of your file system into zones is probably the easiest thing to do to help separate the wheat from the chaff. Some data is only needed transiently. Some data will always need to be saved for regulatory purposes. Having areas within the file system that are dedicated to a variety of uses helps streamline data ingest and data access after ingest. Keeping temporary files in a raw zone, for example, logically isolates single copy (replication factor) data from production data. The idea of using a sandbox also enables what might be the most important function of a data lake which is free from investigation of data sets by data scientists exploring and creating tomorrow analytics based upon their total knowledge of the business need. The combined use of zones and automated data ingestion are the the first step in hydrating a data lake but security of the lake is often the most debated.
It’s very important to know what's in your data lake. As your data grows, it's hard to track how much, how many or even the success of each data source in a process that could include hundreds or thousands of sources.
However, tagging data as it flows in is just one-half of the equation. You also must be able to search and find data sets, and have the ability to assess the quality of each process.
Packages like Apache Atlas, based on HBase and Solr, are now included in the Hortonworks distribution are starting to provide Metadata services for Hadoop. For larger organizations, other closed source packages exist that usually provide some RESTful access. Regardless of where you keep metadata, it absolutely needs to be kept organized and searchable. Not doing so will be the single most troublesome thing that creates the “algae bloom” in your data lake.
Hadoop Security and Data Privacy
While in its infancy, Hadoop really was not very secure. It didn’t need to be because its original intent was not for enterprise use. As it matured, security features common to modern enterprise IT architecture began to infiltrate Hadoop architectures. The maturation of the file system to include technologies like POSIX compliant ACLs in HDFS and ACLs and ACEs in MapR parlance, have provided more mature expressions of file permissions as the foundation of Hadoop security. This maturity enabled the basic process of creating data lake zones. Organized sets of folders with appropriate permissions and policies power those zones along with additional distribution specific technology.
Technologies like Apache Ranger and Sentry are key to implementing least privilege in Hadoop. More importantly, the ideas of encryption at rest and other data privacy measures such as tokenization and masking are critical to providing the balance of appropriate access without undue burden on users. Some solutions integrate measures like tokenization into a data ingestion workflow as a configuration option, while in others it remains a special exception.
Note that there are some Hadoop tools like Apache Falcon or Cloudera Navigator that help manage data governance, but these rely on third-party products for certain functionality like encryption and tokenization.
Data Lifecycle Mangement
An awesome new feature in HDFS, thanks to HDFS-2832, is called Archival Storage. A single data node now has multiple storage options. This feature can also be used to implement temperature zones for data management, including the placement of data in specific areas of your cluster based upon its age or storage type. The movement of data, automated based on policies specific to your organization, is the core of Data Lifecycle Management. Many times data is accessed more frequently when newly imported. In some cases it’s useful to place that data in memory for the most rapid access.
As the age of the data increases it might make more sense to move to a node with an excellent CPU and solid state storage. As the age of the data further increases it might make sense to move data to slower nodes with classic spinning but ultimately cheaper disks and some archival nodes. Again, the use of policies to control data storage aid in optimizing data movement in your lake. Much like a pond fountain, the automated movement of data will keep important data fresh.
Don't Wait, Automate
Automation is the key to working at the scale of big data. All of these considerations are essential to “hydrating” and creating an effective and ultimately “algae-free” data lake. Keep in mind that the point of a data lake is to work with Hadoop, aka big data, at scale. Leveraging automation keeps administrators and users focused on results and analytics, not administration. Using processes like managed data ingestion as a part of your automation process enables process consistency across the data lake. Security of data is probably worthy of an entire series of blogs but it's safe to say that keeping data private and secure from external threats while enforcing the concept of least privilege within an organization are paramount to all of the topics discussed above. While all organizations have unique requirements in implementation there remains commonality in architecture.