Data Fracking: Going Deep Into the Data Lake Using Drill
Data Fracking: Going Deep Into the Data Lake Using Drill
Augmenting a data lake using the open source SQL querying tool Drill. Also allowing for BI through SQL.
Join the DZone community and get the full member experience.Join For Free
How to Simplify Apache Kafka. Get eBook.
Your data lake is finally live. After months and months of planning, designing, tinkering, configuring and reconfiguring, your company is ready to see the fruits of your labor. There’s just one issue: the quarter close is coming up, and data analysts are asking for their functionality yesterday, not next week. That means there’s no time to go through the motions of setting up workflows, rewriting queries to function on Hive or HBase, and working through the kinks of a new architecture. The data lake may be the best, most flexible, and most scalable architecture available, but there is one thing it is not: quick to deploy. How can all of your hard-won socialization and hype for the data lake be saved? Enter Apache Drill.
Drill is a relatively new, but quickly emerging technology in the Apache family of projects, allowing low-latency, schema-free, ANSI SQL queries of structured and unstructured data. That might sound like a lot of functionality. It is. Drill can be run on everything from a personal laptop to a dedicated cluster, and can be pointed at just about any data store, as long as a Drill plugin (basically an optimized SerDe) exists for that data type — and given the rapid iteration and expansion of the open source world, plugins for almost every format exist already. Although it seems absurd to think that anything Hadoop-related could be deployed in less than 10 minutes, a single Drill node can be stood up and running queries with most of those 10 minutes to spare.
So where does Drill fit into the data lake? The architecture diagram below shows the standard data lake architecture, augmented with Apache Drill.
There are two main entry points for Drill in the data lake:
- Ingestion of ad-hoc, schema-free or transient datastores, within or outside the data lake
- On-demand reporting consumption of data from anywhere within the data lake
The common theme among both is the ad-hoc nature of the data, enabled by Drill’s incredibly swift, in-memory processing that is especially well-suited to real-time queries. This, coupled with the ANSI SQL compliance and relatively low-effort setup, make Drill well-suited to provide quick, efficient and user-friendly time-to-value in the data lake.
Let’s look a little more in depth at the use cases outlined in the architecture diagram.
Drill was initially developed to be a schema-agnostic, fully ANSI-compliant query tool for fast, lightweight, yet powerful data exploration. This flexibility has spurred some interesting use cases outside of this scope, especially with the growth of data types compatible with the tool. Drill is especially intriguing as an on-demand ingestion tool for the data lake, because it has no requirements for qualifying data beforehand, besides knowing the data format. Here’s how it could work. Say I’m bringing a new service online, which creates an arbitrary JSON output. The typical ingestion cycle for this data might be bringing it in via a tool like Flume or Kafka, then registering it HCatalog and defining some transformations in HBase to get it into a usable form for users. This entire process might take multiple discovery sessions with the data owner, a couple of iterations of configuration testing to make sure that the ingestion engine, metastore, databases and other Hadoop components play nicely, and a few hundred lines of code. With Drill, here’s the code:
alter session set `store.format`='parquet'; create table myTable as select * from dfs.`/Path/to/file/filename.json`;
And that’s it. There will be instances when more complex JSON files require some knowledge of data structure, and more complex SQL statements might be necessary, but in these scenarios, any system will necessitate some iteration. For rapid discovery and iteration, Drill makes it easy.
Consumption and Exploration
Visualization, exploration, and end-user tools have made leaps and bounds as tools such as Spark have become more widely adopted. Drill still stands in a class of its own as an on-demand exploration tool, and is especially well-suited to the data lake architecture as a precursor to standardization into more formal processes. Drill has recently included drivers for industry standard visualization tools such as Tableau and Excel, and is starting to be integrated into Zeppelin (especially within the MapR community) to provide instant connectivity to the entire data ecosystem. Since Drill has JDBC compatibility, this is a potent combination that is only now starting to find its legs. The capabilities of a fully Hadoop-integrated, schema-free, in-memory data notebook tool like Zeppelin is boundless, especially in context of a data lake.
When to Drill, and When to Dig
Drill is clearly a powerful tool. It allows near-instant access across the entire data lake, and allows users and administrators alike to provide quick value to the business. Especially at the beginning of the data lake lifecycle, these two features are extremely attractive -- they can help lift initiatives off the ground, get users excited, and show stakeholders the power of data lakes. Just as with any Hadoop technology, though, Drill is not a silver bullet. Some points to keep in mind are:
- Drill is in-memory. While this allows it to be super-responsive, it also means that as data scales up, Drill becomes more expensive, requiring more RAM. Cheap, distributed storage is a key selling point of Hadoop and the data lake, so for large datasets, Drill should be used with caution. Also of note: in-memory processes sacrifice redundancy for speed.
- Mature data lakes have a defined process for a reason. Drill makes getting your data lake up and running quick and flexible, and it may seem tempting to ditch or neglect the formal data lake architecture altogether. As your data lake grows in users, size and complexity, however, the governance, repeatability, and stability of a formal data lake becomes vital. Especially in industries with sensitive data and regulation, process is important.
- Ingestion with Drill is not well-documented or measured. While ingesting with Drill is great for one-off, exploratory or non-standard datasets, it is far from the most efficient use of Drill, or the most efficient tool for the purpose. Drill works best when “drillbits,” localized micro-service versions of Drill, are co-located with datastores, which can’t be guaranteed in most exploratory cases. A separate ingestion process may also be necessary, depending on where the Drill output is placed. Audit and traceability may also be impacted.
Another Drop in the Lake
Drill is a powerful emerging technology in the Hadoop ecosystem and holds plenty of promise for application in the data lake. At the end of the day, though, it is best viewed as another piece of the architecture, instead of a standalone bypass to the data lake. Implemented correctly, Apache Drill can bring a new level of flexibility and agility to your data lake as a whole—providing functionality that any user, administrator or business owner will gladly welcome.
Published at DZone with permission of Greg Wood , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.