An Alternative Way of Loading or Importing Data Into Hive Tables Running on Top of HDFS-Based Data Lake
Join the DZone community and get the full member experience.Join For Free
Preceding the article, I want to say thank you to all the teams beginning from cleaning/sterile group to nurses, doctors, and others who are consistently battling the Covid-19 pandemic over the globe.
The fundamental target of this article is to feature how we can load or import data into Hive tables without explicitly executing the “load” command. Basically, with this approach Data Scientists can query or even visualize directly on various data visualization tools for quick investigation in a scenario when raw data is continuously ingested to HDFS based Data lake from the external sources on a consistent schedule. Otherwise, the “load” command would be required to execute furthermore for stacking the processed data into Hive’s table.
Here we are considering an existing environment with the following components either set up on the Cloud or on-premise:
- Multi-node Cluster where HDFS installed and configured. Hive running on top of HDFS with MySQL database as metastore.
- Assuming raw data is getting dumped from multiple sources into HDFS Data lake landing zone by leveraging Kafka, Flume, customized data ingesting tool etc.
- From the landing zone, raw data moves to the refining zone in order to clean junk and subsequently into the processing zone where clean data gets processed. Here we are considering that the processed data stored in text files with CSV format.
Hive input is directory-based, which is similar to many Hadoop tools. This means input for an operation is taken as files in a given directory. Using the HDFS command, let’s create a directory in the HDFS using
$ hdfs dfs -mkdir <<name of the folder>>. The same can be done using the Hadoop administrative UI, depending upon user’s HDFS ACL settings.
Now, move the data files from the processing zone into newly created HDFS folder. As an example, here, we are considering simple order data that ingested into the data lake and eventually transformed to consolidated text files with CSV format after cleaning and filtering. A few lines of rows are as follows
The next step is to create an external table in Hive by using the following command where the location is the path of HDFS directory that created on the previous step. Here is the command we could use to create the external table using Hive CLI. The LOCATION statement in the command tells Hive where to find the input files.
If the command worked, an OK will be printed, and upon executing Hive query, the Hive engine fetches the data internally from these input text files by leveraging processing engine Map Reducer or other like Spark, Tez, etc. Ideally, Spark or Tez can be configured as a processing engine in hive-site.xml in order to improve the data processing speed for huge volume of input files.
Once the table creation is successful, we can cross-check it on the “metastore” schema in the MySQL database. To perform that, log in to MySQL CLI, which might be running on a different node in the cluster and then connect to “metastore” database as well as pulls records from “TBLS” table. This displays the created Hive table information.
The import can be verified through the Hive’s CLI by listing the first few rows in the table.
hive> Select * from OrderData;
Additionally, “analyze compute statistics” command could be executed in Hive CLI to view the detail information of jobs that runs on that table.
The primary advantage with this approach is, data can be query, analyze etc within a minimum span of time without additionally perform explicit data loading operation. Also helps the Data scientists to check the quality of data before running their machine learning jobs on the data lake or cluster. You could read here how to install and configure Apache Hive on multi-node Hadoop cluster with MySQL as Metastore.
Published at DZone with permission of Gautam Goswami. See the original article here.
Opinions expressed by DZone contributors are their own.