DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Join us today at 1 PM EST: "3-Step Approach to Comprehensive Runtime Application Security"
Save your seat
  1. DZone
  2. Data Engineering
  3. Databases
  4. Partitioned Hive Table Across Storage Systems Using Alluxio

Partitioned Hive Table Across Storage Systems Using Alluxio

Combining a Hive table with Alluxio allows you to increase the efficiency and performance of a single table across multiple storage resources.

Madan Kumar user avatar by
Madan Kumar
·
May. 22, 19 · Tutorial
Like (3)
Save
Tweet
Share
2.76K Views

Join the DZone community and get the full member experience.

Join For Free

Today when we create a Hive table, it is a common technique to partition the table across different values and ranges to improve query performance and reduce maintenance cost. However, Hive cannot access a single table directly using a single query with the data of this Hive table across different mediums of storage and different clusters. This becomes a need when the data volume grows too large to fit a single medium of storage or cluster, and also when the users need to take into account the following considerations:

  • Storage cost, where some partitions are less important than others and can be stored on cheaper storage tiers.
  • Regional compliance, where data cannot leave a region and cannot be persisted outside the region. 

This is where Alluxio comes in and interfaces with applications like Hive as a distributed virtual file system to create tables with multiple partitionings in a different storage system. For example, to reduce the storage cost with the same high performance in a multi-cluster environment, you can use Hive with Alluxio to create tables partitioned by date ranges to allow the most frequently-used datasets to reside on higher tiers of storage (MEM) and have older/less frequently accessed data in other storage services like HDFS with lower cost storage (SSD, HDD) or even remote cloud storage (like S3). In this regard, data will always reside in the under-storage system as the source of truth and can be residing temporarily in the Alluxio file system. 

This article aims to guide end users with leveraging Alluxio to create external tables in Hive backed by different file locations, each location however representing a subset of the overall data partitioned by the value of choice.

Example

Setup Alluxio

First, deploy an Alluxio cluster across four nodes (one Master, three Workers) on EC2 instances. Please refer to this documentation to install Alluxio cluster or use our Sandbox to deploy Alluxio on AWS EC2 with one click for free.

Alluxio has 2 mounts –  HDP as the root mount point, S3 bucket as the nested mount.

HDP is also deployed on 4 nodes. For this example, it is co-located on the Alluxio nodes, but could be a separate cluster.

[centos@ip-172-31-8-93 alluxio]$ ./bin/alluxio fs ls /
drwxr-xr-x  centos         centos                      1       PERSISTED 02-20-2019 00:31:53:234  DIR /hdfs
drwxr-xr-x  centos         centos                       1   NOT_PERSISTED 02-19-2019 22:21:46:005  DIR /local
drwxrwxrwx  madan          madan                        1       PERSISTED 02-19-2019 22:14:29:532  DIR /s3

Create Hive Tables

Run the following Hive DDL Statements to create an external Hive table call_center_s3:

create external table call_center_s3(
      cc_call_center_sk         bigint               
,     cc_call_center_id         string              
,     cc_rec_end_date          string                         
,     cc_closed_date_sk         bigint                       
,     cc_open_date_sk           bigint                       
,     cc_name                   string                   
,     cc_class                  string                   
,     cc_employees              int                       
,     cc_sq_ft                  int                       
,     cc_hours                  string                      
,     cc_manager                string                   
,     cc_mkt_id                 int                       
,     cc_mkt_class              string                      
,     cc_mkt_desc               string                  
,     cc_market_manager         string                   
,     cc_division               int                       
,     cc_division_name          string                   
,     cc_company                int                       
,     cc_company_name           string                      
,     cc_street_number          string                      
,     cc_street_name            string                   
,     cc_street_type            string                      
,     cc_suite_number           string                      
,     cc_city                   string                   
,     cc_county                 string                   
,     cc_state                  string                       
,     cc_zip                    string                      
,     cc_country                string                   
,     cc_gmt_offset             double                  
,     cc_tax_percentage         double
)
partitioned by (cc_rec_start_date string)
row format delimited fields terminated by '|' 
location 'alluxio://172.31.9.93:19998/s3/';


We start off by creating the original call_center table against the S3 bucket; notice here that when creating this external table, we do so with the cc_rec_start_date column partitioned in the table. This table is then altered by date and the location is modified accordingly. We alter the table with three values as follows:

ALTER TABLE call_center_s3
ADD PARTITION (cc_rec_start_date='1998-01-01')
location 'alluxio://172.31.9.93:19998/s3/';


We have altered the table and added the partition for the year 1998. We then allow this to remain in the Alluxio S3 file location as it represents colder data in a remote storage device.

ALTER TABLE call_center_s3
ADD PARTITION (cc_rec_start_date='2001-01-01')
location 'alluxio://172.31.9.93:19998/s3/';


We altered the table with the date range of 2001. This, too, we will allow to remain on remote S3 storage as it is less frequently accessed and more recent data we can store in an HDFS cluster on disk(SSD).

ALTER TABLE call_center_s3
ADD PARTITION (cc_rec_start_date='2002-01-01')
location 'alluxio://172.31.9.93:19998/hdfs/';


Finally, for data that is most current, we can add a partition and have it stored against the Alluxio HDFS file location, which is on SSDs. This location is on disk across the Alluxio nodes.

[centos@ip-172-31-8-218 alluxio]$ ./bin/alluxio fs ls /hdfs
drwxr-xr-x  centos         centos                      9356       PERSISTED 04-19-2019 18:30:59:822  0% /hdfs/data-m-0099
[centos@ip-172-31-8-218 alluxio]$ ./bin/alluxio fs ls /s3
drwxr-xr-x  centos         centos                       9356   PERSISTED 04-19-2019 19:04:45:107  0% /s3/data-m-00099


In the above screenshot, we see that for the HDFS and S3 mount points it shows the data is still only being PERSISTED in the under file system and is not yet in Alluxio memory. This is evident by the fact that the percentage for the file, in this case, is 0%, which shows that the file is not yet in Alluxio MEM.

Now after we query against the Hive table we can verify that it is indeed reaching out to the respective understorage file locations by checking once again to see if that data is now in Alluxio MEM (cache_promote is set as default read type policy) and the file shows 100%. Also note, that the data that is in the S3 partition does not get pulled into Alluxio as that partition was eliminated based on the predicate by Hive runtime.

hive>  select * from call_center_s3 where cc_rec_start_date='2002-01-01';


Note this is the date range for the table residing against HDFS.

[centos@ip-172-31-8-218 alluxio]$ ./bin/alluxio fs ls /hdfs
drwxr-xr-x  centos         centos                      9356       PERSISTED 04-19-2019 18:30:59:822  100% /hdfs/data-m-0099
[centos@ip-172-31-8-218 alluxio]$ ./bin/alluxio fs ls /s3
drwxr-xr-x  centos         centos                       9356   PERSISTED 04-19-2019 19:04:45:107  0% /s3/data-m-00099


Verifying through the Alluxio namespace that the query did bring data up from HDFS and not from another remote directory and persisted it in Alluxio MEM.

hive>  select * from call_center_s3 where cc_rec_start_date='1998-01-01';


Similarly, for S3 we query against the table with the date value set to 1998.

[centos@ip-172-31-8-218 alluxio]$ ./bin/alluxio fs ls /hdfs
drwxr-xr-x  centos         centos                      9356       PERSISTED 04-19-2019 18:30:59:822  100% /hdfs/data-m-0099
[centos@ip-172-31-8-218 alluxio]$ ./bin/alluxio fs ls /s3
drwxr-xr-x  centos         centos                       9356   PERSISTED 04-19-2019 19:04:45:107  100% /s3/data-m-00099


Verify again through the Alluxio command line and see that the data was brought up from S3 and into the Alluxio namespace.

Conclusion

Many users leverage partitioning to speed up Hive performance and improve maintenance. This blog talks about an interesting way to combine Hive with Alluxio to store a single table but leveraging multiple different storage resources for higher efficiency.

Database Alluxio File system

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • The Role of Data Governance in Data Strategy: Part II
  • Memory Debugging: A Deep Level of Insight
  • A Real-Time Supply Chain Control Tower Powered by Kafka
  • 2023 Software Testing Trends: A Look Ahead at the Industry's Future

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: