DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Databases
  4. Write CSV Data into Hive and Python

Write CSV Data into Hive and Python

Chase Seibert user avatar by
Chase Seibert
·
Jun. 05, 13 · Interview
Like (0)
Save
Tweet
Share
13.87K Views

Join the DZone community and get the full member experience.

Join For Free

Apache Hive is a high level SQL-like interface to Hadoop. It lets you execute mostly unadulterated SQL, like this:

CREATE TABLE test_table(key string, stats map<string, int>);

The map column type is the only thing that doesn’t look like vanilla SQL here. Hive can actually use different backends for a given table. Map is used to interface with column oriented backends like HBase. Essentially, because we won’t know ahead of time all the column names that could be in the HBase table, Hive will just return them all as a key/value dictionary. There are then helpers to access individual columns by key, or even pivot the map into one key per logical row.

As part of the Hadoop family, Hive is focused on bulk loading and processing. So it’s not a surprise that Hive does not support inserting raw values like the following SQL:

INSERT INTO suppliers (supplier_id, supplier_name) VALUES (24553, 'IBM');

However, for unit testing Hive scripts, it would be nice to be able to insert a few records manually. Then you could run your map reduce HQL, and validate the output. Luckily, Hive can load CSV files, so it’s relatively easy to insert a handful or records that way.

CREATE TABLE foobar(key string, stats map<string, bigint>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
MAP KEYS TERMINATED BY ':' ;

LOAD DATA LOCAL INPATH '/tmp/foobar.csv' INTO TABLE foobar;

This will load a CSV file with the following data, where c4ca4-0000001-79879483-000000000124 is the key, and comments and likesare columns in a map.

c4ca4-0000001-79879483-000000000124,comments:0|likes:0
c4ca4-0000001-79879483-000000000124,comments:0|likes:0

Because I’ve been doing this quite a bit in my unit tests, I wrote a quick Python helper to dump a list of key/map tuples to a temporary CSV file, and then load it into Hive. This uses hiver to talk to Hive over thrift.

import hiver
from django.core.files.temp import NamedTemporaryFile
def _hql(self, hql):
    client = hiver.connect(settings.HIVE_HOST, settings.HIVE_PORT)
    try:
        client.execute(hql)
    finally:
        client.shutdown()
def insert(self, table_name, rows):
    ''' cannot insert single rows via hive, need to save to a temp file and bulk load that '''
    csv_file = NamedTemporaryFile(delete=True)
    for row in rows:
        map_repr = '|'.join('%s:%s' % (key, value) for key, value in row[1].items())
        csv_file.write(row[0] + "," + map_repr + "\n")
    csv_file.flush()
    try:
        _hql('DROP TABLE IF EXISTS %s' % table_name)
        _hql("""
            CREATE TABLE
%s (
                    key string,
                    map<string, int>
                )
            ROW FORMAT DELIMITED
            FIELDS TERMINATED BY ','
            COLLECTION ITEMS TERMINATED BY '|'
            MAP KEYS TERMINATED BY ':'
        """ % (table_name))
        _hql("""
            LOAD DATA LOCAL INPATH '%s' INTO TABLE %s
        """ % (csv_file.name, table_name)
    finally:
        csv_file.close()

You can call it like this:

    insert('test_table', [
        ('c4ca4-0000001-79879483-000000000124', {'comments': 1, 'likes': 2}),
        ('c4ca4-0000001-79879483-000000000124', {'comments': 1, 'likes': 2}),
        ('c4ca4-0000001-79879496-000000000124', {'comments': 1, 'likes': 2}),
        ('b4aed-0000002-79879783-000000000768', {'comments': 1, 'likes': 2}),
        ('b4aed-0000002-79879783-000000000768', {'comments': 1, 'likes': 2}),
    ])
Database CSV sql Python (language) Data (computing)

Published at DZone with permission of Chase Seibert, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Promises, Thenables, and Lazy-Evaluation: What, Why, How
  • RabbitMQ vs. Memphis.dev
  • A Simple Union Between .NET Core and Python
  • Data Mesh vs. Data Fabric: A Tale of Two New Data Paradigms

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: