DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Data Privacy and Security: A Developer's Guide to Handling Sensitive Data With DuckDB
  • DuckDB for Python Developers
  • What Is SQL Injection and How Can It Be Avoided?
  • Snowflake Data Time Travel

Trending

  • Bridging Gaps in SOC Maturity Using Detection Engineering and Automation
  • Stop Using Python for Your GenAI Apps, Use Go and Genkit Instead
  • You Learned AI. So Why Are You Still Not Getting Hired?
  • AWS Managed Database Observability: Monitoring DynamoDB, ElastiCache, and Redshift Beyond CloudWatch
  1. DZone
  2. Data Engineering
  3. Databases
  4. Write CSV Data into Hive and Python

Write CSV Data into Hive and Python

By 
Chase Seibert user avatar
Chase Seibert
·
Jun. 05, 13 · Interview
Likes (0)
Comment
Save
Tweet
Share
14.7K Views

Join the DZone community and get the full member experience.

Join For Free

Apache Hive is a high level SQL-like interface to Hadoop. It lets you execute mostly unadulterated SQL, like this:

CREATE TABLE test_table(key string, stats map<string, int>);

The map column type is the only thing that doesn’t look like vanilla SQL here. Hive can actually use different backends for a given table. Map is used to interface with column oriented backends like HBase. Essentially, because we won’t know ahead of time all the column names that could be in the HBase table, Hive will just return them all as a key/value dictionary. There are then helpers to access individual columns by key, or even pivot the map into one key per logical row.

As part of the Hadoop family, Hive is focused on bulk loading and processing. So it’s not a surprise that Hive does not support inserting raw values like the following SQL:

INSERT INTO suppliers (supplier_id, supplier_name) VALUES (24553, 'IBM');

However, for unit testing Hive scripts, it would be nice to be able to insert a few records manually. Then you could run your map reduce HQL, and validate the output. Luckily, Hive can load CSV files, so it’s relatively easy to insert a handful or records that way.

CREATE TABLE foobar(key string, stats map<string, bigint>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
MAP KEYS TERMINATED BY ':' ;

LOAD DATA LOCAL INPATH '/tmp/foobar.csv' INTO TABLE foobar;

This will load a CSV file with the following data, where c4ca4-0000001-79879483-000000000124 is the key, and comments and likesare columns in a map.

c4ca4-0000001-79879483-000000000124,comments:0|likes:0
c4ca4-0000001-79879483-000000000124,comments:0|likes:0

Because I’ve been doing this quite a bit in my unit tests, I wrote a quick Python helper to dump a list of key/map tuples to a temporary CSV file, and then load it into Hive. This uses hiver to talk to Hive over thrift.

import hiver
from django.core.files.temp import NamedTemporaryFile
def _hql(self, hql):
    client = hiver.connect(settings.HIVE_HOST, settings.HIVE_PORT)
    try:
        client.execute(hql)
    finally:
        client.shutdown()
def insert(self, table_name, rows):
    ''' cannot insert single rows via hive, need to save to a temp file and bulk load that '''
    csv_file = NamedTemporaryFile(delete=True)
    for row in rows:
        map_repr = '|'.join('%s:%s' % (key, value) for key, value in row[1].items())
        csv_file.write(row[0] + "," + map_repr + "\n")
    csv_file.flush()
    try:
        _hql('DROP TABLE IF EXISTS %s' % table_name)
        _hql("""
            CREATE TABLE
%s (
                    key string,
                    map<string, int>
                )
            ROW FORMAT DELIMITED
            FIELDS TERMINATED BY ','
            COLLECTION ITEMS TERMINATED BY '|'
            MAP KEYS TERMINATED BY ':'
        """ % (table_name))
        _hql("""
            LOAD DATA LOCAL INPATH '%s' INTO TABLE %s
        """ % (csv_file.name, table_name)
    finally:
        csv_file.close()

You can call it like this:

    insert('test_table', [
        ('c4ca4-0000001-79879483-000000000124', {'comments': 1, 'likes': 2}),
        ('c4ca4-0000001-79879483-000000000124', {'comments': 1, 'likes': 2}),
        ('c4ca4-0000001-79879496-000000000124', {'comments': 1, 'likes': 2}),
        ('b4aed-0000002-79879783-000000000768', {'comments': 1, 'likes': 2}),
        ('b4aed-0000002-79879783-000000000768', {'comments': 1, 'likes': 2}),
    ])
Database CSV sql Python (language) Data (computing)

Published at DZone with permission of Chase Seibert. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Data Privacy and Security: A Developer's Guide to Handling Sensitive Data With DuckDB
  • DuckDB for Python Developers
  • What Is SQL Injection and How Can It Be Avoided?
  • Snowflake Data Time Travel

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook