Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Lipwig for Hive Is the Greatest!

DZone's Guide to

Lipwig for Hive Is the Greatest!

An example on how to use HIVE for queries, and a deep dive into the explanation of why you should use it.

· Big Data Zone
Free Resource

See how the beta release of Kubernetes on DC/OS 1.10 delivers the most robust platform for building & operating data-intensive, containerized apps. Register now for tech preview.

Ok, this is the coolest thing this Hive user has seen all day.

As you probably know, if you prepend the word EXPLAIN to your SQL query and then run it, Hive prints out a text description of the query plan. This lets you explore the effects such variations as code changes, the use of analyze, turning on/off the cost-based optimizer (CBO), and so on. It’s an essential tool for optimizing Hive.

The output of EXPLAIN is far from pretty, but fortunately, a simple pipeline of Linux commands can give you a slick graphical rendition like the one below.

By the way, I’m doing this on the Hortonworks Sandbox for HDP 2.3, but you can do exactly the same thing for any Linux installation of Hive/Hadoop. The Sandbox is a free VM image of a single-node installation the Hortonworks Hadoop distribution. Mine is running on VirtualBox, but they have the same thing for VMware.

 explain

  1. First, you’ll use standard Hive to generate EXPLAIN output in JSON format.
  2. Then you’ll use a Python script called lipwig.py that is most courteously supplied to us by the t3rmin4t0r. His program reads the JSON and emits something called a dot file.
  3. The dot file is a generic, humanly readable and portable description of a graphical object. Such a file is the input to a program called dot, which knows how to output graphics for the described image in a form that any computer can understand. Read more about dot here.
  4. In this case, dot will output a file in SVG format. SVG stands for Scalable Vector Graphics, and most machines will know what to do with it.

I describe below the sequence of steps used to generate the graphical representation of EXPLAIN’s output shown above. Before you try it out on your own code, you’ll need to get lipwig.py from github.com.  I assume you know what git is, but if you don’t, check this out first.

Use cd to go to a convenient workspace directory and pull down thelipwig project using the command below. This will “clone,” i.e., get a complete copy of the source code tree for a program and put it in its own directory within the directory you are working in. In this case, it’s just one file plus a readme.txt file.

git clone https://github.com/t3rmin4t0r/lipwig.git

You’ll also need the dot program, which is part of the graphvizpackage from AT&T.  You can install this on Linux by running the following:

sudo yum install graphviz

Now you should be ready to rock and roll.

First tack the keywords “EXPLAIN FORMATTED” in front of the query you want to analyze. My source is in a file called myprog.sqland it looks like this:


EXPLAIN FORMATTED SELECT
     br.browser, ct.connection_type, co.country, date_time, evar3, evar28, evar29, first_hit_time_gmt, geo_dma,
     javascript, last_hit_time_gmt, post_purchaseid, last_purchase_time_gmt, new_visit, post_browser_height,
     post_browser_width, post_cookies, post_currency, post_cust_hit_time_gmt, post_evar3, post_evar28,
     post_evar29, post_event_list, post_persistent_cookie, post_product_list, post_visid_high, post_visid_low,
     post_zip, prev_page, visit_num, visit_page_num, visit_start_time_gmt
FROM
     hit_data hd join browser br join connection_type ct join country co
ON
     hd.browser=br.id and hd.connection_type=ct.id and hd.country=co.id
WHERE
     hd.post_product_list like "%hash::448163969%"
AND 
     hd.post_product_list like "%hash::1256898673%"
AND 
      hd.post_product_list not like "%hash::125680000%" 

Run it as follows to put the output of EXPLAIN FORMATTED into a file.  The ‘>’ sign puts the output into a file instead of dumping it on the screen.

hive -f myprog.sql > explain.json

The contents of explain.json is the raw JSON-formatted EXPLAINoutput. The JSON is human-readable, but only for specially gifted humans, unless you have set up your browser to display it.  If you’re using Chrome, try this page for a plugin. Something similar is available for almost any browser.

Use lipwig.py to create a dot file from the JSON.  If you’re running Hive, then you already have Python, so all you need to do is tell it where lipwig.py is. For me, the installation directory for lipwig.py is one directory up, hence the double dots, but yours is wherever you installed it. The lipwig.py program dumps the graphical specificiation into a file called explain.dot.

python ../lipwig/lipwig.py explain.json > explain.dot

Now, all you need to do is create the .svg file. My dot file was calledexplain.dot, so I tell dot to create a type .svg file called explain.svg, using explain.dot as input.

dot -Tsvg -o explain.svg  explain.dot

That’s it.  If you double click on explain.svg you’ll see your graphical output.  Otherwise, you could use something likeGhostScript (which a freeware PostScript utility) to display the output directly out of dot as follows:

dot -Tsvg explain.dot | gs –

SVG files work fine for most purposes, but as it happens, the WordPress editor doesn’t support that format. If you also have some other reason to want an alternate format, -Tjpg and  -Tpng, etc. will give you alternate formats.

New Mesosphere DC/OS 1.10: Production-proven reliability, security & scalability for fast-data, modern apps. Register now for a live demo.

Topics:
big data ,hadoop ,hive

Published at DZone with permission of Peter Coates, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}