DZone
Big Data Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Big Data Zone > Hive and Presto Clusters With Jupyter on AWS, Azure, and Oracle

Hive and Presto Clusters With Jupyter on AWS, Azure, and Oracle

See how Jupyter users can leverage PyHive to run queries from Jupyter Notebooks against Qubole Hive and Presto clusters in a secure way.

Mikhail Stolpner user avatar by
Mikhail Stolpner
·
Oct. 26, 17 · Big Data Zone · Tutorial
Like (2)
Save
Tweet
5.33K Views

Join the DZone community and get the full member experience.

Join For Free

jupyterâ„¢ notebooks are one of the most popular ides of choice among python users. traditionally, jupyter users work with small or sampled datasets that do not require distributed computing. however, as data volumes grow and enterprises move toward a unified data lake, powering business analytics through parallel computing frameworks such as spark, hive, and presto becomes essential.

we covered connecting jupyter with qubole spark cluster in  the previous article  . in this post, we will show how jupyter users can leverage pyhive to run queries from jupyter notebooks against qubole hive and presto clusters in a secure way.

the following diagram depicts a high-level architectural view of the solution.

a jupyter notebook that is running on your local computer will utilize the qubole api to connect to a qubole spark cluster. this will allow the notebook to execute sql code on presto or hive cluster using pyhive. please follow the step-by-step guide below to enable this solution.

 step-by-step guide 

  1. follow steps in  this article  to connect jupyter with qubole spark cluster.

  2. navigate to the  clusters  page on qubole and click the ellipsis on the same spark cluster you used in the previous step. click  edit node bootstrap  .

  3. add the following command to the node bootstrap outside of any conditional code to make sure that this command runs for both, master and slave nodes.

    pip install pyhive
  4. start or restart the spark cluster to activate pyhive.

  5. set elastic ip for master node in the cluster configuration for both hive and presto clusters. this step is optional. however, it will help reconnecting to hive and presto clusters after their restart.

  6. on hive cluster, enable hive server 2.

  7. make sure that port 10003 on the master node of hive cluster and port 8081 on the presto cluster are open for access from spark cluster. you may need to create security groups and apply them as persistent security groups on the cluster configuration.

  8. start or restart hive and presto clusters and take a note of the master dns on the clusters page. if you configured elastic ips on step 5 use them instead. here is an example of how the master dns may look.

  9. start jupyter notebook and open an existing or create a new pyspark notebook. please refer to  this article  on details of starting jupyter notebook.

  10. to connect to hive, use this sample code below. replace  <master-node-dns>  with the values from step 8.

    from pyhive import hive
    hive_conn = hive.connection(host="<master-node-dns>", port=10003)
    hive_cursor=hive_conn.cursor()
    hive_cursor.execute('select * from your_table limit 10')
    print hive_cursor.fetchone()
  11. to connect to presto, use this sample code below. replace <master-node-dns> with the values from step 8.

    from pyhive import presto
    presto_conn = presto.connection(host="<matser-node-dns>", port=8081)
    presto_cursor=presto_conn.cursor()
    presto_cursor.execute('select * from your_table limit 10')
    print presto_cursor.fetchone()

references

  •  connecting jupyter to qubole spark 

  •  pyhive 

jupyter notebook cluster Presto (SQL query engine) AWS azure

Published at DZone with permission of Mikhail Stolpner, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Making Machine Learning More Accessible for Application Developers
  • Top Soft Skills to Identify a Great Software Engineer
  • Waterfall Vs. Agile Methodologies: Which Is Best For Project Management?
  • Stupid Things Orgs Do That Kill Productivity w/ Netflix, FloSports & Refactoring.club

Comments

Big Data Partner Resources

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo