Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Hive and Presto Clusters With Jupyter on AWS, Azure, and Oracle

DZone's Guide to

Hive and Presto Clusters With Jupyter on AWS, Azure, and Oracle

See how Jupyter users can leverage PyHive to run queries from Jupyter Notebooks against Qubole Hive and Presto clusters in a secure way.

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

Jupyter™ Notebooks are one of the most popular IDEs of choice among Python users. Traditionally, Jupyter users work with small or sampled datasets that do not require distributed computing. However, as data volumes grow and enterprises move toward a unified data lake, powering business analytics through parallel computing frameworks such as Spark, Hive, and Presto becomes essential.

We covered connecting Jupyter with Qubole Spark cluster in the previous article. In this post, we will show how Jupyter users can leverage PyHive to run queries from Jupyter Notebooks against Qubole Hive and Presto clusters in a secure way.

The following diagram depicts a high-level architectural view of the solution.

A Jupyter Notebook that is running on your local computer will utilize the Qubole API to connect to a Qubole Spark Cluster. This will allow the notebook to execute SQL code on Presto or Hive Cluster using PyHive. Please follow the step-by-step guide below to enable this solution.

Step-by-Step Guide

  1. Follow steps in this article to connect Jupyter with Qubole Spark Cluster.

  2. Navigate to the Clusters page on Qubole and click the ellipsis on the same Spark cluster you used in the previous step. Click Edit Node Bootstrap.

  3. Add the following command to the Node bootstrap outside of any conditional code to make sure that this command runs for both, master and slave nodes.

    pip install pyhive
  4. Start or restart the Spark cluster to activate PyHive.

  5. Set Elastic IP for Master Node in the cluster configuration for both Hive and Presto clusters. This step is optional. However, it will help reconnecting to Hive and Presto clusters after their restart.

  6. On Hive cluster, enable Hive Server 2.

  7. Make sure that port 10003 on the master node of Hive cluster and port 8081 on the Presto cluster are open for access from Spark cluster. You may need to create security groups and apply them as Persistent Security Groups on the cluster configuration.

  8. Start or restart Hive and Presto clusters and take a note of the Master DNS on the Clusters page. If you configured Elastic IPs on Step 5 use them instead. Here is an example of how the Master DNS may look.

  9. Start Jupyter Notebook and open an existing or create a new PySpark notebook. Please refer to this article on details of starting Jupyter Notebook.

  10. To connect to Hive, use this sample code below. Replace <Master-Node-DNS> with the values from Step 8.

    from pyhive import hive
    hive_conn = hive.Connection(host="<Master-Node-DNS>", port=10003)
    hive_cursor=hive_conn.cursor()
    hive_cursor.execute('SELECT * FROM your_table LIMIT 10')
    print hive_cursor.fetchone()
  11. To connect to Presto, use this sample code below. Replace <Master-Node-DNS> with the values from Step 8.

    from pyhive import presto
    presto_conn = presto.Connection(host="<Matser-Node-DNS>", port=8081)
    presto_cursor=presto_conn.cursor()
    presto_cursor.execute('SELECT * FROM your_table LIMIT 10')
    print presto_cursor.fetchone()

References

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
big data ,hive ,presto ,jupyter ,aws ,azure ,oracle

Published at DZone with permission of Mikhail Stolpner, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}