Integrating Presto With CarbonData
Integrating Presto With CarbonData
Learn how to integrate these two open source technologies in order to get a high-powered way of dealing with bunches of data.
Join the DZone community and get the full member experience.Join For Free
SnapLogic is the leading self-service enterprise-grade integration platform. Download the 2018 GartnerMagic Quadrant for Enterprise iPaaS or play around on the platform, risk free, for 30 days.
Presto is a well known open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. It was developed by Facebook to analyze petabytes of data and was later open sourced. Presto does not provide any storage but can be used with a variety of data sources like Hive, Cassandra, Relational databases, and even with some propriety databases as well.
In this blog, we are going to discuss how we can use Presto to query data from one of the other upcoming open source solutions, CarbonData. CarbonData is a fully indexed columnar and Hadoop native data-store for processing heavy analytical workloads and detailed queries on big data. CarbonData allows faster interactive querying, using advanced columnar storage, index, compression, and encoding techniques to improve computing efficiency. Presto with CarbonData helps in speeding up queries by an order of magnitude over PetaBytes of data.
To install Presto, you can download the tarball for the latest version from here and then untar it in the directory of your choice. The tarball will contain a single top-level directory, in this case, it is
presto-server-0.187 , which we will call the installation directory. All the configuration files for Presto lies in the
etc folder inside the installation directory. Configure the Presto server as defined here according to your server settings. After installing and configuring the Presto Server you can run the server using the below command from the installation directory. The below command will run Presto as a daemon.
Alternatively, if you want to run it in the foreground you can use the below command. Personally, I prefer the below command, as I can see all the log messages and errors on the screen.
The above steps help you to run Presto but now we need to integrate Presto with CarbonData. For integrating CarbonData, we need to first clone the CarbonData repository using the below command:
git clone https://github.com/apache/carbondata.git
Then you can do a complete build running the below command inside the Carbondata folder:
mvn -Pspark-2.1 -Phadoop-2.7.2 -DskipTests clean package
When the installation is complete you will be able to see the following folder created inside the CarbonData directory:
Now we need to make changes on the Presto end so that the Presto engine can connect to CarbonData.
Step 1: We need to create a carbon.properties file inside etc/catalog/ folder in the Presto installation directory. The above properties file will have only two properties:
The connector.name is to specify the catalog name that will be used by Presto to identify the catalog it needs to connect to.
carbondata-store specifies the Carbondata store location.
Step 2: Go to the plugin folder inside the Presto installation directory and create a folder with the name provided as the connector.name property. In this case, it is CarbonData as shown in Step 1.
cd plugin mkdir carbondata
Step 3: Copy all the Jars from the integration/presto/target/carbondata-presto-1.2.0-SNAPSHOT to the CarbonData folder created in step 2.
cp <carbon-data-installation-directory>/integration/presto/target/carbondata-presto-1.2.0-SNAPSHOT/* <presto-installation-directory>/plugin/carbondata
Now you are all set to execute queries on CarbonData using Presto. For executing the queries, you can use the Presto CLI. The Presto CLI provides a terminal-based interactive shell for running queries. The CLI is a self-executing JAR file, which means it acts like a normal UNIX executable. You can download the Presto CLI from here.
The following is the command to run the Presto CLI.
./presto --server localhost:8080 --catalog carbondata --schema default
Once the Presto CLI is started you can run all the queries that you want to in CarbonData using Presto.
This article was originally posted on the Knoldus blog.
Published at DZone with permission of Bhavya Aggarwal , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.