Powering Big Data Processing in Postgres With Apache Spark
Powering Big Data Processing in Postgres With Apache Spark
Spark does not provide a storage layer, and instead, it relies on third-party storage providers. It also integrates seamlessly with Hadoop.
Join the DZone community and get the full member experience.Join For Free
The Architect’s Guide to Big Data Application Performance. Get the Guide.
Postgres plays a central role in today’s integrated data center. A powerful feature called a Foreign Data Wrapper (FDW) in Postgres supports data integration by combining data from multiple database solutions as if it were a single Postgres database. FDWs essentially act as pipelines connecting Postgres with external database solutions, including NoSQL solutions such as MongoDB, Cassandra, or Hadoop, and displaying the data in Postgres where it can be queried with SQL. The result is a seamless view of the data, and database administrators have much more control.
For more insight, read the blog by Stephen Horn, Why It’s Cool to Be an OLTP Database Again.
EnterpriseDB®(EDB™) invests significantly in research and development into FDWs and releases key advances to the open source community to further this capability for all Postgres users. As advances in Postgres progress, EnterpriseDB releases new versions of FDWs to take advantage of emerging capabilities (i.e., Hadoop, MongoDB, and MySQL). EnterpriseDB also offers EDB Postgres Data Adapters for Hadoop, MongoDB, and MySQL as packaged solutions for the EDB Postgres™ Platform. The FDWs developed by EnterpriseDB can be found on EDB’s GitHub page, or via StackBuilder Plus or yum.enterprisedb.com if you are an EnterpriseDB subscription holder.
The most recent advance from EnterpriseDB is a new version of the EDB Postgres Data Adapter for Hadoop with compatibility for the Apache Spark cluster computing framework. The new version gives organizations the ability to combine analytic workloads based on the Hadoop Distributed File System (HDFS) with operational data in Postgres, using an Apache Spark interface. (The new version was announced on February 8, 2017. Read the press release here.)
Below is a demonstration of the Hadoop FDW with Apache Spark. Apache Spark is a general purpose, distributed computing framework which supports a wide variety of uses cases. It provides real-time stream as well as batch processing with speed, ease of use, and sophisticated analytics. Spark does not provide a storage layer, and instead, it relies on third-party storage providers like Hadoop, HBASE, Cassandra, S3, and others. Spark integrates seamlessly with Hadoop and can process existing data. Spark SQL is 100 percent compatible with HiveQL and can be used as a replacement of hiveserver2, using Spark Thrift Server. (For background on the HDFS_FDW and how it works with Hive, please refer to the blog post Hadoop to Postgres - Bridging the Gap.)
Advantages of Apache Spark
Apache Spark is fast. For a comparison against Hive, see the blog post Hive vs. SparkSQL.
Apache Spark is general purpose, providing:
- Batch processing (MapReduce).
- Stream Processing (Storm).
- Interactive Processing (Impala).
- Graph Processing (Neo4J).
- Spark SQL (Hive).
Apache Spark supports many third-party storage providers and formats, such as:
- Amazon S3.
Using hdfs_fdw With Spark on Top of a Hadoop Cluster
The following components must be installed in order to use the hdfs_fdw:
- EDB Postgres™ Advanced Server 9.5 or PostgreSQL 9.5.
- Hadoop 2.6.4.
- Apache Spark 2.1.0.
- Spark Thrift Server.
- The hdfs_fdw extension.
- OS CentOS Linux release 7.2.1511 (Core).
The setup is as follows.
Steps to use the hdfs_fdw with Apache Spark:
Install EDB Postgres Advanced Server 9.5 and hdfs_fdw using the installer.
At the edb-psql prompt, issue the following commands:
CREATE EXTENSION hdfs_fdw; CREATE SERVER hdfs_svr FOREIGN DATA WRAPPER hdfs_fdw OPTIONS (host '127.0.0.1',port '10000',client_type 'hiveserver2'); CREATE USER MAPPING FOR enterprisedb server hdfs_svr; CREATE FOREIGN TABLE f_names_tab( a int, name varchar(255)) SERVER hdfs_svr OPTIONS (dbname 'testdb', table_name 'my_names_tab');
Please note that we are using the same port and client_type while creating the foreign server because the Spark Thrift Server is compatible with the Hive Thrift Server. Applications using Hiveserver2 would work with Spark without any code changes.
Download and install Apache Spark in local mode.
Test the Apache Spark installation using Spark shell:
./spark-shell Spark session available as 'spark'. Welcome to Spark Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_111) Type in expressions to have them evaluated. Type :help for more information. scala> val no = Array(1, 2, 3, 4, 5,6,7,8,9,10) no: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) scala> val noData = sc.parallelize(no) scala> noData.sum res0: Double = 55.0
In the folder
$SPARK_HOME/conf, create a file called
spark-defaults.xml containing the following line:
By default, Apache Spark uses derby for both metadata and the data itself (called warehouse in Apache Spark).In order to have Apache Spark use Hadoop as the warehouse, we have to add this property.
Start Spark Thrift Server:
Make sure Spark Thrift Server is running by checking the log file.
Run the following commands in the beeline command line tool:
./beeline Beeline version 1.0.1 by Apache Hive beeline> !connect jdbc:hive2://localhost:10000 abbasbutt '' org.apache.hive.jdbc.HiveDriver Connecting to jdbc:hive2://localhost:10000 Connected to: Spark SQL (version 2.1.0) Driver: Hive JDBC (version 1.0.1) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://localhost:10000> create database my_test_db; +---------+--+ | Result | +---------+--+ +---------+--+ No rows selected (0.379 seconds) 0: jdbc:hive2://localhost:10000> use my_test_db; +---------+--+ | Result | +---------+--+ +---------+--+ No rows selected (0.03 seconds) 0: jdbc:hive2://localhost:10000> create table my_names_tab(a int, name string) row format delimited fields terminated by ' '; +---------+--+ | Result | +---------+--+ +---------+--+ No rows selected (0.11 seconds) 0: jdbc:hive2://localhost:10000> 0: jdbc:hive2://localhost:10000> load data local inpath '/path/to/file/names.txt' into table my_names_tab; +---------+--+ | Result | +---------+--+ +---------+--+ No rows selected (0.33 seconds) 0: jdbc:hive2://localhost:10000> select * from my_names_tab; +-------+---------+--+ | a | name | +-------+---------+--+ | 1 | abcd | | 2 | pqrs | | 3 | wxyz | | 4 | a_b_c | | 5 | p_q_r | | NULL | NULL | +-------+---------+--+
Stop Apache Thrift Server:
Start Apache Thrift Server with no authentication:
./start-thriftserver.sh --hiveconf hive.server2.authentication=NOSASL.
Run the following command in edb-psql:
select * from f_names_tab; a | name ---+-------- 1 | abcd 2 | pqrs 3 | wxyz 4 | a_b_c 5 | p_q_r 0 | (6 rows)
Here are the corresponding files in Hadoop:
$ hadoop fs -ls /user/hive/warehouse/ Found 1 items drwxrwxr-x - user supergroup 0 2017-01-19 10:47 /user/hive/warehouse/my_test_db.db $ hadoop fs -ls /user/hive/warehouse/my_test_db.db/ Found 1 items drwxrwxr-x - user supergroup 0 2017-01-19 10:50 /user/hive/warehouse/my_test_db.db/my_names_tab
Download EDB Postgres Advanced Server 9.5 and the Data Adaptor for Hadoop from the Advanced Downloads page on the EnterpriseDB website.
A 17-minute demo video that shows the setup steps above is available here.
Published at DZone with permission of Abbas Butt . See the original article here.
Opinions expressed by DZone contributors are their own.