DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • The Magic of Apache Spark in Java
  • RION - A Fast, Compact, Versatile Data Format
  • How Trustworthy Is Big Data?
  • Fixing Common Oracle Database Problems

Trending

  • Unlocking AI Coding Assistants Part 2: Generating Code
  • Breaking Bottlenecks: Applying the Theory of Constraints to Software Development
  • Doris: Unifying SQL Dialects for a Seamless Data Query Ecosystem
  • Simplify Authorization in Ruby on Rails With the Power of Pundit Gem
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Reading Data From Oracle Database With Apache Spark

Reading Data From Oracle Database With Apache Spark

In this quick tutorial, learn how to use Apache Spark to read and use the RDBMS directly without having to go into the HDFS and store it there.

By 
Emrah Mete user avatar
Emrah Mete
·
Updated Feb. 01, 18 · Tutorial
Likes (12)
Comment
Save
Tweet
Share
145.6K Views

Join the DZone community and get the full member experience.

Join For Free

In this article, I will connect Apache Spark to Oracle DB, read the data directly, and write it in a DataFrame.

Following the rapid increase in the amount of data we produce in daily life, big data technology has entered our lives very quickly. Instead of traditional solutions, we are now using tools with the capacity to solve our business quickly and efficiently. The use of Apache Spark is a common technology that can fulfill our needs.

Apache Spark is based on a framework that can process data very quickly and distributedly. In this article, I will not describe Apache Spark technology in detail, so those who are interested in the details should check out the Apache Spark documentation.

The preferred method to process the data we store in our RDBMS databases with Apache Spark is to migrate the data to Hadoop first (HDFS), distributively read the data we have stored in Hadoop (HDFS), and process it with Apache Spark. As those with Hadoop ecosystem experience know, we are exchanging data between the Hadoop ecosystem and other systems (RDBMS-NoSQL) with tools that integrate into the Hadoop ecosystem with Sqoop. Sqoop is a data transfer tool that is easy to use, common, and efficient.

There is some cost involved in moving the data to be processed to the Hadoop environment before the RDBMS, and then importing the data to be processed with Apache Spark. The fact that we do not use the data that we have moved to HDFS will cause us to lose a certain amount of space in HDFS, and it will also increase the processing time. Instead of this method, there is a way with Apache Spark that reads and uses the RDBMS directly without having to go to the HDFS and store it there — especially afterward.

Let's see how to do this.

The technologies and versions I used are as follows:

  • Hadoop: Hadoop 2.7.1

  • Apache Spark: Apache Spark 2.1.0

  • Oracle database: Oracle 11g R2, Enterprise Edition

  • Linux: SUSE Linux

To do this, we need to have the ojdbc6.jar file in our system. You can use this link to download it.

We will create tables in the Oracle database that we will read from Oracle and insert sample data in them.

CREATE TABLE EMP
(
   EMPNO      NUMBER,
   ENAME      VARCHAR (10),
   JOB        VARCHAR (9),
   MGR        NUMBER,
   SAL        NUMBER,
   COMM       NUMBER,
   DEPTNO     NUMBER
);

INSERT INTO EMP VALUES
(7369, 'SMITH', 'CLERK', 7902, 800, 50, 20);

INSERT INTO EMP VALUES
(7499, 'ALLEN', 'SALESMAN', 7698, 1600, 300, 30);

INSERT INTO EMP VALUES
(7521, 'WARD', 'SALESMAN', 7698, 1250, 500, 30);

INSERT INTO EMP VALUES
(7566, 'JONES', 'MANAGER', 7839, 2975, NULL, 20);

INSERT INTO EMP VALUES
(7654, 'MARTIN', 'SALESMAN', 7698, 1250, 1400, 30);

INSERT INTO EMP VALUES
(7698, 'BLAKE', 'MANAGER', 7839, 2850, NULL, 30);

INSERT INTO EMP VALUES
(7782, 'CLARK', 'MANAGER', 7839, 2450, NULL, 10);

INSERT INTO EMP VALUES
(7788, 'SCOTT', 'ANALYST', 7566, 3000, NULL, 20);

INSERT INTO EMP VALUES
(7839, 'KING', 'PRESIDENT', NULL, 5000, NULL, 10);

INSERT INTO EMP VALUES
(7844, 'TURNER', 'SALESMAN', 7698, 1500, 0, 30);

INSERT INTO EMP VALUES
(7876, 'ADAMS', 'CLERK', 7788, 1100, NULL, 20);

INSERT INTO EMP VALUES
(7900, 'JAMES', 'CLERK', 7698, 950, NULL, 30);

INSERT INTO EMP VALUES
(7902, 'FORD', 'ANALYST', 7566, 3000, NULL, 20);

INSERT INTO EMP VALUES
(7934, 'MILLER', 'CLERK', 7782, 1300, NULL, 10);

CREATE TABLE DEPT
(
   DEPTNO   NUMBER,
   DNAME    VARCHAR (14),
   LOC      VARCHAR (13)
);

INSERT INTO DEPT VALUES (10, 'ACCOUNTING', 'NEW YORK');
INSERT INTO DEPT VALUES (20, 'RESEARCH', 'DALLAS');
INSERT INTO DEPT VALUES (30, 'SALES', 'CHICAGO');
INSERT INTO DEPT VALUES (40, 'OPERATIONS', 'BOSTON');

COMMIT;

Now we are starting Apache Spark from the linux terminal with Pyspark interface (Python Interface).

/spark-2.1.0-bin-hadoop2.7/bin/pyspark 
--jars "/home/jars/ojdbc6.jar" 
--master yarn-client 
--num-executors 10 
--driver-memory 16g 
--executor-memory 8g

We started Apache Spark. Now let's write the Python code to read the data from the database and run it.

empDF = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:oracle:thin:username/password@//hostname:portnumber/SID") \
    .option("dbtable", "hr.emp") \
    .option("user", "db_user_name") \
    .option("password", "password") \
    .option("driver", "oracle.jdbc.driver.OracleDriver") \
    .load()

Let's take a look at the contents of this dataframe as we write to the empDF dataframe.

empDF.printSchema()

empDF.show()


Yes, I connected directly to the Oracle database with Apache Spark. Likewise, it is possible to get a query result in the same way.

query = "(select empno,ename,dname from emp, dept where emp.deptno = dept.deptno) emp"

empDF = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:oracle:thin:username/password@//hostname:portnumber/SID") \
    .option("dbtable", query) \
    .option("user", "db_user_name") \
    .option("password", "password") \
    .option("driver", "oracle.jdbc.driver.OracleDriver") \
    .load()

empDF.printSchema()

empDF.show()

It is very easy and practical to use, as you can see from the examples made above.

With this method, it is possible to load large tables directly and in parallel, but I will do the performance evaluation in another article.

Apache Spark Database Big data Oracle Database

Opinions expressed by DZone contributors are their own.

Related

  • The Magic of Apache Spark in Java
  • RION - A Fast, Compact, Versatile Data Format
  • How Trustworthy Is Big Data?
  • Fixing Common Oracle Database Problems

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!