DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • The Magic of Apache Spark in Java
  • How to Convert XLS to XLSX in Java
  • Recurrent Workflows With Cloud Native Dapr Jobs
  • Java Virtual Threads and Scaling

Trending

  • Unlocking AI Coding Assistants Part 1: Real-World Use Cases
  • How to Practice TDD With Kotlin
  • Emerging Data Architectures: The Future of Data Management
  • Doris: Unifying SQL Dialects for a Seamless Data Query Ecosystem
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Your First Java RDD With Apache Spark

Your First Java RDD With Apache Spark

Build a simple spark RDD with the the Java API.

By 
Terrence Munyunguma user avatar
Terrence Munyunguma
·
Jun. 16, 20 · Tutorial
Likes (3)
Comment
Save
Tweet
Share
13.0K Views

Join the DZone community and get the full member experience.

Join For Free

Apache Spark is an amazingly powerful parallel execution interface for processing big data including mining, crunching, analyzing and representation. If you want to know more about the features and ecosystem of this framework checkout the official docs, for now I'm going to assume you're already sold and want to get started immediately.


Maven Project

Start a new Maven project with your favorite IDE or from the command line. You must have Java 8 installed on your operating system. Spark does not currently support newer versions of Java but it is awesome, trust me. When that is done, here is how you want your pom.xml file to look:

XML
 




x
41


 
1
<?xml version="1.0" encoding="UTF-8"?>
2
<project xmlns="http://maven.apache.org/POM/4.0.0"
3
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
4
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
5
    <modelVersion>4.0.0</modelVersion>
6

          
7
    <groupId>org.example</groupId>
8
    <artifactId>spark_first</artifactId>
9
    <version>1.0-SNAPSHOT</version>
10

          
11
    <properties>
12
        <maven.compiler.source>1.8</maven.compiler.source>
13
        <maven.compiler.target>1.8</maven.compiler.target>
14
    </properties>
15

          
16
    <dependencies>
17

          
18
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
19
        <dependency>
20
            <groupId>org.apache.spark</groupId>
21
            <artifactId>spark-core_2.11</artifactId>
22
            <version>2.4.6</version>
23
        </dependency>
24

          
25
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
26
        <dependency>
27
            <groupId>org.apache.spark</groupId>
28
            <artifactId>spark-sql_2.11</artifactId>
29
            <version>2.4.6</version>
30
        </dependency>
31

          
32
        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->
33
        <dependency>
34
            <groupId>org.apache.hadoop</groupId>
35
            <artifactId>hadoop-hdfs</artifactId>
36
            <version>3.2.1</version>
37
        </dependency>
38

          
39
    </dependencies>
40

          
41
</project>



We have 3 important dependencies, Spark Core, running with Scala 2.11. You also have the option to run it with Scala 2.12, there will be no difference for this guide. At the time of writing, the latest version of Scala is 2.13, but spark does not support it yet. Spark Core is the main Spark engine which you use to build your RDDs. Spark SQL provides an interface to perform complex SQL operations on your dataset with ease. Hadoop HDFS provides a distributed file system implementation, from which by design Spark inherits.

Note the project properties. You need to set your compiler and runner VM to Java version 8. Now lets look at the code.


Code

In your src/ folder create a new Java file with a main method like so:

Java
 




xxxxxxxxxx
1
37


 
1
import org.apache.spark.SparkConf;
2
import org.apache.spark.api.java.JavaRDD;
3
import org.apache.spark.api.java.JavaSparkContext;
4

          
5
import java.util.ArrayList;
6
import java.util.List;
7

          
8
public class MainRunner {
9

          
10
    public static void main(String[] args) {
11
        List<Double> doubleList = new ArrayList<>();
12
        doubleList.add(23.44);
13
        doubleList.add(26.43);
14
        doubleList.add(75.35);
15
        doubleList.add(245.767);
16
        doubleList.add(398.445);
17
        doubleList.add(94.72);
18

          
19
        SparkConf sparkConf = new SparkConf().setAppName("spark_first").setMaster("local[*]");
20
        JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
21

          
22
        JavaRDD<Double> javaRDD = sparkContext.parallelize(doubleList);
23

          
24
        //map
25
        JavaRDD<Integer> mappedRDD = javaRDD.map(val -> (int)Math.round(val));
26

          
27
        mappedRDD.collect().forEach(System.out::println);
28

          
29
        //reduce
30
        int reducedResult = mappedRDD.reduce(Integer::sum);
31

          
32
        System.out.println(reducedResult);
33

          
34
        sparkContext.close();
35
    }
36
}



Our RDD is a simple array list of Doubles, but you can load a dataset from your local file system or a distributed file system like HDFS or Amazon S3. There are functions to help you do that in Spark's Java API. Check it out and play around with that code.

Perform a simple map reduce, mapping the Doubles RDD to a new RDD of integers, then reduce it by calling the sum function of Integer class to return the summed value of your RDD. 

If you run your project, your results should look something like this:


PowerShell
 




x
23


 
1
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
2
20/06/14 20:32:45 INFO SparkContext: Running Spark version 2.4.6
3
20/06/14 20:32:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
4
20/06/14 20:32:53 INFO SparkContext: Submitted application: spark_first
5
20/06/14 20:32:55 INFO SecurityManager: Changing view acls to: Terrence
6
20/06/14 20:32:55 INFO SecurityManager: Changing modify acls to: Terrence
7
...
8
20/06/14 20:33:23 INFO DAGScheduler: Job 0 finished: collect at MainRunner.java:29, took 6.834479 s
9
23
10
26
11
75
12
246
13
398
14
95
15
20/06/14 20:33:23 INFO SparkContext: Starting job: reduce at MainRunner.java:32
16
...
17
20/06/14 20:33:24 INFO DAGScheduler: ResultStage 1 (reduce at MainRunner.java:32) finished in 0.640 s
18
20/06/14 20:33:24 INFO DAGScheduler: Job 1 finished: reduce at MainRunner.java:32, took 0.754748 s
19
863
20
20/06/14 20:33:24 INFO SparkUI: Stopped Spark web UI at http://Terrence.mshome.net:4040
21

          
22
Process finished with exit code 0
23

          


There will be a lot of verbose logs. You might want to turn that off by creating a new Log4J object and explicitly setting the logs to WARN or ERROR. Looking at lines 9-14 our new Integer RDD was printed out and on line is 19 our reduced sum.

Congratulations. You just run your first Apache Spark project with Java. Time to get your hands dirty. F

Apache Spark Java (programming language) File system

Opinions expressed by DZone contributors are their own.

Related

  • The Magic of Apache Spark in Java
  • How to Convert XLS to XLSX in Java
  • Recurrent Workflows With Cloud Native Dapr Jobs
  • Java Virtual Threads and Scaling

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!