DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Migrating from React Router v5 to v6: A Comprehensive Guide
  • Transforming Data Into JSON Structure With Spark: API-Based and SQL-Based Approaches
  • The Complete Guide to Stream API and Collectors in Java 8
  • Optimizing Integration Workflows With Spark Structured Streaming and Cloud Services

Trending

  • Automating Data Pipelines: Generating PySpark and SQL Jobs With LLMs in Cloudera
  • The Human Side of Logs: What Unstructured Data Is Trying to Tell You
  • 5 Subtle Indicators Your Development Environment Is Under Siege
  • The Cypress Edge: Next-Level Testing Strategies for React Developers
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Introduction to Apache Spark's Core API (Part I)

Introduction to Apache Spark's Core API (Part I)

We take a quick look at how to work with the functions and methods contained in Spark's core API using Python.

By 
Anil AGRAWAL user avatar
Anil AGRAWAL
·
Dec. 24, 18 · Tutorial
Likes (5)
Comment
Save
Tweet
Share
15.6K Views

Join the DZone community and get the full member experience.

Join For Free

Hello coders, I hope you are all doing well.

Over the past few months, I've been learning the Spark framework along with other big data topics. Spark is essentially a cluster programming framework. I know that one word can't define the entire framework. 

In this post, I am going to discuss the core APIs of Apache Spark with respect to Python as a programming language. I am assuming that you have a basic knowledge of the Spark framework (tuples, RDD, pair RDD, and data frames) and its initialization steps.

When we launch a Spark shell, either in Scala or Python (i.e. Spark Shell or PySpark), it will initialize assparkContextsc and asSQLContextsqlContext.

  • Core APIs
    • sc.textFile(path)
      • This method reads a text file from HDFS and returns it as an RDD of strings.
    • ordersRDD = sc.textFile('orders')

    • rdd.first()
      • This method returns the first element in the RDD.
    • ordersRDD.first()
      # u'1,2013-07-25 00:00:00.0,11599,CLOSED' - first element of the ordersRDD

    • rdd.collect()
      • This method returns a list that contains all of the elements in the RDD.
    • ordersRDD.collect()
      # [u'68882,2014-07-22 00:00:00.0,10000,ON_HOLD', u'68883,2014-07-23 00:00:00.0,5533,COMPLETE']

    • rdd.filter(f)
      • This method returns a new RDD containing only the elements that satisfy a predicate, i.e. it will create a new RDD containing those elements which satisfy the condition given in the argument.
    • filterdRDD = ordersRDD.filter(lambda line: line.split(',')[3] in ['COMPLETE'])
      filterdRDD.first()
      # u'3,2013-07-25 00:00:00.0,12111,COMPLETE'

    • rdd.map(f)
      • This method returns a new RDD by applying a function to each element of this RDD. i.e. it will transform the RDD to a new one by applying a function.
    • mapedRDD = ordersRDD.map(lambda line: tuple(line.split(',')))
      mapedRDD.first()
      # (u'1', u'2013-07-25 00:00:00.0', u'11599', u'CLOSED')

    • rdd.flatMap(f)
      • This method returns a new RDD by first applying a function to each element of this RDD (same as map method) and then flattening the results.
    • flatMapedRDD = ordersRDD.flatMap(lambda line: line.split(','))
      flatMapedRDD.take(4)
      # [u'1', u'2013-07-25 00:00:00.0', u'11599', u'CLOSED']

    • sc.parallelize(c)
      • This method distributes a local Python collection to form an RDD.
    • lRDD = sc.parallelize(range(1,10))
      lRDD.first()
      # 1
      lRDD.take(10)
      # [1, 2, 3, 4, 5, 6, 7, 8, 9]

    • rdd.reduce(f)
      • This method reduces the elements of this RDD using the specified commutative and associative binary operator.
    • lRDD.reduce(lambda x,y: x+y)
      # 45 - this is the sum of 1 to 9

    • rdd.count()
      • This method returns the number of elements in this RDD.
    • lRDD.count()
      # 9 - as there are 9 elements in the lRDD

    • rdd.sortBy(keyFunc)
      • This method sorts this RDD by a givenkeyfunc
    • lRDD.collect()
      # [1, 2, 3, 4, 5, 6, 7, 8, 9]
      
      lRDD.sortBy(lambda x: -x).collect()
      # [9, 8, 7, 6, 5, 4, 3, 2, 1] - can sort the rdd in any manner i.e. ASC or DESC

    • rdd.top(num)
      • This method gets the top N elements from an RDD. It returns the list sorted in descending order.
    • lRDD.top(3)
      # [9, 8, 7]

    • rdd.take(num)
      • This method takes the first num elements of the RDD.
    • lRDD.take(7)
      # [1, 2, 3, 4, 5, 6, 7]

    • rdd.union(otherRDD)
      • Return the union of this RDD and another one.
    • l1 = sc.parallelize(range(1,5))
      l1.collect()
      # [1, 2, 3, 4]
      
      l2 = sc.parallelize(range(3,8))
      l2.collect()
      # [3, 4, 5, 6, 7]
      
      lUnion = l1.union(l2)
      lUnion.collect()
      # [1, 2, 3, 4, 3, 4, 5, 6, 7]

    • rdd.distinct()
      • Return a new RDD containing the distinct elements in this RDD.
    • lDistinct = lUnion.distinct()
      lDistinct.collect()
      # [2, 4, 6, 1, 3, 5, 7]

    • rdd.intersection(otherRDD)
      • Return the intersection of this RDD and another one, i.e. the output will not contain any duplicate elements, even if the input RDDs did.
    • lIntersection = l1.intersection(l2)
      lIntersection.collect()
      # [4, 3]

    • rdd.subtract(otherRDD)
      • Return each value in RDD that is not contained in another one.
    • lSubtract = l1.subtract(l2)
      lSubtract.collect()
      # [2, 1]

    • rdd.saveAsTextFile(path, compressionCodec)
      • Save this RDD as a text file.
    • lRDD.saveAsTextFile('lRDD_only')
      # this method will save the lRDD under lRDD_only folder under home directory in the HDFS
      
      lUnion.saveAsTextFile('lRDD_union','org.apache.hadoop.io.compress.GzipCodec')
      # this method will save the lUion compressed with Gzip codec under lRDD_union folder under home directory in the HDFS
      
      lSubtract.saveAsTextFile('lRDD_union','org.apache.hadoop.io.compress.SnappyCodec')
      # this method will save the lUion compressed with Snappy codec under lRDD_union folder under home directory in the HDFS

    • rdd.keyBy(f)
      • Creates tuples (pair RDD) of the elements in this RDD by applying the function.
    • ordersPairRDD = ordersRDD.keyBy(lambda line: int(line.split(',')[0]))
      ordersPairRDD.first()
      # (1, u'1,2013-07-25 00:00:00.0,11599,CLOSED')
      # This way we can create the pair RDD.

For now, these are all the functions or methods for plain RDD, i.e. without the key. In my next post, I will explain the functions or methods with respect to pair RDD with multiple example snippets.

Thanks for reading and happy coding!

Apache Spark API Element

Opinions expressed by DZone contributors are their own.

Related

  • Migrating from React Router v5 to v6: A Comprehensive Guide
  • Transforming Data Into JSON Structure With Spark: API-Based and SQL-Based Approaches
  • The Complete Guide to Stream API and Collectors in Java 8
  • Optimizing Integration Workflows With Spark Structured Streaming and Cloud Services

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!