Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Introduction to Apache Spark's Core API (Part I)

DZone's Guide to

Introduction to Apache Spark's Core API (Part I)

We take a quick look at how to work with the functions and methods contained in Spark's core API using Python.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Hello coders, I hope you are all doing well.

Over the past few months, I've been learning the Spark framework along with other big data topics. Spark is essentially a cluster programming framework. I know that one word can't define the entire framework. So please refer to this Introduction to Apache Spark article for more details.

In this post, I am going to discuss the core APIs of Apache Spark with respect to Python as a programming language. I am assuming that you have a basic knowledge of the Spark framework (tuples, RDD, pair RDD, and data frames) and its initialization steps.

When we launch a Spark shel,l either in Scala or Python (i.e. Spark Shell or PySpark), it will initialize sparkContext as sc and SQLContext as sqlContext.

  • Core APIs
    • sc.textFile(path)
      • This method reads a text file from HDFS and returns it as an RDD of strings.
      ordersRDD = sc.textFile('orders')

    • rdd.first()
      • This method returns the first element in the RDD.
      ordersRDD.first()
      # u'1,2013-07-25 00:00:00.0,11599,CLOSED' - first element of the ordersRDD

    • rdd.collect()
      • This method returns a list that contains all of the elements in the RDD.
      ordersRDD.collect()
      # [u'68882,2014-07-22 00:00:00.0,10000,ON_HOLD', u'68883,2014-07-23 00:00:00.0,5533,COMPLETE']

    • rdd.filter(f)
      • This method returns a new RDD containing only the elements that satisfy a predicate, i.e. it will create a new RDD containing those elements which satisfy the condition given in the argument.
      filterdRDD = ordersRDD.filter(lambda line: line.split(',')[3] in ['COMPLETE'])
      filterdRDD.first()
      # u'3,2013-07-25 00:00:00.0,12111,COMPLETE'

    • rdd.map(f)
      • This method returns a new RDD by applying a function to each element of this RDD. i.e. it will transform the RDD to a new one by applying a function.
      mapedRDD = ordersRDD.map(lambda line: tuple(line.split(',')))
      mapedRDD.first()
      # (u'1', u'2013-07-25 00:00:00.0', u'11599', u'CLOSED')

    • rdd.flatMap(f)
      • This method returns a new RDD by first applying a function to each element of this RDD (same as map method) and then flattening the results.
      flatMapedRDD = ordersRDD.flatMap(lambda line: line.split(','))
      flatMapedRDD.take(4)
      # [u'1', u'2013-07-25 00:00:00.0', u'11599', u'CLOSED']

    • sc.parallelize(c)
      • This method distributes a local Python collection to form an RDD.
      lRDD = sc.parallelize(range(1,10))
      lRDD.first()
      # 1
      lRDD.take(10)
      # [1, 2, 3, 4, 5, 6, 7, 8, 9]

    • rdd.reduce(f)
      • This method reduces the elements of this RDD using the specified commutative and associative binary operator.
      lRDD.reduce(lambda x,y: x+y)
      # 45 - this is the sum of 1 to 9

    • rdd.count()
      • This method returns the number of elements in this RDD.
      lRDD.count()
      # 9 - as there are 9 elements in the lRDD

    • rdd.sortBy(keyFunc)
      • This method sorts this RDD by a given keyfunc.
      lRDD.collect()
      # [1, 2, 3, 4, 5, 6, 7, 8, 9]
      
      lRDD.sortBy(lambda x: -x).collect()
      # [9, 8, 7, 6, 5, 4, 3, 2, 1] - can sort the rdd in any manner i.e. ASC or DESC

    • rdd.top(num)
      • This method gets the top N elements from an RDD. It returns the list sorted in descending order.
      lRDD.top(3)
      # [9, 8, 7]

    • rdd.take(num)
      • This method takes the first num elements of the RDD.
      lRDD.take(7)
      # [1, 2, 3, 4, 5, 6, 7]

    • rdd.union(otherRDD)
      • Return the union of this RDD and another one.
      l1 = sc.parallelize(range(1,5))
      l1.collect()
      # [1, 2, 3, 4]
      
      l2 = sc.parallelize(range(3,8))
      l2.collect()
      # [3, 4, 5, 6, 7]
      
      lUnion = l1.union(l2)
      lUnion.collect()
      # [1, 2, 3, 4, 3, 4, 5, 6, 7]

    • rdd.distinct()
      • Return a new RDD containing the distinct elements in this RDD.
      lDistinct = lUnion.distinct()
      lDistinct.collect()
      # [2, 4, 6, 1, 3, 5, 7]

    • rdd.intersection(otherRDD)
      • Return the intersection of this RDD and another one, i.e. the output will not contain any duplicate elements, even if the input RDDs did.
      lIntersection = l1.intersection(l2)
      lIntersection.collect()
      # [4, 3]

    • rdd.subtract(otherRDD)
      • Return each value in RDD that is not contained in another one.
      lSubtract = l1.subtract(l2)
      lSubtract.collect()
      # [2, 1]

    • rdd.saveAsTextFile(path, compressionCodec)
      • Save this RDD as a text file.
      lRDD.saveAsTextFile('lRDD_only')
      # this method will save the lRDD under lRDD_only folder under home directory in the HDFS
      
      lUnion.saveAsTextFile('lRDD_union','org.apache.hadoop.io.compress.GzipCodec')
      # this method will save the lUion compressed with Gzip codec under lRDD_union folder under home directory in the HDFS
      
      lSubtract.saveAsTextFile('lRDD_union','org.apache.hadoop.io.compress.SnappyCodec')
      # this method will save the lUion compressed with Snappy codec under lRDD_union folder under home directory in the HDFS

    • rdd.keyBy(f)
      • Creates tuples (pair RDD) of the elements in this RDD by applying the function.
      ordersPairRDD = ordersRDD.keyBy(lambda line: int(line.split(',')[0]))
      ordersPairRDD.first()
      # (1, u'1,2013-07-25 00:00:00.0,11599,CLOSED')
      # This way we can create the pair RDD.

For now, these are all the functions or methods for plain RDD, i.e. without the key. In my next post, I will explain the functions or methods with respect to pair RDD with multiple example snippets.

Thanks for reading and happy coding!

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
spark api ,big data ,apache spark ,apache spark tutorial ,python tutorial

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}