DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Augmented Analytics With PySpark and Sentiment Analysis
  • Apache Spark for the Impatient
  • Data Privacy and Security: A Developer's Guide to Handling Sensitive Data With DuckDB
  • What Is SQL Injection and How Can It Be Avoided?

Trending

  • Agentic AI for Automated Application Security and Vulnerability Management
  • Medallion Architecture: Why You Need It and How To Implement It With ClickHouse
  • How to Convert XLS to XLSX in Java
  • Ethical AI in Agile
  1. DZone
  2. Data Engineering
  3. Data
  4. PySpark DataFrame Tutorial: Introduction to DataFrames

PySpark DataFrame Tutorial: Introduction to DataFrames

In this post, we explore the idea of DataFrames and how they can they help data analysts make sense of large dataset when paired with PySpark.

By 
Kislay Keshari user avatar
Kislay Keshari
·
Jul. 14, 18 · Tutorial
Likes (8)
Comment
Save
Tweet
Share
160.0K Views

Join the DZone community and get the full member experience.

Join For Free

DataFrames is a buzzword in the industry nowadays. People tend to use it with popular languages used for Data Analysis like Python, Scala, and R. So, why is it that everyone is using it so much? Let's take a look at this with our PySpark Dataframe tutorial. In this post, I'll be covering the following topics:

  • What are DataFrames?
  • Why we need DataFrames?
  • Features of DataFrames
  • PySpark DataFrame Sources
  • Dataframe Creation
  • Pyspark DataFrames with FIFA World Cup and Superheroes Dataset

PySpark Dataframe Tutorial: What Are DataFrames?

DataFrames generally refer to a data structure, which is tabular in nature. It represents rows, each of which consists of a number of observations. Rows can have a variety of data formats (heterogeneous), whereas a column can have data of the same data type (homogeneous). DataFrames usually contain some metadata in addition to data; for example, column and row names.

We can say that DataFrames are nothing, but 2-dimensional data structures, similar to a SQL table or a spreadsheet. Now let's move ahead with this PySpark Dataframe Tutorial and understand why exactly we need Pyspark Dataframe.

Why Do We Need DataFrames?

1. Processing Structured and Semi-Structured Data

DataFrames are designed to process a large collection of structured as well as semi-structured data. Observations in Spark DataFrame are organized under named columns, which helps Apache Spark understand the schema of a Dataframe. This helps Spark optimize the execution plan on these queries. It can also handle petabytes of data.

2. Slicing and Dicing

DataFrames APIs usually support elaborate methods for slicing-and-dicing the data. It includes operations such as "selecting" rows, columns, and cells by name or by number, filtering out rows, etc. Statistical data is usually very messy and contains lots of missing and incorrect values and range violations. So a critically important feature of DataFrames is the explicit management of missing data.

3. Data Sources

DataFrames has support for a wide range of data formats and sources, we'll look into this later on in this Pyspark DataFrames tutorial. They can take in data from various sources.

4. Support for Multiple Languages

It has API support for different languages like Python, R, Scala, Java, which makes it easier to be used by people having different programming backgrounds.

Features of DataFrames

  • DataFrames are distributed in nature, which makes it a fault tolerant and highly available data structure.
  • Lazy evaluation is an evaluation strategy which holds the evaluation of an expression until its value is needed. It avoids repeated evaluation. Lazy evaluation in Spark means that the execution will not start until an action is triggered. In Spark, the picture of lazy evaluation comes when Spark transformations occur.
  • DataFrames are immutable in nature. By immutable, I mean that it is an object whose state cannot be modified after it is created. But we can transform its values by applying a certain transformation, like in RDDs.

PySpark DataFrame Sources

DataFrames in Pyspark can be created in multiple ways:

Data can be loaded in through a CSV, JSON, XML, or a Parquet file. It can also be created using an existing RDD and through any other database, like Hive or Cassandra as well. It can also take in data from HDFS or the local file system.

Let's move forward with this PySpark DataFrame tutorial and understand how to create DataFrames.

We'll create Employee and Department instances.

from pyspark.sql import *

Employee = Row("firstName", "lastName", "email", "salary")

employee1 = Employee('Basher', 'armbrust', 'bash@edureka.co', 100000)
employee2 = Employee('Daniel', 'meng', 'daniel@stanford.edu', 120000 )
employee3 = Employee('Muriel', None, 'muriel@waterloo.edu', 140000 )
employee4 = Employee('Rachel', 'wendell', 'rach_3@edureka.co', 160000 )
employee5 = Employee('Zach', 'galifianakis', 'zach_g@edureka.co', 160000 )

print(Employee[0])

print(employee3)

department1 = Row(id='123456', name='HR')
department2 = Row(id='789012', name='OPS')
department3 = Row(id='345678', name='FN')
department4 = Row(id='901234', name='DEV')

Next, we'll create a DepartmentWithEmployees instance from the Employee and Departments.

departmentWithEmployees1 = Row(department=department1, employees=[employee1, employee2, employee5])
departmentWithEmployees2 = Row(department=department2, employees=[employee3, employee4])
departmentWithEmployees3 = Row(department=department3, employees=[employee1, employee4, employee3])
departmentWithEmployees4 = Row(department=department4, employees=[employee2, employee3])

Let's create our DataFrame from the list of rows:

departmentsWithEmployees_Seq = [departmentWithEmployees1, departmentWithEmployees2]
dframe = spark.createDataFrame(departmentsWithEmployees_Seq)
display(dframe)
dframe.show()

Pyspark DataFrames Example 1: FIFA World Cup Dataset

Here we have taken the FIFA World Cup Players Dataset. We are going to load this data, which is in a CSV format, into a DataFrame and then we'll learn about the different transformations and actions that can be performed on this DataFrame.

Reading Data From CSV File

Let's load the data from a CSV file. Here we are going to use the spark.read.csv method to load the data into a DataFrame, fifa_df. The actual method is spark.read.format[csv/json] .

fifa_df = spark.read.csv("path-of-file/fifa_players.csv", inferSchema = True, header = True)

fifa_df.show()

Schema of DataFrame

To have a look at the schema, i.e. the structure of the DataFrame, we'll use the printSchema method. This will give us the different columns in our DataFrame, along with the data type and the nullable conditions for that particular column.

fifa_df.printSchema()

Column Names and Count (Rows and Column)

When we want to have a look at the names and a count of the number of rows and columns of a particular DataFrame, we use the following methods.

fifa_df.columns //Column Names

fifa_df.count() // Row Count

len(fifa_df.columns) //Column Count

37784

8

Describing a Particular Column

If we want to have a look at the summary of any particular column of a DataFrame, we use thedescribe method. This method gives us the statistical summary of the given column, if not specified, it provides the statistical summary of the DataFrame.

fifa_df.describe('Coach Name').show()
fifa_df.describe('Position').show()

Selecting Multiple Columns

If we want to select particular columns from the DataFrame, we use the select method.

fifa_df.select('Player Name','Coach Name').show()

Selecting Distinct Multiple Columns

fifa_df.select('Player Name','Coach Name').distinct().show()

Filtering Data

In order to filter data, according to the condition specified, we use the filter command. Here we are filtering our DataFrame based on the condition that Match ID must be equal to 1096 and then we are calculating how many records/rows are there in the filtered output.

fifa_df.filter(fifa_df.MatchID=='1096').show()

fifa_df.filter(fifa_df.MatchID=='1096').count()  //to get the count

Filtering Data (Multiple Parameters)

We can filter our data based on multiple conditions (AND or OR)

fifa_df.filter((fifa_df.Position=='C') && (fifa_df.Event=="G40'")).show()

Sorting Data (OrderBy)

To sort the data we use the OrderBy method. By default, it sorts in ascending order, but we can change it to descending order as well.

fifa_df.orderBy(fifa_df.MatchID).show()

PySpark Dataframes Example 2: Superheros Dataset

Loading the Data

Here we will load the data in the same way as we did earlier.

Superhero_df = spark.read.csv("path-of file/superheros.csv", inferSchema = True, header = True)

Superhero_df.show(10)

Filtering the Data

Superhero_df.filter(Superhero_df.Gender == 'Male').count() //Male Heros Count

Superhero_df.filter(Superhero_df.Gender == 'Female').count() //Female Heros Count

Grouping the Data

GroupBy is used to group the DataFrame based on the column specified. Here, we are grouping the DataFrame based on the column Race and then with the count function, we can find the count of the particular race.

Race_df = Superhero_df.groupby("Race")\
.count()\
.show()

Performing SQL Queries

We can also pass SQL queries directly to any DataFrame, for that we need to create a table from the DataFrame using the registerTempTable method and then use  sqlContext.sql() to pass the SQL queries.

Superhero_df.registerTempTable('superhero_table')

sqlContext.sql('select * from superhero_table').show()

sqlContext.sql('select distinct(Eye_color) from superhero_table').show()

sqlContext.sql('select distinct(Eye_color) from superhero_table').count()
sqlContext.sql('select max(Weight) from superhero_table').show()

And with this, we come to an end of this PySpark DataFrame Tutorial.

I hope you guys got an idea of what PySpark DataFrame is, why is it used in the industry and its features in this PySpark DataFrame tutorial. Congratulations, you are no longer a newbie to DataFrames.

Apache Spark pyspark Data (computing) sql Database

Published at DZone with permission of Kislay Keshari, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Augmented Analytics With PySpark and Sentiment Analysis
  • Apache Spark for the Impatient
  • Data Privacy and Security: A Developer's Guide to Handling Sensitive Data With DuckDB
  • What Is SQL Injection and How Can It Be Avoided?

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!