How To Use Pandas and Matplotlib To Perform EDA In Python
In this article, we will explore how to use two popular Python libraries, Pandas and Matplotlib, to perform EDA.
Join the DZone community and get the full member experience.Join For Free
Exploratory Data Analysis (EDA) is an essential step in any data science project, as it allows us to understand the data, detect patterns, and identify potential issues. In this article, we will explore how to use two popular Python libraries, Pandas and Matplotlib, to perform EDA. Pandas is a powerful library for data manipulation and analysis, while Matplotlib is a versatile library for data visualization. We will cover the basics of loading data into a pandas DataFrame, exploring the data using pandas functions, cleaning the data, and finally, visualizing the data using Matplotlib. By the end of this article, you will have a solid understanding of how to use Pandas and Matplotlib to perform EDA in Python.
Importing Libraries and Data
To use the pandas and Matplotlib libraries in your Python code, you need to first import them. You can do this using the
import statement followed by the name of the library.
python import pandas as pd import matplotlib.pyplot as plt
In this example, we're importing pandas and aliasing it as 'pd', which is a common convention in the data science community. We're also importing matplotlib.pyplot and aliasing it as 'plt'. By importing these libraries, we can use their functions and methods to work with data and create visualizations.
Once you've imported the necessary libraries, you can load the data into a pandas DataFrame. Pandas provides several methods to load data from various file formats, including CSV, Excel, JSON, and more. The most common method is
read_csv, which reads data from a CSV file and returns a DataFrame.
python# Load data into a pandas DataFrame data = pd.read_csv('path/to/data.csv')
In this example, we're loading data from a CSV file located at 'path/to/data.csv' and storing it in a variable called 'data'. You can replace 'path/to/data.csv' with the actual path to your data file.
By loading data into a pandas DataFrame, we can easily manipulate and analyze the data using pandas' functions and methods. The DataFrame is a 2-dimensional table-like data structure that allows us to work with data in a structured and organized way. It provides functions for selecting, filtering, grouping, aggregating, and visualizing data.
tail() functions are used to view the first few and last few rows of the data, respectively. By default, these functions display the first/last five rows of the data, but you can specify a different number of rows as an argument.
python# View the first 5 rows of the data print(data.head()) # View the last 10 rows of the data print(data.tail(10))
info() function provides information about the DataFrame, including the number of rows and columns, the data types of each column, and the number of non-null values. This function is useful for identifying missing values and determining the appropriate data types for each column.
python# Get information about the data print(data.info())
describe() function provides summary statistics for numerical columns in the DataFrame, including the count, mean, standard deviation, minimum, maximum, and quartiles. This function is useful for getting a quick overview of the distribution of the data.
python# Get summary statistics for the data print(data.describe())
value_counts() function is used to count the number of occurrences of each unique value in a column. This function is useful for identifying the frequency of specific values in the data.
python# Count the number of unique values in a column print(data['column_name'].value_counts())
These are just a few examples of panda functions you can use to explore data. There are many other functions you can use depending on your specific data exploration needs, such as
isnull() to check for missing values,
groupby() to group data by a specific column,
corr() to calculate correlation coefficients between columns and more.
isnull() function is used to check for missing or null values in the DataFrame. It returns a DataFrame of the same shape as the original, with True values where the data is missing and False values where the data is present. You can use the
sum() function to count the number of missing values in each column.
python# Check for missing values print(data.isnull().sum())
dropna() function is used to remove rows or columns with missing or null values. By default, this function removes any row that contains at least one missing value. You can use the
subset argument to specify which columns to check for missing values and the
how argument to specify whether to drop rows with any missing values or only rows where all values are missing.
python# Drop rows with missing values data = data.dropna()
drop_duplicates() function is used to remove duplicate rows from the DataFrame. By default, this function removes all rows that have the same values in all columns. You can use the
subset argument to specify which columns to check for duplicates.
python# Drop duplicate rows data = data.drop_duplicates()
replace() function is used to replace values in a column with new values. You can specify the old value to replace and the new value to replace it. This function is useful for handling data quality issues such as misspellings or inconsistent formatting.
python# Replace values in a column data['column_name'] = data['column_name'].replace('old_value', 'new_value')
These are just a few examples of pandas functions you can use to clean data. There are many other functions you can use depending on your specific data-cleaning needs, such as
fillna() to fill missing values with a specific value or method,
astype() to convert data types of columns,
clip() to trim outliers and more.
Data cleaning plays a crucial role in preparing data for analysis, and automating the process can save time and ensure data quality. In addition to the panda's functions mentioned earlier, automation techniques can be applied to streamline data-cleaning workflows. For instance, you can create reusable functions or pipelines to handle missing values, drop duplicates, and replace values across multiple datasets. Moreover, you can leverage advanced techniques like imputation to fill in missing values intelligently or regular expressions to identify and correct inconsistent formatting. By combining the power of pandas functions with automation strategies, you can efficiently clean and standardize data, improving the reliability and accuracy of your exploratory data analysis (EDA).
Data visualization is a critical component of data science, as it allows us to gain insights from data quickly and easily. Matplotlib is a popular Python library for creating a wide range of data visualizations, including scatter plots, line plots, bar charts, histograms, box plots, and more.
Here are a few examples of how to create these types of visualizations using Matplotlib:
A scatter plot is used to visualize the relationship between two continuous variables. You can create a scatter plot in Matplotlib using the
python# Create a scatter plot plt.scatter(data['column1'], data['column2']) plt.xlabel('Column 1') plt.ylabel('Column 2') plt.show()
In this example, we're creating a scatter plot with
column1 on the x-axis and
column2 on the y-axis. We're also adding labels to the x-axis and y-axis using the
A histogram is used to visualize the distribution of a single continuous variable. You can create a histogram in Matplotlib using the
python# Create a histogram plt.hist(data['column'], bins=10) plt.xlabel('Column') plt.ylabel('Frequency') plt.show()
In this example, we're creating a histogram of the
column variable with 10 bins. We're also adding labels to the x-axis and y-axis using the
A box plot is used to visualize the distribution of a single continuous variable and to identify outliers. You can create a box plot in Matplotlib using the
python# Create a box plot plt.boxplot(data['column']) plt.ylabel('Column') plt.show()
In this example, we're creating a box plot of the
column variable. We're also adding a label to the y-axis using the
These are just a few examples of what you can do with Matplotlib for data visualization. There are many other functions and techniques you can use, depending on the specific requirements of your project.
Exploratory data analysis (EDA) is a crucial step in any data science project, and Python provides powerful tools to perform EDA effectively. In this article, we have learned how to use two popular Python libraries, Pandas and Matplotlib, to load, explore, clean, and visualize data. Pandas provides a flexible and efficient way to manipulate and analyze data, while Matplotlib provides a wide range of options to create visualizations. By leveraging these two libraries, we can gain insights from data quickly and easily. With the skills and techniques learned in this article, you can start performing EDA on your own datasets and uncover valuable insights that can drive data-driven decision-making.
Opinions expressed by DZone contributors are their own.