DZone
Big Data Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Big Data Zone > Getting Started With Pandas – Lesson 2

Getting Started With Pandas – Lesson 2

In this article, we are going to make a summary of the different functions that are used in Pandas to perform Indexing, Selection, and Filtering.

David Suarez user avatar by
David Suarez
·
Dec. 04, 21 · Big Data Zone · Analysis
Like (2)
Save
Tweet
3.77K Views

Join the DZone community and get the full member experience.

Join For Free

Introduction

We begin with the second post of our training saga with Pandas. In this article, we are going to make a summary of the different functions that are used in Pandas to perform Indexing, Selection, and Filtering.

Indexing, Selecting, and Filtering

Before we start, we are going to visualize ahead of our didactic dataset that we are going to follow to show the examples. It is a well-known dataset that contains wine information.

Dataset containing wine information

As an introduction, we are going to explain some functions that can be very useful when obtaining a broader view of the state of our dataset.

Getting Information

  • Info

We will start with info() function, which offers us insights about the number of columns, name of every column, We start with the info() function that provides us with information about the number and names of columns, the number of non-null elements, and the data type of each column.

df.info()

Wines Dataset: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


  • Dtypes

We continue with the dtypes attribute that shows us exclusively the data type of each column.

df.dtypes

Wines Dataset: 

fixed acidity           float64
volatile acidity        float64
citric acid             float64
residual sugar          float64
chlorides               float64
free sulfur dioxide     float64
total sulfur dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
quality                   int64
dtype: object


  • Describe

The following function provides us with information on numerous statistical calculations that help us understand the distribution of our dataset.

df.describe()


Wines Dataset: 

fixed acidity    volatile acidity    citric acid    residual sugar    chlorides    free sulfur dioxide    total sulfur dioxide    density    pH    sulphates    alcohol    quality
count    1599.000000    1599.000000    1599.000000    1599.000000    1599.000000    1599.000000    1599.000000    1599.000000    1599.000000    1599.000000    1599.000000    1599.000000
mean    8.319637    0.527821    0.270976    2.538806    0.087467    15.874922    46.467792    0.996747    3.311113    0.658149    10.422983    5.636023
std    1.741096    0.179060    0.194801    1.409928    0.047065    10.460157    32.895324    0.001887    0.154386    0.169507    1.065668    0.807569
min    4.600000    0.120000    0.000000    0.900000    0.012000    1.000000    6.000000    0.990070    2.740000    0.330000    8.400000    3.000000
25%    7.100000    0.390000    0.090000    1.900000    0.070000    7.000000    22.000000    0.995600    3.210000    0.550000    9.500000    5.000000
50%    7.900000    0.520000    0.260000    2.200000    0.079000    14.000000    38.000000    0.996750    3.310000    0.620000    10.200000    6.000000
75%    9.200000    0.640000    0.420000    2.600000    0.090000    21.000000    62.000000    0.997835    3.400000    0.730000    11.100000    6.000000
max    15.900000    1.580000    1.000000    15.500000    0.611000    72.000000    289.000000    1.003690    4.010000    2.000000    14.900000    8.000000


Indexing and Selection

Here we are going to take a deep dive into explaining the two main indexing and selection pandas functions: ‘iloc’ and ‘loc’

+ .loc is primarily label-based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found. Allowed inputs are:

    – A single label, e.g. 5 or ‘a’ (Note that 5 is interpreted as a label of the index. This use is not an integer position along with the index.).

    – A list or array of labels [‘a’, ‘b’, ‘c’].

    – A slice object with labels ‘a’:’f’ (Note that contrary to usual Python slices, both the start and the stop are included, when present in the index! See Slicing with labels and Endpoints are inclusive.)

    – A boolean array (any NA values will be treated as False).

    – A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).

+ .iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with Python/NumPy slice semantics). Allowed inputs are:

  – An integer e.g. 5.

  – A list or array of integers [4, 3, 0].

  – A slice object with ints 1:7.

  – A boolean array (any NA values will be treated as False).

  – A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).

There is no better way to understand how a function works than showing examples, so here you have a wide range of useful examples to see what are the different ways to use them.

`iloc` Examples

+ Get the first row

df.iloc[0]


+ Get the first column

df.iloc[:, 0]


+ Get the first column of the first row

df.iloc[0:1, 0]


+ Get rows from 3 to 5

df.iloc[3:5]


+ Get rows 3, 7, 10

df.iloc[[3, 7, 10]]


+ Get last five rows

df.iloc[-5:]


`loc` Examples

+ Get the first row of column ‘quality’

df.loc[0, 'quality']


+ Get all rows from columns ‘quality', ‘sulphates’, ‘alcohol’

df.loc[:, ['quality', 'sulphates', 'alcohol']]


+ Get from row called ‘liters’ forward from columns ‘quality’ to ‘alcohol’

df1.loc['litres':, 'quality':'alcohol']


+ Get rows from 3 to 5 (Different from iloc)

df.loc[3:5]


Filtering

One of the things that help us the most when we are working with data is being able to filter it according to certain conditions. For them, the `loc`’ function allows us to introduce these conditions in the following way:

+ Get all wines which quality is greater than 6

wines.loc[wines.quality > 6]


+ Get all wines which quality is greater than 5 and less than 8

wines.loc[(wines.quality > 5) & (wines.quality < 8)]


+ Get all wines which quality is equal to 5 or equal to 7

wines.loc[(wines.quality == 5) | (wines.quality == 7)]


That's all for today. Over the next chapter, we will get a deep dive into the functions we use to iterate, map, group, and sort.

Pandas Data Types

Published at DZone with permission of David Suarez. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Adaptive Change Management: A DevOps Approach to Change Management
  • Comprehensive Guide to Jenkins Declarative Pipeline [With Examples]
  • Basic Convolutional Neural Network Architectures
  • Modernize Legacy Code in Production: Rebuild Your Airplane Midflight Without Crashing

Comments

Big Data Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo