Python Pandas Tutorial: DataFrame Basics
The most commonly used data structures in pandas are DataFrames, so it's important to know at least the basics of working with them.
Join the DZone community and get the full member experience.
Join For FreeThe DataFrame is the most commonly used data structures in pandas. As such, it is very important to learn various specifics about working with the DataFrame. After of creating a DataFrame, let's now delve into some methods for working with it.
Getting Started
Import these libraries: pandas
, mattplotlib
for plotting, numpy
.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import random
If you are working with a Jupyter (or iPython) notebook and want to show graphs inline, use this definition.
%matplotlib inline
Let's now load some CSV data into our DataFrame for working with it. The data we have loaded is the World Happiness Report 2016.
x = pd.read_csv('2016.csv')
DataFrame Details
Index
The attribute index
shows the row index labels.
x = pd.read_csv('2016.csv')
print x.index
# prints
RangeIndex(start=0, stop=157, step=1)
The index is a RangeIndex
if the labels are contiguous integers.
Columns
Get the columns using the attribute columns
.
print x.columns
# prints
Index([u'Country', u'Region', u'Happiness Rank', u'Happiness Score',
u'Lower Confidence Interval', u'Upper Confidence Interval',
u'Economy (GDP per Capita)', u'Family', u'Health (Life Expectancy)',
u'Freedom', u'Trust (Government Corruption)', u'Generosity',
u'Dystopia Residual'],
dtype='object')
Values
The raw values array can be extracted using values
.
print x.values
# prints
[['Denmark' 'Western Europe' 1 ..., 0.44453000000000004 0.36171 2.73939]
['Switzerland' 'Western Europe' 2 ..., 0.41203 0.28083 2.69463]
['Iceland' 'Western Europe' 3 ..., 0.14975 0.47678000000000004 2.83137]
...,
['Togo' 'Sub-Saharan Africa' 155 ..., 0.11587 0.17517 2.1354]
['Syria' 'Middle East and Northern Africa' 156 ..., 0.17232999999999998
0.48396999999999996 0.81789]
['Burundi' 'Sub-Saharan Africa' 157 ..., 0.09419 0.2029 2.1040400000000004]]
Shape
Get a tuple of the number of rows and columns of the DataFrame using the shape
attribute.
x.shape
# prints
(157, 13)
Size
Use the count()
method to retrieve a count of (non-NaN) elements in each column. This method ignores any NaN elements in the column.
print x.count()
# prints
Country 157
Region 157
Happiness Rank 157
Happiness Score 157
Lower Confidence Interval 157
Upper Confidence Interval 157
Economy (GDP per Capita) 157
Family 157
Health (Life Expectancy) 157
Freedom 157
Trust (Government Corruption) 157
Generosity 157
Dystopia Residual 157
dtype: int64
And the size
attribute returns the total number of elements (including NaNs) in the DataFrame. This means the value (nrows * ncols).
print x.size
# prints
2041
Statistics
Get detailed statistics of the DataFrame using the method describe()
. Returns various details such as mean, min, max, etc. for each column.
print x.describe()
# prints
Happiness Rank Happiness Score Lower Confidence Interval \
count 157.000000 157.000000 157.000000
mean 78.980892 5.382185 5.282395
std 45.466030 1.141674 1.148043
min 1.000000 2.905000 2.732000
25% 40.000000 4.404000 4.327000
50% 79.000000 5.314000 5.237000
75% 118.000000 6.269000 6.154000
max 157.000000 7.526000 7.460000
Upper Confidence Interval Economy (GDP per Capita) Family \
count 157.000000 157.000000 157.000000
mean 5.481975 0.953880 0.793621
std 1.136493 0.412595 0.266706
...
Head and Tail
The head()
method retrieves the first five rows from the DataFrame.
x = pd.read_csv('big-data/Salaries.csv')
print x.head()
# prints
yearID teamID lgID playerID salary
0 1985 ATL NL barkele01 870000
1 1985 ATL NL bedrost01 550000
2 1985 ATL NL benedbr01 545000
3 1985 ATL NL campri01 633333
4 1985 ATL NL ceronri01 625000
And the tail method retrieves the last five rows.
print x.tail()
# prints
yearID teamID lgID playerID salary
26423 2016 WSN NL strasst01 10400000
26424 2016 WSN NL taylomi02 524000
26425 2016 WSN NL treinbl01 524900
26426 2016 WSN NL werthja01 21733615
26427 2016 WSN NL zimmery01 14000000
The cumulative methods return a DataFrame with the appropriate cumulative function applied to the rows. Some of the operations are not valid for non-numeric columns.
Cumulative Sum
cumsum()
(cumulative sum): Value of each row is replaced by the sum of all prior rows including this row. String value rows use concatenation as shown below.
y = pd.DataFrame({'one': pd.Series(range(5)),
'two': pd.Series(range(5, 10)),
'three': pd.Series(list('abcde'))})
print 'head =>\n', y.head(), '\n'
print 'cumsum =>\n', y.cumsum(), '\n'
# prints
head =>
one three two
0 0 a 5
1 1 b 6
2 2 c 7
3 3 d 8
4 4 e 9
cumsum =>
one three two
0 0 a 5
1 1 ab 11
2 3 abc 18
3 6 abcd 26
4 10 abcde 35
Cumulative Product
cumprod()
(cumulative product): Row value is replaced by product of all prior rows. This method is not applicable to non-numeric rows. If there are non-numeric rows in the DataFrame, you will need to extract a subset of the DataFrame as shown.
print 'cumprod =>\n', y[['one', 'two']].cumprod(), '\n'
# prints
cumprod =>
one two
0 0 5
1 0 30
2 0 210
3 0 1680
4 0 15120
Cumulative Maximum
cummax()
(cumulative max): Value of the row is replaced by the maximum value of all prior rows till now. In the example below, for demonstrating this method, we use this method on reversed rows of the original DataFrame.
print 'rev =>\n', y.iloc[::-1], '\n',
print 'cummax =>\n', y.iloc[::-1].cummax(), '\n'
# prints
rev =>
one three two
4 4 e 9
3 3 d 8
2 2 c 7
1 1 b 6
0 0 a 5
cummax =>
one three two
4 4 e 9
3 4 e 9
2 4 e 9
1 4 e 9
0 4 e 9
Cumulative Minimum
cummin()
: Similar to cummax, except computes the minimum of values till this row.
print 'cummin =>\n', y.cummin(), '\n'
# prints
cummin =>
one three two
0 0 a 5
1 0 a 5
2 0 a 5
3 0 a 5
4 0 a 5
Index of Min and Max Values
Use the methods idxmin()
and idxmax()
to obtain the index label of the rows containing minimum and maximum values. Applicable only to numeric columns, so non-numeric columns need to be filtered out.
y = pd.DataFrame({'one': pd.Series(random.sample(xrange(100), 5), index=list('abcde')),
'two': pd.Series(random.sample(xrange(100), 5), index=list('abcde')),
'three': pd.Series(list('ABCDE'), index=list('abcde'))})
print y, '\n'
print 'idxmax =>\n', y[['one', 'two']].idxmax(), '\n'
print 'idxmin =>\n', y[['one', 'two']].idxmin(), '\n'
# prints
one three two
a 48 A 25
b 38 B 13
c 62 C 91
d 79 D 32
e 2 E 42
idxmax =>
one d
two c
dtype: object
idxmin =>
one e
two b
dtype: object
Value Counts
The method value_counts()
returns the number of times each value is repeated in the column. Note: this is not a DataFrame method; rather it is applicable on a column (which is a Series object).
x = pd.read_csv('big-data/Salaries.csv')
print 'top 10 =>\n', x.head(10), '\n'
print 'value_counts =>\n', x['yearID'].value_counts().head(10)
# prints
top 10 =>
yearID teamID lgID playerID salary
0 1985 ATL NL barkele01 870000
1 1985 ATL NL bedrost01 550000
2 1985 ATL NL benedbr01 545000
3 1985 ATL NL campri01 633333
4 1985 ATL NL ceronri01 625000
5 1985 ATL NL chambch01 800000
6 1985 ATL NL dedmoje01 150000
7 1985 ATL NL forstte01 483333
8 1985 ATL NL garbege01 772000
9 1985 ATL NL harpete01 250000
value_counts =>
1999 1006
1998 998
1995 986
1996 931
1997 925
1993 923
1994 884
1990 867
2001 860
2008 856
Name: yearID, dtype: int64
Summary
We covered a few aspects of the DataFrame in this article. Ways of learning various details of the DataFrame including size, shape, statistics, etc. were presented.
Published at DZone with permission of Jay Sridhar, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments