# Pandas Dataframe Functions

### Learn the basics of Pandas' Dataframe.

Join the DZone community and get the full member experience.

Join For FreePandas is a Python library that allows users to parse, clean, and visually represent data quickly and efficiently. Here, I will share some useful Dataframe functions that will help you analyze a data set.

First, you have to import the library. Conventionally, we use the alias, "pd," to refer to Pandas.

`import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)`

The data, which is used in the example code, is taken from Kaggle House Prices. Specifically, I used the train.csv file. Save the file in the same folder with your code; otherwise, you have to give the path detail when reading the file.

```
#load data
df = pd.read_csv("train.csv")
```

Now that the data is loaded, we can the first and last nth number of rows in the dataframe using the head() and tail() methods, respectively.

```
#Just give an integer parameter as the number of rows
#Should be greater than zero.
#If you leave it blank, only first or last "5" rows will return.
df.head(10) # First nth rows
df.tail(10) # Last nth rows
```

*First ten rows represented using the head() function*

We can then use the describe() method in order to get some basic statistical information (row count, mean, standard deviation, quartiles, minimum, and maximum) about each column in our dataframe.

`df.describe()`

The output should look something like this:

*Describe() function output*

We can also use the transpose() method or .T in order to get a transposed version of our dataframe.

```
df.describe().T
df.describe().transpose()
```

The output will look something like this (for the first ten rows).

*Transposed version of Dataframe statistics*

You can also describe the columns according to their column data types as below;

```
print(df.select_dtypes(include=['int64','float64']).describe())
print(df.select_dtypes(include=['object']).describe())
```

*Output for columns with float64s as their data type*

If you want to see each columns' name, number of rows, null-value, and data type, use the info() function. If you only want the data type, then use the dtypes attribute.

```
df.info() # Get column name, number of rows, null, and data type.
df.dtypes # get only data types
```

You can use this table later to define the numeric or non-numeric columns to handle some data manipulations on your data. This is especially useful for finding missing values.

You can use the size attribute of a Dataframe in order to get the total number of rows in each column.

```
# Returns size of dataframe/series which is equivalent to
#total number of elements. That is rows x columns.
df.size
```

Similarly, you can use the shape attribute in order to get a tuple of the row count and the column count. You can then index the tuple in order to isolate either of the values returned.

```
df.shape # Get a tuple of the row and column count
df.shape[0] # Get just the row count
df.shape[1] # Get just the column count
```

If you are working with Pandas object and can't determine if it's a Series or Dataframe object, you can use the ndim attribute. This returns the number of dimensions of the object (one if it is a Series, two if it is a Dataframe).

```
df.ndim # Returns dimension of dataframe/series.
# 1 for one dimension (series), 2 for two dimension (dataframe)
```

Every row has an index and an index value;

```
df.index #index of rows -> Returns "RangeIndex(start=0, stop=1460, step=1)"
df.index.values #index values of rows
df.index.tolist() #index
```

To get the distinct values of a column you can use the numpy library. Just as we alias Pandas to "pd", we also will follow the convention of aliasing the Numpy library as "np".

```
import numpy as np
print("Distinct Values for Overall Qualification&Condition")
overall_qual = np.unique(df['OverallQual'])
print(overall_qual)
```

You may want to get all the column names as a list and do some for loop calculations on them. This can be done by the following code;

```
all_columns_list = df.columns.tolist()
# get as a list of all the column names
for col in all_columns_list: print(col)
# just print the names, but you can do other jobs here
```

Opinions expressed by DZone contributors are their own.

Comments