DZone
Big Data Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Big Data Zone > Getting Started With Pandas

Getting Started With Pandas

An introduction to using Pandas, a third-party library for numerical computing that excels in handling data with Series and DataFrame objects.

David Suarez user avatar by
David Suarez
·
Dec. 10, 21 · Big Data Zone · Tutorial
Like (2)
Save
Tweet
5.35K Views

Join the DZone community and get the full member experience.

Join For Free

Today we will introduce one of the first inner training chapters on the fundamentals of DataScience treatment tools. We are talking about Pandas, Numpy, and Matplotlib. Pandas is a third-party library for numerical computing based on NumPy. It excels in handling labeled one-dimensional (1D) data with Series objects and two-dimensional (2D) data with DataFrame objects.

NumPy is a third-party library for numerical computing, optimized for working with single- and multi-dimensional arrays. Its primary type is the array type called ndarray. This library contains many routines for statistical analysis.

Matplotlib is a third-party library for data visualization. It works well in combination with NumPy, SciPy, and Pandas.

Creating, Reading, and Writing Data

In order to work with data, we need to create coherent data structures to store it or read them from an external source. Last but not least, we need to save them after the modifications that we might have made.

The two fundamental data structures are Series and Dataframes. In order to simplify the concepts, we could say that a Series is similar to a python dictionary (key-value pair) and a data frame is a matrix (two dimensional) with its corresponding rows and columns. We use Dataframe in case we have more than one value for each key.

Creating Series from Scratch

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic methods to create a Series are:

From List

list = [7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!']
serie = pd.Series(list)

Index is like an address — that’s how any data point across the Dataframe or Series can be accessed. Rows and columns, in the case of Dataframe, both have indexes. Rows' indices are called an index and for columns, it is general column names. We can specify the index this way:

list = [7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!']
index=['A', 'Z', 'C', 'Y', 'E']

s = pd.Series(list, index=index)

From Dictionary

d = {'Chicago': 1000, 'New York': 1300, 'Portland': 900, 'San Francisco': 1100,
     'Austin': 450, 'Boston': None}
cities = pd.Series(d)


Creating Dataframe From Scratch

We can create Dataframe in different ways, but three of the most used are:

From Dictionary

employees = pd.DataFrame([{"name":"David",
                   "surname":"Suarez",
                   "age":32},
                   {"name":"Gema",
                   "surname":"Parreño",
                   "age":31}], columns=["name","surname","age"])


If we want to name each row with a non-numeric index, we might want to specify it in this attribute.

employees_by_dni = pd.DataFrame([{"name":"David",
                   "surname":"Suarez",
                   "age":32},
                   {"name":"Gema",
                   "surname":"Parreño",
                   "age":31}], columns=["name","surname","age"], index=["76789764A", "78765493G"])

From CSV

A CSV (comma-separated values) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.

In order to read an external file in CSV format we can do it by calling it to read_csv method inside pandas:

Import pandas as pd 
from_csv = pd.read_csv('./path/to/file.csv', index_col=0)


We can specify with the index attribute which one of the columns we want to be the row name.

Normally in CSV, the first column is usually the index, that is, the address through which we can access all the information of each row (Example: get[0] would give us all the information of row 0, but it can also be a string get[‘david’]). We can specify with the index attribute which of the CSV columns we want to be the name of the rows, that is, the attribute by which we will then access all the information in the row.

from_csv = pd.read_csv('./path/to/file.csv', index_col=3)


From PARQUET

Similar to a CSV file, Parquet is a type of file. The difference is that Parquet is designed as a columnar storage format to support complex data processing.

Parquet is column-oriented and designed to bring efficient columnar storage (blocks, row group, column chunks…) of data compared to row-based like CSV.

In order to read an external file in Parquet format we can do it by calling to read_parquet method inside pandas:

Import pandas as pd 
from_parquet = pd.read_parquet('./path/to/file.parquet)


From JSON

In the case that the file has the external format as JSON:

from_json = pd.read_json('./path/to/file.json')


Writing Dataframe

Once the data frame is created, we have several ways of saving the information to an external file. We can save it into CSV or JSON format. We shall use the to_csv and to_json pandas method and save it with the corresponding extensions name:

df_to_write.to_csv("/path/to/file.csv")
df_to_write.to_json("/path/to/file.json")


Finally, SQL formats can also be used from pandas to read and write to a database SQL.

Pandas Database sql file IO Data (computing)

Published at DZone with permission of David Suarez. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Growth in Java Development for Web and Mobile Apps
  • How Many GPUs Should Your Deep Learning Workstation Have?
  • The Engineer’s Guide to Creating a Technical Debt Proposal
  • Get Started With Cloud-Native Decision Automation on Quarkus

Comments

Big Data Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo