DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Data Engineering
  3. Databases
  4. The Idea, Part 1: SQL Queries in Pandas Scripting

The Idea, Part 1: SQL Queries in Pandas Scripting

We take a look at how to use Python and the Pandas library for querying data, doing some rudimentary analysis, and how it compares to SQL for data ingestion.

Zehra Can user avatar by
Zehra Can
·
Aug. 21, 18 · Tutorial
Like (7)
Save
Tweet
Share
7.06K Views

Join the DZone community and get the full member experience.

Join For Free

It is sometimes hard to leave behind the habits that you are used to. As a SQL addicted data analyst, your mind may oppose your new challenging learning path for using new data analysis environments and languages like Python. I have taken many online data science and Python courses. However, I still need to match the query in SQL with Python scripts in my mind. It is like learning a new language over your mother tongue. Consequently, I have decided to prepare a SQL guide for myself and others who are in the same boat as me.

Environment Preparation

In this work, I used Oracle HR example schema data. You can find information about this example data in the Oracle docs.

If you don't have an Oracle environment you can simply use the create scripts of the tables in your selected database environment. The basic create scripts can be found (hr-cre-sql) here.

There are seven tables in the Oracle HR Sample :

  • COUNTRIES
  • DEPARTMENTS
  • EMPLOYEES
  • JOB_HISTORY
  • LOCATIONS
  • REGIONS
  • JOBS

The model entity diagram can be viewed below. I have created the model by reverse engineering the entities from the database with Power Designer modeling tool.

Image title

To create and load data on the Oracle database the script found here can be used.

Until now, I explained how to get data into our database. For querying the data I have used the Toad SQL editor.

I will not explain the details for setting up Pyhton. I am using Pyhton 3 on the Anaconda environment with the Jupyter Notebook. The version of Python is given in the screenshot.

Image title

Let's Turn to Writing Some Simple Queries

Loading Data in Pandas

import pandas as pd
#Loading ORACLE SAMPLE HR data, you can think that this is your simple database level if you are used to database mind. 
#Hence here we have created our tiny HR database.
employees = pd.read_csv(".\sqlVSpandas\EMPLOYEES.CSV")
departments = pd.read_csv(".\sqlVSpandas\DEPARTMENTS.CSV")
job_history = pd.read_csv(".\sqlVSpandas\JOB_HISTORY.CSV")
jobs = pd.read_csv(".\sqlVSpandas\JOBS.CSV")
countries = pd.read_csv(".\sqlVSpandas\COUNTRIES.CSV")
regions = pd.read_csv(".\sqlVSpandas\REGIONS.CSV")
locations = pd.read_csv(".\sqlVSpandas\LOCATIONS.CSV")

#Create a dictionary to keep the names of the data set more compact in case of reaching easily
HR_Model_Entities = {
    'EMPLOYEES'   : employees, 
    'DEPARTMENTS' : departments,
    'JOB_HISTORY' : job_history,  
    'JOBS'        : jobs,  
    'COUNTRIES'   : countries, 
    'REGIONS'     : regions ,
    'LOCATIONS'   : locations}

Counting Data

SQL Version

select 'employees' hr_data, count(*) record_count from employees
    union 
    select 'departments', count(*) record_count from departments
    union
    select 'job_history', count(*) record_count from job_history
    union  
    select 'jobs' , count(*) record_count from jobs
    union
    select 'countries', count(*) record_count from countries
    union
    select 'regions', count(*) record_count from regions
    union 
    select 'locations', count(*) record_count from locations
    order by 1

The result is:

HR_DATA

RECORD_COUNT

countries

25

departments

27

employees

107

job_history

10

jobs

19

locations

23

regions

4

Pandas Version

from prettytable import PrettyTable

data_count = PrettyTable()
data_count.field_names = ["HR Data", "Record Count"]
data_count.align["HR Data"] = "l"
data_count.align["Record Count"] = "r"

for key in HR_Model_Entities:
    df = HR_Model_Entities[key]
    data_count.add_row([key, df.iloc[:,1].count()])

print(data_count)

The result is:

+-------------+--------------+
| HR Data     | Record Count |
+-------------+--------------+
| JOB_HISTORY |           10 |
| DEPARTMENTS |           27 |
| COUNTRIES   |           25 |
| LOCATIONS   |           23 |
| REGIONS     |            4 |
| EMPLOYEES   |          107 |
| JOBS        |           19 |
+-------------+--------------+
#another option for counting rows, It gives the same result.
from prettytable import PrettyTable

data_count = PrettyTable()
data_count.field_names = ["HR Data", "Record Count"]
data_count.align["HR Data"] = "l"
data_count.align["Record Count"] = "r"

for key in HR_Model_Entities:
    df = HR_Model_Entities[key]
    data_count.add_row([key, str(df.shape[0])])

print(data_count) 


Some Aggregation Functions on Single Column

SQL Version

select 'SUM_SALARY' CALCULATION, sum(SALARY) SUM_SALARY from employees
UNION
select 'MIMINUM_SALARY', min(SALARY) MEDIAN_SALARY from employees
UNION
select 'MAXIMUM SALARY', max(SALARY) STD_SALARY from employees
UNION
select 'AVG_SALARY', avg(SALARY) AVG_SALARY from employees
UNION
select 'MEDIAN_SALARY', median(SALARY) MEDIAN_SALARY from employees
UNION
select 'STDDEV_SALARY', stddev(SALARY) STD_SALARY from employees

The result is:

CALCULATION

SUM_SALARY

AVG_SALARY

6461.682243

MAXIMUM SALARY

24000

MEDIAN_SALARY

6200

MIMINUM_SALARY

2100

STDDEV_SALARY

3909.365746

SUM_SALARY

691400

Pandas Version

sum_salary = employees['SALARY'].sum()
print ("Sum Salary :" , sum_salary)

min_salary = employees['SALARY'].min()
print ("Minimum Salary :" , min_salary)

max_salary = employees['SALARY'].max()
print ("Maximum Salary :" , max_salary)

avg_salary = employees['SALARY'].mean()
print ("Average Salary :" , avg_salary)

median_salary = employees['SALARY'].median()
print ("Median Salary :" , median_salary)

std_salary = employees['SALARY'].std()
print ("Standart Deviation :" , std_salary)

#The Result is;

Sum Salary : 691400
Minimum Salary : 2100
Maximum Salary : 24000
Average Salary : 6461.682242990654
Median Salary : 6200.0
Standart Deviation : 3909.365746459056

#You can also use the following code to see the simple statistics on the SALARY columns, in this method emp_salary is a new data frame.

emp_salary = (employees['SALARY']) 
emp_salary.describe()

#The result seems like these
count      107.000000
mean      6461.682243
std       3909.365746
min       2100.000000
25%       3100.000000
50%       6200.000000
75%       8900.000000

This was an easy and quick start to comparing both SQL and Pandas scripts. I will continue to create more complex queries in my next article.

There is always something new to learn!

Database sql Pandas Data science

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • MongoDB Time Series Benchmark and Review
  • Creating a Personal ReadMe for Scrum Masters With ChatGPT
  • What Is Docker Swarm?
  • How to Assess the Technical Skills of a Software Development Partner

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: