The Idea, Part 1: SQL Queries in Pandas Scripting
We take a look at how to use Python and the Pandas library for querying data, doing some rudimentary analysis, and how it compares to SQL for data ingestion.
Join the DZone community and get the full member experience.
Join For FreeIt is sometimes hard to leave behind the habits that you are used to. As a SQL addicted data analyst, your mind may oppose your new challenging learning path for using new data analysis environments and languages like Python. I have taken many online data science and Python courses. However, I still need to match the query in SQL with Python scripts in my mind. It is like learning a new language over your mother tongue. Consequently, I have decided to prepare a SQL guide for myself and others who are in the same boat as me.
Environment Preparation
In this work, I used Oracle HR example schema data. You can find information about this example data in the Oracle docs.
If you don't have an Oracle environment you can simply use the create scripts of the tables in your selected database environment. The basic create scripts can be found (hr-cre-sql) here.
There are seven tables in the Oracle HR Sample :
- COUNTRIES
- DEPARTMENTS
- EMPLOYEES
- JOB_HISTORY
- LOCATIONS
- REGIONS
- JOBS
The model entity diagram can be viewed below. I have created the model by reverse engineering the entities from the database with Power Designer modeling tool.
To create and load data on the Oracle database the script found here can be used.
Until now, I explained how to get data into our database. For querying the data I have used the Toad SQL editor.
I will not explain the details for setting up Pyhton. I am using Pyhton 3 on the Anaconda environment with the Jupyter Notebook. The version of Python is given in the screenshot.
Let's Turn to Writing Some Simple Queries
Loading Data in Pandas
import pandas as pd
#Loading ORACLE SAMPLE HR data, you can think that this is your simple database level if you are used to database mind.
#Hence here we have created our tiny HR database.
employees = pd.read_csv(".\sqlVSpandas\EMPLOYEES.CSV")
departments = pd.read_csv(".\sqlVSpandas\DEPARTMENTS.CSV")
job_history = pd.read_csv(".\sqlVSpandas\JOB_HISTORY.CSV")
jobs = pd.read_csv(".\sqlVSpandas\JOBS.CSV")
countries = pd.read_csv(".\sqlVSpandas\COUNTRIES.CSV")
regions = pd.read_csv(".\sqlVSpandas\REGIONS.CSV")
locations = pd.read_csv(".\sqlVSpandas\LOCATIONS.CSV")
#Create a dictionary to keep the names of the data set more compact in case of reaching easily
HR_Model_Entities = {
'EMPLOYEES' : employees,
'DEPARTMENTS' : departments,
'JOB_HISTORY' : job_history,
'JOBS' : jobs,
'COUNTRIES' : countries,
'REGIONS' : regions ,
'LOCATIONS' : locations}
Counting Data
SQL Version
select 'employees' hr_data, count(*) record_count from employees
union
select 'departments', count(*) record_count from departments
union
select 'job_history', count(*) record_count from job_history
union
select 'jobs' , count(*) record_count from jobs
union
select 'countries', count(*) record_count from countries
union
select 'regions', count(*) record_count from regions
union
select 'locations', count(*) record_count from locations
order by 1
The result is:
HR_DATA |
RECORD_COUNT |
countries |
25 |
departments |
27 |
employees |
107 |
job_history |
10 |
jobs |
19 |
locations |
23 |
regions |
4 |
Pandas Version
from prettytable import PrettyTable
data_count = PrettyTable()
data_count.field_names = ["HR Data", "Record Count"]
data_count.align["HR Data"] = "l"
data_count.align["Record Count"] = "r"
for key in HR_Model_Entities:
df = HR_Model_Entities[key]
data_count.add_row([key, df.iloc[:,1].count()])
print(data_count)
The result is:
+-------------+--------------+
| HR Data | Record Count |
+-------------+--------------+
| JOB_HISTORY | 10 |
| DEPARTMENTS | 27 |
| COUNTRIES | 25 |
| LOCATIONS | 23 |
| REGIONS | 4 |
| EMPLOYEES | 107 |
| JOBS | 19 |
+-------------+--------------+
#another option for counting rows, It gives the same result.
from prettytable import PrettyTable
data_count = PrettyTable()
data_count.field_names = ["HR Data", "Record Count"]
data_count.align["HR Data"] = "l"
data_count.align["Record Count"] = "r"
for key in HR_Model_Entities:
df = HR_Model_Entities[key]
data_count.add_row([key, str(df.shape[0])])
print(data_count)
Some Aggregation Functions on Single Column
SQL Version
select 'SUM_SALARY' CALCULATION, sum(SALARY) SUM_SALARY from employees
UNION
select 'MIMINUM_SALARY', min(SALARY) MEDIAN_SALARY from employees
UNION
select 'MAXIMUM SALARY', max(SALARY) STD_SALARY from employees
UNION
select 'AVG_SALARY', avg(SALARY) AVG_SALARY from employees
UNION
select 'MEDIAN_SALARY', median(SALARY) MEDIAN_SALARY from employees
UNION
select 'STDDEV_SALARY', stddev(SALARY) STD_SALARY from employees
The result is:
CALCULATION |
SUM_SALARY |
AVG_SALARY |
6461.682243 |
MAXIMUM SALARY |
24000 |
MEDIAN_SALARY |
6200 |
MIMINUM_SALARY |
2100 |
STDDEV_SALARY |
3909.365746 |
SUM_SALARY |
691400 |
Pandas Version
sum_salary = employees['SALARY'].sum()
print ("Sum Salary :" , sum_salary)
min_salary = employees['SALARY'].min()
print ("Minimum Salary :" , min_salary)
max_salary = employees['SALARY'].max()
print ("Maximum Salary :" , max_salary)
avg_salary = employees['SALARY'].mean()
print ("Average Salary :" , avg_salary)
median_salary = employees['SALARY'].median()
print ("Median Salary :" , median_salary)
std_salary = employees['SALARY'].std()
print ("Standart Deviation :" , std_salary)
#The Result is;
Sum Salary : 691400
Minimum Salary : 2100
Maximum Salary : 24000
Average Salary : 6461.682242990654
Median Salary : 6200.0
Standart Deviation : 3909.365746459056
#You can also use the following code to see the simple statistics on the SALARY columns, in this method emp_salary is a new data frame.
emp_salary = (employees['SALARY'])
emp_salary.describe()
#The result seems like these
count 107.000000
mean 6461.682243
std 3909.365746
min 2100.000000
25% 3100.000000
50% 6200.000000
75% 8900.000000
This was an easy and quick start to comparing both SQL and Pandas scripts. I will continue to create more complex queries in my next article.
There is always something new to learn!
Opinions expressed by DZone contributors are their own.
Comments