Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Part 2: SQL Queries in Pandas Scripting (Filtering and Joining Data)

DZone's Guide to

Part 2: SQL Queries in Pandas Scripting (Filtering and Joining Data)

When filtering and joining big data sets, which tech is better? In this post, we compare the results garnered using both SQL queries and the Pandas library.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

In Part 1, I went over information on the preparation of a data environment, which is a sample of HR data, and then did some simple query examples over the data by comparing the Pandas library and SQL. For the examples in this article, we first meed to setup the environment as described in Part 1.

Here, I will continue giving more complex SQL queries by rewriting the queries with Pandas. The SQL queries have 6 parts, which are;

  1. SELECT: Used to select the all columns or selected columns.
  2. FROM: Used to include the desired data set.
  3. WHERE: Used to define joins between data sets or filter the data (in some syntaxes, joins are defined in the FROM clause with join keywords like inner join, outer join, etc.).
  4. GROUP BY: Used to aggregate the attribute columns to calculate aggregated metrics.
  5. HAVING BY: Used to filter aggregated metrics.
  6. ORDER BY: Used to define ordering on desired columns.
SELECT <column or columns> /*[1]*/
FROM <table or tables> /*[2]*/
WHERE <join> or <filters> /*[3]*/
GROUP BY <aggregate column or columns> /*[4]*/
HAVING BY <aggregate column filter> /*[5]*/
ORDER BY <sort column or columns> /*[6]*/

In this article, the proceeding examples will be about filtering the data.

Filtering by Values

SQL Version

SELECT * FROM EMPLOYEES
WHERE MANAGER_ID = 100
AND DEPARTMENT_ID = 90

The output is:

FilterByValueSQL

Pandas Version

It can be written in four different syntaxes, as shown below, which all give the same output data;

employees_filtered_v1 = employees[(employees.MANAGER_ID == 100) & (employees.DEPARTMENT_ID == 90)]
employees_filtered_v1
#Or
employees_filtered_v2 = employees.query('MANAGER_ID == 100 & DEPARTMENT_ID == 90')
employees_filtered_v2
#Or
employees_filtered_v3 = employees[(employees['MANAGER_ID'] == 100) & (employees['DEPARTMENT_ID'] == 90)]
employees_filtered_v3
#Or
#Using Boolean values
manager_id = employees['MANAGER_ID'] == 100
department_id = employees['DEPARTMENT_ID'] == 90
employees_filtered_v4 = employees[manager_id & department_id]
employees_filtered_v4

The output, which is shown below, is the same as what we got with our SQL code snippets.

FilterByValuePandasOutput

In the above example, the "and" operator is used to filter data which matches the both criteria. However if you want to use the "or" operator both in your SQL query and Pandas script, use "or" and "|"(pipe), respectively.

Filtering Null and Not Null Values

SQL Version

SELECT * FROM EMPLOYEES
WHERE COMMISSION_PCT IS NULL

The result for the EMPLOYEES table is 72 records.

Image title

Pandas Version

employees[employees.COMMISSION_PCT.isnull()].count()[1]

Image title

SQL Version

SELECT * FROM EMPLOYEES
WHERE COMMISSION_PCT IS NOT NULL

The result for the EMPLOYEES table is 35 records.

Image title

Pandas Version

employees[employees.COMMISSION_PCT.notnull()].count()[1]

The result for this code is:

Image title

Joining Two Data Sets

Sometimes it is necessary to join two data sets. You can join data sets with different join types. There are four different type of joins in SQL, and it can be applied to Pandas data frames with the same logic. These are:

  1. inner join

  2. left/right join

  3. full outer join

The following example is an inner join example. In our Pandas version, you can change the join type by setting the parameter for the merge function.

1. For left outer join:

how='left'

2. For right outer join

how='right'

3. For full outer join 

how='outer'

SQL Version

SELECT 
    e.employee_id, e.department_id,
    e.first_name, e.last_name, 
    d.DEPARTMENT_NAME 
FROM 
    employees e,
    departments d
WHERE e.department_id = d.department_id
order by department_name, first_name , last_name

The output is:

Image title

Pandas Version

employees_departments = pd.merge(
    employees[['EMPLOYEE_ID', 'FIRST_NAME', 'LAST_NAME', 'DEPARTMENT_ID']],
    departments[['DEPARTMENT_ID', 'DEPARTMENT_NAME']],
    on = 'DEPARTMENT_ID')
employees_departments

The output is:

Image title

Joins and filtering are covered in this article.   

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
pandas ,sql ,tutorial ,python for data science ,big data

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}