Part 2: SQL Queries in Pandas Scripting (Filtering and Joining Data)
When filtering and joining big data sets, which tech is better? In this post, we compare the results garnered using both SQL queries and the Pandas library.
Join the DZone community and get the full member experience.Join For Free
In Part 1, I went over information on the preparation of a data environment, which is a sample of HR data, and then did some simple query examples over the data by comparing the Pandas library and SQL. For the examples in this article, we first meed to setup the environment as described in Part 1.
Here, I will continue giving more complex SQL queries by rewriting the queries with Pandas. The SQL queries have 6 parts, which are;
- SELECT: Used to select the all columns or selected columns.
- FROM: Used to include the desired data set.
- WHERE: Used to define joins between data sets or filter the data (in some syntaxes, joins are defined in the FROM clause with join keywords like inner join, outer join, etc.).
- GROUP BY: Used to aggregate the attribute columns to calculate aggregated metrics.
- HAVING BY: Used to filter aggregated metrics.
- ORDER BY: Used to define ordering on desired columns.
SELECT <column or columns> /**/ FROM <table or tables> /**/ WHERE <join> or <filters> /**/ GROUP BY <aggregate column or columns> /**/ HAVING BY <aggregate column filter> /**/ ORDER BY <sort column or columns> /**/
In this article, the proceeding examples will be about filtering the data.
Filtering by Values
SELECT * FROM EMPLOYEES WHERE MANAGER_ID = 100 AND DEPARTMENT_ID = 90
The output is:
It can be written in four different syntaxes, as shown below, which all give the same output data;
employees_filtered_v1 = employees[(employees.MANAGER_ID == 100) & (employees.DEPARTMENT_ID == 90)] employees_filtered_v1 #Or employees_filtered_v2 = employees.query('MANAGER_ID == 100 & DEPARTMENT_ID == 90') employees_filtered_v2 #Or employees_filtered_v3 = employees[(employees['MANAGER_ID'] == 100) & (employees['DEPARTMENT_ID'] == 90)] employees_filtered_v3 #Or #Using Boolean values manager_id = employees['MANAGER_ID'] == 100 department_id = employees['DEPARTMENT_ID'] == 90 employees_filtered_v4 = employees[manager_id & department_id] employees_filtered_v4
The output, which is shown below, is the same as what we got with our SQL code snippets.
In the above example, the "and" operator is used to filter data which matches the both criteria. However if you want to use the "or" operator both in your SQL query and Pandas script, use "or" and "
Filtering Null and Not Null Values
SELECT * FROM EMPLOYEES WHERE COMMISSION_PCT IS NULL
The result for the EMPLOYEES table is 72 records.
SELECT * FROM EMPLOYEES WHERE COMMISSION_PCT IS NOT NULL
The result for the EMPLOYEES table is 35 records.
The result for this code is:
Joining Two Data Sets
Sometimes it is necessary to join two data sets. You can join data sets with different join types. There are four different type of joins in SQL, and it can be applied to Pandas data frames with the same logic. These are:
full outer join
The following example is an inner join example. In our Pandas version, you can change the join type by setting the parameter for the merge function.
1. For left outer join:
2. For right outer join
3. For full outer join
SELECT e.employee_id, e.department_id, e.first_name, e.last_name, d.DEPARTMENT_NAME FROM employees e, departments d WHERE e.department_id = d.department_id order by department_name, first_name , last_name
The output is:
employees_departments = pd.merge( employees[['EMPLOYEE_ID', 'FIRST_NAME', 'LAST_NAME', 'DEPARTMENT_ID']], departments[['DEPARTMENT_ID', 'DEPARTMENT_NAME']], on = 'DEPARTMENT_ID') employees_departments
The output is:
Joins and filtering are covered in this article.
Opinions expressed by DZone contributors are their own.