Deciphering Data to Uncover Hidden Insights: Data Wrangling and Ingestion
Deciphering Data to Uncover Hidden Insights: Data Wrangling and Ingestion
Learn how to be a cowboy, but with data instead of cows. Somehow databoy doesn't sound as good...
Join the DZone community and get the full member experience.Join For Free
The Architect’s Guide to Big Data Application Performance. Get the Guide.
This multi-part article talks about how to collect data, wrangle data, ingest data, model data, and visualize data from three viewpoints (conceptual, practical, and best practice).
In the first article of this series, we have seen how to understand data conceptually through an example from the Banking, Financial Services and Insurance (BFSI) domain.
In this article, we will learn how to wrangle the data (i.e. clean the data) according to your business scenario with Alibaba Cloud Quick BI. We may need Quick BI in the upcoming process of deciphering data so please ensure that you have registered for an Alibaba Cloud account. If you haven't, sign up for a free account through this link.
What Is Data Wrangling?
Data wrangling sometimes referred to as data munging, is the process of transforming data from one format to another with the intent of making it more appropriate and valuable for analytics. With the rapid rise of big data and IoT applications, the number of data types and formats are increasing each day. This makes data wrangling an indispensable element for big data processing, especially for larger applications.
What Is Quick BI?
Alibaba Cloud Quick BI is a flexible and lightweight business analytics platform built on the cloud. The basic components of Quick BI are as follows:
- Data source
- Worksheet (Quick BI Basic)
- Workbook (Quick BI Pro and Professional edition)
- Portal (Quick BI Pro and Professional edition)
Wrangle the Data (Conceptual)
Data wrangling includes:
- Data Cleaning
- Data Editing
Data cleansing or data cleaning is the process of detecting, analyzing, removing the inaccurate records from a data set, and replacing the inaccurate parts of the data with appropriate data. Data cleansing could be done with help of data wrangling tools or through scripts.
For example, let's assume the dataset has some erroneous value in the date column like "s29-05-2018" instead of "29-05-2018." To a person, this may seem like a trivial error, but to a system this entry is unreadable. The system may not read this entry as a date but instead, it will take it as a string.
Note: We can use Excel itself to clean the data but may we miss something. It's better to do it via scripts or tools. If the data source is a database, then we can do it with a BI tool.
Data Editing and Preparing
Data editing and preparing is nothing but a manual process for changing the data, data types for the reporting purpose.
For example, let's assume the dataset has date values in "MM-DD-YYYY" format but we need in "DD-MM-YYYY" format. This means we need to change it manually.
Note: Data editing and preparing are usually carried out in Excel itself or, if the data source is a database, then we can do it with a BI tool.
Wrangle the Data (Practical)
As we discussed earlier, we are going to use Excel to wrangle the data. In addition to that, I would also like to share the Python script I wrote to wrangle the data automatically.
Use Case 1: ATM Analytics
Here, we will look at our first use case, an ATM Dataset. In this use case example, we will see how to wrangle the data in Excel.
Our dataset is clean, but, for demonstration, we will make some part of this data erroneous or inaccurate first.
Now our data contains some erroneous values like:
- Transaction Date has some erroneous data, so it recognizes the columns as a String instead of a Date.
- No. of Withdraws must contain the whole number but it contains the decimal number.
- Working Day contains one small h which makes it has new categorical value.
- Working Day and Holiday Sequence also contains erroneous data.
We need to handle this because, if it's not handled properly, it will lead to inaccurate results which, in turn, leads to false insights.
Let me show how to handle this in Excel. Please have a look at the following table:
Using Type Function, we can easily find erroneous data. Please follow the following steps:
1. Insert a new column for validating the data.
2. Write the formula "=TYPE(Range)"
3. Apply a filter to the column to validate.
From this, we can easily find whether the date column has erroneous values or not. Because it contains text, we can differentiate text vs. numbers, but we aren't able to find out whether it has a float instead of an integer. We may also use other functions in this process, like ISNUMBER(), ISTEXT, etc.
We may not able to analyze each cell to find the incorrect data. So, it is essential to use any wrangling tool or scripts.
Use Case 2: Customer 360
Here we go our next use case Customer360.
We are going to use a Python script to wrangle the data automatically. I will write a separate article in detail about this programmatic approach to deciphering the data to uncover the hidden insights out of it.
# ============================================================================= # #Importing the necessary modules # ============================================================================= import pandas as pd import numpy as np # ============================================================================= # #Data Wrangling or Data munging # ============================================================================= def datawrangler(data): data=data.drop_duplicates() data.rename(columns=lambda x: x.replace(' ', ''), inplace=True) # Getting Column Names columns=data.columns.values columns=pd.DataFrame(columns,columns=['ColumnFields']) # Getting Column Types columndatatype=pd.DataFrame(data.dtypes) columndatatype.index=range(len(columns)) #Adding Column types with their respective columns columns['ColumnTypes']=columndatatype for x in range(len(columns)): if (columns['ColumnTypes'].iloc[x]) is np.dtype(np.int64) or (columns['ColumnTypes'].iloc[x]) is np.dtype(np.float64) : data[columns['ColumnFields'].iloc[x]]=data[columns['ColumnFields'].iloc[x]].fillna(0) if (columns['ColumnTypes'].iloc[x]) is not np.dtype(np.int64): data[columns['ColumnFields'].iloc[x]]=data[columns['ColumnFields'].iloc[x]].fillna(method='ffill') if 'date' in (columns['ColumnFields'].apply(np.str.lower).iloc[x]): data[columns['ColumnFields'].iloc[x]]=data[columns['ColumnFields'].iloc[x]].apply(pd.to_datetime) if (columns['ColumnTypes'].iloc[x]) is np.dtype(np.float64): for y in range(len(data)): if (np.modf(data[columns['ColumnFields'].iloc[x]].iloc[y])) in ['0.0',0.0] : data[columns['ColumnFields'].iloc[x]]= data[columns['ColumnFields'].iloc[x]].astype(np.dtype(np.int64)) break if (columns['ColumnTypes'].iloc[x]) is not np.dtype(np.int64) and (columns['ColumnTypes'].iloc[x]) is not np.dtype(np.float64) and ('date' not in (columns['ColumnFields'].apply(np.str.lower).iloc[x])): for y in range(len(data)): if ((data[columns['ColumnFields'].iloc[x]].iloc[y]).isnumeric()) : data[columns['ColumnFields'].iloc[x]]=data[columns['ColumnFields'].iloc[x]].astype(np.dtype(np.int64)) break return data # ============================================================================= # #Getting the Data # ============================================================================= location=input("\n\nEnter the File Location\n\n\t") data=pd.read_csv(location,index_col=0) savelocation=input("\n\nEnter the File Save Location\n\n\t") # ============================================================================= # #Passing the Data to the Module # ============================================================================= data=datawrangler(data) data.to_csv(savelocation, index=False, header=None)
Wrangle the Data (Best Practices)
- It's always a best practice to check whether the data is clear or needs to be wrangled.
- Check whether the data has anomalies and outliers.
- Convert the categorical values to lower or upper case and then convert it to sentence case to avoid repetition of the same values.
- Find null values and NaN values which lead to false results.
- Use data wrangler or preparation tools to wrangle the data to avoid human error.
Ingest the Data (Conceptual)
Data ingestion is a critical success factor for analytics and business intelligence. We need to know about OLTP and OLAP.
OLTP is an Online Transaction Processing system. The focus of the OLTP system is to record whether an Update, Insertion and Deletion took place during the transaction. The OLTP queries are simpler and short and hence require less time in processing and require less space. A common example of an OLTP system is an ATM database, in which, by using short transactions, we modify the status of our account.
OLAP, on the other hand, is an Online Analytical Processing system. OLAP databases store historical data that has been inputted by OLTP. Using OLAP, you can extract information from a large database and analyze it for decision making. A classic example of an OLAP system is a data warehouse, which accumulates data from multiple OLTP data sources for querying and analyzing the data.
For both OLTP and OLAP, we may have files or database as data sources. Typical file formats include .CSV and .XLS. We also need to consider different types of databases for our application. Connectors will be available to easily connect the databases.
Ingest the Data (Practical)
Let's now see how to ingest data into Quick BI.
First, you'll need to have set up Quick BI.You'll need to have an Alibaba Cloud account.
- You'll need to have an Alibaba Cloud account. Log in to your Alibaba Cloud account.
- Enter your Console and navigate to the Quick BI product console.
- Select the Region and Select Quick BI Pro.
Your Quick BI instance has been created.
How to Ingest Files Into Quick BI Pro
- Select Personal Workspace.
- Click the Data Source, Click Upload, Upload the file, Name the file, and Click Ok.
- Data is successfully ingested into Quick BI.
Note: Files are only supported in the personal workspace.
How to Ingest Data from Database to Quick BI Pro
- Select Default Workspace.Click the Data Source, Click Create Data Source, Select the Data Source.
Note: You can select either Cloud Databases or External Databases.
- Fill the Connection Details, Click Test Connection, and Save.
- Data source connection is established.
Note: Please choose workspace other than the personal workspace so that you can able to use workbooks.
Ingest the Data (Best Practices)
- Always Ingest the Wrangled data if you are using files.
- Don't connect OLTP Database as source as doing so may degrade the performance of source systems.
- Connect the OLAP Databases or Use RDS Services offered by Alibaba Cloud Platform.
- Always use the custom query to ingest the data instead of ingesting all tables.
If you followed the steps correctly, you should have successfully ingested data into Quick BI, which is a great milestone in the process of deciphering your data to uncover the hidden insights.
Please ensure that you have registered for an Alibaba Cloud account to enjoy a free trial of Quick BI Pro. In the next article of this multi-part tutorial, we will see how to model the data with Quick BI. See you all in the next part of this article series. Stay tuned.
By Ranjith Udayakumar, Alibaba Cloud Tech Share Author.
Opinions expressed by DZone contributors are their own.