Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Kylo: Automatic Data Profiling and Search-Based Data Discovery

DZone's Guide to

Kylo: Automatic Data Profiling and Search-Based Data Discovery

Data profiling is a key task for any data scientist. Learn how Kylo can be used to automatically generate myriad profile statistics from data.

Free Resource

Access NoSQL and Big Data through SQL using standard drivers (ODBC, JDBC, ADO.NET). Free Download 

Data profiling is the process of assessing data values and deriving statistics or business information about the data. It allows data scientists to validate data quality and business analysts to determine the usage of the existing data for different purposes. Kylo automatically generates profile statistics such as minimum, maximum, mean, standard deviation, variance, aggregates (count and sum), the occurrence of null values, the occurrence of uniqueness, the occurrence of missing values, the occurrence of duplicates, the occurrence of top values, and occurrence of valid and invalid values.

Once the data has been ingested, cleansed, and persisted in data lake, the business analyst searches and finds out if the data can deliver business impact. Kylo allows users to build queries to access the data so as to build data products supporting analysis and make data discovery simple. In this blog, let's discuss automatic data profiling and search-based data discovery in Kylo.

Prerequisites

To learn about Kylo deployment requiring knowledge on different components/technologies, refer to our previous article  on Kylo setup for data lake management.

To learn more about Kylo self-service data ingest, refer to this previous article.

Data Profiling

Kylo uses Apache Spark for data profiling, data validation, data cleansing, data wrangling, and schema detection. Kylo’s data profiling routine generates statistics for each field in an incoming dataset. Profiling is used to validate data quality. The profiling statistics can be found on the Feed Details page.

Feed Details

The feed ingestion using Kafka is shown in the below diagram:

select

Informative summaries about each field from the ingested data can be viewed under the View option on the Profile page. String (User field in the sample dataset) and numeric data type (Amount field in the sample dataset) profiling details are shown in the below diagrams:

selectselect

Profiling Statistics

Kylo profiling jobs automatically calculate the basic numeric field statistics such as minimum, maximum, mean, standard deviation, variance, and sum. Kylo provides basic statistics for the String field. The numeric field statistics for the Amount field are shown in the below diagram:select

The basic statistics for the String field (User field) are shown in the below diagram:

select

Standardization Rules

Predefined standardization rules are used to manipulate data into conventional or canonical formats (i.e. dates, stripping special characters) or data protection (i.e. masking credit cards, PII, etc.). A few standardization rules applied on the ingested data are as follows:select

Kylo provides an extensible Java API to develop custom validation, custom cleansing, and standardization routines as per business needs. The standardization rules applied to the User, Business, and Address fields as per the configuration is shown in the below diagram:

select

Profiling Window

Kylo’s profiling window provides additional tabs such as Valid and Invalid to view both valid and invalid data after data ingestion. If validation rules fail, the data will be marked as invalid and will be shown under the Invalid tab with the reason for failure such as range validator rule violation, not considered as timestamp, and so on.select

The data is ingested from Kafka. During feed creation, Kafka batch size is set as “10000” (which is the number of messages that the Kafka producer will attempt to batch before sending it to the consumer). To know more on batch size, refer to our previous article. Profiling applied on each batch data and informative summary is available on the Profile page. 68K records of data consumed from Kafka are shown in the below diagram:

select

Search-Based Data Discovery

Kylo uses Elasticsearch to provide the index for search features such as free-form data and metadata. It allows business analysts to decide on the required fields to be searchable and to enable index options for those fields while creating a feed. The indexed User and Business fields searchable from Global Search are shown in the below diagram:select

Index Feed

The predefined Index Feed queries the index-enabled field data from the persisted Hive table and indexes the feed data into Elasticsearch. The Index Feed is automatically triggered as a part of the Data Ingest template. The index feed job status is highlighted in the below diagram:select

If the index feed fails, a search cannot be performed on the ingested data. As “user” is a reserved word in Hive, the search functionality for User and Business fields failed due to the field name “user,” as shown in the below diagram:

select

To resolve this, the “user” field name is modified as “customer_name” during feed creation.

Search Queries

The search query to return the matched documents from the Elasticsearch is:

customer_name: “Bradley Martinez”:

select

The search query (Lucence search query) to search data and metadata is:

business: “JP Morgan Chase & Co”:

select

Feed Lineage

Lineage is automatically maintained at “feed-level” by the Kylo framework, by sinks identified by the template designer, and by any sources when registering the template.select

Conclusion

In this article, we discussed automatic data profiling and search-based data discovery in Kylo. We discussed a few issues faced with Index Feed and their solutions, too. Kylo uses Apache Spark for data profiling, data validation, data cleansing, data wrangling, and schema detection. It provides extensible API capabilities to build custom validator and standardizer. Kylo automatically performs data profiling and discovery in the background on performing proper setup with different technologies.

The fastest databases need the fastest drivers - learn how you can leverage CData Drivers for high performance NoSQL & Big Data Access.

Topics:
big data ,tutorial ,data profiling ,kafka ,kylo ,data discovery ,data science

Published at DZone with permission of Rathnadevi Manivannan. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}