Data profiling is the process of assessing data values and deriving statistics or business information about the data. It allows data scientists to validate data quality and business analysts to determine the usage of the existing data for different purposes. Kylo automatically generates profile statistics such as minimum, maximum, mean, standard deviation, variance, aggregates (count and sum), the occurrence of null values, the occurrence of uniqueness, the occurrence of missing values, the occurrence of duplicates, the occurrence of top values, and occurrence of valid and invalid values.
Once the data has been ingested, cleansed, and persisted in data lake, the business analyst searches and finds out if the data can deliver business impact. Kylo allows users to build queries to access the data so as to build data products supporting analysis and make data discovery simple. In this blog, let's discuss automatic data profiling and search-based data discovery in Kylo.
To learn about Kylo deployment requiring knowledge on different components/technologies, refer to our previous article on Kylo setup for data lake management.
To learn more about Kylo self-service data ingest, refer to this previous article.
Kylo uses Apache Spark for data profiling, data validation, data cleansing, data wrangling, and schema detection. Kylo’s data profiling routine generates statistics for each field in an incoming dataset. Profiling is used to validate data quality. The profiling statistics can be found on the Feed Details page.
The feed ingestion using Kafka is shown in the below diagram:
Informative summaries about each field from the ingested data can be viewed under the View option on the Profile page. String (User field in the sample dataset) and numeric data type (Amount field in the sample dataset) profiling details are shown in the below diagrams:
Kylo profiling jobs automatically calculate the basic numeric field statistics such as minimum, maximum, mean, standard deviation, variance, and sum. Kylo provides basic statistics for the String field. The numeric field statistics for the Amount field are shown in the below diagram:
The basic statistics for the String field (User field) are shown in the below diagram:
Predefined standardization rules are used to manipulate data into conventional or canonical formats (i.e. dates, stripping special characters) or data protection (i.e. masking credit cards, PII, etc.). A few standardization rules applied on the ingested data are as follows:
Kylo provides an extensible Java API to develop custom validation, custom cleansing, and standardization routines as per business needs. The standardization rules applied to the User, Business, and Address fields as per the configuration is shown in the below diagram:
Kylo’s profiling window provides additional tabs such as Valid and Invalid to view both valid and invalid data after data ingestion. If validation rules fail, the data will be marked as invalid and will be shown under the Invalid tab with the reason for failure such as range validator rule violation, not considered as timestamp, and so on.
The data is ingested from Kafka. During feed creation, Kafka batch size is set as “10000” (which is the number of messages that the Kafka producer will attempt to batch before sending it to the consumer). To know more on batch size, refer to our previous article. Profiling applied on each batch data and informative summary is available on the Profile page. 68K records of data consumed from Kafka are shown in the below diagram:
Search-Based Data Discovery
Kylo uses Elasticsearch to provide the index for search features such as free-form data and metadata. It allows business analysts to decide on the required fields to be searchable and to enable index options for those fields while creating a feed. The indexed User and Business fields searchable from Global Search are shown in the below diagram:
The predefined Index Feed queries the index-enabled field data from the persisted Hive table and indexes the feed data into Elasticsearch. The Index Feed is automatically triggered as a part of the Data Ingest template. The index feed job status is highlighted in the below diagram:
If the index feed fails, a search cannot be performed on the ingested data. As “user” is a reserved word in Hive, the search functionality for User and Business fields failed due to the field name “user,” as shown in the below diagram:
To resolve this, the “user” field name is modified as “customer_name” during feed creation.
The search query to return the matched documents from the Elasticsearch is:
customer_name: “Bradley Martinez”:
The search query (Lucence search query) to search data and metadata is:
business: “JP Morgan Chase & Co”:
In this article, we discussed automatic data profiling and search-based data discovery in Kylo. We discussed a few issues faced with Index Feed and their solutions, too. Kylo uses Apache Spark for data profiling, data validation, data cleansing, data wrangling, and schema detection. It provides extensible API capabilities to build custom validator and standardizer. Kylo automatically performs data profiling and discovery in the background on performing proper setup with different technologies.