Over a million developers have joined DZone.

Data Modeling Machine Learning Datasets

DZone's Guide to

Data Modeling Machine Learning Datasets

Learn about accessing and turning data warehouse fact and dimension tables into a traditional machine learning dataset. Let the classifications and regressions begin!

· AI Zone ·
Free Resource

MariaDB TX, proven in production and driven by the community, is a complete database solution for any and every enterprise — a modern database for modern applications.

Often, valuable corporate data is stored in traditional data warehouses. This data can be an important source of data that can be leveraged into important predictive analytics. As a data scientist, getting hands-on this data can be problematic. Without a SQL skillset, getting data warehouse data can be close to impossible. This blog provides insights into querying the MariaDB ColumnStore for machine learning datasets.

This post discusses a solution (recipe) for accessing and turning data warehouse fact and dimension tables into a traditional machine learning dataset. The dataset is necessary for machine learning (ML) algorithms, processing, and data exploration. We will leverage all the goodness and benefits of an analytic data engine such as MariaDB ColumnStore and then integrate that data into some machine learning algorithms for advanced analytical processing.

A typical data warehouse schema has:


  • Dimension tables containing categorization of people, products, place, and time — generally modeled as one table per object.

  • Fact table containing measurements, metrics, and facts of a business process.

An ERP data warehouse will have an order or order_line fact table(s) recording each order and its associated order line items. The product and store dimension tables are related via foreign keys.

A data analyst can then calculate various metrics such as the sum of orders per store, the sum of orders per category, or the sum of orders per day/week/month.  

While a normalized database model like the above makes sense for a data analyst, it makes ad hoc queries significantly harder for data scientists. Data scientists balk at complex joins and window functions.

Let's investigate an approach to improve the usability of data warehouses for machine learning.

Approaches for Data Scientists to Engage With Data

Static reports at the end of the month are one thing and it's alright for a data analyst to work with a data warehouse. To truly unlock the value of data warehouses for machine learning, the warehouse needs to be available for more than simple end-of-month reporting and querying. The warehouse needs to be available to data engineers and scientists for advanced analytics and insights.

Here are three approaches, with varying levels of SQL query skills required:

  1. Denormalization: Create a view (materialized for performance) that consolidates the data warehouse, which then produces an analytical dataset. Data scientists can now explore the data without complex joins and window functions.
  2. Dataset modeling: Use a SQL query to represent a semantic model of the relationships between the warehouse tables. 
  3. Query filtering: Use a SQL stored procedure to implement pluggable WHERE clauses to enable query filtering of warehouse data.

Resulting ML Dataset


The biggest win is usability. The data warehouse has been merged into a single table (dataset) for analytic processing.  Data has been gathered from the data warehouse that can be run in a machine learning algorithm.  

Let the classifications and regressions begin!

Dataset Creation (SQL Query)

To create a dataset (table) like the above, join all the dimension tables into one single table. Refresh this table periodically depending on analytic requirements. Nightly seems to work for most cases. Any changes to the dimension tables, like store name, will be captured on the next refresh.

Dataset query:

    products as ( select * from products_table), 
    stores as ( select * from stores_table),
    orders as ( select * from orders_table),
    order_lines as ( select * from order_lines_table),
joined as (
    from order_lines
    left outer join orders on order_lines.order_id = orders.id, 
    left outer join products on orders.product_id = products.id
    left outer join stores on orders.store_id = stores.id
select * from joined

Dataset as a SQL Table or View

A database view creates a pseudo-table, and from the perspective of a select statement, it appears exactly as a regular table. In MariaDB, views are created with the create view  statement:

create view <view_name> as
    select <column1>, <column2>, .... 
    from <table_name>;

The view is now available to be queried with a select statement. As this is not a real table, you cannot delete nor update it. The underlying query is run every time you query the view. If you want to store the result of the underlying query, you just have to use the materialized keyword:

create materialized view <view_name> as
    select <column1>, <column2>, .... 
    from <table_name>;

You now control the upgrade schedule of the view and can be refreshed at your convenience:

refresh materialized view <view_name>;

Creating a dataset from a data warehouse is just one use case of the MariaDB ColumnStore advanced analytic datastore. Following are a few other use cases for preparing data for machine learning modeling and predictive analytics datasets. 

Prepare Your Data for Modeling

MariaDB ColumnStore



Data profiling

Quickly summarize the shape of your dataset to avoid bias or missing information before you start building your model. Missing data, zero values, text, and a visual distribution of the data are visualized automatically upon data ingestion.

Summary statistics

Visualize your data with summary statistics to get the mean, standard deviation, min, max, cardinality, quantile, and a preview of the dataset.

Aggregate, filter, bin, and derive columns

Build unique views with windows functions, filtering, binning, and derived columns.

Slice, log transform, and anonymize

Normalize, anonymize, and partition to get your data into the right shape for modeling.

Variable creation

Highly customizable variable value creation to hone in on the key data characteristics to model.

Training and validation sampling datasets

Design a random or stratified sampling plan to generate data sets for model training and scoring.

MariaDB AX is an open source database for modern analytics: distributed, columnar and easy to use.

predictive analytics ,ai ,tutorial ,mariadb ,data modeling ,machine learning ,data warehouse

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}