Why NoSQL Needs Schema-Free ETL Tools
Learn about an open-source tool that can effectively help you deal with the difficulty of processing data into and out of NoSQL.
Join the DZone community and get the full member experience.Join For Free
Many NoSQL databases lack tooling and analytics. Here, I discuss why a schema-agnostic modern ETL approach can help NoSQL vendors and customers.
For any operation involving data, or indeed computing generally, you need three things: input, processing, and output.
The Difficulty of Processing Data Into/Out of NoSQL: The Uncomfortable Truth
NoSQL databases are a fantastic way to store disparate data (data whose structure changes rapidly) — for example, where you don't have control of the source formats.
What they tend to lack, though, is advanced tooling to onboard data in the first place (the input piece), tooling to perform data engineering, advanced analytics and data science on that data (processing), and finally, tooling to display the results or visualize what is contained in your NoSQL database (the output piece).
For a long time, this hasn't been a barrier to the adoption of NoSQL. NoSQL adoption was traditionally bolstered by developers using these databases' developer-friendly APIs in order to wrap them inside a bespoke application. This has worked very well for growing the NoSQL market in the early years.
For this market to continue to take off, though, and challenge the traditional database vendors, a wider range of people need to adopt NoSQL rather than just API savvy developers.
Even developers don't like writing boring "plumbing code" — code that just links data from one place to another. It's dull and repetitive. Customers don't like it, either — as anywhere that code is needed inevitably means a maintenance headache, not to mention a long time to write and test it in the first place. This means increased costs to initially deploy a new technology like NoSQL.
Equally, on the output side, if you can't rapidly visualize the insights you can glean from your data, then you cannot fully realize the benefits of your investment in NoSQL database technology. Trying to code around the problem leads to longer project times, and the aforementioned increase in costs associated with custom coding.
Many NoSQL companies have tried to shoe-horn SQL support into their products in an effort to bridge the gap between traditional BI vendors and their products. This has only been partially successful. BI is very much fixed-schema at the final ten yards of creating a visualization. These SQL layers impose these constraints and remove some of the great flexibility and built-in features these NoSQL databases provide. Thus, customers who try this don't fully realize the benefits the NoSQL databases can offer, decreasing return on investment.
For these many reasons, custom coding for the input, processing, and output of data held in NoSQL databases greatly increases the barrier to entry of customers in using NoSQL and limits the NoSQL market's growth.
What is needed is better tooling around these NoSQL databases.
What Tooling Is Generally Available Right Now?
Tools with user interfaces that enable non-developer users to interact with data held in a variety of systems and to visually create data processing reduce the barrier to entry to use new technology. In the traditional relational databases (RDBMS) space, ETL (extract, transform, load) tools performed this function.
Of course, the historic problem is that your ETL process is fixed-schema upon creation. You effectively hardcode the set of fields in your ETL process when you design it. If the underlying structure changes then best case scenario, the new data is ignored. Worst case scenario: your ETL job breaks.
In the NoSQL world, where your data structures are many and are varied and change often, fixed-schema ETL constrains you too much in what you can do.
But NoSQL could still benefit from similar tooling — tooling that enables non-developers to read data from a variety of systems, cleanse it, discover information about the data, merge it with other sources of data, perform statistical analysis and machine learning, and other advanced operations on it, and then store the enriched data and new insights in to a target database — normally either a NoSQL database or in-memory store for rapid reporting.
Such tools would be very useful to customers who are looking at adopting NoSQL.
A Flexible-Schema ETL Tool
I like cake. I want to have my cake and eat it. I love NoSQL, and I love putting easy-to-use tools in front of people so that they can get rapid business benefit from their technology investments.
I want schema-free ETL that works with NoSQL.
"Adam, you crazy!" I hear you say, "ETL ain't ever gonna be that flexible, it'll never help us in NoSQL!"
Well, guess what...
Turns out there is a way to perform schema-free ETL with support for hundreds of data sources and sinks, machine learning, and feeding the data to a visual business analytics/BI dashboarding layer. Oh, and it's got an open-source core.
This particular trick is held within two features of the Pentaho Platform.
"Ah! Hang on, you work for them! You shameless corporate sell out!" Yeah, well, I do work for Hitachi Vantara, owners and purveyors of the Enterprise Edition of the Pentaho Platform. That's true.
I don't work anywhere, though, where I don't believe in the product. I wouldn't be mentioning Pentaho Data Integration — the open-source ETL tool — if I didn't know for sure it could help with the NoSQL flexible schema tooling problem. So hear me out.
Pentaho Data Integration, on the face of it, looks like every other fixed-schema ETL tool you've ever seen. If you drag on an import step and point it at a data source, the fields you see in the data stream are those seen in the data source - and are fixed for the rest of that 'transformation' (or flow).
Pentaho Data Integration (PDI) Metadata Injection
Pentaho Data Integration though has a unique feature called Metadata Injection. This enables a parent transformation to dynamically set the configuration of steps in a child transformation. It's used in places where you have many slightly different transformations.
A good use case of Metadata Injection is where you're reading one data source — say, a relational database — and sending that structure of data into a target system (say, a NoSQL database). I may develop one transformation to read my sales table and load that into sales JSON documents, another for customer details, another for in-flight shopping baskets, and so on.
Creating 500 of these for 500 source tables would suck though. Man, oh man, the time taken! This is the problem with most other ETL tools.
All these transformations would look suspiciously the same. They'd have maybe ten steps that load data, set a few temporary variables (like JSON collection name, maybe some constant or calculated fields in the target JSON structure), and then load the data into a particular collection.
500 transformations multiplied by 10 steps = 5,000 steps to manually configure... Ouch! Carpel tunnel here I come...
Where Metadata Injection comes in is that you can create a single transformation to do this load, but parameterize it from a parent transformation. You may even configure this parent in a single job and loop over a list of input data sources to perform the work.
Thus you now create just two transformations: one with ten steps and a parent one with another ten-ish steps that loop over the table set and call the child transformation using metadata injection.
Two transformations totaling just 20 steps. My mouse-using hands can handle that easily!
So, using an ETL tool that is flexible enough to load disparate structures into NoSQL now becomes possible, and even low-cost, thanks to Pentaho Data Integration's Metadata Injection support.
PDI to Aid Data Discovery and Semantic Relationship Discovery
But what about loading a variable data lake in Hadoop or NoSQL which contains greatly varying structures?
Well, Pentaho Data Integration can load that data too. You can load the JSON data (for example, XML-supported, too!) and parse it in Pentaho. The JSON input step supports Metadata Injection, too. Thus you can sample data (even just a single record), and then invoke a Metadata Injection transformation to process data with varying schema.
You can even do some crazy stuff...
I developed a custom step with colleagues in the data science team a month or so ago that does some great magic. It'll analyze all data in a stream in a transformation and output summary statistics about it.
Big whoop. Who cares? Well...
What the step does is determine the type of each piece of data (irrespective of the data type in the source system) and determine whether that field is categorical or continuous. It counts the number of unique, null (or not present), values, and for continuous fields, calculates min, max, median, and mean, as well as skewness and dispersion.
In short, we determine the makeup of every field and every piece of data in the source system. We then store that metadata to later drive our ETL processes with metadata injection.
In the NoSQL world where this becomes relevant is loading massive amounts of data from a variety of sources and determining through data science, rather than manually configuration, how data entities link to each other across systems.
Using this approach combined with metadata injection will allow Pentaho transformations to load multiple sources and suggest to an integration developer the entities that exist within an organization's data and the relationships between those entities.
Basically, you can use Pentaho to discover the semantic linkages between an entire organization's data.
You can then use this information to dynamically configure your target system and metadata injection to load that data and blend it, and set up the relationships, semantic relationship models, and other metadata in the target, perhaps NoSQL, database.
This is particularly useful if you have thousands of source record types and don't want to have to manually configure these metamodels in your NoSQL database (be it a document store, or hybrid document-graph/triple store).
We ran this over existing demonstration sales data information and were amazed at how useful the semantic graph was after discovery. All the main entities just popped out on the screen on a semantic graph, with discovered relationships and data types shown, and the strength of the correlation.
Basically, using Pentaho Data Integration with NoSQL just saved your customers months in data discovery, modeling, and data loading development.
What About Data Processing?
Pentaho Data Integration also comes with a myriad of data science plugins, statistical functions, and third-party plugins on the Pentaho Marketplace. This means any crazy data manipulation, data engineering, feature creation, statistical modeling, or machine learning you need to do, you can orchestrate with Pentaho.
Pentaho becomes a central place to do this regardless of the underlying data store — so customers don't have to rely on database vendors to embed these facilities, and NoSQL database companies don't have to invest the millions of engineering dollars to build them.
Even integrating machine learning in Spark, Python, or R becomes a simple case of dragging and dropping single step on to a transformation.
Again, you're welcome.
Visualizing NoSQL-Held Data
"What about the visualization piece?!" I hear you cry, "You promised help with that, too!"
Well, fear ye not...
Another awesome feature of the Enterprise Pentaho platform — that is Pentaho Data Integration combined with Pentaho Business Analytics — is that of exposing Data Services.
A data service is configured in a Pentaho Data Integration (PDI) transformation. You right-click any step and say, "Hey, the data stream I have at this point, I want to expose as a JDBC compatible data source."
It could be anything — a CSV file, a NoSQL set of records, anything. When it is exposed, the data set is given a name and you can connect to it from any JDBC compatible BI tool.
This Data Service can have several options. It can be cached and refreshed every X minutes in order to reduce load on the source system. It can also, crucially, "push down" the
WHERE clause passed through JDBC to the 'Input' step configured in the source system.
What the hell does that mean? Well...
Rather than load all customer sales from your NoSQL database and cache them in memory, only for PDI to filter them by customer number, we can instead "push down" the customer number into the query we pass to the NoSQL database in the first place.
So, the Data Service just became the equivalent of a simple function call with parameters, and only loads the data you absolutely need to answer the query passed to the data service. This executes much faster than a traditional SQL translation layer.
Pentaho can do this for any data source that supports querying, search, or filtering. I've developed data services to do this for customers using MongoDB and for MarkLogic Server, for example. We have a native MongoDB step, and I used MarkLogic's REST API to push down the query to that NoSQL database. It was pretty easy.
I then exposed that to our Pentaho Business Analytics dashboard for live querying and viewing of a few thousand records on my laptop. Worked like a charm and executed within seconds!
And the savings/benefits?
Once I figured how to do it, it took five minutes to develop the transformation to load customer data into NoSQL using PDI, another five for the transformation with a data service, and another five to configure the dashboard.
So, 15 minutes from data to insight. Easy!
Of course, developing many of these transformations using Metadata Injection and variable schema will take a little longer than this simple example, but it's a darn sight quicker than writing data loading code, not to mention cheaper versus development with maintenance over time.
Our ETL models are basically XML files built and documented visually. Someone can open the transformation in two years time and make instant sense of it. The same can't be said of most integration code bases!
In Pentaho Data Integration (PDI), the NoSQL community can access the ability to create schema-free and variable schema data loading, as well as data science and integration transformations, while avoiding creating a myriad of transformations — thus, greatly reducing implementation costs associated with NoSQL systems. They can also be called over REST if needs be for dynamic invocation.
The NoSQL community can also access dashboarding in their BI tool of choice through PDI Data Services over NoSQL data sources.
And this platform is available today and has an open-source core. I'd advise you to download it and try it out.
Published at DZone with permission of Adam Fowler, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.