Structure the Unstructured
Structure the Unstructured
Promises of agility and schema-less data are always around the corner for NoSQL databases. And the thing is this is all true. You can store unstructured data easily. But you’ll need to apply structure when you come to actually use the data.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
It’s the name of a talk I will be giving in the coming weeks. It’s basically about data modelling for unstructured or semi-structured data.
If you are interested I will be giving it at the following conferences:
- Java Days Kiev - 6–7 November, 2015 Kyiv - http://javaday.org.ua/kyiv/
- Devoxx Marroco - 16–18 November, 2015 Casablanca - http://devoxx.ma/en/
- TopConf Talinn –17–19 November, 2015 Tallinn - http://topconf.com/tallinn–2015/
If you have some time to spare, this is basically a write up of the talk introduction:
Promises of agility and schema-less data are always around the corner for NoSQL databases. And the thing is this is all true. You can store unstructured data easily. But you’ll need to apply structure when you come to actually use the data. There are various ways to apply that structure. And we’re going to talk about it today.
But first let’s get into the Structured vs. Unstructured topic. Structured data as we know it in databases traditionally refers to columns. A column has a name and a type and is grouped by table. When you try to insert a new line in that table, all of the elements on this line have to correspond to the name and type of each column. You can’t add a new one on the fly or change the type dynamically.
Well you could probably store everything as Strings but it kind of defeats the purpose of using database columns (it’s also worth mentioning most of the column store are coming up with a JSON type). So it’s like having a schema. And if what you are trying to store does not fit that schema you’ll get an error.
What is Schemaless Anyway?
Being schemaless on the other hand would let you store anything of any type. If you think about programming and data structures like Maps or Dictionaries, they are schemaless structures. If you think about a key/value store, the value can be anything. It can be binary encoded image, a serialized object, JSON, XML, a number, etc..
While there might not be any structure embedded in a binary encoded file, there is one for JSON and XML. We usually refer to it as semi-structured data. They both give you named, typed field and flexibility to add any field at any given time. It’s like an implicit schema. And they both have actual schema specifications like XSD or JSON-SCHEMA.
So most of the time in databases, when you hear about schemaless, it’s actually closer to semi-structured data.
Structured, Unstructured or Semi-Structured, Got It, What Now?
And of course since you are storing data, you’ll want to use it at one point. So that means roughly three things:
- You need to write that data
- You need to read that data
- You need to map the data store answer to a structure or an object usable by your application
These three points coupled to the architecture of the store you are using will raise more questions:
- What you can use to retrieve data (simple key get? materialized views? query language? )
- Is your data distributed, replicated, consistent?
These points will all have an impact on how you structure your data. And to do that you can add all the specifics of your domain design.
I’ll try to answer these questions as much as possible and by taking Couchbase as example. Couchbase is a distributed Key/Value store and a Document database and makes a good candidate for that.
So join me to learn more about Data Modeling!
Published at DZone with permission of Laurent Doguin , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.