Over a million developers have joined DZone.

Data Aggregation or Analytics? Think Columns

DZone's Guide to

Data Aggregation or Analytics? Think Columns

Column based data store is more suited for data analytics and offers highest performance as there are no strict constraints of data normalization rules.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

In RDBMS, data is represented in the form of tuples or rows, while a NoSQL form of database often stores data in the form of columns. Column-centric databases like Cassandra or HBase is the talk of the town today.
With the traditional relational model, the data records or entries are represented as rows. It is best suited for typical online transactional systems or application that needs rich transaction ACID properties for its data. In column form of database, the unstructured or semi-structured data or information is aggregated into number of columns or family of columns for further analysis. Column based data store is more suited for data analytics and offers highest performance as there are no strict constraints of data normalization rules.

The concept of analytics is about applying intelligence to the dormant read-only data and therefore collecting such legacy data and persisting it into organized set of fewer columns makes it easier to query and process. These columns are stored as rows with each column having a certain key-value pair of attributes often nested in nature. This type of data organization helps in processing analytical queries faster. A good small example can be of e-commerce domain when dealing with certain category of merchandise, say footwear. Footwear category data can be aggregated and columnized with attributes like brand, type, size, color, style, weather-prone,etc. It becomes simple and fast to analyse category of information in a columnar way. The volume of data can itself reach in thousands or even millions.

Column based database also provides the concept of TTL or time to live for its column data. Products in e-commerce tend to get obsolete or outdated very fast. One can set the TTL on such columns and it will be automatically discarded after the specified time has elapsed. Another feature column based database provides is its ability to compress data that can increase the capacity of its storage. Compression can be configured to apply on closely related segments of data so that the reads and writes are fast. Modern NoSQL databases uses advanced compression techniques and algorithms that can greatly reduce disk IO.

One more important aspect of columns based database is that certain attributes of single family of columns (represented as rows) may vary or be omitted. For example, a footwear column type ‘formals’ may have more attributes than say type ‘flip-flops’. In short the schema of single family of columns may not be consistent. This kind of flexible schema or data model allows for complex data structures to be designed with ease. The design takes an application perspective than a schema perspective. And therefore, you do not have to break your head on applying different normalization rules.

A typical column based DB has the following characteristics:

1. It works on large volume of unstructured or semi-structured data.
2. The design model has a more flexible and design is application driven than schema driven.
3. Column values are typically stored as key-value attributes that can be nested to create a hierarchy.
4. Data model can be or is usually multi-dimensional
5. Column is a unit of storage
6. It is best suited for aggregation and analytics use cases.
7. Columns may have a time to live constraint
8. Columns relationships are represented using aggregation

So when you want to perform data aggregation or analytics, think a column based NoSQL database approach!

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

nosql ,data analytics ,database ,big data ,data storage

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}