Over a million developers have joined DZone.

Treasure Data's Plazma: Columnar Cloud Storage

· Cloud Zone

Download the Essential Cloud Buyer’s Guide to learn important factors to consider before selecting a provider as well as buying criteria to help you make the best decision for your infrastructure needs, brought to you in partnership with Internap.


Treasure Data has been developed by Hadoop experts. We get Hadoop, and, in many ways, it’s part of our core. As we have built out the platform, we noticed that the storage layer needs to be multi-tenant, elastic, and easy to manage while keeping the scalability and efficiency. This led us to create Plazma, our own distributed columnar storage system in place of HDFS. We wanted to leverage the “store everything now, analyze later” model of our schema-less architecture and provide better performance in terms of storage and query processing.

By separating the MapReduce processing engine of Hadoop and the storage layer, we would be able to optimize the elasticity, efficiency, and reliability of the system. Making our system more modular also allowed us to use columnar storage for our data and allow queries to only parse through the relevant records instead of reading the whole dataset. Plazma led us to process the queries faster, manage databases more simply, and make better use of our schemaless database architecture.

We achieved our technical goals by architecting Plazma in the following ways:

  • JSON processing: automatically converts row-based JSON objects into a columnar format
  • Columnar storage: uses a columnar file storage format which significantly reduces disk IO for analytical queries
  • IO optimizations: implements various IO optimizations such as parallel pre-fetch and background decompression
  • Scalability and ease management: Plazma is built on top of object-based storage, which is more easier to scale and maintain

These are some of the key innovations we made with Plazma to optimize query processing and storage and provide us with a more efficient distributed storage system solution. Some companies make the argument that leveraging HDFS allows for their business to take advantage of open source innovation, which is preferable to on-premise solutions. However, for our purposes, Plazma is much more efficient in terms of query processing and allows us to separate the processing and storage layers for optimizing query processing and manageability.

While this technology is currently proprietary to Treasure Data, we have discussed open sourcing it to provide developers with the tools they need for efficient distributed storage systems meant for big data analytics processing.

What do you think? Would you find this kind of technology useful and would you be interested in using it? Leave your thoughts in the comments.

The Cloud Zone is brought to you in partnership with Internap. Read Bare-Metal Cloud 101 to learn about bare-metal cloud and how it has emerged as a way to complement virtualized services.

Topics:

Published at DZone with permission of Sadayuki Furuhashi, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}