Debunking Four Metadata Myths
Here are four misconceptions about metadata that can cause a data engine nightmare if overlooked. To better understand the challenges, it's time to debunk these myths!
Join the DZone community and get the full member experience.Join For Free
Data is being accumulated at an alarming rate due to the growth of unstructured data. Within that is metadata, which allows us to quickly find data files by identifying certain properties. It wasn’t until recently that developers had to pay closer attention due to the overwhelming effect it was having on our data engines. Metadata used to just be a thing that could be stored in memory behind the scenes without any worry, but its expansion presents the growing concern of where to store it and how to manage it.
Every database system, whether SQL or NoSQL, uses a data engine (aka storage engine), embedded or not, to manage how data is stored. Our ever-growing volume of data has outgrown our data engines’ infrastructures and with that, they are stretched to their limits and forced to make painful trade-offs necessary to support the scale and performance requirements of modern datasets.
In order for our data engines to run smoothly, we must understand how the metadata affects them in order to implement proper practices in the future. This article addresses some of the most common myths around metadata.
Myth 1: Metadata Doesn’t Take Up Much Storage
Believe it or not, metadata takes up more storage space than one might think. It has grown significantly in the past ten years due to the volume of unstructured data, such as documents, multimedia files, IoT, and sensor data. The fastest-growing data segment in recent years is unstructured data in the form of objects and it is estimated that by 2021, 80% of all global data will be unstructured.
The ratio behind the object’s data and the metadata used to be about 1000:1. However, along with the drastic increase in data over the past ten years, this ratio has shifted significantly towards metadata. An object that's 20K in size, for example, could have metadata of around 20 bytes. The amount of storage that metadata eats up is no longer insignificant and should be at the forefront of developers’ discussions when working with the data engine.
Myth 2: Metadata Growth Won’t Affect My Application Performance
One of the most common ways to handle the increased amount of metadata is to implement a data engine as a software layer within the application to execute different on-the-fly activities on live metadata. Deployed as an embedded key-value store (KVS), the storage engine typically uses a log-structured merge-tree (or LSM tree) structure. This offers more flexibility and speed compared with traditional relational databases, however, LSM-based KVS have limited capacity and high CPU utilization and memory consumption due to high write amplification. This causes a data bottleneck which in turn impacts the application’s performance.
This is where the developers come in. When existing data engine solutions face this issue, developers are forced to implement workarounds like sharding, which partially loosens the bottleneck, allowing for the quality of performance to increase but at the expense of creating other limitations such as scalability.
Myth 3: Metadata Is Easy to Manage
Unfortunately when metadata increases, existing data engines are unable to manage on their own. To keep things running, application developers find themselves spending more and more time dealing with sharding, database tuning, and other time-consuming operational tasks rather than delivering real business value. Performing these tasks isn’t on a once-off basis but something that needs to be continuously monitored. These tedious tasks will become part of the developer’s routine to ensure the system continues to function to a high-performing standard, instead of focusing on delivering real business value. Oftentimes when an organization doesn’t have enough skilled developers to manage the workload, they turn to using the default settings which won’t hold off the problems for long.
Myth 4: All Data Engines Can Store Unlimited Amounts of Metadata Without Making Sacrifices
Whatever database system you choose to use, a data engine is necessary to manage how the data is stored. With the ever-growing size of metadata, existing data engines can store the data in different ways, each forcing trade-offs on different issues whether it is capacity, scalability, cost, or performance. If sacrificing performance isn’t an option for your business, existing solutions will have you paying significantly more money, especially for a hyper-scale data operation. Alternatively, if a cost-effective solution like RocksDB is selected, then you’ll have to learn to manage I/O hangs and performance hits.
Metadata is growing at a significant rate and there is no slowing it down. It is highlighting the limitations within our current technology, especially our data engines. Don’t let these myths cloud your judgment as it is essential to recognize the problems you may be faced with and tackle the issues head-on, starting with selecting the right data engine. Thinking about this now will ensure your operations aren’t overburdened later down the line.
Opinions expressed by DZone contributors are their own.