When Small Files Crush Big Data — How to Manage Small Files in Your Data Lake
In this article, see how to manage small files in your data lake.
Join the DZone community and get the full member experience.Join For Free
Big Data faces an ironic small file problem that hampers productivity and wastes valuable resources.
If not managed well, it slows down the performance of your data systems and leaves you with stale analytics. This kind of defeats the purpose, doesn’t it? HDFS stores small files inefficiently, leading to inefficient Namenode memory utilization and RPC calls, block-scanning throughput degradation, and reduced application layer performance. If you are a big data administrator on any modern data lake, you will invariably come face to face with the problem of small files. Distributed file systems are great but let's face it, the more you split storage layers the greater your overhead is when reading those files. So the idea is to optimize the file size to best serve your use case, while also actively optimizing your data lake.
Slow Files and the Business Impact
- Slowing down reads — Reading through small files requires multiple seeks to retrieve data from each small file which is an inefficient way of accessing data.
- Slowing down processing — Small files can slow down Spark, MapReduce, and Hive jobs. For example, MapReduce map-tasks process one block at a time. Files use one map task each and if there are a large no. of small files each map task processes very little input. The larger the number of files the larger the number of tasks.
- Wasted storage — Hundreds of thousands of files that are 5 KB each or even 1 KB may be created daily while running jobs which adds up quickly. The lack of transparency on where they are located adds complexity.
- Stale data — All of this results in stale data which can weigh down the entire reporting and analytics process of extracting value. If jobs don’t run fast or if responses are slow, decision making becomes slower and the data stops being as valuable. You lose the edge that the data is meant to bring in the first place.
- Spending more time tackling operational issues than on strategic improvements — Resources end up being used to actively monitor jobs. If that dependency could be removed resources can be used to explore how to optimize the job itself such that a job that earlier took 4 hours now takes only 1 hour. So, this has a cascading effect.
- Impacting ability to scale — Operational costs increase exponentially. If you grow 10x in the process, the rise in operation cost is not linear. This impacts your cost to scale. While small files are a massive problem, they aren’t completely avoidable either. Following the best practices to effectively apply them to your organization will give you control over rather than firefighting. In any production system the focus is on keeping it up and running. As issues crop up resources are deployed to tackle it.
The Small File Problem
Let's take the case of HDFS, a distributed file system that is part of the Hadoop infrastructure, designed to handle large data sets. In HDFS, data is distributed over several machines and replicated to optimize parallel processing. As the data and metadata are stored separately every file created irrespective of size occupies a minimum default block size in memory. Small files are files size less than 1 HDFS block, typically 128MB. Small files, even as small as 1kb, cause excessive load on the name node (which is involved in translating file system operations into block operations on the data node) and consume as much metadata storage space as a file of 128 MB. Smaller file sizes also mean smaller clusters as there are practical limits on the number of files (irrespective of size) that can be managed by a name mode.
What Can You Do to Identify and Eliminate Small Files?
As the admin for an HDFS system you may already understand why and how small files are created. It could be every clickstream event dumped in the data lake, large volume of individual images (which can’t be logically merged into a larger file), practice of storing every configuration file, or simply placeholder files (basically empty files) used to track output status of every job and are never deleted. There are a number of tasks that Hadoop admins perform to (1) identify the number of small files, (2) identify who is creating the small files, and (3) perform general cleanup of the small files, including compaction and deletion.
Why Is It So Difficult to Identify and Eliminate Small Files?
There is no easy tooling on top of HDFS to see how many files are present, what the size of each file or directory is, or how and where the users are creating files. Now think of this in terms of a system that spans multiple clusters and regions, and petabytes of data taking up storage and slowing down performance. Not only are you left with wasted storage but also any jobs that you run like Hive and MapReduce are also slowed down. Now the requirements have changed and with the data lake concept users want to bring it down so that they can even serve a mobile application from their data lake platform (i.e., sub second response times). The time boundaries are falling exponentially for which people want to use this infrastructure. So, your optimal block size may even be 64 MB, all depending on the use case. An extreme use case is where you are dealing with file sizes of 1 KB and this is a common use case when you are dealing with IoT data or sensor data where you might be getting a file every 200 milliseconds and you want to create a file for every minute. These will still be very small files and won’t cross 10 KB. So, while small files can’t be avoided they can be managed so that you keep them at lower % than the bigger files. It’s a continuous process and requires a maintenance cycle.
Existing Tools for Managing Small Files
An ideal solution would be a system that is software agnostic and gives a cross-sectional view across a firm’s entire Big Data infrastructure. The problem companies currently face with existing tools is three-fold:
- Single-threaded APM tools — The traditional APM tools are single-threaded focusing on web applications, but none of them provide deep File system analytics.
- Optimization tools that are engine-specific — Query engines like Spark, Hive have specific monitoring & optimization products, focused only on their platform and the optimization happens only if the data is ingested through their software.
- In-house scripted software & manual intervention — So, the process of identifying small files across a company’s Big Data ecosystem still remains largely manual, relying on support staff to go through individual folders, and as companies grow fast and the system gets more complex, the existing monitoring tools just aren't enough.
The Small File Solution
A company will have multiple single-threaded application management tools that do not collectively address the problems that scaling a Big Data system can bring. What would be ideal is a software-agnostic tool monitoring across Spark, Kafka, Hive, Hadoop, Presto, etc. offering a cross-sectional multi-dimensional real-time view with the following characteristics:
- This application should give you an online view of the small files in your data lake in near-real-time.
- This should be presented as actionable charts that give you a mile-high view of your data ecosystem so that you can quickly identify where the issue originates.
- Once the problem is identified, it should offer the ability to troubleshoot by drilling down to where the problem originated.
- Once you arrive at the root of the issue, it should then offer the ability to take actions to rectify it.
To answer the question we started with, yes, small files can crush big data but there are steps that you can take to manage them, including timely identification, compaction, compression and deletion. Some are manual, time-consuming and use makeshift software, while others require you to make an investment in data observability and your future.
Much shorter maintenance cycles — from 12 hours to 15 mins
We recently worked with a customer that had over 40 PBs of data with Spark and MapReduce workloads. This volume was causing reading through the catalog to take many seconds which is 100 times more than it should ideally take. They also had a maintenance cycle which was resource intensive which involved going through each folder, figuring out what files are there and what are the locations where they may need to compress the files. Just figuring out which were the files to compress was taking a lot of time. Data observability reduced the time taken for identifying small files by making it dead simple. So, from a maintenance cycle of 12 hours we have reduced it to under 15 minutes.
Reduced maintenance cost — by at least 50%
Maintenance comes with a cost, which when done every seven days is very high. It also is carried out in cases every day. With Pulse even once you factor in licensing and compute it comes down to a very small fraction of this for this particular feature.
Easier and faster RCA — lesser tickets, quicker resolution
Another client wanted to send regular reports about their maintenance cycles. Earlier log files were collected, collated and sent back to the modeling team for analysis. With a data observability application you can get all of this context at a single place – with multi-dimensional visibility. So, an Ops resource could see a problem as it comes in and not hours, certainly not in months. Managing small files in your data lake offers significant benefits from reduced cost to faster problem resolution and this cascades down to the rest of your business. To realize these benefits or if you would like to know more about optimizing your data lakes and explore how data observability can help.
Published at DZone with permission of Rohit Chaudhary. See the original article here.
Opinions expressed by DZone contributors are their own.