Eliminating Fragmentation With Big Data Queries
Eliminating Fragmentation With Big Data Queries
Few can afford to use so many resources constantly duplicating their data. Are there any new tools available to solve some of these problems?
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Some of us still remember the days when data was stored on 3 1/2 inch floppy discs. You could only store 360 kb of data, but at least you could easily access it when you needed it. Many organizations face a very different dilemma in the age of big data.
We can store almost all the data that we could ever use. Many cloud storage systems allow brands like Facebook to store 500 terabytes of data. Unfortunately, data is often so fragmented that we can't process it.
Challenges With Big Data Fragmentation
According to a 2012 report from Cisco, most data is highly fragmented. It's nearly impossible to manage in its current state. It has been a problem for many ecommerce applications because merchants need to pull inventory data that is so fragmented is virtually unreadable.
Many point-of-sale applications have found ways to address it for SMEs, but data fragmentation remains a significant challenge for healthcare providers, financial services entities, and many other big data consumers.
Data fragmentation is especially problematic for hosting providers. The International Journal of Engineering, Science and Technology has written that fragmentation concerns have risen over the last few years, as companies scale their data usage. According to recent web hosting literature, a growing number of brands are utilizing Hadoop and other services to handle data effectively.
Hadoop is even playing a role in the evolution of drone technology. Experts have relied on WinPeg and other fragmentation software for UAV applications. However, Hadoop based solutions will likely be far more effective.
The consumer review industry may be affected by data fragmentation more than any other sector. Consumer review publishers like BestAdvisor must store data on tens of thousands of brands, which can often be compromised if it is poorly fragmented. Hadoop will help address the problem, as consumer information databases grow.
If you want to use big data to its fullest potential, you will need to minimize the risk of fragmentation.
A number of factors contribute to data fragmentation:
- Data is stored on a disk that is already highly fragmented
- Companies are trying to comply with new GDPR regulations
- The disc is filled to capacity
- Data is stored across incompatible platforms
The last point listed above has been one of the biggest issues over the last couple of years, since many brands assimilate data from various sources. Andrew Brust addressed it at the Hadoop Summit last year and emphasized the need for compatibility between data sources.
Data fragmentation has always been a challenge. However, it's become far more noticeable with big data applications.
Here are some of the solutions developers have used in the past.
Allocate Fixed Memory Blocks for Data Sets
They set aside a specific block of memory for each file they attempted to save. This solution was fairly effective when they were consistently using similar sized files all the time. Unfortunately, it isn't very feasible for most real world applications.
Sometimes files are larger than you located memory space. In these instances, the file would fail to save to the disc.
This problem could be addressed by a locating memory more liberally. Unfortunately, this creates a different challenge. If you routinely reserve larger chunks of memory for your data sets than necessary, you will waste a lot of storage space. Big data storage space isn't infinite.
Big data can also be continuously written to a disk. Since new copies of the data are always being replicated, there is little risk of fragmentation. Of course, this introduces other issues, such as increased bandwidth requirements.
More Eloquent Solutions to Big Data Fragmentation
These solutions are too primitive to work in a big data environment. Few organizations can afford to use so many resources constantly duplicating their data.
Fortunately, new tools are available to solve some of these problems. RTD Stream is one of them.
RTD Stream is used specifically for ERP landscapes, which have become too fragmented in recent years.
Mike Ormerod, Vice President of Product Management at Magnitude Software, told Datalytics Technologies that the new tool will eliminate the data fragmentation problems for brands relying on ERP.
“The ERP landscape is more fragmented than ever, and that trend will only continue,” Ormerod states. “With the release of RTD Stream we are giving real-time controls and insights to CFOs and finance departments so they can better optimize their businesses with a connectivity solution that leverages our deep understanding of the internal Oracle ERP and SAP HANA data structures.”
The Cisco® Internet Business Solutions Group says that data infomediaries will also play an important role. They will seamlessly connect the sources of data with the end users, which will eliminate many of the hangups big data users currently face.
However, a number of hurdles will need to be addressed. Organizations will need to find an effective way to transport their data infrastructure.
Opinions expressed by DZone contributors are their own.