The No Fluff Introduction to Big Data
Join the DZone community and get the full member experience.Join For Free
big data traditionally has referred to a collection of data too massive to be handled efficiently by traditional database tools and methods. this original definition has expanded over the years to identify tools (big data tools) that tackle extremely large datasets (nosql databases, mapreduce, hadoop, newsql, etc.), and to describe the industry challenge posed by having data harvesting abilities that far outstrip the ability to process, interpret, and act on that data. technologists knew that those huge batches of user data and other data types were full of insights that could be extracted by analyzing the data in large aggregates. they just didn’t have any cheap, simple technology for organizing and querying these large batches of raw, unstructured data.
the term quickly became a buzzword for every sort of data processing product’s marketing team. big data became a catchall term for anything that handled non-trivial sizes of data. sean owen, a data scientist at cloudera, has suggested that big data is a stage where individual data points are irrelevant and only aggregate analysis matters . but this is true for a 400 person survey as well, and most people wouldn’t consider that very big. the key part missing from that definition is the transformation of unstructured data batches into structured datasets. it doesn’t matter if the database is relational or non-relational. big data is not defined by a number of terabytes, it’s rooted in the push to discoverhidden insights in data that companies used to disregard or throw away.
due to the obstacles presented by large scale data management, the goal for developers and data scientists is two-fold: first, systems must be created to handle large scale data, and two, business intelligence and insights should be acquired from analysis of the data. acquiring the tools and methods to meet these goals is a major focus in the data science industry, but it’s a landscape where needs and goals are still shifting.
what are the characteristics of big data?
tech companies are constantly amassing data from a variety of digital sources that is almost without end—everything from email addresses to digital images, mp3s, social media communication, server traffic logs, purchase history, and demographics. and it’s not just the data itself, but data about the data (metadata). it is a barrage of information on every level. what is it that makes this mountain of data big data?
one of the most helpful models for understanding the nature of big data is “the three vs:” volume, velocity, and variety.
volumeis the sheer size of the data being collected. there was a point in not-so-distant history where managing gigabytes of data was considered a serious task—now we have web giants like google and facebook handling petabytes of information about users’ digital activities. the size of the data is often seen as the first challenge of characterizing big data storage, but even beyond that is the capability of programs to provide architecture that can not only store but query these massive datasets. one of the most popular models for big data architecture comes from google’s mapreduce concept, which was the basis for apache hadoop, a popular data management solution.
velocityis a problem that flows naturally from the volume characteristics of big data. data velocity is the speed at which data is flowing into a business’ infrastructure and the ability of software solutions to receive and ingest that data quickly. certain types of high-velocity data, such as streaming data, needs to be moved into storage and processed on the fly. this is often referred to as complex event processing (cep). the ability to intercept and analyze data that has a lifespan of milliseconds is a widely sought after. this kind of quick-fire data processing has long been the cornerstone of digital financial transactions, but it is also being used to track live consumer behavior or to bring instant updates to social media feeds.
variety refers to the source and type of data that is being collected. this data could be anything from raw image data to sensor readings, audio recordings, social media communication, and metadata. the challenge of data variety is being able to take raw, unstructured data and organize it so that an application can use it. this kind of structure can be achieved through architectural models that traditionally favor relational databases—but there is often a need to tidy up this data before it will even be useful to store in a raw form. sometimes a better option is to use a schema-less, non-relational database.
how do you manage big data?
the three vs is a great model for getting an initial understanding of what makes big data a challenge for businesses. however, big data is not just about the data itself, but the way that it is handled. a popular way of thinking about these challenges is to look at how a business stores, processes, and accesses their data.
· store: can you store the vast amounts of data being collected?
· process: can you organize, clean, and analyze the data collected?
· access: can you search and query this data in an organized manner?
the store, process, and access model is useful for two reasons: it reminds businesses that big data is largely about managing data, and it demonstrates the problem of scale within big data management. “big” is relative. the data batches that challenge some companies could be moved through a single google datacenter in under a minute. the only question a company needs to ask itself is how it will store and access increasingly massive amounts of data for its particular use case. there are several high level approaches that companies have turned to in the last few years.
the traditional approach
the traditional method for handling most data is to use relational databases. data warehouses are then used to integrate and analyze data from many sources. these databases are structured according to the concept of “early structure binding”—essentially, the database has predetermined “questions” that can be asked based on a schema. relational databases are highly functional, and the goal with this type of data processing is for the database to be fully transactional. although relational databases are the most common persistence type by a large margin (see key findings pg. 4-5), a growing number of use cases are not well-suited for relational schema. relational architectures tend to have difficulty when dealing with the velocity and variety of big data, since their structure is very rigid. when you perform functions such as join on many large data sets, the volume can be a problem as well. instead, businesses are looking to non-relational databases, or a mixture of both types, to meet data demand.
the newer approach - mapreduce, hadoop, and nosql databases
in the early 2000s, web giant google released two helpful web technologies: google file system (gfs) and mapreduce. both were new and unique approaches to the growing problem of big data, but mapreduce was chief among them, especially when it comes to its role as a major influencer of later solution models. mapreduce is a programming paradigm that allows for low cost data analysis and clustered scale-out processing.
mapreduce became the primary architectural influence for the next big thing in big data: the creation of the big data management infrastructure known as hadoop. hadoop’s open source ecosystem and ease of use for handling large-scale data processing operations have secured a large part of the big data marketplace.
besides hadoop, there was a host of non-relational (nosql) databases that emerged around 2009 to meet a different set of demands for processing big data. whereas hadoop is used for its massive scalability and parallel processing, nosql databases are especially useful for handling data stored within large multi-structured datasets. this kind of discrete data handling is not traditionally seen as a strong point of relational databases, but it’s also not the same kind of data operations that hadoop is running. the solution for many businesses ends up being a combination of these approaches to data management.
finding hidden data insights
once you get beyond storage and management, you still have the enormous task of creating actionable business intelligence (bi) from the datasets you’ve collected. this problem of processing and analyzing data is maybe one of the trickiest in the data management lifecycle. the best options for data analytics will favor an approach that is predictive and adaptable to changing data streams. the thing is, there’s so many types of analytic models and different ways of providing infrastructure for this process. your analytics solution should scale, but to what degree? scalability can be an enormous pain in your analytical neck, due to the problem of decreasing performance returns when scaling out an algorithm.
ultimately, analytics tools rely on a great deal of reasoning and analysis to extract data patterns and data insights, but this capacity means nothing for a business if they can’t then create actionable intelligence. part of this problem is that many businesses have the infrastructure to accommodate big data, but they aren’t asking questions about what problems they’re going to solve with the data. implementing a big data-ready infrastructure before knowing what questions you want to ask is like putting the cart before the horse.
but even if we do know the questions we want to ask, data analysis can always reveal many correlations with no clear causes. as organizations get better at processing and analyzing big data, the next major hurdle will be pinpointing the causes behind the trends by asking the right questions and embracing the complexity of our answers.
Opinions expressed by DZone contributors are their own.