Over a million developers have joined DZone.

Evolving Technologies for Driving Big Data

An introduction to the different kinds of technology and tools that are useful for taking charge of big data.

· Big Data Zone

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

It is estimated that 2.5 quintillion bytes of data are being created every single day. This quickly creates unprecedented quantities of data, which then have to be stored, and should be easily accessed to analyze later on. These are huge quantities of data, measured in lesser-known terms such as zettabytes, petabytes, and exabytes. As companies collect ever-increasing amounts of data and expect it to be quickly and easily accessible, requirements on technology and infrastructure have become ever larger. In the early 2000s, industry analyst Doug Laney coined a definition of what constitutes Big Data that has now become widely accepted. The “three Vs” definition uses three markers to delineate what qualifies as Big Data – and it is more than just the sheer amount of data.

  • Volume – This aspect represents the sheer amounts of data generated in an era where companies commonly store vast quantities of transaction information, social media generated information, and machine to machine and sensor data. Without technologies to accommodate such huge amounts of data, storing it efficiently would be a problem.
  • Velocity – New technologies allow data to be streamed in at extremely high speeds, through RFID tags, sensors, and other technologies. These flows of data must be stored and structured in real time, and this is another challenge for dealing with Big Data.
  • Variety – This represents the fact that data comes in a wide variety of formats – from traditional databases to unstructured video, email, audio and transaction data. All of these kinds of data must be integrated and structured.

These three hallmarks of Big Data present a challenge for companies who need to store such data in a structured, accessible, and affordable way. These challenges often prevent business from properly analyzing and utilizing Big Data. This is a real loss, since by its nature Big Data often contains highly exploitable information about customer behavior. With so much raw information, useful patterns can be used to predict future customer behavior. This is just one of many ways Big Data can contain potential for companies. Luckily, new technologies, and ways of dealing with data, have arisen to fill the increasing needs of companies trying to properly store and utilize Big Data. A few novel technologies are useful in Big Data storage and utilization.

  • Column oriented databases – Traditional databases focus more on rows than columns, and while these are quite efficient in terms of online transaction speeds and update speeds, they fall short as data volume grows and becomes more amorphous. Query times can become extremely long. Column oriented database offers extremely fast query times and allows high levels of data compression. The downside to such databases is that they normally only allow batch updates, leading to long update times.
  • No SQL databases and Schema-less databases – This includes database types such as key-value stores and document stores that focus on access to large volumes of data that may be structured, unstructured, or semi-structured. These databases move past many of the restrictions of traditional databases such as read-write consistency, gaining scalability and distributed processing in the process.
  • MapReduce – MapReduce allows for broad job execution scalability against large numbers of servers. Implementing MapReduce consists of two primary tasks – The Map task, in which an input dataset is converted into a new set of key/value pairs, and the Reduce task, in which the outputs of the Map task are combined into a reduced set of key/value pairs.
  • Hadoop – This is a highly popular implementation of Map Reduce, and a wholly open source platform for dealing with Big Data. It distributes processing across clusters of servers. Hadoop is able to work with multiple data sources, either through aggregating data to do large scale processing, or reading a database to run processor intensive machine learning jobs. Hadoop is especially useful for dealing with high volumes of constantly changing data, including location based weather and traffic sensors, social media data, or machine-machine transaction data. As opposed to other methods of dealing with Big Data, which involve high-end hardware, Hadoop’s resiliency comes from its ability to detect and deal with failures at the application layer.
  • PLATFORA – As a low-level implementation of MapReduce, Hadoop requires extensive developer knowledge to operate. PLATFORA turns user’s queries into Hadoop jobs automatically, creating an abstraction layer, which can be exploited to organize datasets stored in Hadoop.
  • Massively Parallel Processing (MPP) – Also known as a “loosely coupled” or “shared nothing” system, MPP is a coordinated processing of a program by 200 or more processors, each using its own operating system and memory, working of different parts of the program. Processors will communicate using messaging interface.
  • Hive – Hive helps conventional business intelligence applications to be able to run queries against a Hadoop cluster. Originally developed by Facebook, it has been open source for some time now. Hive allows anyone to make queries against data stored in a Hadoop cluster just as a user would manipulate a conventional data store. This makes Hadoop more familiar for users of business intelligence application.
  • Stream analytics – Stream analytics are technologies that can filter and analyze large volumes of data from disparate live sources, and in a variety of data formats. It looks insight in data and set up real time analytic computation on streaming data. With cost effective stream analytics, business can success in corporate world. Stream analytics can be found in major industries like stock trading analysis, financial services, and data protection services.
  • Distributed file systems – It allows client nodes to access files via network and multiple users can share and storage file and resources. However, client nodes can use network protocol in spite of inaccessibility to blocked storage. Thus, it creates limited access to file system for both servers and clients.

Most of these technologies utilize Cloud Computing in one way or another. Cloud Computing is the key that allow companies of all sizes to exploit data potential that traditionally was wasted, due to the difficulties of dealing with Big Data. Cloud Computing can increase speed and reduce costs in a way that allows even smaller companies to store, analyze, and utilize this data.

Because of the ever-increasing capabilities of computers and the Internet, increasing volumes of data need to be stored in ways that they can be easily analyzed and accessed. This will not be accomplished in an efficient and affordable way with traditional storage methods. New ideas, methods, and technologies are driving business user’s ability to store and deal with Big Data. Many businesses do not use unstructured data to their advantage because of the difficulties involved in analyzing and utilizing it. As these technologies to handle Big Data expand and reach even higher levels of efficiency, it will become much easier for businesses to utilize this data.

Hortonworks Sandbox is a personal, portable Apache Hadoop® environment that comes with dozens of interactive Hadoop and it's ecosystem tutorials and the most exciting developments from the latest HDP distribution, brought to you in partnership with Hortonworks.

big data,big data application,big data analytics,big data management,cloud computing,mapreduce,haddop,hive

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}