Drilling Into Big Data: Data Interpretation
We look into some of the data technologies out there that help data engineers and analysts better interpret their findings .
Join the DZone community and get the full member experience.Join For Free
In this series, we will walk you through the entire cycle of Big Data analytics. Now that we are familiar with the basics and with cluster creation, it is time to understand the data which is acquired from various sources and the most suitable data format to ingest into a big data environment. Let’s divide the source of the data into four quadrants
- Quadrant1: Data we analyze today.
- Quadrant2: Data we collect, but do not analyze.
- Quadrant3: Data we could collect, but we don’t.
- Quadrant4: Data from partners.
The second quadrant is usually unstructured, which we process to perform analytics. Big data mostly concentrates here. As mentioned in our previous articles, the storage part of Hadoop is HDFS, a file system which does not impose any data format or schema. So you can store the file in any format you wish and, hence, we come to the question, “Which file format should I use?”
Secondly, assume new columns are added to your table in a database. This gets reflected in your CSV dumps and, as a result, the schema changes. Thus, schema changes take place frequently which will be another drawback. Hadoop supports various file formats among which some are easy for humans to handle but some might provide better performance optimizations.
"Data is only as useful as the context in which it is gathered and presented" – Josh Pigford
Data obtained from various sources can be of different formats. Let’s quickly take a look at the file formats supported by the Hadoop for better understanding and quick processing of data. A few major file formats are listed below:
- Text files (.txt and .csv)
- Sequence Files (binary)
- JSON files
- Row Columnar (RC) and Optimized Row Columnar (ORC) files
- Avro files
- Parquet files
Which one will I choose from these file formats? A better file format must satisfy two major conditions:
- Block Compression
- Schema Evolution
It is a compression technique for reducing texture size up to 3/4th of the original and hence applications can see a drastic performance increase with the use of block compression.
As mentioned, once the initial schema is defined, applications might evolve it over time and schema changes might take place frequently. To handle this better, your file format should support schema evolution.
Comparison of File Formats
Take a look at the comparison table to arrive at a better solution
The most common file format which strikes our mind first is the Text file, as it is generally used by humans in daily life and hence it's easy to use and troubleshoot. But, remember the mantra of big data - “Write Once, Read many times” - which means performance is a main factor to consider. Hence, there should be no overhead in data querying and, unfortunately, this was the main drawback of text files.
From the comparison table, it is clear that Avro and Parquet file formats are dominant in Hadoop as they support Block Compression and Schema evolution. On the other hand, the other file formats are least used in the Hadoop ecosystem. This is based, however, on the storage type which is preferred by your organization, with better file formats being preferred for optimized performance.
Why Should I Choose a File Format?
- For better read and write performance.
- For better utilization of storage.
- Better compression support.
- Splittable files.
Data Compression in Hadoop
The main challenge in Big Data is dealing with large data sets. Storing these massive data volumes can cause I/O and network related issues. The solution to eliminate this problem is Data Compression. Some compression techniques are splittable, which can enhance performance when reading and processing large compressed files.
The four widely used Compression techniques in Hadoop are:
- Gzip - Compressed data is not splittable which makes it unsplittable for MR jobs.
- Bzip2 - Compressed data is splittable, but not suited for MR jobs due to high compression/decompression time.
- Snappy - Not splittable if used with normal file like .txt but good with Container files.
- LZO - Splittable and best suited for MR jobs.
- Tools like Impala do not support ORC files, where we have to go for Parquet.
- MapReduce produces an intermediate output where sequence files are preferred.
- GZIP and BZIP2 are good choices for data which is infrequently accessed.
- SNAPPY is used to compress Container file formats like Avro and Sequence File.
Accessing the Instance
We already have an EMR cluster which is all set for use. You can submit jobs and execution plans in the UI, otherwise, we can use the log on to the master node of the cluster. So what will you need to access the cluster?
- Tools like Putty, Cygwin.
- Public IP of the master node.
- Username and Password of the cluster.
SSH on Windows
Secure Shell (SSH) provides strong authentication and encrypted data communication between two systems.
In this article, we will use Cygwin and the Cygwin installation steps are mentioned below.
- Cygwin provides various functionalities similar to Linux distributions on Windows.
- Install Cygwin from here. Choose next on the first screen.
- Select "Install from Internet" and click next.
- Choose the location in your system where you want the Cygwin to be installed
- In the next window, select “Direct Connection” to connect to the internet directly and click next.
- Just choose one among the download sites and click next. You can also add your own sites by clicking on “Add”
- This leads to the progress page which shows the download progress. Select the packages which are needed. In our case, we will initially enable “ssh” to login to the cluster.
- Search for “ssh” which shows you a list of options. Select “autossh” as shown below.
- Once the installation of the packages is over, click on ‘Finish.’
- The Cygwin terminal is all set now. Let's open the terminal and check to see if SSH has been enabled. Type “ssh” and you should get the usage directions as shown below.
Obtaining the IP of the Master Node
The next prerequisite to access the cluster is the IP. Navigate to the cluster overview page where the public network IP address of the master exists.
You can also obtain the IP from the Hosts page. Navigate to “Hosts,” where the list of running hosts (master and core nodes) will be displayed. Extract the public IP of the master node.
Let’s login to the cluster by giving
ssh email@example.com in the Cygwin terminal where
- The root is the username and 220.127.116.11 is the IP of the cluster.
Enter ‘yes’ to continue connecting.
Give the password which was provided during the cluster creation
You are all set to explore and play with various settings and states in the cluster.
Types of Processing
Big data is acquired in various ways, either as batch data or as real-time streaming data. Hence, let’s divide this into two data processing pipelines:
- Batch Processing
- Stream Processing
Batch data processing is an efficient way of processing huge volumes of data where a group of transactions is collected over a period of time. Hadoop focuses on batch data processing.
Real time data processing or Stream processing is where we obtain continual input and hence processing is done with streaming data. Data is processed in near real-time. Most organizations use batch data processing, but sometimes there is a need for real-time processing too. Real-time analytics can help an organization to take immediate actions whenever necessary.
At this initial stage, we will look at Batch Processing as an example for better understanding.
Let me take Tourism as a use case for us to easily relate with.
If you are planning a trip, you would probably acquire more sources by browsing through the internet, from buying tickets to reserving accommodations and identifying the best spots. Members of the tourism industry are slowly turning to big data to improve decision-making and overall performance in a different aspect.
Download the open source tourism dataset from here.
Here you have various files in CSV and JSON format. From the data sources, the tripadvisor_merged Excel file will be the concatenated sheet of all other data. It is a US tourism dataset comprising various museums and its reviews were collected from Trip Advisor. It has columns like Museum name, Location, Review Count, Rating by the Tourists, Rank of the Museum, Fee (whether there is entry fee for the Museum), Length of Hours — how much time a tourist has spent in the museum (this can indirectly correlate to how interesting the museumis ) — Family and Friends Count (count of people visiting the museum), and more.
Let’s try to recollect a few terms from our first article:
- Source - This sheet is based on various events that are registered every time and can be processed in batches.
- Sort of data - Data is extracted in Excel sheets or captured in databases; and the data is structured.
- Tool to Ingest - Since it comes under batch processing, the tool used is Sqoop.
- Storage - With the help of Sqoop, we move the data into HDFS.
- Tool to Process - Spark.
- Querying - Hive.
- Analysis - Zeppelin/QuickBI.
Now, let’s preserve the data we collected by ingesting it into HDFS and process it to find out the top museums by visitor count and visitor spending time, the museum rankings by state, etc. These insights can help the tourism industry to make better decisions and identify tourists favorite spots. The source of this data comes exactly from the comments, ratings, and feedbacks which were left by the tourists and this huge data is scattered. An interconnection of this scattered information can be made possible through big data. With this understanding of data, we will focus on data ingestion and processing in the next article.
In the next article, we will take a closer look into the concepts and usage of HDFS and Sqoop for data ingestion.
Opinions expressed by DZone contributors are their own.