An Overview of the Hadoop Ecosystem of Beginners
Interested in learning Hadoop? Read on to get a nice overview of the entire Hadoop ecosystem, and how all the parts work together.
Join the DZone community and get the full member experience.Join For Free
Introduction to Big Data
Big data refers to all the data generated through various platforms across the world..
Categories of big data:
Examples of Big Data:
1) New York Exchange generates about 1TB of new trade data per day.
2) Social Media: Statistics shows that 500+ terabytes of data get ingested into the database of the social media site Facebook every day.
Data is mainly generated in terms of:
Photos and video uploads
3) Jet Engine /Travel Portals:
A ingle jet engine generates 10+ terabytes (TB) of data in 30 minutes of flight per day. The generation of data reaches up to many petabytes (PB).
What Is Hadoop?
Hadoop is an open source framework managed by the Apache Software Foundation. Open source implies that it is freely available and its source code can be changed as per the user's requirements. Apache Hadoop is designed to store and process big data efficiently. Hadoop is used for data storing, processing, analyzing, accessing, governance, operations, and security.
Large organizations with a huge amount of data use Hadoop, processed with the help of a large cluster of commodity hardware. Cluster refers to a group of systems which are connected via LAN and multiple nodes on this cluster help in performing Hadoop jobs. Hadoop has gained popularity worldwide in managing big data and, at present, it has a nearly 90% market share.
Features of Hadoop
Cost Effective: Hadoop system is very cost effective as it does not require any specialized hardware and thus requires low investment. Use of simple hardware known as commodity hardware is sufficient for the system.
Supports Large Cluster of Nodes: A Hadoop structure can be made of thousands of nodes making a large cluster. Large cluster helps in expanding the storage system & offers more computing power.
Parallel Processing of Data: Hadoop system supports parallel processing of the data across all nodes in the cluster, and thus it reduces the storage & processing time.
Distribution of Data(Distributed Processing): Hadoop efficiently distributes the data across all the nodes in a cluster. Moreover, it replicates the data over the entire cluster in order to retrieve the data other nodes, if a particular node is busy or fails to operate.
Automatic Failover Management (Fault Tolerance): An important feature of Hadoop is that it automatically resolves the problem in case a node in the cluster fails. The framework itself replaces the failed system with another system along with configuring the replicated settings and data on the new machine.
Supports Heterogeneous Clusters: A heterogeneous cluster is one which accounts for nodes or machines which are from a different vendor, different operating system, and running on different versions. For instance, if a Hadoop cluster has three systems, one Lenovo machine that runs on RHEL Linux, the second is Intel machine running on Ubuntu Linux, and third is an AMD machine running on Fedora Linux, all of these different systems are capable of simultaneously running on a single cluster.
Scalability: A Hadoop system has the ability to add or remove node/nodes and hardware components from a cluster, without affecting the operations of the cluster. This refers to scalability, which is one of the important features of the Hadoop system.
Overview of Hadoop Ecosystem
The Hadoop ecosystem consists of:
HDFS (Hadoop Distributed File System)
HDFS (Hadoop Distributed File System): HDFS has the most important job to perform in the Hadoop framework. It distributes the data and stores it on each node present in a cluster, simultaneously. This process reduces the total time to store data onto the disk.
MapReduce (Read/Write Large Datasets into/from Hadoop using MR): Hadoop MapReduce is another important part of the system that processes the huge volumes of data stored in a cluster. It allows parallel processing of all the data stored by HDFS. Moreover, it resolves the issue of high cost of processing through the massive scalability in a cluster.
Apache Pig (Pig is a kind of ETL for the Hadoop ecosystem): It is the high-level scripting language to write the data analysis programmes for huge data sets in the Hadoop cluster. Pig enables developers to generate query execution routines for analysis of large data sets. The scripting language is known as Pig Latin, which one key part of Pig, and the second key part is a compiler.
Apache HBase (OLTP/NoSQL) sources: It is a column-oriented database that supports the working of HDFS on a real-time basis. It is enabled to process large database tables, i.e. a file containing millions of rows and columns. An important use of HBase is the efficient use of master nodes for managing region servers.
Apache Hive (Hive is a SQL engine on Hadoop): With a SQL-like interace, Hive allows ofr the squaring of data from HDFS. The Hive version of SQL language is called as HiveQL.
Apache Sqoop (Data Import/Export from RDBMS [SQL sources] into Hadoop): It is an application that helps with the import and export of data from Hadoop to other relational database management systems. It can transfer the bulk of your data. Sqoop is based on connector architecture that backs the plugins for establishing connectivity to new external systems.
Apache Flume(Data Import from Unstrucuted(Social Media sites)/Structured into Hadoop) : It is an application it allows the storage of streaming data into Hadoop cluster, such as data being written to log files is a good example of streaming data.
Apache Zookeeper (coordination tool used in a clustered environment): Its role is to manage the coordination between the above-mentioned applications for their efficient functioning in the Hadoop ecosystem.
Functioning of Hadoop – HDFS Daemons
The Hadoop system works on the principle of master-slave architecture.
Name Node: It is the master node, and is single in the entity. It is responsible for storing HDFS metadata that keeps track of all the files that are stored in the HDFS. The information stored on metadata is like the file name, permissions the file has, the authorized user of the file, and the location where the file is stored. This information is stored on RAM, which is generally called file system metadata.
Data Nodes: It is the slave node, and is present in multiple numbers. Data nodes are responsible for storing and retrieving the data as instructed by the name node. Data nodes intermittently report to the name node with their present status and all the files stored with them. The data nodes keep multiple copies of each file stored in them.
Secondary Name Node: The secondary name node is present to support the primary name node in storing the metadata. On the failure of the name node due to corrupt metadata, or any other reason, the secondary name nodes prevents the malfunctioning of the complete cluster. The secondary name node instructs the name node to create and send fsimage and editlog files, upon which the compacted fsimage file is created by the secondary name node. This compacted file is then transferred back to tge name node and it is renamed. This process either repeats every hour or when the size of the editlog file exceeds 64MB.
Published at DZone with permission of gyan setu. See the original article here.
Opinions expressed by DZone contributors are their own.