DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
What's in store for DevOps in 2023? Find out today at 11 am ET in our "DZone 2023 Preview: DevOps Edition!"
Last chance to join
  1. DZone
  2. Data Engineering
  3. Big Data
  4. An Overview of the Hadoop Ecosystem of Beginners

An Overview of the Hadoop Ecosystem of Beginners

Interested in learning Hadoop? Read on to get a nice overview of the entire Hadoop ecosystem, and how all the parts work together.

gyan setu user avatar by
gyan setu
·
Jan. 07, 19 · Analysis
Like (7)
Save
Tweet
Share
12.31K Views

Join the DZone community and get the full member experience.

Join For Free

Introduction to Big Data

Big data refers to all the data generated through various platforms across the world..

Categories of big data:

  1. Structured

  2. Unstructured

  3. Semi-structured

Examples of Big Data:

1) New York Exchange generates about 1TB of new trade data per day.

2) Social Media: Statistics shows that 500+ terabytes of data get ingested into the database of the social media site Facebook every day.

Data is mainly generated in terms of:

  • Photos and video uploads

  • Message exchanges

  • Comments

3) Jet Engine /Travel Portals:

A ingle jet engine generates 10+ terabytes (TB) of data in 30 minutes of flight per day. The generation of data reaches up to many petabytes (PB).

What Is Hadoop?

Hadoop is an open source framework managed by the Apache Software Foundation. Open source implies that it is freely available and its source code can be changed as per the user's requirements. Apache Hadoop is designed to store and process big data efficiently. Hadoop is used for data storing, processing, analyzing, accessing, governance, operations, and security. 

Large organizations with a huge amount of data use Hadoop, processed with the help of a large cluster of commodity hardware. Cluster refers to a group of systems which are connected via LAN and multiple nodes on this cluster help in performing Hadoop jobs. Hadoop has gained popularity worldwide in managing big data and, at present, it has a nearly 90% market share.

Features of Hadoop

  • Cost Effective: Hadoop system is very cost effective as it does not require any specialized hardware and thus requires low investment. Use of simple hardware known as commodity hardware is sufficient for the system.

  • Supports Large Cluster of Nodes: A Hadoop structure can be made of thousands of nodes making a large cluster. Large cluster helps in expanding the storage system & offers more computing power.

  • Parallel Processing of Data: Hadoop system supports parallel processing of the data across all nodes in the cluster, and thus it reduces the storage & processing time.

  • Distribution of Data(Distributed Processing): Hadoop efficiently distributes the data across all the nodes in a cluster. Moreover, it replicates the data over the entire cluster in order to retrieve the data other nodes, if a particular node is busy or fails to operate.

  • Automatic Failover Management (Fault Tolerance): An important feature of Hadoop is that it automatically resolves the problem in case a node in the cluster fails. The framework itself replaces the failed system with another system along with configuring the replicated settings and data on the new machine.

  • Supports Heterogeneous Clusters: A heterogeneous cluster is one which accounts for nodes or machines which are from a different vendor, different operating system, and running on different versions. For instance, if a Hadoop cluster has three systems, one Lenovo machine that runs on RHEL Linux, the second is Intel machine running on Ubuntu Linux, and third is an AMD machine running on Fedora Linux, all of these different systems are capable of simultaneously running on a single cluster.

  • Scalability: A Hadoop system has the ability to add or remove node/nodes and hardware components from a cluster, without affecting the operations of the cluster. This refers to scalability, which is one of the important features of the Hadoop system. 

Overview of Hadoop Ecosystem

The Hadoop ecosystem consists of:

  1. HDFS (Hadoop Distributed File System)

  2. Apache MapReduce

  3. Apache Pig

  4. Apache HBase

  5. Apache Hive

  6. Apache Sqoop

  7. Apache Flume

  8. Apache Zookeeper

  9. Apache Kafka

  10. Apache Oozie 

HDFS (Hadoop Distributed File System): HDFS has the most important job to perform in the Hadoop framework. It distributes the data and stores it on each node present in a cluster, simultaneously. This process reduces the total time to store data onto the disk. 

MapReduce (Read/Write Large Datasets into/from Hadoop using MR): Hadoop MapReduce is another important part of the system that processes the huge volumes of data stored in a cluster. It allows parallel processing of all the data stored by HDFS. Moreover, it resolves the issue of high cost of processing through the massive scalability in a cluster.

Apache Pig (Pig is a kind of ETL for the Hadoop ecosystem): It is the high-level scripting language to write the data analysis programmes for huge data sets in the Hadoop cluster. Pig enables developers to generate query execution routines for analysis of large data sets. The scripting language is known as Pig Latin, which one key part of Pig, and the second key part is a compiler.

Apache HBase (OLTP/NoSQL) sources: It is a column-oriented database that supports the working of HDFS on a real-time basis. It is enabled to process large database tables, i.e. a file containing millions of rows and columns. An important use of HBase is the efficient use of master nodes for managing region servers.

Apache Hive (Hive is a SQL engine on Hadoop): With a SQL-like interace, Hive allows ofr the squaring of data from HDFS. The Hive version of SQL language is called as HiveQL.

Apache Sqoop (Data Import/Export from RDBMS [SQL sources] into Hadoop): It is an application that helps with the import and export of data from Hadoop to other relational database management systems. It can transfer the bulk of your data. Sqoop is based on connector architecture that backs the plugins for establishing connectivity to new external systems.

Apache Flume(Data Import from Unstrucuted(Social Media sites)/Structured      into  Hadoop)  : It is an application it allows the storage of streaming data into Hadoop cluster, such as data being written to log files is a good example of streaming data.

Apache Zookeeper (coordination tool used in a clustered environment): Its role is to manage the coordination between the above-mentioned applications for their efficient functioning in the Hadoop ecosystem.

Functioning of Hadoop – HDFS Daemons

The Hadoop system works on the principle of master-slave architecture.

Name Node: It is the master node, and is single in the entity. It is responsible for storing HDFS metadata that keeps track of all the files that are stored in the HDFS. The information stored on metadata is like the file name, permissions the file has, the authorized user of the file, and the location where the file is stored. This information is stored on RAM, which is generally called file system metadata.

Data Nodes: It is the slave node, and is present in multiple numbers. Data nodes are responsible for storing and retrieving the data as instructed by the name node. Data nodes intermittently report to the name node with their present status and all the files stored with them. The data nodes keep multiple copies of each file stored in them.

Secondary Name Node: The secondary name node is present to support the primary name node in storing the metadata. On the failure of the name node due to corrupt metadata, or any other reason, the secondary name nodes prevents the malfunctioning of the complete cluster. The secondary name node instructs the name node to create and send fsimage and editlog files, upon which the compacted fsimage file is created by the secondary name node. This compacted file is then transferred back to tge name node and it is renamed. This process either repeats every hour or when the size of the editlog file exceeds 64MB. 

hadoop Big data File system cluster Database Relational database

Published at DZone with permission of gyan setu. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Iptables Basic Commands for Novice
  • Visual Network Mapping Your K8s Clusters To Assess Performance
  • How to Use MQTT in Java
  • OpenID Connect Flows

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: