DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • The Magic of Apache Spark in Java
  • 5 Key Postgres Advantages Over MySQL
  • Basic CRUD Operations Using Hasura GraphQL With Distributed SQL on GKE
  • Keep Calm and Column Wise

Trending

  • Securing Parquet Files: Vulnerabilities, Mitigations, and Validation
  • How Clojure Shapes Teams and Products
  • A Complete Guide to Modern AI Developer Tools
  • The Role of Artificial Intelligence in Climate Change Mitigation
  1. DZone
  2. Data Engineering
  3. Databases
  4. Ecosystem of Hadoop Animal Zoo

Ecosystem of Hadoop Animal Zoo

By 
Umashankar Ankuri user avatar
Umashankar Ankuri
·
Jun. 03, 15 · Interview
Likes (3)
Comment
Save
Tweet
Share
23.5K Views

Join the DZone community and get the full member experience.

Join For Free

 hadoop  is best known for map reduce and it's distributed file system (hdfs). recently other productivity tools developed on top of these will form a complete ecosystem of hadoop. most of the projects are hosted under  apache software foundation  . hadoop ecosystem projects are listed below.

hadoop common

a set of components and interfaces for distributed file system and i/o (serialization, java rpc, persistent data structures)  http://hadoop.apache.org/ 

apache-hadoop

hadoop ecosystem


ecosystem-hadoop

hdfs

a distributed file system that runs on large clusters of commodity hardware.  hadoop  distributed file system, hdfs renamed form ndfs. scalable data store that stores semi-structured,  un-structured  and structured data.

 http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/hdfsuserguide.html   http://wiki.apache.org/hadoop/hdfs 

hdfs-hadoop

hadoop-ecosystem

map reduce

 map reduce  is the distributed, parallel computing programming model for hadoop. inspired from  google map reduce research paper  . hadoop includes implementation of map reduce programming model. in map reduce there are two phases, not surprisingly map and reduce. to be precise in between map and reduce phase, there is another phase called sort and shuffle. job tracker in name node machine manages other cluster nodes. map reduce programming can be written in java. if you like sql or other non- java languages, you are still in luck. you can use utility called hadoop streaming.  http://wiki.apache.org/hadoop/hadoopmapreduce 

hadoop_mapreduce

hadoop streaming

a utility to enable map reduce code in many languages like c, perl, python, c++, bash etc., examples include a python mapper and awk reducer.  http://hadoop.apache.org/docs/r1.2.1/streaming.html 

avro

a serialization system for efficient, cross-language rpc and persistent data storage. avro is a framework for performing remote procedure calls and data serialization. in the context of hadoop, it can be used to pass data from one program or language to another, e.g. from c to pig. it is particularly suited for use with scripting languages such as pig, because data is always stored with its schema in avro.  http://avro.apache.org/ 

apache-avro

apache thrift

apache thrift allows you to define data types and service interfaces in a simple definition file. taking that file as input, the compiler generates code to be used to easily build rpc clients and servers that communicate seamlessly across programming languages. instead of writing a load of boilerplate code to serialize and transport your objects and invoke remote methods, you can get right down to business.  http://thrift.apache.org/ 

hive and hue

if you like sql, you would be delighted to hear that you can write sql and hive convert it to a map reduce job. but, you don't get a full ansi-sql environment. hue gives you a  browser based graphical interface to do your hive work.  hue  features a file browser for hdfs, a job browser for map reduce/yarn, an hbase browser, query editors for hive, pig, cloudera impala and sqoop2.it also ships with an oozie application for creating and monitoring workflows, a zookeeper browser and an sdk.

hive_hue

pig

a high-level programming data flow language and execution environment to do map reduce coding the pig language is called pig latin. you may find naming conventions some what un-conventional, but you get incredible price-performance and high availability.  https://pig.apache.org/ 

apache-pig

jaql

jaql is a functional, declarative programming language designed especially for working with large volumes of structured, semi-structured and unstructured data. as its name implies, a primary use of jaql is to handle data stored as json documents, but jaql can work on various types of data. for example, it can support xml, comma-separated values (csv) data and flat files. a "sql within jaql" capability lets programmers work with structured sql data while employing a json data model that's less restrictive than its structured query language counterparts.  1. jaql in google code   2. what is jaql? by ibm 

sqoop

sqoop provides a bi-directional data transfer between hadoop -hdfs  and your favorite relational database. for example you might be storing your app data in relational store such as oracle, now you want to scale your application with hadoop so you can migrate oracle database data to hadoop hdfs using sqoop.  http://sqoop.apache.org/ 

sqoop-hadoop

oozie

manages hadoop workflow. this doesn't replace your scheduler or BPM tooling, but it will provide if-then-else branching and control with hadoop jobs.  https://oozie.apache.org/ 

oozie-hadoop

zookeeper

a distributed, highly available coordination service. zookeeper provides primitives such as distributed locks that can be used for building the highly scalable applications. it is used to manage synchronization for cluster.  http://zookeeper.apache.org/ 

zookeeper-hadoop

hbase

based on google's  bigtable  , hbase "is an open-source, distributed, version, column-oriented store" that sits on top of hdfs. a super scalable key-value store. it works very much like a persistent hash-map (for python developers think like a dictionary). it is not a conventional relational database. it is a distributed, column oriented database. hbase uses hdfs for it's underlying. supports both batch-style computations using map reduce and point queries for random reads.  https://hbase.apache.org/ 

hbase-hadoop

cassandra

a column oriented nosql data store which offers scalability, high availability with out compromising on performance. it perfect platform for commodity hardware and cloud infrastructure.cassandra's data model offers the convenience of  column indexes  with the performance of log-structured updates, strong support for  de-normalization  and  materialized views  , and powerful built-in caching.  http://cassandra.apache.org/ 

cassandra-hadoop

flume

a real time loader for streaming your data into hadoop. it stores data in hdfs and hbase.flume "channels" data between "sources" and "sinks" and its data harvesting can either be scheduled or event-driven. possible sources for flume include avro, files, and system logs, and possible sinks include hdfs and hbase.  http://flume.apache.org/ 

flume-hadoop

mahout

machine learning for hadoop, used for predictive analytics and other advanced analysis. there are currently four main groups of algorithms in mahout:

  • recommendations, a.k.a. collective filtering
  • classification, a.k.a categorization
  • clustering
  • frequent item set mining, a.k.a parallel frequent pattern mining

mahout is not simply a collection of pre-existing algorithms; many machine learning algorithms are intrinsically non-scalable; that is, given the types of operations they perform, they cannot be executed as a set of parallel processes. algorithms in the mahout library belong to the subset that can be executed in a distributed fashion.  http://en.wikipedia.org/wiki/list_of_machine_learning_algorithms   https://www.coursera.org/course/machlearning   https://mahout.apache.org/ 

mahout-hadoop

fuse

makes the hdfs system to look like a regular file system so that you can use ls, rm, cd etc., directly on hdfs data.

whirr

apache whirr is a set of libraries for running cloud services. whirr provides a cloud-neutral way to run services. you don't have to worry about the idiosyncrasies of each provider.a common service api. the details of provisioning are particular to the service. smart defaults for services. you can get a properly configured system running quickly, while still being able to override settings as needed. you can also use whirr as a command line tool for deploying clusters.  https://whirr.apache.org/ 

whirr-hadoop

giraph

an open source graph processing api like  pregel  from google  https://giraph.apache.org/ 

apachegiraph

chukwa

chukwa, an incubator project on apache, is a data collection and analysis system built on top of hdfs and map reduce. tailored for collecting logs and other data from distributed monitoring systems, chukwa provides a workflow that allows for incremental data collection, processing and storage in hadoop. it is included in the apache hadoop distribution  as an independent module.  https://chukwa.apache.org/ 

chukwa-hadoop

drill

apache drill, an incubator project on apache, is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. drill is the open source version of  google's dremel  system which is available as an iaas service called google big query. one explicitly stated design goal is that drill is able to scale to 10,000 servers or more and to be able to process petabytes of data and trillions of records in seconds.  http://incubator.apache.org/drill/ 

apache-drill

impala (cloudera)

released by cloudera, impala is an open-source project which, like apache drill, was inspired by google's paper on dremel; the purpose of both is to facilitate real-time querying of data in hdfs or hbase. impala uses an sql-like language that, though similar to hiveql, is currently more limited than hiveql. because impala relies on the hive meta store, hive must be installed on a cluster in order for impala to work. the secret behind impala's speed is that it "circumvents map reduce to directly access the data through a specialized distributed query engine that is very similar to those found in commercial parallel rdbmss." (source: cloudera)  http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html   http://training.cloudera.com/elearning/impala/ 

hadoop Machine learning Database Data (computing) File system Open source Relational database clustering sql

Opinions expressed by DZone contributors are their own.

Related

  • The Magic of Apache Spark in Java
  • 5 Key Postgres Advantages Over MySQL
  • Basic CRUD Operations Using Hasura GraphQL With Distributed SQL on GKE
  • Keep Calm and Column Wise

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!