DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Starfish: A Hadoop Performance Tuning Tool

Starfish: A Hadoop Performance Tuning Tool

Swathi Venkatachala user avatar by
Swathi Venkatachala
·
Feb. 03, 13 · Interview
Like (0)
Save
Tweet
Share
8.82K Views

Join the DZone community and get the full member experience.

Join For Free

What is Starfish? 
Starfish is a Self-tuning System For Big Data Analytics. It's an open source project hosted at GitHub: https://github.com/jwlent55/Starfish 

What is the need for Starfish?
Need for Performance!!

What it does and what are its components?
It enables Hadoop users and applications to get good performance automatically.
It has three main components.
1. Profiler
2. What-if Engine
3. Optimizer

1. Job Profile / Profiler :

  1. Profile is a concise statistical summary of MR Job execution.
  2. This profiling is based on the dataflow and cost estimation of MR Job.
  3. Data Flow estimation would be considered with the number of bytes of <K,V> pairs processed during a job’s execution.
  4. Cost estimation would be considered with execution time at the level of tasks and phases within the tasks for a MR job execution. (Basically, the resource usage and execution time)
  5. The performance models consider the above two and the configuration parameters associated with the MR Job.
  6. Space of configuration choices:
    • Number of map tasks
    • Number of reduce tasks
    • Partitioning of map outputs to reduce tasks
    • Memory allocation to task-level buffers
    • Multiphase external sorting in the tasks
    • Whether output data from tasks should be compressed
    • Whether combine function should be used ...
job j = < program p, data d, resources r, configuration c >
Thus, we can tell performance is a function of a job j.
perf = F(p,d,r,c)
Job profile is generated by Profiler through measurement or by the What-if Engine through estimation.

2. What-if Engine:
The What-if Engine uses a mix of simulation and model-based estimation at the phase level of MapReduce job execution, in order to predict the performance of a MapReduce job before executed on a Hadoop cluster.
It estimates the perf using properties of p, d, r, and c.
ie. Given profile for job j = <p, d1, r1, c1>
 Estimate profile for job j' = <p, d2, r2, c2>
It has white box models consisting detailed set of equations for Hadoop.
Example:
Input data properties
Dataflow statistics
Configuration parameters
⇒ Calculate dataflow in each task phase in a map task

3. Optimizer:
It finds the optimal configuration settings to use for executing a MapReduce job. It recommends and can also run with the recommended job configuration settings.

Benchmark:
Normal Execution:
Program : WordCount
Data Size : 4.45GB
Time taken to complete the job : 8m 5s

Starfish Profiling and Optimized Execution:
Program : WordCount
Data Size: 4.45GB
Time taken to complete the job : 4m 59s

Executed with cluster of 1 Master, 3 Slave nodes


What’s achieved?
  • Perform in-depth job analysis with profiles
  • Predict the behavior of hypothetical job executions
  • Optimize arbitrary MapReduce programs


Installation ??
It’s pretty easy to install.
  • Prerequisites :
    • Hadoop Cluster of 0.20.2 or 0.20.203.0 should be up and running. Tested for Cloudera Distributions.
    • Java JDK should be installed.

  • Download from the repository
      • git clone from the repository https://github.com/hherodotou/Starfish
  • or download the tarball from here  http://www.cs.duke.edu/starfish/files/starfish-0.3.0.tar.gz

  • Compile the source code
    • Compile the entire source code and create the jar files:
    ant

    • Execute all available JUnit tests and verify the code was compiled successfully:
    ant test

    • Generate the javadoc documentation in docs/api:
    ant javadoc
Ensure that in ~/.bashrc,
JAVA_HOME and HADOOP_HOMEenvironment variables are set.

  • BTrace Installation in the Slave Nodes
After the compilation, btrace directory created will contain all the classes and the jars. These must be shipped to the slave nodes.

  • Create a file (in Master node) “slaves_list.txt”
This file should contain the slave node IP addresses or the hostnames. Make sure the hostnames are updated in the Master node ie. /etc/hosts (IP address and their respective slave hostname).
Example :
$vi slaves_list.txt
slave1
slave2
slave3
  • Set the global profile parameter in bin/config.sh

  • SLAVES_BTRACE_DIR: BTrace installation directory at the slave nodes. Please specify the full path and ensure you have the appropriate write permissions. The path will be created if it doesn't exist.
  • CLUSTER_NAME: A descriptive name for the cluster. Do not include spaces or special characters in the name.
  • PROFILER_OUTPUT_DIR: The local directory to place the collected logs and profile files. Please specify the full path and ensure you have the appropriate write permissions. The path will be created if it doesn't exist.

  • Run the script
bin/install_btrace.sh <absolute_path_slaves_list.txt>

  • This will copy the btrace jars in the SLAVES_BTRACE_DIR of the slave nodes.

This is all with the installation.

Execution is followed by
  • Profiling
  • Job Analysis
  • What-if analysis
  • Optimisation
    • Follow for execution from this link http://www.cs.duke.edu/starfish/tutorial/profile.html

The link http://www.cs.duke.edu/starfish/tutorial/ is a great source to get started with both installation and execution. The documentation is equally great!
Happy Learning! :)

hadoop career Big data Execution (computing) Task (computing) Profile (engineering)

Published at DZone with permission of Swathi Venkatachala, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • How To Select Multiple Checkboxes in Selenium WebDriver Using Java
  • Create a CLI Chatbot With the ChatGPT API and Node.js
  • Best Navicat Alternative for Windows
  • 5 Best Python Testing Frameworks

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: