DZone
Big Data Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Big Data Zone > Keeping it In-Memory: A Look at Tachyon/Alluxio

Keeping it In-Memory: A Look at Tachyon/Alluxio

Even with frameworks such as Spark, which take advantage of in-memory computation for individual jobs, sharing of the output is usually limited by network bandwidth or disk throughput.

Siddharth Agarwal user avatar by
Siddharth Agarwal
·
Aug. 18, 16 · Big Data Zone · Analysis
Like (2)
Save
Tweet
3.63K Views

Join the DZone community and get the full member experience.

Join For Free

A core tenet of the Hadoop cluster architecture is the replication of data blocks across nodes. The reasoning behind this is that it protects against the failure of an individual node by having backup data elsewhere. An issue with this is that slow write speeds to the HDFS can hurt the performance of a job with multiple steps, especially when the size of the output of a step is close to the size of the input. Even with frameworks such as Spark, which take advantage of in-memory computation for individual jobs, sharing of the output is usually limited by network bandwidth or disk throughput.

Improving Lineage, Checkpoints, and Storage

Tachyon/Alluxio is an open source 'memory-centric distributed storage system', introduced by a group of researchers from UC Berkeley and MIT in a seminal paper in 2014. Tachyon/Alluxio creates a transparent file system that sits on top of and mirrors the native file system (HDFS, S3, etc) that is available to jobs to write to/read from. A couple concepts are at work here:

  • Lineage: Tachyon/Alluxio provides an API to record the inputs/outputs of jobs to provide fault-tolerance. Using a directed acyclic graph (DAG) of all jobs, the optimal datasets are checkpointed to guarantee a bounded recomputation time.
  • Asynchronous checkpointing: ‘hot’ files (files accessed a lot) and leaves of the DAG are checkpointed to the storage layer.
  • Tiered storage system: Storage is filled in the order of memory -> SSD -> HDD.

An Example Workflow

Suppose I have a workflow with a couple steps: I run a query on some data in a Hive table using Spark SQL, take the output of that and run a Spark job on it, then use a Spark ML library to run some analysis. For parts 1 and 2, I may have outputs that are close to the size of my inputs. In a traditional HDFS-centric system, the performance of the overall workflow would be bounded by the write throughput. However, with Tachyon/Alluxio, we leave the output in-memory, able to be shared between jobs. Our workflow now looks like this:

  1. Run a query (S1) on data in a data warehouse: Output O1 is stored in Tachyon.
  2. Run a Spark transformation (S2) on O1: Output O2 is stored in Tachyon.
    1. At this point the lineage information for S1 is stored by Tachyon, so O1 might be removed from memory based on resource allocation.
  3. Run a Spark ML algorithm on O2: output is a model.

memory.png

The above image shows a view of a Tachyon/Alluxio file system. ‘hello’ and ‘hello2’ are two files that are currently cached in-memory.

Evolving Compatibility and Deployment

Existing Spark and MapReduce programs are compatible with Tachyon/Alluxio without any code change. Tachyon also has multiple deployment options, including integration with Mesos and YARN. There are also a case study at Baidu that is available online for further reading.

File system Database

Published at DZone with permission of Siddharth Agarwal. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • API Security Weekly: Issue 171
  • An Overview of DTrace and strace
  • Roles in Data Science Teams
  • Difference Between Data Mining and Data Warehousing

Comments

Big Data Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo