Over a million developers have joined DZone.

Keeping it In-Memory: A Look at Tachyon/Alluxio

DZone's Guide to

Keeping it In-Memory: A Look at Tachyon/Alluxio

Even with frameworks such as Spark, which take advantage of in-memory computation for individual jobs, sharing of the output is usually limited by network bandwidth or disk throughput.

· Big Data Zone
Free Resource

Need to build an application around your data? Learn more about dataflow programming for rapid development and greater creativity. 

A core tenet of the Hadoop cluster architecture is the replication of data blocks across nodes. The reasoning behind this is that it protects against the failure of an individual node by having backup data elsewhere. An issue with this is that slow write speeds to the HDFS can hurt the performance of a job with multiple steps, especially when the size of the output of a step is close to the size of the input. Even with frameworks such as Spark, which take advantage of in-memory computation for individual jobs, sharing of the output is usually limited by network bandwidth or disk throughput.

Improving Lineage, Checkpoints, and Storage

Tachyon/Alluxio is an open source 'memory-centric distributed storage system', introduced by a group of researchers from UC Berkeley and MIT in a seminal paper in 2014. Tachyon/Alluxio creates a transparent file system that sits on top of and mirrors the native file system (HDFS, S3, etc) that is available to jobs to write to/read from. A couple concepts are at work here:

  • Lineage: Tachyon/Alluxio provides an API to record the inputs/outputs of jobs to provide fault-tolerance. Using a directed acyclic graph (DAG) of all jobs, the optimal datasets are checkpointed to guarantee a bounded recomputation time.
  • Asynchronous checkpointing: ‘hot’ files (files accessed a lot) and leaves of the DAG are checkpointed to the storage layer.
  • Tiered storage system: Storage is filled in the order of memory -> SSD -> HDD.

An Example Workflow

Suppose I have a workflow with a couple steps: I run a query on some data in a Hive table using Spark SQL, take the output of that and run a Spark job on it, then use a Spark ML library to run some analysis. For parts 1 and 2, I may have outputs that are close to the size of my inputs. In a traditional HDFS-centric system, the performance of the overall workflow would be bounded by the write throughput. However, with Tachyon/Alluxio, we leave the output in-memory, able to be shared between jobs. Our workflow now looks like this:

  1. Run a query (S1) on data in a data warehouse: Output O1 is stored in Tachyon.
  2. Run a Spark transformation (S2) on O1: Output O2 is stored in Tachyon.
    1. At this point the lineage information for S1 is stored by Tachyon, so O1 might be removed from memory based on resource allocation.
  3. Run a Spark ML algorithm on O2: output is a model.


The above image shows a view of a Tachyon/Alluxio file system. ‘hello’ and ‘hello2’ are two files that are currently cached in-memory.

Evolving Compatibility and Deployment

Existing Spark and MapReduce programs are compatible with Tachyon/Alluxio without any code change. Tachyon also has multiple deployment options, including integration with Mesos and YARN. There are also a case study at Baidu that is available online for further reading.

Check out the Exaptive data application Studio. Technology agnostic. No glue code. Use what you know and rely on the community for what you don't. Try the community version.

in-memory ,hadoop ,spark sql ,spark ,mapreduce

Published at DZone with permission of Siddharth Agarwal, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.


Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.


{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}