DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

[DZone Research] Observability + Performance: We want to hear your experience and insights. Join us for our annual survey (enter to win $$).

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Related

  • Improving Spark Memory Resource With Off-Heap In-Memory Storage
  • Tutorial: Presto + Alluxio + Hive Metastore on Your laptop in 10 Minutes
  • Getting Started With EMR Hive on Alluxio in 10 Minutes
  • Running Alluxio-Presto Sandbox in Docker

Trending

  • Unlocking Data Insights and Architecture
  • Top 8 Conferences Developers Can Still Attend
  • CI/CD Docker: How To Create a CI/CD Pipeline With Jenkins, Containers, and Amazon ECS
  • JWT Token Revocation: Centralized Control vs. Distributed Kafka Handling

Four Ways to Write to Alluxio

Find the Write Type that works for your project.

Zac Blanco user avatar by
Zac Blanco
·
Aug. 22, 19 · Tutorial
Like (3)
Save
Tweet
Share
5.69K Views

Join the DZone community and get the full member experience.

Join For Free

Alluxio is an open-source data orchestration system for analytics and AI workloads. Distributed applications like Apache Spark or Apache Hive can access Alluxio through its HDFS-compatible interface without code change. We refer to external storage, such as HDFS or S3 as under storage. Alluxio is a new layer on top of under storage systems that can not only improve raw I/O performance but also enables applications flexible options to read, write, and manage files. This article focuses on describing different ways to write files to Alluxio, realizing the tradeoffs in performance, consistency, and the level of fault tolerance compared to HDFS.

Given an application, such as a Spark job, which saves its output to an external storage service, writing the job output to the memory layer in a colocated Alluxio worker will achieve the best write performance. Due to the volatility of memory, when a node in Alluxio goes down or restarts, any data in that node’s memory is lost. To prevent data loss, Alluxio provides the ability to write data to the persistent under storage either synchronously or asynchronously by configuring client-side Write Types. Each Write Type has benefits and drawbacks associated with it. Applications that write to Alluxio storage should consider different Write Types and perform a cost-benefit analysis to determine the Write Type that is best-suited for the application requirements.

A summary of the available write types are listed below:

Write Type Description Write Speed Fault Tolerance
MUST_CACHE Writes directly to Alluxio memory. Very fast. Data loss if a worker crashes.
THROUGH Writes directly to under storage. Limited to under storage throughput. Dependent upon under storage.
CACHE_THROUGH Writes to Alluxio and under storage synchronously. Data in memory and persisted to under storage synchronously. Dependent upon under storage.
ASYNC_THROUGH Writes to Alluxio first and then asynchronously writes to the under storage. Nearly as fast as MUST_CACHE and data persisted to under storage without user interaction. Possible to lose data if only one replica is written.


Write types are a client-side property, which means they can be modified when submitting an application without restarting any Alluxio processes. For example, to set the Alluxio write type to CACHE_THROUGH when submitting a Spark job, you can add the following options to the spark-submit:

$ spark-submit \
--conf 'spark.driver.extraJavaOptions=-Dalluxio.user.file.writetype.default=CACHE_THROUGH' \
--conf 'spark.executor.extraJavaOptions=-Dalluxio.user.file.writetype.default=CACHE_THROUGH' \
...


Here are some general bits of advice when choosing the right Write Type for your applications:

  • For temporary data that doesn’t need to be saved or data that is very cheap to re-generate, use MUST_CACHE to write directly to Alluxio memory. It will then replicates over time, least safe, but most performant.
  • For data created that will not be used in the near term, use THROUGH to write it directly from the client application persisting immediately to the under storage, without caching another copy. This leaves more room in Alluxio storage for data which needs to be read fast and frequently.
  • For data must be persisted at the moment when the writer application returns, and will be used by other Alluxio applications very soon, use CACHE_THROUGH to both write data into Alluxio and the under storage. Note that, Alluxio may create replicas over time in Alluxio based on the data access pattern.
  • For data needs to be persisted and doesn’t need to be used immediately, use ASYNC_THROUGH which writes directly to Alluxio and then asynchronously persists data to the UFS.
Alluxio

Opinions expressed by DZone contributors are their own.

Related

  • Improving Spark Memory Resource With Off-Heap In-Memory Storage
  • Tutorial: Presto + Alluxio + Hive Metastore on Your laptop in 10 Minutes
  • Getting Started With EMR Hive on Alluxio in 10 Minutes
  • Running Alluxio-Presto Sandbox in Docker

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: