DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

[DZone Research] Observability + Performance: We want to hear your experience and insights. Join us for our annual survey (enter to win $$).

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Related

  • The Complete Apache Spark Collection [Tutorials and Articles]
  • Deploying AI With an Event-Driven Platform
  • AI: The Future of HealthTech
  • Role of Artificial Intelligence for Government

Trending

  • How to Submit a Post to DZone
  • DZone's Article Submission Guidelines
  • Monkey-Patching in Java
  • Demystifying Project Loom: A Guide to Lightweight Threads in Java
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs)

Furkan Kamaci user avatar by
Furkan Kamaci
·
Jun. 24, 15 · Interview
Like (1)
Save
Tweet
Share
3.24K Views

Join the DZone community and get the full member experience.

Join For Free

In this post I’ll mention RDD paper, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. If you didn’t check my post about Spark, I strongly advice you to read it from here: Spark: Cluster Computing with Working Sets.

RDD Abstraction

RDDs is a distributed memory abstraction which leverages applications performance due to it is suitable for iterative algorithms and interactive data mining tools in a fault-tolerant manner. Other cluster computing frameworks such as MapReduce and Dryad lack abstractions for leveraging distributed memory. So, it makes them inefficient for operations which requires reuse of intermediate results. Data reuse is common for many iterative machine learning and graph algorithms i.e. K means clustering, logistic regression and PageRank.

An RDD is a read only, immutable, partitioned collection of records. RDDs provides an interface based on coarse-grained transformations (e.g., map, filter and join) to provide efficiency for fault-tolerance and it is implemented in Spark. To use Spark, developers writes a driver program and it connects to a cluster of workers. The driver defines one or more RDDs and invokes actions on them and Spark  code on the driver tracks the lineage of RDDs.Spark Runtime

Advantages of RDD Model

RDDs can be compared to DSM (Distributed Shared Memory) systems due to it is a distributed memory abstraction. RDDs have enough information how it was derived from other datasets so a program cannot reference an RDD that it can not reconstruct after a failure. RDDs do not have a mechanism like checkpointing as DSM systems have and only the lost partitions of an RDD need to be recomputed due to a failure which can be done parallel on different nodes.

A runtime schedule tasks based on data locality in bulk operations to improve performance. Also, when there is not enough memory at RAM they can be stored on disk which will provide similar performance to current data-parallel systems.

Applications Not Suitable for RDDs

RDDs are best suited for batch applications that apply same operation to all elements of a dataset. RDDs would be less suitable for applications which makes asynchronous fine-grained updates to a shared state i.e. an incremental web crawler or storage system for a web application.

Evaluation

Performance comparison result for iterative machine learning applications:

Hadoop Spark Comparison

Interactive queries response time comparison:

Interactive Query Performance

Conclusion

RDDs is an efficient, general purpose and fault-tolerant data sharing abstraction in cluster computing and is suitable for iterative machine learning algorithms. RDDs offer anAPI based on coarse-grained transformations which can recover data using lienage. RDDs implemented in Spark and outperforms Hadoop by up to 20x in iterative applications and can be used interactively to query large volume of data.

Machine learning Database clustering application Data mining

Published at DZone with permission of Furkan Kamaci, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • The Complete Apache Spark Collection [Tutorials and Articles]
  • Deploying AI With an Event-Driven Platform
  • AI: The Future of HealthTech
  • Role of Artificial Intelligence for Government

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: