DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
Building Scalable Real-Time Apps with AstraDB and Vaadin
Register Now

Trending

  • Microservices Decoded: Unraveling the Benefits, Challenges, and Best Practices for APIs
  • How To Integrate Microsoft Team With Cypress Cloud
  • Why You Should Consider Using React Router V6: An Overview of Changes
  • Top 10 Engineering KPIs Technical Leaders Should Know

Trending

  • Microservices Decoded: Unraveling the Benefits, Challenges, and Best Practices for APIs
  • How To Integrate Microsoft Team With Cypress Cloud
  • Why You Should Consider Using React Router V6: An Overview of Changes
  • Top 10 Engineering KPIs Technical Leaders Should Know

RavenFS & Rsync

Oren Eini user avatar by
Oren Eini
·
May. 02, 11 · News
Like (0)
Save
Tweet
Share
4.16K Views

Join the DZone community and get the full member experience.

Join For Free

One of the things that I considered when starting to design RavenFS is “Do we really need this? Can’t we just use rsync?”

There were several reasons why I decided to go ahead with RavenFS. The first was that rsync isn’t really used on the Windows platform, for various reasons that I won’t get into.

The second is that rsync is meant to be a general purpose file synchronization tool. RavenFS is meant to be more than that.

More specifically, RavenFS is aimed specifically at distribution and synchronization of large files (tens of MB are considered small, hundreds of MB are common and multiple GBs are frequently used). It turns out that very large files mean a whole different kettle of fish. I am going to reference the thesis project of Andrew Tridgell quite frequently for the rest of this post (and you can find the link at the bottom of this post). Andrew is the author of rsync, so he probably knows a thing or two about synchronization problems.

In particular, he had thought about and discarded my approach for synchronizing files in his thesis:

This algorithm is very simple and meets some of the aims of our remote update algorithm, but it is useless in practice
The problem with it is that A can only find matches that are on block boundaries. If the file on A is the same as B except that one byte has been inserted at the start of the file then no block matches will be found and the algorithm will transfer the whole file.

This is absolutely correct. My approach, by design, suffer from the “inserting a single byte at beginning of file”. Why do I take such a stupid approach?

Put simply, because anything else is too expensive. The rsync approach is to compute a hash at byte boundaries, it gets away with that with having multiple hash functions, one of which is very cheap to compute (with higher number of collisions possible) and another that is more expensive to compute but has far lower probability of collisions.

Now, go up and look at the description of RavenFS. It is meant for very large files. Are you really going to perform an operation on every single byte on the file when the file is 2 GB in size?

This is where a very interesting property of file systems come into place. Let us assume that you have the following file, shown as a sequence of bytes:

[1,2,3,4]

And let us say that we want to add the byte 0 at the beginning of the file. For now, we will assume buffer size of 1, we will have to issue the following commands to the file system to do so:

  • Write 0 to position 1
  • Write 1 to position 2
  • Write 2 to position 3
  • Write 3 to position 4
  • Write 4 to position 5 <—increase the file size

In other words, inserting a value (vs modifying a value) is an O( File.Length – File.Position). In other words, the closer to the beginning of the file you are, the more expensive it is to insert a new value.

Ponder that while considering the cost of doing something like that on files that are big. Let us assume a more reasonable file size of 4,096 (the .NET default), and we realize that inserting a value into the beginning of a 500 MB file would require 128 thousand file system operations.

In practice, this isn’t a real issue because most of the time, when we are talking about such large files, we are talking about either files that are using fixed sized records ( no inserting whatsoever ) or appending data to the end of the file. That is mostly because there really isn’t any other choice for them, in order to preserve reasonable performance.

RavenFS is meant to take advantage on this aspect to make the “useless in practice” option a viable option.

On a separate issue, this quote made me laugh:

The lack of such a routine was quite a surprise and prompted me to look into the whole issue of parallel sorting, totally ignoring the fact that I only wanted a parallel sorting routine in order to solve a problem that didn’t really require sorting at all.

~ Andrew Tridgell http://samba.org/~tridge/phd_thesis.pdf

File system Rsync

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Trending

  • Microservices Decoded: Unraveling the Benefits, Challenges, and Best Practices for APIs
  • How To Integrate Microsoft Team With Cypress Cloud
  • Why You Should Consider Using React Router V6: An Overview of Changes
  • Top 10 Engineering KPIs Technical Leaders Should Know

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: