Scalability Cheatsheet: The Road to Paxos
Scalability Cheatsheet: The Road to Paxos
Get some tips for journaling, using Paxos, and ensuring that your data is scalable.
Join the DZone community and get the full member experience.Join For Free
Sensu is an open source monitoring event pipeline. Try it today.
We like journaling. It helps us avoid data corruption — because you could update your data and still fail if there's an electricity shutdown or something like that. This is why we like journaling. It’s append-only, so nothing can really be corrupted except for what you append. But if it’s corrupted, you don’t consider it to be appended.
Reading is difficult, though. You need to read all your journaling, so from time to time, you create a snapshot of your state that you can augment with your append-only journaling.
When you read, you just read. You don’t lock. You don’t care about the world. Read without disturbing writes. It's immutable. All is cool, dude, we're just reading.
What happens if two distributed machines try to write at the very same time with same timestamp and different content on the same key? We have a problem even in append-only journaling. What can we do? Calling Turing might be a good idea, but he hasn't answered any of my calls for the last couple of years, so we’ll have to figure something out!
Less is more. We came upon a secret weapon that basically means that we can’t have the best of both worlds — but we don’t need to. For our case, we are going to be satisfied with less. As each participant has its own local strongly consistent store, they give each other updates. With eventual consistency, you build a map with all nodes and update others when something happens, and you can change the route; no one node has precedence over others with the same data.
The Non-Real Last State
Let me stress it again: It is impossible for anyone to know the current, real state. It's impossible to do proper read-modify-write with the most recently updated, correct information without collisions, and in addition, it's vulnerable to network failures. But we already agreed, less is more — much more!
The Part-Time Parliament
Paxos has strong consistency. The whole target of Paxos is to reach a consensus among members. With Quorum, Paxos is reaching an agreement between chosen members. For example, you need at least half the nodes to agree on it. Paxos uses the part-time parliament example: In order to make a change, you have to get the agreement of the majority of the paxons. You can't have two sets of paxons both larger than half the group with the same decision.
Paxos is also for reading. Likewise, when you need to read something, you want to know that you get the latest and greatest revision, so you need to get the majority of the paxons to agree on the latest revision.
Paxos reads the client and asks all nodes for a value. A valid answer is when the majority of all nodes agrees on the value, at least in the naive algorithm. There's no canonical place to store the answer. This is naive Paxos. There are better.
A Paxos write looks like ask > promise > accept. Here, the client contacts a random node and asks it to write a value. The node then takes the value, appends a sequence number, and does preparations (value, seq). All receiving nodes make sure that seqnum is the highest to accept a proposal. If two clients were to send a proposal immediately without first contacting a node, then each could end up with half the system agreeing.
Paxos has a way to generate only growing numbers with time. If the highest nodes agree to accept, then there's a guarantee of being accepted. If we have promises from more than half of the majority before time-out, then we're asking all promises to accept. If only some nodes managed to accept the value, reads won't get the majority and would fail. It sucks, but at least we are not in an inconsistent state.
The cost a write is requiring consensus among the majority of members. You can no longer just write to your local journal.
Master election is yet another example of agreeing on something. In this case, you agree on the master. It’s an expensive, strongly consistent store used to decide who is in charge. If you need something, you contact him, but once you all agree on a master, it simplifies your processes when you need to read or write. It eliminates further agreements on every read and write.
Here's an illustration of this process:
Published at DZone with permission of Tomer Ben David , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.