Kyle Kingsbury had a blog post last month that showed some interesting insight into the Apache Kafka, a distributed messaging system. The developers of Kafka are planning to introduce replication to the 0.8 release, which should improve Kafka's durability and availability by duplicating each shard’s data across multiple nodes.
The post does an excellent job of explaining Kafka's in plain English. Anyone who's still not clear about what Kafka is or how it works should spend a minute reading this post. Kingsbury closes with some potential issues he sees in the current implementation of the replication feature and some recommendations to solve these issues:
Kafka’s replication claimed to be CA, but in the presence of a partition, threw away an arbitrarily large volume of committed writes. It claimed tolerance to F-1 failures, but a single node could cause catastrophe.
I made two recommendations to the Kafka team:
- Ensure that the ISR never goes below N/2 nodes. This reduces the probability of a single node failure causing the loss of commited writes.
- In the event that the ISR becomes empty, block and sound an alarm instead of silently dropping data. It’s OK to make this configurable, but as an administrator, you probably want to be aware when a datastore is about to violate one of its constraints–and make the decision yourself. It might be better to wait until an old leader can be recovered. Or perhaps the administrator would like a dump of the to-be-dropped writes which could be merged back into the new state of the cluster.