Lease Coordination Under Serializable Isolation in CockroachDB
Designing scalable lease coordination in CockroachDB, focusing on key distribution, concurrency, and reducing transaction conflicts in multi-region systems.
Join the DZone community and get the full member experience.
Join For FreeMulti-region systems that rely on entity-scoped write coordination often reach a scale where correctness is no longer the primary challenge; predictability under sustained concurrency is. CockroachDB’s serializable isolation model makes lease-based coordination attractive because it eliminates external lock services while preserving strict ordering guarantees. Early architecture reviews typically focus on invariants: single-writer enforcement, fencing epochs, failover safety, and replication durability. Those properties are easy to reason about in isolation. The long-term constraint is not logical soundness. It is how that coordination behaves once leased traffic becomes a first-class component of system throughput.
In distributed platforms operating under constant cross-region load, coordination traffic grows proportionally with entity activity. Lease renewals, takeovers, and epoch increments can exceed the mutation volume of the business data they protect. At that point, coordination is not metadata. It is part of the write path. Under serializable isolation, every write participates in intent resolution and conflict detection at the range level. When coordination keys are not deliberately distributed, restart pressure accumulates in predictable physical locations, even if aggregate cluster metrics remain healthy.
The refactor pressure does not come from failure. It comes from concurrency envelope expansion.
Where the Original Model Holds and Where It Doesn’t
The canonical lease model is straightforward. Each entity maps to a single row. Writers acquire ownership transactionally and increment a fencing epoch. Expiration timestamps guard against stale holders. The schema often resembles:
CREATE TABLE entity_leases (
entity_id STRING PRIMARY KEY,
holder_region STRING NOT NULL,
fencing_epoch INT8 NOT NULL,
lease_expires_at TIMESTAMPTZ NOT NULL,
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
Acquisition logic operates under serializable isolation:
func AcquireLease(ctx context.Context, db *sql.DB, entityID string, region string, ttl time.Duration) error {
tx, err := db.BeginTx(ctx, &sql.TxOptions{Isolation: sql.LevelSerializable})
if err != nil {
return err
}
var holder string
var epoch int64
var expires time.Time
err = tx.QueryRowContext(ctx,
`SELECT holder_region, fencing_epoch, lease_expires_at
FROM entity_leases
WHERE entity_id = $1
FOR UPDATE`,
entityID).Scan(&holder, &epoch, &expires)
now := time.Now().UTC()
if err == sql.ErrNoRows || expires.Before(now) {
_, err = tx.ExecContext(ctx,
`UPSERT INTO entity_leases
(entity_id, holder_region, fencing_epoch, lease_expires_at, updated_at)
VALUES ($1, $2, $3, $4, now())`,
entityID, region, epoch+1, now.Add(ttl))
if err != nil {
_ = tx.Rollback()
return err
}
return tx.Commit()
}
_ = tx.Rollback()
return fmt.Errorf("lease held")
}
At moderate concurrency, this model performs predictably. Conflicts are rare because entity-level activity is low relative to range capacity. CockroachDB distributes primary keys lexicographically across ranges, and entity IDs that appear random in design documentation are assumed to distribute evenly in practice.
The assumption that fails at scale is that entity identifier distribution correlates with concurrency distribution. In real systems, correlated workflows, batch operations, or localized bursts of entity activity concentrate writes across subsets of identifiers. Under serializable isolation, overlapping write intents on the same range result in transaction restarts. Restart cost grows nonlinearly as conflict density increases.
The cluster does not fail. CPU does not saturate. Replicas remain healthy. What shifts is the proportion of time spent replaying transactions due to conflict resolution. Tail latency begins to track restart density rather than raw write volume.
Physical Key Distribution as the True Constraint
CockroachDB’s range splitting is deterministic and transparent. It partitions data by primary key ordering and balances ranges across nodes. When entity_id is the sole primary key component, the physical distribution of lease rows depends entirely on that identifier’s ordering and activity pattern. If correlated entities cluster in the keyspace, the coordination workload becomes range-bound.
Adding secondary indexes compounds the issue. An index introduced for introspection or debugging, such as indexing holder_region, creates additional write amplification:
CREATE INDEX idx_holder_region ON entity_leases (holder_region);
Every renewal now updates multiple key ranges. Under sustained churn, coordination writes span more physical space, increasing conflict probability rather than reducing it.
The refactor discussion, therefore, shifts from transaction semantics to key topology. The invariants remain valid; the physical layout does not.
Refactoring the Coordination Topology
The structural adjustment is to introduce a deterministic shard component into the primary key so that coordination traffic distributes independently of entity skew. The revised schema introduces a bucket prefix:
CREATE TABLE entity_leases (
bucket_id INT8 NOT NULL,
entity_id STRING NOT NULL,
holder_region STRING NOT NULL,
fencing_epoch INT8 NOT NULL,
lease_expires_at TIMESTAMPTZ NOT NULL,
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (bucket_id, entity_id)
);
The bucket_id derives from a stable hash of the entity identifier:
const NumBuckets uint64 = 2048
func BucketFor(entityID string) int64 {
h := fnv.New64a()
_, _ = h.Write([]byte(entityID))
return int64(h.Sum64() % NumBuckets)
}
Lease acquisition logic remains unchanged in its logical structure, but now scopes by (bucket_id, entity_id):
func AcquireLease(ctx context.Context, db *sql.DB, entityID string, region string, ttl time.Duration) error {
bucket := BucketFor(entityID)
tx, err := db.BeginTx(ctx, &sql.TxOptions{Isolation: sql.LevelSerializable})
if err != nil {
return err
}
var holder string
var epoch int64
var expires time.Time
err = tx.QueryRowContext(ctx,
`SELECT holder_region, fencing_epoch, lease_expires_at
FROM entity_leases
WHERE bucket_id = $1 AND entity_id = $2
FOR UPDATE`,
bucket, entityID).Scan(&holder, &epoch, &expires)
now := time.Now().UTC()
if err == sql.ErrNoRows || expires.Before(now) {
_, err = tx.ExecContext(ctx,
`UPSERT INTO entity_leases
(bucket_id, entity_id, holder_region, fencing_epoch, lease_expires_at, updated_at)
VALUES ($1, $2, $3, $4, $5, now())`,
bucket, entityID, region, epoch+1, now.Add(ttl))
if err != nil {
_ = tx.Rollback()
return err
}
return tx.Commit()
}
_ = tx.Rollback()
return fmt.Errorf("lease held")
}
This change does not alter the lease invariant or fencing semantics. It alters how CockroachDB distributes coordination rows across ranges. Correlated entity activity no longer maps to correlated physical ranges. Restart amplification becomes statistically bounded by shard count rather than dictated by key ordering.
Temporal Effects and Renewal Discipline
Even after redistributing keys, renewal cadence remains relevant. Uniform TTL values cause synchronized renewal bursts, which increase conflict probability within short time windows. Introducing bounded jitter into renewal duration reduces correlated write pressure without changing expiration semantics:
func LeaseTTL(base time.Duration) time.Duration {
jitter := time.Duration(rand.Int63n(int64(base / 4)))
return base + jitter
}
Restart-aware exponential backoff further reduces retry amplification under transient contention:
func Backoff(attempt int) time.Duration {
base := 15 * time.Millisecond
return base * time.Duration(1<<attempt)
}
These refinements are secondary to the structural key redistribution but are necessary once coordinated traffic is a material portion of the overall write throughput.
Concurrency Envelope as a Design Boundary
The architectural takeaway is not that lease models are flawed, nor that serializable isolation is costly. The takeaway is that physical key topology defines the concurrency envelope of coordination paths. Logical clarity in schema design does not guarantee range-level stability under sustained multi-region concurrency.
In long-lived distributed platforms, coordination must be treated as infrastructure. Its key distribution, index footprint, and renewal cadence determine whether restart density remains incidental or becomes systemic. CockroachDB enforces correctness deterministically. The responsibility to shape coordinated traffic so that correctness does not consume throughput belongs to the architecture.
Opinions expressed by DZone contributors are their own.
Comments