Lease Coordination Under Serializable Isolation in CockroachDB

Designing scalable lease coordination in CockroachDB, focusing on key distribution, concurrency, and reducing transaction conflicts in multi-region systems.

Nishant Jain

May. 14, 26 · Analysis

Likes (0)

Comment

Save

1.6K Views

Multi-region systems that rely on entity-scoped write coordination often reach a scale where correctness is no longer the primary challenge; predictability under sustained concurrency is. CockroachDB’s serializable isolation model makes lease-based coordination attractive because it eliminates external lock services while preserving strict ordering guarantees. Early architecture reviews typically focus on invariants: single-writer enforcement, fencing epochs, failover safety, and replication durability. Those properties are easy to reason about in isolation. The long-term constraint is not logical soundness. It is how that coordination behaves once leased traffic becomes a first-class component of system throughput.

In distributed platforms operating under constant cross-region load, coordination traffic grows proportionally with entity activity. Lease renewals, takeovers, and epoch increments can exceed the mutation volume of the business data they protect. At that point, coordination is not metadata. It is part of the write path. Under serializable isolation, every write participates in intent resolution and conflict detection at the range level. When coordination keys are not deliberately distributed, restart pressure accumulates in predictable physical locations, even if aggregate cluster metrics remain healthy.

The refactor pressure does not come from failure. It comes from concurrency envelope expansion.

Where the Original Model Holds and Where It Doesn’t

The canonical lease model is straightforward. Each entity maps to a single row. Writers acquire ownership transactionally and increment a fencing epoch. Expiration timestamps guard against stale holders. The schema often resembles:

    SQL
   
 

   CREATE TABLE entity_leases (
 entity_id        STRING PRIMARY KEY,
 holder_region    STRING NOT NULL,
 fencing_epoch    INT8   NOT NULL,
 lease_expires_at TIMESTAMPTZ NOT NULL,
 updated_at       TIMESTAMPTZ NOT NULL DEFAULT now()
);
  

Acquisition logic operates under serializable isolation:

    Go
   
 

   func AcquireLease(ctx context.Context, db *sql.DB, entityID string, region string, ttl time.Duration) error {
   tx, err := db.BeginTx(ctx, &sql.TxOptions{Isolation: sql.LevelSerializable})
   if err != nil {
       return err
   }

   var holder string
   var epoch int64
   var expires time.Time

   err = tx.QueryRowContext(ctx,
       `SELECT holder_region, fencing_epoch, lease_expires_at
        FROM entity_leases
        WHERE entity_id = $1
        FOR UPDATE`,
       entityID).Scan(&holder, &epoch, &expires)

   now := time.Now().UTC()

   if err == sql.ErrNoRows || expires.Before(now) {
       _, err = tx.ExecContext(ctx,
           `UPSERT INTO entity_leases
            (entity_id, holder_region, fencing_epoch, lease_expires_at, updated_at)
            VALUES ($1, $2, $3, $4, now())`,
           entityID, region, epoch+1, now.Add(ttl))
       if err != nil {
           _ = tx.Rollback()
           return err
       }
       return tx.Commit()
   }
   _ = tx.Rollback()
   return fmt.Errorf("lease held")
}
  

At moderate concurrency, this model performs predictably. Conflicts are rare because entity-level activity is low relative to range capacity. CockroachDB distributes primary keys lexicographically across ranges, and entity IDs that appear random in design documentation are assumed to distribute evenly in practice.

The assumption that fails at scale is that entity identifier distribution correlates with concurrency distribution. In real systems, correlated workflows, batch operations, or localized bursts of entity activity concentrate writes across subsets of identifiers. Under serializable isolation, overlapping write intents on the same range result in transaction restarts. Restart cost grows nonlinearly as conflict density increases.

The cluster does not fail. CPU does not saturate. Replicas remain healthy. What shifts is the proportion of time spent replaying transactions due to conflict resolution. Tail latency begins to track restart density rather than raw write volume.

Physical Key Distribution as the True Constraint

CockroachDB’s range splitting is deterministic and transparent. It partitions data by primary key ordering and balances ranges across nodes. When entity_id is the sole primary key component, the physical distribution of lease rows depends entirely on that identifier’s ordering and activity pattern. If correlated entities cluster in the keyspace, the coordination workload becomes range-bound.

Adding secondary indexes compounds the issue. An index introduced for introspection or debugging, such as indexing holder_region, creates additional write amplification:

    SQL
   
   CREATE INDEX idx_holder_region ON entity_leases (holder_region);

Every renewal now updates multiple key ranges. Under sustained churn, coordination writes span more physical space, increasing conflict probability rather than reducing it.

The refactor discussion, therefore, shifts from transaction semantics to key topology. The invariants remain valid; the physical layout does not.

Refactoring the Coordination Topology

The structural adjustment is to introduce a deterministic shard component into the primary key so that coordination traffic distributes independently of entity skew. The revised schema introduces a bucket prefix:

    SQL
   
 

   CREATE TABLE entity_leases (
 bucket_id        INT8   NOT NULL,
 entity_id        STRING NOT NULL,
 holder_region    STRING NOT NULL,
 fencing_epoch    INT8   NOT NULL,
 lease_expires_at TIMESTAMPTZ NOT NULL,
 updated_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
 PRIMARY KEY (bucket_id, entity_id)
);
  

The bucket_id derives from a stable hash of the entity identifier:

    Go
   
 

   const NumBuckets uint64 = 2048
func BucketFor(entityID string) int64 {
   h := fnv.New64a()
   _, _ = h.Write([]byte(entityID))
   return int64(h.Sum64() % NumBuckets)
}
  

Lease acquisition logic remains unchanged in its logical structure, but now scopes by (bucket_id, entity_id):

    Go
   
 

   func AcquireLease(ctx context.Context, db *sql.DB, entityID string, region string, ttl time.Duration) error {
   bucket := BucketFor(entityID)

   tx, err := db.BeginTx(ctx, &sql.TxOptions{Isolation: sql.LevelSerializable})
   if err != nil {
       return err
   }

   var holder string
   var epoch int64
   var expires time.Time
   err = tx.QueryRowContext(ctx,
       `SELECT holder_region, fencing_epoch, lease_expires_at
        FROM entity_leases
        WHERE bucket_id = $1 AND entity_id = $2
        FOR UPDATE`,
       bucket, entityID).Scan(&holder, &epoch, &expires)

   now := time.Now().UTC()

   if err == sql.ErrNoRows || expires.Before(now) {
       _, err = tx.ExecContext(ctx,
           `UPSERT INTO entity_leases
            (bucket_id, entity_id, holder_region, fencing_epoch, lease_expires_at, updated_at)
            VALUES ($1, $2, $3, $4, $5, now())`,
           bucket, entityID, region, epoch+1, now.Add(ttl))
       if err != nil {
           _ = tx.Rollback()
           return err
       }
       return tx.Commit()
   }
   _ = tx.Rollback()
   return fmt.Errorf("lease held")
}
  

This change does not alter the lease invariant or fencing semantics. It alters how CockroachDB distributes coordination rows across ranges. Correlated entity activity no longer maps to correlated physical ranges. Restart amplification becomes statistically bounded by shard count rather than dictated by key ordering.

Temporal Effects and Renewal Discipline

Even after redistributing keys, renewal cadence remains relevant. Uniform TTL values cause synchronized renewal bursts, which increase conflict probability within short time windows. Introducing bounded jitter into renewal duration reduces correlated write pressure without changing expiration semantics:

    Go
   
   func LeaseTTL(base time.Duration) time.Duration {
   jitter := time.Duration(rand.Int63n(int64(base / 4)))
   return base + jitter
}

Restart-aware exponential backoff further reduces retry amplification under transient contention:

    Go
   
   func Backoff(attempt int) time.Duration {
   base := 15 * time.Millisecond
   return base * time.Duration(1<<attempt)
}

These refinements are secondary to the structural key redistribution but are necessary once coordinated traffic is a material portion of the overall write throughput.

Concurrency Envelope as a Design Boundary

The architectural takeaway is not that lease models are flawed, nor that serializable isolation is costly. The takeaway is that physical key topology defines the concurrency envelope of coordination paths. Logical clarity in schema design does not guarantee range-level stability under sustained multi-region concurrency.

In long-lived distributed platforms, coordination must be treated as infrastructure. Its key distribution, index footprint, and renewal cadence determine whether restart density remains incidental or becomes systemic. CockroachDB enforces correctness deterministically. The responsibility to shape coordinated traffic so that correctness does not consume throughput belongs to the architecture.

CockroachDB Isolation (database systems) Lease (computer science)

Opinions expressed by DZone contributors are their own.

Related

Trending