Skip to content
Dev Dump

👑 Leader Election Patterns

  • Leader handles coordination, sequencing, or writes; followers replicate or execute delegated work.
  • System must detect failures, choose a new leader, and resume service quickly.
  • Eliminates ambiguity but introduces a logical single point of control.
  • Write serialization: databases or logs that require ordered commits.
  • Task orchestration: scheduler coordinating workers (e.g., MapReduce master).
  • Cluster membership: services needing one spokesperson for external clients.
ApproachHow It WorksStrengthsTrade-offs
BullyHighest-ID node wins; others concedeSimple, no extra servicesO(n²) messaging; sensitive to churn
PaxosConsensus on proposals via majority votingProven safety, tolerant to failuresHard to implement; latency overhead
RaftLog replication with randomized electionsEasier mental model; widely adoptedRequires persistent logs; leader bottleneck
ZooKeeper/etcd (ZAB/Raft)External quorum service grants leadership via ephemeral nodesBattle-tested, provides watchesNeeds dedicated cluster; adds dependency
  1. Detect: followers miss heartbeats or lease expiry.
  2. Nominate: eligible nodes campaign using algorithm rules.
  3. Vote/Agree: majority consensus or deterministic winner.
  4. Promote: new leader replays logs, announces leadership.
  5. Recover: old leader steps down when it regains connectivity.
  • Tune election timeouts to balance prompt failover against false positives.
  • Maintain durable state (term/epoch, log index) across restarts.
  • Emit metrics on election frequency, log lag, and leadership duration.
  • Run chaos drills (kill leader, isolate network) to validate recovery.
  • Primary/replica databases: Raft or Paxos to elect a single writer.
  • Distributed locks: ZooKeeper ephemeral znodes for leadership leases.
  • Partitioned systems: one leader per shard to scale out horizontally.
  • ✅ Simplifies coordination, enforces ordering, supports strong consistency.
  • ❌ Leader can become throughput bottleneck; election overhead adds latency; misconfigured failover risks downtime.