Database Partitioning & Sharding

1. Partitioning vs. Sharding

While often used interchangeably, they represent different architectural scopes.

The Library Analogy:

Partitioning: Organizing one large bookshelf into sections (History, Science, Fiction). It’s easier to find books, but it’s still one bookshelf.

Sharding: Splitting the books across three different library buildings. Now you have three times the space and three teams of librarians working simultaneously.

Partitioning vs Sharding Figure 1: Partitioning stays within one machine (logical split), while sharding spreads data across multiple nodes (physical scale-out).

Feature	Partitioning	Sharding
Physical Nodes	Single server node	Multiple distributed nodes
Goal	Manageability & Local Perf	Massive scale & Throughput
Complexity	Low to Moderate	High (requires routing & coordination)
Fault Tolerance	Limited by single node	High (one shard failing doesn’t kill all)

2. Partitioning Strategies

There are two fundamental ways to slice your data: Horizontal and Vertical.

Horizontal vs Vertical Partitioning Figure 2: Horizontal partitioning splits by rows; Vertical partitioning splits by columns.

Horizontal Partitioning (Data Sharding)

Rows of a table are split into multiple partitions based on a Partition Key.

Pros: Scales linearly with the number of records.
Example: Putting users with ID 1–1000 in Partition A and 1001–2000 in Partition B.

Vertical Partitioning

The table is split by columns. Some columns go to one table, others to another, both sharing the same primary key.

Pros: Reduces I/O for tables with many columns; increases security (PII in separate tables).
Example: Splitting a User table into UserBasic (Name, Email) and UserBlobs (Profile Picture, Bio).

3. Common Horizontal Techniques

How do we decide which row goes where?

Range Partitioning: Maps data based on continuous ranges (e.g., Dates: Jan, Feb, Mar).
- Risk: Can lead to “Hotspots” if most activity happens in the latest month.
List Partitioning: Maps data to explicit values (e.g., Country: ‘US’, ‘IN’, ‘UK’).
- Use Case: Great for data residency compliance.
Hash Partitioning: Applies a hash function to the key (e.g., hash(user_id) % 4) to determine the partition.
- Pros: Excellent uniform distribution of data.
Composite Partitioning: Combines techniques (e.g., Range by Year, then Hash by UserID within each year).

4. Sharding Architecture

When moving to a sharded environment, you need a way to find the data.

Sharding Architecture Overview Figure 3: A typical sharding setup includes an application layer, a routing layer (proxy), and the individual shards.

The Shard Key

The most critical decision in sharding. It is the attribute used to determine where data is stored.

Good Shard Key: High cardinality (many unique values) and uniform distribution.
Bad Shard Key: Causes all traffic to hit one shard (e.g., a Status field with only ‘Active/Inactive’).

Routing Layer

Applications typically connect to a Router or Proxy (like Vitess for MySQL or Citus for Postgres). The router evaluates the Shard Key in the query and forwards the request to the correct physical node.

5. Consistent Hashing

Traditional hash sharding (hash(key) % N) has a massive flaw: if you add a new node (N + 1), almost all keys will map to different locations, requiring a total data migration.

Consistent Hashing solves this by mapping both nodes and data onto a circular “Hash Ring.”

Consistent Hashing Ring Figure 4: In consistent hashing, data is mapped to the first node encountered clockwise on the ring. Adding a node only requires moving a small fraction of data.

6. The Challenges of Sharding

Sharding provides “infinite” scale, but it comes at a steep price:

Cross-Shard Joins: Joining tables across different physical servers is extremely slow and often unsupported. Developers must perform joins in the application code.
Distributed Transactions: Ensuring ACID properties across shards requires complex protocols like Two-Phase Commit (2PC), which can hurt performance.
Resharding: As data grows, you eventually need to split an existing shard. This process is complex and can cause downtime if not handled by a sophisticated control plane.
Referential Integrity: Most sharded databases cannot enforce Foreign Key constraints across shards. This logic must move to the application layer.

TL;DR

Partitioning is logical; Sharding is physical.
Horizontal splits rows (scale volume); Vertical splits columns (scale access pattern).
Shard Keys must be chosen with extreme care to avoid hotspots.
Use Consistent Hashing to minimize data movement when scaling the cluster.
Sharding is a last resort—optimize your queries, indexes, and caching first!