Database Partitioning & Sharding
1. Partitioning vs. Sharding
Section titled “1. Partitioning vs. Sharding”While often used interchangeably, they represent different architectural scopes.
The Library Analogy:
- Partitioning: Organizing one large bookshelf into sections (History, Science, Fiction). It’s easier to find books, but it’s still one bookshelf.
- Sharding: Splitting the books across three different library buildings. Now you have three times the space and three teams of librarians working simultaneously.
Figure 1: Partitioning stays within one machine (logical split), while sharding spreads data across multiple nodes (physical scale-out).
| Feature | Partitioning | Sharding |
|---|---|---|
| Physical Nodes | Single server node | Multiple distributed nodes |
| Goal | Manageability & Local Perf | Massive scale & Throughput |
| Complexity | Low to Moderate | High (requires routing & coordination) |
| Fault Tolerance | Limited by single node | High (one shard failing doesn’t kill all) |
2. Partitioning Strategies
Section titled “2. Partitioning Strategies”There are two fundamental ways to slice your data: Horizontal and Vertical.
Figure 2: Horizontal partitioning splits by rows; Vertical partitioning splits by columns.
Horizontal Partitioning (Data Sharding)
Section titled “Horizontal Partitioning (Data Sharding)”Rows of a table are split into multiple partitions based on a Partition Key.
- Pros: Scales linearly with the number of records.
- Example: Putting users with
ID1–1000 in Partition A and 1001–2000 in Partition B.
Vertical Partitioning
Section titled “Vertical Partitioning”The table is split by columns. Some columns go to one table, others to another, both sharing the same primary key.
- Pros: Reduces I/O for tables with many columns; increases security (PII in separate tables).
- Example: Splitting a
Usertable intoUserBasic(Name, Email) andUserBlobs(Profile Picture, Bio).
3. Common Horizontal Techniques
Section titled “3. Common Horizontal Techniques”How do we decide which row goes where?
- Range Partitioning: Maps data based on continuous ranges (e.g., Dates: Jan, Feb, Mar).
- Risk: Can lead to “Hotspots” if most activity happens in the latest month.
- List Partitioning: Maps data to explicit values (e.g., Country: ‘US’, ‘IN’, ‘UK’).
- Use Case: Great for data residency compliance.
- Hash Partitioning: Applies a hash function to the key (e.g.,
hash(user_id) % 4) to determine the partition.- Pros: Excellent uniform distribution of data.
- Composite Partitioning: Combines techniques (e.g., Range by Year, then Hash by UserID within each year).
4. Sharding Architecture
Section titled “4. Sharding Architecture”When moving to a sharded environment, you need a way to find the data.
Figure 3: A typical sharding setup includes an application layer, a routing layer (proxy), and the individual shards.
The Shard Key
Section titled “The Shard Key”The most critical decision in sharding. It is the attribute used to determine where data is stored.
- Good Shard Key: High cardinality (many unique values) and uniform distribution.
- Bad Shard Key: Causes all traffic to hit one shard (e.g., a
Statusfield with only ‘Active/Inactive’).
Routing Layer
Section titled “Routing Layer”Applications typically connect to a Router or Proxy (like Vitess for MySQL or Citus for Postgres). The router evaluates the Shard Key in the query and forwards the request to the correct physical node.
5. Consistent Hashing
Section titled “5. Consistent Hashing”Traditional hash sharding (hash(key) % N) has a massive flaw: if you add a new node (N + 1), almost all keys will map to different locations, requiring a total data migration.
Consistent Hashing solves this by mapping both nodes and data onto a circular “Hash Ring.”
Figure 4: In consistent hashing, data is mapped to the first node encountered clockwise on the ring. Adding a node only requires moving a small fraction of data.
6. The Challenges of Sharding
Section titled “6. The Challenges of Sharding”Sharding provides “infinite” scale, but it comes at a steep price:
- Cross-Shard Joins: Joining tables across different physical servers is extremely slow and often unsupported. Developers must perform joins in the application code.
- Distributed Transactions: Ensuring ACID properties across shards requires complex protocols like Two-Phase Commit (2PC), which can hurt performance.
- Resharding: As data grows, you eventually need to split an existing shard. This process is complex and can cause downtime if not handled by a sophisticated control plane.
- Referential Integrity: Most sharded databases cannot enforce Foreign Key constraints across shards. This logic must move to the application layer.
- Partitioning is logical; Sharding is physical.
- Horizontal splits rows (scale volume); Vertical splits columns (scale access pattern).
- Shard Keys must be chosen with extreme care to avoid hotspots.
- Use Consistent Hashing to minimize data movement when scaling the cluster.
- Sharding is a last resort—optimize your queries, indexes, and caching first!