System Requirements & Foundations
Why This Page Matters
Section titled “Why This Page Matters”Before designing databases, caches, and queues, define the core system requirements:
- What scale should the system handle?
- How reliable and available should it be?
- What latency is acceptable?
- How should traffic be routed globally?
These choices drive architecture decisions later.
1. Scalability
Section titled “1. Scalability”- Vertical scaling (scale up): add CPU/RAM to one machine.
- Horizontal scaling (scale out): add more machines behind a load balancer.
| Factor | Vertical | Horizontal |
|---|---|---|
| Setup complexity | Low | Medium/High |
| Max capacity | Hardware-limited | High |
| Single-point failure risk | Higher | Lower (with redundancy) |
| Cost at small scale | Usually lower | Usually higher |
2. Availability, Reliability, and Fault Tolerance
Section titled “2. Availability, Reliability, and Fault Tolerance”- Availability: percent of time system is up.
- Reliability: system performs correctly over time.
- Fault tolerance: system continues operating when components fail.
Useful terms:
- SLO: internal target (for example, 99.9% monthly availability, p95 latency under 200ms).
- SLA: external promise/contract, usually lower than SLO.
- MTTR: mean time to recover after failure (lower is better).
Common availability targets:
| SLA | Max downtime/year (approx.) |
|---|---|
| 99% | 3.65 days |
| 99.9% | 8.76 hours |
| 99.99% | 52.6 minutes |
| 99.999% | 5.26 minutes |
Reliability patterns:
- Redundant instances across zones/regions.
- Health checks + auto failover.
- Data replication and backups.
- Graceful degradation (core features first).
Practical reliability checklist:
- Remove single points of failure at app, DB, and network layers.
- Test failure paths regularly (zone loss, DB primary failover, cache outage).
- Prefer fast recovery over perfect prevention.
3. Latency vs Throughput
Section titled “3. Latency vs Throughput”- Latency: time for one request/response.
- Throughput: number of requests processed per second.
Both matter:
- Low latency improves user experience.
- High throughput supports traffic growth.
Measure latency with percentiles:
- p50: typical user experience.
- p95: slower tail users (common SLO metric).
- p99: worst tail behavior under load.
How to improve both:
- Caching (app cache + CDN).
- Fewer network hops and optimized queries.
- Async processing and batching.
- Horizontal scaling.
Capacity relationship (simplified):
- Higher latency reduces throughput for fixed worker count.
- If traffic grows and utilization stays too high, tail latency rises sharply.
Performance budgeting helps:
- Set per-hop budgets (API gateway, service, DB).
- Track p95/p99 in production and alert on regressions.
4. DNS, CDN, and Proxies
Section titled “4. DNS, CDN, and Proxies”DNS maps domain names to IP addresses.
Useful commands:
dignslookup
CDN = globally distributed cache for static content:
- lower latency,
- reduced origin load,
- better spike handling.
Proxy types
Section titled “Proxy types”| Type | Represents | Typical use |
|---|---|---|
| Forward proxy | Client | Filtering, privacy, egress control |
| Reverse proxy | Server | Load balancing, TLS termination, caching |