Skip to content
Dev Dump

System Requirements & Foundations

Before designing databases, caches, and queues, define the core system requirements:

  • What scale should the system handle?
  • How reliable and available should it be?
  • What latency is acceptable?
  • How should traffic be routed globally?

These choices drive architecture decisions later.

  • Vertical scaling (scale up): add CPU/RAM to one machine.
  • Horizontal scaling (scale out): add more machines behind a load balancer.

Scaling approaches

FactorVerticalHorizontal
Setup complexityLowMedium/High
Max capacityHardware-limitedHigh
Single-point failure riskHigherLower (with redundancy)
Cost at small scaleUsually lowerUsually higher

2. Availability, Reliability, and Fault Tolerance

Section titled “2. Availability, Reliability, and Fault Tolerance”
  • Availability: percent of time system is up.
  • Reliability: system performs correctly over time.
  • Fault tolerance: system continues operating when components fail.

Useful terms:

  • SLO: internal target (for example, 99.9% monthly availability, p95 latency under 200ms).
  • SLA: external promise/contract, usually lower than SLO.
  • MTTR: mean time to recover after failure (lower is better).

Common availability targets:

SLAMax downtime/year (approx.)
99%3.65 days
99.9%8.76 hours
99.99%52.6 minutes
99.999%5.26 minutes

Fault tolerance with failover

Reliability patterns:

  • Redundant instances across zones/regions.
  • Health checks + auto failover.
  • Data replication and backups.
  • Graceful degradation (core features first).

Practical reliability checklist:

  • Remove single points of failure at app, DB, and network layers.
  • Test failure paths regularly (zone loss, DB primary failover, cache outage).
  • Prefer fast recovery over perfect prevention.
  • Latency: time for one request/response.
  • Throughput: number of requests processed per second.

Latency and throughput under load

Both matter:

  • Low latency improves user experience.
  • High throughput supports traffic growth.

Measure latency with percentiles:

  • p50: typical user experience.
  • p95: slower tail users (common SLO metric).
  • p99: worst tail behavior under load.

How to improve both:

  • Caching (app cache + CDN).
  • Fewer network hops and optimized queries.
  • Async processing and batching.
  • Horizontal scaling.

Capacity relationship (simplified):

  • Higher latency reduces throughput for fixed worker count.
  • If traffic grows and utilization stays too high, tail latency rises sharply.

Performance budgeting helps:

  • Set per-hop budgets (API gateway, service, DB).
  • Track p95/p99 in production and alert on regressions.

DNS maps domain names to IP addresses.

Useful commands:

  • dig
  • nslookup

CDN = globally distributed cache for static content:

  • lower latency,
  • reduced origin load,
  • better spike handling.
TypeRepresentsTypical use
Forward proxyClientFiltering, privacy, egress control
Reverse proxyServerLoad balancing, TLS termination, caching

Request path with DNS, CDN, and reverse proxy