1500

Foundations are the concepts you keep reusing no matter which stack is fashionable this year. A new engineer needs them to understand why production systems behave strangely; an experienced engineer needs them to avoid cargo-culting tools whose trade-offs they cannot explain.

This roadmap is not a résumé checklist. It is the base layer: the smallest set of mental models that make distributed systems, system design interviews, database choices, and AI infrastructure easier to reason about.

Read in order

1. How services talk

  1. TCP, UDP, QUIC — The transport layer underneath every API decision. Head-of-line blocking, congestion control, connection setup, and why HTTP/3 had to leave TCP behind.
  2. 3 — How the same HTTP semantics behave differently once connection reuse, multiplexing, and QUIC enter the request path.
  3. TLS 1.3 and mTLS — What encrypted connections prove, where certificates fail operationally, and when mutual TLS is worth the overhead.
  4. API Design — Naming, versioning, idempotency, pagination, and the contract your future clients will depend on.
  5. API Protocols Compared — REST, gRPC, GraphQL, WebSockets, SSE, and long-polling through the lens of workload shape and operations.
  6. gRPC vs REST — When JSON over HTTP is good enough, and when binary protocols, streaming, and HTTP/2 become worth the operational cost.
  7. Load Balancers — L4 vs L7 routing, health checks, balancing algorithms, and the failure modes hiding in the request path.
  8. DNS, GeoDNS, and Anycast — The cached, weakly-consistent global control plane every request depends on before it reaches your service.

2. How data survives

  1. Database Concepts — ACID, indexes, write-ahead logs, LSM trees, B-trees, and the vocabulary every storage decision assumes.
  2. Isolation Levels — The anomalies your database allows, even when the dashboard says “transactional.”
  3. MVCC — How modern databases let readers and writers coexist without turning every request into a lock queue.

3. How systems scale

  1. Caching Techniques — Cache-aside, write-through, write-back, invalidation, and the consistency debt behind every fast read.
  2. Consistent Hashing — The partitioning primitive behind caches, key-value stores, and systems that resize without reshuffling the world.
  3. CAP and PACELC — The trade-off everyone quotes, plus the latency-vs-consistency trade-off that matters when there is no partition.

4. How distributed systems fail

  1. Distributed Systems Primitives — Partial failure, partitions, retries, idempotency, and why “exactly once” is mostly a product phrase.
  2. Logical Clocks — Causality when wall-clock time lies.
  3. Consensus Algorithms — What it costs to make machines agree.
  4. Distributed Locks — Why a lock without fencing is usually just hope.

5. How production systems are operated

  1. Kafka — Logs, partitions, consumer groups, and why messaging systems are storage systems with subscriptions.
  2. Observability — Logs, metrics, traces, and the difference between monitoring known failures and investigating unknown ones.
  3. Testing Strategies — The shape of tests that survive distributed systems, not just the pyramid drawn on a slide.

Where to go next

  • Interviews — move to the System Design Interviews Roadmap when you want to turn these primitives into design answers.
  • AI systems — move to the AI Systems Roadmap when you want the same production mindset applied to RAG, serving, and evaluation.

If you are newer: read this top-to-bottom. If you are experienced: skim the headings and deep-dive where your mental model feels hand-wavy.

Discussion

Comments are open. Anonymous is fine — pick any name and post. Comments appear after a quick moderation check.