Foundations are the concepts you keep reusing no matter which stack is fashionable this year. A new engineer needs them to understand why production systems behave strangely; an experienced engineer needs them to avoid cargo-culting tools whose trade-offs they cannot explain.
This roadmap is not a résumé checklist. It is the base layer: the smallest set of mental models that make distributed systems, system design interviews, database choices, and AI infrastructure easier to reason about.
Read in order
1. How services talk
- TCP, UDP, QUIC — The transport layer underneath every API decision. Head-of-line blocking, congestion control, connection setup, and why HTTP/3 had to leave TCP behind.
- 3 — How the same HTTP semantics behave differently once connection reuse, multiplexing, and QUIC enter the request path.
- TLS 1.3 and mTLS — What encrypted connections prove, where certificates fail operationally, and when mutual TLS is worth the overhead.
- API Design — Naming, versioning, idempotency, pagination, and the contract your future clients will depend on.
- API Protocols Compared — REST, gRPC, GraphQL, WebSockets, SSE, and long-polling through the lens of workload shape and operations.
- gRPC vs REST — When JSON over HTTP is good enough, and when binary protocols, streaming, and HTTP/2 become worth the operational cost.
- Load Balancers — L4 vs L7 routing, health checks, balancing algorithms, and the failure modes hiding in the request path.
- DNS, GeoDNS, and Anycast — The cached, weakly-consistent global control plane every request depends on before it reaches your service.
2. How data survives
- Database Concepts — ACID, indexes, write-ahead logs, LSM trees, B-trees, and the vocabulary every storage decision assumes.
- Isolation Levels — The anomalies your database allows, even when the dashboard says “transactional.”
- MVCC — How modern databases let readers and writers coexist without turning every request into a lock queue.
3. How systems scale
- Caching Techniques — Cache-aside, write-through, write-back, invalidation, and the consistency debt behind every fast read.
- Consistent Hashing — The partitioning primitive behind caches, key-value stores, and systems that resize without reshuffling the world.
- CAP and PACELC — The trade-off everyone quotes, plus the latency-vs-consistency trade-off that matters when there is no partition.
4. How distributed systems fail
- Distributed Systems Primitives — Partial failure, partitions, retries, idempotency, and why “exactly once” is mostly a product phrase.
- Logical Clocks — Causality when wall-clock time lies.
- Consensus Algorithms — What it costs to make machines agree.
- Distributed Locks — Why a lock without fencing is usually just hope.
5. How production systems are operated
- Kafka — Logs, partitions, consumer groups, and why messaging systems are storage systems with subscriptions.
- Observability — Logs, metrics, traces, and the difference between monitoring known failures and investigating unknown ones.
- Testing Strategies — The shape of tests that survive distributed systems, not just the pyramid drawn on a slide.
Where to go next
- Interviews — move to the System Design Interviews Roadmap when you want to turn these primitives into design answers.
- AI systems — move to the AI Systems Roadmap when you want the same production mindset applied to RAG, serving, and evaluation.
If you are newer: read this top-to-bottom. If you are experienced: skim the headings and deep-dive where your mental model feels hand-wavy.
Discussion
Comments are open. Anonymous is fine — pick any name and post. Comments appear after a quick moderation check.