--- title: "Why more GPUs is not enough for LLM inference" description: "What I learned deploying and tuning large-model inference: KV cache, routing, and cache hierarchy matter as much as raw GPU count." date: 2026-05-28 canonical: https://starptech.com/blog/why-more-gpus-is-not-enough-for-llm-inference/ --- # Why more GPUs is not enough for LLM inference > What I learned deploying and tuning large-model inference: KV cache, routing, and cache hierarchy matter as much as raw GPU count. I went down the road of deploying [DeepSeek V4 Flash](https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4) on dedicated inference infrastructure to understand what it really takes to serve, tune, and scale a model deployment with great performance. ## The biggest learning KV cache is not an implementation detail. It becomes stateful infrastructure when you want to do LLM serving seriously. At a small scale, it is easy to think mostly in terms of GPUs, VRAM, batching, and tokens per second. But once you start optimizing for long context, high throughput, and low TTFT (Time to First Token), you start thinking in cache domains: who owns the KV, how do requests route back to it, and what happens when the owner is overloaded? ## From H200 to B200 I first served the model on H200 GPUs with `DP=1`. [DP, or data parallelism](https://docs.sglang.io/docs/advanced_features/dp_dpa_smg_guide), is how serving systems split requests across multiple parallel replicas or workers. With `DP=1`, there was only one DP cache domain, so prefix-cache behavior was easier to reason about. LPM scheduling helped here. LPM stands for longest prefix match: instead of serving everything strictly first-come-first-served, the scheduler groups waiting requests with shared prefixes closer together. This is especially useful for agent workloads, where steps often repeat the same system prompt, tool schemas, policies, memory, workspace context, repository context, and previous trajectory. Each step may only add a small amount of new information while most of the prefix stays the same. LPM helped me reach an acceptable cache-hit ratio even while the traffic still lacked proper replica-level cache locality. But LPM is a scheduler, not a routing layer. It can improve reuse inside the cache domain it sees, but it cannot make separate DP-replica caches behave like one shared cache. For long-context inference, VRAM becomes the first bottleneck before compute. So I leaned on [SGLang HiCache](https://docs.sglang.io/docs/advanced_features/hicache_design#hicache-system-design-and-optimization) as an L2 layer, using host memory to extend KV capacity beyond GPU memory. On that node, this meant leveraging hundreds of gigabytes of host RAM instead of treating HBM as the only cache tier. Then I moved to modern B200s with `DP=4`. Throughput improved, but cache behavior changed: each DP replica owned a separate KV-cache domain. The same prefix could hit on one replica and miss on another, even with LPM enabled, because LPM was local to each replica's cache. The important part is not H200 versus B200 as hardware. A B200 setup with `DP=1` would not behave this way, and an H200 setup with `DP=4` would. The topology changed: `DP=4` creates four cache owners, not one bigger shared cache.

Each DP replica owns a separate KV cache domain.

## The routing problem At first, I approached it like a normal load-balancing problem: add a router, improve cache locality. But the router only saw one SGLang worker. The real cache domains were inside that worker, across DP replicas. Worker-level routing was not enough. What I needed was replica-level cache stickiness. The fix was simple but important: hash a stable routing key, pick the DP replica that owns the cache, then send `routed_dp_rank` with the request. For chat, the key can be tenant plus conversation ID. For agents, tenant plus session ID. For coding workloads, tenant plus repository ID or workspace ID. In an OpenAI-compatible request, the extra routing field can be passed alongside the normal completion payload. For example, to force rank 0: ```sh curl http://localhost:30000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-v4-flash", "messages": [ {"role": "system", "content": "You are a concise debugging assistant."}, {"role": "user", "content": "We are investigating high TTFT in the inference service."}, {"role": "assistant", "content": "The first thing I would check is prefix-cache hit rate by worker."}, {"role": "user", "content": "Continue from there."} ], "routed_dp_rank": 0 }' ``` The goal is simple: same prefix, same cache owner. If a conversation warms DP replica 2, future turns should go back to DP replica 2.

Replica-level stickiness makes the router choose the cache owner, not just the worker.

After combining LPM, a better cache hierarchy, and replica-level stickiness, I raised the cache hit ratio from roughly 0-60% to a stable 90%. ## The caveat Replica stickiness improves cache locality, but it does not solve everything. It does not account for GPU utilization, overloaded replicas, unhealthy workers, failover, or tail latency. A production routing layer still needs to balance cache ownership against real-time load and availability: route to the cache owner when possible, but fall back when that owner is too busy or unavailable. ### The better production shape I did not deploy SMG for this experiment, but it is the cleaner production shape: run [SGLang Model Gateway (SMG)](https://docs.sglang.io/docs/advanced_features/dp_dpa_smg_guide) in cache-aware mode. Instead of manually hashing to a replica, SMG can route by prefix locality while still accounting for load, health, retries, and observability.

SMG is the routing layer: it scores workers, then sends the request to the best cache owner.

## The cache hierarchy The broader pattern I learned is that serious LLM serving needs a KV hierarchy: 1. **L1:** GPU KV cache 2. **L2:** host-memory KV cache via HiCache 3. **L3:** shared or distributed KV cache with something like [Mooncake](https://kvcache-ai.github.io/Mooncake/), as used by Kimi

Cache hits should stay as high in the hierarchy as possible.

L1 is fastest, but limited by GPU memory. L2 extends local KV capacity into host memory. L3 becomes important when KV cache needs to be shared across workers or nodes to maximize cache hits. I also tried different speculative decoding strategies to improve latency. The idea is to draft several likely next tokens cheaply, then let the main model verify them. When acceptance is good, you skip multiple expensive decode steps and improve latency and throughput. In practice, it becomes another [tuning surface](https://github.com/supa-thibaud/sglang-dry/blob/main/docs/en/hyperparameter_tuning.md): draft length, number of speculative steps, acceptance rate, draft model quality, and workload shape all matter. ## The next scaling lever Once caching is working well across the serving layer, the next scaling lever is splitting prefill from decode. I left this out of the initial experiment because it is a bigger infrastructure change than replica stickiness or cache tuning. The serving system has to coordinate separate pools and move KV state between them efficiently. Prefill is compute-heavy, especially for large contexts. Decode is latency-sensitive and KV-cache-heavy. Keeping them in the same pool means long prefills can interfere with active streams. Separating them lets you scale each side independently: more prefill capacity when prompts get long, more decode capacity when concurrency rises. The main catch is that KV state has to move efficiently from the prefill side to the decode side.

Prefill and decode want different scaling knobs.

The practical win is isolation: one user with a huge context window is much less likely to slow down everyone else's quick requests. ## The result At the end of this setup, the experiment served 600M production tokens with a P50 TTFT under 2 seconds and a generation rate of around 100 tokens per second. I call that a success. The current pricing wars make it even harder to justify running models yourself purely from an economic perspective. But understanding what it takes to run and [benchmark these models](https://github.com/sgl-project/sglang/blob/main/docs/developer_guide/bench_serving.md) at scale, and what good performance actually means, taught me a lot. A big shoutout to the SGLang team as well. The software and documentation made it possible to explore this deeply: DP attention, LPM, HiCache, speculative decoding, and the many small serving knobs that matter when running current OSS models in production. Open-source infrastructure like this is what makes serious model serving experimentation possible. The real lesson: scaling LLM inference is not just adding more GPUs. It is model parallelism, data parallelism, routing, cache hierarchy, speculative decoding, prefill/decode disaggregation, and observability. If you want high throughput and low TTFT with growing demand, the KV cache has to become a backbone part of your serving architecture.

references SGLang DeepSeek V4 cookbook SGLang DP/DPA guide SGLang serving benchmark guide SGLang HiCache design Mooncake KV cache SGLang hyperparameter tuning notes