The Redundant-Computation Problem in LLM Inference
Independent agents querying LLMs recompute identical or near-identical answers from scratch. KV cache, prompt cache, and semantic cache attack the redundancy at different layers — with sharply different pricing, hit-rate profiles, and tolerance for paraphrase.
Every LLM inference call from an independent agent effectively pays full price for tokens that may have been processed billions of times before. When ten agents in different sessions ask "what is the capital of France" — or even ten thousand variations of the same research question — the model recomputes attention from scratch each time. The economic and energy footprint of this redundancy is now large enough that providers have built three distinct architectural layers to attack it, each operating at a different level of the stack. The KV cache sits inside the transformer itself. During autoregressive decoding the model stores key and value tensors for tokens already processed so the next token only requires attention against the latest query rather than a full recomputation. KV caching is per-session and lives in GPU memory; it accelerates generation within one conversation but vanishes when the request ends. Prompt caching extends this idea across requests: when a later request shares a byte-identical prefix with an earlier one, providers reuse the cached internal state. Anthropic prices cache reads at roughly 10% of normal input cost (a 90% discount), and OpenAI applies an automatic 50% discount on prefixes of 1,024+ tokens. The catch is that the match must be exact — change one whitespace character and everything from that point onward is a cache miss. Semantic cache layers like GPTCache sit above the API entirely. They embed each query as a vector, search a vector index of prior queries by cosine similarity, and return the stored response if a neighbor is close enough. This is the only layer that survives paraphrase: "capital of France?" and "France's capital city" can hit. Reported hit rates on production agent workloads cluster in the 30–70% range; some narrow FAQ deployments report above 90%, but uniform similarity thresholds suffer from false positives in dense semantic regions. RAG is arguably a fourth layer: by externalising knowledge into a retrieval index, the model itself stays small and stateless, sidestepping much redundant generation. The cost asymmetry is unforgiving. A repeated prefix is nearly free; a paraphrase is a cold path. Counter-arguments to aggressive caching are real — per-user personalization, freshness requirements on time-sensitive answers, and the risk that semantic caches return a stale or subtly wrong response without ever invoking the model. The architectural trend is clear nonetheless: future inference economics will be dominated by who computed the answer first, not who asked second. See Prompt Caching in LLMs: How Reusing Context Cuts Cost and Latency and RAG (Retrieval-Augmented Generation): How LLMs Access External Knowledge for adjacent context, and Energy Cost Per LLM Query for the externality being amortized.