How does prompt caching work?
Many LLM applications send the same large block of text at the start of every request — a detailed system prompt, formatting instructions, few-shot examples, or a chunk of retrieved context. Normally the model reprocesses all of it on every call, paying in both tokens and time.
Prompt caching stores the processed representation of that stable prefix. On subsequent calls that begin with the same prefix, the provider reuses the cached work and only processes the new tokens — the user's actual question. Because cached input tokens are billed at a steep discount and skip recomputation, the effect is lower cost and faster responses, with no change to the output.
It only helps when there's a repeated prefix. The savings scale with how large and how frequently reused that prefix is, which is why it pairs naturally with system prompts and shared context.
Prompt caching vs semantic caching
These two are easy to confuse but solve different problems.
| Prompt caching | Semantic caching | |
|---|---|---|
| Caches | A repeated prompt prefix | A full response to a query |
| Match type | Exact prefix match | Semantically similar query |
| Saves | Reprocessing the prefix | The entire model call |
| Best for | Stable system prompts, shared context | Repeated or near-duplicate questions |
Prompt caching speeds up the calls you still make; semantic caching avoids making the call at all when a similar question was already answered. They compose well — cache responses for repeats, cache prefixes for everything else.
Where it fits in cost optimisation
Prompt caching is a low-risk, high-leverage first move in cost work because it changes nothing about output quality — it only removes redundant computation. In a broader effort it sits alongside cost attribution, cheaper-model routing, and distillation. For one high-volume voice AI platform, Prodinit combined this kind of systematic cost engineering with model distillation to cut inference spend 70% with no quality loss.