The original is one click away. Open original ↗
RAG vs CAG: a decision framework for AI knowledge retrieval
Executive overview
AI models have a fixed training cutoff and no access to private data — so developers use external retrieval to fill the gap. RAG (retrieval-augmented generation) pulls relevant chunks from a vector database at query time. CAG (context/cache-augmented generation) loads the full knowledge base into the model's context window upfront.
CAG consistently outperforms RAG in accuracy, speed, and simplicity — but only when the data fits. Start with CAG; fall back to RAG only when constraints force it.
The default should be CAG today, with RAG reserved for large, frequently-changing, or citation-critical data sets.
How RAG works
- Documents are embedded into a vector database as numerical representations.
- At query time, the user's question is embedded and matched against the database.
- The closest matching chunks are combined with the query and sent to the LLM.
- Limitations: chunking can break reasoning across related sections; retrieval errors are common.
How CAG works
- Documents are loaded into the model's KV cache — its extended in-context memory.
- The model reasons across the entire data set in one block, not chunks.
- After the initial upload, repeated queries against the same data cost ~90% less due to cache hits.
- Requires a large enough context window to hold the full data set.
When to use RAG
- Knowledge base exceeds ~75% of the model's context window (e.g. 25M tokens for 50k product SKUs).
- Data updates more frequently than weekly (daily or hourly inventory, release notes).
- Citations or source attribution are required in responses.
- Cost of repeated full-context reloads on data changes is prohibitive.
When to use CAG
- Data set fits comfortably within the context window (e.g. 80k tokens of policy documents).
- Data is relatively static — monthly or quarterly updates.
- Sub-second latency matters; CAG averages ~0.4s vs ~2.5s for RAG after warm-up.
- Complex multi-step reasoning across the full corpus is required.
- Simpler architecture is preferred — CAG requires no vector DB, no chunking strategy, no hybrid search.
Decision tree summary
- Data size — does it exceed 75% of the context window? If yes → RAG.
- Update frequency — updated daily or more? If yes → RAG.
- Attribution — are source citations required? If yes → RAG.
- Latency — sub-second responses needed? If yes → CAG.
- Reasoning — must the model reason across the full corpus? If yes → CAG.
- Simplicity / accuracy — prefer fewer moving parts and lower retrieval error? → CAG.
- If none of the above are decisive → evaluate on cost and expected query volume.
Cost breakdown
- CAG upfront cost is higher — the full data set is tokenised and cached on first load.
- CAG per-query cost drops ~90% after the first load due to cache pricing (Gemini, OpenAI, Claude all offer cached-token discounts).
- RAG per-query cost is higher — new chunks are pushed into the context on every question.
- CAG is cheaper overall when data is static and query volume is high; RAG is cheaper when data changes constantly.
Worked examples
- E-commerce catalogue (50k SKUs, daily updates): ~25M tokens — exceeds any context window; data changes daily. Use RAG.
- Internal policy chatbot (80k tokens, quarterly updates): fits easily in most models; static data; ~0.4s response vs 2.5s with RAG. Use CAG.
- Hybrid knowledge base (400k tokens, weekly updates): split the corpus — load the 200k evergreen tokens into CAG; use RAG to retrieve volatile documents (e.g. release notes) on demand. This RAG-to-cache pipeline is powerful but complex — only adopt it if necessary.
Future signals to watch
- Context window size — models have grown from ~26k to 1M+ tokens (Gemini 2.5 Pro) and 10M (Llama Scout) in under a year; the trend continues.
- In-context retrieval performance — track benchmarks like fiction.livebench, which tests reasoning across long, multi-chapter documents. O3 now scores near 100% up to 120k tokens; Gemini 2.5 Pro is close behind.
- Token cost — models are ~99% cheaper and significantly smarter than 18 months ago. As cost drops and context grows, CAG will absorb an increasing share of RAG use cases.
The long-run direction is clear: as context windows expand and retrieval quality inside the model improves, CAG will replace RAG for most workloads. RAG is largely a workaround for today's model limitations. Track size, performance, and cost — when those signals shift, move your workloads to CAG.
More like this — when you're ready for early access.
Join the waitlist for a personal account and content recommendations based on what you're working on.
No spam. Unsubscribe at any time.
You're on the list. We'll be in touch before launch.