RAG vs CAG: a decision framework for AI knowledge retrieval

Executive overview

AI models have a fixed training cutoff and no access to private data — so developers use external retrieval to fill the gap. RAG (retrieval-augmented generation) pulls relevant chunks from a vector database at query time. CAG (context/cache-augmented generation) loads the full knowledge base into the model's context window upfront.

CAG consistently outperforms RAG in accuracy, speed, and simplicity — but only when the data fits. Start with CAG; fall back to RAG only when constraints force it.

The default should be CAG today, with RAG reserved for large, frequently-changing, or citation-critical data sets.

How RAG works

  • Documents are embedded into a vector database as numerical representations.
  • At query time, the user's question is embedded and matched against the database.
  • The closest matching chunks are combined with the query and sent to the LLM.
  • Limitations: chunking can break reasoning across related sections; retrieval errors are common.

How CAG works

  • Documents are loaded into the model's KV cache — its extended in-context memory.
  • The model reasons across the entire data set in one block, not chunks.
  • After the initial upload, repeated queries against the same data cost ~90% less due to cache hits.
  • Requires a large enough context window to hold the full data set.

When to use RAG

  • Knowledge base exceeds ~75% of the model's context window (e.g. 25M tokens for 50k product SKUs).
  • Data updates more frequently than weekly (daily or hourly inventory, release notes).
  • Citations or source attribution are required in responses.
  • Cost of repeated full-context reloads on data changes is prohibitive.

When to use CAG

  • Data set fits comfortably within the context window (e.g. 80k tokens of policy documents).
  • Data is relatively static — monthly or quarterly updates.
  • Sub-second latency matters; CAG averages ~0.4s vs ~2.5s for RAG after warm-up.
  • Complex multi-step reasoning across the full corpus is required.
  • Simpler architecture is preferred — CAG requires no vector DB, no chunking strategy, no hybrid search.

Decision tree summary

  1. Data size — does it exceed 75% of the context window? If yes → RAG.
  2. Update frequency — updated daily or more? If yes → RAG.
  3. Attribution — are source citations required? If yes → RAG.
  4. Latency — sub-second responses needed? If yes → CAG.
  5. Reasoning — must the model reason across the full corpus? If yes → CAG.
  6. Simplicity / accuracy — prefer fewer moving parts and lower retrieval error? → CAG.
  7. If none of the above are decisive → evaluate on cost and expected query volume.

Cost breakdown

  • CAG upfront cost is higher — the full data set is tokenised and cached on first load.
  • CAG per-query cost drops ~90% after the first load due to cache pricing (Gemini, OpenAI, Claude all offer cached-token discounts).
  • RAG per-query cost is higher — new chunks are pushed into the context on every question.
  • CAG is cheaper overall when data is static and query volume is high; RAG is cheaper when data changes constantly.

Worked examples

  • E-commerce catalogue (50k SKUs, daily updates): ~25M tokens — exceeds any context window; data changes daily. Use RAG.
  • Internal policy chatbot (80k tokens, quarterly updates): fits easily in most models; static data; ~0.4s response vs 2.5s with RAG. Use CAG.
  • Hybrid knowledge base (400k tokens, weekly updates): split the corpus — load the 200k evergreen tokens into CAG; use RAG to retrieve volatile documents (e.g. release notes) on demand. This RAG-to-cache pipeline is powerful but complex — only adopt it if necessary.

Future signals to watch

  • Context window size — models have grown from ~26k to 1M+ tokens (Gemini 2.5 Pro) and 10M (Llama Scout) in under a year; the trend continues.
  • In-context retrieval performance — track benchmarks like fiction.livebench, which tests reasoning across long, multi-chapter documents. O3 now scores near 100% up to 120k tokens; Gemini 2.5 Pro is close behind.
  • Token cost — models are ~99% cheaper and significantly smarter than 18 months ago. As cost drops and context grows, CAG will absorb an increasing share of RAG use cases.

The long-run direction is clear: as context windows expand and retrieval quality inside the model improves, CAG will replace RAG for most workloads. RAG is largely a workaround for today's model limitations. Track size, performance, and cost — when those signals shift, move your workloads to CAG.

More like this — when you're ready for early access.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Get early access to the full library.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.