Stop over-engineering AI pipelines: let the model do the work

Executive overview

A developer spent two weeks and ~3,000 lines of code building a multi-stage PDF extraction pipeline, only to discover that switching to a single well-prompted Gemini Flash 2.0 call reduced costs by 99.6%, halved processing time, and raised accuracy on unknown formats from ~90% to ~99%. The core lesson: when transforming unstructured data to structured output, test a capable model with a specific prompt before building any surrounding infrastructure. This "micro" bitter lesson mirrors the "macro" bitter lesson articulated by AI researcher Rich Sutton in 2019 — human-engineered complexity tends to cap model performance, whereas raw compute scaling consistently wins long-term. Several techniques that developers rely on today (RAG, fine-tuning, prompt engineering) are likely to erode in importance as context windows grow and models become more autonomous.

The over-engineered extraction pipeline

Built and rebuilt roughly five times across different PDF parsing libraries (PyPDF, PyMuPDF, Unstructured.io, LlamaParse).
Final architecture: LlamaParse premium → heavy regex pre-processing → context-aware chunking → Gemini boundary identification → Gemini extraction → post-processing regex → output.
Achieved ~100% accuracy on the training format but dropped to 95% and 90% as new PDF layouts appeared.
The brittleness came from hard-coding edge cases; the system could not adapt to format variation.
LlamaParse premium alone cost ~$45 per 1,000 pages processed.
A static fallback system with minimal AI produced poor accuracy when the main pipeline failed.

Why initial AI tests looked disappointing

Early tests dropped PDFs into ChatGPT, Claude, Gemini, and Llama and got mediocre results.
The real problem was prompt quality, not model capability.
The prompts lacked specificity: no output format examples, no field definitions, no edge-case instructions.
This false negative led to weeks of unnecessary custom infrastructure.

The simplified AI-first approach

Removed LlamaParse entirely; fed raw PDF pages directly to Gemini Flash 2.0 one page at a time.
Context-aware chunking was retained, but the model handled extraction, formatting, and structuring.
A small amount of regex was kept only for cosmetic output cleanup.
Total code for the new extraction layer: ~1,200 lines vs. ~3,000 previously; overall codebase roughly halved.

Numbers that changed the decision

Cost: $0.12 vs. ~$45 per 1,000 pages — a 99.6% reduction.
Accuracy on unknown/varied PDF formats: ~99% (Gemini) vs. ~90% (custom pipeline).
Latency for 50 pages: ~1–2 minutes (Gemini) vs. ~10 minutes (old pipeline).
The model is also format-agnostic; no new edge cases need to be coded.

The macro bitter lesson (Rich Sutton, 2019)

Sutton's thesis: human knowledge injected into AI systems provides short-term gains but eventually plateaus and caps model improvement.
Chess, Go, and other benchmark domains all showed the same pattern — stepping back and letting compute scale won decisively.
As Moore's law reduces the cost of compute, models that rely on search and reinforcement learning outperform heavily hand-crafted systems.
The practical implication: infrastructure built around today's model limitations will become obsolete as models improve.

Techniques under pressure from scaling

RAG: being challenged by context-augmented generation (CAG); Gemini 2.5 Pro has a 2M token context window, Llama 4 claims 10M — large enough to hold entire knowledge bases directly.
Fine-tuning: active in-context learning during inference may reduce the need for separate fine-tuning runs.
Guardrails: prompt injection and hallucination mitigations may be absorbed into more capable base models.
Prompt engineering: deep research agents already clarify intent by asking the user questions rather than relying on a perfect initial prompt; the model will increasingly engineer the conversation, not the user.

Practical takeaways

Default to testing an AI model first for any unstructured-to-structured transformation task.
Use a specific, example-rich prompt with explicit output expectations before concluding AI cannot do the job.
Treat surrounding infrastructure (parsing libraries, chunking strategies, regex pipelines) as a last resort, not a first instinct.
Expect today's workarounds — RAG, guardrails, elaborate prompt templates — to shrink in importance as models scale.
Design systems to be thin around the model so they are easy to update when model capabilities improve.

Stop over-engineering AI pipelines: let the model do the work

Executive overview

The over-engineered extraction pipeline

Why initial AI tests looked disappointing

The simplified AI-first approach

Numbers that changed the decision

The macro bitter lesson (Rich Sutton, 2019)

Techniques under pressure from scaling

Practical takeaways

More like this — when you're ready for early access.

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.

Executive overview

The over-engineered extraction pipeline

Why initial AI tests looked disappointing

The simplified AI-first approach

Numbers that changed the decision

The macro bitter lesson (Rich Sutton, 2019)

Techniques under pressure from scaling

Practical takeaways

More like this — when you're ready for early access.

More in AI

Building $10,000 software MVPs with AI in under an hour

How to actually make money with AI: five brutal truths

How to choose the right home for your AI workflow

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.