The original is one click away. Open original ↗
Stop over-engineering AI pipelines: let the model do the work
Executive overview
A developer spent two weeks and ~3,000 lines of code building a multi-stage PDF extraction pipeline, only to discover that switching to a single well-prompted Gemini Flash 2.0 call reduced costs by 99.6%, halved processing time, and raised accuracy on unknown formats from ~90% to ~99%. The core lesson: when transforming unstructured data to structured output, test a capable model with a specific prompt before building any surrounding infrastructure. This "micro" bitter lesson mirrors the "macro" bitter lesson articulated by AI researcher Rich Sutton in 2019 — human-engineered complexity tends to cap model performance, whereas raw compute scaling consistently wins long-term. Several techniques that developers rely on today (RAG, fine-tuning, prompt engineering) are likely to erode in importance as context windows grow and models become more autonomous.
The over-engineered extraction pipeline
- Built and rebuilt roughly five times across different PDF parsing libraries (PyPDF, PyMuPDF, Unstructured.io, LlamaParse).
- Final architecture: LlamaParse premium → heavy regex pre-processing → context-aware chunking → Gemini boundary identification → Gemini extraction → post-processing regex → output.
- Achieved ~100% accuracy on the training format but dropped to 95% and 90% as new PDF layouts appeared.
- The brittleness came from hard-coding edge cases; the system could not adapt to format variation.
- LlamaParse premium alone cost ~$45 per 1,000 pages processed.
- A static fallback system with minimal AI produced poor accuracy when the main pipeline failed.
Why initial AI tests looked disappointing
- Early tests dropped PDFs into ChatGPT, Claude, Gemini, and Llama and got mediocre results.
- The real problem was prompt quality, not model capability.
- The prompts lacked specificity: no output format examples, no field definitions, no edge-case instructions.
- This false negative led to weeks of unnecessary custom infrastructure.
The simplified AI-first approach
- Removed LlamaParse entirely; fed raw PDF pages directly to Gemini Flash 2.0 one page at a time.
- Context-aware chunking was retained, but the model handled extraction, formatting, and structuring.
- A small amount of regex was kept only for cosmetic output cleanup.
- Total code for the new extraction layer: ~1,200 lines vs. ~3,000 previously; overall codebase roughly halved.
Numbers that changed the decision
- Cost: $0.12 vs. ~$45 per 1,000 pages — a 99.6% reduction.
- Accuracy on unknown/varied PDF formats: ~99% (Gemini) vs. ~90% (custom pipeline).
- Latency for 50 pages: ~1–2 minutes (Gemini) vs. ~10 minutes (old pipeline).
- The model is also format-agnostic; no new edge cases need to be coded.
The macro bitter lesson (Rich Sutton, 2019)
- Sutton's thesis: human knowledge injected into AI systems provides short-term gains but eventually plateaus and caps model improvement.
- Chess, Go, and other benchmark domains all showed the same pattern — stepping back and letting compute scale won decisively.
- As Moore's law reduces the cost of compute, models that rely on search and reinforcement learning outperform heavily hand-crafted systems.
- The practical implication: infrastructure built around today's model limitations will become obsolete as models improve.
Techniques under pressure from scaling
- RAG: being challenged by context-augmented generation (CAG); Gemini 2.5 Pro has a 2M token context window, Llama 4 claims 10M — large enough to hold entire knowledge bases directly.
- Fine-tuning: active in-context learning during inference may reduce the need for separate fine-tuning runs.
- Guardrails: prompt injection and hallucination mitigations may be absorbed into more capable base models.
- Prompt engineering: deep research agents already clarify intent by asking the user questions rather than relying on a perfect initial prompt; the model will increasingly engineer the conversation, not the user.
Practical takeaways
- Default to testing an AI model first for any unstructured-to-structured transformation task.
- Use a specific, example-rich prompt with explicit output expectations before concluding AI cannot do the job.
- Treat surrounding infrastructure (parsing libraries, chunking strategies, regex pipelines) as a last resort, not a first instinct.
- Expect today's workarounds — RAG, guardrails, elaborate prompt templates — to shrink in importance as models scale.
- Design systems to be thin around the model so they are easy to update when model capabilities improve.
More like this — when you're ready for early access.
Join the waitlist for a personal account and content recommendations based on what you're working on.
No spam. Unsubscribe at any time.
You're on the list. We'll be in touch before launch.