The original is one click away. Open original ↗
How GPT-4.1 changes prompting best practices
Executive overview
Newer models like GPT-4.1, Gemini 2.5 Pro, and Claude 3.7 have made many legacy prompting tricks obsolete. Better instruction-following means you no longer need all-caps yelling, emotional manipulation, or convoluted workarounds to get reliable behaviour. OpenAI's accompanying prompt guide codifies a cleaner system prompt structure and signals a broader convergence on XML delimiters across providers.
The core shift: prompting is becoming less about coaxing models and more about clearly structured specification.
What older tricks you can drop
- All-caps emphasis, threats, and bribery prompts are no longer needed — models follow plain instructions reliably.
- Negating terms ("never do X", "do not include Y") are now safe to use; models no longer accidentally do the forbidden thing.
- For RAG setups, a simple "say I don't know if the answer isn't in the database" now reliably prevents hallucination without elaborate grounding workarounds.
Instruction following improvements
- Models can now handle ordered multi-step instructions ("first do X, then Y, then Z") reliably — critical for agent pipelines.
- Reranking and content requirements can be specified normally without forcing constructs.
- Overconfidence suppression in RAG agents works with a plain instruction rather than complex system-prompt scaffolding.
Context window and global rules
- GPT-4.1 supports a one-million-token context window.
- Historically, AI IDEs (Cursor, Windsurf) struggled to honour global rules when project context filled the model's working memory.
- Gemini 2.5 Pro is the first model to consistently reference global rules across large project contexts; GPT-4.1 also improves here.
- Reliable global rule adherence reduces errors and improves one-shot completion rates in AI-assisted development.
Recommended system prompt structure (OpenAI guide)
- Role — the persona (e.g. professional coder, writer).
- Task/objective — the goal to achieve.
- Instructions — with subcategories for specificity.
- Reasoning steps — explicit step-by-step logic baked in (GPT-4.1 is generative, not a reasoning model, so reasoning must be specified).
- Output format — XML is now recommended for complex prompts; converges with Anthropic's long-standing practice.
- Examples — few-shot examples to increase response reliability.
- Context — large variable context block placed between instruction repetitions.
Instruction placement for large context prompts
- Repeat critical instructions at both the top and bottom of the prompt when context is large.
- If choosing only one position, the top outperforms the bottom.
- This conflicts with previous caching advice (put static content at the top only) — the trade-off between cost savings and instruction retention is unresolved for GPT-4.1.
XML delimiter convergence
- OpenAI now recommends XML tags for structuring complex system prompts.
- Anthropic has used XML from the start; other providers are moving the same direction.
- XML outperforms markdown delimiters when prompts are long and multi-sectioned.
Benchmark context: GPT-4.1 vs. the field
- Fiction.livebench tests multi-fact reasoning across a long document — more representative of real use cases than needle-in-a-haystack tests.
- Gemini 2.5 Pro leads this benchmark by a wide margin at 120k tokens; GPT-4.1 scores ~62, below Grok Mini and GPT-4.0.
- For tool calling, Gemini 2.5 Pro again ranks highest; Claude 3.5 Sonnet outperforms 3.7 Sonnet for execution tasks.
- Practical recommendation: use Gemini 2.5 Pro or Claude 3.7 for strategy/planning; Claude 3.5 Sonnet for execution; GPT-4.1 is improved but not the benchmark leader.
More like this — when you're ready for early access.
Join the waitlist for a personal account and content recommendations based on what you're working on.
No spam. Unsubscribe at any time.
You're on the list. We'll be in touch before launch.