Choosing the Right AI Model for Every Coding Task: A Two-Bucket Framework

Executive overview

Model overload is real — tools like Cursor list 14+ default models with no clear guidance on which to use for what. Benchmarks mislead because they optimise for narrow problem types and some commercial models game the leaderboards. The solution is a simple planning/execution split: use the smartest reasoning models to produce detailed plans, then pass those plans to fast, tool-reliable models that follow instructions precisely. This separates cognitive load from mechanical execution, matching each model to what it actually does well.

The core insight: a model's intelligence rank matters less than whether it excels at planning or at reliable tool-calling.

The planning/execution split

  • Planning models (Gemini 2.5 Pro, O3) reason through architecture, diagnose root causes, write detailed specs
  • Execution models (GPT-4.1, Claude 3.5 Sonnet) excel at tool-calling and precise instruction-following
  • 3.7 Sonnet thinking reserved for gnarly bugs only — it over-reasons and gets stuck in loops
  • 2.5 Pro for execution when generation length matters (large functions, complex features)
  • O4 mini high used for fast planning tasks where speed matters, such as interview-style spec generation

Why benchmarks fail in practice

  • Benchmarks optimise for narrow problem types that rarely match real workflows
  • Multiple commercial and open-source models have been found gaming leaderboards
  • Use benchmarks as rough approximations; validate through hands-on testing with your own use cases

Small task example: Sankey diagram in Python

  • O3 chose the right visualisation type (Sankey) and wrote the initial Python script
  • Iterating inside O3 without Canvas caused full rewrites that degraded quality
  • Moved the original script into Cursor with 3.7 Sonnet thinking for inline edits only
  • Result: O3 as planner, Cursor/3.7 thinking as executor — clean separation produced the final visual

Mid-size task example: Chrome extension UI debug

  • First execution pass used 3.7 thinking; it made progress then started looping and breaking the UI
  • Used Repomix/Yeek to flatten the codebase into a single TXT, stripped Cursor's system prompt
  • Fed flat file to 2.5 Pro to diagnose the root cause, then asked it to write a sequenced prompt list
  • GPT-4.1 in Cursor executed each prompt reliably — tool calls clean, UI fixed
  • Pattern: P1 → E1 → P2 → E2; reverting to planning mid-build is normal and expected

Large project workflow: spec-to-build pipeline

  • Fast reasoning model (O4 mini high) interviews you one question at a time to produce a specification
  • Spec fed to a smart model (2.5 Pro or O3) to produce the blueprint — the architectural how
  • Blueprint model then generates a Markdown to-do checklist as the execution roadmap
  • Cursor executes iteratively against the checklist; 4.1 is default, 3.5 Sonnet is fallback
  • 2.5 Pro blueprint risk: knowledge cutoff means outdated APIs and libraries; O3/Grok 3 research mid-generation and stay current

Practical model hierarchy for coding today

  • Planning: Gemini 2.5 Pro (long output), O3 (up-to-date research)
  • Execution default: GPT-4.1 (best tool-calling), Claude 3.5 Sonnet (second choice)
  • Execution specialist: Claude 3.7 thinking (hard bugs), Gemini 2.5 Pro (long generation runs)
  • Model rankings change weekly — apply the framework, not the specific names

The practitioner mindset

  • Theory and benchmarks age out faster than ever in AI development
  • Hands-on testing against your own use cases outperforms any leaderboard ranking
  • Apply new models to old use cases to measure delta; build new use cases to discover capabilities

More like this — when you're ready for early access.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Get early access to the full library.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.