How o3 uses tools inside reasoning to solve harder problems

Executive overview

Most AI models use tools reactively — triggered by a keyword or explicit request, after reasoning has already concluded. o3 integrates tool use directly inside the reasoning process, letting it search, verify, and adjust mid-thought before producing an answer.

This makes o3 more accurate and capable on complex tasks, not just smarter on benchmarks. Three scaling levers now exist: training compute, inference compute, and in-reasoning tool use.

The core insight: giving a model access to tools during reasoning — not after — is the next major capability jump, equivalent in impact to adding chain-of-thought.

The three scaling levers

  • Training compute — data volume, compute, and time used to build the base model
  • Inference compute — active compute allocated while the model responds; more = better answers
  • In-reasoning tool use — tools called mid-thought, not reactively at the end
  • Earlier models treated tool use as reactive: keyword-triggered or user-prompted
  • o3 reasons about when and why to use a tool, then uses it, then continues reasoning
  • This reduces hallucinations and increases accuracy on tasks requiring multiple steps

Where o3 outperforms other models

  • Sticky-note to-do list: o3 rotates, crops, and extracts text from an image as part of reasoning before producing the final list — other models would just describe the image
  • Thumbnail feedback: passes in draft thumbnails, reasons through YouTube best practices, returns specific colour and layout recommendations in a comparison table
  • Market research (broad questions): o3 is better than deep research for high-level, analytically driven queries; deep research wins for targeted questions with lots of provided context
  • Gnarly bugs: when GPT-4.1, then Claude 3.7 thinking, then Gemini 2.5 Pro all fail, o3 often fixes the bug in one or two shots by reasoning across the entire codebase logic

Choosing the right model for the task

  • High-level research → o3 (broad, analytical, fast)
  • Targeted research with long output → deep research tools (Perplexity, Gemini, Claude, Grok)
  • Writing → Claude 3.7 / 3.5 Sonnet; GPT-4.1 with precise instructions
  • UI design → Claude 3.7 thinking
  • Large codebase or file → Gemini 2.5 Pro (1M context window)
  • Complex bugs, targeted feedback, business/product fit analysis → o3

Expert outsourcing with AI projects

  • Convert expert knowledge (YouTube videos, reports, podcasts) into a Claude or GPT project
  • Pair a high-quality knowledge base with a precise system prompt to create an always-available specialist
  • Use cases: thumbnail strategy, cold outreach, sales — any domain with an identifiable expert whose thinking can be captured
  • Custom GPTs are less preferred; Claude projects or GPT projects with structured instructions work better

More like this — when you're ready for early access.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Get early access to the full library.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.