How fluid intelligence and program search point toward AGI

Executive overview

Scaling up pre-training produced AI with vast memorized skill but near-zero fluid intelligence — the ability to handle genuinely novel problems. Test-time adaptation fixed part of this, but deep learning models still lack the compositional generalisation needed for real AGI.

The missing piece is combining two types of abstraction: perception-based (type one) and program-based (type two). AGI requires a system that can synthesise new programs on the fly, guided by deep-learning intuition, rather than fetching pre-recorded templates.

Fluid intelligence is not skill — it is the efficiency with which past experience is converted into the ability to handle future novelty.

Why pre-training scaling failed

  • Intelligence is an efficiency ratio: how well past experience converts to performance on genuinely new situations.
  • Scaling up basal LLMs by 50,000x moved ARC-1 accuracy from ~0% to ~10%; humans score above 95%.
  • Benchmarks based on known tasks measure memorised skill, not intelligence — they are designed for humans who cannot pre-memorise answers.
  • Static inference (querying pre-loaded knowledge) cannot demonstrate fluid intelligence, no matter the model size.
  • Goodhart's Law: chasing task-specific skill benchmarks optimises for automation, not invention.

What test-time adaptation changed

  • Test-time adaptation (TTA) lets models modify their behaviour during inference — via test-time training, chain-of-thought synthesis, or program synthesis.
  • Every AI system that performs meaningfully above zero on ARC uses TTA.
  • OpenAI's O3, fine-tuned on ARC, reached human-level performance on ARC-1.
  • ARC-1 is now saturated; it was a binary signal — either near-zero fluid intelligence or near-human.

ARC-2 and what it measures

  • ARC-2 (released March 2025) targets compositional generalisation — tasks requiring deliberate reasoning, not just pattern recall.
  • Validated with 400 non-expert humans in San Diego; all tasks solved by at least two people, average seven per task.
  • Basal LLMs score 0%. Single-chain-of-thought reasoning systems score 1–2%. Only TTA systems do meaningfully better.
  • Even O3 remains below human level on ARC-2, showing current TTA is not sufficient for AGI.
  • ARC-3 (developer preview July 2025, full launch early 2026) will assess agency: exploring unknown environments, setting and achieving goals, with strict action-efficiency limits matching human performance.

Two types of abstraction

  • Type one (value-centric): continuous distance functions; underlies perception, pattern recognition, intuition, and modern deep learning.
  • Type two (program-centric): discrete graph comparison via exact structure matching; underlies explicit reasoning, planning, and software engineering abstraction.
  • Transformers excel at type one; they struggle with simple type two tasks like sorting or digit addition.
  • Human intelligence combines both: type one intuition prunes the search space so type two reasoning stays tractable (e.g., chess: pattern recognition selects which moves to calculate).

The role of discrete program search

  • Deep learning alone does not invent — it automates.
  • All known AI systems capable of genuine invention rely on discrete search (genetic algorithms, AlphaGo's Move 37, AlphaEvolve).
  • Program synthesis treats learning as combinatorial search over a graph of symbolic operations.
  • Program synthesis is data-efficient (fits from 2–3 examples) but hits combinatorial explosion as complexity grows.
  • The solution: use type-one deep learning intuition to guide and prune type-two program search — analogous to embedding a discrete graph into a latent space where approximate distance functions control combinatorial explosion.

Architecture of the target system

  • A programmer-like meta-learner that, when given a new task, synthesises a bespoke program on the fly.
  • Programs blend deep-learning submodules (type one perception) with algorithmic modules (type two reasoning).
  • Assembly is driven by discrete program search guided by learned intuition about program space.
  • A global abstraction library accumulates reusable building blocks; new abstractions discovered during task-solving are uploaded back (like open-source libraries on GitHub).
  • The system improves continuously: both the library and the intuition over program space grow over time.
  • First milestone at Ndea (Chollet's research lab): solve ARC-AGI using a system that starts with zero knowledge of ARC.

Implications for AGI timelines

  • Current TTA models are a major step — on-the-fly recombination is now possible — but remain far too compute-inefficient (thousands of dollars to solve ARC-1 at human level).
  • Deep learning requires 3–4 orders of magnitude more data than humans to distill simple abstractions.
  • You are close to AGI when it becomes hard to construct tasks that humans can solve but AI cannot; we are not there yet.
  • AGI defined as autonomous invention and discovery — not just 80% task completion — is what unlocks acceleration of scientific progress.

More like this — when you're ready for early access.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Get early access to the full library.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.