The original is one click away. Open original ↗
Why scaling alone won't get us to AGI: François Chollet on ARC and intelligence
Executive overview
Current AI progress automates domains with verifiable rewards — code, math — but this is not the same as general intelligence. General intelligence is human-level skill-acquisition efficiency across arbitrary new tasks, not just performance on trained domains.
LLMs scaled 50,000x with no meaningful gain on ARC V1 until reasoning models arrived. That gap is the signal: pre-training scale alone cannot produce fluid intelligence.
Chollet's lab, Endia, is building an alternative foundation — symbolic program synthesis — to target optimality directly rather than patching the LLM stack.
The core insight: intelligence is efficient learning, not stored knowledge — and those require fundamentally different architectures.
Why the LLM stack has a ceiling
- Verifiable-reward domains (code, math) can be fully automated with current technology via RL post-training loops
- Domains without formal verification (essays, law) will see slow or stalling progress
- Scaling pre-training 50,000x left ARC V1 scores near zero — more parameters alone don't produce fluid intelligence
- Reasoning models caused a step-function jump on ARC V1; RL harnesses saturated ARC V2 — but neither indicates higher fluid intelligence, only better training in specific domains
- Human-engineered harnesses being required to crack benchmarks is itself evidence we are short of AGI
What Endia is building
- Replacing parametric curves (neural nets) with the shortest possible symbolic models of data — minimum description length as the target
- Gradient descent replaced by symbolic descent: search over symbolic space guided by deep learning
- Models are expected to be tiny at inference time, generalise better, and compose more cleanly
- Estimated ~10–15% chance of success — worth attempting because no one else is doing it
- Retrospective prediction: AGI, once found, will be less than 10,000 lines of code; the compute of the 1980s would have been sufficient
ARC as a barometer of AI progress
- ARC V1 (2019): static pattern tasks requiring causal modelling from provided data; base LLMs scored near zero
- ARC V2: same format, harder composition; saturated by RL harnesses fine-tuned on self-generated verified reasoning chains
- ARC V3 (2026): agentic — agent dropped into an unseen mini game with no instructions, no stated goal, no controls; must explore, form a world model, set goals, and solve efficiently
- Scored on action efficiency matched against human baselines; brute-force exploration scores extremely low
- Private test set is deliberately unlike the public set to resist targeted fine-tuning
- ARC 4 planned: continual/curriculum learning across compounding game levels
- ARC 5: focused on invention (details withheld)
- AGI moment defined as when the measurable gap between human and AI learning efficiency effectively closes
What makes domains automatable now
- True, trustable verification signals enable RL post-training loops that self-generate training data at scale
- Code was first: unit tests provide dense, reliable reward; models learn execution traces the way human programmers mentally simulate code
- Mathematics is next for the same reason
- The key human contribution shrinks to designing the environment; from that, exponentially more training data is generated autonomously
- Removing humans from the improvement loop — not recursive self-improvement per se — is the prerequisite for compounding capability gains
Intelligence, efficiency, and the knowledge–intelligence trade-off
- Competence requires either high intelligence or high knowledge; better training substitutes for fluid intelligence in bounded domains
- LLMs are effectively large knowledge bases — modular vector programs mapping input patterns to output patterns
- Fluid intelligence is the ability to model a new environment efficiently from scratch, with little data
- Humans solve novel ARC V3 games in hundreds to thousands of actions with no prior training; frontier models are far from matching this
- Science itself is symbolic compression: observations → shortest symbolic rule; Endia is attempting to build this process algorithmically
Advice for researchers and founders exploring alternative approaches
- If an idea has low probability but high impact and no one else is doing it, that is sufficient reason to pursue it
- Look for approaches that scale without human bottlenecks — capability must improve with compute/data, not engineer-hours
- Read AI research from the 1970s–80s: more diverse ideas were being explored before the field collapsed into one paradigm
- Genetic algorithms are underexplored and may have significant scaling potential
- Build a compounding stack — reusable foundations, not a series of disconnected experiments
- For open-source projects: prioritise API simplicity and onboarding; docs should teach the domain, not just the tool; hire your most enthusiastic community members
More like this — when you're ready for early access.
Join the waitlist for a personal account and content recommendations based on what you're working on.
No spam. Unsubscribe at any time.
You're on the list. We'll be in touch before launch.