Attention is *not* all you need (notes)

Attention is not all you need (notes)

August 13, 2025 · 9:07 AM · 848 words

I.

"That's exactly what it means to hit a wall, and exactly the particular set of obstacles I described in my most notorious (and prescient) paper, in 2022. Real progress on some dimensions, but stuck in place on others.

Ultimately, the idea that scaling alone might get us to AGI is a hypothesis.

No hypothesis has ever been given more benefit of the doubt, nor more funding. After half a trillion dollars in that direction, it is obviously time to move on. The disappointing performance of GPT-5 should make that enormously clear.

Pure scaling simply isn't the path to AGI. It turns out that attention, the key component in LLMs, and the focus of the justly famous Transformer paper, is not fact "all you need".

All I am saying is give neurosymbolic AI with explicit world models a chance. Only once we have systems that can reason about enduring representations of the world, including but not to limited to abstract symbolic ones, will we have a genuine shot at AGI."'

II.

The "attention is all you need," paper might be wrong. As in, the scaling laws won't hold. It will get more and more expensive to realize less and less gains. This doesn't mean LLMs are a bust. Even if they stopped where they are, society would transform from integrating today's technology. But in terms of the path to "AGI/ASI," you don't get there by scaling. We've just overindexed on a single branch of the AI technology tree. We actually need to backtrack, and bring what we've learned from LLMs to other, previously blocked branches. Neurosymbolic AI did not work in the 80s, 90s, and 2000s, but now that LLMs have matured, that dead branch could be what leads to the breakthrough.

Gary Marcus, I think, needs to clarify his position. He's all for neurosymbolic AI, but maybe he's not clear enough in acknowledging that neurosymbolic is only feasible now that LLMs have become what they are. Considering writing him a letter to clarify.

Instead of trying to scale LLMs forever, we need to use LLM as a tool to bootstrap symbolic reasoning systems that can do what LLMs can't.

III.

Neurosymbolic AI feels like it would lead to true reasoning. Current LLM are basically predicting the order of token/letters based on probability, but there are limits, especially when you get into synthetic data. Even COT isn't real reasoning, it's just extended vector mapping with prompts to double-check and verify. It's pseudo-reasoning.

What we really need is like a massive self-evolving RAG, a generalizable "hypergraph." Data has to be structured and stable. An entry like "blue jay" might have 1k-100k-1m properties. If someone asks "can a blue jay fly to the moon?" it will query the right properties and reason through it based on a series of known, verified facts.

The challenge here is both scaling while creating a flexible schema to structure the parameters within any object. They started doing this manually in the 80s. But LLMs can scale and accelerate this. Arguably, every single conversation requires new knowledge nodes to be created, and if the nodes are true, they can be added to the graph. Unlike LLMs, knowledge compounds with use.

Agents can be constantly scanning the web and updating this hypergraph in real-time with current events of the day. Ultimately though, it will have to make guesses on property creation, and perhaps it could have a confidence score. Humans could then review low-confidence submissions and verify them.

III.

There are 10s of thousands if not millions of parameters for key/value pairs you might want to assign to a dog: species, aging, diseases, incidents, pop-culture, anatomy, etc. So you need some way to both generate and upload those things. Apparently humans have been trying this since the 80s. It's too slow, too infinite. But we can use LLMs to build, update, and "pull" from the hypergraph. When someone prompts about a dog, the system needs to query the relevant 25 parameters out of the million. From these paramters, it can do actual reasoning with formal, verifiable logic:

"If [moon had atmosphere], and we brought [dogs] there, based on [gravity coefficient], they would be [1.4x] bigger, but then might suffer from [A] disease."

Our current chain-of-thought reasoning is, sort of bullshit. It's not really reasoning.

IV.

I wonder how you design embeddings for neurosymbolic reasoning. If someone ask "can a bluejay fly to the moon?" you'd need to (1) call the "bluejay" object, which has, say, 10,000 key:value pairs, but then also (2) convert the prompt into a vector so that you know which of the 10k properties to pull.

Some optimization ideas:

(a) the properties could each live in a category that's embedded; meaning it would first find "locomotion" and then search properties within there (this means each object's database would need to be hierarchical);
(b) each request helps identify "archetypal questions" and the properties they pull, via training/finetuning;
(c) rewrite the question before the database pull, in a way that's aware of what might exist in the database.

I.

II.

III.

III.

IV.

Related Topics

Related Essays