The engine and the oracle

Can a language model invent? I spent a while on that one question. The honest answer isn’t yes or no — it depends what you mean by invent.

I spent a while on a single question — can a language model invent? — and came out convinced the honest answer isn’t yes or no. It’s: depends what you mean by invent. That sounds like a dodge. It’s the opposite. The whole disagreement lives in the definition, and once you fix the definition, the facts are surprisingly settled.

What “invent” has to mean

Start with the word, because everything turns on it. A workable idea of creativity has two parts, not one: a thing is creative if it is new and valuable. Novelty alone is just noise — a random string is new and worthless. Value alone is just competence — a correct but obvious answer is fine and not creative. You need both at once, and the and is where all the difficulty hides.

The philosopher Margaret Boden split creativity into three kinds, and the ladder is worth climbing. Combinational: novel combinations of familiar ideas. Exploratory: new moves inside an established framework — a known game played well. Transformational: changing the rules of the framework itself, so that things become sayable which the old space could not express. She also separated novelty that is new to you from novelty that is new to everyone, ever — the historical kind. The hard question about machines is only ever about that top rung: transformational, and historically new.

What is actually settled

For the first two rungs there is no real debate. Recombine known ideas? Explore a defined space? Language models do this constantly and well; it is most of what they are for. The argument only gets interesting at genuinely new and verifiably correct — and even there, we now have receipts.

FunSearch (DeepMind, 2023, published in Nature) paired a model with an automated evaluator and found new constructions for the cap-set problem and better heuristics for bin-packing — the first time an LLM was shown to produce a genuinely new result on an open problem rather than retrieve one. Then AlphaEvolve (2025) improved how you multiply 4×4 complex matrices, using 48 scalar multiplications where Strassen’s 1969 method needed 49 — a wall no one had moved in fifty-six years. In a later collaboration with the mathematician Terence Tao, the same kind of system was pointed at 67 research problems: it matched the best known answer on most and pushed past it on roughly twenty. On Problem 6 of the 2025 International Mathematical Olympiad it found a tiling construction the frontier chat models could not — though, in fairness, it could not prove the construction was optimal.

These aren’t vibes. They are checkable, and they were checked. Whatever you call it, something new and correct came out that was not in the box before.

FIG. I — How the documented cases actually work. The model proposes a flood of new candidates; an external verifier — a proof checker, a benchmark, a unit test — keeps only the ones that hold and feeds them back. The system invents. The model is its generative half; the oracle (in accent) is the half that knows whether the novelty is any good — and the model cannot supply that alone.

But who, exactly, invented?

Here is the nuance the headlines flatten. In every one of those cases it was not the model alone that invented — it was a system. The model generates a torrent of variants; an external verifier — a proof, a benchmark, a test that either passes or doesn’t — throws out everything that fails and keeps the rare thing that holds. Invention takes two faculties: producing the new, and recognising that it is good. Language models are extraordinary at the first and, on their own, almost helpless at the second. They need an oracle.

That single fact explains the whole pattern of where they invent and where they don’t. Give them a domain where value is verifiable — mathematics, algorithms, code that either runs faster or doesn’t — and the loop closes: the oracle is real, and invention happens. Point them at a domain where value is subjective — is this poem good, is this melody worth keeping — and the oracle vanishes. Nothing external tells the system which novelty mattered, and it settles for “plausible,” which is exactly the average again.

Where I land

So I would split the claim cleanly in two, and accept one half while doubting the other. “The artifact that came out is new” — defensible, sometimes provably so; Strassen’s better twin did not exist, and now it does. But “the model has the intentional power to invent” — weak. It has no purpose of its own. It did not decide that 4×4 matrix multiplication was a problem worth caring about, did not feel the itch, will not be pleased that it’s solved. The wanting, the framing, the judgment of what deserves to be attempted — all of that came from outside. Credit belongs to the system. The model is its engine, and an engine is not a driver.

The objection, and the part no one can close

The strongest counter is the stochastic parrot: that all of this is just very fluent recombination of training data, never real invention. It is a serious argument and half right — most model output is recombination. But it breaks on the hard cases. Strassen’s better twin was not in the training data, because it did not exist anywhere until 2025, and it is verifiably correct — so it cannot be reduced to memorised retrieval. Something more than lookup happened.

And yet the parrot keeps one last move, the one I can’t answer: where exactly is the line between recombining extremely well and inventing? I don’t have a clean one. Neither, if I’m honest, does anyone — and that holds when we describe humans too. We recombine our training as well. Maybe the comfortable border we keep trying to draw between human invention and machine recombination was never sharp to begin with, and the machines are simply making us admit it. That is the part of the question I think is genuinely open — and I would rather leave it open than pretend I have closed it.

J’ai passé du temps sur une seule question — un modèle de langage peut-il inventer ? — et j’en suis ressorti convaincu que la réponse honnête n’est ni oui ni non. C’est : ça dépend de ce qu’on appelle inventer. Cela sonne comme une esquive. C’est l’inverse. Tout le désaccord vit dans la définition, et une fois la définition fixée, les faits sont étonnamment tranchés.

Ce qu’« inventer » doit vouloir dire

Commençons par le mot, car tout en dépend. Une idée utile de la créativité a deux parties, pas une : une chose est créative si elle est nouvelle et valable. La nouveauté seule n’est que du bruit — une chaîne aléatoire est nouvelle et sans valeur. La valeur seule n’est que de la compétence — une réponse correcte mais évidente est très bien et n’est pas créative. Il faut les deux à la fois, et le et est l’endroit où se cache toute la difficulté.

La philosophe Margaret Boden a distingué trois sortes de créativité, et l’échelle vaut d’être gravie. Combinatoire : des combinaisons nouvelles d’idées familières. Exploratoire : des coups inédits à l’intérieur d’un cadre établi — un jeu connu bien joué. Transformationnelle : changer les règles du cadre lui-même, de sorte que deviennent dicibles des choses que l’ancien espace ne pouvait exprimer. Elle a aussi séparé le nouveau qui l’est pour vous du nouveau qui l’est pour tous, depuis toujours — au sens historique. La question difficile, à propos des machines, ne porte jamais que sur ce dernier barreau : le transformationnel, et l’historiquement nouveau.

Ce qui est réellement établi

Pour les deux premiers barreaux, il n’y a pas de vrai débat. Recombiner des idées connues ? Explorer un espace défini ? Les modèles de langage le font constamment, et bien ; c’est l’essentiel de leur fonction. L’argument ne devient intéressant qu’au niveau du réellement nouveau et vérifiablement correct — et là même, on a désormais des reçus.

FunSearch (DeepMind, 2023, publié dans Nature) a couplé un modèle à un évaluateur automatique et trouvé de nouvelles constructions pour le problème du cap-set et de meilleures heuristiques pour le bin-packing — la première fois qu’on a montré un LLM produire un résultat réellement nouveau sur un problème ouvert, et non en retrouver un. Puis AlphaEvolve (2025) a amélioré la façon de multiplier deux matrices complexes 4×4, en 48 multiplications scalaires là où la méthode de Strassen, en 1969, en demandait 49 — un mur que personne n’avait fait bouger en cinquante-six ans. Dans une collaboration ultérieure avec le mathématicien Terence Tao, le même genre de système a été pointé sur 67 problèmes de recherche : il a égalé la meilleure réponse connue sur la plupart, et l’a dépassée sur une vingtaine. Sur le Problème 6 de l’Olympiade internationale de mathématiques 2025, il a trouvé une construction de pavage que les modèles de pointe de l’époque ne trouvaient pas — sans toutefois, soyons justes, pouvoir prouver qu’elle était optimale.

Ce ne sont pas des impressions. C’est vérifiable, et cela a été vérifié. Quel que soit le nom qu’on lui donne, quelque chose de nouveau et de correct est sorti, qui n’était pas dans la boîte auparavant.

FIG. I — Comment fonctionnent réellement les cas documentés. Le modèle propose un flot de nouveaux candidats ; un vérificateur externe — un assistant de preuve, un benchmark, un test unitaire — ne garde que ceux qui tiennent et les renvoie. C’est le système qui invente. Le modèle en est la moitié générative ; l’oracle (en accent) est la moitié qui sait si la nouveauté vaut quelque chose — et le modèle ne peut pas la fournir seul.

Mais qui, au juste, a inventé ?

Voici la nuance que les gros titres aplatissent. Dans chacun de ces cas, ce n’est pas le modèle seul qui a inventé — c’est un système. Le modèle génère un torrent de variantes ; un vérificateur externe — une preuve, un benchmark, un test qui passe ou ne passe pas — jette tout ce qui échoue et garde la chose rare qui tient. Inventer suppose deux facultés : produire du nouveau, et reconnaître que c’est bon. Les modèles de langage sont extraordinaires pour la première et, seuls, presque démunis pour la seconde. Il leur faut un oracle.

Ce seul fait explique toute la carte de là où ils inventent et de là où ils n’inventent pas. Donnez-leur un domaine où la valeur est vérifiable — les mathématiques, les algorithmes, un code qui tourne plus vite ou non — et la boucle se referme : l’oracle est réel, et l’invention a lieu. Pointez-les vers un domaine où la valeur est subjective — ce poème est-il bon, cette mélodie mérite-t-elle d’être gardée — et l’oracle s’évanouit. Rien d’externe ne dit au système quelle nouveauté comptait, et il se rabat sur le « plausible », ce qui est exactement la moyenne, encore une fois.

Où je me situe

Je scinderais donc la thèse nettement en deux, en acceptant une moitié et en doutant de l’autre. « L’artefact produit est nouveau » — défendable, parfois de façon prouvée ; le meilleur jumeau de Strassen n’existait pas, et maintenant il existe. Mais « le modèle a le pouvoir intentionnel d’inventer » — faible. Il n’a pas de but propre. Il n’a pas décidé que la multiplication de matrices 4×4 était un problème digne d’intérêt, n’a pas senti la démangeaison, ne sera pas content que ce soit résolu. Le vouloir, le cadrage, le jugement de ce qui mérite d’être tenté — tout cela est venu du dehors. Le crédit revient au système. Le modèle en est le moteur, et un moteur n’est pas un conducteur.

L’objection, et la part que personne ne referme

Le contre-argument le plus fort est le perroquet stochastique : tout cela ne serait qu’une recombinaison très fluide des données d’entraînement, jamais une vraie invention. C’est un argument sérieux et à moitié juste — l’essentiel de la sortie d’un modèle est de la recombinaison. Mais il achoppe sur les cas difficiles. Le meilleur jumeau de Strassen n’était pas dans les données d’entraînement, puisqu’il n’existait nulle part avant 2025, et il est vérifiablement correct — il n’est donc pas réductible à une récupération mémorisée. Il s’est passé plus qu’une consultation.

Et pourtant le perroquet garde un dernier coup, celui auquel je ne sais pas répondre : où passe exactement la frontière entre très bien recombiner et inventer ? Je n’en ai pas de nette. Personne, à vrai dire, n’en a — et cela vaut aussi quand on décrit des humains. Nous recombinons notre entraînement, nous aussi. Peut-être que la frontière confortable qu’on s’obstine à tracer entre l’invention humaine et la recombinaison des machines n’a jamais été nette, et que les machines nous forcent simplement à l’admettre. C’est la part de la question que je crois réellement ouverte — et je préfère la laisser ouverte que faire semblant de l’avoir refermée.