An emergent ability is a capability that a smaller model does not have and a larger one does, where the transition is not a smooth ramp but closer to a switch flipping. Below some scale the model cannot do the task at all, performing at chance. Past that scale it can. The ability was not visible in the trend from smaller systems.

Examples reported across model families include multi-step arithmetic, answering questions in specialised domains, following instructions in languages barely present in training, and chaining reasoning steps. None of these was designed in. They showed up as models grew.

A real debate worth flagging

It would be dishonest to present emergence as settled. Some researchers argue that a share of reported emergence is partly an artefact of measurement: use a harsh all-or-nothing metric and progress looks like a sudden jump, while a smoother metric reveals steady underlying improvement. That critique lands on specific cases and is worth taking seriously.

It does not dissolve the practical problem. Whether a capability arrives as a true discontinuity or as a steep-enough curve that we only notice it after it crosses a threshold, the governance situation is the same: we learn a model can do something new after it can, not before. For a policymaker deciding whether the next training run is safe to green-light, the distinction between genuine emergence and a sharp curve is academic. Either way the surprise is on the far side of the decision.

Why unpredictability is the core issue

The uncomfortable fact is that we cannot reliably predict, before training a larger model, the full list of things it will be able to do. Scaling laws forecast broad performance measures like loss reasonably well. They do not tell you which specific abilities will appear, or when. Capability is more predictable in aggregate than in particulars.

For most abilities this is merely interesting. For the abilities that matter to safety it is alarming, because the same unpredictability that produces a surprise talent for translation can produce a surprise talent for the things covered in dangerous capability evaluations: assisting with a cyberattack, or a bioweapon, or manipulating people at scale. A capability we did not anticipate is a capability we did not test for, and did not build safeguards against.

We can predict roughly how good the next model will be. We cannot predict everything it will be able to do. That gap is where the risk sits.

What it means for how we scale

Emergence turns each large jump in scale into a step into partial darkness. You are building a system whose complete capability profile you will only learn after it exists, and by then, if a dangerous capability came with it, the system already has it. This is a different kind of risk from a known hazard you can measure and mitigate in advance.

It strengthens the case the Foundation makes throughout: that frontier scaling should not run ahead of our ability to understand what we are creating, and that the burden should be on demonstrating a large new system is safe before it is trained and deployed, not on discovering its dangers afterward. When abilities can arrive unannounced, caution before the jump is the only caution that helps. The framework for that caution is in our plan, and the shortening timelines that raise the stakes are covered in our AGI timeline.

Common questions.

What are emergent abilities in large language models?

Emergent abilities are capabilities that are absent in smaller models and appear in larger ones, often with a sharp transition rather than a smooth ramp. Below a certain scale a model performs at chance on the task; past that scale it can do it. Reported examples include multi-step arithmetic, specialised question answering, following instructions in barely-represented languages, and multi-step reasoning, none of which were deliberately designed in.

Are emergent abilities real or a measurement artefact?

There is genuine debate. Some researchers argue that part of reported emergence is an artefact of harsh all-or-nothing metrics, which make steady underlying progress look like a sudden jump, and that smoother metrics reveal gradual improvement. This critique applies to specific cases. It does not remove the practical issue, because whether a capability arrives as a true discontinuity or a steep curve, we typically discover it only after a model has it.

Why do emergent abilities matter for AI safety?

Because we cannot reliably predict, before training a larger model, the full set of things it will be able to do. Scaling laws forecast broad performance well but not which specific abilities will appear or when. For safety-relevant abilities, such as helping with cyberattacks or bioweapons, that unpredictability means a dangerous capability can arrive unannounced, in a system we have already built, without having been anticipated, tested for, or guarded against.

How should emergence affect the way we scale AI?

It suggests that each large increase in scale is a step into partial darkness, since a model's complete capability profile is only learned after it exists. That argues for not letting frontier scaling outrun our ability to understand what we are building, and for placing the burden on demonstrating that a large new system is safe before it is trained, rather than discovering its hazards after deployment when it already possesses them.