What Large Language Models Actually Do
If you've spent any time reading about artificial intelligence over the past two years, you've probably encountered two stories. In one, large language models are moments away from replacing most knowledge work. In the other, they're overblown autocomplete, parlour tricks that string words together without understanding any of them.
Both stories are wrong, and the gap between them is where most of the important decisions sit.
Large language models (LLMs), the technology behind ChatGPT, Claude, Gemini, and their competitors, are sophisticated prediction engines. They compress patterns from trillions of words of text into billions of numerical parameters, producing outputs that range from impressive to confidently wrong. They do not understand language the way humans do, yet they develop rich internal representations that defy easy dismissal. This tension, between real capability and real limitation, defines the current moment and shapes every decision organisations must make about deploying these systems.
This guide explains the real science, draws on peer-reviewed research and credible institutional sources, and aims to give you the foundation needed to make sound decisions about AI. No jargon without explanation. No hype. No false reassurance.
Here is what is actually going on.
Part One: The Engine Under the Hood
The architecture that changed everything
Every major LLM runs on the same foundational design: the Transformer, introduced in 2017 by a team at Google in a paper titled "Attention Is All You Need." Before Transformers, language models processed words one at a time, in sequence, like reading a sentence left to right, one word at a time, trying to remember what came before. The Transformer replaced this with a mechanism called self-attention, which processes all words simultaneously and lets the model consider the entire context at once.
Self-attention works through an elegant mechanism that is worth understanding at an intuitive level. When the model encounters a sentence like "The animal didn't cross the street because it was too tired," it needs to figure out what "it" refers to. Self-attention lets every word in the sentence "look at" every other word and compute a relevance score. Through training, the model learns that "it" should attend strongly to "animal" and weakly to "street." The model runs multiple attention operations (called "heads"), each tracking different types of relationships. One head might track grammatical dependencies, another semantic meaning, another pronoun references.
This ability to consider all words simultaneously, rather than sequentially, is what made LLMs possible at scale. It allowed massive parallelisation on modern hardware (GPUs), which in turn allowed training on datasets that would have been computationally infeasible with earlier architectures. Jay Alammar's illustrated guide to Transformers remains one of the best visual explanations if you want to go deeper.
Breaking text into pieces: how tokenisation shapes what AI can and cannot do
Before an LLM can process text, that text must be converted into numbers. This happens through tokenisation, breaking text into chunks called tokens. Most LLMs use a technique called Byte Pair Encoding, which iteratively merges frequently occurring character combinations into larger units. The resulting vocabulary for a model like GPT-4 contains roughly 100,000 tokens.
Common words become single tokens. Less common words get split into subword pieces. "Understanding" might be one token; "tokenisation" might become ["token", "isation"]. Numbers are often split into individual digits. The important thing: the model never sees individual characters, only these pre-defined chunks.
This has real consequences that show up in everyday use. When asked "How many r's are in strawberry?", early GPT-4 famously answered "2" because "strawberry" was tokenised into subword pieces like ["straw", "berry"], and the model never examined the individual letters. Many apparent LLM "failures" in arithmetic, spelling, and character-level tasks trace directly to tokenisation. The model isn't failing to reason; it literally cannot see the information it would need.
If you've ever found an LLM surprisingly poor at a task that seems trivially easy for a human, tokenisation is often the explanation.
The fundamental trick: predicting the next word
Strip away the complexity and every LLM does the same thing: given a sequence of tokens, it produces a probability distribution over all possible next tokens and selects one. That selected token is appended to the sequence, and the process repeats. This is called autoregressive generation: each new word depends on everything that came before it.
A "temperature" setting controls how the model selects from its probability distribution. Low temperature (around 0.2) makes the model predictable and conservative, always choosing high-probability tokens. High temperature (around 1.5) introduces more randomness, producing more varied and creative output. This is why you sometimes see different answers to the same question. The model is sampling from a distribution, not looking up a fixed answer.
What makes this simple mechanism produce complex, coherent text is the sheer scale of what the model learns during training. To accurately predict the next word across trillions of examples spanning physics, contract law, Shakespearean sonnets, and Python code, the model must learn very rich representations of language, knowledge, and reasoning patterns. The task is simple; achieving it requires developing something that looks, from the outside, remarkably like understanding.
Whether it actually is understanding is a question we'll return to.
Part Two: How LLMs Learn
Pre-training: absorbing the world's text
LLM training happens in distinct stages, each serving a different purpose.
The first and most expensive stage is pre-training. The model processes massive text datasets (books, websites, academic papers, code repositories, forums) learning to predict the next token at each position. There is no human labelling involved. The model simply reads text and tries to guess what comes next, billions of times, adjusting its internal parameters each time it gets it wrong.
The scale involved is difficult to grasp. GPT-3 (2020) trained on 300 billion tokens. Meta's Llama 3 (2024) trained its 8-billion-parameter model on 15 trillion tokens. GPT-4's training reportedly cost over $100 million and consumed an estimated 2.1 × 10²⁵ floating-point operations, a number so large it has no intuitive meaning. What matters is the implication: building a frontier LLM from scratch requires resources available to perhaps a dozen organisations on the planet.
Pre-training produces a model that can complete text fluently, but one that isn't particularly useful as an assistant. It might continue a question with another question, or respond to a request with unrelated text. It has learned language, but not the specific behaviour of being helpful.
Fine-tuning and alignment: from text predictor to useful assistant
The transformation from text predictor to helpful assistant happens through fine-tuning and alignment.
Fine-tuning involves training the model on curated examples of desired behaviour: well-crafted prompt-response pairs that demonstrate the format, tone, and helpfulness expected of an AI assistant. This teaches the model what kind of text to generate, not just any text.
RLHF — Reinforcement Learning from Human Feedback — is the technique that made the difference between GPT-3 (impressive but unwieldy) and ChatGPT (actually useful). The process works as follows: human annotators are shown pairs of model responses to the same prompt and asked to select the better one. A separate "reward model" learns to predict these human preferences. The LLM is then optimised to generate responses that score higher on that reward model.
The difference is hard to overstate. OpenAI's InstructGPT model, with just 1.3 billion parameters and RLHF, was preferred by human evaluators over the much larger 175-billion-parameter GPT-3 without it. RLHF uses less than 2% of pre-training compute, a tiny additional investment that fundamentally transforms usability.
Newer alignment techniques have refined this process further. Direct Preference Optimisation (DPO), introduced in 2023, eliminates the separate reward model entirely, optimising the LLM directly on preference data: simpler, cheaper, and comparably effective. Anthropic's Constitutional AI takes a different approach: instead of collecting thousands of individual human judgements, it defines a small set of written principles (a "constitution") and trains the model to critique and revise its own outputs according to those principles. An AI model then evaluates response pairs against the constitution, generating the preference data needed for training. This reduces dependence on expensive human annotation while producing models that engage with sensitive queries by explaining their reasoning rather than simply refusing.
Scaling laws: the economics behind the investment
Two papers shape how the industry thinks about the relationship between investment and capability.
Kaplan et al. (2020), at OpenAI, discovered that model performance improves as a precise mathematical function of three variables: model size (number of parameters), dataset size (number of training tokens), and compute (processing power). These relationships hold across seven orders of magnitude, a remarkably clean result in a field full of messy empirics. Their recommendation was to prioritise model size. This directly motivated GPT-3's massive 175-billion-parameter design.
Hoffmann et al. (2022), the "Chinchilla" paper from Google DeepMind, overturned that advice. By training over 400 models of varying sizes and dataset quantities, they showed that model size and training data should be scaled equally, with an optimal ratio of roughly 20 tokens per parameter. Their 70-billion-parameter model, Chinchilla, outperformed GPT-3 (175 billion parameters), Gopher (280 billion), and several other much larger models. The implication was stark: previous large models had been systematically undertrained on data. They were too big for how much they'd read.
The industry has since moved beyond even these ratios. Llama 3's 8-billion-parameter model trained at 1,875 tokens per parameter, nearly 100 times the Chinchilla-optimal ratio, because smaller models trained on more data are cheaper to run when serving millions of users, and serving costs dominate the economics.
A new dimension emerged in late 2024: inference-time compute scaling. Research from Google DeepMind showed that spending more computational resources during generation, letting the model "think longer," can allow a smaller model to outperform one 14 times larger. This is the principle behind "reasoning models" like OpenAI's o1 and o3, which we'll discuss later. The field has moved from "build bigger models" to "train smarter and think harder at inference time."
Context windows: working memory and its limits
The context window is the total amount of text an LLM can process at once: your input plus its output. Think of it as the model's working memory for a single conversation.
Context windows have grown fast: GPT-2 (2019) had 1,024 tokens (roughly 750 words); GPT-3 offered 2,048; Claude 3 reached 200,000; and Gemini 2.5 Pro now supports 2 million tokens, enough to process several full-length novels.
However, a larger context window does not mean the model can reason equally well about all the information it contains. Research consistently shows that models weight information at the beginning and end of their context far more heavily than information in the middle, the "lost in the middle" effect. Simple retrieval tasks (finding a specific fact in a long document) work reasonably well at scale, but complex reasoning across long contexts remains unreliable. Most models claiming 200,000 tokens become noticeably less reliable beyond 130,000.
A useful analogy: a 200,000-token context window is like being handed a 600-page report. You can skim it and find specific sections. But if someone asks you a question that requires synthesising information from pages 47, 203, and 491, you're going to struggle, and so will the model.
Part Three: What LLMs Genuinely Do Well
It's worth being specific about where the evidence for LLM capability is strongest, because precision here matters more than enthusiasm.
Writing, summarisation, and translation
Text generation is where LLMs perform most reliably, and where the evidence base is most robust.
A peer-reviewed study published in Science (Noy & Zhang, 2023) tested 453 college-educated professionals on realistic writing tasks. Those using ChatGPT completed tasks 40% faster and produced output rated 18% higher in quality by independent evaluators. The entire productivity distribution shifted upward, not just the bottom. Workers who initially produced lower-quality output benefited most, making AI a notable leveller of writing ability.
In translation, GPT-4 outperforms established commercial machine-translation systems on both automated and human evaluation metrics. Researchers have described its performance as comparable to junior and mid-level human translators: strong on common language pairs, weaker on rarer languages and highly specialised domains.
The practical implication for organisations is clear: if your work involves producing, editing, summarising, or translating text in major languages, current LLMs offer measurable productivity gains. The gains are largest for first drafts and routine communications, and smallest for work requiring deep domain expertise or a distinctive voice.
Code generation
The improvement in code generation is the most dramatic and measurable success story in the LLM space.
On SWE-bench Verified — a benchmark that tests AI's ability to fix real bugs from GitHub repositories — performance went from solving 4.4% of problems in 2023 to over 80% by late 2025. GitHub's controlled experiment showed developers completed coding tasks 55.8% faster with AI assistance, and a Microsoft field study found 13–22% more code contributions per week.
The caveats are important. On SWE-bench Pro, which tests enterprise-grade, previously unseen codebases, the best models score only around 23%. A GitClear analysis found AI-generated code has a 41% higher revision rate. The pattern is consistent: AI coding assistance works well for many tasks (scaffolding new projects, writing boilerplate, debugging common errors, explaining unfamiliar codebases) but falls well short of autonomous software engineering on complex, real-world systems.
If you manage a software team, the evidence suggests AI coding tools should already be part of your workflow. If someone tells you those tools can replace your developers, the evidence says otherwise.
Analysis and reasoning
LLM reasoning capabilities have improved substantially, particularly with a new category of "reasoning models" that emerged in late 2024.
Standard models now pass professional examinations at impressive levels. GPT-4 cleared the bar exam at the 90th percentile and medical licensing exams at 91–95%. On GPQA Diamond — a benchmark of PhD-level science questions — the latest models exceed human expert accuracy.
Reasoning models like OpenAI's o1 and o3, and Claude with extended thinking, represent a real advance. These models use reinforcement learning to develop chain-of-thought reasoning, spending much more time working through problems step by step. OpenAI's o3 achieved 87.5% on ARC-AGI — a benchmark specifically designed to test general reasoning — surpassing the 85% threshold considered human-level. It scored in the 89th percentile on competitive programming challenges.
Whether this constitutes "real" reasoning remains legitimately debated among researchers, a question we address in Part Four. What matters practically is that these models are measurably better at complex, multi-step problems than their predecessors, even if the mechanism underlying that improvement is still contested.
Part Four: The Honest Limitations
This section is arguably more important than the previous one. Knowing what LLMs can do enables good use; knowing what they cannot do prevents bad outcomes.
Hallucination is not a bug. It's a structural feature
Hallucination — the tendency to generate fluent, confident text that is factually wrong or entirely fabricated — is the most important LLM limitation, and the one most commonly misunderstood.
It is not a software bug that will be patched in the next release. It is an inherent property of how these systems work.
LLMs don't store facts in a retrievable database the way a spreadsheet stores numbers. They compress vast quantities of information into statistical parameters, and during generation, they "decompress" it. When the model's compressed information is incomplete, ambiguous, or conflicting, it fills gaps with statistically plausible content, text that sounds right, follows the expected patterns, and is delivered with the same confidence as accurate information.
A 2024 paper by Xu, Jain, and Kankanhalli used results from computational learning theory to prove formally that all computable LLMs must hallucinate. Follow-up research (Banerjee et al., 2025) extended this finding: techniques that use LLMs to check their own outputs (chain-of-thought verification, self-consistency checks) cannot eliminate the problem. The system checking for hallucinations has the same structural tendency to hallucinate.
The rates vary by domain and model, but they remain significant across all systems. A systematic review found citation hallucination rates of 28.6% for GPT-4 and 91.4% for Google Bard. Stanford's legal hallucination study found rates of 69–88% when models were asked to answer legal questions. Even with careful prompt engineering in clinical documentation, rates persist around 1.5%.
Perhaps most counterintuitively, reasoning models, which are better at complex tasks, hallucinate more than standard models in certain contexts. OpenAI's own evaluations show o3 hallucinating 33% of the time and o4-mini 48% when summarising information about people, versus 16% for the earlier o1.
The most instructive real-world case was Mata v. Avianca (2023). Attorney Steven Schwartz used ChatGPT to research legal precedents. The model fabricated six entirely fictitious cases, with convincing names, fake quotes from non-existent judicial opinions, and made-up citations. When Schwartz asked the model if the cases were real, it confidently affirmed they were. The court sanctioned the attorneys $5,000.
Retrieval-Augmented Generation (RAG) — a technique that grounds LLM responses in retrieved documents from curated databases — reduces hallucination significantly (by up to 71% in well-implemented systems). But it cannot eliminate it. The retrieval step can miss relevant documents, and the model can still generate fabricated content even when given accurate source material.
The operating principle for organisations: Never deploy an LLM in a context where a confident but fabricated output could cause meaningful harm, unless there is a robust verification layer between the model's output and any consequential action.
The understanding question: what's actually happening inside?
Whether LLMs "understand" anything is the field's deepest question, and the honest answer is that researchers genuinely disagree.
The sceptical position was crystallised in a 2021 paper by Bender, Gebru, and colleagues, titled "On the Dangers of Stochastic Parrots." Their argument: LLMs are systems that stitch together sequences of words according to statistical patterns, but without any reference to meaning. Human language pairs form (the words) and meaning (what they refer to); LLMs have access only to form. Their fluency creates an illusion, exploiting our natural tendency to attribute understanding to anything that speaks coherently.
The counter-evidence has grown substantially since 2021, and the strongest comes from mechanistic interpretability, the effort to look inside LLMs and understand what their internal components actually do.
Anthropic's research programme has produced the most significant findings here. In 2024, using a technique called sparse autoencoders, researchers extracted millions of interpretable "features" from a Claude model. They found a "Golden Gate Bridge" feature that responded to the bridge's name in English, Japanese, Chinese, Greek, Vietnamese, and Russian, and to images of the bridge. This suggests something more structured than keyword-matching: the model appears to develop real conceptual representations that cut across languages and even modalities.
More strikingly, a March 2025 paper titled "On the Biology of a Large Language Model" traced step-by-step computation inside Claude 3.5 Haiku. When asked "What is the capital of the state containing Dallas?", the model first internally activated a representation for "Texas" as an intermediate step before producing "Austin", demonstrating multi-step reasoning, not memorised lookup. The model also pre-selects rhyming words before composing each line of poetry, suggesting a form of planning.
And manipulating these internal features causally changes model behaviour, confirming they play functional roles in computation, not that they are merely statistical artefacts.
The balanced view, and our position: LLMs are categorically more than phone-keyboard autocomplete. They operate in high-dimensional mathematical spaces, developing representations that capture semantic relationships, abstract concepts, and multi-step reasoning pathways. But whether this constitutes "understanding" in any philosophically meaningful sense, whether there is something it is like to be an LLM processing a sentence, remains unresolved.
For practical decision-making, the safest operating assumption is this: LLMs can simulate understanding well enough to be useful across a wide range of tasks, but unreliably enough to require human oversight for anything that matters.
The jagged frontier: the most important concept in this guide
If you take one idea from this article into your next meeting about AI, it should be this one.
In 2023, a team from Harvard Business School, in collaboration with Boston Consulting Group, ran a pre-registered experiment with 758 BCG consultants, one of the largest and most rigorous studies of AI in professional work. The study introduced a concept that should be part of every organisation's AI vocabulary: the jagged technological frontier.
The finding: AI capabilities form an uneven, unpredictable boundary. Some tasks that appear very difficult for humans are easy for AI. Some tasks that seem straightforward fall completely outside its capabilities. And (this is the critical part) the boundary doesn't follow any obvious pattern.
For tasks that fell inside the frontier, consultants using GPT-4 completed 12.2% more tasks, finished 25.1% faster, and produced results rated 40% higher in quality. Lower-performing consultants saw the largest gains.
For one task deliberately designed to fall outside the frontier, consultants using AI performed 19 percentage points worse than those without it. The AI gave confident but wrong answers, and the consultants, trusting the tool, accepted them. Their accuracy dropped from 84% to between 60% and 70%.
The practical implications are significant. You cannot reliably predict which tasks sit inside and which sit outside the frontier without domain expertise. The AI itself certainly cannot tell you. It performs with equal confidence whether it is right or wrong. The consultants who performed worst were those who trusted AI output uncritically. The best performers fell into two patterns: "Centaurs," who maintained a clear division of labour (doing some tasks themselves, delegating others to AI based on a strategic assessment of strengths), and "Cyborgs," who deeply integrated AI into their workflow with continuous checking and iteration. Ethan Mollick's summary of the study remains the best accessible overview of these findings.
Both approaches require knowing where the frontier lies, which demands human expertise.
What this means for you: Before deploying AI in any workflow, identify whether each specific task is likely inside or outside the frontier. Test empirically. Do not assume that because AI handles one part of a process well, it will handle adjacent tasks equally well. And build in checkpoints where a human with domain knowledge reviews AI output before it leads to action.
Other limitations worth understanding
No persistent memory. LLMs are stateless. Each conversation begins from scratch. The model has no built-in mechanism to remember your previous conversations, accumulate knowledge about your preferences, or learn from past interactions. The context window provides working memory for the current exchange, but once the conversation ends, everything is gone. Companies address this with external memory systems that store and retrieve information across sessions, but this is scaffolding built around a fundamental limitation, not a native capability.
No real-time information. LLMs have training data cutoffs and no awareness of events after that date. Integrations with web search, external databases, and other tools can provide access to current information, but the model itself has no native ability to know what is happening in the world right now.
Inconsistent outputs. Research shows that minor prompt rephrasings produce very different answers. The same question asked twice may yield different responses. Studies have found accuracy variations of up to 10% across identical repeated runs, and performance swings of up to 45% depending on how a prompt is worded. Models also tend toward overconfidence, using more certain language when generating incorrect information than when generating correct information.
Part Five: Six Misconceptions That Distort the Conversation
If you're going to have productive discussions about AI in your organisation, it helps to name the common misunderstandings directly. Here are the six we encounter most frequently, and what the evidence actually shows.
"AI understands what it's saying"
LLMs develop rich internal representations (as mechanistic interpretability research demonstrates), but these are statistical compressions of training data, not comprehension in the human sense. The distinction matters because it shapes expectations. A system that understands would know when it's wrong and tell you. A system that compresses patterns will sometimes be wrong and present the error with full confidence.
Even Ilya Sutskever, former chief scientist at OpenAI, acknowledged in late 2025 that models learn to solve specific types of problems but struggle to generalise in the way humans do. We are, as MIT Technology Review put it, "hardwired to see intelligence in things that behave in certain ways, whether it's there or not."
"AI is just autocomplete"
This framing, popular among sceptics, undersells what is happening by a considerable margin. Your phone's autocomplete predicts the next word using simple statistical models with tiny vocabularies and no understanding of context beyond a few words. LLMs operate in mathematical spaces of vast dimensionality, processing entire sequences simultaneously through attention mechanisms, and developing internal representations that capture abstract concepts across languages and modalities.
Anthropic's research has shown these models develop features for concepts like "bugs in computer code," "gender bias in language," and "the Golden Gate Bridge," concepts that respond consistently whether the input is in English text, Japanese text, or an image. That is not autocomplete.
But the opposite error, concluding that because it's more than autocomplete, it must be approaching human intelligence, is equally misleading. The truth sits between the two extremes, and it's more interesting than either.
"AI will replace all knowledge work"
An MIT study from late 2025 found that AI can currently perform work equivalent to about 11.7% of US jobs, representing roughly $1.2 trillion in wages. That is significant, but it is not "all knowledge work."
Research from Stanford and Harvard found that entry-level employment in AI-exposed fields declined around 20% since late 2022. But employment for older workers grew 6–9% in the same fields. The researchers' explanation: older workers possess tacit knowledge from experience, the kind of contextual understanding that comes from years of practice and is never written down in the documents LLMs trained on.
The dominant pattern is task automation, not job elimination. Specific tasks within roles get automated or augmented; the roles themselves evolve. This is consistent with how most workplace technologies have historically played out, including spreadsheets, email, and the internet.
"AI output is always reliable if the model is good enough"
Even the best-performing model tested (Gemini 2.0 Flash) has a measured hallucination rate of 0.7%, and that is on standardised benchmarks, not real-world deployment. Legal hallucination rates exceed 70%. In 2025, an estimated 47% of enterprise AI users made at least one significant business decision based on hallucinated content. Knowledge workers report spending an average of 4.3 hours per week fact-checking AI outputs.
No model is "good enough" to eliminate the need for verification. The question is not whether to verify, but how to build efficient verification into your workflows.
"Bigger models are always better"
GPT-4.5, released in February 2025, was reportedly larger than GPT-4 but widely considered poor value: marginal improvement at substantially higher cost. Meanwhile, Microsoft's Phi-3-mini, with just 3.8 billion parameters, matches the performance of models with over 500 billion parameters on key benchmarks, a 142-fold size reduction.
The industry has moved decisively away from "scale the model" toward "improve the training pipeline and the inference strategy." Better data curation, more sophisticated training techniques, and inference-time reasoning have driven the majority of recent progress. Sebastian Raschka's "State of LLMs 2025" provides a thorough technical review of this shift.
"AI learns from your conversations"
This is one of the most widespread misunderstandings, and it's worth being precise about.
Training — the process of learning patterns from data — costs millions of pounds, takes weeks or months, and happens infrequently. It produces a model with fixed parameters. Inference — the process of generating a response to your prompt — uses those fixed parameters to produce output. No model update occurs. The weights do not change based on your conversation.
Enterprise API agreements from major providers (OpenAI, Anthropic, Google) generally contractually guarantee that customer data is not used for training. Some consumer-facing products may use conversations to improve future models, but this is disclosed in terms of service and often opt-out.
Memory features that carry information across sessions (like ChatGPT's memory or Claude's memory) work by storing notes externally and including them in future prompts, not by updating the model's parameters. The model itself never changes as a result of talking to you.
Part Six: The Key Research That Shaped the Field
You don't need to read academic papers to make good decisions about AI. But knowing the handful of foundational studies that shaped the field, and what each actually showed, helps you evaluate claims and cut through the noise.
"Attention Is All You Need" (Vaswani et al., 2017). The paper that introduced the Transformer architecture, achieving state-of-the-art language translation while training in a fraction of the time and cost of prior approaches. Every modern LLM descends from this work.
The scaling laws papers (Kaplan et al., 2020; Hoffmann et al., 2022). Established the quantitative relationships between model size, training data, compute, and performance. These papers drove billions of pounds in investment decisions. The Chinchilla paper's finding, that previous models were undertrained on data, reshaped the entire industry's approach.
"Emergent Abilities of Large Language Models" (Wei et al., 2022). Documented over 137 capabilities (including multi-digit arithmetic and instruction following) that appeared to emerge suddenly at certain model scales. This fuelled the narrative that scaling could produce unpredictable, qualitatively new capabilities. The important counter-paper, "Are Emergent Abilities a Mirage?" (Schaeffer et al., NeurIPS 2023), argued these apparent jumps were artefacts of how performance was measured, not real sudden leaps. The debate has significant implications for AI safety planning.
"Chain-of-Thought Prompting Elicits Reasoning" (Wei et al., NeurIPS 2022). Showed that providing step-by-step reasoning examples in prompts dramatically improves LLM performance on reasoning tasks, a technique requiring no additional training. This finding underpins the reasoning model paradigm and is directly useful in everyday practice: if you want better results from an LLM on complex tasks, ask it to work through its reasoning step by step.
Anthropic's mechanistic interpretability research (2022–2025). The most significant progress in understanding what LLMs actually represent internally. Using sparse autoencoders, researchers decomposed neural activations into millions of interpretable features, from concrete concepts like the Golden Gate Bridge to abstract ones like "sycophantic praise" and "code bugs." Manipulating these features causally changes model behaviour, opening the door to better safety and control.
"On the Dangers of Stochastic Parrots" (Bender, Gebru et al., 2021). Argued that LLMs operate without reference to meaning, generating text through statistical pattern-matching alone. The paper became a touchstone in AI ethics discourse and remains the strongest statement of the sceptical position, though subsequent empirical evidence, particularly from interpretability research, has complicated its central claim.
The Harvard Business School / BCG "jagged frontier" study (Dell'Acqua et al., 2023). The most practically important study for organisations. Demonstrated, with 758 consultants in a controlled experiment, that AI sharply improves performance on some tasks while actively degrading it on others, and that the boundary is unpredictable without domain expertise.
Part Seven: Where the Field Stands in Early 2026
The model market has converged
The performance gap between leading models has narrowed sharply. On the LMArena leaderboard, the spread between first and tenth place is just 5.4%, down from 11.9% in 2023. Open-weight models (freely available for anyone to download and run) nearly match proprietary ones: the performance gap narrowed from 8% to roughly 1.7%.
The major players each hold distinct positions. Anthropic's Claude leads enterprise adoption (approximately 32% market share) and code generation (42% market share). OpenAI retains the largest consumer base (over 400 million ChatGPT users). Google DeepMind's Gemini 3 Pro leads reasoning benchmarks. Meta's Llama leads open-source adoption with over 1.2 billion downloads. Chinese competitors, particularly DeepSeek, have closed the quality gap almost entirely.
For organisations, the practical meaning is this: the choice of model matters less than it did two years ago. How you integrate AI into workflows, how you evaluate its output, and how you manage its limitations matter far more.
Four trends that are actually real
Reasoning models represent a legitimate advance. By training LLMs with reinforcement learning against verifiable rewards (correct mathematical answers, working code), models develop step-by-step reasoning that sharply improves performance on complex tasks. The scale of improvement, from 5% to 87.5% on ARC-AGI, is not incremental. These models are considerably more expensive and slower, but for tasks requiring real multi-step reasoning, they are qualitatively better.
Tool use and AI agents are moving from concept to early deployment. Agents are LLMs that can reason about a task, use external tools (web search, code execution, databases), and iterate across multiple steps. McKinsey reports 23% of organisations scaling agentic AI in 2025. Gartner projects 40% of enterprise applications will include task-specific agents by end of 2026. The primary constraint is not model capability but data quality, integration complexity, and organisational readiness.
Multimodality — processing images, audio, and video alongside text — is now standard in all frontier models. This enables applications from medical imaging analysis to real-time document processing that were not feasible two years ago.
Inference-time compute scaling means capability improvements no longer depend solely on building bigger models or amassing more training data. Models can improve by "thinking harder" on each individual query. This opens new paths for progress that are less capital-intensive than pre-training scaling.
The hype correction is real, and healthy
MIT Technology Review declared 2025 "The Great AI Hype Correction." An MIT study found that 95% of businesses that tried AI found zero measurable value after six months. Roughly 42% of companies abandoned most AI initiatives in 2025, up from 17% in 2024.
What is actually delivering value: code generation tools, writing and productivity assistance for knowledge workers (25–55% gains in controlled studies), and specific vertical applications where AI augments domain expertise. What remains overhyped: autonomous agents operating without human oversight, near-term artificial general intelligence, and the belief that deploying a model API into existing workflows will produce transformative results without redesigning those workflows.
The correction is healthy because it shifts attention from "will AI change everything?" (a question about potential) to "where is AI creating value for us right now?" (a question about practice). The second question is more useful.
Part Eight: What This Means for Organisations Making Decisions Today
Adoption is wide but shallow
The headline numbers look impressive: 88% of enterprises report regular AI use; Fortune 500 adoption reaches 92%. But over 80% of those organisations report no meaningful impact on enterprise-wide profitability. An estimated 70–85% of AI initiatives fail to meet their expected outcomes. Total corporate AI investment hit $252.3 billion in 2024, with most organisations stuck in pilot phases.
The organisations generating real value share common traits. They invest in evaluation frameworks and prompt engineering. They redesign workflows rather than bolting AI onto existing processes. They treat AI as a catalyst for organisational change, not a software upgrade. McKinsey found that only 21% of organisations have fundamentally redesigned workflows for AI, and these are disproportionately the ones seeing returns.
How to think about deployment
The jagged frontier framework should guide every deployment decision. Rather than asking "Can AI do this?", ask "Is this specific task likely inside or outside the frontier, and what happens if we get the answer wrong?"
Where current evidence supports deployment: First-draft content creation. Code generation and debugging. Document summarisation and extraction. Data analysis and pattern recognition. Translation across major languages. Research assistance and literature review. Internal knowledge search and retrieval.
Where caution is warranted or human oversight is essential: Any task requiring guaranteed factual accuracy. Legal filings, citations, and regulatory submissions. Medical diagnosis without clinician review. Financial decisions based solely on AI output. Customer-facing communications in sensitive contexts. Anything where "confidently wrong" carries serious consequences.
The boundary between these categories is not fixed and shifts as models improve. A task that fell outside GPT-4's frontier may sit comfortably inside newer models' capabilities, or it may not. Building the organisational competence to test, evaluate, and recalibrate is more valuable than any specific model selection.
Human oversight is necessary but not sufficient
Most enterprises now include human review in their AI workflows. This is necessary but faces a scaling challenge: as AI generates output faster, human review becomes a bottleneck, and the quality of that review degrades as volume increases.
The deeper issue is calibration of trust. The Harvard study showed that the worst-performing consultants were those who "blindly adopted AI output and interrogated it less." The risk is not that people will refuse to use AI. It is that they will use it uncritically, over-relying on it where it is weakest and under-relying on their own expertise where it matters most.
Organisations need to invest not just in AI tools but in training people to evaluate AI output, identify the kinds of errors their specific model tends to make, and know when to trust it and when to apply professional judgement instead.
What to take from all of this
LLMs are not intelligent in any human sense, but they are not trivial text generators either. They are powerful statistical engines that compress trillions of words of human knowledge into mathematical representations rich enough to produce working code, pass professional examinations, and assist with complex analysis, while also fabricating legal cases, failing at letter-counting, and generating confidently wrong answers that degrade human performance when trusted without verification.
Three insights from the research stand out above the rest.
The jagged frontier is the essential mental model. AI capabilities are uneven and unpredictable. The ability to distinguish where AI helps from where it harms requires domain expertise that AI itself cannot provide. Every deployment decision should start by mapping where the frontier lies for your specific context, and testing that map empirically.
Hallucination is a mathematical certainty, not a solvable bug. Any deployment strategy that assumes AI output is reliable without verification will eventually fail. Reasoning models (which are better at complex tasks) actually hallucinate more in certain contexts, not less. Build verification into workflows from the start, not as an afterthought.
The field is shifting from scaling model size to scaling training quality and inference strategy. Capability improvements will continue even without bigger models, but the nature of improvement is changing. Progress is moving from "generally better at everything" toward "specifically better at reasoning-intensive tasks, with specific tradeoffs in cost and speed."
The technology is powerful. The gap between that power and the hype surrounding it is where most of the value, and most of the risk, currently sits. Understanding the real science, honestly, is the first step toward navigating that gap well.
Further reading
On this site:
*Links to related AI Primer Foundations Library articles will be added as they are published.
Selected external sources:
The research referenced throughout this article can be explored further through these starting points:
- Stanford HAI, AI Index Report 2025 — the most comprehensive annual survey of AI progress and its societal implications.
- Dell'Acqua, F. et al., "Navigating the Jagged Technological Frontier" (2023) — the Harvard Business School / BCG study on AI and knowledge worker productivity.
- Anthropic, "Mapping the Mind of a Large Language Model" (2024) and "On the Biology of a Large Language Model" (2025) — accessible summaries of mechanistic interpretability research.
- Noy, S. & Zhang, W., "Experimental Evidence on the Productivity Effects of Generative AI," Science (2023) — the widely cited productivity study.
- Sebastian Raschka, "The State of LLMs 2025" — a thorough technical review of the current model landscape.
- Simon Willison, "2025: The Year in LLMs" — an excellent practitioner-level review of the year's developments.
Key takeaways
- —LLMs are prediction engines, not knowledge databases. They generate text by predicting the most probable next word, drawing on patterns compressed from trillions of words of training data. This mechanism produces impressively capable output, but it also means they fill gaps with plausible-sounding fabrications, delivered with the same confidence as accurate information.
- —Hallucination is mathematically inevitable, not a bug awaiting a fix. Research has formally proven that all computable LLMs must hallucinate. Reasoning models (which are better at complex tasks) actually hallucinate more in certain contexts. Any deployment strategy that treats AI output as reliable without verification will eventually fail.
- —AI capabilities follow a jagged frontier, not a smooth curve. The Harvard/BCG study with 758 consultants showed AI sharply improves performance on some tasks while actively degrading it on others, and the boundary is unpredictable. Professionals who trusted AI output uncritically performed worst. Those who maintained domain judgement about when to trust it and when to override it performed best.
- —The field is shifting from bigger models to smarter training and inference. The performance gap between leading models has narrowed to just 5.4%. Progress now comes from better data curation, refined training techniques, and inference-time reasoning, not from building ever-larger models. For organisations, this means the choice of model matters less than how you integrate it into workflows.
- —Adoption is wide but shallow, and that's the real problem. 88% of enterprises use AI, but over 80% report no meaningful profit impact. The organisations seeing returns are the 21% that redesigned workflows around AI rather than bolting it onto existing processes. The gap between deploying AI and generating value from it is where most organisations are stuck.
Want to go deeper? See the related professional guide →
Prefer a structured route?
Stay informed
The AI Primer Briefing is a weekly digest of what matters in AI — curated for professionals, free of breathless hype.