Have we stopped to think about what LLMs actually model? • The Register
In May, Sam Altman, CEO of $80-billion-or-so OpenAI, seemed unconcerned about how much it would cost to achieve the company’s stated goal. “Whether we burn $500 million a year or $5 billion – or $50 billion a year – I don’t care,” he told students at Stanford University. “As long as we can figure out a way to pay the bills, we’re making artificial general intelligence. It’s going to be expensive.”
Statements like this have become commonplace among tech leaders who are scrambling to maximize their investments in large language models (LLMs). Microsoft has put $10 billion into OpenAI, Google and Meta have their own models, and enterprise vendors are baking LLMs into products on a large scale. However, as industry bellwether Gartner identifies GenAI as nearing the peak of the hype cycle, it’s time to examine what LLMs actually model – and what they do not.
“Large Models of What? Mistaking Engineering Achievements for Human Linguistic Agency” is a recent peer-reviewed paper that aims to take a look at how LLMs work, and examine how they compare with a scientific understanding of human language.
Amid “hyperbolic claims” that LLMs are capable of “understanding language” and are approaching artificial general intelligence (AGI), the GenAI industry – forecast to be worth $1.3 trillion over the next ten years – is often prone to misusing terms that are naturally applied to human beings, according to the paper by Abeba Birhane, an assistant professor at University College Dublin’s School of Computer Science, and Marek McGann, a lecturer in psychology at Mary Immaculate College, Limerick, Ireland. The danger is that these terms become recalibrated and the use of words like “language” and “understanding” shift towards interactions with and between machines.
“Mistaking the impressive engineering achievements of LLMs for the mastering of human language, language understanding, and linguistic acts has dire implications for various forms of social participation, human agency, justice and policies surrounding them,” argues the paper published in the peer-reviewed journal Language Sciences.
The risks are far from imagined. The AI industry and its associated bedfellows have spent the last few years cozying up to political leaders. Last year, US vice president and Democratic presidential candidate Kamala Harris met CEOs of four American companies at the “forefront of AI innovation” including Altman and Satya Nadella, Microsoft CEO. At the same time, former UK prime minister Rishi Sunak hosted an AI Safety Summit, which included the Conservative leader’s fawning interview with Elon Musk, a tech CEO who has predicted that AI would be smarter than humans by 2026.
Speaking to The Register, Birhane said: “Big corporations like Meta and Google tend to exaggerate and make misleading claims that do not stand up to scrutiny. Obviously, as a cognitive scientist who has the expertise and understanding of human language, it’s disheartening to see a lot of these claims made without proper evidence to back them up. But they also have downstream impacts in various domains. If you start treating these massive complex engineering systems as language understanding machines, it has implications in how policymakers and regulators think about them.”
LLMs build a model capable of responding to natural language by absorbing a large corpus of training data, often from the World Wide Web. Leaving aside legal issues around how much of that data is copyrighted, the technique involves atomizing written language into tokens, and then using powerful statistical techniques – and a lot of computing power – to predict the relationship between those tokens in response to a question, for example. But there are a couple of implicit assumptions in this approach.
“The first is what we call the assumption of language completeness – that there exists a ‘thing’ called a ‘language’ that is complete, stable, quantifiable, and available for extraction from traces in the environment,” the paper says. “The engineering problem then becomes how that ‘thing’ can be reproduced artificially. The second assumption is the assumption of data completeness – that all of the essential characteristics can be represented in the datasets that are used to initialize and ‘train’ the model in question. In other words, all of the essential characteristics of language use are assumed to be present within the relationships between tokens, which presumably would allow LLMs to effectively and comprehensively reproduce the ‘thing’ that is being modeled.”
The problem is that one of the more modern branches of cognitive science sees language as a behavior rather than a big pile of text. In other words, language is something we do, and have done for hundreds of thousands of years.
The approach taken by Birhane and her colleagues is to understand human thought in terms that are “embodied” and “enacted.”
“The idea is that cognition doesn’t end at the brain and the person doesn’t end at the the skin. Rather, cognition is extended. Personhood is messy, ambiguous, intertwined with the existence of others, and so on,” she said.
Tone of voice, gesture, eye contact, emotional context, facial expressions, touch, location, and setting are among the factors that influence what is said or written.
Language behavior “cannot, in its entirety, be captured in representations appropriate for automation and computational processing. Written language constitutes only part of human linguistic activity,” the paper says.
In other words, the stronger claims of AI builders fall down on the assumption that language itself is ever complete. The researchers argue the second assumption – that language is captured by a corpus of text – is also false by the same means.
It’s true that both humans and LLMs learn from examples of text, but by looking at how humans use language in their lives, there’s a great deal missing. As well as human language being embodied, it is something in which people participate.
“Training data therefore is not only necessarily incomplete but also lacks to capture the motivational, participatory, and vitally social aspects that ground meaning making by people,” the paper says.
Human language is also precarious, a concept that may be harder to understand.
“The idea of precarity or precariousness is that human interaction and language is full of ambiguities, tensions, frictions, and those are not necessarily a bad thing,” Birhane said. “They’re really at the heart of what being human means. We actually need frictions to resolve disagreements, to have an in-depth understanding about a phenomena and confronting wrongs, for example.”
“LLMs do not participate in social interaction, and having no basis for shared experience, they also have nothing at stake,” the paper says. “There is no set of processes of self-production that are at risk, and which their behavior continually stabilizes, or at least moves them away from instability and dissolution. A model does not experience a sense of satisfaction, pleasure, guilt, responsibility, or accountability for what it produces. Instead, LLMs are complex tools, and within any activity their roles is that of a tool.”
Human language is an activity is one in which “various opportunities and risks are perceived, engaged with, and managed.”
“Not so for machines. Nothing is risked by ChatGPT when it is prompted and generates text. It seeks to achieve nothing as tokens are concatenated into grammatically sound output,” the paper says.
The authors argue that whatever LLMs model, it is not human language, which is considered not as a “large and growing heap, but more a flowing river.”
“Once you have removed water from the river, no matter how large a sample you have taken, it is no longer the river,” the paper says.
Birhane has previously challenged the AI industry. With colleagues, she pored over an MIT visual dataset for training AI to discover thousands of images labeled with racist slurs for Black and Asian people, and derogatory terms used to describe women, prompting the US super-college to take the dataset offline.
Whether or not LLMs effectively model human language, their advocates make spectacular claims about their usefulness. McKinsey says 70 percent of companies will deploy some sort of AI tech by 2030, producing a global economic impact of around $13 trillion in the same period, increasing global GDP by about 1.2 percent annually.
But claims asserting the usefulness of LLMs as a tool alone have also been exaggerated.
“There is no clear evidence that that shows LLMs are useful because they are extremely unreliable,” Birhane said. “Various scholars have been doing domain specific audits … in legal space … and in medical space. The findings across all these domains is that LLMs are not actually that useful because they give you so much unreliable information.”
Birhane argues that there are risks in releasing these models into the wild that would be unacceptable in other industries.
“When we build bridges, for example, we do rigorous testing before we allow any vehicles or pedestrians to use it,” she said. “Many other industries – pharma, for example – have proper regulations in place and we have established bodies that do the auditing and the evaluation. My biggest concern at the moment is that we’re just building LLMs and releasing them into super important domains such as education and medicine. This has huge impacts, and also massive downstream impacts, say in 20 years, and where we’re not doing proper testing, proper evaluations of these models.”
Not everyone agrees. Although Gartner has declared that GenAI is entering its famous “trough of disillusionment,” it has little doubt about the importance of its long-term impact.
Research showing LLM builders have a flawed understanding of what they are modeling is an opportunity to promote a more cautious, skeptical approach. ®