Understanding speech is extremely challenging both for humans and machines, as the sounds can be rapid and ambiguous. “What did you say?” is a sentence everyone without a doubt had to utter, as their interlocutor spoke without articulating or too quickly. While humans can use context clues to help comprehension, computers still lack this ability.
A team of neuroscientists working at the University of Geneva within the Swiss National Centre of Competence in Research (NCCR) “Evolving Language” developed, along with their collaborator at the University of Cambridge, a computer model based on the human brain to accurately and sparingly guess the meaning of ambiguous sentences. The conceived model could help both in the fields of neurosciences, giving major insights on how the human brain understands speech, and artificial intelligence, by further improving language models such as ChatGPT.
Human language is complex. We don’t only write, we also speak. We don’t only read, we also hear. Today, the most performant language models – such as ChatGPT – are based on written text and fail to take into consideration the orality of language. As a result, they are not very performant when tasked to interpret a speech.
In a study just published in PloS Biology, researchers at the University of Geneva have successfully designed a computer model based on human brain mechanisms, a major step for improving language models such as ChatGPT and the understanding of the brain itself.
“Speech comprehension is particularly challenging because the acoustic signal is fleeting, and the system needs to quickly process all the information. Nowadays, most speech-processing models either translate the speech to text without understanding its meaning, or recognize meaning from a long analysis of the acoustic signals, forbidding real-time comprehension,” explains Yaqing Su, a postdoctoral researcher in the team of Prof. Anne-Lise Giraud, at the University of Geneva and first author of the study.
Indeed, until today, models have largely focused on the next-word (horizontal) prediction as a central mechanism of human and machine language processing. “Our approach, in contrast, uses predictions from high to low (hierarchical) processing levels, for example from meaning to words and from words to syllables which is more likely used by our brain,” Su adds.
As neuroscientists, their primary goal is to identify plausible computational principles that enable the brain to comprehend speech in real time. These are important for filling the knowledge gap in basic neuroscience, as well as for identifying possible mechanisms causing various speech-related deficits. “The spectacular performance of large language models gives the impression that they somehow understand. However, they do not understand the way humans do. We understand by relating what we hear or read to mental images, something that current AI language models lack. This suggests that humans and large language models use different computational principles,” adds Itsaso Olasagasti, one of the authors in the study.
A Model That Analyses Context on the Fly Like Humans
If you hear that “one more ace wins the game”, you’ll probably have trouble understanding what is meant, as the word [ace] can have different meanings, depending on what type of game is involved. Maybe you’ll resort to the most widespread meaning of the word as a default: the card A in the card deck. However, if it is later specified that it’s a [tennis] game, you’ll be able to adapt to the alternate meaning of the sentence retrospectively and very rapidly.
“Though recent large artificial language models such as the GPT family have achieved stunning performance in generating human-like language content, making people believe that they have acquired human-like intelligence, they actually struggle with context deduction because they do not work in the same way as the human brain,” Su explains. “Current language-processing models, such as GPT-2’s next-word predictions, cannot deduce accurate meaning without additional components.”
To remedy this lack of accurate yet fast speech-processing models, the authors developed a computer model able to extract multilevel information from ongoing, continuous speech. The model achieves this by first predicting a number of general context and semantic roles (meaning) for the perceived speech, based on its knowledge. In our example, it will give the word [ace] the possible contexts [tennis game] or [poker game]. Then, it will convert these possibilities (or predictions) to simple linguistic forms and translate them into syllables and sound patterns that could be recognized in the speech. If it encounters a sound that matches the sound pattern predicted from one of the possible contexts, the model can quickly deduce the meaning of the ambiguous word [ace] according to the context.
Within the Human Brain
This model could be a good representation of what happens in the brain when humans try to understand a sentence. Indeed, in addition to linguistic (lexical and grammatical) knowledge, it uses non-linguistic (semantic and contextual) knowledge, which is crucial in human speech comprehension, for example for disambiguating different meanings of the same word. The authors propose that the use of such world knowledge could be as important in large language models. This characteristic sets it apart from other language processors such as GPT-2 and makes it more human-like. “We hope to provide a more holistic view for the general public about the power and the limitations of current large artificial language models like ChatGPT, which is a big upgrade from the GPT-2 we used here but similar in their core mechanism,” explains Anne-Lise Giraud, Professor at the University of Geneva and Director of the Institut de l’Audition, centre de l’Institut Pasteur in Paris.
Thanks to magnetoencephalography scans – results of measuring real-time brain activity – obtained by Lucy McGregor at Cambridge, one of the authors, the researchers were able to see that their model, based on hierarchical predictions, could match the brain signals associated to word meaning ambiguity and disambiguation, while the horizontal predictions from GPT-2 could not.
The model could also be useful to figure out underlying mechanisms in illnesses where patients struggle to understand meaning, such as autism or hallucinations. These illnesses present an abnormal (respectively high and low) focus on the sound signals, leading to an incorrect interpretation of the context.
“There is still a lot unknown about our brains, and scientists are striving to unravel the mystery carefully and responsibly,” Su says. “We think that this work can provide neuroscientists and language scientists an interesting framework towards a unified theory of human language processing. Overall, our findings have implications for not only how to better investigate the human brain, but also how to build better (not necessarily larger) artificial language models.”
Reference
A deep hierarchy of predictions enables online meaning extraction in a computational model of human speech comprehension, Su Y, MacGregor LJ, Olasagasti I, Giraud AL (2023) A deep hierarchy of predictions enables online meaning extraction in a computational model of human speech comprehension. PLOS Biology 21(3): e3002046. https://doi.org/10.1371/journal.pbio.3002046.