Natural-language processing has made significant advances recently—but how much does AI really comprehend what it reads? Less than we believed.
Until quite recently, computers were incompetent at producing sentences that really made sense. But the sphere of natural-language processing (NLP) has made significant advances, and machines can now produce convincing passages with the press of a button.
This progress has been driven by deep-learning techniques, which single out statistical patterns in argument structure and word usage from vast heaps of text. But a new paper from the Allen Institute of Artificial Intelligence draws attention to something still lacking: machines don’t really understand what they’re reading or writing.
This is a primary challenge in the grand quest of generalisable AI—but beyond the academic world, it’s applicable for consumers, too. Voice assistants and chatbots built on state-of-the-art natural-language models, for instance, have become the interface for many health-care providers, financial institutions, and government agencies. Without a real understanding of language, these systems are more likely to fail, slowing access to vital services.
Bottom of Form
The researchers built off the work of the Winograd Schema Challenge, a test made in 2011 to assess the common-sense reasoning of NLP systems. The challenge utilises a set of 273 questions including pairs of sentences that are same except for one word. That word, known as a trigger, changes the meaning of each sentence’s pronoun, as seen in the sample below:
- The trophy doesn’t fit into the brown suitcase because it’s too small.
- The trophy doesn’t fit into the brown suitcase because it’s too large.
To get ahead, an NLP system must work out which of two options the pronoun refers to. In this instance, it would need to choose “trophy” for the first and “suitcase” for the second to solve the problem accurately.
The test was initially designed with the idea that such problems couldn’t be responded without a deeper understanding of semantics. State-of-the-art deep-learning models can now reach about 90% accuracy, so it would appear that NLP has gotten closer to its objective. But in their paper, which will get the Outstanding Paper Award at next month’s AAAI conference, the researchers contest the effectiveness of the benchmark and, therefore, the level of advancement that the field has actually made.
They produced a significantly larger data set, called WinoGrande, with 44,000 of the same types of problems. To do so, they planned a crowdsourcing scheme to create and authenticate new sentence pairs speedily. (One of the reasons the Winograd data set is so small is that it was hand-crafted by experts.) Workers on Amazon Mechanical Turk made new sentences with essential words selected via a randomization procedure. Individual sentence pair was then given to three extra workers and kept only if it met three standards: at least two workers chose the correct answers, all three considered the options unambiguous, and the pronoun’s references couldn’t be determined through simple word associations.
As a concluding step, the researchers also ran the data set through an algorithm to eliminate as many “artifacts” as possible—unintended data patterns or correlations that could help a language model reach at the right answers for the incorrect reasons. This decreased the chance that a model could learn to fiddle with the data set.
When they tested state-of-the-art models on these new challenges, performance dropped to between 59.4% and 79.1%. By comparison, humans still achieve 94% accuracy. This shows a high score on the first Winograd test is likely exaggerated. “It’s just a data-set-specific success, not a general-task success,” says Yejin Choi, an assistant professor at the University of Washington and a senior research manager at AI2, who directed the research.
Choi expects the data set will cater as a new benchmark. But she also hopes it will motivate more researchers to look beyond deep learning. The results highlighted to her that true common-sense NLP systems must combine other techniques, like structured knowledge models. Her earlier work has shown substantial promise in this direction. “We someway need to find a different game plan,” she says.
The paper has got some criticism. Ernest Davis, one of the academics who worked on the original Winograd challenge, says that many of the sample sentence pairs listed in the paper are “truly flawed,” with puzzling grammar. “They don’t match to the way that people speaking English actually use pronouns,” he wrote in an email.
But Choi mentions that truly robust models shouldn’t require perfect grammar to understand a sentence. People who converse in English as a second language sometimes muddle their grammar but still express their meaning.
“Humans can easily comprehend what our questions are about and choose the correct answer,” she says, mentioning the 94% performance accuracy. “If humans should be competent to do that, my take is that machines should be able to do that too.”