The Role of Noise in Information-Theoretic Models of Sentence Comprehension and Production

Principal Investigator Edward Gibson

Project Website http://www.nsf.gov/awardsearch/showAward?AWD_ID=1534318&HistoricalAwards=false

Project Start Date September 2015

Project End Date  February 2019

Human language as it is produced and understood is full of errors: people make speech errors, they make typographical errors when typing / texting, and there is often background noise that makes it impossible to perceive words accurately. Given the noisy nature of human language in practice, it is surprising that people can understand one another so well. The question of how people can communicate given noise is not yet solved, and is the focus of our work. Understanding how humans can understand noisy language is critical for two reasons. First, language technologies must be capable of processing noisy language input: translation services need to account for errors in the text being translated; search engines need to process noisily-generated web content. Evidence concerning how humans understand language in noise can lead to improvements in the design of language technologies. In addition, until dialogue systems can produce coherent language responses--likely decades away--any practical application of such systems must be designed with an understanding of how humans deal with noisy or confusing language input. Second, on the clinical side, understanding how humans understand language which might contain errors will provide insights into language comprehension disorders. Recent research has shown that individuals with aphasia appear to assume the presence of more errors in the input than healthy participants, and thus show stronger reliance on their prior beliefs about the world when interpreting language. Applications of this work may lead to more efficient diagnosis and treatment options for such patients.

The goals of the proposed research are two-fold. First, the researchers will investigate noise in the process of language comprehension, where noise falls into three categories (a) deletions, such that the listener / reader might miss something that was intended; (b) insertions, such that the producer might accidentally insert something; and (c) swaps, such that the producer might accidentally switch elements in the stream. Second, the researchers will investigate an information-theoretic approach to memory in sentence production, where memory is a source of potential errors in language use. Recent human vision research suggests that memory capacity is best modeled as a limitation on the complexity of the representations, in terms of information-theoretic units called "bits". Simple representations require very few bits of information, but complex representations require many. The proposed research extends this idea to language, such that high-frequency words and phrases such as "the boy sees the girl" should be stored easily in memory, while less frequent components such as "the woman who the man met was tall" should be difficult to store in memory.