Counting words to make words count: Statistics optimize communication: 2/01

2/16/01

Dawn Levy, News Service (650) 725-1944; e-mail: dawnlevy@stanford.edu

Counting words to make words count: Statistics optimize communication

Pity students of a second language, often confounded by the weird expressions and arbitrary grammar rules of a new tongue. They find themselves victims of a strange paradox: Even if they master the strange sayings and arbitrary rules, their faultless grammar can keep them from blending linguistically with native speakers because people simply don't follow all the rules. In fact, researchers are finding, the most effective communicators are those who frequently bend the rules.

"People have communicative goals," says Christopher Manning, an assistant professor of computer science and linguistics at Stanford. Manning participated in a Feb. 16 symposium on the use of statistics in language analysis at the annual meeting of the American Association for the Advancement of Science (AAAS). "They want to express ideas. And essentially those communicative goals are causing them to always bend or stretch the rules or use language in different ways."

In his research, which aims to study language scientifically using statistics, Manning collects voluminous text from the World Wide Web to learn how often people use the English language in certain patterns. Genres include the recent and the dated (newspapers and classic literature), the stilted and the hip (legal documents and chat groups). Manning tries to get as representative a sample of English usage as possible: "The legal stuff is a particularly extreme, convoluted genre, but it's important to remember that the New York Times, while closer to what we think of as normal English, is a genre as well. It's not how people talk on the playground."

There's a lot of flexibility in everyday talk. And finding the most common ways people string words together can improve language instruction, speech recognition and computer search engines, Manning says.

Communication is a two-way street, and people tend to structure their phrases along well-traveled linguistic routes to optimize their chances of being understood. They say "bacon and eggs," for instance, more often than "eggs and bacon." And while it is perfectly grammatical to put a large 'that' clause at the beginning of a sentence -- say, "That we will need to reduce inventory and slash prices next year is almost certain." -- people hardly ever speak like that. They place long clauses at the ends of sentences: "It's almost certain that we will need to reduce inventory and slash prices next year."

In traditional linguistics, a word string is considered ill-crafted if it violates a grammar rule. But in Manning's world, a word string is not undesirable merely if it irks the grammar police: It is undesirable if it is not the optimal way of communicating.

"When humans communicate, they're assuming their interlocutor knows most of the same stuff that they know," Manning says. "They know how the world works. They know what's been happening lately." So speakers need only provide small hints of information to listeners, who will presumably piece together the meaning. "Keeping communications short and ambiguous promotes the efficiency of communication," he says.

But communication is rife with uncertainty, and probabilistic methods allow computers, such as those used in voice recognition, to engage in higher processing like that performed by the human brain.

In a noisy bar, a listener misses many of a speaker's words, but conversation can continue thanks to contextual clues, such as information from prior conversation.

"The audio signal that is coming in your ear may be insufficient to work out what words are being said," Manning says. "But you combine it with contextual evidence to work out what the person probably said, even though you missed some words. And you do that without even thinking about it -- it's subconscious reconstruction."

In studies, researchers have excised words from an audio recording, melded the recording with white noise and played it to subjects in a noisy context. People's brains are so good at reconstructing sentences that subjects claim to hear intact sentences, when in fact they have added back removed words that are obvious due to context.

Probabilistic analysis has an important role in reducing uncertainty at the juncture where binary computers and ambiguous humans meet.

During an automated stock trade over the telephone, the voice-recognition software of a computer broker might be 90 percent sure a customer said, "Buy 200 shares of Yahoo! at $50." Or the broker could decide that there's a 40 percent probability the customer meant that, but also a 40 percent chance she meant sell 200 shares. Computers tend to go with the highest probability and sort ties randomly -- options that could upset customers.

In cases of ties or low probabilities, however, computers can engage in further conversation to lower uncertainty: "I'm not sure whether you're wanting to buy or sell those 200 shares of Yahoo! Please clarify."

Manning says: "By repeating some of the stuff at the end, that also lets you do a confirmation to check that the rest of what you thought you understood really is correct. Because if it's not, the user not only gets to repeat 'buy' or 'sell,' but can also say, 'Wait, wait! Stop! I wasn't talking about Yahoo! at all!'"

For now, computational linguists focus on practical, limited language-understanding goals. One, Manning says, is developing information extraction techniques to help people evaluate a glut of, say, real estate data: "If I show you an advertisement for a piece of real estate, can you extract from the text how many bedrooms the place has, how many bathrooms, what suburb it's in and what the selling price is? That's less glorious than building HAL (the sentient computer in the movie 2001: A Space Odyssey). But that's the kind of thing that's really useful."

-30-

By Dawn Levy