Stanford University Home

Stanford News Archive

Stanford Report, May 10, 2000

Computer science workshop focuses on building better databases


Try this one at home (it may not be advisable to try it in the office): Get on the Internet, pick the search engine of your choice, then type in your own name plus the word "sex." If, say, you placed fourth in a marathon whose entrants were identified by gender, that little engine will make you out to be a porno star despite a lifetime of unwavering circumspection on your part.

Mistaken identity aside, finding what you're looking for on the Web can be grindingly slow going. And what you get may be outdated, incomplete or irrelevant. Misinformation, or the wrong kind of information -- or too much information -- can be worse than no information at all.

Computer scientists are working on ways to make searches easier by, among other things, standardizing and streamlining so-called "relational databases" -- the complex cyberskeletons programmers erect to connect myriad tidbits of information (your name, your phone number, whether you have a dog) to one another. Politely put, databases are still in the process of evolving toward what may someday be a more perfect state of the art.

A somewhat saltier assessment was offered by Stanford computer science and electrical engineering Professor Gio Wiederhold: "Some have maintained that databases are the roadkill on the information highway."

Wiederhold's quip was tossed out during a workshop on database management held last month on the Stanford campus. The workshop was sponsored by the Stanford Computer Forum, an alliance of Stanford's Computer Science Department and Computer Systems Laboratory and a large group of industrial affiliates.

Flawed or not, databases are important and becoming more so, even to those of us who wouldn't know one from a military base. We use them every time we go comparison shopping on the Web, for example. (They use us, too -- or, rather, the marketing people who monitor our buying behavior do.) And that's happening more and more.

E-commerce is now in the neighborhood of $10 billion annually and accounted for around one-third of Web activity in 1999, up from a mere 8 percent in 1998, noted Wiederhold. In 1998, 40 percent of the population of the United States used the Internet; in 1999, it was 52 percent. If these rates of increase continue, Wiederhold jestingly predicted, "by 2004 more than 100 percent of all commerce in toys will be on the Web."

A series of faculty and graduate students took turns describing efforts to make searching the Web faster and more reliable. One stumbling block is that information at diverse websites is stored in different databases written in different formats. This makes it harder to run a search for the cheapest supplier of a book or compact disc. While standardization of database formats would be desirable, don't count on that happening soon. Many companies don't want standards, but rather want to showcase their uniqueness (and, possibly, avoid price comparisons).

Workshop participants agreed that standardized supplier databases are more likely to become a reality when customers are few and large (like the Big Three automakers) than when they are numerous and small. Graduate student Chen Li, using the example of comparison shopping for a CD on the Web, demonstrated progress toward an improved search algorithm that navigates efficiently among diverse search formats to arrive at the best price in a reasonable amount of time.

A second problem frequently encountered in Web searches is that of getting too many "hits," many of them useless. Often this occurs not so much because of faulty databases but because what you're looking for isn't stored in databases at all, but is, rather, a string of plain English text matching the specifications you've punched in.

The ideal search engine would be able to recognize human language, analyze it for semantic content and retrieve relevant results while filtering out close-but-no-cigar material. "Most stuff [on the Web] is in human language," said Chris Manning. assistant professor of computer science and linguistics. If you don't have some way of automatically sorting through it, he said, "you're missing out on a lot of good information."

Manning has cobbled together an automated method of searching the Web and sniffing out classified real-estate ads that match user specifications (price, location, number of bedrooms, etc.). There are lots of ways for such a search to go awry. The location of interest may be referenced by a nickname or an abbreviation ("Alex. Hgts." instead of "Alexandria Heights") or misleadingly invoked in a phrase such as "45 minute drive from . . . . " Dollar amounts may refer not to the current price but rather to a former asking price ("was $357,000"). Nonetheless, Manning's retrieval system is able to get around those pitfalls, producing results with better than 90 percent accuracy, he told the audience.

The proliferation of small, roadworthy devices like personal digital assistants, or PDAs -- with their midget displays, walnut-sized brains, minuscule memories, short battery lives and klutzy electronic pens -- necessitates simplified Web searches. Graduate student Orkut Buyukkokten demonstrated a technique for remotely accessing digital libraries that reduced pen movements by 42 percent and increased browsing speed by 45 percent.

Yet another issue vexing database managers, said graduate student Brian Cooper, is protecting data from a wide range of onslaughts, from deterioration of the storage medium itself (loss of magnetism from or breakage of tape or a hard disk) to physical catastrophes (fires, floods). Data also may be lost through mistaken or malicious erasure, negligent failure to preserve it or theft. Then, of course, there is "format obsolescence." (Will your 9-year-old ever play your old LPs on that turntable in your basement?) Over longish time periods, there even can be loss of meaning as the English language evolves.

Confining themselves to the humble objective of "just keeping the bits around," Buyukkokten's group adopted the strategy of routinely making multiple copies of files every so often, without being so obtrusive as to force users to stop every 10 minutes and cool their heels while the archiving was taking place. Errors introduced in copying, or by the demagnetizing effects of stray cosmic rays, could be corrected via automatized comparisons of multiple copies to detect discrepancies.

Because the date and time of each copying sweep are encoded in the resulting copies, the correct version of the files resident in a crashed system could be restored by specifying the time of the crash, Buyukkokten said, "so, if there's an earthquake and Stanford falls into the sea, we can restore the file-system data to the way it looked just before everybody died."

One thing is certain: Database-management expertise is in demand. As computer science and electrical engineering Associate Professor Jennifer Widom, who organized the workshop with Stanford Computer Forum Director Hector Garcia-Molina, remarked, half-joking, upon scanning the approximately 90 participants who filled the lecture room in the Packard Building, where the event took place: "We have a hunch our large enrollment this year might have something to do with recruiting." SR