
Online edition
of
April 20, 2000
 

|
|
Computer science workshop
focuses on building better databases
BY BRUCE GOLDMAN
Try this one at home (it
may not be advisable to try it in the office): Get on the
Internet, pick the search engine of your choice, then
type in your own name plus the word "sex." If,
say, you placed fourth in a marathon whose entrants were
identified by gender, that little engine will make you
out to be a porno star despite a lifetime of unwavering
circumspection on your part.
Mistaken identity aside,
finding what you're looking for on the Web can be
grindingly slow going. And what you get may be outdated,
incomplete or irrelevant. Misinformation, or the wrong
kind of information -- or too much information -- can be
worse than no information at all.
Computer scientists are
working on ways to make searches easier by, among other
things, standardizing and streamlining so-called
"relational databases" -- the complex
cyberskeletons programmers erect to connect myriad
tidbits of information (your name, your phone number,
whether you have a dog) to one another. Politely put,
databases are still in the process of evolving toward
what may someday be a more perfect state of the art.
A somewhat saltier
assessment was offered by Stanford computer science and
electrical engineering Professor Gio Wiederhold:
"Some have maintained that databases are the
roadkill on the information highway."
Wiederhold's quip was
tossed out during a workshop on database management held
last month on the Stanford campus. The workshop was
sponsored by the Stanford Computer Forum, an alliance of
Stanford's Computer Science Department and Computer
Systems Laboratory and a large group of industrial
affiliates.
Flawed or not, databases
are important and becoming more so, even to those of us
who wouldn't know one from a military base. We use them
every time we go comparison shopping on the Web, for
example. (They use us, too -- or, rather, the marketing
people who monitor our buying behavior do.) And that's
happening more and more.
E-commerce is now in the
neighborhood of $10 billion annually and accounted for
around one-third of Web activity in 1999, up from a mere
8 percent in 1998, noted Wiederhold. In 1998, 40 percent
of the population of the United States used the Internet;
in 1999, it was 52 percent. If these rates of increase
continue, Wiederhold jestingly predicted, "by 2004
more than 100 percent of all commerce in toys will be on
the Web."
A series of faculty and
graduate students took turns describing efforts to make
searching the Web faster and more reliable. One stumbling
block is that information at diverse websites is stored
in different databases written in different formats. This
makes it harder to run a search for the cheapest supplier
of a book or compact disc. While standardization of
database formats would be desirable, don't count on that
happening soon. Many companies don't want
standards, but rather want to showcase their uniqueness
(and, possibly, avoid price comparisons).
Workshop participants
agreed that standardized supplier databases are more
likely to become a reality when customers are few and
large (like the Big Three automakers) than when they are
numerous and small. Graduate student Chen Li, using the
example of comparison shopping for a CD on the Web,
demonstrated progress toward an improved search algorithm
that navigates efficiently among diverse search formats
to arrive at the best price in a reasonable amount of
time.
A second problem
frequently encountered in Web searches is that of getting
too many "hits," many of them useless. Often
this occurs not so much because of faulty databases but
because what you're looking for isn't stored in databases
at all, but is, rather, a string of plain English text
matching the specifications you've punched in.
The ideal search engine
would be able to recognize human language, analyze it for
semantic content and retrieve relevant results while
filtering out close-but-no-cigar material. "Most
stuff [on the Web] is in human language," said Chris
Manning. assistant professor of computer science and
linguistics. If you don't have some way of automatically
sorting through it, he said, "you're missing out on
a lot of good information."
Manning has cobbled
together an automated method of searching the Web and
sniffing out classified real-estate ads that match user
specifications (price, location, number of bedrooms,
etc.). There are lots of ways for such a search to go
awry. The location of interest may be referenced by a
nickname or an abbreviation ("Alex. Hgts."
instead of "Alexandria Heights") or
misleadingly invoked in a phrase such as "45 minute
drive from . . . . " Dollar amounts may refer not to
the current price but rather to a former asking price
("was $357,000"). Nonetheless, Manning's
retrieval system is able to get around those pitfalls,
producing results with better than 90 percent accuracy,
he told the audience.
The proliferation of
small, roadworthy devices like personal digital
assistants, or PDAs -- with their midget displays,
walnut-sized brains, minuscule memories, short battery
lives and klutzy electronic pens -- necessitates
simplified Web searches. Graduate student Orkut
Buyukkokten demonstrated a technique for remotely
accessing digital libraries that reduced pen movements by
42 percent and increased browsing speed by 45 percent.
Yet another issue vexing
database managers, said graduate student Brian Cooper, is
protecting data from a wide range of onslaughts, from
deterioration of the storage medium itself (loss of
magnetism from or breakage of tape or a hard disk) to
physical catastrophes (fires, floods). Data also may be
lost through mistaken or malicious erasure, negligent
failure to preserve it or theft. Then, of course, there
is "format obsolescence." (Will your 9-year-old
ever play your old LPs on that turntable in your
basement?) Over longish time periods, there even can be
loss of meaning as the English language evolves.
Confining themselves to
the humble objective of "just keeping the bits
around," Buyukkokten's group adopted the strategy of
routinely making multiple copies of files every so often,
without being so obtrusive as to force users to stop
every 10 minutes and cool their heels while the archiving
was taking place. Errors introduced in copying, or by the
demagnetizing effects of stray cosmic rays, could be
corrected via automatized comparisons of multiple copies
to detect discrepancies.
Because the date and time
of each copying sweep are encoded in the resulting
copies, the correct version of the files resident in a
crashed system could be restored by specifying the time
of the crash, Buyukkokten said, "so, if there's an
earthquake and Stanford falls into the sea, we can
restore the file-system data to the way it looked just
before everybody died."
One thing is certain:
Database-management expertise is in demand. As computer
science and electrical engineering Associate Professor
Jennifer Widom, who organized the workshop with Stanford
Computer Forum Director Hector Garcia-Molina, remarked,
half-joking, upon scanning the approximately 90
participants who filled the lecture room in the Packard
Building, where the event took place: "We have a
hunch our large enrollment this year might have something
to do with recruiting." SR
|