New system accounts for uncertainty of data and sourcing
BY DAVID ORENSTEIN
This much is sure: Information is sometimes uncertain and sources are sometimes flawed. A new prototype database system created at Stanford is the first to have both data uncertainty and lineage (sourcing) built in, a development that could enable diverse applications such as tracking wildlife, improving Internet comparison shopping and even fighting crime.
"A lot of applications have uncertainty in their data," says computer science Professor Jennifer Widom, who leads the Trio project at Stanford to create a database system that accounts for uncertainty. Sensors can have systemic errors, sources sometimes disagree, and observers may have doubts. In traditional databases, it is difficult to represent uncertain data or to keep track of its lineage. As a result, people have sometimes represented data as more certain than it is or have avoided putting the data in a database at all. "There is the question of how many people have not acknowledged uncertainty in their data because they didn't have a database system that could handle it," Widom says. Her team describes a prototype of their system in the March 2006 issue of the IEEE Data Engineering Bulletin.
The only alternative to misrepresenting uncertain data or leaving it out completely has been for software developers to write onerous computer code in every individual application that accesses the data. The code would do the work of calculating how uncertainty "adds up" (i.e., uncertainties compound somewhat like probabilities do). But Widom's database system automatically does the calculations and keeps track of sources. A system that tracks sources can account for their influence throughout the database. Data found to be unreliable, for example, can be automatically discounted in calculations.
Widom's interest in combining data uncertainty and lineage began when she considered a common commercial data integration problem. Companies such as Yahoo! provide web users with information such as pricing and features for thousands of products. These sites must reconcile data about those products from dozens of online retailers. When retailers represent products in slightly different ways, comparison-shopping software must figure out whether they are describing the same product or different ones. The best answer to this "entity resolution" problem, Widom reasoned, could be found by assigning a "confidence value" (similar to a probability between 0 and 1) in the database to each retailer's data records, and then forming aggregate data records by combining multiple records that are likely to represent the same product.The Trio system
Realizing that uncertainty pervades other applications as well—one of Widom's favorites is a large data set capturing more than a decade of dolphin behavior observations where observers weren't always positive which dolphin was which—Widom and her students decided to build a general-purpose database system that could be used for any application where capturing uncertainty and lineage would be needed. The result was Trio, a prototype Uncertainty-Lineage Database (ULDB) and accompanying TriQL language for querying that database.
The introduction in the IEEE bulletin uses a simplified crime-fighting example to illustrate how the database system works. Say a detective named Jennifer is trying to crack a car-theft ring. She could begin by creating two data tables: one of witnesses she has interviewed and one of current owners of cars in the area.
Jennifer knows that witnesses aren't always sure of what they saw, so for each witness in the table, she includes a confidence value to quantify their uncertainty. On a scale of 0 to 1, witness Alice, for example, might be 0.6 sure a stolen car was a Honda and 0.4 sure it was a Toyota. Meanwhile, Jennifer would include similar values for what every other witness saw.
To come up with a list of suspects, Jennifer could run a database query combining the table of witnesses and the table of owners. If the most mathematically likely story from the witnesses is that a Honda was stolen, then the most likely suspects are those in possession of Hondas. Using the ULDB, Jennifer's query can be quite simple. TriQL, an extension of the industry-standard Structured Query Language, allows users to exploit the unique features of a ULDB.
When users write a TriQL query, Trio will do the calculations of uncertainty automatically, using algorithms developed by the Trio team. The calculations in the simplified car-theft example may not seem intimidating, but in a real database, many uncertain factors would be combined in tables created by multiple queries. Sophisticated automation helps, Widom says.
Another important innovation in Trio is handling data lineage. By keeping track of where data comes from, the Trio system can help users ensure that information throughout the database can be accounted for and adjusted. For example, say witness Bob was partially or totally discredited late in an investigation. Jennifer could reduce Bob's confidence values and, because the lineage of Bob's data is fully accounted for in the database, Trio can automatically recalculate all the related confidence values that Bob's testimony influenced. After changing Bob's credibility—and letting Trio update all the tables in the ULDB—Jennifer might be able to exonerate some suspects and shift her suspicions to others that she had previously overlooked.A future of uncertainty
Before the Trio system could help fight crime, track dolphins or clean up consumer data, the prototype will need fine-tuning. "I feel like we're quite early in this project," Widom acknowledges. "We're going to have a much, much better system in the long run, although it's pretty good like it is." Widom wants to improve the user interface and improve Trio's efficiency before making it available as an open-source system. The research is funded by the National Science Foundation and aerospace giant Boeing, which is interested in many database topics, including data integration.
Certainly, however, the time has come for uncertain data, Widom says. "When people get interested in adding some kind of advanced data-processing functionality, they first tend to write it into their applications," she says. "But when the functionality gets important and common enough, then they push it into the database system. Then you don't have to replicate that code in every single application and the database can do it much more efficiently."
David Orenstein is the communications and public relations manager at the Stanford School of Engineering.