Many data-intensive applications on the Web require integrating data from multiple sources (Web databases) at query time. Online sources may refer to the same real world entity in different ways and some may provide outdated or erroneous data. An important task is to recognize and merge the various references that refer to the same entity at query time. Almost all existing duplicate detection and fusion techniques work in the offline setting and, thus, do not meet the online constraint. There are at least two aspects that differentiate online duplicate detection and fusion from its offline counterpart. First, the latter assumes that the entire data is available, while the former cannot make such a hard assumption. Second, several iterations (query submissions) may be required to compute the “ideal” representation of an entity in the online setting.

We propose a general framework to address this problem: an interactive caching solution. A set of frequently requested records is cleaned off-line and cached for future references. Newly arriving records in response to a stream of queries are cleaned jointly with the records in the cache, presented to users and appended to the cache.

We introduce two online record linkage and fusion approaches: (i) a record-based and (ii) a graph-based. They chiefly differ in the way they organize data in the cache as well as computationally. We conduct a comprehensive empirical study of the two techniques with real data from the Web. We couple their analysis with commonly used cache settings: static/dynamic, cache size and eviction policies.


Article metrics loading...

Loading full text...

Full text loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error