Record Linkage and Fusion over Web Databases

Mourad Ouzzani; Eduard Dragut; El Kindi; Amgad Madkour

doi:10.5339/qfarf.2011.CSP1

Abstract

Many data-intensive applications on the Web require integrating data from multiple sources (Web databases) at query time. Online sources may refer to the same real world entity in different ways and some may provide outdated or erroneous data. An important task is to recognize and merge the various references that refer to the same entity at query time. Almost all existing duplicate detection and fusion techniques work in the offline setting and, thus, do not meet the online constraint. There are at least two aspects that differentiate online duplicate detection and fusion from its offline counterpart. First, the latter assumes that the entire data is available, while the former cannot make such a hard assumption. Second, several iterations (query submissions) may be required to compute the “ideal” representation of an entity in the online setting.

We propose a general framework to address this problem: an interactive caching solution. A set of frequently requested records is cleaned off-line and cached for future references. Newly arriving records in response to a stream of queries are cleaned jointly with the records in the cache, presented to users and appended to the cache.

We introduce two online record linkage and fusion approaches: (i) a record-based and (ii) a graph-based. They chiefly differ in the way they organize data in the cache as well as computationally. We conduct a comprehensive empirical study of the two techniques with real data from the Web. We couple their analysis with commonly used cache settings: static/dynamic, cache size and eviction policies.

oa Record Linkage and Fusion over Web Databases

Abstract

Most Read This Month

Most Cited Most Cited RSS feed

Barriers and facilitators influencing the physical activity of Arabic adults: A literature review

Osteoporosis: An under-recognized public health problem

E-learning in Saudi Arabia: Past, present and future

Association of erythrocytes antioxidant enzymes and their cofactors with markers of oxidative stress in patients with sickle cell anemia

Qatar’s economy: Past, present and future