Cleaning data with constraints and experts

Ahmad Assadi, Tova Milo, Slava Novgorodov

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

7 Scopus citations

Abstract

Popular techniques for data cleaning use integrity constraints to identify errors in the data and to automatically resolve them, e.g. by using predefined priorities among possible updates and finding a minimal repair that will resolve violations. Such automatic solutions however cannot ensure precision of the repairs since they do not have enough evidence about the actual errors and may in fact lead to wrong results with respect to the ground truth. It has thus been suggested to use domain experts to examine the potential updates and choose which should be applied to the database. However, the sheer volume of the databases and the large number of possible updates that may resolve a given constraint violation, may make such a manual examination prohibitory expensive. The goal of the DANCE system presented here is to help to optimize the experts work and reduce as much as possible the number of questions (updates verification) they need to address. Given a constraint violation, our algorithm identifies the suspicious tuples whose update may contribute (directly or indirectly) to the constraint resolution, as well as the possible dependencies among them. Using this information it builds a graph whose nodes are the suspicious tuples and whose weighted edges capture the likelihood of an error in one tuple to occur and affect the other. PageRank-style algorithm then allows us to identify the most beneficial tuples to ask about first. Incremental graph maintenance is used to assure interactive response time. We implemented our solution in the DANCE system and show its effectiveness and efficiency through a comprehensive suite of experiments.

Original languageEnglish
Title of host publicationProceedings of the 21st Workshop on the Web and Databases, WebDB 2018
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9781450356480
DOIs
StatePublished - 10 Jun 2018
Event21st Workshop on the Web and Databases, WebDB 2018 - Houston, United States
Duration: 10 Jun 2018 → …

Publication series

NameProceedings of the 21st Workshop on the Web and Databases, WebDB 2018

Conference

Conference21st Workshop on the Web and Databases, WebDB 2018
Country/TerritoryUnited States
CityHouston
Period10/06/18 → …

Funding

FundersFunder number
Intel Corporation
European Commission291071
Seventh Framework Programme

    Fingerprint

    Dive into the research topics of 'Cleaning data with constraints and experts'. Together they form a unique fingerprint.

    Cite this