Top-k and clustering with noisy comparisons

Susan Davidson, Sanjeev Khanna, Tova Milo, Sudeepa Roy*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

26 Scopus citations

Abstract

We study the problems of max/top-k and clustering when the comparison operations may be performed by oracles whose answer may be erroneous. Comparisons may either be of type or of value: given two data elements, the answer to a type comparison is "yes" if the elements have the same type and therefore belong to the same group (cluster); the answer to a value comparison orders the two data elements. We give efficient algorithms that are guaranteed to achieve correct results with high probability, analyze the cost of these algorithms in terms of the total number of comparisons (i.e., using a fixed-cost model), and show that they are essentially the best possible. We also show that fewer comparisons are needed when values and types are correlated, or when the error model is one in which the error decreases as the distance between the two elements in the sorted order increases. Finally, we examine another important class of cost functions, concave functions, which balances the number of rounds of interaction with the oracle with the number of questions asked of the oracle. Results of this article form an important first step in providing a formal basis for max/top-k and clustering queries in crowdsourcing applications, that is, when the oracle is implemented using the crowd. We explain what simplifying assumptions are made in the analysis, what results carry to a generalized crowdsourcing setting, and what extensions are required to support a full-fledged model.

Original languageEnglish
Article number35
JournalACM Transactions on Database Systems
Volume39
Issue number4
DOIs
StatePublished - 30 Dec 2014

Funding

FundersFunder number
European Research Council

    Fingerprint

    Dive into the research topics of 'Top-k and clustering with noisy comparisons'. Together they form a unique fingerprint.

    Cite this