Using the crowd for top-k and group-by queries

Susan B. Davidson*, Sanjeev Khanna, Tova Milo, Sudeepa Roy

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review


Group-by and top-k are fundamental constructs in database queries. However, the criteria used for grouping and ordering certain types of data - such as unlabeled photos clustered by the same person ordered by age - are difficult to evaluate by machines. In contrast, these tasks are easy for humans to evaluate and are therefore natural candidates for being crowd-sourced. We study the problem of evaluating top-k and group-by queries using the crowd to answer either type or value questions. Given two data elements, the answer to a type question is "yes" if the elements have the same type and therefore belong to the same group or cluster; the answer to a value question orders the two data elements. The assumption here is that there is an underlying ground truth, but that the answers returned by the crowd may sometimes be erroneous. We formalize the problems of top-k and group-by in the crowd-sourced setting, and give efficient algorithms that are guaranteed to achieve good results with high probability. We analyze the crowd-sourced cost of these algorithms in terms of the total number of type and value questions, and show that they are essentially the best possible. We also show that fewer questions are needed when values and types are correlated, or when the error model is one in which the error decreases as the distance between the two elements in the sorted order increases.

Original languageEnglish
Title of host publicationICDT 2013 - 16th International Conference on Database Theory, Proceedings
PublisherAssociation for Computing Machinery
Number of pages12
ISBN (Print)9781450315982
StatePublished - 2013
Event16th International Conference on Database Theory, ICDT 2013 - Genoa, Italy
Duration: 18 Mar 201322 Mar 2013

Publication series

NameACM International Conference Proceeding Series


Conference16th International Conference on Database Theory, ICDT 2013


  • Clustering
  • Crowd sourcing
  • Group by
  • Lower bounds
  • Top-k


Dive into the research topics of 'Using the crowd for top-k and group-by queries'. Together they form a unique fingerprint.

Cite this