Query driven data labeling with experts: Why pay twice?

Eyal Dushkin, Shay Gershtein, Tova Milo, Slava Novgorodov

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Data has become a major priority for customer facing businesses of all sizes. Companies put a lot of effort and money into storing, cleaning, organizing, enriching and processing data to better meet user needs. Usually in large scale systems such as big e-commerce sites these tasks involve machine learning methods, relying on training data annotated by domain experts. Since domain experts are an expensive resource in terms of monetary costs and latency, it is desired to design algorithms that minimize the interaction with them. In this paper we address the problem of minimizing the number of annotation tasks with respect to a set of queries. We present a dedicated algorithm based on efficient labeling, that dictates the strategy for constructing a minimal set of classifiers sufficing to answer all queries. Our approach not only reduces monetary costs and latency, but also avoids data redundancy and saves storage space. We first consider a typical scenario of two expressions per query, and further discuss the challenges of extending our approach to multiple expressions. We examine two common models: batch and stream configurations, and devise offline and online algorithms, respectively. We analyze the number of annotations, and demonstrate the efficiency and effectiveness of our algorithm on a real-world dataset.

Original languageEnglish
Title of host publicationAdvances in Database Technology - EDBT 2019
Subtitle of host publication22nd International Conference on Extending Database Technology, Proceedings
EditorsIrini Fundulaki, Zoi Kaoudi, Carsten Binnig, Helena Galhardas, Berthold Reinwald, Melanie Herschel
PublisherOpenProceedings.org
Pages698-701
Number of pages4
ISBN (Electronic)9783893180813
DOIs
StatePublished - 2019
Event22nd International Conference on Extending Database Technology, EDBT 2019 - Lisbon, Portugal
Duration: 26 Mar 201929 Mar 2019

Publication series

NameAdvances in Database Technology - EDBT
Volume2019-March
ISSN (Electronic)2367-2005

Conference

Conference22nd International Conference on Extending Database Technology, EDBT 2019
Country/TerritoryPortugal
CityLisbon
Period26/03/1929/03/19

Fingerprint

Dive into the research topics of 'Query driven data labeling with experts: Why pay twice?'. Together they form a unique fingerprint.

Cite this