TY - GEN
T1 - Query driven data labeling with experts
T2 - 22nd International Conference on Extending Database Technology, EDBT 2019
AU - Dushkin, Eyal
AU - Gershtein, Shay
AU - Milo, Tova
AU - Novgorodov, Slava
N1 - Publisher Copyright:
© 2019 Copyright held by the owner/author(s).
PY - 2019
Y1 - 2019
N2 - Data has become a major priority for customer facing businesses of all sizes. Companies put a lot of effort and money into storing, cleaning, organizing, enriching and processing data to better meet user needs. Usually in large scale systems such as big e-commerce sites these tasks involve machine learning methods, relying on training data annotated by domain experts. Since domain experts are an expensive resource in terms of monetary costs and latency, it is desired to design algorithms that minimize the interaction with them. In this paper we address the problem of minimizing the number of annotation tasks with respect to a set of queries. We present a dedicated algorithm based on efficient labeling, that dictates the strategy for constructing a minimal set of classifiers sufficing to answer all queries. Our approach not only reduces monetary costs and latency, but also avoids data redundancy and saves storage space. We first consider a typical scenario of two expressions per query, and further discuss the challenges of extending our approach to multiple expressions. We examine two common models: batch and stream configurations, and devise offline and online algorithms, respectively. We analyze the number of annotations, and demonstrate the efficiency and effectiveness of our algorithm on a real-world dataset.
AB - Data has become a major priority for customer facing businesses of all sizes. Companies put a lot of effort and money into storing, cleaning, organizing, enriching and processing data to better meet user needs. Usually in large scale systems such as big e-commerce sites these tasks involve machine learning methods, relying on training data annotated by domain experts. Since domain experts are an expensive resource in terms of monetary costs and latency, it is desired to design algorithms that minimize the interaction with them. In this paper we address the problem of minimizing the number of annotation tasks with respect to a set of queries. We present a dedicated algorithm based on efficient labeling, that dictates the strategy for constructing a minimal set of classifiers sufficing to answer all queries. Our approach not only reduces monetary costs and latency, but also avoids data redundancy and saves storage space. We first consider a typical scenario of two expressions per query, and further discuss the challenges of extending our approach to multiple expressions. We examine two common models: batch and stream configurations, and devise offline and online algorithms, respectively. We analyze the number of annotations, and demonstrate the efficiency and effectiveness of our algorithm on a real-world dataset.
UR - http://www.scopus.com/inward/record.url?scp=85064935721&partnerID=8YFLogxK
U2 - 10.5441/002/edbt.2019.90
DO - 10.5441/002/edbt.2019.90
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:85064935721
T3 - Advances in Database Technology - EDBT
SP - 698
EP - 701
BT - Advances in Database Technology - EDBT 2019
A2 - Fundulaki, Irini
A2 - Kaoudi, Zoi
A2 - Binnig, Carsten
A2 - Galhardas, Helena
A2 - Reinwald, Berthold
A2 - Herschel, Melanie
PB - OpenProceedings.org
Y2 - 26 March 2019 through 29 March 2019
ER -