COUNTATA: Dataset Labeling Using Pattern Counts

Yuval Moskovitch, H. V. Jagadish

Research output: Contribution to journalArticlepeer-review

Abstract

Information regarding the counts of attributes combination is central to the profiling of a data set. It may reveal bias; it can help determine fitness for use. While counts of individual attribute values may be stored in some data set profiles, there are too many combinations of attributes for it to be practical to store counts for each combination. To this end, we present the notion of storing a “label” of limited size that can be used to obtain good estimates for these counts. A label contains information regarding the count of selected patterns–attributes values combinations–in the data. We define an estimation function, that uses this label to estimate the count of every pattern. Intuitively, there is a trade-off between the label size and its estimation error. We propose a demonstration of Countata, a system that allows the user to examine this trade-off as well as the label’s count information. We will demonstrate the usefulness of Countata using real-life data, and illustrate the effectiveness of our estimation paradigm.

Original languageEnglish
Pages (from-to)2829-2832
Number of pages4
JournalProceedings of the VLDB Endowment
Volume13
Issue number12
DOIs
StatePublished - 2020
Externally publishedYes

Fingerprint

Dive into the research topics of 'COUNTATA: Dataset Labeling Using Pattern Counts'. Together they form a unique fingerprint.

Cite this