TY - GEN
T1 - Entropy-Based Approach to Efficient Cleaning of Big Data in Hierarchical Databases
AU - Levner, Eugene
AU - Kriheli, Boris
AU - Benis, Arriel
AU - Ptuskin, Alexander
AU - Elalouf, Amir
AU - Hovav, Sharon
AU - Ashkenazi, Shai
N1 - Publisher Copyright:
© 2020, Springer Nature Switzerland AG.
PY - 2020
Y1 - 2020
N2 - When databases are at risk of containing erroneous, redundant, or obsolete data, a cleaning procedure is used to detect, correct or remove such undesirable records. We propose a methodology for improving data cleaning efficiency in a large hierarchical database. The methodology relies on Shannon’s information entropy for measuring the amount of information stored in databases. This approach, which builds on previously-gathered statistical data regarding the prevalence of errors in the database, enables the decision maker to determine which components of the database are likely to have undergone more information loss, and thus to prioritize those components for cleaning. In particular, in cases where the cleaning process is iterative (from the root node down), the entropic approach produces a scientifically motivated stopping rule that determines the optimal (i.e. minimally required) number of tiers in the hierarchical database that need to be examined. This stopping rule defines a more streamlined representation of the database, in which less informative tiers are eliminated.
AB - When databases are at risk of containing erroneous, redundant, or obsolete data, a cleaning procedure is used to detect, correct or remove such undesirable records. We propose a methodology for improving data cleaning efficiency in a large hierarchical database. The methodology relies on Shannon’s information entropy for measuring the amount of information stored in databases. This approach, which builds on previously-gathered statistical data regarding the prevalence of errors in the database, enables the decision maker to determine which components of the database are likely to have undergone more information loss, and thus to prioritize those components for cleaning. In particular, in cases where the cleaning process is iterative (from the root node down), the entropic approach produces a scientifically motivated stopping rule that determines the optimal (i.e. minimally required) number of tiers in the hierarchical database that need to be examined. This stopping rule defines a more streamlined representation of the database, in which less informative tiers are eliminated.
KW - Data cleaning
KW - Entropy evaluation
KW - Entropy-based analytics
UR - http://www.scopus.com/inward/record.url?scp=85092079714&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-59612-5_1
DO - 10.1007/978-3-030-59612-5_1
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:85092079714
SN - 9783030596118
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 3
EP - 12
BT - Big Data – BigData 2020 - 9th International Conference, Held as Part of the Services Conference Federation, SCF 2020, Proceedings
A2 - Nepal, Surya
A2 - Cao, Wenqi
A2 - Nasridinov, Aziz
A2 - Bhuiyan, MD Zakirul Alam
A2 - Guo, Xuan
A2 - Zhang, Liang-Jie
PB - Springer Science and Business Media Deutschland GmbH
T2 - 9th International Conference on Big Data, BigData 2020, held as part of the Services Conference Federation, SCF 2020
Y2 - 18 September 2020 through 20 September 2020
ER -