Synthesis of forgiving data extractors

Adi Omari, Sharon Shoham, Eran Yahav

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We address the problem of synthesizing a robust data-extractor from a family of websites that contain the same kind of information. This problem is common when trying to aggregate information from many web sites, for example, when extracting information for a price-comparison site. Given a set of example annotated web pages from multiple sites in a family, our goal is to synthesize a robust data extractor that performs well on all sites in the family (not only on the provided example pages). The main challenge is the need to trade off precision for generality and robustness. Our key contribution is the introduction of forgiving extractors that dynamically adjust their precision to handle structural changes, without sacrificing precision on the training set. Our approach uses decision tree learning to create a generalized extractor and converts it into a forgiving extractor, in the form of an XPath query. The forgiving extractor captures a series of pruned decision trees with monotonically decreasing precision, and monotonically increasing recall, and dynamically adjusts precision to guarantee sufficient recall. We have implemented our approach in a tool called TREEX and applied it to synthesize extractors for real-world large scale web sites. We evaluate the robustness and generality of the forgiving extractors by evaluating their precision and recall on: (i) different pages from sites in the training set (ii) pages from different versions of sites in the training set (iii) pages from different (unseen) sites. We compare the results of our synthesized extractor to those of classifier-based extractors, and pattern-based extractors, and show that TREEX significantly improves extraction accuracy.

Original languageEnglish
Title of host publicationWSDM 2017 - Proceedings of the 10th ACM International Conference on Web Search and Data Mining
PublisherAssociation for Computing Machinery, Inc
Pages385-394
Number of pages10
ISBN (Electronic)9781450346757
DOIs
StatePublished - 2 Feb 2017
Event10th ACM International Conference on Web Search and Data Mining, WSDM 2017 - Cambridge, United Kingdom
Duration: 6 Feb 201710 Feb 2017

Publication series

NameWSDM 2017 - Proceedings of the 10th ACM International Conference on Web Search and Data Mining

Conference

Conference10th ACM International Conference on Web Search and Data Mining, WSDM 2017
Country/TerritoryUnited Kingdom
CityCambridge
Period6/02/1710/02/17

Fingerprint

Dive into the research topics of 'Synthesis of forgiving data extractors'. Together they form a unique fingerprint.

Cite this