TY - GEN
T1 - Lossless separation of web pages into layout code and data
AU - Omari, Adi
AU - Kimelfeld, Benny
AU - Yahav, Eran
AU - Shoham, Sharon
N1 - Publisher Copyright:
© 2016 ACM.
PY - 2016/8/13
Y1 - 2016/8/13
N2 - A modern web page is often served by running layout code on data, producing an HTML document that enhances the data with front/back matters and layout/style operations. In this paper, we consider the opposite task: separating a given web page into a data component and a layout program. This separation has various important applications: page encoding may be significantly more compact (reducing web traffic), data representation is normalized across web designs (facilitating wrapping, retrieval and extraction), and repetitions are diminished (expediting updates and redesign). We present a framework for defining the separation task, and devise an algorithm for synthesizing layout code from a web page while distilling its data in a lossless manner. The main idea is to synthesize layout code hierarchically for parts of the page, and use a combined program-data representation cost to decide whether to align intermediate programs. When intermediate programs are aligned, they are transformed into a single program, possibly with loops and conditionals. At the same time, differences between the aligned programs are captured by the data component such that executing the layout code on the data results in the original page. We have implemented our approach and conducted a thorough experimental study of its effectiveness. Our experiments show that our approach features state of the art (and higher) performance in both size compression and record extraction.
AB - A modern web page is often served by running layout code on data, producing an HTML document that enhances the data with front/back matters and layout/style operations. In this paper, we consider the opposite task: separating a given web page into a data component and a layout program. This separation has various important applications: page encoding may be significantly more compact (reducing web traffic), data representation is normalized across web designs (facilitating wrapping, retrieval and extraction), and repetitions are diminished (expediting updates and redesign). We present a framework for defining the separation task, and devise an algorithm for synthesizing layout code from a web page while distilling its data in a lossless manner. The main idea is to synthesize layout code hierarchically for parts of the page, and use a combined program-data representation cost to decide whether to align intermediate programs. When intermediate programs are aligned, they are transformed into a single program, possibly with loops and conditionals. At the same time, differences between the aligned programs are captured by the data component such that executing the layout code on the data results in the original page. We have implemented our approach and conducted a thorough experimental study of its effectiveness. Our experiments show that our approach features state of the art (and higher) performance in both size compression and record extraction.
UR - http://www.scopus.com/inward/record.url?scp=84985024409&partnerID=8YFLogxK
U2 - 10.1145/2939672.2939858
DO - 10.1145/2939672.2939858
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:84985024409
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 1805
EP - 1814
BT - KDD 2016 - Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
PB - Association for Computing Machinery
T2 - 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016
Y2 - 13 August 2016 through 17 August 2016
ER -