XML syntax conscious compression

S. Harrusi, A. Averbuch, A. Yehudai

Research output: Contribution to journalConference articlepeer-review

Abstract

XML is the standard format of content representation and sharing on the Web. XML is a highly verbose language, especially regarding the duplication of meta-data in the form of elements and attributes. As XML content is becoming more widespread so is the demand to compress XML data volume. The paper presents the best XML compression ratios reported to date. Its advantage over other XML compression techniques is that it uses syntactic information to enhance compression. Therefore, it is a fully syntactic based XML compression. The syntactic information is parsed from XML documents by an innovative XML parser. We developed a new XML parser-generator for that purpose. Our parser-generator is based on a syntactic dictionary (DTD, XML-Schema, etc.) of the XML in order to create an efficient and compact XML parsers. This XML parser-generator is adopted to streaming technologies and can be used in a wide variety of XML applications such as validators, converters, gateways, routers, browsers editors etc. The parsers' symbols are encoded by a partial prediction matching (PPM) codec. We compare between the performance of our algorithm and other existing XML compression techniques. The proposed compression algorithm achieves better compression ratio in comparison to other XML compression techniques that do not utilize syntactic structure. The superiority of our compression technique is more evident when it is tested on XML data sets that contain only tags and not free text.

Original languageEnglish
Article number1607275
Pages (from-to)402-411
Number of pages10
JournalProceedings of the Data Compression Conference
DOIs
StatePublished - 2006
EventData Compression Conference, DCC 2006 - Snowbird, UT, United States
Duration: 28 Mar 200630 Mar 2006

Fingerprint

Dive into the research topics of 'XML syntax conscious compression'. Together they form a unique fingerprint.

Cite this