Abstract
XML is the standard format of content representation and sharing on the Web. XML is a highly verbose language, especially regarding the duplication of meta-data in the form of elements and attributes. As XML content is becoming more widespread so is the demand to compress XML data volume. The paper presents the best XML compression ratios reported to date. Its advantage over other XML compression techniques is that it uses syntactic information to enhance compression. Therefore, it is a fully syntactic based XML compression. The syntactic information is parsed from XML documents by an innovative XML parser. We developed a new XML parser-generator for that purpose. Our parser-generator is based on a syntactic dictionary (DTD, XML-Schema, etc.) of the XML in order to create an efficient and compact XML parsers. This XML parser-generator is adopted to streaming technologies and can be used in a wide variety of XML applications such as validators, converters, gateways, routers, browsers editors etc. The parsers' symbols are encoded by a partial prediction matching (PPM) codec. We compare between the performance of our algorithm and other existing XML compression techniques. The proposed compression algorithm achieves better compression ratio in comparison to other XML compression techniques that do not utilize syntactic structure. The superiority of our compression technique is more evident when it is tested on XML data sets that contain only tags and not free text.
Original language | English |
---|---|
Article number | 1607275 |
Pages (from-to) | 402-411 |
Number of pages | 10 |
Journal | Proceedings of the Data Compression Conference |
DOIs | |
State | Published - 2006 |
Event | Data Compression Conference, DCC 2006 - Snowbird, UT, United States Duration: 28 Mar 2006 → 30 Mar 2006 |