Summary: Compact XML grammar based compression
S. Harrusi, A. Averbuch, A. Yehudai
School of Computer Science
Tel Aviv University, Tel Aviv 69978, Israel
Extensible Markup Language (XML) is the standard format for content representation and
sharing on the Web. XML is a highly verbose language, especially regarding the duplication of
meta-data in the form of elements and attributes. As XML content is becoming more widespread
so is the demand to compress XML data volume.
This paper presents a new grammar, called D-grammar, which defines XML structure for a
specific DTD. DTD is chosen as an explanatory example. The grammar can be extended to define
other deterministic XML scheme languages such as XML Scheme. It also presents a parser generator
which generates a D-grammar parser. DPDT is an efficient and compact XML validator for the
DTD which the D-grammar reflects. The presented compression technique encodes the DPDT
validation choices during the XML structure parsing instead of the textual tags that compose the
XML structure. This enhances the XML text compression twofold: first, there are less symbols to
encode and second, the encoded structure symbols can predict the preceding text better than the
textual structure tags.
A unique advantage of the presented technique is that it combines the validation phase with
the compression phase and thus saves processing time.