24-01-2011, 09:33 AM
Greg Leighton, Jim Diamond, Tomasz Müldner
February 18, 2005
Overview
A (brief) introduction to data compression
XML lossless data compression
New XML Compression Programs
AXECHOP
TREECHOP
XML Data Compression
A (brief) introduction to XML
Techniques for achieving XML compression
Traditional approaches – Huffman, LZ
Specialized approaches
XML Compression Programs
XMill
XGrind
XPRESS
eXtensible Markup Language
separate syntax from semantics
support semi-structured data
support internationalization and platform independence
is self-describing (through labeling of the tree)
eXtensible Markup Language : 2
XML is a framework for defining markup languages:
no fixed collection of markup tags
each XML language is specialized for its own application domain
a common set of generic tools supports processing documents
XML: textual convention to represent tagged trees
eXtensible Markup Language : 3
Code:
<?xml version=“1.0” encoding=“UTF-8”?>
<Employees>
<Employee id=“123456”>
<Name>Homer Simpson</Name>
<Department>Sector 7-G</Department>
</Employee>
<Employee id=“123457”>
<Name>Frank Grimes</Name>
<Department>Sector 7-G</Department>
</Employee>
…
</Employees>
Attribute
Data Value
eXtensible Markup Language : 4
Correctness of an XML document:
Well-formed: complies with XML syntax
Valid: obeys the structure described in a grammar, such as XML schema document
Two kinds of XML parsers:
SAX
DOM
Why Compress XML?
XML is verbose:
Each non-empty element tag must end with a matching closing tag -- <tag>data</tag>
Ordering of tags is often repeated in a document (e.g. multiple records)
Tag names are often long
XML Compressors
View XML as a tree
Separate the tree structure and what is stored in leaves
Save the tree structure so that it can be restored
The compressed file may or may not remember the tree structure
breadfruit tree
XMill: Liefke and Suciu
Tree structure:
Start tags and attribute names are dictionary-encoded
(as T1, T2, etc.)
End tags replaced with ‘/’ token
Data values are replaced with their container number
Code:
<Book><Title lang="English">Views</Title>
<Author>Miller</Author>
<Author>Tai</Author>
</Book>
More