University of Leicester

informatics

Compressed indexable XML representation of astronomical data

Source of Funding: PPARC
Duration: March 1, 2004 - Feb 28, 2007
Principal Investigator (Computer Science): Rajeev Raman
Principal Investigators (Physics and Astronomy): Clive Page, Tony Linde and Mike Watson.


Overview


e-Science presents computer scientists with new challenges in terms of handling huge volumes of data. The student allocated on this project will work closely with people involved in the AstroGrid project, and is concerned with the efficient storage and processing of large XML files that arise in the context of the International Virtual Observatory Alliance (IVOA). VOTable is an XML-based astronomical data format developed by the IVOA for tables and (later) images.

Unfortunately XML-based files are larger than the binary equivalent (such as FITS), and network bandwidth will be a scarce resource for the Virtual Observatory. Different VOTable encodings allow trade-offs between efficiency and ease of parsing. Even within the XML community at large there is growing concern that inefficiency arising from document size will hinder adoption and use of XML. A few XML-specific approaches can compress XML files better than generic algorithms such as gzip However, compression ratios can vary greatly (from 3:1 to 66:1) on different kinds of data. One issue then is to to understand the characteristics of astronomical XML files and invent or discover a compression method for these files.

Although compressed files are much smaller, their contents become inaccessible until uncompressed. Indeed, it would be impossible even to support the most rudimentary approaches to searching a compressed XML database, such as searching for sections that match an Xpath expression. Thus, another issue is to develop a compressed file representation that supports sequential searching through the file for the necessary structural and semantic components.

As XML-based formats such as VOTable become the norm for the extraction of data from astronomical archives, XML is likely to follow FITS in being used not only for data interchange but also for data storage. XML-based databases will therefore assume increased importance. However, current XML technology is not efficient enough to scale well. A final issue is to develop a compressed in-memory representation that supports complex queries.


Outcomes


PhD Thesis

O. Delpratt
Space Efficient In-Memory Representation of XML Documents. [PDF]
PhD Thesis, 2008, University of Leicester.

Software

The software developed as part of this project is now being taken further in the SiXML project.

Papers

  1. O. Delpratt, R. Raman and N. Rahman.
    Engineering Succinct DOM. [PDF].
    In Proceedings of the 11th International Conference on Extending Database Technology (EDBT), Nantes, France, March 25--29, 2008.
    DOI:10.1145/1353343.1353354
  2. O. Delpratt, N. Rahman and R. Raman.
    Compressed Prefix Sums [PDF].
    In Proceedings of the 33rd International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM 2007), Harrachov, Czech Republic, January 20--26, 2007. Proceedings are in the Springer Lecture Notes in CS series v.4362, pp. 235-247. Springer.
    DOI:10.1007/978-3-540-69507-3_19
  3. O. Delpratt, N. Rahman and R. Raman.
    Engineering the LOUDS Succinct Tree Representation [PDF].
    In Proceedings of the 5th International Workshop on Experimental Algorithms (WEA 2006), Cala Galdana, Menorca, Spain, May 24--27, 2006. Proceedings are in the Springer Lecture Notes in CS series v.4007, pp. 134-145. Springer.
    DOI: 10.1007/11764298_12

Poster Presentations

Author: Rajeev Raman (r.raman at mcs.le.ac.uk), T: +44 (0)116 252 3894.
University of Leicester June 2003. Last modified: 27th May 2010, 15:04:12.
Informatics Web Maintainer. This document has been approved by the Head of Department.