Summary: Generating Synthetic Complex-structured XML Data
Ashraf Aboulnaga Jeffrey F. Naughton Chun Zhang
Computer Sciences Department
University of Wisconsin - Madison
Synthetically generated data has always been important for evaluating and understanding new ideas in database research.
In this paper, we describe a data generator for generating synthetic complex-structured XML data that allows for a high
level of control over the characteristics of the generated data. This data generator is certainly not the ultimate solution
to the problem of generating synthetic XML data, but we have found it very useful in our research on XML data
management, and we believe that it can also be useful to other researchers. Furthermore, we hope that this paper starts
a discussion in the XML community about characterizing and generating XML data, and that it may serve as a first step
towards developing a commonly accepted XML data generator for our community.
Synthetically generated data is very useful in evaluating and understanding new ideas in database research. For example,
research on relational databases often uses synthetic data from the Wisconsin benchmark [DeW93], TPC-C [TPCC], or
TPC-H [TPCH], and research on object oriented databases often uses synthetic data from the OO7 benchmark [CDN93].
Synthetic data generators allow us to generate large volumes of data with well-understood characteristics. We can
easily vary the characteristics of the generated data by varying the input parameters of the data generator. This allows us
to systematically cover much more of the space of possible data sets than relying solely on real data over which we have