Home

About

Advanced Search

Browse by Discipline

Scientific Societies

E-print Alerts

Add E-prints

E-print Network
FAQHELPSITE MAPCONTACT US


  Advanced Search  

 
Using Data Mining Techniques to Learn Layouts of Flat-File Biological Datasets Kaushik Sinha Xuan Zhang Ruoming Jin Gagan Agrawal
 

Summary: Using Data Mining Techniques to Learn Layouts of Flat-File Biological Datasets
Kaushik Sinha Xuan Zhang Ruoming Jin Gagan Agrawal
Department of Computer Science and Engineering
Ohio State University
Columbus, OH, 43210
sinhak,zhangx,jinr,agrawalĄ @cse.ohio-state.edu
Abstract
One of the major problems in biological data integration is
that many data sources are stored as flat-files, with a variety
of different layouts. Integrating data from such sources can
be an extremely time-consuming task.
We have been developing data mining techniques to help
learn the layout of a dataset in a semi-automatic way. In
this paper, we focus on the problem of identifying delimiters
for optional fields. Since these fields do not occur in every
record, frequency based methods are not able to identify the
corresponding delimiters. We present a method which uses
contrast analysis on the frequency of sequences to identify
such delimiters and help complete the layout descriptions.
We demonstrate the effectiveness of this technique using three

  

Source: Agrawal, Gagan - Department of Computer Science and Engineering, Ohio State University

 

Collections: Computer Technologies and Information Sciences