Advanced Search

Browse by Discipline

Scientific Societies

E-print Alerts

Add E-prints

E-print Network

  Advanced Search  

Structured-Content Extraction from the Web for Bibliographic Reference Generation

Summary: Structured-Content Extraction from the Web
for Bibliographic Reference Generation
Ramon Xuriguera
, Marta Arias


UPC Barcelona Tech
Abstract. In this paper we present a system that automatically creates bibli-
ographic indexes from a collection of PDF files by using the file contents to
search the Web and later extract the information from the resulting pages. We
pay special attention to the techniques used for extracting this data as well as the
automatic generation of extraction rules and their evaluation.
1 Introduction
Working on a research project surely implies spending vast amounts of time reading related
publications and the corresponding files, mostly in PDF format. Once the research is done, re-
searches have to generate bibliographic indexes from these articles, which can be a very tedious
and time-consuming task, even when using existing tools. We believe that this task could be
automatized to a large extent. In fact, the subject of this paper is a prototype application that


Source: Arias, Marta - Departament of Llenguatges i Sistemes Informátics, Universitat Politècnica de Catalunya


Collections: Computer Technologies and Information Sciences