Advanced Search

Browse by Discipline

Scientific Societies

E-print Alerts

Add E-prints

E-print Network

  Advanced Search  

Scaling Information Extraction to Large Document Collections Eugene Agichtein

Summary: Scaling Information Extraction to Large Document Collections
Eugene Agichtein
Microsoft Research
Information extraction and text mining applications are just beginning to tap the immense amounts of
valuable textual information available online. In order to extract information from millions, and in some
cases, billions of documents, different solutions to scalability emerged. We review key approaches for
scaling up information extraction, including using general-purpose search engines as well as indexing
techniques specialized for information extraction applications. Scalable information extraction is an
active area of research, and we highlight some of the opportunities and challenges in this area that are
relevant to the database community.
1 Overview
Text documents convey valuable structured information. For example, medical literature contains information
about new treatments for diseases. Similarly, news archives contain information useful to analysts tracking
financial transactions, or to government agencies that monitor infectious disease outbreaks. All this information
could be managed and queried more easily if represented in a structured form. This task is typically called
information extraction. More specifically, information extraction systems can identify particular types of entities
(e.g., person names, locations, organizations, or even drug and disease names) and relationships between entities
(e.g., employees of organizations or adverse interactions between medical drugs) in natural language text. In this


Source: Agichtein, Eugene - Department of Mathematics and Computer Science, Emory University


Collections: Computer Technologies and Information Sciences