Home

About

Advanced Search

Browse by Discipline

Scientific Societies

E-print Alerts

Add E-prints

E-print Network
FAQHELPSITE MAPCONTACT US


  Advanced Search  

 
The Tanl Pipeline Giuseppe Attardi, Stefano Dei Rossi, Maria Simi
 

Summary: The Tanl Pipeline
Giuseppe Attardi, Stefano Dei Rossi, Maria Simi
Dipartimento di Informatica, UniversitÓ di Pisa
Largo B. Pontecorvo 3, I-56127 Pisa, Italy
E-mail: attardi@di.unipi.it, deirossi@di.unipi.it, simi@di.unipi.it
Abstract
Tanl (Natural Language Text Analytics) is a suite of tools for text analytics based on the software architecture paradigm
of data pipelines. Tanl pipelines are data driven, i.e. each stage pulls data from the preceding stage and transforms them
for use by the next stage. Since data is processed as soon as it becomes available, processing delay is minimized
improving data throughput. The processing modules can be written in C++ or in Python and can be combined using few
lines of Python scripts to produce full NLP applications. Tanl provides a set of modules, ranging from tokenization to
POS tagging, from parsing to NE recognition. A Tanl pipeline can be processed in parallel on a cluster of computers by
means of a modified version of Hadoop streaming. We present the architecture, its modules and some sample
applications.
Introduction
Text analytics involves many tasks ranging from simple
text collection, extraction, and preparation to linguistic
syntactic and semantic analysis, cross reference analysis,
intent mining and finally indexing and search. A complete
system must be able to process textual data of any size and

  

Source: Attardi, Giuseppe - Dipartimento di Informatica, UniversitÓ di Pisa

 

Collections: Computer Technologies and Information Sciences