Unary Data Structures for Language Models Jeffrey Sorensen, Cyril Allauzen

Summary: Unary Data Structures for Language Models
Jeffrey Sorensen, Cyril Allauzen
Google, Inc., 76 Ninth Avenue, New York, NY 10011
Language models are important components of speech recog-
nition and machine translation systems. Trained on billions of
words, and consisting of billions of parameters, language mod-
els often are the single largest components of these systems.
There have been many proposed techniques to reduce the stor-
age requirements for language models. A technique based upon
pointer-free compact storage of ordinal trees shows compres-
sion competitive with the best proposed systems, while retain-
ing the full finite state structure, and without using computation-
ally expensive block compression schemes or lossy quantization
Index Terms: n-gram language models, unary data structures
1. Introduction
Models of language constitute one of the largest components of
contemporary speech recognition and machine translation sys-


Source: Allauzen, Cyril - Computer Science Department, Courant Institute of Mathematical Sciences, New York University


Collections: Computer Technologies and Information Sciences