skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Towards a semantic lexicon for biological language processing

Conference ·
DOI:https://doi.org/10.1002/cfg.451· OSTI ID:977640

It is well understood that natural language processing (NLP) applications require sophisticated lexical resources to support their processing goals. In the biomedical domain, we are privileged to have access to extensive terminological resources in the form of controlled vocabularies and ontologies, which have been integrated into the framework of the National Library of Medicine's Unified Medical Language System's (UMLS) Metathesaurus. However, the existence of such terminological resources does not guarantee their utility for NLP. In particular, we have two core requirements for lexical resources for NLP in addition to the basic enumeration of important domain terms: representation of morphosyntactic information about those terms, specifically part of speech information and inflectional patterns to support parsing and lemma assignment, and representation of semantic information indicating general categorical information about terms, and significant relations between terms to support text understanding and inference (Hahn et at, 1999). Biomedical vocabularies by and large commonly leave out morphosyntactic information, and where they address semantic considerations, they often do so in an unprincipled manner, for instance by indicating a relation between two concepts without indicating the type of that relation. But all is not lost. The UMLS knowledge sources include two additional resources which are relevant - the SPECIALIST lexicon, a lexicon addressing our morphosyntactic requirements, and the Semantic Network, a representation of core conceptual categories in the biomedical domain. The coverage of these two knowledge sources with respect to the full coverage of the Metathesaurus is, however, not entirely clear. Furthermore, when our goals are specifically to process biological text - and often more specifically, text in the molecular biology domain - it is difficult to say whether the coverage of these resources is meaningful. The utility of the UMLS knowledge sources for medical language processing (MLP) has been explored (Johnson, 1999; Friedman et al 2001); the time has now come to repeat these experiments with respect to biological language processing (BLP). To that end, this paper presents an analysis of ihe UMLS resources, specifically with an eye towards constructing lexical resources suitable for BLP. We follow the paradigm presented in Johnson (1999) for medical language, exploring overlap between the UMLS Metathesaurus and SPECIALIST lexicon to construct a morphosyntactic and semantically-specified lexicon, and then further explore the overlap with a relevant domain corpus for molecular biology.

Research Organization:
Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)
Sponsoring Organization:
USDOE
OSTI ID:
977640
Report Number(s):
LA-UR-04-3190; TRN: US201012%%622
Resource Relation:
Journal Volume: 6; Journal Issue: 1-2; Conference: Submitted to: ISMB BioLINK, Glasgow, Scotland, July 29, 2004
Country of Publication:
United States
Language:
English

References (3)

Gene Ontology: tool for the unification of biology journal May 2000
How knowledge drives understanding—matching medical ontologies with the needs of medical language processing journal January 1999
A Semantic Lexicon for Medical Language Processing journal May 1999

Cited By (4)

Identifying named entities from PubMed® for enriching semantic categories journal February 2015
UMLS content views appropriate for NLP processing of the biomedical literature vs. clinical text journal August 2010
The BioLexicon: a large-scale terminological resource for biomedical text mining journal October 2011
Ontology quality assurance through analysis of term transformations journal May 2009