Summary: Solving the protein sequence metric problem
William R. Atchley*§
, Jieping Zhao¶
, Andrew D. Fernandes¶
, and Tanja Druš ke
*Department of Genetics, ¶Bioinformatics Research Center, Graduate Program in Biomathematics, and Center for Computational Biology, North Carolina
State University, Raleigh, NC 27695-7614; and Faculty of Technology, Bielefeld University, D-33501 Bielefeld, Germany
Edited by Walter M. Fitch, University of California, Irvine, CA, and approved March 22, 2005 (received for review December 14, 2004)
Biological sequences are composed of long strings of alphabetic
letters rather than arrays of numerical values. Lack of a natural
underlying metric for comparing such alphabetic data significantly
inhibits sophisticated statistical analyses of sequences, modeling
structural and functional aspects of proteins, and related problems.
Herein, we use multivariate statistical analyses on almost 500
amino acid attributes to produce a small set of highly interpretable
numeric patterns of amino acid variability. These high-dimensional
attribute data are summarized by five multidimensional patterns
of attribute covariation that reflect polarity, secondary structure,
molecular volume, codon diversity, and electrostatic charge. Nu-
merical scores for each amino acid then transform amino acid


