HIV classification using coalescent theory
- Los Alamos National Laboratory
Algorithms for subtype classification and breakpoint detection of HIV-I sequences are based on a classification system of HIV-l. Hence, their quality highly depend on this system. Due to the history of creation of the current HIV-I nomenclature, the current one contains inconsistencies like: The phylogenetic distance between the subtype B and D is remarkably small compared with other pairs of subtypes. In fact, it is more like the distance of a pair of subsubtypes Robertson et al. (2000); Subtypes E and I do not exist any more since they were discovered to be composed of recombinants Robertson et al. (2000); It is currently discussed whether -- instead of CRF02 being a recombinant of subtype A and G -- subtype G should be designated as a circulating recombination form (CRF) nd CRF02 as a subtype Abecasis et al. (2007); There are 8 complete and over 400 partial HIV genomes in the LANL-database which belong neither to a subtype nor to a CRF (denoted by U). Moreover, the current classification system is somehow arbitrary like all complex classification systems that were created manually. To this end, it is desirable to deduce the classification system of HIV systematically by an algorithm. Of course, this problem is not restricted to HIV, but applies to all fast mutating and recombining viruses. Our work addresses the simpler subproblem to score classifications of given input sequences of some virus species (classification denotes a partition of the input sequences in several subtypes and CRFs). To this end, we reconstruct ancestral recombination graphs (ARG) of the input sequences under restrictions determined by the given classification. These restritions are imposed in order to ensure that the reconstructed ARGs do not contradict the classification under consideration. Then, we find the ARG with maximal probability by means of Markov Chain Monte Carlo methods. The probability of the most probable ARG is interpreted as a score for the classification. To our knowledge, this particular problem was not addressed up to now. The software package Lamarc Kuhner et al. (2000) allows for sampling ARGs, but it assumes that recombination events only involve one breakpoint. However, in HIV recombinants usually have more than one breakpoint. Moreover, Lamarc does not perform an explicit breakpoint detection, but tries to find them by chance. Although this approach is suitable for most situations, it will not lead to satisfying results in case of highly recombining viruses with multiple breakpoints.
- Research Organization:
- Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC52-06NA25396
- OSTI ID:
- 956655
- Report Number(s):
- LA-UR-08-07956; LA-UR-08-7956; TRN: US201016%%2340
- Country of Publication:
- United States
- Language:
- English
Similar Records
Emergence of recombinant forms in geographic regions with co-circulating HIV subtypes in the dynamic HIV-1 epidemic
jpHMM at GOBICS: a web server to detect genomic recombinations in HIV-1