Jensen MA, Coetzer M, van ’t Wout AB, Morris L, Mullins JI (2006). A reliable phenotype predictor for human immunodeficiency virus type 1 subtype C based on envelope V3 sequences. Journal of virology, 80(10), 4698-704. (pubmed)
In human immunodeficiency virus type 1 (HIV-1) subtype B infections, the emergence of viruses able to use CXCR4 as a coreceptor is well documented and associated with accelerated CD4 decline and disease progression. However, in HIV-1 subtype C infections, responsible for more than 50% of global infections, CXCR4 usage is less common, even in individuals with advanced disease. A reliable phenotype prediction method based on genetic sequence analysis could provide a rapid and less expensive approach to identify possible CXCR4 variants and thus increase our understanding of subtype C coreceptor usage. For subtype B V3 loop sequences, genotypic predictors have been developed based on position-specific scoring matrices (PSSM). In this study, we apply this methodology to a training set of 279 subtype C sequences of known phenotypes (228 non-syncytium-inducing [NSI] CCR5(+) and 51 SI CXCR4(+) sequences) to derive a C-PSSM predictor. Specificity and sensitivity distributions were estimated by combining data set bootstrapping with leave-one-out cross-validation, with random sampling of single sequences from individuals on each bootstrap iteration. The C-PSSM had an estimated specificity of 94% (confidence interval [CI], 92% to 96%) and a sensitivity of 75% (CI, 68% to 82%), which is significantly more sensitive than predictions based on other methods, including a commonly used method based on the presence of positively charged residues (sensitivity, 47.8%). A specificity of 83% and a sensitivity of 83% were achieved with a validation set of 24 SI and 47 NSI unique subtype C sequences. The C-PSSM performs as well on subtype C V3 loops as existing subtype B-specific methods do on subtype B V3 loops. We present bioinformatic evidence that particular sites may influence coreceptor usage differently, depending on the subtype.
Supplemental Data for Jensen et al.:
- Training Set Data
- Validation Analysis
- NSI training set in FASTA format u[nn] in name line indicates sample from [nn]th infected individual GenBank acc. (indicated where available)
- SI training set in FASTA format u[nn] in name line indicates sample from [nn]th infected individual GenBank acc. (indicated where available)
- Validation set in FASTA format “multiple names after ‘>’ are isolates with identical V3 loops; for Genbank accessions, see ”“validation analysis”“ sheet”