Full text loading...
Abstract
Classification of viral sequences should be fast, objective, accurate and reproducible. Most methods that classify sequences use either pair-wise distances or phylogenetic relations, but cannot discern when a sequence is unclassifiable. The branching index (BI) combines distance and phylogeny methods to compute a ratio that quantifies how closely a query sequence clusters with a subtype clade. In the hypothesis-testing framework of statistical inference, the BI is compared with a threshold to test whether sufficient evidence exists for the query sequence to be classified among known sequences. If above the threshold, the null hypothesis of no support for the subtype relation is rejected and the sequence is taken as belonging to the subtype clade with which it clusters on the tree. This study evaluates statistical properties of the BI for subtype classification in hepatitis C virus (HCV) and human immunodeficiency virus-1 (HIV-1). Pairs of BI values with known positive- and negative-test results were computed from 10 000 random fragments of reference alignments. Sampled fragments were of sufficient length to contain phylogenetic signals that grouped reference sequences together properly into subtype clades. For HCV, a threshold BI of 0.71 yields 95.1 % agreement with reference subtypes, with equal false-positive and false-negative rates. For HIV-1, a threshold of 0.66 yields 93.5 % agreement. Higher thresholds can be used where lower false-positive rates are required. In synthetic recombinants, regions without breakpoints are recognized accurately; regions with breakpoints do not represent any known subtype uniquely. Web-based services for viral subtype classification with the BI are available online.
- Received:
- Accepted:
- Published Online: