Generalizable characteristics of false-positive bacterial variant calls

Stephen J. Bush

doi:10.1099/mgen.0.000615

Volume 7, Issue 8

Research Article

Open Access

Generalizable characteristics of false-positive bacterial variant calls

Stephen J. Bush¹
View Affiliations Hide Affiliations

Affiliations: ¹ Nuffield Department of Medicine, University of Oxford, Oxford, UK
*Correspondence: Stephen J. Bush, [email protected]
Published: 04 August 2021 https://doi.org/10.1099/mgen.0.000615

Abstract

Minimizing false positives is a critical issue when variant calling as no method is without error. It is common practice to post-process a variant-call file (VCF) using hard filter criteria intended to discriminate true-positive (TP) from false-positive (FP) calls. These are applied on the simple principle that certain characteristics are disproportionately represented among the set of FP calls and that a user-chosen threshold can maximize the number detected. To provide guidance on this issue, this study empirically characterized all false SNP and indel calls made using real Illumina sequencing data from six disparate species and 166 variant-calling pipelines (the combination of 14 read aligners with up to 13 different variant callers, plus four ‘all-in-one’ pipelines). We did not seek to optimize filter thresholds but instead to draw attention to those filters of greatest efficacy and the pipelines to which they may most usefully be applied. In this respect, this study acts as a coda to our previous benchmarking evaluation of bacterial variant callers, and provides general recommendations for effective practice. The results suggest that, of the pipelines analysed in this study, the most straightforward way of minimizing false positives would simply be to use Snippy. We also find that a disproportionate number of false calls, irrespective of the variant-calling pipeline, are located in the vicinity of indels, and highlight this as an issue for future development.

Received: 09/03/2021
Accepted: 20/05/2021
Published Online: 04/08/2021

Keyword(s): benchmarking , best practice , false positive and variant calling

Funding

This study was supported by the:

National Institute for Health Research Health Protection Research Unit (Award HPRU-2012-10041)
- Principle Award Recipient: NotApplicable

This is an open-access article distributed under the terms of the Creative Commons Attribution License. This article was made open access via a Publish and Read agreement between the Microbiology Society and the corresponding author’s institution.

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.000615

2021-08-04

2024-04-25

Full text loading...

/deliver/fulltext/mgen/7/8/mgen000615.html?itemId=/content/journal/mgen/10.1099/mgen.0.000615&mimeType=html&fmt=ahah

References

Sichtig H, Minogue T, Yan Y, Stefan C, Hall A et al. FDA-ARGOS is a database with public quality-controlled reference genomes for diagnostic use and regulatory science. Nat Commun 2019; 10:3313 [View Article] [PubMed]
[Google Scholar]
Park JS, Son JH, Park CS, Chang HS. Clinical implications of single nucleotide polymorphisms in diagnosis of asthma and its subtypes. Yonsei Med J 2019; 60:1–9 [View Article] [PubMed]
[Google Scholar]
Tempfer CB, Hefler LA, Schneeberger C, Huber JC. How valid is single nucleotide polymorphism (SNP) diagnosis for the individual risk assessment of breast cancer?. Gynecol Endocrinol 2006; 22:155–159 [View Article] [PubMed]
[Google Scholar]
Labbé G, Kruczkiewicz P, Mabon P, Robertson J, Schonfeld J et al. Rapid and accurate SNP genotyping of clonal bacterial pathogens with BioHansel. bioRxiv 2020
[Google Scholar]
Bush SJ, Foster D, Eyre DW, Clark EL, De Maio N et al. Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism–calling pipelines. GigaScience 20209
[Google Scholar]
Yoshimura D, Kajitani R, Gotoh Y, Katahira K, Okuno M et al. Evaluation of SNP calling methods for closely related bacterial isolates and a novel high-accuracy pipeline: BactSNP. Microb Genom 2019; 5:e000261 [View Article] [PubMed]
[Google Scholar]
Goig GA, Blanco S, Garcia-Basteiro AL, Comas I. Contaminant DNA in bacterial sequencing experiments is a major source of false genetic variability. BMC Biol 2020; 18:24 [View Article] [PubMed]
[Google Scholar]
Ribeiro A, Golicz A, Hackett CA, Milne I, Stephen G et al. An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome. BMC Bioinformatics 2015; 16:382 [View Article] [PubMed]
[Google Scholar]
Hall LMC, Henderson-Begg SK. Hypermutable bacteria isolated from humans – a critical analysis. Microbiology (Reading) 2006; 152:2505–2514 [View Article] [PubMed]
[Google Scholar]
Ramiro RS, Durão P, Bank C, Gordo I. Low mutational load and high mutation rate variation in gut commensal bacteria. PLoS Biol 2020; 18:e3000617 [View Article] [PubMed]
[Google Scholar]
Olson ND, Lund SP, Colman RE, Foster JT, Sahl JW et al. Best practices for evaluating single nucleotide variant calling methods for microbial genomics. Frontiers in Genetics 20156
[Google Scholar]
Altmann A, Weber P, Bader D, Preuß M, Binder EB et al. A beginners guide to SNP calling from high-throughput DNA-sequencing data. Hum Genet 2012; 131:1541–1554 [View Article]
[Google Scholar]
Reumers J, De Rijk P, Zhao H, Liekens A, Smeets D et al. Optimized filtering reduces the error rate in detecting genomic variants by short-read sequencing. Nat Biotechnol 2011; 30:61–68 [View Article] [PubMed]
[Google Scholar]
Jia P, Li F, Xia J, Chen H, Ji H et al. Consensus rules in variant detection from next-generation sequencing data. PLoS ONE 2012; 7:e38470 [View Article]
[Google Scholar]
O’Leary SJ, Puritz JB, Willis SC, Hollenbeck CM, Portnoy DS. These aren’t the loci you’re looking for: Principles of effective SNP filtering for molecular ecologists. Mol Ecol 2018; 27:3193–3206 [View Article]
[Google Scholar]
De Summa S, Malerba G, Pinto R, Mori A, Mijatovic V et al. GATK hard filtering: tunable parameters to improve variant calling for next generation sequencing targeted gene panel data. BMC Bioinformatics 2017; 18:119 [View Article] [PubMed]
[Google Scholar]
GATK About the GATK best practices; 2020 https://gatk.broadinstitute.org/hc/en-us/articles/360035894711-About-the-GATK-Best-Practices accessed 28 Jan 2020
Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008; 18:1851–1858 [View Article] [PubMed]
[Google Scholar]
Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods 2012; 9:357–359 [View Article] [PubMed]
[Google Scholar]
Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 2009; 25:1754–1760 [View Article] [PubMed]
[Google Scholar]
Rizk G, Lavenier D. GASSST: global alignment short sequence search tool. Bioinformatics 2010; 26:2534–2540 [View Article] [PubMed]
[Google Scholar]
Marco-Sola S, Sammeth M, Guigo R, Ribeca P. The GEM mapper: fast, accurate and versatile alignment by filtration. Nat Methods 2012; 9:1185–1188 [View Article] [PubMed]
[Google Scholar]
Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods 2015; 12:357–360 [View Article] [PubMed]
[Google Scholar]
Li H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 2018; 34:3094–3100 [View Article]
[Google Scholar]
Zaharia M, Bolosky WJ, Curtis K, Fox A, Patterson D et al. Faster and More Accurate Sequence Alignment with SNAP. In arXiv 2011e-prints
[Google Scholar]
Lunter G, Goodson M. Stampy: A statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res 2011; 21:936–939 [View Article] [PubMed]
[Google Scholar]
Siragusa E, Weese D, Reinert K. Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Res 2013; 41:e78 [View Article] [PubMed]
[Google Scholar]
Poplin R, Chang PC, Alexander D, Schwartz S, Colthurst T et al. A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol 2018; 36:983–987 [View Article] [PubMed]
[Google Scholar]
Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv 2012arXiv:1207.3907 [q-bio.GN]
[Google Scholar]
DePristo MA, Banks E, Poplin RE, Garimella KV, Maguire JR et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011; 43:491–498 [View Article] [PubMed]
[Google Scholar]
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010; 20:1297–1303 [View Article] [PubMed]
[Google Scholar]
Wilm A, PPK A, Bertrand D, GHT Y, Ong SH et al. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Research 2012; 40:11189–11201 [View Article]
[Google Scholar]
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J et al. Genome Project Data Processing S: The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009; 25:2078–2079 [View Article] [PubMed]
[Google Scholar]
Cooke DP, Wedge DC, Lunter G. A unified haplotype-based method for accurate and comprehensive variant calling. bioRxiv 2018456103
[Google Scholar]
Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A et al. Pilon: An integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLOS ONE 2014; 9:e112963 [View Article] [PubMed]
[Google Scholar]
Rimmer A, Phan H, Mathieson I, Iqbal Z, Twigg SRF et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nature Genetics 2014; 46:912 [View Article]
[Google Scholar]
Wei Z, Wang W, Hu P, Lyon GJ, Hakonarson H. SNVer: a statistical tool for variant calling in analysis of pooled or individual next-generation sequencing data. Nucleic Acids Res 2011; 39:e132 [View Article] [PubMed]
[Google Scholar]
Liu Y, Loewer M, Aluru S, Schmidt B. SNVSniffer: an integrated caller for germline and somatic single-nucleotide and indel mutations. BMC Systems Biology 2016; 10:47
[Google Scholar]
Saunders CT, Wong WS, Swamy S, Becq J, Murray LJ et al. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics 2012; 28:1811–1817 [View Article] [PubMed]
[Google Scholar]
Koboldt DC, Chen K, Wylie T, Larson DE, McLellan MD et al. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 2009; 25:2283–2285 [View Article] [PubMed]
[Google Scholar]
Deatherage DE, Barrick JE. Identification of mutations in laboratory-evolved microbes from next-generation sequencing data using breseq. Methods Mol Biol 2014; 1151:165–188 [View Article] [PubMed]
[Google Scholar]
Chiang C, Layer RM, Faust GG, Lindberg MR, Rose DB et al. SpeedSeq: ultra-fast personal genome analysis and interpretation. Nat Methods 2015; 12:966–968 [View Article] [PubMed]
[Google Scholar]
Sarovich DS, Price EP. SPANDx: a genomics pipeline for comparative analysis of large haploid whole genome re-sequencing datasets. BMC Res Notes 2014; 7:618 [View Article] [PubMed]
[Google Scholar]
Sandmann S, de Graaf AO, Karimi M, van der Reijden BA, Hellström-Lindberg E et al. Evaluating variant calling tools for non-matched next-generation sequencing data. Sci Rep 2017; 7:43169 [View Article] [PubMed]
[Google Scholar]
Khasanov FK, Zvingila DJ, Zainullin AA, Prozorov AA, Bashkirov VI. Homologous recombination between plasmid and chromosomal DNA in Bacillus subtilis requires approximately 70 bp of homology. Mol Gen Genet 1992; 234:494–497 [View Article] [PubMed]
[Google Scholar]
Zhu SJ, Almagro-Garcia J, McVean G. Deconvolution of multiple infections in Plasmodium falciparum from high throughput sequencing data. Bioinformatics 2018; 34:9–15
[Google Scholar]
Guo Y, Li J, Li C-I, Long J, Samuels DC et al. The effect of strand bias in Illumina short-read sequencing data. BMC Genomics 2012; 13:666
[Google Scholar]
Roberts ND, Kortschak RD, Parker WT, Schreiber AW, Branford S et al. A comparative analysis of algorithms for somatic SNV detection in cancer. Bioinformatics 2013; 29:2223–2230 [View Article] [PubMed]
[Google Scholar]
Kim SY, Speed TP. Comparing somatic mutation-callers: beyond Venn diagrams. BMC Bioinformatics 2013; 14:189 [View Article] [PubMed]
[Google Scholar]
Li H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 2014; 30:2843–2851 [View Article] [PubMed]
[Google Scholar]
Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M et al. Spades: A new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 2012; 19:455–477 [View Article]
[Google Scholar]
Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 2017; 27:722–736 [View Article] [PubMed]
[Google Scholar]
Chin C-S, Alexander DH, Marks P, Klammer AA, Drake J et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat Methods 2013; 10:563–569 [View Article] [PubMed]
[Google Scholar]
Berlin K, Koren S, Chin C-S, Drake JP, Landolin JM et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol 2015; 33:623–630 [View Article] [PubMed]
[Google Scholar]
Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 2016; 17:132 [View Article] [PubMed]
[Google Scholar]
Marçais G, Delcher AL, Phillippy AM, Coston R, Salzberg SL et al. MUMmer4: A fast and versatile genome alignment system. PLoS Comput Biol 2018; 14:e1005944 [View Article]
[Google Scholar]
Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 2010; 26:841–842 [View Article] [PubMed]
[Google Scholar]
O’Fallon BD, Wooderchak-Donahue W, Crockett DK. A support vector machine for identification of single-nucleotide polymorphisms from next-generation sequencing data. Bioinformatics 2013; 29:1361–1366 [View Article] [PubMed]
[Google Scholar]
Olson ND, Lund SP, Colman RE, Foster JT, Sahl JW et al. Best practices for evaluating single nucleotide variant calling methods for microbial genomics. Front Genet 2015; 6:235 [View Article] [PubMed]
[Google Scholar]
Jun G, Wing MK, Abecasis GR, Kang HM. An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data. Genome Res 2015; 25:918–925 [View Article] [PubMed]
[Google Scholar]
Danecek P, Auton A, Abecasis G, Albers CA, Banks E et al. The variant call format and VCFtools. Bioinformatics 2011; 27:2156–2158 [View Article] [PubMed]
[Google Scholar]
Colquhoun RM, Hall MB, Lima L, Roberts LW, Malone KM et al. Nucleotide-resolution bacterial pan-genomics with reference graphs. bioRxiv 2020
[Google Scholar]
Miller CA, Hampton O, Coarfa C, Milosavljevic A. ReadDepth: a parallel R package for detecting copy number alterations from short sequencing reads. PLOS ONE 2011; 6:e16327 [View Article] [PubMed]
[Google Scholar]

http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.000615

Generalizable characteristics of false-positive bacterial variant calls

M Gen 7, 000615 (2021); https://doi.org/10.1099/mgen.0.000615

/content/journal/mgen/10.1099/mgen.0.000615

Data & Media loading...

Supplements

Loading data from figshare

Volume 7, Issue 8

Research Article

Open Access

Generalizable characteristics of false-positive bacterial variant calls

Abstract

Funding

Most read this month

Most cited Most Cited RSS feed

Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification

ResFinder – an open online resource for identification of antimicrobial resistance genes in next-generation sequencing data and prediction of phenotypes from genotypes

MOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies

Completing bacterial genome assemblies with multiplex MinION sequencing

SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments

ClermonTyping: an easy-to-use and accurate in silico method for Escherichia genus strain phylotyping

Identification of Klebsiella capsule synthesis loci from whole genome data

Emergence, molecular mechanisms and global spread of carbapenem-resistant Acinetobacter baumannii

chewBBACA: A complete suite for gene-by-gene schema creation and strain identification

Microreact: visualizing and sharing data for genomic epidemiology and phylogeography