Read trimming has minimal effect on bacterial SNP-calling accuracy

Stephen J. Bush

doi:10.1099/mgen.0.000434

Volume 6, Issue 12

Research Article

Open Access

Read trimming has minimal effect on bacterial SNP-calling accuracy

Stephen J. Bush¹
View Affiliations Hide Affiliations

Affiliations: ¹ Nuffield Department of Medicine, University of Oxford, Oxford, UK
*Correspondence: Stephen J. Bush, [email protected]
Published: 11 December 2020 https://doi.org/10.1099/mgen.0.000434

Abstract

Read alignment is the central step of many analytic pipelines that perform variant calling. To reduce error, it is common practice to pre-process raw sequencing reads to remove low-quality bases and residual adapter contamination, a procedure collectively known as ‘trimming’. Trimming is widely assumed to increase the accuracy of variant calling, although there are relatively few systematic evaluations of its effects and no clear consensus on its efficacy. As sequencing datasets increase both in number and size, it is worthwhile reappraising computational operations of ambiguous benefit, particularly when the scope of many analyses now routinely incorporates thousands of samples, increasing the time and cost required. Using a curated set of 17 Gram-negative bacterial genomes, this study initially evaluated the impact of four read-trimming utilities (Atropos, fastp, Trim Galore and Trimmomatic), each used with a range of stringencies, on the accuracy and completeness of three bacterial SNP-calling pipelines. It was found that read trimming made only small, and statistically insignificant, increases in SNP-calling accuracy even when using the highest-performing pre-processor in this study, fastp. To extend these findings, >6500 publicly archived sequencing datasets from Escherichia coli , Mycobacterium tuberculosis and Staphylococcus aureus were re-analysed using a common analytic pipeline. Of the approximately 125 million SNPs and 1.25 million indels called across all samples, the same bases were called in 98.8 and 91.9 % of cases, respectively, irrespective of whether raw reads or trimmed reads were used. Nevertheless, the proportion of mixed calls (i.e. calls where <100 % of the reads support the variant allele; considered a proxy of false positives) was significantly reduced after trimming, which suggests that while trimming rarely alters the set of variant bases, it can affect the proportion of reads supporting each call. It was concluded that read quality- and adapter-trimming add relatively little value to a SNP-calling pipeline and may only be necessary if small differences in the absolute number of SNP calls, or the false call rate, are critical. Broadly similar conclusions can be drawn about the utility of trimming to an indel-calling pipeline. Read trimming remains routinely performed prior to variant calling likely out of concern that doing otherwise would typically have negative consequences. While historically this may have been the case, the data in this study suggests that read trimming is not always a practical necessity.

Received: 08/09/2020
Accepted: 25/11/2020
Published Online: 11/12/2020

Keyword(s): read pre-processing , read trimming , SNP calling and variant calling

Funding

This study was supported by the:

National Institute for Health Research Health Protection Research Unit (Award HPRU-2012-10041)
- Principle Award Recipient: NotApplicable

This is an open-access article distributed under the terms of the Creative Commons Attribution License. This article was made open access via a Publish and Read agreement between the Microbiology Society and the corresponding author’s institution.

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.000434

2020-12-11

2024-04-25

Full text loading...

/deliver/fulltext/mgen/6/12/mgen000434.html?itemId=/content/journal/mgen/10.1099/mgen.0.000434&mimeType=html&fmt=ahah

References

De Maio N, Shaw LP, Hubbard A, George S, Sanderson ND et al. Comparison of long-read sequencing technologies in the hybrid assembly of complex bacterial genomes. Microb Genom 2019; 5:e000294 [View Article][PubMed]
[Google Scholar]
Bush SJ, Foster D, Eyre DW, Clark EL, De Maio N et al. Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism-calling pipelines. Gigascience 2020; 9:giaa007 [View Article][PubMed]
[Google Scholar]
Del Fabbro C, Scalabrin S, Morgante M, Giorgi FM. An extensive evaluation of read trimming effects on Illumina NGS data analysis. PLoS One 2013; 8:e85024 [View Article][PubMed]
[Google Scholar]
Farrer RA, Henk DA, MacLean D, Studholme DJ, Fisher MC. Using false discovery rates to benchmark SNP-callers in next-generation sequencing projects. Sci Rep 2013; 3:1512 [View Article][PubMed]
[Google Scholar]
Liu Q, Guo Y, Li J, Long J, Zhang B et al. Steps to ensure accuracy in genotype and SNP calling from Illumina sequencing data. BMC Genomics 2012; 13:S8 [View Article][PubMed]
[Google Scholar]
Pightling AW, Petronella N, Pagotto F. Choice of reference sequence and assembler for alignment of Listeria monocytogenes short-read sequence data greatly influences rates of error in SNP analyses. PLoS One 2014; 9:e104579 [View Article][PubMed]
[Google Scholar]
Yang S-F, Lu C-W, Yao C-T, Hung C-M. To TRIM or not to TRIM: effects of read trimming on the de novo genome assembly of a widespread East Asian passerine, the Rufous-Capped Babbler (Cyanoderma ruficeps Blyth). Genes 2019; 10:737 [View Article]
[Google Scholar]
MacManes MD. On the optimal trimming of high-throughput mRNA sequence data. Front Genet 2014; 5:13 [View Article][PubMed]
[Google Scholar]
Dominguez Del Angel V, Hjerde E, Sterck L, Capella-Gutierrez S, Notredame C et al. Ten steps to get started in genome assembly and annotation. F1000Res 2018; 7:ELIXIR-148 [View Article][PubMed]
[Google Scholar]
Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 2018; 34:i884–i890 [View Article][PubMed]
[Google Scholar]
Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 2014; 30:2114–2120 [View Article][PubMed]
[Google Scholar]
Didion JP, Martin M, Collins FS. Atropos: specific, sensitive, and speedy trimming of sequencing reads. PeerJ 2017; 5:e3720 [View Article][PubMed]
[Google Scholar]
Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J 2011; 17:10 [View Article]
[Google Scholar]
Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 2009; 25:1754–1760 [View Article][PubMed]
[Google Scholar]
Wilm A, Aw PPK, Bertrand D, Yeo GHT, Ong SH et al. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res 2012; 40:11189–11201 [View Article][PubMed]
[Google Scholar]
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009; 25:2078–2079 [View Article][PubMed]
[Google Scholar]
Kim S, Scheffler K, Halpern AL, Bekritsky MA, Noh E et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat Methods 2018; 15:591–594 [View Article][PubMed]
[Google Scholar]
Marçais G, Delcher AL, Phillippy AM, Coston R, Salzberg SL et al. MUMmer4: a fast and versatile genome alignment system. PLoS Comput Biol 2018; 14:e1005944 [View Article]
[Google Scholar]
Treangen TJ, Ondov BD, Koren S, Phillippy AM. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol 2014; 15:524 [View Article]
[Google Scholar]
Broad Institute Picard: a set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF ( http://broadinstitute.github.io/picard/) Cambridge, MA: Broad Institute; 2018
Eyre DW, Cule ML, Wilson DJ, Griffiths D, Vaughan A et al. Diverse sources of C. difficile infection identified on whole-genome sequencing. N Engl J Med 2013; 369:1195–1205 [View Article]
[Google Scholar]
Jia P, Li F, Xia J, Chen H, Ji H et al. Consensus rules in variant detection from next-generation sequencing data. PLoS One 2012; 7:e38470 [View Article]
[Google Scholar]
Lieberman TD, Wilson D, Misra R, Xiong LL, Moodley P et al. Genomic diversity in autopsy samples reveals within-host dissemination of HIV-associated Mycobacterium tuberculosis . Nat Med 2016; 22:1470–1474 [View Article]
[Google Scholar]
Guthrie JL, Delli Pizzi A, Roth D, Kong C, Jorgensen D et al. Genotyping and whole-genome sequencing to identify tuberculosis transmission to pediatric patients in British Columbia, Canada, 2005–2014. J Infect Dis 2018; 218:1155–1163 [View Article][PubMed]
[Google Scholar]
Ceric O, Tyson GH, Goodman LB, Mitchell PK, Zhang Y et al. Enhancing the one health initiative by using whole genome sequencing to monitor antimicrobial resistance of animal pathogens: Vet-LIRN collaborative project with veterinary diagnostic laboratories in United States and Canada. BMC Vet Res 2019; 15:130 [View Article]
[Google Scholar]
Hasan MS, Wu X, Zhang L. Performance evaluation of indel calling tools using real short-read data. Hum Genomics 2015; 9:20 [View Article]
[Google Scholar]
Li D, Kim W, Wang L, Yoon KA, Park B et al. Comparison of indel calling tools with simulation data and real short-read data. IEEE/ACM Trans Comput Biol Bioinform 2019; 16:1635–1644 [View Article][PubMed]
[Google Scholar]
Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 2019; 37:907–915 [View Article]
[Google Scholar]
Kelley DR, Schatz MC, Salzberg SL. Quake: quality-aware detection and correction of sequencing errors. Genome Biol 2010; 11:R116 [View Article]
[Google Scholar]
Marçais G, Yorke JA, Zimin A. QuorUM: an error corrector for Illumina reads. PLoS One 2015; 10:e0130821 [View Article]
[Google Scholar]
Greenfield P, Duesing K, Papanicolaou A, Bauer DC. Blue: correcting sequencing errors using consensus and context. Bioinformatics 2014; 30:2723–2732 [View Article]
[Google Scholar]
Yang X, Chockalingam SP, Aluru S. A survey of error-correction methods for next-generation sequencing. Brief Bioinform 2013; 14:56–66 [View Article]
[Google Scholar]
Bush SJ, Connor TR, Peto TEA, Crook DW, Walker AS. Evaluation of methods for detecting human reads in microbial sequencing datasets. Microb Genom 2020; 6:e000393 [View Article][PubMed]
[Google Scholar]
LaPierre N, Egan R, Wang W, Wang Z. De novo Nanopore read quality improvement using deep learning. BMC Bioinformatics 2019; 20:552 [View Article]
[Google Scholar]
Williams CR, Baccarella A, Parrish JZ, Kim CC. Trimming of sequence reads alters RNA-Seq gene expression estimates. BMC Bioinformatics 2016; 17:103 [View Article]
[Google Scholar]
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 2009; 10:R25 [View Article]
[Google Scholar]
Schröder J, Hsu A, Boyle SE, Macintyre G, Cmero M et al. Socrates: identification of genomic rearrangements in tumour genomes by re-aligning soft clipped reads. Bioinformatics 2014; 30:1064–1072 [View Article]
[Google Scholar]
Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods 2012; 9:357–359 [View Article]
[Google Scholar]
Sedlazeck FJ, Rescheneder P, von Haeseler A. NextGenMap: fast and accurate read mapping in highly polymorphic genomes. Bioinformatics 2013; 29:2790–2791 [View Article]
[Google Scholar]
Lunter G, Goodson M. Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res 2011; 21:936–939 [View Article]
[Google Scholar]
Zhou A, Lin T, Xing J. Evaluating nanopore sequencing data processing pipelines for structural variation identification. Genome Biol 2019; 20:237 [View Article]
[Google Scholar]
Shafin K, Pesout T, Lorig-Roach R, Haukness M, Olsen HE et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat Biotechnol 2020; 38:1044–1053 [View Article]
[Google Scholar]
De Coster W, D’Hert S, Schultz DT, Cruts M, Van Broeckhoven C. NanoPack: visualizing and processing long-read sequencing data. Bioinformatics 2018; 34:2666–2669 [View Article]
[Google Scholar]
Schubert M, Lindgreen S, Orlando L. AdapterRemoval v2: rapid adapter trimming, identification, and read merging. BMC Res Notes 2016; 9:88 [View Article]
[Google Scholar]
Chen S, Huang T, Zhou Y, Han Y, Xu M et al. AfterQC: automatic filtering, trimming, error removing and quality control for fastq data. BMC Bioinformatics 2017; 18:80 [View Article]
[Google Scholar]
Criscuolo A, Brisse S. AlienTrimmer: a tool to quickly and accurately trim off multiple short contaminant sequences from high-throughput sequencing reads. Genomics 2013; 102:500–506 [View Article]
[Google Scholar]
Kong Y. Btrim: a fast, lightweight adapter and quality trimming program for next-generation sequencing technologies. Genomics 2011; 98:152–153 [View Article]
[Google Scholar]
O’Halloran DM. fastQ_brew: module for analysis, preprocessing, and reformatting of FASTQ sequence data. BMC Res Notes 2017; 10:275 [View Article]
[Google Scholar]
Pérez-Rubio P, Lottaz C, Engelmann JC. FastqPuri: high-performance preprocessing of RNA-Seq data. BMC Bioinformatics 2019; 20:226 [View Article]
[Google Scholar]
Chen C, Khaleel SS, Huang H, Wu CH. Software for pre-processing Illumina next-generation sequencing short read sequences. Source Code Biol Med 2014; 9:8 [View Article]
[Google Scholar]
Li Y-L, Weng J-C, Hsiao C-C, Chou M-T, Tseng C-W et al. PEAT: an intelligent and efficient paired-end sequencing adapter trimming algorithm. BMC Bioinformatics 2015; 16 (Suppl. 1):S2 [View Article][PubMed]
[Google Scholar]
Sturm M, Schroeder C, Bauer P. SeqPurge: highly-sensitive adapter trimming for paired-end NGS data. BMC Bioinformatics 2016; 17:208 [View Article]
[Google Scholar]
Falgueras J, Lara AJ, Fernández-Pozo N, Cantón FR, Pérez-Trabado G et al. SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read. BMC Bioinformatics 2010; 11:38 [View Article]
[Google Scholar]
Jiang H, Lei R, Ding SW, Zhu S. Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads. BMC Bioinformatics 2014; 15:182 [View Article]
[Google Scholar]
Chen Y, Chen Y, Shi C, Huang Z, Zhang Y et al. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. Gigascience 2018; 7:gix120 [View Article]
[Google Scholar]
O’Connell J, Schulz-Trieglaff O, Carlson E, Hims MM, Gormley NA et al. NxTrim: optimized trimming of Illumina mate pair reads. Bioinformatics 2015; 31:2035–2037 [View Article]
[Google Scholar]
Leggett RM, Clavijo BJ, Clissold L, Clark MD, Caccamo M. NextClip: an analysis and read preparation tool for Nextera long mate pair libraries. Bioinformatics 2014; 30:566–568 [View Article]
[Google Scholar]
Tian S, Yan H, Kalmbach M, Slager SL. Impact of post-alignment processing in variant discovery from whole exome data. BMC Bioinformatics 2016; 17:403 [View Article]
[Google Scholar]
Ebbert MTW, Wadsworth ME, Staley LA, Hoyt KL, Pickett B et al. Evaluating the necessity of PCR duplicate removal from next-generation sequencing data and a comparison of approaches. BMC Bioinformatics 2016; 17:239 [View Article]
[Google Scholar]

http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.000434

Read trimming has minimal effect on bacterial SNP-calling accuracy

M Gen 6, e000434 (2020); https://doi.org/10.1099/mgen.0.000434

/content/journal/mgen/10.1099/mgen.0.000434

Volume 6, Issue 12

Research Article

Open Access

Read trimming has minimal effect on bacterial SNP-calling accuracy

Abstract

Funding

Supplementary material 1

Supplementary material 2

Most read this month

Most cited Most Cited RSS feed

Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification

ResFinder – an open online resource for identification of antimicrobial resistance genes in next-generation sequencing data and prediction of phenotypes from genotypes

MOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies

SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments

Completing bacterial genome assemblies with multiplex MinION sequencing

ClermonTyping: an easy-to-use and accurate in silico method for Escherichia genus strain phylotyping

Identification of Klebsiella capsule synthesis loci from whole genome data

Emergence, molecular mechanisms and global spread of carbapenem-resistant Acinetobacter baumannii

chewBBACA: A complete suite for gene-by-gene schema creation and strain identification

Microreact: visualizing and sharing data for genomic epidemiology and phylogeography