1887

Abstract

Read alignment is the central step of many analytic pipelines that perform variant calling. To reduce error, it is common practice to pre-process raw sequencing reads to remove low-quality bases and residual adapter contamination, a procedure collectively known as ‘trimming’. Trimming is widely assumed to increase the accuracy of variant calling, although there are relatively few systematic evaluations of its effects and no clear consensus on its efficacy. As sequencing datasets increase both in number and size, it is worthwhile reappraising computational operations of ambiguous benefit, particularly when the scope of many analyses now routinely incorporates thousands of samples, increasing the time and cost required. Using a curated set of 17 Gram-negative bacterial genomes, this study initially evaluated the impact of four read-trimming utilities (Atropos, , Trim Galore and Trimmomatic), each used with a range of stringencies, on the accuracy and completeness of three bacterial SNP-calling pipelines. It was found that read trimming made only small, and statistically insignificant, increases in SNP-calling accuracy even when using the highest-performing pre-processor in this study, . To extend these findings, >6500 publicly archived sequencing datasets from , and were re-analysed using a common analytic pipeline. Of the approximately 125 million SNPs and 1.25 million indels called across all samples, the same bases were called in 98.8 and 91.9 % of cases, respectively, irrespective of whether raw reads or trimmed reads were used. Nevertheless, the proportion of mixed calls (i.e. calls where <100 % of the reads support the variant allele; considered a proxy of false positives) was significantly reduced after trimming, which suggests that while trimming rarely alters the set of variant bases, it can affect the proportion of reads supporting each call. It was concluded that read quality- and adapter-trimming add relatively little value to a SNP-calling pipeline and may only be necessary if small differences in the absolute number of SNP calls, or the false call rate, are critical. Broadly similar conclusions can be drawn about the utility of trimming to an indel-calling pipeline. Read trimming remains routinely performed prior to variant calling likely out of concern that doing otherwise would typically have negative consequences. While historically this may have been the case, the data in this study suggests that read trimming is not always a practical necessity.

Funding
This study was supported by the:
  • NotApplicable , National Institute for Health Research Health Protection Research Unit , (Award HPRU-2012-10041)
Loading

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.000434
2020-12-11
2021-01-22
Loading full text...

Full text loading...

/deliver/fulltext/mgen/6/12/mgen000434.html?itemId=/content/journal/mgen/10.1099/mgen.0.000434&mimeType=html&fmt=ahah

References

  1. De Maio N, Shaw LP, Hubbard A, George S, Sanderson ND et al. Comparison of long-read sequencing technologies in the hybrid assembly of complex bacterial genomes. Microb Genom 2019; 5:e000294 [CrossRef][PubMed]
    [Google Scholar]
  2. Bush SJ, Foster D, Eyre DW, Clark EL, De Maio N et al. Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism-calling pipelines. Gigascience 2020; 9:giaa007 [CrossRef][PubMed]
    [Google Scholar]
  3. Del Fabbro C, Scalabrin S, Morgante M, Giorgi FM. An extensive evaluation of read trimming effects on Illumina NGS data analysis. PLoS One 2013; 8:e85024 [CrossRef][PubMed]
    [Google Scholar]
  4. Farrer RA, Henk DA, MacLean D, Studholme DJ, Fisher MC. Using false discovery rates to benchmark SNP-callers in next-generation sequencing projects. Sci Rep 2013; 3:1512 [CrossRef][PubMed]
    [Google Scholar]
  5. Liu Q, Guo Y, Li J, Long J, Zhang B et al. Steps to ensure accuracy in genotype and SNP calling from Illumina sequencing data. BMC Genomics 2012; 13:S8 [CrossRef][PubMed]
    [Google Scholar]
  6. Pightling AW, Petronella N, Pagotto F. Choice of reference sequence and assembler for alignment of Listeria monocytogenes short-read sequence data greatly influences rates of error in SNP analyses. PLoS One 2014; 9:e104579 [CrossRef][PubMed]
    [Google Scholar]
  7. Yang S-F, Lu C-W, Yao C-T, Hung C-M. To TRIM or not to TRIM: effects of read trimming on the de novo genome assembly of a widespread East Asian passerine, the Rufous-Capped Babbler (Cyanoderma ruficeps Blyth). Genes 2019; 10:737 [CrossRef]
    [Google Scholar]
  8. MacManes MD. On the optimal trimming of high-throughput mRNA sequence data. Front Genet 2014; 5:13 [CrossRef][PubMed]
    [Google Scholar]
  9. Dominguez Del Angel V, Hjerde E, Sterck L, Capella-Gutierrez S, Notredame C et al. Ten steps to get started in genome assembly and annotation. F1000Res 2018; 7:ELIXIR-148 [CrossRef][PubMed]
    [Google Scholar]
  10. Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 2018; 34:i884–i890 [CrossRef][PubMed]
    [Google Scholar]
  11. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 2014; 30:2114–2120 [CrossRef][PubMed]
    [Google Scholar]
  12. Didion JP, Martin M, Collins FS. Atropos: specific, sensitive, and speedy trimming of sequencing reads. PeerJ 2017; 5:e3720 [CrossRef][PubMed]
    [Google Scholar]
  13. Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J 2011; 17:10 [CrossRef]
    [Google Scholar]
  14. Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 2009; 25:1754–1760 [CrossRef][PubMed]
    [Google Scholar]
  15. Wilm A, Aw PPK, Bertrand D, Yeo GHT, Ong SH et al. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res 2012; 40:11189–11201 [CrossRef][PubMed]
    [Google Scholar]
  16. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009; 25:2078–2079 [CrossRef][PubMed]
    [Google Scholar]
  17. Kim S, Scheffler K, Halpern AL, Bekritsky MA, Noh E et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat Methods 2018; 15:591–594 [CrossRef][PubMed]
    [Google Scholar]
  18. Marçais G, Delcher AL, Phillippy AM, Coston R, Salzberg SL et al. MUMmer4: a fast and versatile genome alignment system. PLoS Comput Biol 2018; 14:e1005944 [CrossRef]
    [Google Scholar]
  19. Treangen TJ, Ondov BD, Koren S, Phillippy AM. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol 2014; 15:524 [CrossRef]
    [Google Scholar]
  20. Broad Institute Picard: a set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF ( http://broadinstitute.github.io/picard/) Cambridge, MA: Broad Institute; 2018
  21. Eyre DW, Cule ML, Wilson DJ, Griffiths D, Vaughan A et al. Diverse sources of C. difficile infection identified on whole-genome sequencing. N Engl J Med 2013; 369:1195–1205 [CrossRef]
    [Google Scholar]
  22. Jia P, Li F, Xia J, Chen H, Ji H et al. Consensus rules in variant detection from next-generation sequencing data. PLoS One 2012; 7:e38470 [CrossRef]
    [Google Scholar]
  23. Lieberman TD, Wilson D, Misra R, Xiong LL, Moodley P et al. Genomic diversity in autopsy samples reveals within-host dissemination of HIV-associated Mycobacterium tuberculosis . Nat Med 2016; 22:1470–1474 [CrossRef]
    [Google Scholar]
  24. Guthrie JL, Delli Pizzi A, Roth D, Kong C, Jorgensen D et al. Genotyping and whole-genome sequencing to identify tuberculosis transmission to pediatric patients in British Columbia, Canada, 2005–2014. J Infect Dis 2018; 218:1155–1163 [CrossRef][PubMed]
    [Google Scholar]
  25. Ceric O, Tyson GH, Goodman LB, Mitchell PK, Zhang Y et al. Enhancing the one health initiative by using whole genome sequencing to monitor antimicrobial resistance of animal pathogens: Vet-LIRN collaborative project with veterinary diagnostic laboratories in United States and Canada. BMC Vet Res 2019; 15:130 [CrossRef]
    [Google Scholar]
  26. Hasan MS, Wu X, Zhang L. Performance evaluation of indel calling tools using real short-read data. Hum Genomics 2015; 9:20 [CrossRef]
    [Google Scholar]
  27. Li D, Kim W, Wang L, Yoon KA, Park B et al. Comparison of indel calling tools with simulation data and real short-read data. IEEE/ACM Trans Comput Biol Bioinform 2019; 16:1635–1644 [CrossRef][PubMed]
    [Google Scholar]
  28. Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 2019; 37:907–915 [CrossRef]
    [Google Scholar]
  29. Kelley DR, Schatz MC, Salzberg SL. Quake: quality-aware detection and correction of sequencing errors. Genome Biol 2010; 11:R116 [CrossRef]
    [Google Scholar]
  30. Marçais G, Yorke JA, Zimin A. QuorUM: an error corrector for Illumina reads. PLoS One 2015; 10:e0130821 [CrossRef]
    [Google Scholar]
  31. Greenfield P, Duesing K, Papanicolaou A, Bauer DC. Blue: correcting sequencing errors using consensus and context. Bioinformatics 2014; 30:2723–2732 [CrossRef]
    [Google Scholar]
  32. Yang X, Chockalingam SP, Aluru S. A survey of error-correction methods for next-generation sequencing. Brief Bioinform 2013; 14:56–66 [CrossRef]
    [Google Scholar]
  33. Bush SJ, Connor TR, Peto TEA, Crook DW, Walker AS. Evaluation of methods for detecting human reads in microbial sequencing datasets. Microb Genom 2020; 6:e000393 [CrossRef][PubMed]
    [Google Scholar]
  34. LaPierre N, Egan R, Wang W, Wang Z. De novo Nanopore read quality improvement using deep learning. BMC Bioinformatics 2019; 20:552 [CrossRef]
    [Google Scholar]
  35. Williams CR, Baccarella A, Parrish JZ, Kim CC. Trimming of sequence reads alters RNA-Seq gene expression estimates. BMC Bioinformatics 2016; 17:103 [CrossRef]
    [Google Scholar]
  36. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 2009; 10:R25 [CrossRef]
    [Google Scholar]
  37. Schröder J, Hsu A, Boyle SE, Macintyre G, Cmero M et al. Socrates: identification of genomic rearrangements in tumour genomes by re-aligning soft clipped reads. Bioinformatics 2014; 30:1064–1072 [CrossRef]
    [Google Scholar]
  38. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods 2012; 9:357–359 [CrossRef]
    [Google Scholar]
  39. Sedlazeck FJ, Rescheneder P, von Haeseler A. NextGenMap: fast and accurate read mapping in highly polymorphic genomes. Bioinformatics 2013; 29:2790–2791 [CrossRef]
    [Google Scholar]
  40. Lunter G, Goodson M. Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res 2011; 21:936–939 [CrossRef]
    [Google Scholar]
  41. Zhou A, Lin T, Xing J. Evaluating nanopore sequencing data processing pipelines for structural variation identification. Genome Biol 2019; 20:237 [CrossRef]
    [Google Scholar]
  42. Shafin K, Pesout T, Lorig-Roach R, Haukness M, Olsen HE et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat Biotechnol 2020; 38:1044–1053 [CrossRef]
    [Google Scholar]
  43. De Coster W, D’Hert S, Schultz DT, Cruts M, Van Broeckhoven C. NanoPack: visualizing and processing long-read sequencing data. Bioinformatics 2018; 34:2666–2669 [CrossRef]
    [Google Scholar]
  44. Schubert M, Lindgreen S, Orlando L. AdapterRemoval v2: rapid adapter trimming, identification, and read merging. BMC Res Notes 2016; 9:88 [CrossRef]
    [Google Scholar]
  45. Chen S, Huang T, Zhou Y, Han Y, Xu M et al. AfterQC: automatic filtering, trimming, error removing and quality control for fastq data. BMC Bioinformatics 2017; 18:80 [CrossRef]
    [Google Scholar]
  46. Criscuolo A, Brisse S. AlienTrimmer: a tool to quickly and accurately trim off multiple short contaminant sequences from high-throughput sequencing reads. Genomics 2013; 102:500–506 [CrossRef]
    [Google Scholar]
  47. Kong Y. Btrim: a fast, lightweight adapter and quality trimming program for next-generation sequencing technologies. Genomics 2011; 98:152–153 [CrossRef]
    [Google Scholar]
  48. O’Halloran DM. fastQ_brew: module for analysis, preprocessing, and reformatting of FASTQ sequence data. BMC Res Notes 2017; 10:275 [CrossRef]
    [Google Scholar]
  49. Pérez-Rubio P, Lottaz C, Engelmann JC. FastqPuri: high-performance preprocessing of RNA-Seq data. BMC Bioinformatics 2019; 20:226 [CrossRef]
    [Google Scholar]
  50. Chen C, Khaleel SS, Huang H, Wu CH. Software for pre-processing Illumina next-generation sequencing short read sequences. Source Code Biol Med 2014; 9:8 [CrossRef]
    [Google Scholar]
  51. Li Y-L, Weng J-C, Hsiao C-C, Chou M-T, Tseng C-W et al. PEAT: an intelligent and efficient paired-end sequencing adapter trimming algorithm. BMC Bioinformatics 2015; 16 (Suppl. 1):S2 [CrossRef][PubMed]
    [Google Scholar]
  52. Sturm M, Schroeder C, Bauer P. SeqPurge: highly-sensitive adapter trimming for paired-end NGS data. BMC Bioinformatics 2016; 17:208 [CrossRef]
    [Google Scholar]
  53. Falgueras J, Lara AJ, Fernández-Pozo N, Cantón FR, Pérez-Trabado G et al. SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read. BMC Bioinformatics 2010; 11:38 [CrossRef]
    [Google Scholar]
  54. Jiang H, Lei R, Ding SW, Zhu S. Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads. BMC Bioinformatics 2014; 15:182 [CrossRef]
    [Google Scholar]
  55. Chen Y, Chen Y, Shi C, Huang Z, Zhang Y et al. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. Gigascience 2018; 7:gix120 [CrossRef]
    [Google Scholar]
  56. O’Connell J, Schulz-Trieglaff O, Carlson E, Hims MM, Gormley NA et al. NxTrim: optimized trimming of Illumina mate pair reads. Bioinformatics 2015; 31:2035–2037 [CrossRef]
    [Google Scholar]
  57. Leggett RM, Clavijo BJ, Clissold L, Clark MD, Caccamo M. NextClip: an analysis and read preparation tool for Nextera long mate pair libraries. Bioinformatics 2014; 30:566–568 [CrossRef]
    [Google Scholar]
  58. Tian S, Yan H, Kalmbach M, Slager SL. Impact of post-alignment processing in variant discovery from whole exome data. BMC Bioinformatics 2016; 17:403 [CrossRef]
    [Google Scholar]
  59. Ebbert MTW, Wadsworth ME, Staley LA, Hoyt KL, Pickett B et al. Evaluating the necessity of PCR duplicate removal from next-generation sequencing data and a comparison of approaches. BMC Bioinformatics 2016; 17:239 [CrossRef]
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.000434
Loading
/content/journal/mgen/10.1099/mgen.0.000434
Loading

Data & Media loading...

Supplements

Supplementary material 1

PDF

Supplementary material 2

EXCEL

Most cited this month Most Cited RSS feed

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error