Exact mapping of Illumina blind spots in the Mycobacterium tuberculosis genome reveals platform-wide and workflow-specific biases

Samuel J. Modlin; Cassidy Robinhold; Christopher Morrissey; Scott N. Mitchell; Sarah M. Ramirez-Busby; Tal Shmaya; Faramarz Valafar

doi:10.1099/mgen.0.000465

Volume 7, Issue 3

Research Article

Open Access

Exact mapping of Illumina blind spots in the Mycobacterium tuberculosis genome reveals platform-wide and workflow-specific biases

Samuel J. Modlin^1,†, Cassidy Robinhold^1,†, Christopher Morrissey¹, Scott N. Mitchell¹, Sarah M. Ramirez-Busby¹, Tal Shmaya¹ and Faramarz Valafar¹
View Affiliations Hide Affiliations

Affiliations: ¹ Laboratory for Pathogenesis of Clinical Drug Resistance and Persistence, School of Public Health, San Diego State University, San Diego, CA 92182, USA
*Correspondence: Faramarz Valafar, [email protected]

† These authors contributed equally to this work
Published: 27 January 2021 https://doi.org/10.1099/mgen.0.000465

Abstract

Whole-genome sequencing (WGS) is fundamental to Mycobacterium tuberculosis basic research and many clinical applications. Coverage across Illumina-sequenced M. tuberculosis genomes is known to vary with sequence context, but this bias is poorly characterized. Here, through a novel application of phylogenomics that distinguishes genuine coverage bias from deletions, we discern Illumina ‘blind spots’ in the M. tuberculosis reference genome for seven sequencing workflows. We find blind spots to be widespread, affecting 529 genes, and provide their exact coordinates, enabling salvage of unaffected regions. Fifty-seven pe/ppe genes (the primary families assumed to exhibit Illumina bias) lack blind spots entirely, while the remaining pe/ppe genes account for 55.1 % of blind spots. Surprisingly, we find coverage bias persists in homopolymers as short as 6 bp, shorter tracts than previously reported. While G+C-rich regions challenge all Illumina sequencing workflows, a modified Nextera library preparation that amplifies DNA with a high-fidelity polymerase markedly attenuates coverage bias in G+C-rich and homopolymeric sequences, expanding the ‘Illumina-sequenceable’ genome. Through these findings, and by defining workflow-specific exclusion criteria, we spotlight effective strategies for handling bias in M. tuberculosis Illumina WGS. This empirical analysis framework may be used to systematically evaluate coverage bias in other species using existing sequencing data.

Received: 04/05/2020
Accepted: 16/10/2020
Published Online: 27/01/2021

Keyword(s): blind spots , coverage bias , homopolymers , Illumina , sequencing and tuberculosis

Funding

This study was supported by the:

National Institute of Allergy and Infectious Diseases (Award R01AI105185)
- Principle Award Recipient: Faramarz Valafar

This is an open-access article distributed under the terms of the Creative Commons Attribution NonCommercial License.

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.000465

2021-01-27

2024-04-23

Full text loading...

/deliver/fulltext/mgen/7/3/mgen000465.html?itemId=/content/journal/mgen/10.1099/mgen.0.000465&mimeType=html&fmt=ahah

References

WHO Global Tuberculosis Report Geneva: World Health Organization; 2019
[Google Scholar]
CRyPTIC Consortium and the 100,000 Genomes Project Allix-Béguec C, Arandjelovic I, Bi L, Beckert P et al. Prediction of susceptibility to first-line tuberculosis drugs by DNA sequencing. N Engl J Med 2018; 379:1403–1415 [View Article][PubMed]
[Google Scholar]
WHO The Use of Next-Generation Sequencing Technologies for the Detection of Mutations Associated with Drug Resistance in Mycobacterium tuberculosis Complex: Technical Guide Geneva: World Health Organization; 2018
[Google Scholar]
Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ et al. Characterizing and measuring bias in sequence data. Genome Biol 2013; 14:R51 [View Article][PubMed]
[Google Scholar]
Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008; 456:53–59 [View Article][PubMed]
[Google Scholar]
Tyler AD, Christianson S, Knox NC, Mabon P, Wolfe J et al. Comparison of sample preparation methods used for the next-generation sequencing of Mycobacterium tuberculosis . PLoS One 2016; 11:e0148676 [View Article][PubMed]
[Google Scholar]
Galagan JE. Genomic insights into tuberculosis. Nat Rev Genet 2014; 15:307–320 [View Article][PubMed]
[Google Scholar]
Aird D, Ross MG, Chen W-S, Danielsson M, Fennell T et al. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol 2011; 12:R18 [View Article][PubMed]
[Google Scholar]
Star B, Nederbragt AJ, Hansen MHS, Skage M, Gilfillan GD et al. Palindromic sequence artifacts generated during next generation sequencing library preparation from historic and ancient DNA. PLoS One 2014; 9:e89676 [View Article][PubMed]
[Google Scholar]
Huang Y-F, Chen S-C, Chiang Y-S, Chen T-H, Chiu K-P. Palindromic sequence impedes sequencing-by-ligation mechanism. BMC Syst Biol 2012; 6 (Suppl. 2):S10 [View Article][PubMed]
[Google Scholar]
Quail MA, Smith M, Coupland P, Otto TD, Harris SR et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 2012; 13:341 [View Article][PubMed]
[Google Scholar]
Minoche AE, Dohm JC, Himmelbauer H. Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems. Genome Biol 2011; 12:R112 [View Article][PubMed]
[Google Scholar]
Laehnemann D, Borkhardt A, McHardy AC. Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction. Brief Bioinform 2016; 17:154–179 [View Article][PubMed]
[Google Scholar]
Luo C, Tsementzi D, Kyrpides N, Read T, Konstantinidis KT. Direct comparisons of Illumina vs. Roche 454 sequencing technologies on the same microbial community DNA sample. PLoS One 2012; 7:e30087 [View Article][PubMed]
[Google Scholar]
Illumina An Introduction to Next-Generation Sequencing Technology San Diego, CA: Illumina; 2017
[Google Scholar]
Sabina J, Leamon JH. Bias in whole genome amplification: causes and considerations. Methods Mol Biol 2015; 1347:15–41 [View Article][PubMed]
[Google Scholar]
Warris S, Schijlen E, van de Geest H, Vegesna R, Hesselink T et al. Correcting palindromes in long reads after whole-genome amplification. BMC Genomics 2018; 19:798 [View Article][PubMed]
[Google Scholar]
Lasken RS, Stockwell TB. Mechanism of chimera formation during the multiple displacement amplification reaction. BMC Biotechnol 2007; 7:19 [View Article][PubMed]
[Google Scholar]
Advani J, Verma R, Chatterjee O, Pachouri PK, Upadhyay P et al. Whole genome sequencing of Mycobacterium tuberculosis clinical isolates from India reveals genetic heterogeneity and region-specific variations that might affect drug susceptibility. Front Microbiol 2019; 10:00309 [View Article][PubMed]
[Google Scholar]
Zakham F, Laurent S, Esteves Carreira AL, Corbaz A, Bertelli C et al. Whole-genome sequencing for rapid, reliable and routine investigation of Mycobacterium tuberculosis transmission in local communities. New Microbes New Infect 2019; 31:100582 [View Article][PubMed]
[Google Scholar]
Phelan JE, Coll F, Bergval I, Anthony RM, Warren R et al. Recombination in pe/ppe genes contributes to genetic variation in Mycobacterium tuberculosis lineages. BMC Genomics 2016; 17:151 [View Article][PubMed]
[Google Scholar]
Meehan CJ, Goig GA, Kohl TA, Verboven L, Dippenaar A et al. Whole genome sequencing of Mycobacterium tuberculosis: current standards and open issues. Nat Rev Microbiol 2019; 17:533–545 [View Article][PubMed]
[Google Scholar]
Mikheecheva NE, Melerzanov AV, Melerzanov AV, Danilenko VN. A nonsynonymous SNP catalog of Mycobacterium tuberculosis virulence genes and its use for detecting new potentially virulent sublineages. Genome Biol Evol 2017; 9:887–899 [View Article][PubMed]
[Google Scholar]
Casali N, Broda A, Harris SR, Parkhill J, Brown T. Whole genome sequence analysis of a large isoniazid-resistant tuberculosis outbreak in London: a retrospective observational study. PLoS Med 2016; 13:e1002137 [View Article][PubMed]
[Google Scholar]
Farhat MR, Shapiro BJ, Sheppard SK, Colijn C, Murray M. A phylogeny-based sampling strategy and power calculator informs genome-wide associations study design for microbial pathogens. Genome Med 2014; 6:101 [View Article][PubMed]
[Google Scholar]
Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C et al. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 1998; 393:537–544 [View Article][PubMed]
[Google Scholar]
Fishbein S, van Wyk N, Warren RM, Sampson SL. Phylogeny to function: PE/PPE protein evolution and impact on Mycobacterium tuberculosis pathogenicity. Mol Microbiol 2015; 96:901–916 [View Article][PubMed]
[Google Scholar]
Baym M, Kryazhimskiy S, Lieberman TD, Chung H, Desai MM et al. Inexpensive multiplexed library preparation for megabase-sized genomes. PLoS One 2015; 10:e0128036 [View Article][PubMed]
[Google Scholar]
Robinhold C, Modlin S, Morrissey C, Valafar F. Table S7: blind spots and their attributes ( https://zenodo.org/record/3701840#.Xma5TaaVtGo); 2020
NCBI SRA-Tools, NCBI 2020
Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 2014; 30:2114–2120 [View Article][PubMed]
[Google Scholar]
Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods 2012; 9:357–359 [View Article][PubMed]
[Google Scholar]
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009; 25:2078–2079 [View Article][PubMed]
[Google Scholar]
Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res 2012; 22:568–576 [View Article][PubMed]
[Google Scholar]
Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 2014; 30:1312–1313 [View Article]
[Google Scholar]
Letunic I, Bork P. Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation. Bioinformatics 2007; 23:127–128 [View Article][PubMed]
[Google Scholar]
Letunic I, Bork P. Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic Acids Res 2016; 44:W242–W245 [View Article][PubMed]
[Google Scholar]
Huerta-Cepas J, Serra F, Bork P. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Mol Biol Evol 2016; 33:1635–1638 [View Article][PubMed]
[Google Scholar]
Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 2014; 30:1312–1313 [View Article][PubMed]
[Google Scholar]
Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 1999; 27:573–580 [View Article][PubMed]
[Google Scholar]
Tilak M-K, Botero-Castro F, Galtier N, Nabholz B. Illumina library preparation for sequencing the GC-rich fraction of heterogeneous genomic DNA. Genome Biol Evol 2018; 10:616–622 [View Article][PubMed]
[Google Scholar]
Rice P, Longden I, Bleasby A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 2000; 16:276–277 [View Article][PubMed]
[Google Scholar]
Varani G. Exceptionally stable nucleic acid hairpins. Annu Rev Biophys Biomol Struct 1995; 24:379–404 [View Article][PubMed]
[Google Scholar]
R Core Team R: a Language and Environment for Statistical Computing Vienna: R Foundation for Statistical Computing; 2013
Gagneux S. Ecology and evolution of Mycobacterium tuberculosis . Nat Rev Microbiol 2018; 16:202–213 [View Article][PubMed]
[Google Scholar]
Payton ME, Greenstone MH, Schenker N. Overlapping confidence intervals or standard error intervals: what do they mean in terms of statistical significance?. J Insect Sci 2003; 3:34 [View Article][PubMed]
[Google Scholar]
Tsai IJ, Hunt M, Holroyd N, Huckvale T, Berriman M et al. Summarizing specific profiles in Illumina sequencing from whole-genome amplified DNA. DNA Res 2014; 21:243–254 [View Article][PubMed]
[Google Scholar]
Ioerger TR, Koo S, No E-G, Chen X, Larsen MH et al. Genome analysis of multi- and extensively-drug-resistant tuberculosis from KwaZulu-Natal, South Africa. PLoS One 2009; 4:e7778 [View Article][PubMed]
[Google Scholar]
Gröschel MI, Sayes F, Simeone R, Majlessi L, Brosch R. ESX secretion systems: mycobacterial evolution to counter host immunity. Nat Rev Microbiol 2016; 14:677–691 [View Article][PubMed]
[Google Scholar]
Quadri LEN. Biosynthesis of mycobacterial lipids by polyketide synthases and beyond. Crit Rev Biochem Mol Biol 2014; 49:179–211 [View Article][PubMed]
[Google Scholar]
Domenech P, Reed MB. Rapid and spontaneous loss of phthiocerol dimycocerosate (PDIM) from Mycobacterium tuberculosis grown in vitro: implications for virulence studies. Microbiology 2009; 155:3532–3543 [View Article][PubMed]
[Google Scholar]
Van Dijk EL, Jaszczyszyn Y, Thermes C. Library preparation methods for next-generation sequencing: tone down the bias. Exp Cell Res 2014; 322:12–20 [View Article][PubMed]
[Google Scholar]
Vargas R, Farhat MR. Antibiotic treatment and selection for glpK mutations in patients with active tuberculosis disease. Proc Natl Acad Sci USA 2020; 117:3910–3912 [View Article][PubMed]
[Google Scholar]
Safi H, Gopal P, Lingaraju S, Ma S, Levine C et al. Phase variation in Mycobacterium tuberculosis glpK produces transiently heritable drug tolerance. Proc Natl Acad Sci USA 2019; 116:19665–19674 [View Article][PubMed]
[Google Scholar]
Gragg H, Harfe BD, Jinks-Robertson S. Base composition of mononucleotide runs affects DNA polymerase slippage and removal of frameshift intermediates by mismatch repair in Saccharomyces cerevisiae . Mol Cell Biol 2002; 22:8756–8762 [View Article][PubMed]
[Google Scholar]
Coll F, Phelan J, Hill-Cawthorne GA, Nair MB, Mallard K et al. Genome-wide analysis of multi- and extensively drug-resistant Mycobacterium tuberculosis . Nat Genet 2018; 50:307–316 [View Article][PubMed]
[Google Scholar]
Farhat MR, Freschi L, Calderon R, Ioerger T, Snyder M et al. GWAS for quantitative resistance phenotypes in Mycobacterium tuberculosis reveals resistance genes and regulatory regions. Nat Commun 2019; 10:2128 [View Article][PubMed]
[Google Scholar]

http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.000465

Exact mapping of Illumina blind spots in the Mycobacterium tuberculosis genome reveals platform-wide and workflow-specific biases

M Gen 7, 000465 (2021); https://doi.org/10.1099/mgen.0.000465

/content/journal/mgen/10.1099/mgen.0.000465

Data & Media loading...

Supplements

Volume 7, Issue 3

Research Article

Open Access

Exact mapping of Illumina blind spots in the Mycobacterium tuberculosis genome reveals platform-wide and workflow-specific biases

Abstract

Funding

Supplementary material 1

Most read this month

Most cited Most Cited RSS feed

ResFinder – an open online resource for identification of antimicrobial resistance genes in next-generation sequencing data and prediction of phenotypes from genotypes

Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification

MOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies

Completing bacterial genome assemblies with multiplex MinION sequencing

SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments

ClermonTyping: an easy-to-use and accurate in silico method for Escherichia genus strain phylotyping

Identification of Klebsiella capsule synthesis loci from whole genome data

Emergence, molecular mechanisms and global spread of carbapenem-resistant Acinetobacter baumannii

chewBBACA: A complete suite for gene-by-gene schema creation and strain identification

Microreact: visualizing and sharing data for genomic epidemiology and phylogeography