1887

Abstract

While variant identification pipelines are becoming increasingly standardized, less attention has been paid to the pre-processing of variants prior to their use in bacterial genome-wide association studies (bGWAS). Three nuances of variant pre-processing that impact downstream identification of genetic associations include the separation of variants at multiallelic sites, separation of variants in overlapping genes, and referencing of variants relative to ancestral alleles. Here we demonstrate the importance of these variant pre-processing steps on diverse bacterial genomic datasets and present prewas, an R package, that standardizes the pre-processing of multiallelic sites, overlapping genes, and reference alleles before bGWAS. This package facilitates improved reproducibility and interpretability of bGWAS results. prewas enables users to extract maximal information from bGWAS by implementing multi-line representation for multiallelic sites and variants in overlapping genes. prewas outputs a binary SNP matrix that can be used for SNP-based bGWAS and will prevent the masking of minor alleles during bGWAS analysis. The optional binary gene matrix output can be used for gene-based bGWAS, which will enable users to maximize the power and evolutionary interpretability of their bGWAS studies. prewas is available for download from GitHub.

Funding
This study was supported by the:
  • National Institutes of Health, http://dx.doi.org/10.13039/100000002 (Award T32GM007544)
  • National Institutes of Health, http://dx.doi.org/10.13039/100000002 (Award 1U01Al124255)
  • National Institutes of Health, http://dx.doi.org/10.13039/100000002 (Award 1U01Al124255)
  • National Institutes of Health, http://dx.doi.org/10.13039/100000002 (Award T32 AI007528)
  • National Science Foundation, http://dx.doi.org/10.13039/100000001 (Award DGE 1256260)
  • This is an open-access article distributed under the terms of the Creative Commons Attribution License.
Loading

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.000368
2020-04-20
2024-12-14
Loading full text...

Full text loading...

/deliver/fulltext/mgen/6/5/mgen000368.html?itemId=/content/journal/mgen/10.1099/mgen.0.000368&mimeType=html&fmt=ahah

References

  1. Power RA, Parkhill J, de Oliveira T. Microbial genome-wide association studies: lessons from human GWAS. Nat Rev Genet 2017; 18:41–50 [View Article][PubMed][PubMed]
    [Google Scholar]
  2. Brynildsrud O, Bohlin J, Scheffer L, Eldholm V. Rapid scoring of genes in microbial pan-genome-wide association studies with Scoary. Genome Biol 2016; 17:238 [View Article][PubMed][PubMed]
    [Google Scholar]
  3. Lees JA, Galardini M, Bentley SD, Weiser JN, Corander J. pyseer: a comprehensive tool for microbial pangenome-wide association studies. Bioinformatics 2018; 34:4310–4312 [View Article][PubMed][PubMed]
    [Google Scholar]
  4. Collins C, Didelot X. A phylogenetic method to perform genome-wide association studies in microbes that accounts for population structure and recombination. PLoS Comput Biol 2018; 14:e1005958 [View Article][PubMed][PubMed]
    [Google Scholar]
  5. Earle SG, Wu C-H, Charlesworth J, Stoesser N, Gordon NC et al. Identifying lineage effects when controlling for population structure improves power in bacterial association studies. Nat Microbiol 2016; 1:16041 [View Article][PubMed][PubMed]
    [Google Scholar]
  6. Lees JA, Vehkala M, Välimäki N, Harris SR, Chewapreecha C et al. Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes. Nat Commun 2016; 7:1–8 [View Article][PubMed][PubMed]
    [Google Scholar]
  7. Jaillard M, Lima L, Tournoud M, Mahé P, van Belkum A, van BA et al. A fast and agnostic method for bacterial genome-wide association studies: bridging the gap between k-mers and genetic events. PLoS Genet 2018; 14:e1007758 [View Article][PubMed][PubMed]
    [Google Scholar]
  8. Farhat MR, Shapiro BJ, Kieser KJ, Sultana R, Jacobson KR et al. Genomic analysis identifies targets of convergent positive selection in drug-resistant Mycobacterium tuberculosis. Nat Genet 2013; 45:1183–1189 [View Article][PubMed][PubMed]
    [Google Scholar]
  9. Alam MT, Petit RA, Crispell EK, Thornton TA, Conneely KN et al. Dissecting vancomycin-intermediate resistance in Staphylococcus aureus using genome-wide association. Genome Biol Evol 2014; 6:1174–1185 [View Article][PubMed][PubMed]
    [Google Scholar]
  10. Chewapreecha C, Marttinen P, Croucher NJ, Salter SJ, Harris SR et al. Comprehensive identification of single nucleotide polymorphisms associated with beta-lactam resistance within pneumococcal mosaic genes. PLoS Genet 2014; 10:e1004547 [View Article][PubMed][PubMed]
    [Google Scholar]
  11. Desjardins CA, Cohen KA, Munsamy V, Abeel T, Maharaj K et al. Genomic and functional analyses of Mycobacterium tuberculosis strains implicate ALD in D-cycloserine resistance. Nat Genet 2016; 48:544–551 [View Article][PubMed][PubMed]
    [Google Scholar]
  12. Laabei M, Recker M, Rudkin JK, Aldeljawi M, Gulay Z et al. Predicting the virulence of MRSA from its genome sequence. Genome Res 2014; 24:839–849 [View Article][PubMed][PubMed]
    [Google Scholar]
  13. Olson ND, Lund SP, Colman RE, Foster JT, Sahl JW et al. Best practices for evaluating single nucleotide variant calling methods for microbial genomics. Front Genet 2015; 6:235 [View Article][PubMed][PubMed]
    [Google Scholar]
  14. Yoshimura D, Kajitani R, Gotoh Y, Katahira K, Okuno M et al. Evaluation of SNP calling methods for closely related bacterial isolates and a novel high-accuracy pipeline: BactSNP. Microb Genom 2019; 5:e000261 [View Article]
    [Google Scholar]
  15. Zhan X, Chen S, Jiang Y, Liu M, Iacono WG et al. Association analysis and meta-analysis of multi-allelic variants for large scale sequence data. bioRxiv [Internet] 2017
    [Google Scholar]
  16. Farhat MR, Freschi L, Calderon R, Ioerger T, Snyder M et al. Gwas for quantitative resistance phenotypes in Mycobacterium tuberculosis reveals resistance genes and regulatory regions. Nat Commun 2019; 10:1–11 [View Article]
    [Google Scholar]
  17. Johnson ZI, Chisholm SW. Properties of overlapping genes are conserved across microbial genomes. Genome Res 2004; 14:2268–2272 [View Article][PubMed][PubMed]
    [Google Scholar]
  18. Huvet M, Stumpf MPH. Overlapping genes: a window on gene evolvability. BMC Genomics 2014; 15:721 [View Article][PubMed][PubMed]
    [Google Scholar]
  19. Carlson PE, Walk ST, Bourgis AET, Liu MW, Kopliku F et al. The relationship between phenotype, ribotype, and clinical disease in human Clostridium difficile isolates. Anaerobe 2013; 24:109–116 [View Article][PubMed][PubMed]
    [Google Scholar]
  20. Saund K, Rao K, Young VB, Snitkin ES. Genetic determinants of trehalose utilization are not associated with severe Clostridium difficile infection [Internet]. Infectious Diseases 2019
    [Google Scholar]
  21. Mody L, Krein SL, Saint S, Min LC, Montoya A et al. A targeted infection prevention intervention in nursing home residents with indwelling devices: a randomized clinical trial. JAMA Intern Med 2015; 175:714–723 [View Article][PubMed][PubMed]
    [Google Scholar]
  22. Mody L, Foxman B, Bradley S, McNamara S, Lansing B et al. Longitudinal assessment of multidrug-resistant organisms in newly admitted nursing facility patients: implications for an evolving population. Clin Infect Dis 2018; 67:837–844 [View Article][PubMed][PubMed]
    [Google Scholar]
  23. Han JH, Lapp Z, Bushman F, Lautenbach E, Goldstein EJC et al. Whole-genome sequencing to identify drivers of carbapenem-resistant Klebsiella pneumoniae transmission within and between regional long-term acute-care hospitals. Antimicrob Agents Chemother 2019; 63:e01622-19 [View Article][PubMed][PubMed]
    [Google Scholar]
  24. Bassis CM, Bullock KA, Sack DE, Saund K, Pirani A et al. Evidence that vertical transmission of the vaginal microbiota can persist into adolescence [Internet]. Microbiology 2019
    [Google Scholar]
  25. Sun Z, Harris HMB, McCann A, Guo C, Argimón S et al. Expanding the biotechnology potential of lactobacilli through comparative genomics of 213 strains and associated genera. Nat Commun 2015; 6:1–13 [View Article][PubMed][PubMed]
    [Google Scholar]
  26. Popovich KJ, Snitkin ES, Zawitz C, Aroutcheva A, Payne D et al. Frequent methicillin-resistant Staphylococcus aureus introductions into an inner-city jail: indications of community transmission networks. Clin Infect Dis 352: [View Article]
    [Google Scholar]
  27. Roach DJ, Burton JN, Lee C, Stackhouse B, Butler-Wu SM et al. A year of infection in the intensive care unit: prospective whole genome sequencing of bacterial clinical isolates reveals cryptic transmissions and novel microbiota. PLoS Genet 2015; 11:e1005413 [View Article][PubMed][PubMed]
    [Google Scholar]
  28. Sichtig H, Minogue T, Yan Y, Stefan C, Hall A et al. FDA-ARGOS is a database with public quality-controlled reference genomes for diagnostic use and regulatory science. Nat Commun 2019; 10:1–13 [View Article]
    [Google Scholar]
  29. Lira F, Berg G, Martínez JL. Double-face meets the bacterial world: the opportunistic pathogen Stenotrophomonas maltophilia. Front Microbiol 2017; 8:2190 [View Article][PubMed][PubMed]
    [Google Scholar]
  30. Esposito A, Pompilio A, Bettua C, Crocetta V, Giacobazzi E et al. Evolution of Stenotrophomonas maltophilia in cystic fibrosis lung over chronic infection: a genomic and phenotypic population study. Front Microbiol 2017; 8:1590 [View Article][PubMed][PubMed]
    [Google Scholar]
  31. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J et al. The sequence Alignment/Map format and SAMtools. Bioinformatics 2009; 25:2078–2079 [View Article][PubMed][PubMed]
    [Google Scholar]
  32. Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol 2015; 32:268–274 [View Article][PubMed][PubMed]
    [Google Scholar]
  33. Cingolani P, Platts A, Wang LL, Coon M, Nguyen T et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly 2012; 6:80–92 [View Article][PubMed][PubMed]
    [Google Scholar]
  34. Paradis E, Schliep K. Ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 2019; 35:526–528 [View Article][PubMed][PubMed]
    [Google Scholar]
  35. Bengtsson H. R Core Team future.apply: apply function to elements in parallel using futures [Internet]. 2019 [cited 2019 Dec 10]. Available from: https://CRAN.R-project.org/package=future.apply.
  36. Schliep KP. phangorn: phylogenetic analysis in R. Bioinformatics 2011; 27:592–593 [View Article][PubMed][PubMed]
    [Google Scholar]
  37. Revell LJ. phytools: an R package for phylogenetic comparative biology (and other things). Methods in Ecology and Evolution 2012; 3:217–223 [View Article]
    [Google Scholar]
  38. Knaus BJ, Grünwald NJ. vcfr: a package to manipulate and visualize variant call format data in R. Mol Ecol Resour 2017; 17:44–53 [View Article][PubMed][PubMed]
    [Google Scholar]
  39. Wickham H, Averick M, Bryan J, Chang W, McGowan L et al. Welcome to the Tidyverse. Journal of Open Source Software 2019; 4:1686 [View Article]
    [Google Scholar]
  40. Wickham H. Reshaping data with the reshape package. J Stat Softw 2007; 21:1–20 [View Article]
    [Google Scholar]
  41. Kolde R. pheatmap: Pretty Heatmaps [Internet]. 2019 [cited 2019 Dec 10]. Available from: https://CRAN.R-project.org/package=pheatmap.
  42. Xie Y. animation: an R Package for creating animations and demonstrating statistical methods. J Stat Softw 2013; 53:1–27 [View Article]
    [Google Scholar]
  43. Pagès H, Aboyoun P, Gentleman R, DebRoy S. Biostrings: Efficient manipulation of biological strings [Internet]. Bioconductor version: Release (3.10); 2019 [cited 2019 Dec 10]. Available from: https://bioconductor.org/packages/Biostrings/.
  44. Anaconda Anaconda | The World’s Most Popular Data Science Platform [Internet]. Anaconda. [cited 2019 Dec 10]. Available from: https://www.anaconda.com/.
  45. Lippert C, Listgarten J, Liu Y, Kadie CM, Davidson RI et al. Fast linear mixed models for genome-wide association studies. Nat Methods 2011; 8:833–835 [View Article][PubMed][PubMed]
    [Google Scholar]
  46. Campbell IM, Gambin T, Jhangiani S, Grove ML, Veeraraghavan N et al. Multiallelic positions in the human genome: challenges for genetic analyses. Hum Mutat 2016; 37:231–234 [View Article][PubMed][PubMed]
    [Google Scholar]
  47. Lees JA, Ferwerda B, Kremer PHC, Wheeler NE, Serón MV et al. Joint sequencing of human and pathogen genomes reveals the genetics of pneumococcal meningitis. Nat Commun 2019; 10:1–14 [View Article]
    [Google Scholar]
/content/journal/mgen/10.1099/mgen.0.000368
Loading
/content/journal/mgen/10.1099/mgen.0.000368
Loading

Data & Media loading...

Supplements

Supplementary material 1

PDF

Supplementary material 2

EXCEL
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error