1887

Abstract

With the ever-expanding number of available sequences from bacterial genomes, and the expectation that this data type will be the primary one generated from both diagnostic and research laboratories for the foreseeable future, then there is both an opportunity and a need to evaluate how effectively computational approaches can be used within bacterial genomics to predict and understand complex phenotypes, such as pathogenic potential and host source. This article applied various quantitative methods such as diversity indexes, pangenome-wide association studies (GWAS) and dimensionality reduction techniques to better understand the data and then compared how well unsupervised and supervised machine learning (ML) methods could predict the source host of the isolates. The study uses the example of the pangenomes of 1203 serovar Typhimurium isolates in order to predict 'host of isolation' using these different methods. The article is aimed as a review of recent applications of ML in infection biology, but also, by working through this specific dataset, it allows discussion of the advantages and drawbacks of the different techniques. As with all such sub-population studies, the biological relevance will be dependent on the quality and diversity of the input data. Given this major caveat, we show that supervised ML has the potential to add real value to interpretation of bacterial genomic data, as it can provide probabilistic outcomes for important phenotypes, something that is very difficult to achieve with the other methods.

  • This is an open-access article distributed under the terms of the Creative Commons Attribution License.
Loading

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.000317
2019-12-01
2024-04-25
Loading full text...

Full text loading...

/deliver/fulltext/mgen/5/12/mgen000317.html?itemId=/content/journal/mgen/10.1099/mgen.0.000317&mimeType=html&fmt=ahah

References

  1. Royal Society Machine Learning: the Power and Promise of Computers that Learn by Example London: The Royal Society; 2017
    [Google Scholar]
  2. Alikhan N-F, Zhou Z, Sergeant MJ, Achtman M. A genomic overview of the population structure of Salmonella . PLoS Genet 2018; 14:e1007261 [View Article]
    [Google Scholar]
  3. NCBI Resource Coordinators Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2016; 44:D7–D19 [View Article]
    [Google Scholar]
  4. Stegle O, Payet L, Mergny J-L, MacKay DJC, Leon JH. Predicting and understanding the stability of G-quadruplexes. Bioinformatics 2009; 25:i374–i1382 [View Article]
    [Google Scholar]
  5. Liu Y. Active learning with support vector machine applied to gene expression data for cancer classification. J Chem Inf Comput Sci 2004; 44:1936–1941 [View Article]
    [Google Scholar]
  6. Pournara I, Wernisch L. Reconstruction of gene networks using Bayesian learning and manipulation experiments. Bioinformatics 2004; 20:2934–2942 [View Article]
    [Google Scholar]
  7. Fujiwara Y, Yamashita Y, Osoda T, Asogawa M, Fukushima C et al. Virtual screening system for finding structurally diverse hits by active learning. J Chem Inf Model 2008; 48:930–940 [View Article]
    [Google Scholar]
  8. Davis JJ, Boisvert S, Brettin T, Kenyon RW, Mao C et al. Antimicrobial resistance prediction in PATRIC and RAST. Sci Rep 2016; 6:27930 [View Article]
    [Google Scholar]
  9. Habibi N, Mohd Hashim SZ, Norouzi A, Samian MR. A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli . BMC Bioinformatics 2014; 15:134 [View Article]
    [Google Scholar]
  10. Azé J, Sola C, Zhang J, Lafosse-Marin F, Yasmin M et al. Genomics and machine learning for taxonomy consensus: the Mycobacterium tuberculosis complex paradigm. PLoS One 2015; 10:e0130912 [View Article]
    [Google Scholar]
  11. Andreatta M, Nielsen M, Møller Aarestrup F, Lund O. In silico prediction of human pathogenicity in the γ-proteobacteria. PLoS One 2010; 5:e13680 [View Article]
    [Google Scholar]
  12. Wheeler NE, Gardner PP, Barquist L. Machine learning identifies signatures of host adaptation in the bacterial pathogen Salmonella enterica . PLoS Genet 2018; 14:e1007333 [View Article]
    [Google Scholar]
  13. Lupolova N, Dallman TJ, Matthews L, Bono JL, Gally DL. Support vector machine applied to predict the zoonotic potential of E. coli O157 cattle isolates. Proc Natl Acad Sci USA 2016; 113:11312–11317 [View Article]
    [Google Scholar]
  14. Fàbrega A, Vila J. Salmonella enterica serovar Typhimurium skills to succeed in the host: virulence and regulation. Clin Microbiol Rev 2013; 26:308–341 [View Article]
    [Google Scholar]
  15. Uzzau S, Brown DJ, Wallis T, Leori G, Leori G et al. Host adapted serotypes of Salmonella enterica. Epidemiol Infect 2000; 125:229–255 [View Article]
    [Google Scholar]
  16. Bäumler A, Fang FC. Host specificity of bacterial pathogens. Cold Spring Harb Perspect Med 2013; 3:a010041 [View Article]
    [Google Scholar]
  17. Okoro CK, Barquist L, Connor TR, Harris SR, Clare S et al. Signatures of adaptation in human invasive Salmonella Typhimurium ST313 populations from sub-Saharan Africa. PLoS Negl Trop Dis 2015; 9:e0003611 [View Article]
    [Google Scholar]
  18. Branchu P, Bawn M, Kingsley RA. Genome variation and molecular epidemiology of Salmonella enterica serovar Typhimurium pathovariants. Infect Immun 2018; 86:e00079-18 [View Article]
    [Google Scholar]
  19. Lupolova N, Dallman TJ, Holden NJ, Gally DL. Patchy promiscuity: machine learning applied to predict the host specificity of Salmonella enterica and Escherichia coli . Microbial Genomics 2017; 3:e000135 [View Article]
    [Google Scholar]
  20. Wiedenbeck J, Cohan FM. Origins of bacterial diversity through horizontal genetic transfer and adaptation to new ecological niches. FEMS Microbiol Rev 2011; 35:957–976 [View Article]
    [Google Scholar]
  21. Langridge GC, Fookes M, Connor TR, Feltwell T, Feasey N et al. Patterns of genome evolution that have accompanied host adaptation in Salmonella . Proc Natl Acad Sci USA 2015; 112:863–868 [View Article]
    [Google Scholar]
  22. Bäumler AJ, Tsolis RM, Ficht TA, Adams LG. Evolution of host adaptation in Salmonella enterica . Infect Immun 1998; 66:4579–4587
    [Google Scholar]
  23. Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics 2000; 155:945–959
    [Google Scholar]
  24. Blei DM, Jordan MI, AY N. Latent Dirichlet allocation. J Mach Learn Res 2003; 3:993–1022
    [Google Scholar]
  25. Drăghici S, Potter RB. Predicting HIV drug resistance with neural networks. Bioinformatics 2003; 19:98–107 [View Article]
    [Google Scholar]
  26. Zhang S, Li S, Gu W, den Bakker H, Boxrud D et al. Zoonotic source attribution of Salmonella enterica serotype Typhimurium using genomic surveillance data, United States. Emerg Infect Dis 2019; 25:82–91 [View Article]
    [Google Scholar]
  27. Aguas R, Ferguson NM. Feature selection methods for identifying genetic determinants of host species in RNA viruses. PLoS Comput Biol 2013; 9:e1003254 [View Article]
    [Google Scholar]
  28. Goodfellow I, Bengio Y, Courville A. Deep Learning Cambridge, MA: MIT press; 2016
    [Google Scholar]
  29. Breiman L. Random Forrests. Mach Learn 2001; 45:5–32 [View Article]
    [Google Scholar]
  30. Qi Y. Random forest for bioinformatics. In Zhang C, Ma Y. (editors) Ensemble Machine Learning: Methods and Applications Boston, MA: Springer US; 2012 pp 307–323
    [Google Scholar]
  31. Brynildsrud O, Bohlin J, Scheffer L, Eldholm V. Rapid scoring of genes in microbial pan-genome-wide association studies with Scoary. Genome Biol 2016; 17:238 [View Article]
    [Google Scholar]
  32. Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 2015; 31:3691–3693 [View Article]
    [Google Scholar]
  33. Shannon CE. A mathematical theory of communication. Bell System Technical Journal 1948; 27:379–423 [View Article]
    [Google Scholar]
  34. Bishop CM. Pattern Recogniton and Machine Learning, 1st ed. New York: Springer-Verlag; 2006
    [Google Scholar]
  35. Sherwin WB. Entropy and information approaches to genetic diversity and its expression: genomic geography. Entropy 2010; 12:1765–1798 [View Article]
    [Google Scholar]
  36. Whittaker RH. Vegetation of the Siskiyou mountains, Oregon and California. Ecol Monogr 1960; 30:279–338 [View Article]
    [Google Scholar]
  37. Anderson MJ, Ellingsen KE, McArdle BH. Multivariate dispersion as a measure of beta diversity. Ecol Lett 2006; 9:683–693 [View Article]
    [Google Scholar]
  38. Charrad M, Ghazzali N, Boiteau V, Niknafs A. NbClust: an R Package for determining the relevant number of clusters in a data set. J Stat Softw 2014; 61:1–36 [View Article]
    [Google Scholar]
  39. Kaufman L, Rousseeuw PJ. eds Finding Groups in Data Hoboken, NJ: Wiley; 1990
    [Google Scholar]
  40. Arbelaitz O, Gurrutxaga I, Muguerza J, Pérez JM, Perona I. An extensive comparative study of cluster validity indices. Pattern Recognit 2013; 46:243–256 [View Article]
    [Google Scholar]
  41. Richardson EJ, Bacigalupe R, Harrison EM, Weinert LA, Lycett S et al. Gene exchange drives the ecological success of a multi-host bacterial pathogen. Nat Ecol Evol 2018; 2:1468–1478 [View Article]
    [Google Scholar]
  42. Suchard MA, Lemey P, Baele G, Ayres DL, Drummond AJ et al. Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evol 2018; 4:vey016 [View Article]
    [Google Scholar]
  43. Abdi H, Williams LJ. Principal component analysis. WIREs Comp Stat 2010; 2:433–459 [View Article]
    [Google Scholar]
  44. MacQueen J. Some methods for classification and analysis of multivariate observations. Berkeley Symp Math Statist Prob 1967; 1:281–297
    [Google Scholar]
  45. Rokach L, Maimon O. Clustering Methods. Data Mining and Knowledge Discovery Handbook New York: Springer-Verlag; 2005 pp 321–352
    [Google Scholar]
  46. Cortes C, Vapnik V. Support-vector networks. Mach Learn 1995; 20:273–297 [View Article]
    [Google Scholar]
  47. Liaw A, Wiener M. Classification and regression by randomForest. R News 2002; 2:18–22
    [Google Scholar]
  48. Schmidhuber J. Deep learning in neural networks: an overview. Neural Netw 2015; 61:85–117 [View Article]
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.000317
Loading
/content/journal/mgen/10.1099/mgen.0.000317
Loading

Data & Media loading...

Supplements

Supplementary material 1

PDF
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error