1887

Abstract

is a taxonomically diverse pathogen with over 2600 serovars associated with a wide variety of animal hosts including humans, other mammals, birds and reptiles. Some serovars are host-specific or host-restricted and cause disease in distinct host species, while others, such as serovar . Typhimurium (STm), are generalists and have the potential to colonize a wide variety of species. However, even within generalist serovars such as STm it is becoming clear that pathovariants exist that differ in tropism and virulence. Identifying the genetic factors underlying host specificity is complex, but the availability of thousands of genome sequences and advances in machine learning have made it possible to build specific host prediction models to aid outbreak control and predict the human pathogenic potential of isolates from animals and other reservoirs. We have advanced this area by building host-association prediction models trained on a wide range of genomic features and compared them with predictions based on nearest-neighbour phylogeny. SNPs, protein variants (PVs), antimicrobial resistance (AMR) profiles and intergenic regions (IGRs) were extracted from 3883 high-quality STm assemblies collected from humans, swine, bovine and poultry in the USA, and used to construct Random Forest (RF) machine learning models. An additional 244 recent STm assemblies from farm animals were used as a test set for further validation. The models based on PVs and IGRs had the best performance in terms of predicting the host of origin of isolates and outperformed nearest-neighbour phylogenetic host prediction as well as models based on SNPs or AMR data. However, the models did not yield reliable predictions when tested with isolates that were phylogenetically distinct from the training set. The IGR and PV models were often able to differentiate human isolates in clusters where the majority of isolates were from a single animal source. Notably, IGRs were the feature with the best performance across multiple models which may be due to IGRs acting as both a representation of their flanking genes, equivalent to PVs, while also capturing genomic regulatory variation, such as altered promoter regions. The IGR and PV models predict that ~45 % of the human infections with STm in the USA originate from bovine, ~40 % from poultry and ~14.5 % from swine, although sequences of isolates from other sources were not used for training. In summary, the research demonstrates a significant gain in accuracy for models with IGRs and PVs as features compared to SNP-based and core genome phylogeny predictions when applied within the existing population structure. This article contains data hosted by Microreact.

Funding
This study was supported by the:
  • University of Edinburgh
    • Principle Award Recipient: AntoniaChalka
  • Biotechnology and Biological Sciences Research Council (Award BBS/E/D/20002173)
    • Principle Award Recipient: DavidL. Gally
  • This is an open-access article distributed under the terms of the Creative Commons Attribution License. This article was made open access via a Publish and Read agreement between the Microbiology Society and the corresponding author’s institution.
Loading

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.001116
2023-10-16
2024-11-04
Loading full text...

Full text loading...

/deliver/fulltext/mgen/9/10/mgen001116.html?itemId=/content/journal/mgen/10.1099/mgen.0.001116&mimeType=html&fmt=ahah

References

  1. Jajere SM. A review of Salmonella enterica with particular focus on the pathogenicity and virulence factors, host specificity and antimicrobial resistance including multidrug resistance. Vet World 2019; 12:504–521 [View Article] [PubMed]
    [Google Scholar]
  2. Havelaar AH, Kirk MD, Torgerson PR, Gibb HJ, Hald T et al. World Health Organization global estimates and regional comparisons of the burden of foodborne disease in 2010. PLoS Med 2015; 12:e1001923 [View Article] [PubMed]
    [Google Scholar]
  3. Issenhuth-Jeanjean S, Roggentin P, Mikoleit M, Guibourdenche M, de Pinna E et al. Supplement 2008-2010 (no. 48) to the White-Kauffmann-Le Minor scheme. Res Microbiol 2014; 165:526–530 [View Article] [PubMed]
    [Google Scholar]
  4. Authority EFS. European Centre for Disease Prevention and Control The European Union One Health 2019 zoonoses report. EFSA J 2021; 19:e06406 [View Article] [PubMed]
    [Google Scholar]
  5. Scallan E, Hoekstra RM, Angulo FJ, Tauxe RV, Widdowson M-A et al. Foodborne illness acquired in the United States--major pathogens. Emerg Infect Dis 2011; 17:7–15 [View Article] [PubMed]
    [Google Scholar]
  6. Rabsch W, Andrews HL, Kingsley RA, Prager R, Tschäpe H et al. Salmonella enterica serotype Typhimurium and its host-adapted variants. Infect Immun 2002; 70:2249–2255 [View Article] [PubMed]
    [Google Scholar]
  7. Alikhan N-F, Zhou Z, Sergeant MJ, Achtman M, Casadesús J. A genomic overview of the population structure of Salmonella. PLoS Genet 2018; 14:e1007261 [View Article] [PubMed]
    [Google Scholar]
  8. Hammarlöf DL, Kröger C, Owen SV, Canals R, Lacharme-Lora L et al. Role of a single noncoding nucleotide in the evolution of an epidemic African clade of Salmonella. Proc Natl Acad Sci U S A 2018; 115:E2614–E2623 [View Article] [PubMed]
    [Google Scholar]
  9. Van Puyvelde S, Pickard D, Vandelannoote K, Heinz E, Barbé B et al. An African Salmonella Typhimurium ST313 sublineage with extensive drug-resistance and signatures of host adaptation. Nat Commun 2019; 10:4280 [View Article] [PubMed]
    [Google Scholar]
  10. Mather AE, Lawson B, de Pinna E, Wigley P, Parkhill J et al. Genomic analysis of Salmonella enterica serovar Typhimurium from wild passerines in England and Wales. Appl Environ Microbiol 2016; 82:6728–6735 [View Article] [PubMed]
    [Google Scholar]
  11. Chaudhuri RR, Morgan E, Peters SE, Pleasance SJ, Hudson DL et al. Comprehensive assignment of roles for Salmonella Typhimurium genes in intestinal colonization of food-producing animals. PLoS Genet 2013; 9:e1003456 [View Article] [PubMed]
    [Google Scholar]
  12. Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics. Nat Rev Genet 2015; 16:321–332 [View Article] [PubMed]
    [Google Scholar]
  13. Wheeler NE, Gardner PP, Barquist L. Machine learning identifies signatures of host adaptation in the bacterial pathogen Salmonella enterica. PLoS Genet 2018; 14:e1007333 [View Article] [PubMed]
    [Google Scholar]
  14. Lupolova N, Dallman TJ, Holden NJ, Gally DL. Patchy promiscuity: machine learning applied to predict the host specificity of Salmonella enterica and Escherichia coli. Microb Genom 2017; 3:e000135 [View Article] [PubMed]
    [Google Scholar]
  15. Munck N, Njage PMK, Leekitcharoenphon P, Litrup E, Hald T. Application of whole-genome sequences and machine learning in source attribution of Salmonella Typhimurium. Risk Anal 2020; 40:1693–1705 [View Article] [PubMed]
    [Google Scholar]
  16. Zhang S, Li S, Gu W, den Bakker H, Boxrud D et al. Zoonotic source attribution of Salmonella enterica serotype Typhimurium using genomic surveillance data, United States. Emerg Infect Dis 2019; 25:82–91 [View Article] [PubMed]
    [Google Scholar]
  17. Zhou Z, Alikhan N-F, Mohamed K, Fan Y, Achtman M et al. The user’s guide to comparative genomics with EnteroBase, including case studies on transmissions of micro-clades of Salmonella, the phylogeny of ancient and modern Yersinia pestis genomes, and the core genomic diversity of all Escherichia. Microbiology 2019613554 [View Article]
    [Google Scholar]
  18. Ham K. OpenRefine (version 2.5). J Med Libr Assoc 2013; 101:233–234 [View Article]
    [Google Scholar]
  19. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 2013; 29:1072–1075 [View Article] [PubMed]
    [Google Scholar]
  20. Robertson J, Yoshida C, Kruczkiewicz P, Nadon C, Nichani A et al. Comprehensive assessment of the quality of Salmonella whole genome sequence data available in public sequence databases using the Salmonella in silico typing resource (SISTR). Microb Genom 2018; 4:e000151 [View Article] [PubMed]
    [Google Scholar]
  21. Jolley KA, Maiden MCJ. BIGSdb: scalable analysis of bacterial genome variation at the population level. BMC Bioinformatics 2010; 11:595 [View Article] [PubMed]
    [Google Scholar]
  22. Seemann T. mlst. n.d https://github.com/tseemann/mlst
  23. Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 2014; 30:2068–2069 [View Article] [PubMed]
    [Google Scholar]
  24. Consortium TU. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res 2019; 47:D506–D515 [View Article] [PubMed]
    [Google Scholar]
  25. Feldgarden M, Brover V, Haft DH, Prasad AB, Slotta DJ et al. Validating the AMRFinder tool and resistance gene database by using antimicrobial resistance genotype-phenotype correlations in a collection of isolates. Antimicrob Agents Chemother 2019; 63:e00483-19 [View Article] [PubMed]
    [Google Scholar]
  26. Seemann T. Snippy. n.d https://github.com/tseemann/snippy
  27. Kröger C, Dillon SC, Cameron ADS, Papenfort K, Sivasankaran SK et al. The transcriptional landscape and small RNAs of Salmonella enterica serovar Typhimurium. Proc Natl Acad Sci U S A 2012; 109:E1277–E1286 [View Article] [PubMed]
    [Google Scholar]
  28. Tonkin-Hill G, MacAlasdair N, Ruis C, Weimann A, Horesh G et al. Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol 2020; 21:180 [View Article] [PubMed]
    [Google Scholar]
  29. Thorpe HA, Bayliss SC, Sheppard SK, Feil EJ. Piggy: a rapid, large-scale pan-genome analysis tool for intergenic regions in bacteria. Gigascience 2018; 7:1–11 [View Article] [PubMed]
    [Google Scholar]
  30. Brynildsrud O, Bohlin J, Scheffer L, Eldholm V. Erratum to: Rapid scoring of genes in microbial pan-genome-wide association studies with scoary. Genome Biol 2016; 17:238 [View Article] [PubMed]
    [Google Scholar]
  31. Croucher NJ, Page AJ, Connor TR, Delaney AJ, Keane JA et al. Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Res 2015; 43:e15 [View Article] [PubMed]
    [Google Scholar]
  32. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD et al. Corrigendum to: IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol Biol Evol 2020; 37:1530–1534 [View Article] [PubMed]
    [Google Scholar]
  33. Tonkin-Hill G, Lees JA, Bentley SD, Frost SDW, Corander J. RhierBAPS: an R implementation of the population clustering algorithm hierBAPS. Wellcome Open Res 2018; 3:93 [View Article] [PubMed]
    [Google Scholar]
  34. Cheng L, Connor TR, Sirén J, Aanensen DM, Corander J. Hierarchical and spatially explicit clustering of DNA sequences with BAPS software. Mol Biol Evol 2013; 30:1224–1228 [View Article] [PubMed]
    [Google Scholar]
  35. Lupolova N, Lycett SJ, Gally DL. A guide to machine learning for bacterial host attribution using genome sequence data. Microb Genom 2019; 5:e000317 [View Article] [PubMed]
    [Google Scholar]
  36. Kuhn M. Building predictive models in R using the caret package. J Stat Softw 2008; 28:1–26 [View Article]
    [Google Scholar]
  37. Kuhn M, Weston S, Culp M, Coulter N. C50: C5.0 Decision Trees and Rule-Based Models; 2022 https://CRAN.R-project.org/package=C50 accessed 20 September 2022
  38. Fernández A, García S, Galar M, Prati RC, Krawczyk B et al. Learning from Imbalanced Data Sets. In Learning from Imbalanced Data Sets, 1st ed. 2018 edition Cham: Springer; 2018 [View Article]
    [Google Scholar]
  39. Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E et al. Nextflow enables reproducible computational workflows. Nat Biotechnol 2017; 35:316–319 [View Article] [PubMed]
    [Google Scholar]
  40. Boettiger C. An introduction to docker for reproducible research. SIGOPS Oper Syst Rev 2015; 49:71–79 [View Article]
    [Google Scholar]
  41. Kurtzer GM, Sochat V, Bauer MW. Singularity: scientific containers for mobility of compute. PLoS One 2017; 12:e0177459 [View Article] [PubMed]
    [Google Scholar]
  42. Vohra P, Chaudhuri RR, Mayho M, Vrettou C, Chintoan-Uta C et al. Retrospective application of transposon-directed insertion-site sequencing to investigate niche-specific virulence of Salmonella Typhimurium in cattle. BMC Genomics 2019; 20:20 [View Article] [PubMed]
    [Google Scholar]
  43. National Center for Emerging and Zoonotic Infectious Diseases (NCEZID) National Enteric Disease Surveillance: Salmonella Annual Report, 2016; 2018
  44. Heithoff DM, Shimp WR, Lau PW, Badie G, Enioutina EY et al. Human Salmonella clinical isolates distinct from those of animal origin. Appl Environ Microbiol 2008; 74:1757–1766 [View Article] [PubMed]
    [Google Scholar]
  45. Nuccio S-P, Bäumler AJ, Finlay BB. Comparative analysis of Salmonella genomes identifies a metabolic network for escalating growth in the inflamed gut. mBio 2014; 5:e00929-14 [View Article] [PubMed]
    [Google Scholar]
  46. Hu X, Chen Z, Xiong K, Wang J, Rao X et al. Vi capsular polysaccharide: synthesis, virulence, and application. Crit Rev Microbiol 2017; 43:440–452 [View Article] [PubMed]
    [Google Scholar]
  47. Yue M, Han X, De Masi L, Zhu C, Ma X et al. Allelic variation contributes to bacterial host specificity. Nat Commun 2015; 6:8754 [View Article] [PubMed]
    [Google Scholar]
  48. Nguyen M, Long SW, McDermott PF, Olsen RJ, Olson R et al. Using machine learning to predict antimicrobial MICs and associated genomic features for nontyphoidal Salmonella. J Clin Microbiol 2019; 57:e01260-18 [View Article] [PubMed]
    [Google Scholar]
  49. Azriel S, Goren A, Shomer I, Aviv G, Rahav G et al. The Typhi colonization factor (Tcf) is encoded by multiple non-typhoidal Salmonella serovars but exhibits a varying expression profile and interchanging contribution to intestinal colonization. Virulence 2017; 8:1791–1807 [View Article] [PubMed]
    [Google Scholar]
  50. Canals R, Hammarlöf DL, Kröger C, Owen SV, Fong WY et al. Adding function to the genome of African Salmonella Typhimurium ST313 strain D23580. PLoS Biol 2019; 17:e3000059 [View Article] [PubMed]
    [Google Scholar]
  51. Hennebry SC, Sait LC, Mantena R, Humphrey TJ, Yang J et al. Salmonella typhimurium’s transthyretin-like protein is a host-specific factor important in fecal survival in chickens. PLoS One 2012; 7:e46675 [View Article] [PubMed]
    [Google Scholar]
/content/journal/mgen/10.1099/mgen.0.001116
Loading
/content/journal/mgen/10.1099/mgen.0.001116
Loading

Data & Media loading...

Supplements

Supplementary material 1

EXCEL
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error