SuperDCA for genome-wide epistasis analysis Open Access

Abstract

The potential for genome-wide modelling of epistasis has recently surfaced given the possibility of sequencing densely sampled populations and the emerging families of statistical interaction models. Direct coupling analysis (DCA) has previously been shown to yield valuable predictions for single protein structures, and has recently been extended to genome-wide analysis of bacteria, identifying novel interactions in the co-evolution between resistance, virulence and core genome elements. However, earlier computational DCA methods have not been scalable to enable model fitting simultaneously to 10–10 polymorphisms, representing the amount of core genomic variation observed in analyses of many bacterial species. Here, we introduce a novel inference method (SuperDCA) that employs a new scoring principle, efficient parallelization, optimization and filtering on phylogenetic information to achieve scalability for up to 10 polymorphisms. Using two large population samples of Streptococcus pneumoniae, we demonstrate the ability of SuperDCA to make additional significant biological findings about this major human pathogen. We also show that our method can uncover signals of selection that are not detectable by genome-wide association analysis, even though our analysis does not require phenotypic measurements. SuperDCA, thus, holds considerable potential in building understanding about numerous organisms at a systems biological level.

Loading

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.000184
2018-05-29
2024-03-28
Loading full text...

Full text loading...

/deliver/fulltext/mgen/4/6/mgen000184.html?itemId=/content/journal/mgen/10.1099/mgen.0.000184&mimeType=html&fmt=ahah

References

  1. Weigt M, White RA, Szurmant H, Hoch JA, Hwa T. Identification of direct residue contacts in protein-protein interaction by message passing. Proc Natl Acad Sci USA 2009; 106:67–72 [View Article][PubMed]
    [Google Scholar]
  2. Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc Natl Acad Sci USA 2011; 108:E1293E1301 [View Article][PubMed]
    [Google Scholar]
  3. Feinauer C, Skwark MJ, Pagnani A, Aurell E. Improving contact prediction along three dimensions. PLoS Comput Biol 2014; 10:e1003847 [View Article][PubMed]
    [Google Scholar]
  4. Morcos F, Hwa T, Onuchic JN, Weigt M. Direct coupling analysis for protein contact prediction. Methods Mol Biol 2014; 1137:55–70 [View Article][PubMed]
    [Google Scholar]
  5. Ovchinnikov S, Kamisetty H, Baker D. Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information. Elife 2014; 3:e02030 [View Article][PubMed]
    [Google Scholar]
  6. Ovchinnikov S, Park H, Varghese N, Huang PS, Pavlopoulos GA et al. Protein structure determination using metagenome sequence data. Science 2017; 355:294–298 [View Article][PubMed]
    [Google Scholar]
  7. Söding J. Big-data approaches to protein structure prediction. Science 2017; 355:248–249 [View Article][PubMed]
    [Google Scholar]
  8. de Leonardis E, Lutz B, Ratz S, Cocco S, Monasson R et al. Direct-coupling analysis of nucleotide coevolution facilitates RNA secondary and tertiary structure prediction. Nucleic Acids Res 2015; 43:10444–10455 [View Article][PubMed]
    [Google Scholar]
  9. Figliuzzi M, Jacquier H, Schug A, Tenaillon O, Weigt M. Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1. Mol Biol Evol 2016; 33:268–280 [View Article][PubMed]
    [Google Scholar]
  10. Barton JP, Goonetilleke N, Butler TC, Walker BD, McMichael AJ et al. Relative rate and location of intra-host HIV evolution to evade cellular immunity are predictable. Nat Commun 2016; 7:11660 [View Article][PubMed]
    [Google Scholar]
  11. Hopf TA, Ingraham JB, Poelwijk FJ, Schärfe CP, Springer M et al. Mutation effects predicted from sequence co-variation. Nat Biotechnol 2017; 35:128–135 [View Article][PubMed]
    [Google Scholar]
  12. Skwark MJ, Croucher NJ, Puranen S, Chewapreecha C, Pesonen M et al. Interacting networks of resistance, virulence and core machinery genes identified by genome-wide epistasis analysis. PLoS Genet 2017; 13:e1006508 [View Article][PubMed]
    [Google Scholar]
  13. Wainwright M, Jordan MI. Graphical Models, Exponential Families, and Variational Inference Boston: Now Publishers; 2008 pp. 310
    [Google Scholar]
  14. Ekeberg M, Lövkvist C, Lan Y, Weigt M, Aurell E. Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. Phys Rev E Stat Nonlin Soft Matter Phys 2013; 87:012707 [View Article][PubMed]
    [Google Scholar]
  15. Ekeberg M, Hartonen T, Aurell E. Fast pseudolikelihood maximization for direct-coupling analysis of protein structure from many homologous amino-acid sequences. J Comput Phys 2014; 276:341–356 [View Article]
    [Google Scholar]
  16. Kamisetty H, Ovchinnikov S, Baker D. Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era. Proc Natl Acad Sci USA 2013; 110:15674–15679 [View Article][PubMed]
    [Google Scholar]
  17. Seemayer S, Gruber M, Söding J. CCMpred-fast and precise prediction of protein residue-residue contacts from correlated mutations. Bioinformatics 2014; 30:3128–3130 [View Article][PubMed]
    [Google Scholar]
  18. Xu Y, Puranen S, Corander J, Kabashima Y. Inverse finite-size scaling for high-dimensional significance analysis. Phys Rev E Stat Nonlin Soft Matter Phys, in press
    [Google Scholar]
  19. Cheng L, Connor TR, Sirén J, Aanensen DM, Corander J. Hierarchical and spatially explicit clustering of DNA sequences with BAPS software. Mol Biol Evol 2013; 30:1224–1228 [View Article][PubMed]
    [Google Scholar]
  20. Chewapreecha C, Harris SR, Croucher NJ, Turner C, Marttinen P et al. Dense genomic sampling identifies highways of pneumococcal recombination. Nat Genet 2014; 46:305–309 [View Article][PubMed]
    [Google Scholar]
  21. Croucher NJ, Finkelstein JA, Pelton SI, Mitchell PK, Lee GM et al. Population genomics of post-vaccine changes in pneumococcal epidemiology. Nat Genet 2013; 45:656–663 [View Article][PubMed]
    [Google Scholar]
  22. Yuan L, Kesavan HK. Bayesian estimation of Shannon entropy. Commun Stat Theory Methods 1997; 26:139–148 [View Article]
    [Google Scholar]
  23. Hutter M. Distribution of mutual information. In Dietterich TG, Becker S, Ghahramani Z. (editors) Advances in Neural Information Processing Systems vol. 14 Cambridge: MIT Press; 2002 pp. 399–406
    [Google Scholar]
  24. Lees JA, Vehkala M, Välimäki N, Harris SR, Chewapreecha C et al. Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes. Nat Commun 2016; 7:12797 [View Article][PubMed]
    [Google Scholar]
  25. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN et al. The Protein Data Bank. Nucleic Acids Res 2000; 28:235–242 [View Article][PubMed]
    [Google Scholar]
  26. Peters K, Schweizer I, Beilharz K, Stahlmann C, Veening JW et al. Streptococcus pneumoniae PBP2x mid-cell localization requires the C-terminal PASTA domains and is essential for cell shape maintenance. Mol Microbiol 2014; 92:733–755 [View Article][PubMed]
    [Google Scholar]
  27. Tsui HT, Boersma MJ, Vella SA, Kocaoglu O, Kuru E et al. Pbp2x localizes separately from Pbp2b and other peptidoglycan synthesis proteins during later stages of cell division of Streptococcus pneumoniae D39. Mol Microbiol 2014; 94:21–40 [View Article][PubMed]
    [Google Scholar]
  28. Blaby IK, Lyons BJ, Wroclawska-Hughes E, Phillips GC, Pyle TP et al. Experimental evolution of a facultative thermophile from a mesophilic ancestor. Appl Environ Microbiol 2012; 78:144–155 [View Article][PubMed]
    [Google Scholar]
  29. Croucher NJ, Vernikos GS, Parkhill J, Bentley SD. Identification, variation and transcription of pneumococcal repeat sequences. BMC Genomics 2011; 12:120 [View Article][PubMed]
    [Google Scholar]
  30. Moscoso M, López E, García E, López R. Implications of physiological studies based on genomic sequences: Streptococcus pneumoniae TIGR4 synthesizes a functional LytC lysozyme. J Bacteriol 2005; 187:6238–6241 [View Article][PubMed]
    [Google Scholar]
  31. Numminen E, Chewapreecha C, Turner C, Goldblatt D, Nosten F et al. Climate induces seasonality in pneumococcal transmission. Sci Rep 2015; 5:11344 [View Article][PubMed]
    [Google Scholar]
  32. Delcher AL, Harmon D, Kasif S, White O, Salzberg SL. Improved microbial gene identification with GLIMMER. Nucleic Acids Res 1999; 27:4636–4641 [View Article][PubMed]
    [Google Scholar]
  33. Li M, Badger JH, Chen X, Kwong S, Kearney P et al. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 2001; 17:149–154 [View Article][PubMed]
    [Google Scholar]
  34. Mahony S, Auron PE, Benos PV. Inferring protein-DNA dependencies using motif alignments and mutual information. Bioinformatics 2007; 23:i297–i304 [View Article][PubMed]
    [Google Scholar]
  35. Chen PE, Shapiro BJ. The advent of genome-wide association studies for bacteria. Curr Opin Microbiol 2015; 25:17–24 [View Article][PubMed]
    [Google Scholar]
  36. Chewapreecha C, Marttinen P, Croucher NJ, Salter SJ, Harris SR et al. Comprehensive identification of single nucleotide polymorphisms associated with beta-lactam resistance within pneumococcal mosaic genes. PLoS Genet 2014; 10:e1004547 [View Article][PubMed]
    [Google Scholar]
  37. Weinert LA, Chaudhuri RR, Wang J, Peters SE, Corander J et al. Genomic signatures of human and animal disease in the zoonotic pathogen Streptococcus suis . Nat Commun 2015; 6:6740 [View Article][PubMed]
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.000184
Loading
/content/journal/mgen/10.1099/mgen.0.000184
Loading

Data & Media loading...

Supplements

Supplementary File 1

PDF

Most cited Most Cited RSS feed