Genome-wide association studies (GWASs) have the potential to reveal the genetics of microbial phenotypes such as antibiotic resistance and virulence. Capitalizing on the growing wealth of bacterial sequence data, microbial GWAS methods aim to identify causal genetic variants while ignoring spurious associations. Bacteria reproduce clonally, leading to strong population structure and genome-wide linkage, making it challenging to separate true ‘hits’ (i.e. mutations that cause a phenotype) from non-causal linked mutations. GWAS methods attempt to correct for population structure in different ways, but their performance has not yet been systematically and comprehensively evaluated under a range of evolutionary scenarios. Here, we developed a bacterial GWAS simulator (BacGWASim) to generate bacterial genomes with varying rates of mutation, recombination and other evolutionary parameters, along with a subset of causal mutations underlying a phenotype of interest. We assessed the performance (recall and precision) of three widely used single-locus GWAS approaches (cluster-based, dimensionality-reduction and linear mixed models, implemented in plink, pyseer and gemma) and one relatively new multi-locus model implemented in pyseer, across a range of simulated sample sizes, recombination rates and causal mutation effect sizes. As expected, all methods performed better with larger sample sizes and effect sizes. The performance of clustering and dimensionality reduction approaches to correct for population structure were considerably variable according to the choice of parameters. Notably, the multi-locus elastic net (lasso) approach was consistently amongst the highest-performing methods, and had the highest power in detecting causal variants with both low and high effect sizes. Most methods reached the level of good performance (recall >0.75) for identifying causal mutations of strong effect size [log odds ratio (OR) ≥2] with a sample size of 2000 genomes. However, only elastic nets reached the level of reasonable performance (recall=0.35) for detecting markers with weaker effects (log OR ~1) in smaller samples. Elastic nets also showed superior precision and recall in controlling for genome-wide linkage, relative to single-locus models. However, all methods performed relatively poorly on highly clonal (low-recombining) genomes, suggesting room for improvement in method development. These findings show the potential for multi-locus models to improve bacterial GWAS performance. BacGWASim code and simulated data are publicly available to enable further comparisons and benchmarking of new methods.
BunielloA,
MacArthurJAL,
CerezoM,
HarrisLW,
HayhurstJ et al. The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res2019; 47:D1005–D1012 [View Article]
LeesJA,
CroucherNJ,
GoldblattD,
NostenF,
ParkhillJ et al. Genome-wide identification of lineage and locus specific variation associated with pneumococcal carriage duration. Elife2017; 6:e26255 [View Article]
BerthenetE,
YaharaK,
ThorellK,
PascoeB,
MericG et al. A GWAS on Helicobacter pylori strains points to genetic variants associated with gastric cancer risk. BMC Biol2018; 16:84 [View Article]
LaabeiM,
ReckerM,
RudkinJK,
AldeljawiM,
GulayZ et al. Predicting the virulence of MRSA from its genome sequence. Genome Res2014; 24:839–849 [View Article]
SheppardSK,
DidelotX,
MericG,
TorralboA,
JolleyKA et al. Genome-wide association study identifies vitamin B5 biosynthesis as a host specificity factor in Campylobacter
. Proc Natl Acad Sci USA2013; 110:11923–11927 [View Article]
CollinsC,
DidelotX.
A phylogenetic method to perform genome-wide association studies in microbes that accounts for population structure and recombination. PLoS Comput Biol2018; 14:e1005958 [View Article]
LeesJA,
VehkalaM,
VälimäkiN,
HarrisSR,
ChewapreechaC et al. Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes. Nat Commun2016; 7:12797 [View Article]
BarthaI,
CarlsonJM,
BrummeCJ,
McLarenPJ,
BrummeZL et al. A genome-to-genome analysis of associations between human genetic variation, HIV-1 sequence diversity, and viral control. Elife2013; 2:e01123 [View Article]
PowerRA,
DavaniahS,
DeracheA,
WilkinsonE,
TanserF et al. Genome-wide association study of HIV whole genome sequences validated using drug resistance. PLoS One2016; 11:e0163746 [View Article]
EarleSG,
WuC-H,
CharlesworthJ,
StoesserN,
GordonNC et al. Identifying lineage effects when controlling for population structure improves power in bacterial association studies. Nat Microbiol2016; 1:16041 [View Article]
JaillardM,
LimaL,
TournoudM,
MahéP,
van BelkumA et al. A fast and agnostic method for bacterial genome-wide association studies: bridging the gap between k-mers and genetic events. PLoS Genet2018; 14:e1007758 [View Article]
MiottoP,
TessemaB,
TaglianiE,
ChindelevitchL,
StarksAM et al. A standardised method for interpreting the association between mutations and phenotypic drug resistance in Mycobacterium tuberculosis
. Eur Respir J2017; 50:1701354 [View Article]
McKennaA,
HannaM,
BanksE,
SivachenkoA,
CibulskisK et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res2010; 20:1297–1303 [View Article]
LiH.
A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics2011; 27:2987–2993 [View Article]
PurcellS,
NealeB,
Todd-BrownK,
ThomasL,
FerreiraMAR et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet2007; 81:559–575 [View Article]
LippertC,
ListgartenJ,
LiuY,
KadieCM,
DavidsonRI et al. Fast linear mixed models for genome-wide association studies. Nat Methods2011; 8:833–835 [View Article]
LeesJA,
HarrisSR,
Tonkin-HillG,
GladstoneRA,
LoSW et al. Fast and flexible bacterial genomic epidemiology with PopPUNK. Genome Res2019; 29:304–316 [View Article]
FarhatMR,
ShapiroBJ,
SheppardSK,
ColijnC,
MurrayM.
A phylogeny-based sampling strategy and power calculator informs genome-wide associations study design for microbial pathogens. Genome Med2014; 6:101 [View Article]