Methods and Software
The Microbial Genomics Methods and Software collection will bring together articles describing novel experimental, bioinformatics, modelling, and statistical approaches to the analysis of microbial genomics data, including databases or the integration of genomics with other data streams; as well as systematic comparisons or benchmarking of existing methodologies used in the field of microbial genomics. Guest-edited by Dr Zamin Iqbal (European Bioinformatics Institute) and Dr Caroline Colijn (Simon Fraser University), the collection aims to provide the microbial genomics community with new and systematically validated tools to advance their research.
The cover image for this collection brings together figures from two of retrospective articles in the collection: a phylogeny richly annotated with insertion sequence sites from the article on ISseeker by Adams et al. 2016 (bottom left); and a genome assembly graph from the article on completing bacterial genomes by Wick et al. 2017 (top right).
This collection is now open for submissions. Submit your article here, stating that your manuscript is part of the Methods and Software collection.
Collection Contents
-
-
PANINI: Pangenome Neighbour Identification for Bacterial Populations
The standard workhorse for genomic analysis of the evolution of bacterial populations is phylogenetic modelling of mutations in the core genome. However, a notable amount of information about evolutionary and transmission processes in diverse populations can be lost unless the accessory genome is also taken into consideration. Here, we introduce panini (Pangenome Neighbour Identification for Bacterial Populations), a computationally scalable method for identifying the neighbours for each isolate in a data set using unsupervised machine learning with stochastic neighbour embedding based on the t-SNE (t-distributed stochastic neighbour embedding) algorithm. panini is browser-based and integrates with the Microreact platform for rapid online visualization and exploration of both core and accessory genome evolutionary signals, together with relevant epidemiological, geographical, temporal and other metadata. Several case studies with single- and multi-clone pneumococcal populations are presented to demonstrate the ability to identify biologically important signals from gene content data. panini is available at http://panini.pathogen.watch and code at http://gitlab.com/cgps/panini.
-
-
-
PhasomeIt: an ‘omics’ approach to cataloguing the potential breadth of phase variation in the genus Campylobacter
More LessHypermutable simple sequence repeats (SSRs) are drivers of phase variation (PV) whose stochastic, high-frequency, reversible switches in gene expression are a common feature of several pathogenic bacterial species, including the human pathogen Campylobacter jejuni. Here we examine the distribution and conservation of known and putative SSR-driven phase variable genes – the phasome – in the genus Campylobacter. PhasomeIt, a new program, was specifically designed for rapid identification of SSR-mediated PV. This program detects the location, type and repeat number of every SSR. Each SSR is linked to a specific gene and its putative expression state. Other outputs include conservation of SSR-driven phase-variable genes and the ‘core phasome’ – the minimal set of PV genes in a phylogenetic grouping. Analysis of 77 complete Campylobacter genome sequences detected a ‘core phasome’ of conserved PV genes in each species and a large number of rare PV genes with few, or no, homologues in other genome sequences. Analysis of a set of partial genome sequences, with food-chain-associated metadata, detected evidence of a weak link between phasome and source host for disease-causing isolates of sequence type (ST)-828 but not the ST-21 or ST-45 complexes. Investigation of the phasomes in the genus Campylobacter provided evidence of overlapping but distinctive mechanisms of PV-mediated adaptation to specific niches. This suggests that the phasome could be involved in host adaptation and spread of campylobacters. Finally, this tool is malleable and will have utility for studying the distribution and genic effects of other repetitive elements in diverse bacterial species.
-
-
-
PlaScope: a targeted approach to assess the plasmidome from genome assemblies at the species level
More LessPlasmid prediction may be of great interest when studying bacteria of medical importance such as Enterobacteriaceae as well as Staphylococcus aureus or Enterococcus. Indeed, many resistance and virulence genes are located on such replicons with major impact in terms of pathogenicity and spreading capacities. Beyond strain outbreak, plasmid outbreaks have been reported in particular for some extended-spectrum beta-lactamase- or carbapenemase-producing Enterobacteriaceae. Several tools are now available to explore the ‘plasmidome’ from whole-genome sequences with various approaches, but none of them are able to combine high sensitivity and specificity. With this in mind, we developed PlaScope, a targeted approach to recover plasmidic sequences in genome assemblies at the species or genus level. Based on Centrifuge, a metagenomic classifier, and a custom database containing complete sequences of chromosomes and plasmids from various curated databases, PlaScope classifies contigs from an assembly according to their predicted location. Compared to other plasmid classifiers, PlasFlow and cBar, it achieves better recall (0.87), specificity (0.99), precision (0.96) and accuracy (0.98) on a dataset of 70 genomes of Escherichia coli containing plasmids. In a second part, we identified 20 of the 21 chromosomal integrations of the extended-spectrum beta-lactamase coding gene in a clinical dataset of E. coli strains. In addition, we predicted virulence gene and operon locations in agreement with the literature. We also built a database for Klebsiella and correctly assigned the location for the majority of resistance genes from a collection of 12 Klebsiella pneumoniae strains. Similar approaches could also be developed for other well-characterized bacteria.
-
-
-
PlasmidTron: assembling the cause of phenotypes and genotypes from NGS data
Increasingly rich metadata are now being linked to samples that have been whole-genome sequenced. However, much of this information is ignored. This is because linking this metadata to genes, or regions of the genome, usually relies on knowing the gene sequence(s) responsible for the particular trait being measured and looking for its presence or absence in that genome. Examples of this would be the spread of antimicrobial resistance genes carried on mobile genetic elements (MGEs). However, although it is possible to routinely identify the resistance gene, identifying the unknown MGE upon which it is carried can be much more difficult if the starting point is short-read whole-genome sequence data. The reason for this is that MGEs are often full of repeats and so assemble poorly, leading to fragmented consensus sequences. Since mobile DNA, which can carry many clinically and ecologically important genes, has a different evolutionary history from the host, its distribution across the host population will, by definition, be independent of the host phylogeny. It is possible to use this phenomenon in a genome-wide association study to identify both the genes associated with the specific trait and also the DNA linked to that gene, for example the flanking sequence of the plasmid vector on which it is encoded, which follows the same patterns of distribution as the marker gene/sequence itself. We present PlasmidTron, which utilizes the phenotypic data normally available in bacterial population studies, such as antibiograms, virulence factors, or geographical information, to identify traits that are likely to be present on DNA that can randomly reassort across defined bacterial populations. It is also possible to use this methodology to associate unknown genes/sequences (e.g. plasmid backbones) with a specific molecular signature or marker (e.g. resistance gene presence or absence) using PlasmidTron. PlasmidTron uses a k-mer-based approach to identify reads associated with a phylogenetically unlinked phenotype. These reads are then assembled de novo to produce contigs in a fast and scalable-to-large manner. PlasmidTron is written in Python 3 and is available under the open source licence GNU GPL3 from https://github.com/sanger-pathogens/plasmidtron.
-
-
-
Patchy promiscuity: machine learning applied to predict the host specificity of Salmonella enterica and Escherichia coli
More LessSalmonella enterica and Escherichia coli are bacterial species that colonize different animal hosts with sub-types that can cause life-threatening infections in humans. Source attribution of zoonoses is an important goal for infection control as is identification of isolates in reservoir hosts that represent a threat to human health. In this study, host specificity and zoonotic potential were predicted using machine learning in which Support Vector Machine (SVM) classifiers were built based on predicted proteins from whole genome sequences. Analysis of over 1000 S. enterica genomes allowed the correct prediction (67 –90 % accuracy) of the source host for S. Typhimurium isolates and the same classifier could then differentiate the source host for alternative serovars such as S. Dublin. A key finding from both phylogeny and SVM methods was that the majority of isolates were assigned to host-specific sub-clusters and had high host-specific SVM scores. Moreover, only a minor subset of isolates had high probability scores for multiple hosts, indicating generalists with genetic content that may facilitate transition between hosts. The same approach correctly identified human versus bovine E. coli isolates (83 % accuracy) and the potential of the classifier to predict a zoonotic threat was demonstrated using E. coli O157. This research indicates marked host restriction for both S. enterica and E. coli, with only limited isolate subsets exhibiting host promiscuity by gene content. Machine learning can be successfully applied to interrogate source attribution of bacterial isolates and has the capacity to predict zoonotic potential.
-
-
-
PhagePhisher: a pipeline for the discovery of covert viral sequences in complex genomic datasets
More LessObtaining meaningful viral information from large sequencing datasets presents unique challenges distinct from prokaryotic and eukaryotic sequencing efforts. The difficulties surrounding this issue can be ascribed in part to the genomic plasticity of viruses themselves as well as the scarcity of existing information in genomic databases. The open-source software PhagePhisher (http://www.putonti-lab.com/phagephisher) has been designed as a simple pipeline to extract relevant information from complex and mixed datasets, and will improve the examination of bacteriophages, viruses, and virally related sequences, in a range of environments. Key aspects of the software include speed and ease of use; PhagePhisher can be used with limited operator knowledge of bioinformatics on a standard workstation. As a proof-of-concept, PhagePhisher was successfully implemented with bacteria–virus mixed samples of varying complexity. Furthermore, viral signals within microbial metagenomic datasets were easily and quickly identified by PhagePhisher, including those from prophages as well as lysogenic phages, an important and often neglected aspect of examining phage populations in the environment. PhagePhisher resolves viral-related sequences which may be obscured by or imbedded in bacterial genomes.
-