- Volume 5, Issue 12, 2019
Volume 5, Issue 12, 2019
- Review
-
- Genomic Methodologies
- Genome-phenotype Association
-
-
A guide to machine learning for bacterial host attribution using genome sequence data
More LessWith the ever-expanding number of available sequences from bacterial genomes, and the expectation that this data type will be the primary one generated from both diagnostic and research laboratories for the foreseeable future, then there is both an opportunity and a need to evaluate how effectively computational approaches can be used within bacterial genomics to predict and understand complex phenotypes, such as pathogenic potential and host source. This article applied various quantitative methods such as diversity indexes, pangenome-wide association studies (GWAS) and dimensionality reduction techniques to better understand the data and then compared how well unsupervised and supervised machine learning (ML) methods could predict the source host of the isolates. The study uses the example of the pangenomes of 1203 Salmonella enterica serovar Typhimurium isolates in order to predict 'host of isolation' using these different methods. The article is aimed as a review of recent applications of ML in infection biology, but also, by working through this specific dataset, it allows discussion of the advantages and drawbacks of the different techniques. As with all such sub-population studies, the biological relevance will be dependent on the quality and diversity of the input data. Given this major caveat, we show that supervised ML has the potential to add real value to interpretation of bacterial genomic data, as it can provide probabilistic outcomes for important phenotypes, something that is very difficult to achieve with the other methods.
-
- Research Article
-
- Genomic Methodologies
- Genome-phenotype Association
-
-
Whole-genome sequencing of dog-specific assemblages C and D of Giardia duodenalis from single and pooled cysts indicates host-associated genes
More LessGiardia duodenalis (syn. Giardia intestinalis or Giardia lamblia) infSAects over 280 million people each year and numerous animals. G. duodenalis can be subdivided into eight assemblages with different host specificity. Unculturable assemblages have so far resisted genome sequencing efforts. In this study, we isolated single and pooled cysts of assemblages C and D from dog faeces by FACS, and sequenced them using multiple displacement amplification and Illumina paired-end sequencing. The genomes of assemblages C and D were compared with genomes of assemblages A and B from humans and assemblage E from ruminants and pigs. The genomes obtained from the pooled cysts and from the single cysts were considered complete (>99 % marker genes observed) and the allelic sequence heterozygosity (ASH) values of assemblages C and D were 0.89 and 0.74 %, respectively. These ASH values were slightly higher than for assemblage B (>0.43 %) and much higher than for assemblages A and E, which ranged from 0.002 to 0.037 %. The flavohaemoglobin and 4Fe-4S binding domain family encoding genes involved in O2 and NO detoxification were only present in assemblages A, B and E. Cathepsin B orthologs were found in all genomes. Six clades of cathepsin B orthologs contained one gene of each genome, while in three clades not all assemblages were represented. We conclude that whole-genome sequencing from a single Giardia cyst results in complete draft genomes, making the genomes of unculturable Giardia assemblages accessible. Observed differences between the genomes of assemblages C and D on one hand and the assemblages A, B and E on the other hand are possibly associated with host specificity.
-
- Systems Microbiology
- Large-scale Comparative Genomics
-
-
EvoMining reveals the origin and fate of natural product biosynthetic enzymes
Natural products (NPs), or specialized metabolites, are important for medicine and agriculture alike, and for the fitness of the organisms that produce them. NP genome-mining aims at extracting biosynthetic information from the genomes of microbes presumed to produce these compounds. Typically, canonical enzyme sequences from known biosynthetic systems are identified after sequence similarity searches. Despite this being an efficient process, the likelihood of identifying truly novel systems by this approach is low. To overcome this limitation, we previously introduced EvoMining, a genome-mining approach that incorporates evolutionary principles. Here, we release and use our latest EvoMining version, which includes novel visualization features and customizable databases, to analyse 42 central metabolic enzyme families (EFs) conserved throughout Actinobacteria , Cyanobacteria , Pseudomonas and Archaea. We found that expansion-and-recruitment profiles of these 42 families are lineage specific, opening the metabolic space related to ‘shell’ enzymes. These enzymes, which have been overlooked, are EFs with orthologues present in most of the genomes of a taxonomic group, but not in all. As a case study of canonical shell enzymes, we characterized the expansion and recruitment of glutamate dehydrogenase and acetolactate synthase into scytonemin biosynthesis, and into other central metabolic pathways driving Archaea and Bacteria adaptive evolution. By defining the origin and fate of enzymes, EvoMining complements traditional genome-mining approaches as an unbiased strategy and opens the door to gaining insights into the evolution of NP biosynthesis. We anticipate that EvoMining will be broadly used for evolutionary studies, and for generating predictions of unprecedented chemical scaffolds and new antibiotics. This article contains data hosted by Microreact.
-