1887

Abstract

The rapidly reducing cost of bacterial genome sequencing has lead to its routine use in large-scale microbial analysis. Though mapping approaches can be used to find differences relative to the reference, many bacteria are subject to constant evolutionary pressures resulting in events such as the loss and gain of mobile genetic elements, horizontal gene transfer through recombination and genomic rearrangements. De novo assembly is the reconstruction of the underlying genome sequence, an essential step to understanding bacterial genome diversity. Here we present a high-throughput bacterial assembly and improvement pipeline that has been used to generate nearly 20 000 annotated draft genome assemblies in public databases. We demonstrate its performance on a public data set of 9404 genomes. We find all the genes used in multi-locus sequence typing schema present in 99.6 % of assembled genomes. When tested on low-, neutral- and high-GC organisms, more than 94 % of genes were present and completely intact. The pipeline has been proven to be scalable and robust with a wide variety of datasets without requiring human intervention. All of the software is available on GitHub under the GNU GPL open source license.

Loading

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.000083
2016-08-25
2019-10-15
Loading full text...

Full text loading...

/deliver/fulltext/mgen/2/8/mgen000083.html?itemId=/content/journal/mgen/10.1099/mgen.0.000083&mimeType=html&fmt=ahah

References

  1. Medini D., Donati C., Tettelin H., Masignani V., Rappuoli R.. 2005; The microbial pan-genome. Curr Opin Genet Dev15:589–683 [CrossRef][PubMed]
    [Google Scholar]
  2. Nasser W., Beres S. B., Olsen R. J., Dean M. A., Rice K. A., Long S. W., Kristinsson K. G., Gottfredsson M., Vuopio J. et al. 2014; Evolutionary pathway to increased virulence and epidemic group A Streptococcus disease derived from 3,615 genome sequences. Proc Natl Acad Sci U S A111:E17681776 [CrossRef][PubMed]
    [Google Scholar]
  3. Abbas M. M., Malluhi Q. M., Balakrishnan P.. 2014; Assessment of de novo assemblers for draft genomes: a case study with fungal genomes. BMC Genomics15:1–12 [CrossRef][PubMed]
    [Google Scholar]
  4. Assefa S., Keane T. M., Otto T. D., Newbold C., Berriman M.. 2009; ABACAS: algorithm-based automatic contiguation of assembled sequences. Bioinformatics25:1968–1969 [CrossRef][PubMed]
    [Google Scholar]
  5. Bala S.. 2016; Vertebrate resequencing sequence analysis pipeline. GitHubhttps://github.com/sanger-pathogens/vr-codebase
    [Google Scholar]
  6. Bankevich A., Nurk S., Antipov D., Gurevich A. A., Dvorkin M., Kulikov A. S., Lesin V. M., Nikolenko S. I., Pham S. et al. 2012; SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol19:455–477 [CrossRef][PubMed]
    [Google Scholar]
  7. Blattner F. R., Plunkett G., Bloch C. A., Perna N. T., Burland V., Riley M., Collado-Vides J., Glasner J. D., Rode C. K. et al. 1997; The complete genome sequence of Escherichia coli K-12. Science277:1453–1462 [CrossRef][PubMed]
    [Google Scholar]
  8. Boetzer M., Pirovano W.. 2012; Toward almost closed genomes with gapfiller. Genome Biol13:R56 [CrossRef][PubMed]
    [Google Scholar]
  9. Boetzer M., Henkel C. V., Jansen H. J., Butler D., Pirovano W.. 2011; Scaffolding pre-assembled contigs using SSPACE. Bioinformatics27:578–579 [CrossRef][PubMed]
    [Google Scholar]
  10. Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J., Bealer K., Madden T. L.. 2009; blast+: architecture and applications. BMC Bioinformatics10:421 [CrossRef][PubMed]
    [Google Scholar]
  11. Chewapreecha C., Harris S. R., Croucher N. J., Turner C., Marttinen P., Cheng L., Pessia A., Aanensen D. M., Mather A. E. et al. 2014; Dense genomic sampling identifies highways of pneumococcal recombination. Nat Genet46:305–309 [CrossRef][PubMed]
    [Google Scholar]
  12. Croucher N. J., Harris S. R., Fraser C., Quail M. A., Burton J., Van der Linden M., McGee L., Von Gottberg A., Song J. H. et al. 2011; Rapid pneumococcal evolution in response to clinical interventions. Science331:430–434 [CrossRef][PubMed]
    [Google Scholar]
  13. Croucher N. J., Page A. J., Connor T. R., Delaney A. J., Keane J. A., Bentley S. D., Parkhill J., Harris S. R.. 2014; Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Res43:e15 [CrossRef]
    [Google Scholar]
  14. Gladman S., Seemann T.. 2008; Velvet optimiser. http://bioinformatics.net.au/software.velvetoptimiser.shtml
  15. Gurevich A., Saveliev V., Vyahhi N., Tesler G.. 2013; QUAST: quality assessment tool for genome assemblies. Bioinformatics29:1072–1075 [CrossRef][PubMed]
    [Google Scholar]
  16. Holden M. T., Lindsay J. A., Corton C., Quail M. A., Cockfield J. D., Pathak S., Batra R., Parkhill J., Bentley S. D. et al. 2010; Genome sequence of a recently emerged, highly transmissible, multi-antibiotic- and antiseptic-resistant variant of methicillin-resistant Staphylococcus aureus, sequence type 239 (TW). J Bacteriol192:888–892 [CrossRef][PubMed]
    [Google Scholar]
  17. Hunt M., Kikuchi T., Sanders M., Newbold C., Berriman M., Otto T. D.. 2013; REAPR: a universal tool for genome assembly evaluation. Genome Biol14:R47 [CrossRef][PubMed]
    [Google Scholar]
  18. Hunt M., Newbold C., Berriman M., Otto T. D.. 2014; A comprehensive evaluation of assembly scaffolding tools. Genome Biol15:R42 [CrossRef][PubMed]
    [Google Scholar]
  19. Jolley K. A., Maiden M. C.. 2010; BIGSdb: scalable analysis of bacterial genome variation at the population level. BMC Bioinformatics11:595 [CrossRef][PubMed]
    [Google Scholar]
  20. Klemm E. J., Gkrania-Klotsas E., Hadfield J., Forbester J. L., Harris S. R., Hale C., Heath J. N., Wileman T., Clare S. et al. 2016; Emergence of host-adapted Salmonella Enteritidis through rapid evolution in an immunocompromised host. Nat Microbiol1:15023 [CrossRef][PubMed]
    [Google Scholar]
  21. Koren S., Treangen T. J., Hill C. M., Pop M., Phillippy A. M.. 2014; Automated ensemble assembly and validation of microbial genomes. BMC Bioinformatics15:126 [CrossRef][PubMed]
    [Google Scholar]
  22. Langmead B., Trapnell C., Pop M., Salzberg S. L.. 2009; Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol10:R25 [CrossRef][PubMed]
    [Google Scholar]
  23. Li H., Durbin R.. 2009; Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics25:1754–1760 [CrossRef][PubMed]
    [Google Scholar]
  24. Maiden M. C. J., Bygraves J. A., Feil E., Morelli G., Russell J. E., Urwin R., Zhang Q., Zhou J., Zurth K. et al. 1998; Multilocus sequence typing: A portable approach to the identification of clones within populations of pathogenic microorganisms. Proceedings of the National Academy of Sciences95:3140–3145 [CrossRef]
    [Google Scholar]
  25. Makendi C., Page A. J., Wren B. W., Le Thi Phuong T., Clare S., Hale C., Goulding D., Klemm E. J., Pickard D. et al. 2016; A phylogenetic and phenotypic analysis of Salmonella enterica serovar Weltevreden, an emerging agent of diarrheal disease in tropical regions. PLoS Negl Trop Dis10:e0004446 [CrossRef][PubMed]
    [Google Scholar]
  26. Mapleson D., Drou N., Swarbreck D.. 2015; RAMPART: a workflow management system for de novo genome assembly. Bioinformatics31:1–2 [CrossRef][PubMed]
    [Google Scholar]
  27. Mitchell A., Chang H. Y., Daugherty L., Fraser M., Hunter S., Lopez R., McAnulla C., McMenamin C., Nuka G. et al. 2015; The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res43:D213–D221 [CrossRef][PubMed]
    [Google Scholar]
  28. Otto T. D., Sanders M., Berriman M., Newbold C.. 2010; Iterative correction of reference nucleotides (iCORN) using second generation sequencing technology. Bioinformatics26:1704–1707 [CrossRef][PubMed]
    [Google Scholar]
  29. Page A. J.. 2016a; Assembly improvement example. GitHubhttps://github.com/sanger-pathogens/assembly_improvement/tree/master/example
    [Google Scholar]
  30. Page A. J.. 2016b; MLST-check. https://github.com/sanger-pathogens/mlst_check
  31. Page A. J., Cummins C. A., Hunt M., Wong V. K., Reuter S., Holden M. T., Fookes M., Falush D., Keane J. A. et al. 2015; Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics31:3691–3693 [CrossRef][PubMed]
    [Google Scholar]
  32. Page A. J., Taylor B., Steinbiss S.. 2016; GFF3toEMBL. GitHubhttps://github.com/sanger-pathogens/gff3toembl
    [Google Scholar]
  33. Perna N. T., Plunkett G., Burland V., Mau B., Glasner J. D., Rose D. J., Mayhew G. F., Evans P. S., Gregor J. et al. 2001; Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature409:529–533 [CrossRef][PubMed]
    [Google Scholar]
  34. Pirovano W., Boetzer M., Derks M. F., Smit S.. 2015; NCBI-compliant genome submissions: tips and tricks to save time and money. Brief Bioinform104 [CrossRef][PubMed]
    [Google Scholar]
  35. Ponstingl H., Ning Z.. 2015; SMALT. http://www.sanger.ac.uk/science/tools/smalt-0
  36. Pop M.. 2009; Genome assembly reborn: recent computational challenges. Brief Bioinform10:354–366 [CrossRef][PubMed]
    [Google Scholar]
  37. Pruitt K. D., Tatusova T., Brown G. R., Maglott D. R.. 2012; NCBI reference sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res40:130–135 [CrossRef]
    [Google Scholar]
  38. Puranik R., Quan G., Werner J., Zhou R., Xu Z.. 2015; A pipeline for completing bacterial genomes using in silico and wet lab approaches. BMC Genomics16:S7 [CrossRef][PubMed]
    [Google Scholar]
  39. Quail M. A., Otto T. D., Gu Y., Harris S. R., Skelly T. F., McQuillan J. A., Swerdlow H. P., Oyola S. O.. 2012; Optimal enzymes for amplifying sequencing libraries. Nat Methods9:10–11 [CrossRef]
    [Google Scholar]
  40. Seemann T.. 2014; Prokka: rapid prokaryotic genome annotation. Bioinformatics30:2068–2069 [CrossRef][PubMed]
    [Google Scholar]
  41. Tsai I. J., Otto T. D., Berriman M.. 2010; Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps. Genome Biol11:R41 [CrossRef][PubMed]
    [Google Scholar]
  42. Walker B. J., Abeel T., Shea T., Priest M., Abouelliel A., Sakthikumar S., Cuomo C. A., Zeng Q., Wortman J. et al. 2014; Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One9:e112963 [CrossRef][PubMed]
    [Google Scholar]
  43. Wong V. K., Baker S., Pickard D. J., Parkhill J., Page A. J., Feasey N. A., Kingsley R. A., Thomson N. R., Keane J. A. et al. 2015; Phylogeographical analysis of the dominant multidrug-resistant H58 clade of Salmonella Typhi identifies inter- and intracontinental transmission events. Nat Genet47:632–639 [CrossRef][PubMed]
    [Google Scholar]
  44. Wood D. E., Salzberg S. L.. 2014; Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol15:R46 [CrossRef][PubMed]
    [Google Scholar]
  45. Zerbino D. R.. 2010; Using the Velvet de novo assembler for short-read sequencing technologies. Curr Protoc Bioinformatics31:11.5.1–11.511
    [Google Scholar]
  46. Zimin A. V., Marçais G., Puiu D., Roberts M., Salzberg S. L., Yorke J. A.. 2013; The MaSuRCA genome assembler. Bioinformatics29:2669–2677 [CrossRef][PubMed]
    [Google Scholar]
  47. Holden, M. T. G., Staphylococcus aureus subsp. aureus TW20. EBML. FN433596
  48. Aslett, M. A. & De Silva, N., Salmonella enterica subsp. enterica serovar Pullorum S44987_1, EMBL. LK931482
  49. Parkhill, J., et al. Bordetella pertussis strain Tohama I, EMBL. BX470248
http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.000083
Loading
/content/journal/mgen/10.1099/mgen.0.000083
Loading

Data & Media loading...

Supplementary File 1

Supplementary File 2

Most Cited This Month

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error