Robust high-throughput prokaryote de novo assembly and improvement pipeline for Illumina data Open Access

Abstract

The rapidly reducing cost of bacterial genome sequencing has lead to its routine use in large-scale microbial analysis. Though mapping approaches can be used to find differences relative to the reference, many bacteria are subject to constant evolutionary pressures resulting in events such as the loss and gain of mobile genetic elements, horizontal gene transfer through recombination and genomic rearrangements. De novo assembly is the reconstruction of the underlying genome sequence, an essential step to understanding bacterial genome diversity. Here we present a high-throughput bacterial assembly and improvement pipeline that has been used to generate nearly 20 000 annotated draft genome assemblies in public databases. We demonstrate its performance on a public data set of 9404 genomes. We find all the genes used in multi-locus sequence typing schema present in 99.6 % of assembled genomes. When tested on low-, neutral- and high-GC organisms, more than 94 % of genes were present and completely intact. The pipeline has been proven to be scalable and robust with a wide variety of datasets without requiring human intervention. All of the software is available on GitHub under the GNU GPL open source license.

Loading

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.000083
2016-08-25
2024-03-29
Loading full text...

Full text loading...

/deliver/fulltext/mgen/2/8/mgen000083.html?itemId=/content/journal/mgen/10.1099/mgen.0.000083&mimeType=html&fmt=ahah

References

  1. Medini D., Donati C., Tettelin H., Masignani V., Rappuoli R. 2005; The microbial pan-genome. Curr Opin Genet Dev 15:589–683 [View Article][PubMed]
    [Google Scholar]
  2. Nasser W., Beres S. B., Olsen R. J., Dean M. A., Rice K. A., Long S. W., Kristinsson K. G., Gottfredsson M., Vuopio J. et al. 2014; Evolutionary pathway to increased virulence and epidemic group A Streptococcus disease derived from 3,615 genome sequences. Proc Natl Acad Sci U S A 111:E17681776 [View Article][PubMed]
    [Google Scholar]
  3. Abbas M. M., Malluhi Q. M., Balakrishnan P. 2014; Assessment of de novo assemblers for draft genomes: a case study with fungal genomes. BMC Genomics 15:1–12 [View Article][PubMed]
    [Google Scholar]
  4. Assefa S., Keane T. M., Otto T. D., Newbold C., Berriman M. 2009; ABACAS: algorithm-based automatic contiguation of assembled sequences. Bioinformatics 25:1968–1969 [View Article][PubMed]
    [Google Scholar]
  5. Bala S. 2016; Vertebrate resequencing sequence analysis pipeline. GitHub https://github.com/sanger-pathogens/vr-codebase
    [Google Scholar]
  6. Bankevich A., Nurk S., Antipov D., Gurevich A. A., Dvorkin M., Kulikov A. S., Lesin V. M., Nikolenko S. I., Pham S. et al. 2012; SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 19:455–477 [View Article][PubMed]
    [Google Scholar]
  7. Blattner F. R., Plunkett G., Bloch C. A., Perna N. T., Burland V., Riley M., Collado-Vides J., Glasner J. D., Rode C. K. et al. 1997; The complete genome sequence of Escherichia coli K-12. Science 277:1453–1462 [View Article][PubMed]
    [Google Scholar]
  8. Boetzer M., Pirovano W. 2012; Toward almost closed genomes with gapfiller. Genome Biol 13:R56 [View Article][PubMed]
    [Google Scholar]
  9. Boetzer M., Henkel C. V., Jansen H. J., Butler D., Pirovano W. 2011; Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27:578–579 [View Article][PubMed]
    [Google Scholar]
  10. Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J., Bealer K., Madden T. L. 2009; blast+: architecture and applications. BMC Bioinformatics 10:421 [View Article][PubMed]
    [Google Scholar]
  11. Chewapreecha C., Harris S. R., Croucher N. J., Turner C., Marttinen P., Cheng L., Pessia A., Aanensen D. M., Mather A. E. et al. 2014; Dense genomic sampling identifies highways of pneumococcal recombination. Nat Genet 46:305–309 [View Article][PubMed]
    [Google Scholar]
  12. Croucher N. J., Harris S. R., Fraser C., Quail M. A., Burton J., Van der Linden M., McGee L., Von Gottberg A., Song J. H. et al. 2011; Rapid pneumococcal evolution in response to clinical interventions. Science 331:430–434 [View Article][PubMed]
    [Google Scholar]
  13. Croucher N. J., Page A. J., Connor T. R., Delaney A. J., Keane J. A., Bentley S. D., Parkhill J., Harris S. R. 2014; Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Res 43:e15 [View Article]
    [Google Scholar]
  14. Gladman S., Seemann T. 2008; Velvet optimiser. http://bioinformatics.net.au/software.velvetoptimiser.shtml
  15. Gurevich A., Saveliev V., Vyahhi N., Tesler G. 2013; QUAST: quality assessment tool for genome assemblies. Bioinformatics 29:1072–1075 [View Article][PubMed]
    [Google Scholar]
  16. Holden M. T., Lindsay J. A., Corton C., Quail M. A., Cockfield J. D., Pathak S., Batra R., Parkhill J., Bentley S. D. et al. 2010; Genome sequence of a recently emerged, highly transmissible, multi-antibiotic- and antiseptic-resistant variant of methicillin-resistant Staphylococcus aureus, sequence type 239 (TW). J Bacteriol 192:888–892 [View Article][PubMed]
    [Google Scholar]
  17. Hunt M., Kikuchi T., Sanders M., Newbold C., Berriman M., Otto T. D. 2013; REAPR: a universal tool for genome assembly evaluation. Genome Biol 14:R47 [View Article][PubMed]
    [Google Scholar]
  18. Hunt M., Newbold C., Berriman M., Otto T. D. 2014; A comprehensive evaluation of assembly scaffolding tools. Genome Biol 15:R42 [View Article][PubMed]
    [Google Scholar]
  19. Jolley K. A., Maiden M. C. 2010; BIGSdb: scalable analysis of bacterial genome variation at the population level. BMC Bioinformatics 11:595 [View Article][PubMed]
    [Google Scholar]
  20. Klemm E. J., Gkrania-Klotsas E., Hadfield J., Forbester J. L., Harris S. R., Hale C., Heath J. N., Wileman T., Clare S. et al. 2016; Emergence of host-adapted Salmonella Enteritidis through rapid evolution in an immunocompromised host. Nat Microbiol 1:15023 [View Article][PubMed]
    [Google Scholar]
  21. Koren S., Treangen T. J., Hill C. M., Pop M., Phillippy A. M. 2014; Automated ensemble assembly and validation of microbial genomes. BMC Bioinformatics 15:126 [View Article][PubMed]
    [Google Scholar]
  22. Langmead B., Trapnell C., Pop M., Salzberg S. L. 2009; Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25 [View Article][PubMed]
    [Google Scholar]
  23. Li H., Durbin R. 2009; Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25:1754–1760 [View Article][PubMed]
    [Google Scholar]
  24. Maiden M. C. J., Bygraves J. A., Feil E., Morelli G., Russell J. E., Urwin R., Zhang Q., Zhou J., Zurth K. et al. 1998; Multilocus sequence typing: A portable approach to the identification of clones within populations of pathogenic microorganisms. Proceedings of the National Academy of Sciences 95:3140–3145 [View Article]
    [Google Scholar]
  25. Makendi C., Page A. J., Wren B. W., Le Thi Phuong T., Clare S., Hale C., Goulding D., Klemm E. J., Pickard D. et al. 2016; A phylogenetic and phenotypic analysis of Salmonella enterica serovar Weltevreden, an emerging agent of diarrheal disease in tropical regions. PLoS Negl Trop Dis 10:e0004446 [View Article][PubMed]
    [Google Scholar]
  26. Mapleson D., Drou N., Swarbreck D. 2015; RAMPART: a workflow management system for de novo genome assembly. Bioinformatics 31:1–2 [View Article][PubMed]
    [Google Scholar]
  27. Mitchell A., Chang H. Y., Daugherty L., Fraser M., Hunter S., Lopez R., McAnulla C., McMenamin C., Nuka G. et al. 2015; The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res 43:D213–D221 [View Article][PubMed]
    [Google Scholar]
  28. Otto T. D., Sanders M., Berriman M., Newbold C. 2010; Iterative correction of reference nucleotides (iCORN) using second generation sequencing technology. Bioinformatics 26:1704–1707 [View Article][PubMed]
    [Google Scholar]
  29. Page A. J. 2016a; Assembly improvement example. GitHub https://github.com/sanger-pathogens/assembly_improvement/tree/master/example
    [Google Scholar]
  30. Page A. J. 2016b; MLST-check. https://github.com/sanger-pathogens/mlst_check
  31. Page A. J., Cummins C. A., Hunt M., Wong V. K., Reuter S., Holden M. T., Fookes M., Falush D., Keane J. A. et al. 2015; Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31:3691–3693 [View Article][PubMed]
    [Google Scholar]
  32. Page A. J., Taylor B., Steinbiss S. 2016; GFF3toEMBL. GitHub https://github.com/sanger-pathogens/gff3toembl
    [Google Scholar]
  33. Perna N. T., Plunkett G., Burland V., Mau B., Glasner J. D., Rose D. J., Mayhew G. F., Evans P. S., Gregor J. et al. 2001; Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature 409:529–533 [View Article][PubMed]
    [Google Scholar]
  34. Pirovano W., Boetzer M., Derks M. F., Smit S. 2015; NCBI-compliant genome submissions: tips and tricks to save time and money. Brief Bioinform104 [View Article][PubMed]
    [Google Scholar]
  35. Ponstingl H., Ning Z. 2015; SMALT. http://www.sanger.ac.uk/science/tools/smalt-0
  36. Pop M. 2009; Genome assembly reborn: recent computational challenges. Brief Bioinform 10:354–366 [View Article][PubMed]
    [Google Scholar]
  37. Pruitt K. D., Tatusova T., Brown G. R., Maglott D. R. 2012; NCBI reference sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res 40:130–135 [View Article]
    [Google Scholar]
  38. Puranik R., Quan G., Werner J., Zhou R., Xu Z. 2015; A pipeline for completing bacterial genomes using in silico and wet lab approaches. BMC Genomics 16:S7 [View Article][PubMed]
    [Google Scholar]
  39. Quail M. A., Otto T. D., Gu Y., Harris S. R., Skelly T. F., McQuillan J. A., Swerdlow H. P., Oyola S. O. 2012; Optimal enzymes for amplifying sequencing libraries. Nat Methods 9:10–11 [View Article]
    [Google Scholar]
  40. Seemann T. 2014; Prokka: rapid prokaryotic genome annotation. Bioinformatics 30:2068–2069 [View Article][PubMed]
    [Google Scholar]
  41. Tsai I. J., Otto T. D., Berriman M. 2010; Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps. Genome Biol 11:R41 [View Article][PubMed]
    [Google Scholar]
  42. Walker B. J., Abeel T., Shea T., Priest M., Abouelliel A., Sakthikumar S., Cuomo C. A., Zeng Q., Wortman J. et al. 2014; Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One 9:e112963 [View Article][PubMed]
    [Google Scholar]
  43. Wong V. K., Baker S., Pickard D. J., Parkhill J., Page A. J., Feasey N. A., Kingsley R. A., Thomson N. R., Keane J. A. et al. 2015; Phylogeographical analysis of the dominant multidrug-resistant H58 clade of Salmonella Typhi identifies inter- and intracontinental transmission events. Nat Genet 47:632–639 [View Article][PubMed]
    [Google Scholar]
  44. Wood D. E., Salzberg S. L. 2014; Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 15:R46 [View Article][PubMed]
    [Google Scholar]
  45. Zerbino D. R. 2010; Using the Velvet de novo assembler for short-read sequencing technologies. Curr Protoc Bioinformatics 31:11.5.1–11.511
    [Google Scholar]
  46. Zimin A. V., Marçais G., Puiu D., Roberts M., Salzberg S. L., Yorke J. A. 2013; The MaSuRCA genome assembler. Bioinformatics 29:2669–2677 [View Article][PubMed]
    [Google Scholar]
  47. Holden, M. T. G., Staphylococcus aureus subsp. aureus TW20. EBML. FN433596
  48. Aslett, M. A. & De Silva, N., Salmonella enterica subsp. enterica serovar Pullorum S44987_1, EMBL. LK931482
  49. Parkhill, J., et al. Bordetella pertussis strain Tohama I, EMBL. BX470248
http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.000083
Loading
/content/journal/mgen/10.1099/mgen.0.000083
Loading

Data & Media loading...

Supplements

Supplementary File 1

Supplementary File 2

Most cited Most Cited RSS feed