The open sharing of genomic data provides an incredibly rich resource for the study of bacterial evolution and function, and even anthropogenic perturbations such as the widespread use of antimicrobials. Whilst these archives are rich in data, considerable processing is required before a biological question can be addressed. Here, we have assembled, quality controlled and characterised 661,405 bacterial genomes that were in the European Nucleotide Archive (ENA) at the end of November of 2018, using a uniform standardised approach. A searchable index has been produced, facilitating the easy interrogation of the entire dataset for a specific gene or mutation. Our analysis shows how uneven the species composition is within this database, with just 20 of the total 2,336 species making up 90% of the high-quality genomes. The over-represented species tend to be acute/common human pathogens, often aligning with research priorities at different levels from individuals with targeted but focussed research questions, areas of focus for the funding bodies or national public health agencies, to those identified globally as priority pathogens by the WHO for their resistance to front- and last-line antimicrobials. Whilst this is a rich resource which often forms the context or references for multi-‘omic’ studies and supports discovery research in many domains, understanding the actual and potential biases in bacterial diversity depicted in this snapshot, and hence within the data being submitted to the public sequencing archives, is essential if we are to target and fill gaps in our understanding of the bacterial kingdom.

  • This is an open-access article distributed under the terms of the Creative Commons Attribution License.

Article metrics loading...

Loading full text...

Full text loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error