@article{mbs:/content/journal/jmm/10.1099/jmm.0.001474, author = "Liu, Wenjia and Ying, Nanjiao and Mo, Qiusi and Li, Shanshan and Shao, Mengjie and Sun, Lingli and Zhu, Lei", title = "Machine learning for identifying resistance features of Klebsiella pneumoniae using whole-genome sequence single nucleotide polymorphisms", journal= "Journal of Medical Microbiology", year = "2021", volume = "70", number = "11", pages = "", doi = "https://doi.org/10.1099/jmm.0.001474", url = "https://www.microbiologyresearch.org/content/journal/jmm/10.1099/jmm.0.001474", publisher = "Microbiology Society", issn = "1473-5644", type = "Journal Article", keywords = "machine learning", keywords = "single nucleotide polymorphism", keywords = "drug resistance", keywords = "whole-genome sequence", keywords = "Klebsiella pneumoniae", keywords = "Codon Mutation Detection", keywords = "Fast Feature Selection", keywords = "resistance gene", eid = "001474", abstract = " Introduction. Klebsiella pneumoniae , a gram-negative bacterium, is a common pathogen causing nosocomial infection. The drug-resistance rate of K. pneumoniae is increasing year by year, posing a severe threat to public health worldwide. K. pneumoniae has been listed as one of the pathogens causing the global crisis of antimicrobial resistance in nosocomial infections. We need to explore the drug resistance of K. pneumoniae for clinical diagnosis. Single nucleotide polymorphisms (SNPs) are of high density and have rich genetic information in whole-genome sequencing (WGS), which can affect the structure or expression of proteins. SNPs can be used to explore mutation sites associated with bacterial resistance. Hypothesis/Gap Statement. Machine learning methods can detect genetic features associated with the drug resistance of K. pneumoniae from whole-genome SNP data. Aims. This work used Fast Feature Selection (FFS) and Codon Mutation Detection (CMD) machine learning methods to detect genetic features related to drug resistance of K. pneumoniae from whole-genome SNP data. Methods. WGS data on resistance of K. pneumoniae strains to four antibiotics (tetracycline, gentamicin, imipenem, amikacin) were downloaded from the European Nucleotide Archive (ENA). Sequence alignments were performed with MUMmer 3 to complete SNP calling using K. pneumoniae HS11286 chromosome as the reference genome. The FFS algorithm was applied to feature selection of the SNP dataset. The training set was constructed based on mutation sites with mutation frequency >0.995. Based on the original SNP training set, 70% of SNPs were randomly selected from each dataset as the test set to verify the accuracy of the training results. Finally, the resistance genes were obtained by the CMD algorithm and Venny. Results. The number of strains resistant to tetracycline, gentamicin, imipenem and amikacin was 931, 1048, 789 and 203, respectively. Machine learning algorithms were applied to the SNP training set and test set, and 28 and 23 resistance genes were predicted, respectively. The 28 resistance genes in the training set included 22 genes in the test set, which verified the accuracy of gene prediction. Among them, some genes (KPHS_35310, KPHS_18220, KPHS_35880, etc.) corresponded to known resistance genes (Eef2, lpxK, MdtC, etc). Logistic regression classifiers were established based on the identified SNPs in the training set. The area under the curves (AUCs) of the four antibiotics was 0.939, 0.950, 0.912 and 0.935, showing a strong ability to predict bacterial resistance. Conclusion. Machine learning methods can effectively be used to predict resistance genes and associated SNPs. The FFS and CMD algorithms have wide applicability. They can be used for the drug-resistance analysis of any microorganism with genomic variation and phenotypic data. This work lays a foundation for resistance research in clinical applications.", }