Prediction of prokaryotic transposases from protein features with machine learning approaches Open Access

Abstract

Identification of prokaryotic transposases (Tnps) not only gives insight into the spread of antibiotic resistance and virulence but the process of DNA movement. This study aimed to develop a classifier for predicting Tnps in bacteria and archaea using machine learning (ML) approaches. We extracted a total of 2751 protein features from the training dataset including 14852 Tnps and 14852 controls, and selected 75 features as predictive signatures using the combined mutual information and least absolute shrinkage and selection operator algorithms. By aggregating these signatures, an ensemble classifier that integrated a collection of individual ML-based classifiers, was developed to identify Tnps. Further validation revealed that this classifier achieved good performance with an average AUC of 0.955, and met or exceeded other common methods. Based on this ensemble classifier, a stand-alone command-line tool designated TnpDiscovery was established to maximize the convenience for bioinformaticians and experimental researchers toward Tnp prediction. This study demonstrates the effectiveness of ML approaches in identifying Tnps, facilitating the discovery of novel Tnps in the future.

Funding
This study was supported by the:
  • Start-up funds from the First Affiliated Hospital of Wenzhou Medical University (Award 2018QD014)
    • Principle Award Recipient: JianchaoYing
  • Science & Technology Project of Inner Mongolia Autonomous Region, China (Award 201802125)
    • Principle Award Recipient: TengXu
  • Fundamental Research Funds for the Zhejiang Provincial Universities (Award KYYW201919)
    • Principle Award Recipient: JianchaoYing
  • Natural Science Foundation of Zhejiang Province (Award LQ20H150004)
    • Principle Award Recipient: JianchaoYing
Loading

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.000611
2021-07-26
2024-03-29
Loading full text...

Full text loading...

/deliver/fulltext/mgen/7/7/mgen000611.html?itemId=/content/journal/mgen/10.1099/mgen.0.000611&mimeType=html&fmt=ahah

References

  1. Babakhani S, Oloomi M. Transposons: the agents of antibiotic resistance in bacteria. J Basic Microbiol 2018; 58:905–917 [View Article] [PubMed]
    [Google Scholar]
  2. Makalowski W, Pande A, Gotea V, Makalowska I. Transposable elements and their identification. Methods Mol Biol 2012; 855:337–359
    [Google Scholar]
  3. Dziewit L, Baj J, Szuplewska M, Maj A, Tabin M et al. Insights into the transposable mobilome of Paracoccus spp. (Alphaproteobacteria. PloS one 2012; 7:e32277 [View Article] [PubMed]
    [Google Scholar]
  4. Iyer A, Barbour E, Azhar E, Salabi A, Hassan H et al. Transposable elements in Escherichia coli antimicrobial resistance. Adv Biosci Biotechnol 2013; 4:415–423
    [Google Scholar]
  5. van Hoek AH, Mevius D, Guerra B, Mullany P, Roberts AP et al. Acquired antibiotic resistance genes: an overview. Front Microbiol 2011; 2:203 [View Article] [PubMed]
    [Google Scholar]
  6. Wagner A. Cooperation is fleeting in the world of transposable elements. PLoS Comput Biol 2006; 2:e162 [View Article]
    [Google Scholar]
  7. García-Contreras R. Unraveling resistance mechanisms against new antimicrobials using transposon mutagenesis. Cloning and Transgenesis 2013; 02: [View Article]
    [Google Scholar]
  8. Rice PA, Baker TA. Comparative architecture of transposase and integrase complexes. Nat Struct Biol 2001; 8:302–307 [View Article] [PubMed]
    [Google Scholar]
  9. Siguier P, Perochon J, Lestrade L, Mahillon J, Chandler M. ISfinder: the reference centre for bacterial insertion sequences. Nucleic Acids Res 2006; 34:D32–6 [View Article] [PubMed]
    [Google Scholar]
  10. Varani AM, Siguier P, Gourbeyre E, Charneau V, Chandler M. ISsaga is an ensemble of web-based methods for high throughput identification and semi-automatic annotation of insertion sequences in prokaryotic genomes. Genome Biol 2011; 12:R30 [View Article] [PubMed]
    [Google Scholar]
  11. Wagner A, Lewis C, Bichsel M. A survey of bacterial insertion sequences using IScan. Nucleic Acids Res 2007; 35:5284–5293 [View Article] [PubMed]
    [Google Scholar]
  12. Riadi G, Medina-Moenne C, Holmes DS. TnpPred: A web service for the robust prediction of prokaryotic transposases. Comp Funct Genomics 2012; 2012:678761 [View Article] [PubMed]
    [Google Scholar]
  13. Kamoun C, Payen T, Hua-Van A, Filee J. Improving prokaryotic transposable elements identification using a combination of de novo and profile HMM methods. BMC genomics 2013; 14:700 [View Article] [PubMed]
    [Google Scholar]
  14. Tarca AL, Carey VJ, Chen XW, Romero R, Draghici S. Machine learning and its applications to biology. PLoS Comput Biol 2007; 3:e116 [View Article] [PubMed]
    [Google Scholar]
  15. Hou R, Wang L, YJ W. Predicting atp-binding cassette transporters using the random forest method. Front Genet 2020; 11:156
    [Google Scholar]
  16. HC Y, You ZH, Zhou X, Cheng L, Li X et al. ACP-DL: A deep learning long short-term memory model to predict anticancer peptides using high-efficiency feature representation. Molecular Therapy Nucleic acids 2019; 17:1–9
    [Google Scholar]
  17. Han K, Wang M, Zhang L, Wang Y, Guo M et al. Predicting ion channels genes and their types with machine learning techniques. Front Genet 2019; 10:399 [View Article] [PubMed]
    [Google Scholar]
  18. Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006; 22:1658–1659 [View Article] [PubMed]
    [Google Scholar]
  19. Wang J, Yang B, Leier A, Marquez-Lago TT, Hayashida M et al. Bastion6: a bioinformatics approach for accurate prediction of type VI secreted effectors. Bioinformatics 2018; 34:2546–2555 [View Article] [PubMed]
    [Google Scholar]
  20. Chen Z, Zhao P, Li F, Leier A, Marquez-Lago TT et al. iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 2018; 34:2499–2502 [View Article] [PubMed]
    [Google Scholar]
  21. Bhasin M, Raghava GP. Classification of nuclear receptors based on amino acid composition and dipeptide composition. J Biol Chem 2004; 279:23262–23266 [View Article] [PubMed]
    [Google Scholar]
  22. Saravanan V, Gautham N. Harnessing computational biology for exact linear b-cell epitope prediction: A novel amino acid composition-based feature descriptor. OMICS: A Journal of Integrative Biology 2015; 19:648–658 [View Article]
    [Google Scholar]
  23. Chen K, Kurgan LA, Ruan J. Prediction of flexible/rigid regions from protein sequences using k-spaced amino acid pairs. BMC Struct Biol 2007; 7:25 [View Article] [PubMed]
    [Google Scholar]
  24. Lee TY, Lin ZQ, Hsieh SJ, Bretana NA, CT L. Exploiting maximal dependence decomposition to identify conserved motifs from a group of aligned signal sequences. Bioinformatics 2011; 27:1780–1787
    [Google Scholar]
  25. Sokal RR, Thomson BA. Population structure inferred by local spatial autocorrelation: an example from an Amerindian tribal population. Am J Phys Anthropol 2006; 129:121–131 [View Article] [PubMed]
    [Google Scholar]
  26. Lin Z, Pan XM. Accurate prediction of protein secondary structural content. J Protein Chem 2001; 20:217–220 [View Article] [PubMed]
    [Google Scholar]
  27. Horne DS. Prediction of protein helix content from an autocorrelation analysis of sequence hydrophobicities. Biopolymers 1988; 27:451–477 [View Article] [PubMed]
    [Google Scholar]
  28. Dubchak I, Muchnik I, Holbrook SR, Kim SH. Prediction of protein folding class using global description of amino acid sequence. Proc Natl Acad Sci USA 1995; 92:8700–8704 [View Article] [PubMed]
    [Google Scholar]
  29. Dubchak I, Muchnik I, Mayor C, Dralyuk I, Kim SH. Recognition of a protein fold in the context of the Structural Classification of Proteins (SCOP) classification. Proteins 1999; 35:401–407 [PubMed]
    [Google Scholar]
  30. Shen J, Zhang J, Luo X, Zhu W, Yu K et al. Predicting protein-protein interactions based only on sequences information. Proc Natl Acad Sci USA 2007; 104:4337–4341 [View Article] [PubMed]
    [Google Scholar]
  31. Chou KC. Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. Biochem Biophys Res Commun 2000; 278:477–483 [View Article] [PubMed]
    [Google Scholar]
  32. Chou KC, Cai YD. Prediction and classification of protein subcellular location-sequence-order effect and pseudo amino acid composition. J Cell Biochem 2003; 90:1250–1260 [View Article] [PubMed]
    [Google Scholar]
  33. Chou KC. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 2005; 21:10–19 [View Article] [PubMed]
    [Google Scholar]
  34. Chou KC. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins 2001; 43:246–255 [View Article] [PubMed]
    [Google Scholar]
  35. van der Maaten L, Hinton G. Viualizing data using t-SNE. J Mach Learn Res 2008; 9:2579–2605
    [Google Scholar]
  36. Chen Z, Pang M, Zhao Z, Li S, Miao R et al. Feature selection may improve deep neural networks for the bioinformatics problems. Bioinformatics 2020; 36:1542–1552 [View Article] [PubMed]
    [Google Scholar]
  37. Ross BC. Mutual information between discrete and continuous data sets. PloS one 2014; 9:e87357 [View Article] [PubMed]
    [Google Scholar]
  38. Xu Y, Cao L, Zhao X, Yao Y, Liu Q et al. Prediction of smoking behavior from single nucleotide polymorphisms with machine learning approaches. Front Psychiatry 2020; 11:416 [View Article] [PubMed]
    [Google Scholar]
  39. Wang Q, Xu T, Tong Y, Wu J, Zhu W et al. Prognostic potential of alternative splicing markers in endometrial cancer. Mol Ther Nucleic Acids 2019; 18:1039–1048 [View Article] [PubMed]
    [Google Scholar]
  40. Rao B, Zhou C, Zhang G, Su R, Wei L. ACPred-Fuse: fusing multi-view information improves the prediction of anticancer peptides. Brief Bioinformatics 2019
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.000611
Loading
/content/journal/mgen/10.1099/mgen.0.000611
Loading

Data & Media loading...

Supplements

Supplementary material 1

PDF

Most cited Most Cited RSS feed