NonClasGP-Pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data

Chao Wang; Jin Wu; Lei Xu; Quan Zou

doi:10.1099/mgen.0.000483

Volume 6, Issue 12

Method

Open Access

NonClasGP-Pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data

Chao Wang^1,†, Jin Wu^2,†, Lei Xu³ and Quan Zou^1,4
View Affiliations Hide Affiliations

Affiliations: ¹ Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, PR China ² School of Management, Shenzhen Polytechnic, Shenzhen, PR China ³ School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, PR China ⁴ Hainan Key Laboratory for Computational Science and Application, Hainan Normal University, Haikou, PR China
*Correspondence: Lei Xu, [email protected] *Correspondence: Quan Zou, [email protected]

† These authors contributed equally to this work
Published: 27 November 2020 https://doi.org/10.1099/mgen.0.000483

Abstract

Non-classically secreted proteins (NCSPs) are proteins that are located in the extracellular environment, although there is a lack of known signal peptides or secretion motifs. They usually perform different biological functions in intracellular and extracellular environments, and several of their biological functions are linked to bacterial virulence and cell defence. Accurate protein localization is essential for all living organisms, however, the performance of existing methods developed for NCSP identification has been unsatisfactory and in particular suffer from data deficiency and possible overfitting problems. Further improvement is desirable, especially to address the lack of informative features and mining subset-specific features in imbalanced datasets. In the present study, a new computational predictor was developed for NCSP prediction of gram-positive bacteria. First, to address the possible prediction bias caused by the data imbalance problem, ten balanced subdatasets were generated for ensemble model construction. Then, the F-score algorithm combined with sequential forward search was used to strengthen the feature representation ability for each of the training subdatasets. Third, the subset-specific optimal feature combination process was adopted to characterize the original data from different aspects, and all subdataset-based models were integrated into a unified model, NonClasGP-Pred, which achieved an excellent performance with an accuracy of 93.23 %, a sensitivity of 100 %, a specificity of 89.01 %, a Matthew’s correlation coefficient of 87.68 % and an area under the curve value of 0.9975 for ten-fold cross-validation. Based on assessment on the independent test dataset, the proposed model outperformed state-of-the-art available toolkits. For availability and implementation, see: http://lab.malab.cn/~wangchao/softwares/NonClasGP/.

Received: 14/07/2020
Accepted: 06/11/2020
Published Online: 27/11/2020

Keyword(s): feature selection , imbalanced dataset , machine learning , model ensemble and non-classically secreted proteins

Funding

This study was supported by the:

National Natural Science Foundation of China (Award No. 62002051, No. 61922020, No.61902259, No. 61771331)

This is an open-access article distributed under the terms of the Creative Commons Attribution License.

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.000483

2020-11-27

2024-04-20

Full text loading...

/deliver/fulltext/mgen/6/12/mgen000483.html?itemId=/content/journal/mgen/10.1099/mgen.0.000483&mimeType=html&fmt=ahah

References

Beckwith J. The Sec-dependent pathway. Res Microbiol 2013; 164:497–504 [View Article][PubMed]
[Google Scholar]
Driessen AJ, Manting EH, van der Does C. The structural basis of protein targeting and translocation in bacteria. Nat Struct Biol 2001; 8:492–498 [View Article][PubMed]
[Google Scholar]
Palmer T, Berks BC. The twin-arginine translocation (Tat) protein export pathway. Nat Rev Microbiol 2012; 10:483–496 [View Article][PubMed]
[Google Scholar]
Wang G, Chen H, Xia Y, Cui J, Gu Z et al. How are the non-classically secreted bacterial proteins released into the extracellular milieu?. Curr Microbiol 2013a; 67:688–695 [View Article][PubMed]
[Google Scholar]
Wang GQ, Xia Y, Song X, Ai L. Common non-classically secreted bacterial proteins with experimental evidence. Curr Microbiol 2016; 72:102–111 [View Article][PubMed]
[Google Scholar]
Yu L, Yao S, Gao L, Zha Y. Conserved disease modules extracted from multilayer heterogeneous disease and gene networks for understanding disease mechanisms and predicting disease treatments. Front Genet 2018; 9:745 [View Article][PubMed]
[Google Scholar]
Bendtsen JD, Kiemer L, Fausbøll A, Brunak S et al. Non-classical protein secretion in bacteria. BMC Microbiol 2005; 5:13 [View Article]
[Google Scholar]
Pancholi V, Chhatwal GS. Housekeeping enzymes as virulence factors for pathogens. Int J Med Microbiol 2003; 293:391–401 [View Article][PubMed]
[Google Scholar]
Kang Q, Zhang D. Principle and potential applications of the non-classical protein secretory pathway in bacteria. Appl Microbiol Biotechnol 2020; 104:953–965 [View Article][PubMed]
[Google Scholar]
Cui J, Wang G, Chen H, Chen J, Gu Z et al. [Effect of non-classical secreted proteins on LipaseA secretion]. Wei Sheng Wu Xue Bao 2015; 55:198–204[PubMed]
[Google Scholar]
Wang GQ, Chen H, Zhang H, Song Y, Chen W et al. The secretion of an intrinsically disordered protein with different secretion signals in Bacillus subtilis . Curr Microbiol 2013b; 66:566–572 [View Article][PubMed]
[Google Scholar]
Bendtsen JD, Jensen LJ, Blom N, Von Heijne G, Brunak S et al. Feature-based prediction of non-classical and leaderless protein secretion. Protein Eng Des Sel 2004; 17:349–356 [View Article][PubMed]
[Google Scholar]
Yu L, Guo Y, Li Y, Li G, Li M et al. SecretP: identifying bacterial secreted proteins by fusing new features into Chou's pseudo-amino acid composition. J Theor Biol 2010; 267:1–6 [View Article][PubMed]
[Google Scholar]
Restrepo-Montoya D, Pino C, Nino LF, Patarroyo ME, Patarroyo MA et al. NClassG+: a classifier for non-classically secreted Gram-positive bacterial proteins. BMC Bioinformatics 2011; 12:8 [View Article]
[Google Scholar]
Zhang YJ, Yu S, Xie R, Li J, Leier A et al. PeNGaRoo, a combined gradient boosting and ensemble learning framework for predicting non-classical secreted proteins. Bioinformatics 2020; 36:704–712 [View Article][PubMed]
[Google Scholar]
Chen Z, Zhao P, Li F, Leier A, Marquez-Lago TT et al. iFeature: a python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 2018; 34:2499–2502 [View Article][PubMed]
[Google Scholar]
Chen Z, Zhao P, Li F, Marquez-Lago TT, Leier A et al. iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Brief Bioinform 2020; 21:1047–1057 [View Article][PubMed]
[Google Scholar]
Liu B, Gao X, Zhang H. BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches. Nucleic Acids Res 2019; 47:e127 [View Article][PubMed]
[Google Scholar]
Wang M, Yue L, Cui X, Chen C, Zhou H et al. Prediction of extracellular matrix proteins by fusing multiple feature information, elastic net, and random forest algorithm. Mathematics 2020d; 8:169 [View Article]
[Google Scholar]
Bhasin M, Raghava GPS. Classification of nuclear receptors based on amino acid composition and dipeptide composition. J Biol Chem 2004; 279:23262–23266 [View Article][PubMed]
[Google Scholar]
Liu B. BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Brief Bioinform 2019; 20:1280–1294 [View Article][PubMed]
[Google Scholar]
Chen K, Kurgan L, Rahbari M. Prediction of protein crystallization using collocation of amino acid pairs. Biochem Biophys Res Commun 2007; 355:764–769 [View Article][PubMed]
[Google Scholar]
Saravanan V, Gautham N. Harnessing computational biology for exact linear B-cell epitope prediction: a novel amino acid composition-based feature descriptor. OMICS 2015; 19:648–658 [View Article][PubMed]
[Google Scholar]
Govindan G, Nair AS. Composition, transition and distribution (ctd) - a dynamic feature for predictions based on hierarchical structure of cellular sorting. In Negi A. editor Annual Ieee India Conference 2011 New York: Ieee; 2011
[Google Scholar]
Shen J, Zhang J, Luo X, Zhu W, Yu K et al. Predicting protein-protein interactions based only on sequences information. Proc Natl Acad Sci U S A 2007; 104:4337–4341 [View Article][PubMed]
[Google Scholar]
Schneider G, Wrede P. The rational design of amino acid sequences by artificial neural networks and simulated molecular evolution: de novo design of an idealized leader peptidase cleavage site. Biophys J 1994; 66:335–344 [View Article][PubMed]
[Google Scholar]
Grantham R. Amino acid difference formula to help explain protein evolution. Science 1974; 185:862–864 [View Article][PubMed]
[Google Scholar]
Horne DS. Prediction of protein helix content from an autocorrelation analysis of sequence hydrophobicities. Biopolymers 1988; 27:451–477 [View Article][PubMed]
[Google Scholar]
Chou KC. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins 2001; 43:246–255 [View Article][PubMed]
[Google Scholar]
Chou KC. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 2005; 21:10–19 [View Article][PubMed]
[Google Scholar]
Liang S, Ma A, Yang S, Wang Y, Ma Q et al. A review of Matched-pairs feature selection methods for gene expression data analysis. Comput Struct Biotechnol J 2018; 16:88–97 [View Article][PubMed]
[Google Scholar]
Yu B, Qiu W, Chen C, Ma A, Jiang J et al. SubMito-XGBoost: predicting protein submitochondrial localization by fusing multiple feature information and eXtreme gradient boosting. Bioinformatics 2020a; 36:1074–1081 [View Article][PubMed]
[Google Scholar]
Dou L, Li X, Ding H, Xu L, Xiang H et al. Is there any sequence feature in the RNA pseudouridine modification prediction problem?. Mol Ther Nucleic Acids 2020; 19:293–303 [View Article][PubMed]
[Google Scholar]
Huo Y, Xin L, Kang C, Wang M, Ma Q et al. SGL-SVM: a novel method for tumor classification via support vector machine with sparse group LASSO. J Theor Biol 2020; 486:110098 [View Article][PubMed]
[Google Scholar]
Liu B, Li C-C, Yan K. DeepSVM-fold: protein fold recognition by combining support vector machines and pairwise sequence similarity scores generated by deep learning networks. Brief Bioinform 2020; 21:1733–1741 [View Article][PubMed]
[Google Scholar]
Shen Y, Ding Y, Tang J, Zou Q, Guo F et al. Critical evaluation of web-based prediction tools for human protein subcellular localization. Brief Bioinform 2020; 21:1628–1640 [View Article][PubMed]
[Google Scholar]
Shen Y, Tang J, Guo F. Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou's General PseAAC. J Theor Biol 2019c; 462:230–239 [View Article][PubMed]
[Google Scholar]
Song J, Wang Y, Li F, Akutsu T, Rawlings ND et al. iProt-Sub: a comprehensive package for accurately mapping and predicting protease-specific substrates and cleavage sites. Brief Bioinform 2019; 20:638–658 [View Article][PubMed]
[Google Scholar]
Wang H, Ding Y, Tang J, Guo F et al. Identification of membrane protein types via multivariate information fusion with Hilbert–Schmidt independence criterion. Neurocomputing 2020c; 383:257–269 [View Article]
[Google Scholar]
Xu L, Liang G, Shi S, Liao C. SeqSVM: a sequence-based support vector machine method for identifying antioxidant proteins. Int J Mol Sci 2018b; 19:1773 [View Article][PubMed]
[Google Scholar]
Zhang M, Li F, Marquez-Lago TT, Leier A, Fan C et al. Multiply: a novel multi-layer predictor for discovering general and specific types of promoters. Bioinformatics 2019a; 35:2957–2965 [View Article][PubMed]
[Google Scholar]
Xu L, Liang G, Liao C, Chen G-D, Chang C-C et al. An efficient classifier for Alzheimer’s disease genes identification. Molecules 2018a; 23:3140 [View Article]
[Google Scholar]
Zeng X, Liao Y, Liu Y, Zou Q. Prediction and validation of disease genes using HeteSim scores. IEEE/ACM Trans Comput Biol Bioinform 2017; 14:687–695 [View Article][PubMed]
[Google Scholar]
Liu Y, Zeng X, He Z, Zou Q. Inferring microRNA-disease associations by random walk on a heterogeneous network with multiple data sources. IEEE/ACM Trans Comput Biol Bioinform 2017; 14:905–915 [View Article][PubMed]
[Google Scholar]
Zhang X, Zou Q, Rodriguez-Paton A, Zeng X. Meta-path methods for prioritizing candidate disease miRNAs. IEEE/ACM Trans Comput Biol Bioinform 2019b; 16:283–291 [View Article][PubMed]
[Google Scholar]
Ding Y, Tang J, Guo F. Identification of drug-side effect association via multiple information integration with centered kernel alignment. Neurocomputing 2019; 325:211–224 [View Article]
[Google Scholar]
Jia C, Bi Y, Chen J, Leier A, Li F et al. Passion: an ensemble neural network approach for identifying the binding sites of RBPs on circRNAs. Bioinformatics 2020; 36:4276–4282 [View Article][PubMed]
[Google Scholar]
Li F, Chen J, Leier A, Marquez-Lago T, Liu Q et al. DeepCleave: a deep learning predictor for caspase and matrix metalloprotease substrates and cleavage sites. Bioinformatics 2020a; 36:1057–1065 [View Article][PubMed]
[Google Scholar]
Li F, Leier A, Liu Q, Wang Y, Xiang D et al. Procleave: predicting Protease-specific substrate cleavage sites by combining sequence and structural information. Genomics Proteomics Bioinformatics 2020b; 18:52–64 [View Article][PubMed]
[Google Scholar]
Li F, Li C, Marquez-Lago TT, Leier A, Akutsu T et al. Quokka: a comprehensive tool for rapid and accurate prediction of kinase family-specific phosphorylation sites in the human proteome. Bioinformatics 2018; 34:4223–4231 [View Article][PubMed]
[Google Scholar]
Liu B, Li K. iPromoter-2L2.0: identifying promoters and their types by combining smoothing cutting window algorithm and sequence-based features. Mol Ther Nucleic Acids 2019; 18:80–87 [View Article][PubMed]
[Google Scholar]
Shen C, Ding Y, Tang J, Jiang L, Guo F et al. LPI-KTASLP: prediction of LncRNA-Protein interaction by Semi-Supervised link learning with multivariate information. IEEE Access 2019a; 7:13486–13496 [View Article]
[Google Scholar]
Su R, Wu H, Xu B, Liu X, Wei L et al. Developing a Multi-Dose computational model for drug-induced hepatotoxicity prediction based on toxicogenomics data. IEEE/ACM Trans Comput Biol Bioinform 2019; 16:1231–1239 [View Article][PubMed]
[Google Scholar]
Wang C, Wang P, Han S, Wang L, Zhao Y et al. FunEffector-Pred: identification of fungi effector by activate learning and genetic algorithm sampling of imbalanced data. IEEE Access 2020a; 8:57674–57683 [View Article]
[Google Scholar]
Wang C, Zhang Y, Han S. Its2vec: fungal species identification using sequence embedding and random forest classification. Biomed Res Int 2020b; 2020:2468789 [View Article][PubMed]
[Google Scholar]
Wei L, Chen H, Su R. M6APred-EL: a sequence-based predictor for identifying N6-methyladenosine sites using ensemble learning. Mol Ther Nucleic Acids 2018; 12:635–644 [View Article][PubMed]
[Google Scholar]
Wei L, Wan S, Guo J, Wong KK. A novel hierarchical selective ensemble classifier with bioinformatics application. Artif Intell Med 2017a; 83:82–90 [View Article][PubMed]
[Google Scholar]
Wei L, Xing P, Zeng J, Chen J, Su R et al. Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier. Artif Intell Med 2017b; 83:67–74 [View Article][PubMed]
[Google Scholar]
Yu L, Xu F, Gao L. Predict new therapeutic drugs for hepatocellular carcinoma based on gene mutation and expression. Front Bioeng Biotechnol 2020b; 8:8 [View Article][PubMed]
[Google Scholar]
Cabarle FGC, de la Cruz RTA, Cailipan DPP, Zhang D, Liu X et al. On solutions and representations of spiking neural P systems with rules on synapses. Inf Sci 2019; 501:30–49 [View Article]
[Google Scholar]
Li C-C, Liu B. MotifCNN-fold: protein fold recognition based on fold-specific features extracted by motif-based convolutional neural networks. Brief Bioinform 16: [View Article]
[Google Scholar]
Li J, Pu Y, Tang J, Zou Q, Guo F. DeepAVP: a dual-channel deep neural network for identifying variable-length antiviral peptides. IEEE J Biomed Health Inform 2020; 24:3012–3019 [View Article][PubMed]
[Google Scholar]
Song T, Rodriguez-Paton A, Zheng P, Zeng X et al. Spiking neural P systems with colored spikes. IEEE Trans Cogn Dev Syst 2018; 10:1106–1115 [View Article]
[Google Scholar]
Zeng X, Zhu S, Liu X, Zhou Y, Nussinov R et al. deepDR: a network-based deep learning approach to in silico drug repositioning. Bioinformatics 2019; 35:5191–5198 [View Article][PubMed]
[Google Scholar]
Xu H, Zeng W, Zeng X, Yen GG. An evolutionary algorithm based on Minkowski distance for Many-Objective optimization. IEEE Trans Cybern 2019a; 49:3968–3979 [View Article][PubMed]
[Google Scholar]
Xu H, Zeng W, Zhang D, Zeng X. MOEA/HD: a multiobjective evolutionary algorithm based on hierarchical decomposition. IEEE Trans Cybern 2019b; 49:517–526 [View Article][PubMed]
[Google Scholar]
Zeng X, Wang W, Chen C, Yen GG. A consensus community-based particle Swarm optimization for dynamic community detection. IEEE Trans Cybern 2020; 50:2502–2513 [View Article][PubMed]
[Google Scholar]

http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.000483

NonClasGP-Pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data

M Gen 6, e000483 (2020); https://doi.org/10.1099/mgen.0.000483

/content/journal/mgen/10.1099/mgen.0.000483

Data & Media loading...

Supplements

Volume 6, Issue 12

Method

Open Access

NonClasGP-Pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data

Abstract

Funding

Supplementary material 1

Most read this month

Most cited Most Cited RSS feed

ResFinder – an open online resource for identification of antimicrobial resistance genes in next-generation sequencing data and prediction of phenotypes from genotypes

Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification

MOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies

Completing bacterial genome assemblies with multiplex MinION sequencing

SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments

ClermonTyping: an easy-to-use and accurate in silico method for Escherichia genus strain phylotyping

Identification of Klebsiella capsule synthesis loci from whole genome data

Emergence, molecular mechanisms and global spread of carbapenem-resistant Acinetobacter baumannii

chewBBACA: A complete suite for gene-by-gene schema creation and strain identification

Microreact: visualizing and sharing data for genomic epidemiology and phylogeography