Algorithm-Level Data-Guided Correction for Class Imbalance in Biological Machine Learning Predictions: Protein Interactions as a Case
Abstract
Introduction: In real-world biomedical applications of data mining, machine learning and artificial intelligence, there are
situations where the widespread problem of class imbalance cannot be addressed by data-level methods such as over- or
under-sampling. Correct and efficient use of algorithm-level methods, on the other hand, needs paying heed to data structure
and content. This study aims to devise and examine simple methods for addressing the imbalanced class distribution issue
in predicting the protein-protein interaction (PPI) sites in membrane proteins as a biomedical case experiment.
Methods: Using an adopted dataset of membrane protein complexes and a retrieved validation set, a class-weighted
random forests (CWRF) classifier model was built for predicting interfacial residues from positional frequencies and an
evolutionary index.
Results: Among several class weighting methods, a data imbalance-emulating weighting method for the CWRF model
achieved an area under the receiver operating characteristics curve (AUC) of 0.815 (95% CI: 0.805-0.823) in the independent
test prediction and 0.802 (95% CI: 0.794-0.809) in the prediction for the external validation set, which outperformed
previous similar studies. A case prediction confirmed the practical utility of this method.
Conclusion: The proposed approach implies potential applications in other fields of biomedicine and beyond. It also
highlights the role of algorithm-data interplay in addressing the class imbalance
1. Han K, Kim KZ, Oh JM, Kim IW, Kim K, Park T. Unbalanced sample size effect on genome-
wide population differentiation studies. International Journal of Data Mining and Bioinformatics.
2012;6(5):490-504.
2. Fregoso-Aparicio L, Noguez J, Montesinos L, Garcia-Garcia JA. Machine learning and deep
learning predictive models for type 2 diabetes: a systematic review. Diabetol Metab Syndr.
2021;13(1):148.
3. Malhotra R, Lata K. Handling class imbalance problem in software maintainability prediction: an
empirical investigation. Frontiers of Computer Science. 2022;16(4).
4. Yen S-J, Lee Y-S. Cluster-based under-sampling approaches for imbalanced data distributions.
Expert Systems with Applications. 2009;36(3):5718-27.
5. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling
technique. arXiv preprint arXiv:11061813. 2011.
6. Stagljar I, Fields S. Analysis of membrane protein interactions using yeast-based technologies.
Trends in biochemical sciences. 2002;27(11):559-63.
7. Ge H, Walhout AJ, Vidal M. Integrating 'omic' information: a bridge between genomics and
systems biology. Trends in genetics : TIG. 2003;19(10):551-60.
8. Balit T, Thonabulsombat C, Dharmasaroja P. Moringa oleifera leaf extract suppresses TIMM23
and NDUFS3 expression and alleviates oxidative stress induced by Abeta1-42 in neuronal cells via
activation of Akt. Res Pharm Sci. 2024;19(1):105-20.
9. Maurel D, Kniazeff J, Mathis G, Trinquet E, Pin JP, Ansanay H. Cell surface detection of membrane
protein interaction with homogeneous time-resolved fluorescence resonance energy transfer
technology. Analytical biochemistry. 2004;329(2):253-62.
10. Zhao Z, Gong X. Protein-Protein Interaction Interface Residue Pair Prediction Based on Deep
Learning Architecture. IEEE/ACM transactions on computational biology and bioinformatics.
2019;16(5):1753-9.
11. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein
structure prediction with AlphaFold. Nature. 2021;596(7873):583-9.
12. Xia JF, Zhao XM, Song J, Huang DS. APIS: accurate prediction of hot spots in protein interfaces
by combining protrusion index with solvent accessibility. BMC bioinformatics. 2010;11:174.
13. Liu GH, Shen HB, Yu DJ. Prediction of Protein-Protein Interaction Sites with Machine-
Learning-Based Data-Cleaning and Post-Filtering Procedures. The Journal of membrane biology.
2016;249(1-2):141-53.
14. Kozma D, Simon I, Tusnady GE. PDBTM: Protein Data Bank of transmembrane proteins after 8
years. Nucleic Acids Res. 2013;41(Database issue):D524-9.
15. Bordner AJ. Predicting protein-protein binding sites in membrane proteins. BMC bioinformatics.
2009;10:312.
16. Wang G, Dunbrack RL. PISCES: a protein sequence culling server. Bioinformatics.
2003;19(12):1589-91.
17. Tusnady GE, Dosztanyi Z, Simon I. TMDET: web server for detecting transmembrane regions of
proteins by using their 3D coordinates. Bioinformatics. 2005;21(7):1276-7.
18. Lomize MA, Pogozheva ID, Joo H, Mosberg HI, Lomize AL. OPM database and PPM web
server: resources for positioning of proteins in membranes. Nucleic Acids Res. 2012;40(Database
issue):D370-6.
19. Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or
nucleotide sequences. Bioinformatics. 2006;22(13):1658-9.
20. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput.
Nucleic Acids Res. 2004;32(5):1792-7.
21. Bordner AJ, Abagyan R. REVCOM: a robust Bayesian method for evolutionary rate estimation.
Bioinformatics. 2005;21(10):2315-21.
22. Breiman L. Random forests. Machine learning. 2001;45(1):5-32.
23. Orriols-Puig A, Bernadó-Mansilla E. Evolutionary rule-based systems for imbalanced data sets.
Soft Computing-A Fusion of Foundations, Methodologies and Applications. 2009;13(3):213-25.
24. Chen C, Liaw A, Breiman L. Using Random Forest to Learn Imbalanced Data. Statistics
Department: University of California at Berkeley; 2004. Report No.: 666.
25. Weiss GM. Mining with rarity: a unifying framework. Sigkdd Explorations. 2004;6(1):7-19.
26. Lewis DD, Gale WA, editors. A sequential algorithm for training text classifiers. The 17th annual
international ACM SIGIR conference on Research and development in information retrieval;
1994: Springer-Verlag New York, Inc.
27. Steinbach P, Kumar M, Tan V. Introduction to data mining. International Edition–NY: Addison
Wesley. 2006.
28. Mineev KS, Bocharov EV, Volynsky PE, Goncharuk MV, Tkach EN, Ermolyuk YS, et al. Dimeric
structure of the transmembrane domain of glycophorin A in lipidic and detergent environments.
Acta Naturae. 2011;3:90-8.
29. Senes A, Gerstein M, Engelman DM. Statistical analysis of amino acid patterns in transmembrane
helices: the GxxxG motif occurs frequently and in association with beta-branched residues at
neighboring positions. J Mol Biol. 2000;296:921-36.
30. Barzegari Asadabadi E, Abdolmaleki P. A review and comparative assessment of machine
learning approaches for interaction site prediction in membrane proteins. Current Bioinformatics.
2015;10(3):284-91.
31. Liu L, Zhu X, Ma Y, Piao H, Yang Y, Hao X, et al. Combining sequence and network information
to enhance protein-protein interaction prediction. BMC bioinformatics. 2020;21(Suppl 16):537.
32. Xie Z, Deng X, Shu K. Prediction of Protein-Protein Interaction Sites Using Convolutional Neural
Network and Improved Data Sets. International journal of molecular sciences. 2020;21(2).
33. Zhong X, Rajapakse JC. Graph embeddings on gene ontology annotations for protein-protein
interaction prediction. BMC bioinformatics. 2020;21(Suppl 16):560.
34. Barzegari Asadabadi E, Abdolmaleki P. Predictions of protein-protein interfaces within membrane
protein complexes. Avicenna Journal of Medical Biotechnology. 2013;5(3):148-57.
35. Lemmon MA, Flanagan JM, Treutlein HR, Zhang J, Engelman DM. Sequence specificity in the
dimerization of transmembrane alpha-helices. Biochemistry. 1992;31:12719-25.
36. Gallet X, Charloteaux B, Thomas A, Brasseur R. A fast method to predict protein interaction sites
from sequences. J Mol Biol. 2000;302(4):917-26.
37. Guo L, Wang S, Li M, Cao Z. Accurate classification of membrane protein types based on sequence
and evolutionary information using deep learning. BMC bioinformatics. 2019;20(Suppl 25):700.
38. Li BQ, Feng KY, Chen L, Huang T, Cai YD. Prediction of protein-protein interaction sites by
random forest algorithm with mRMR and IFS. PloS one. 2012;7(8):e43927.
39. Wang X, Yu B, Ma A, Chen C, Liu B, Ma Q. Protein-protein interaction sites prediction by
ensemble random forests with synthetic minority oversampling technique. Bioinformatics.
2019;35(14):2395-402.
40. Nicoludis JM, Gaudet R. Applications of sequence coevolution in membrane protein biochemistry.
Biochimica et biophysica acta Biomembranes. 2018;1860(4):895-908.
41. DeLano WL. Unraveling hot spots in binding interfaces: progress and challenges. Current opinion
in structural biology. 2002;12(1):14-20
| Files | ||
| Issue | Vol 11 No 3 (2025): . | |
| Section | Articles | |
| Keywords | ||
| Machine Learning Bioinformatics Statistical Bias Random Forests Protein-Protein Interaction Domains | ||
| Rights and permissions | |
|
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. |

