Diabetes Risk Prediction Using Feature Selection Algorithms and Advanced Machine Learning Models
Abstract
Diabetes is a persistent metabolic disorder that impacts millions globally, presenting a significant health challenge worldwide with increasing prevalence and significant healthcare implications. Early and accurate prediction of diabetes can aid in timely intervention and disease management. This study investigates the efficacy of machine learning algorithms in predicting diabetes using the Sylhet Diabetes Hospital dataset, which consists of clinical records from 520 patients. Various feature selection methodologies, including Pearson correlation analysis, Genetic Algorithm, Chi-Square test, and Recursive Feature Elimination (RFE), were employed to identify the most biologically significant predictors associated with diabetes onset. Five machine learning models-Support Vector Machine (SVM), K-Nearest Neighbours (KNN), Decision Tree (DT), Random Forest (RF), and Logistic Regression (LR)-were trained and validated through cross-validation techniques. Among these techniques, the Random Forest algorithm exhibited the highest predictive performance, achieving an accuracy of 94.70% and an Area Under the Receiver Operating Characteristic Curve (AUC-ROC) score of 98.22%, indicating its superior ability to differentiate between diabetic and non-diabetic cases. Making it the most Dependable for diabetes prediction. These outcomes highlight the potential of machine learning in enhancing diabetes diagnosis and risk assessment. Future work includes integrating deep learning techniques and expanding datasets to improve generalizability and predictive performance.