Abdussamad

and 5 more

Early and accurate prediction of Type 2 diabetes is crucial for effective intervention. However, extracting meaningful insights from high-dimensional datasets with sparse values remains challenging. Sparsity and redundant features often impede traditional machine learning algorithms' ability to detect informative patterns. While conventional Stacked Sparse Autoencoders (SSAEs) can identify key features in dense data, they generally struggle with high-dimensional sparse data, leading to lower classification accuracy. To address this, the study introduces a Hybrid Stacked Sparse Autoencoder (HSSAE) model designed for robust feature extraction and classification in sparse data environments. The architecture integrates L1 and L2 regularization within a binary cross-entropy loss function and leverages dropout and batch normalization to enhance generalization and training stability. The HSSAE's performance was evaluated with a sigmoid classifier and various machine learning methods. When combined with a sigmoid layer, the model achieved 89% accuracy and an F1-score of 0.89. It also outperformed baseline models when paired with traditional classifiers; notably, the HSSAE+KNN reached an F1-score of 0.91, a recall of 0.98, 90% accuracy, and the lowest Hamming Loss of 0.10. Comparative assessments included baseline classifiers such as Logistic Regression, K-Nearest Neighbors, Naïve Bayes, AdaBoost, and XGBoost, applied directly to the pre-processed dataset. An ablation study also tested these classifiers on features extracted via the SSAE. In both scenarios, the HSSAE demonstrated superior performance across all metrics. These results highlight the HSSAE's effectiveness in extracting discriminative features from sparse, high-dimensional data, emphasizing its potential for clinical decision support systems that demand high accuracy and reliability.