Unlocking the Potential of Weight of Evidence and Entity Embedding Encoding for Categorical Data Transformation in Medical Datasets: An Innovative Approach to Enhance Classification Accuracy -

loading page

Unlocking the Potential of Weight of Evidence and Entity Embedding Encoding for Categorical Data Transformation in Medical Datasets: An Innovative Approach to Enhance Classification Accuracy

Anitha 1∗ M ME,
Nickolas PhD 1∗ S,
Mary Saira bhanu PhD 2† S,
Gayathiri 1‡ S ME

Abstract

In the present era, healthcare systems grapple with substan- tial volumes of medical data. However, a significant portion of this data is marked by incompleteness, inconsistency, er- rors, and unsuitability for training Machine Learning (ML) or Deep Learning (DL) algorithms. This necessitates preprocess- ing the data to render it amenable to utilization by ML/DL al- gorithms. Medical datasets predominantly feature two types of attributes: numerical and categorical values. The conver- sion of categorical features into numerical vectors is a crucial step in preparing the data for ML/DL algorithms, known as Feature Engineering (FE) based categorical encoding. Con- ventional and straightforward encoding of categorical fea- tures, termed one-hot encoding, generates multiple columns, thereby transforming data from a lower-dimensional to a higher- dimensional space. This approach poses challenges, includ- ing increased memory requirements due to the proliferation of columns. Considering these issues, this research proposes an encoding technique named “Weight of Evidence with En- tity Embedding” (WoEEE). The WoEEE approach bolsters the predictive capabilities of ML/DL algorithms by calculating the weight of evidence and concurrently mitigates dimension- ality issues. To empirically validate the proposed method, it is tested on six diverse datasets: Breast Cancer, Hospital Readmission, Vadu, Covid-19, Stroke, and Heartstatlog. Four distinct ML/DL algorithms—Decision Tree (DT), Random For- est (RF), Logistic Regression (LR), and a simple Feed-forward Neural Network (NN)— are employed for testing. The re- sults obtained demonstrate that the WoEEE approach yields an average improvement of 11.18%, 10.37%, 5.83%, 7.58%, 7.83%, and 6.83% across all combinations of datasets, classi- fiers, and encoding methods. Furthermore, an Anova test is performed to confirm the effectiveness of WoEEE in encod- ing categorical data, especially for tasks involving binary clas- sification. This enhances the treatment of categorical data in ML and data analytics scenarios. Overall, WoEEE shows po- tential as a valuable approach for categorical data encoding, making a positive contribution to the creation of effective techniques for handling this type of data in real-world appli- cations.