Unlocking the Potential of Weight of Evidence and Entity Embedding
Encoding for Categorical Data Transformation in Medical Datasets: An
Innovative Approach to Enhance Classification Accuracy
Abstract
In the present era, healthcare systems grapple with substan- tial
volumes of medical data. However, a significant portion of this data is
marked by incompleteness, inconsistency, er- rors, and unsuitability for
training Machine Learning (ML) or Deep Learning (DL) algorithms. This
necessitates preprocess- ing the data to render it amenable to
utilization by ML/DL al- gorithms. Medical datasets predominantly
feature two types of attributes: numerical and categorical values. The
conver- sion of categorical features into numerical vectors is a crucial
step in preparing the data for ML/DL algorithms, known as Feature
Engineering (FE) based categorical encoding. Con- ventional and
straightforward encoding of categorical fea- tures, termed one-hot
encoding, generates multiple columns, thereby transforming data from a
lower-dimensional to a higher- dimensional space. This approach poses
challenges, includ- ing increased memory requirements due to the
proliferation of columns. Considering these issues, this research
proposes an encoding technique named “Weight of Evidence with En- tity
Embedding” (WoEEE). The WoEEE approach bolsters the predictive
capabilities of ML/DL algorithms by calculating the weight of evidence
and concurrently mitigates dimension- ality issues. To empirically
validate the proposed method, it is tested on six diverse datasets:
Breast Cancer, Hospital Readmission, Vadu, Covid-19, Stroke, and
Heartstatlog. Four distinct ML/DL algorithms—Decision Tree (DT),
Random For- est (RF), Logistic Regression (LR), and a simple
Feed-forward Neural Network (NN)— are employed for testing. The re-
sults obtained demonstrate that the WoEEE approach yields an average
improvement of 11.18%, 10.37%, 5.83%, 7.58%, 7.83%, and 6.83%
across all combinations of datasets, classi- fiers, and encoding
methods. Furthermore, an Anova test is performed to confirm the
effectiveness of WoEEE in encod- ing categorical data, especially for
tasks involving binary clas- sification. This enhances the treatment of
categorical data in ML and data analytics scenarios. Overall, WoEEE
shows po- tential as a valuable approach for categorical data encoding,
making a positive contribution to the creation of effective techniques
for handling this type of data in real-world appli- cations.