Unsupervised clustering
The derivation of the asthma phenotypes was done using Deep Embedded
Clustering (DEC) 13. DEC is a novel approach that
combines deep learning, which is an advanced form of machine learning
technology, with clustering, allowing for the discovery of complex
patterns in data and providing a robust, scalable solution for
clustering large datasets without the need for labelled data13. This makes it particularly valuable for
applications where the true cluster structure is unknown or hard to
define a priori 13. DEC has an advantage over
traditional clustering methods because of its ability to learn a
lower-dimensional representation (feature space) of data using deep
autoencoders. This feature space is more suitable for clustering due to
compact representation at lower dimensionality, allowing DEC to
outperform traditional methods that either do not involve feature
learning or rely on simpler, linear dimensionality reduction techniques13. Secondly, DEC’s iterative optimization process
that utilizes distance metrices to optimize both the feature
representation and cluster assignments in a way that traditional
methods, such as k-means or spectral clustering, cannot13. These qualities make DEC particularly effective
for complex datasets, offering improved clustering accuracy, efficiency
in handling large datasets.
After the data was processes, The R package NbClust was used to decide
the optimal number of clusters using voting consensus methods14. Additionally, the optimal number of cluster was
confirmed using Monte Carlo reference based consensus clustering
approach 15, implemented through M3C R package16. The output was further fed into the DEC algorithm
to perform the clustering. The cluster were later validated using
prediction strength approach. The final numbers proposed by such metrics
were then evaluated in conjunction with clinical experience before a
final determination of the optimal number of clusters were decided to
represent the data. The cluster solution determined were then named
based on their distribution with regards to the variables used to derive
the clusters. A detailed statistical implementation is presented in theSupplementary file .