Clustering of data
Variables were selected for clustering algorithm inclusion based on
clinical relevance to indicating illness severity. Specific variables
selected for inclusion were as follows. Both minimum and maximum values
were included unless otherwise specified: temperature, heart rate
(maximum), SBP, blood glucose, creatinine (maximum), haematocrit
(minimum), sodium, WBC, platelets (minimum), respiratory rate (maximum),
oxygen saturation (minimum), eGFR, and time from symptom onset to
admission. Missing data for all selected physiologic measures were
imputed using the study population mean stratified by age group (18-49,
50-64, 65+) and hospital. A table of selected metrics can be found in
Supplemental Table 1.
Prior to the creation of clusters, Hopkin’s statistic was used to assess
the randomness of the distribution of the data in relation to a uniform
distribution. Values of 0.5 for this statistic indicate data are similar
to the univariate distribution, while values closer to 1 indicate the
data may contain clusters. The use of this statistic helps reduce the
risk of a machine learning algorithm detecting clusters when the data do
not actually have clusters within22.
Data were classified separately for each influenza season using the
k-medoids Partitioning Around the Medoids (PAM) algorithm with Manhattan
distance. Briefly, k-medoids clustering assigns groups to a set of data
based on the distance to an assigned central data point of a
cluster23. At the start, these medoids are randomly
assigned, and the algorithm iterates through different selections of
data centroids and cluster groupings until the distance from the
centroid is minimized to all other data points in the cluster. K-medoids
clustering is more robust in the presence of outliers than other
centroid-based clustering algorithms such as k-means since the chosen
centroid is an observed data point. Additionally, this algorithm assigns
all data observations to a cluster; this is preferred in a cohort of
hospitalized individuals where biologically plausible data outliers are
of interest. The appropriate number of clusters to be assigned for a
given season was chosen using the largest average silhouette width, a
measure of the distance from points in one cluster to another, with one
to a maximum of ten clusters tested.
The k-medoids clustering was performed using the “pam” function in R.
Following group assignment, the silhouette width of each cluster was
computed using the “silhouette” function. An average silhouette width
close to 1 indicates perfect clusters, and an average silhouette width
around 0 indicates clusters lie close together. A negative silhouette
width for a given observation indicates that the data point may have
been misclassified.