Clustering of data
Variables were selected for clustering algorithm inclusion based on clinical relevance to indicating illness severity. Specific variables selected for inclusion were as follows. Both minimum and maximum values were included unless otherwise specified: temperature, heart rate (maximum), SBP, blood glucose, creatinine (maximum), haematocrit (minimum), sodium, WBC, platelets (minimum), respiratory rate (maximum), oxygen saturation (minimum), eGFR, and time from symptom onset to admission. Missing data for all selected physiologic measures were imputed using the study population mean stratified by age group (18-49, 50-64, 65+) and hospital. A table of selected metrics can be found in Supplemental Table 1.
Prior to the creation of clusters, Hopkin’s statistic was used to assess the randomness of the distribution of the data in relation to a uniform distribution. Values of 0.5 for this statistic indicate data are similar to the univariate distribution, while values closer to 1 indicate the data may contain clusters. The use of this statistic helps reduce the risk of a machine learning algorithm detecting clusters when the data do not actually have clusters within22.
Data were classified separately for each influenza season using the k-medoids Partitioning Around the Medoids (PAM) algorithm with Manhattan distance. Briefly, k-medoids clustering assigns groups to a set of data based on the distance to an assigned central data point of a cluster23. At the start, these medoids are randomly assigned, and the algorithm iterates through different selections of data centroids and cluster groupings until the distance from the centroid is minimized to all other data points in the cluster. K-medoids clustering is more robust in the presence of outliers than other centroid-based clustering algorithms such as k-means since the chosen centroid is an observed data point. Additionally, this algorithm assigns all data observations to a cluster; this is preferred in a cohort of hospitalized individuals where biologically plausible data outliers are of interest. The appropriate number of clusters to be assigned for a given season was chosen using the largest average silhouette width, a measure of the distance from points in one cluster to another, with one to a maximum of ten clusters tested.
The k-medoids clustering was performed using the “pam” function in R. Following group assignment, the silhouette width of each cluster was computed using the “silhouette” function. An average silhouette width close to 1 indicates perfect clusters, and an average silhouette width around 0 indicates clusters lie close together. A negative silhouette width for a given observation indicates that the data point may have been misclassified.