BACKGROUND: In biostatistics, evaluating fragility is crucial for understanding their vulnerability to miscategorization. One proposed measure of statistical fragility is the unit fragility index (UFI), which measures the susceptibility of the p-value to flip significance with minor changes in outcomes. Although the UFI provides valuable information, it relies on p-values, which are arbitrary measures of statistical significance. Alternative measures, such as the fragility quotient (FQ) and the percent fragility index, have been proposed to decrease the UFI's reliance on sample size. However, these approaches still rely on p-values and thus depend on an arbitrary cutoff of p < 0.05. Instead of quantifying fragility by relying on p-values, this study evaluated the effect of small changes on relative risk. METHODS: Random 2x2 contingency tables associated with an initial p-value of 0.001 to 0.05 were evaluated. Each table's UFI and relative risk index (RRI) were calculated. A derivative of the RRI, the percent RRI, was also calculated along with the FQ. The UFI, FQ, RRI, pRRI, initial p-value, and sample size were compared. RESULTS: A total of 15000 cases were tested. The correlation between the UFI and the p-value was the strongest (r = -0.807), and the correlation between the pRRI was the weakest (r = -0.395). The RRI had the strongest correlation with the sample size (r = 0.826), and the UFI had the weakest correlation (r = 0.3904). The coefficient of variation for the average RRI was the smallest at 28.3%, and for the FQ, it was the greatest at 57.0%. The correlation between the UFI, FQ, and p-value is significantly greater than the correlation between the RRI, pRRI, and p-value (for all comparisons, p < 0.001). CONCLUSION: The RRI and pRRI are significantly less correlated with the p-value than the UFI and FQ, indicating relative independence of the RRI and pRRI from p-values.
The classical definition of statistical significance is p <= 0.05, meaning a 1/20 chance the test statistic found is due to normal variation of the null hypothesis. This definition of statistical significance does not represent the likelihood that the alternative hypothesis is true. Hypothesis testing can be evaluated using a 2x2 table (shown below). Box "a" = true positives: p <= 0.05 and the alternative hypothesis is true. This is the study's power. A rule of thumb is that study power should be at least 80% (80% of the time the statistical test is positive when the alternative hypothesis is true). Therefore a = 0.80. Box "b" = false-positives: p <= 0.05 but the alternative hypothesis is false. By definition, when p = 0.05 the test statistic has a 5% probability of occurring by chance when the null hypothesis is true. Therefore, b = 0.05. Box "c" = false-negatives: p >= 0.05 but the alternative hypothesis is true. This occurs 20% of the time when the study's power is 80%. Therefore, c = 0.20. Box "d" = true-negatives: p >= 0.05 and the null hypothesis is true. This occurs 95% of the time when p <= 0.05. Therefore, d = 0.95. From this table we derive: Sensitivity = power = a/(a+c) = 80%. Specificity = (1-p) = d/(b+d) = 95%. Positive predictive value = power/(power + p-value) = a/(a+b) = 94%. Negative predictive value = d/(c+d) = 83%. The classical definition of statistical significance is (1-specificity) and does not take power into consideration. The proposed new definition of statistical significance is when the positive predictive value of a test statistic is 95% or greater. To arrive at this, the cut-off p-value representing statistical significance needs to be corrected for study power so that 0.05 > (p-value)/(p-value + power). To achieve a 95% predictive confidence,it can be derived that statistical significance is a p-value <= power / 19.
Objectives: Statistical significance does not equal clinical significance. This study looked at how frequently statistically significant results in the nuclear medicine literature are clinically relevant. Methods: A medline search was performed with results limited to clinical trials or randomized controlled trials, published in one of the major nuclear medicine journals. Articles analyzed were limited to those reporting continuous variables where a mean (X) and standard deviation (SD) were reported and determined to be statistically significant (p < 0.05). A total of 32 test results were evaluated. Clinical relevance was determined in a two-step fashion. First, the crossover point between group 1 (normal) and group 2 (abnormal) was determined. This is the the point at which a variable is just as likely to fall in the normal distrubution as the abnormal distribution. Jacobson's test for clinically significant change was used: crossover point = (SD1 * X2 + SD2 * X1) / (SD1 + SD2). It was then determined how many SD's from the mean this crossover point fell. For example, 13.9 +/- 4.5 compared to 9.2 +/- 2.1 was reported as statistically significant (p < 0.05). The crossover point is 10.7, which equals 0.71 std from the mean: 13.9 - (0.71*4.5) = 9.2 + (0.71*2.1).   Results: The average crossover point was 0.66 SD's from the mean. The crossover point was within 1 SD from the mean in 26/32 cases, and in these cases averaged 0.45 SD. Thus, for 4 out of 5 statistically significant results, when applied to an individual patient, the cut-off between normal and abnormal was 0.45 SD from the mean. This results in a third of normal patients falling into an abnormal category. Conclusions: Statistically significant results frequently are not clinically significant. Statistical significance alone does not ensure clinical relevance.