Stochastic earthquake-tsunami models (SETMs) are widely used to simulate hypothetical tsunamis and their variability. Different SETMs can produce tsunamis with substantially different statistical properties, and to understand their biases, SETMs should be tested against tsunamis generated by multiple real earthquakes. However, few studies have attempted this. This study tests three SETMs from the 2018 Australian Probabilistic Tsunami Hazard Assessment by comparison with fourteen earthquake-tsunamis observed at tide gauges in southeast and west Australia. The SETMs vary in complexity from a simple uniform slip model with deterministic rupture area (FAUS), a uniform slip model with variable rupture area (VAUS), and a heterogeneous slip model (HS). For all historical events, sixty scenarios with similar earthquake location and magnitude are sampled from each SETM, and modelled at tide gauges for sixty hours post-earthquake to represent the SETM tsunami distribution. The best fitting SETM scenarios often agree with observations better than tsunamis modelled using published source inversions. However, some observations are not well modelled by one or more SETMs. The tsunami size distribution varies between the SETMs, with FAUS failing to envelope the observed tsunami size much more often. FAUS also tends to underestimate the observations, particularly for larger tsunamis. The VAUS and HS SETMs perform much better, with HS typically producing larger tsunamis than VAUS, but also failing to envelope the observations more often. The relative performance of each SETM is similar if the tsunami size is analysed over the full simulation, or just for early arriving waves, or late waves.