Discussion
Single call annotation, whether manual or via recognisers, is a viable alternative to acoustic indices for monitoring ecological restoration (Linke and Deretic, 2020). While recognisers are commonly treated as one analysis class, there is a gradient in both effort and performance of auto-detectors. This ranges from largely automated recognisers – typically built-in software packages such as ‘Kaleidoscope’ (Wildlife Acoustics, 2017) – to completely custom-built software (Towsey et al., 2012). In all cases, various parameters alter recogniser performance; these may be left as defaults in software or manipulated by the end-user. Differences in recogniser construction alters performance and this can manifest as poor agreement among recognisers built using different software (Lemen et al., 2015). Relying on recognisers without properly understanding how they operate can be problematic (Russo and Voigt, 2016). In this study, we took a semi-custom approach; we used a pre-programmed matching algorithm (Towsey et al., 2012; Ulloa et al., 2016) within the R package monitoR (Katz et al., 2016b), but actively investigated three important parameters that are often overlooked – or, at least, are rarely reported on - in recogniser construction. These parameters were call template selection and representativeness, template construction (including amplitude cut-off) and the threshold of similarity at which a detection is returned (score cut-off). We argue that there is a need to establish thorough construction and evaluation mechanisms for building recognisers, and for these to be properly reported in the literature.
First, choices pertaining to call template selection are crucial (Katz et al., 2016a; Teixeira et al., 2022). Studies typically report the source of call templates (e.g. whether calls were collected from wild or captive animals), but usually fail to explain the decisions underlying the selection of the exact calls used. For example, were calls free of background noise – and how did this affect recogniser performance? Animal calls exist not in isolation, but within an overall soundscape. As such, representing calls within the context of the soundscapes that we seek to monitor may be important. While our call recognisers perform well overall (Table 2), they are also prone to species-specific errors. For example, L. tasmaniensis recognisers produce false positives for rain events, whereas erroneous detections of C. parinsignifera mainly found birds and insects (Table 3).
In this study, we attempted to represent common background noises, such other species’ calls and non-biological sounds (e.g. running water). Although we selected calls that were relatively clear in their structure, we maintained a ‘buffer’ (or a margin) around each selection, in both the time and frequency domains. Since any manual selection of candidate calls will incur a level of human bias, we chose to extract between 100 and 200 templates per species, from which a minimum of 10 were tested and only two or three were chosen for the final recogniser. Although for some rare or cryptic species, call templates can be difficult to acquire, we argue that, as much as possible, recognisers should be built following the testing of many candidate templates.
Another important consideration is the representativeness of species’ call types and behaviours (Priyadarshani et al., 2018). For species that exhibit large vocal repertoires, decisions must be made about the call types to feature in recognisers. This should be driven by a program’s objectives or research questions; for example, monitoring breeding may require only one or two breeding-associated call types to feature in the recogniser (Teixeira et al., 2019). Further, geographic variation in call structure (e.g. regional dialects) may also impact recogniser performance, and should be investigated when recognisers are intended for use at spatial scales over which call types may vary (Kahl et al., 2021; Lauha et al., 2022; Priyadarshani et al., 2018). If recognisers are used among discrete or isolated populations, call templates may need to represent each area. In this study, we attempted to represent inter-site variability by selecting candidate call templates from every site where the species was recorded. For several species, the final recognisers comprised templates from more than one site.
Once call templates are chosen, decisions must be made about their construction for use in a recogniser. In binary point matching, call templates are created from a grid of on and off points (i.e. call and non-call points), which are manipulated by the amplitude cut-off set by the user (Katz et al., 2016b). In monitoR, the impact of altering amplitude cut-off can be easily visualised (Figure 1). In this study, we manipulated amplitude cut-off to show both the call structure and some background noise. Since the recogniser ‘matches’ both the on and off points, finding a suitable balance between these is important. Although visualising and selecting amplitude cut-off is a manual and somewhat arbitrary process, we considered that the large sample size of candidate templates tested in this study would minimise any bias from this process. However, for studies that test a smaller number of candidate templates, we recommend that each template is tested at several different amplitude cut-offs.
Finally, an appropriate score cut-off, which sets the threshold of similarity at which a detection is returned (Figure 2), must be set for each call template. Score cut-off alters the template’s sensitivity and therefore, greatly affects performance. A higher score cut-off will reduce false positive detections, but may increase false negatives (Katz et al., 2016a). Conversely, increasing sensitivity by lowering score cut-off will reduce false negatives, but it may reduce precision by returning more false positives. Here, we tested every call template at score cut-off increments of 0.2 from a low of 3, and measured performance by ROC value. For most species examined, high ROC values indicated that call templates were able to sufficiently trade-off false positives and false negatives while maximising true positives. This rigorous approach to score cut-off testing allowed us to set highly specific cut-offs in the final recognisers. However, for species that are rarer or more cryptic, returning sufficient true positives may require a lower score cut-off with a poorer ROC value. Where detecting most, if not all, calls is important, other performance metrics like recall should be given due consideration. Ultimately, decisions about score cut-off should be driven by a study’s objectives, but we argue that general metrics like ROC values are a good starting point in most cases.
We argue that ecoacoustic researchers and practitioners need to stop treating recognisers like a black box and actively develop, improve and test processes that help evaluation. From the literature, it is currently unclear how reliable recognisers are. Many studies report poor performance but this may be more a function of inappropriate construction, rather than recognition methods per se. Especially recogniser testing is often ignored and recogniser performance is reported by number of detections in a larger dataset. Even when performance is reported, it is often unclear what the source of low recogniser accuracy is. We demonstrated that this could have multiple causes, from badly selected templates to a lack of template calibration, for example amplification or detection cut-offs. We recommend that recognisers are not treated as a static product. They can be refined and adapted as more monitoring data become available. Using this study as an example, we are currently working on a refinement for the recogniser forL. tasmaniensis that is based on better template recordings. A complete recommended workflow could start with a recogniser built for a particular species in a particular region, then enhanced by data from other environments, followed by a performance evaluation and refinement as necessary.