2. Data processing and classification
A series of preprocessing steps were applied to the audio after collection, beginning with normalizing all survey audio to -2 dB maximum gain. The SWIFT recorder firmware writes a high amplitude audio spike at the beginning of the first file recorded after the unit wakes from standby (e.g., the beginning of the 5:00 and 16:00 audio files); therefore, we chose to overwrite the first five seconds of audio on each of these files to prevent this spike from impacting the gain normalization step. As our chosen audio classifier architecture operates on fixed-length samples, we split each 30 min audio file into 7197 2s-long overlapping audio “windows” that advance forward by 0.25 seconds per window. The classifier operates on the log-Mel-weighted spectrogram (Knight et al. 2017) of each window, which is created dynamically during classification using STFT utilities in the TensorFlow python module (Table 2) at a native resolution of 512x512 px.
Audio event detection was conducted using a set of Convolutional Neural Network classifiers. The chosen classifier architecture is adapted from the multiclass single-label classifier called “Model 1” in Kahl et al. (2017). Our decision to use a multiclass single-label classifier architecture was driven by a desire for reduced learning complexity; however we feel there is merit to introducing a multilabel classifier in future analyses as existing ML techniques are capable of dealing with this task with minor modifications (Kahl et al., 2017). For similar reasons, we reduced the number of neurons per hidden layer by half to account for limitations in available processing power, and also down-sampled the 512x512 px spectrogram images to 256x256 px before training and classification. The full classifier architecture is described in Table 3. All data processing was performed either in Python, using a combination of TensorFlow 2.0 (Abadi et al. 2016) and other widely-used Python modules, or, in the case of later statistical testing, in R (R Core Team, 2019). During training, we applied the same STFT algorithm as used for the survey data to dynamically convert the training audio to log-Mel-weighted spectrograms, and implemented data augmentation to improve model generalization (Ding et al., 2016). These augmentation parameters, along with general model hyperparameters (Table 4), were chosen using a Bayesian hyperparameter search module in Python (Nogueira, 2014) that was driven to optimize the calculated multiclass F1-score (Sokolova & Lapalme, 2009) (β = 1) on a set of known good clips (hereafter the “validation set”) created using a sample of clips not used in the training data (Table 1). F1-score was calculated as a macro-average of the 12 classes in order to give equal weight to rare classes. Although the goal of hyperparameter search techniques is typically to identify an optimal set of parameters, we observed two apparent local optima that we chose to incorporate into our classification pipeline as two submodels: a. submodel 1, which added artificial gaussian noise to training spectrograms as part of the augmentation process and b. submodel 2, which did not. The set of class probabilities returned for each clip was the mean of the probabilities reported by the two models (hereafter the “ensemble”). We validated each submodel, as well as the ensemble, on the same validation set used in hyperparameter search.
When deployed on survey data, our classification pipeline yields classifications as a sequence of probability vectors of size 12, where each vector corresponds to one window in the sequence of overlapping windows. Raw class probabilities for windows that contain only the very beginning or the very end of tinamou vocalizations are often classified incorrectly, which we believe results from the fact that different tinamou species often share structural similarities with one another in those regions of their vocalizations. To reduce the impact of this pattern on our overall classification accuracy, we applied a “smoothing” post-processing to the class probabilities where each probability value was replaced by the weighted average of that value (weight = 1) and the values immediately before and after it in the time sequence (weight = 0.5). Windows with a maximum class probability < 0.85 were removed, and the remainder assigned the label with the highest class probability. All windows detected as positive were manually checked for accuracy and relabeled if incorrect.
We assessed the degree of marginal improvement in classifier performance due to increased training dataset size and increased structural uniformity between training clips and survey audio by running a second “pass” of the acoustic classifier on the survey data with a set of models that had been trained using a larger training dataset. To generate this dataset, the original training dataset was supplemented with all known good positive windows from the initial classification (the first “pass”). We sampled from this dataset to produce a new training dataset (n = 18,480) with the larger of 2000 randomly selected clips (4000 for the “junk” class), or as many clips as were available, per class (Table 1). We trained new submodels on this data using the same model architecture and hyperparameters that were used for models in the first pass. The sole change made to the training process between classifications 1 and 2 was to alter the batch generation code to produce batches with balanced class frequencies to offset the greatly increased degree of class imbalance in the supplemented dataset. Each submodel was validated using a new validation set that contained known-good survey audio whenever possible in order to ensure that the calculated metrics would be more indicative of each submodel’s real world performance.
The survey data was classified with these new models, and the resulting class predictions were processed to extract probable detections as described previously. In order to decrease labor time, all positive windows from the initial classification were “grandfathered in” as correctly identified due to having been manually checked previously, which allowed us to only check positive detections that were newly identified during the second pass. Finally, all sequences of windows with a particular species classification that were >= 0.75s apart from any other sequence were grouped as a single vocal event.
For the purposes of quantifying model performance and generalizability, we calculated a precision, recall, F1-score, and precision-recall area under the curve (AUC) performance metrics for the primary and secondary models, presented on a per-class basis or as macro-averages across classes, after Sokolova & Lapalme (2009). All metrics were calculated based on classifier performance on a set of known good clips (hereafter the “validation set”), using data from the survey audio whenever possible in order to ensure that the performance metrics would be more indicative of each submodel’s real world performance.
As a point of comparison for our audio detection counts, we also examined community science observation data for tinamous from eBird (Sullivan et al., 2009; Sullivan et al. 2015). We used stationary and traveling checklists containing tinamous that were submitted at the LACC hotspot between the months of July and October, removing stationary checklists with durations > 150 min and traveling checklists with lengths > 0.5 km in order to constrain the sampling effort parameter space of the eBird data such that it was more comparable to our 2.5 h morning and afternoon recording periods. Despite these filtering steps, the final eBird dataset still contained all locally-occurring tinamou species. However, it was clear that our acoustic data density for C. strigulosus vastly outstripped eBird data density, so we excluded this species from our analysis as we feel it warrants separate discussion. We produced estimates of occurrence probabilities by averaging the results of random samples from the eBird data (n = 1,000) and averaging the results of the same number of samples from the acoustic event dataset using the same underlying sampling effort density distribution as the eBird checklist durations. Audio frequency estimates were calculated separately for terra firme and floodplain habitat types on a site-level presence-absence frequencies and then averaged. In addition, we compared our audio detection counts to camera trap capture rates reported by Mere Roncal et al. (2019), also at LACC. Camera trap capture rates suggest seasonally-driven differences in tinamou activity rates, so we only considered detection rates from the dry season, which limited our comparison to the five tinamou species reported by Mere Roncal et al. (2019) for which dry season camera trap data is available. Occurrence frequencies were again calculated as the average of the distributions from terra firme and floodplain sites.