An Open Reproducible Framework for CNN-Based Cetacean Vocalization Detection in Passive Acoustic Monitoring

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

An Open Reproducible Framework for CNN-Based Cetacean Vocalization Detection in Passive Acoustic Monitoring

Authors

De Marco, R.

Abstract

This paper presents a six-stage methodological framework for Convolutional Neural Network (CNN)-based cetacean vocalization detection and classification in Passive Acoustic Monitoring (PAM), implemented as the open-source toolkit ai-pam-pipeline. The frame-work is generalizable across species and fully parameterised through a single configuration file, guaranteeing exact experimental reproducibility. Two experiments are reported. Experiment A examines the effect of FFT window length N_fft in {256, 512, 1024} on binary Bottlenose dolphin (Tursiops truncatus) whistle detection using stratified 10-fold cross-validation on an in-domain dataset (Oltremare, 192 kHz) and a cross-domain benchmark (DCLDE 2022). In-domain performance is uniformly high (macro F1 {approx} 0.98; Wilcoxon, all p > 0.05). Cross-domain results diverge substantially: Nfft = 256 is significantly superior (p = 0.006, rank-biserial r = 0.89). The mechanism is an upsampling amplification effect: coarser spectral bins produce wider, higher-contrast FM traces after bilinear resampling to fixed image dimensions. This superiority is threshold-invariant: precision equals 1.000 across all configurations and thresholds {theta} in [0.1, 0.9], confirming that the advantage is not an artifact of threshold choice. These findings demonstrate that preprocessing choices - often treated as secondary implementation details - can significantly affect cross-domain generalisation. While N_fft serves here as a controlled case study, the framework is designed to enable systematic, reproducible evaluation of arbitrary preprocessing parameters within a unified experimental protocol. Experiment B demonstrates multiclass capability on five T. truncatus vocalization categories (macro F1 = 0.843); inter-class confusion between click trains and burst-pulse sounds reflects biological signal overlap rather than classifier failure.

Follow Us on

0 comments

Add comment