Sound synthesizers are ubiquitous in modern music production but manipulating their presets, i.e. the sets of synthesis parameters, demands expert skills. This study presents a novel variational auto-encoder model tailored for black-box synthesizer preset interpolation, which enables the intuitive generation of new presets from pre-existing ones. Leveraging multi-head self-attention networks, the model efficiently learns latent representations of synthesis parameters, aligning these with perceived timbre dimensions through attribute-based regularization. It is able to gradually transition between diverse presets, surpassing traditional linear parametric interpolation methods. Furthermore, we introduce an objective and reproducible evaluation method, based on linearity and smoothness metrics computed on a broad set of audio features. The model's efficacy is demonstrated through subjective experiments, whose results also highlight significant correlations with the proposed objective metrics. The model is validated using a widespread frequency modulation synthesizer with a large set of interdependent parameters. It can be adapted to various commercial synthesizers, and can perform other tasks such as modulations and extrapolations.