Google AI Introduces ZeroBAS: A Neural Method to Synthesize Binaural Audio from Monaural Audio Recordings and Positional Information without Training on Any Binaural Data

Humans possess an extraordinary ability to localize sound sources and interpret their environment using auditory cues, a phenomenon termed spatial hearing. This capability enables tasks such as identifying speakers in noisy settings or navigating complex environments. Emulating such auditory spatial perception is crucial for enhancing the immersive experience in technologies like augmented reality (AR) and virtual reality (VR). However, the transition from monaural (single-channel) to binaural (two-channel) audio synthesis—which captures spatial auditory effects—faces significant challenges, particularly due to the limited availability of multi-channel and positional audio data.

Traditional mono-to-binaural synthesis approaches often rely on digital signal processing (DSP) frameworks. These methods model auditory effects using components such as the head-related transfer function (HRTF), room impulse response (RIR), and ambient noise, typically treated as linear time-invariant (LTI) systems. Although DSP-based techniques are well-established and can generate realistic audio experiences, they fail to account for the nonlinear acoustic wave effects inherent in real-world sound propagation.

Supervised learning models have emerged as an alternative to DSP, leveraging neural networks to synthesize binaural audio. However, such models face two major limitations: First, the scarcity of position-annotated binaural datasets and second, susceptibility to overfitting to specific acoustic environments, speaker characteristics, and training datasets. The need for specialized equipment for data collection further constraints these approaches, making supervised methods costly and less practical.

To address these challenges, researchers from Google have proposed ZeroBAS, a zero-shot neural method for mono-to-binaural speech synthesis that does not rely on binaural training data. This innovative approach employs parameter-free geometric time warping (GTW) and amplitude scaling (AS) techniques based on source position. These initial binaural signals are further refined using a pretrained denoising vocoder, yielding perceptually realistic binaural audio. Remarkably, ZeroBAS generalizes effectively across diverse room conditions, as demonstrated using the newly introduced TUT Mono-to-Binaural dataset, and achieves performance comparable to, or even better than, state-of-the-art supervised methods on out-of-distribution data.

The ZeroBAS framework comprises a three-stage architecture as follows:

In stage 1, Geometric time warping (GTW) transforms the monaural input into two channels (left and right) by simulating interaural time differences (ITD) based on the relative positions of the sound source and listener’s ears. GTW computes the time delays for the left and right ear channels. The warped signals are then interpolated linearly to generate initial binaural channels.
In stage 2, Amplitude scaling (AS) enhances the spatial realism of the warped signals by simulating the interaural level difference (ILD) based on the inverse-square law. As human perception of sound spatiality relies on both ITD and ILD, with the latter dominant for high-frequency sounds. Using the Euclidean distances of source from both ears and , the amplitudes are scaled.
In stage 3, involves an iterative refinement of the warped and scaled signals using a pretrained denoising vocoder, WaveFit. This vocoder leverages log-mel spectrogram features and denoising diffusion probabilistic models (DDPMs) to generate clean binaural waveforms. By iteratively applying the vocoder, the system mitigates acoustic artifacts and ensures high-quality binaural audio output.

Coming to evaluations, ZeroBAS was evaluated on two datasets (results in Table 1 and 2): the Binaural Speech dataset and the newly introduced TUT Mono-to-Binaural dataset. The latter was designed to test the generalization capabilities of mono-to-binaural synthesis methods in diverse acoustic environments. In objective evaluations, ZeroBAS demonstrated significant improvements over DSP baselines and approached the performance of supervised methods despite not being trained on binaural data. Notably, ZeroBAS achieved superior results on the out-of-distribution TUT dataset, highlighting its robustness across varied conditions.

Subjective evaluations further confirmed the efficacy of ZeroBAS. Mean Opinion Score (MOS) assessments showed that human listeners rated ZeroBAS’s outputs as slightly more natural than those of supervised methods. In MUSHRA evaluations, ZeroBAS achieved comparable spatial quality to supervised models, with listeners unable to discern statistically significant differences.

Even though this method is quite remarkable, it does have some limitations. ZeroBAS struggles to directly process phase information because the vocoder lacks positional conditioning, and it relies on general models instead of environment-specific ones. Despite these constraints, its ability to generalize effectively highlights the potential of zero-shot learning in binaural audio synthesis.

In conclusion, ZeroBAS offers a fascinating, room-agnostic approach to binaural speech synthesis that achieves perceptual quality comparable to supervised methods without requiring binaural training data. Its robust performance across diverse acoustic environments makes it a promising candidate for real-world applications in AR, VR, and immersive audio systems.

Check out the Paper and Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

Recommend Open-Source Platform: Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios. ^(Promoted)

The post Google AI Introduces ZeroBAS: A Neural Method to Synthesize Binaural Audio from Monaural Audio Recordings and Positional Information without Training on Any Binaural Data appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

The Alters: Release date, mechanics, and everything else you need to know

I’ve fallen hard for Starsand Island, a promising anime-style life sim bringing Ghibli vibes to Xbox and PC later this year

This new official Xbox 4TB storage card costs almost as much as the Xbox SeriesXitself

I may have found the ultimate monitor for conferencing and productivity, but it has a few weaknesses

May report 2025

May report 2025

Write more reliable JavaScript with optional chaining

Deploying a Scalable Next.js App on Vercel – A Step-by-Step Guide

The Alters: Release date, mechanics, and everything else you need to know

The Alters: Release date, mechanics, and everything else you need to know

I’ve fallen hard for Starsand Island, a promising anime-style life sim bringing Ghibli vibes to Xbox and PC later this year

This new official Xbox 4TB storage card costs almost as much as the Xbox SeriesXitself

Google AI Introduces ZeroBAS: A Neural Method to Synthesize Binaural Audio from Monaural Audio Recordings and Positional Information without Training on Any Binaural Data

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

Off-Policy Reinforcement Learning RL with KL Divergence Yields Superior Reasoning in Large Language Models

RTX 5060 Ti gets official support in NVIDIA’s newest driver, stability concerns are addressed

LiveKit is an end-to-end stack for WebRTC

Why the road from passwords to passkeys is long, bumpy, and worth it – probably

The Three Big Announcements by Databricks AI Team in June 2024

The best e-commerce website builders of 2025: Expert tested

xAI Released Grok-2 Beta:Â An AI Model with Unparalleled Reasoning, Benchmark-Topping Performance, and Advanced Capabilities

Quick Glossary: Web Browsers

Medusa Ransomware Group Claims Cyberattack on Organizations in USA, Canada

Google AI Introduces ZeroBAS: A Neural Method to Synthesize Binaural Audio from Monaural Audio Recordings and Positional Information without Training on Any Binaural Data

Related Posts