High-fidelity waveform generation, particularly in text-to-speech (TTS) and audio generation applications, involves several critical challenges. Accurately generating natural-sounding audio remains a primary issue, essential for real-world deployment. Capturing the natural periodicity of high-resolution waveforms and producing high-quality output without artifacts such as metallic sounds or hissing noises is difficult. Additionally, slow inference speed limits the practicality of many high-quality generative models. Overcoming these challenges is vital for advancing AI capabilities in voice conversion, TTS, and general audio synthesis.
Current waveform generation approaches predominantly utilize GAN-based models such as MelGAN, HiFi-GAN, and BigVGAN. These models generate high-quality waveforms rapidly by using various discriminators to capture distinct audio signal characteristics. However, they face substantial limitations, including the necessity for extensive hyperparameter tuning, complex loss functions, and susceptibility to train-inference mismatches, which can lead to undesirable artifacts in the generated audio. Diffusion models like Multi-Band Diffusion (MBD) attempt to address quality issues by modeling frequency bands separately but suffer from slow generation speeds and difficulty in capturing high-frequency information accurately, limiting their practical application in real-time or high-fidelity contexts.
A team of Researchers from Ajou University, Korea University, and KT Corp. propose PeriodWave, a novel waveform generation method that incorporates period-aware flow matching. This approach captures the periodic features of waveform signals by including multiple periods in the estimation process, thereby reflecting the natural periodicity of high-resolution waveforms. The core innovation involves using flow matching to estimate vector fields based on optimal transport paths, ensuring fast and accurate waveform generation. The method also introduces a period-conditional universal estimator, which enables parallel inference across different periods, significantly improving computational efficiency. Additionally, PeriodWave employs discrete wavelet transform (DWT) for frequency disentanglement, enhancing the model’s capability to generate accurate high-frequency components. This combination of techniques represents a significant advancement, offering a more efficient and scalable solution for high-fidelity waveform generation.
PeriodWave integrates several advanced technical components to achieve superior performance. A time-conditional UNet-based structure is utilized for vector field estimation, crucial for capturing the periodic features of waveform signals. Input signals are reshaped into 2D data corresponding to different periods, and period-aware feature extraction is performed using 2D convolutions and ResNet Blocks. The model handles multiple periods by employing prime numbers to avoid overlaps and ensure comprehensive feature extraction. For high-frequency modeling, DWT is used to separate the waveform into multiple frequency bands, with specialized estimators for each band. Furthermore, FreeU is incorporated to scale down high-frequency components in skip connections, reducing noise and improving overall waveform quality. The method is trained on datasets such as LJSpeech and LibriTTS and optimized using the AdamW optimizer.
PeriodWave demonstrates superiority over existing models in both objective and subjective metrics. On the LJSpeech dataset, it achieves remarkable performance improvements across various metrics, including M-STFT, PESQ, periodicity, and pitch accuracy, outperforming state-of-the-art models like BigVGAN and HiFi-GAN with significantly fewer training steps. For instance, PeriodWave+FreeU achieves a PESQ score of 4.293 and a pitch error distance of 15.753, surpassing BigVGAN’s PESQ score of 4.210 and pitch error distance of 19.019. The ability to generate high-quality waveforms with reduced training time (only three days) highlights its efficiency. Additionally, it shows robustness in out-of-distribution scenarios, performing well on the MUSDB18-HQ dataset, which includes various audio types beyond speech, further demonstrating versatility and robustness in real-world applications.
In conclusion, PeriodWave represents a groundbreaking advancement in waveform generation, offering a novel period-aware flow matching approach that captures the natural periodicity of high-resolution signals effectively. The method addresses limitations in existing GAN-based and diffusion-based techniques by introducing innovations such as multi-period estimation, DWT for frequency disentanglement, and FreeU for noise reduction. Results demonstrate that PeriodWave not only enhances the quality of generated waveforms but also significantly reduces training time, making it an efficient and practical solution for applications in TTS, audio generation, and beyond. PeriodWave represents a significant step forward in AI-driven audio synthesis, providing a robust and scalable tool capable of potentially replacing conventional neural vocoders in various applications.
Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..
Don’t Forget to join our 48k+ ML SubReddit
Find Upcoming AI Webinars here
The post PeriodWave: A Novel Universal Waveform Generation Model appeared first on MarkTechPost.
Source: Read MoreÂ