Unleashing the Power of Cyclic Noise in Speech Modeling

Unleashing the Power of Cyclic Noise in Speech Modeling

Table of Contents:

  1. Introduction
  2. Background
  3. Pure Machine Learning Based Approach
  4. Deep Learning with Signal Processing Techniques
  5. Neural Source Filter Wave Model (NSF)
  6. The Need for a Better Source Signal
  7. Technique Details
  8. Baseline Model: ASM with Harmonic Plus Noise
  9. Proposed Source Module
  10. Experiments and Results
  11. Summary and Conclusion

Introduction

In this article, we will explore the topic of using cycling noise as a source signal for neural source filter waveform models. We will discuss the background of Neural Wave modeling and the different approaches that have been proposed in this field. Additionally, we will Delve into the concept of the neural source filter wave model and its relevance in speech modeling.

Background

The task of a neural wave model is to convert input acoustic features, such as spectrogram and f0, into an output waveform. Many existing models have been proposed, including pure machine learning-based approaches and those that combine deep learning with signal processing techniques. One famous example is the LPC Net, which belongs to the Second group.

Pure Machine Learning Based Approach

The first group of models, which are purely based on machine learning, have gained popularity. This includes models like Wavenet and WaveRNN. These models utilize 1D convolution to convert input features into the desired waveform.

Deep Learning with Signal Processing Techniques

The second group of models combines deep learning with signal processing techniques. This approach takes inspiration from the classic source-filter architecture. One such model is the Neural Source Filter Wave Model (NSF), which we will focus on in this article. The NSF model utilizes a sine waveform as the source signal and 1D convolution as the filter to convert the source signal into the output waveform.

The Need for a Better Source Signal

While the NSF model has shown promising results for female speakers, it remains a question whether the sine waveform is the best choice for all types of sounds. Different types of sounds have different source signals, some of which can be quite random while others may be sparse. Therefore, in this paper, we aim to explore a better or more flexible source signal for the neural source filter waveform model.

Technique Details

In this section, we will delve into the technical details of the baseline model used, the ASM model with harmonic plus noise structure. We will explain the different modules of this network, including the condition module, source module, and neural filter module. Additionally, we will discuss the frequency domain distance for network training.

Baseline Model: ASM with Harmonic Plus Noise

The baseline model used in this study is the ASM model with the harmonic plus noise structure. This model aims to convert input features into an output waveform by utilizing three modules: the condition module, the source module, and the neural filter module. The frequency domain distance is also considered for network training.

Proposed Source Module

The focus of this article is on the source module in the NSF model. We will explain how the source module works and the different excitation signals it generates based on the observed f0. We will discuss the sign generator and the generation of multiple harmonics. Additionally, we will explore the use of noise as an excitation signal and the concept of exponential decaying noise.

Experiments and Results

In this section, we will discuss the experiments conducted using the CMU Arctic dataset with four speakers (two females and two males). We will explain the input features used, including the mel spectrogram and f0. The proposed models will be evaluated in a speaker-independent manner, without using speaker vectors. The results will be presented in terms of mean opinion score for quality, comparing the natural speech, Wavenet, baseline models (with different excitation signals), and the proposed method with cycling noise-based excitation.

Summary and Conclusion

To summarize, in this article, we proposed the use of cycling noise as a source signal for the neural source filter waveform model. We discussed the need for a better source signal and presented the technique details of the proposed model. Through experiments and results, we found that the cycling noise-based excitation was preferred and more flexible compared to the sine waveform-based excitation. We also noted that the choice of parameters, such as the beta value, may be gender-dependent. In conclusion, the proposed method shows promise in improving the performance of neural source filter waveform models.

Highlights:

  • Introduction to neural source filter waveform models
  • Comparison of pure machine learning-based approaches and approaches combining deep learning with signal processing techniques
  • The concept and structure of the Neural Source Filter Wave Model (NSF)
  • Exploring a better source signal for the NSF model
  • Technical details of the baseline model, ASM with harmonic plus noise
  • Proposed source module utilizing cycling noise as an excitation signal
  • Evaluation of the proposed method through experiments and results
  • Gender-dependent preferences in source signals
  • Summary and conclusion showcasing the potential of the proposed method

FAQ:

Q: What is the Neural Source Filter Wave Model (NSF)? A: The NSF model is a waveform model for speech modeling that combines deep learning with signal processing techniques.

Q: What is the baseline model used in the experiments? A: The baseline model used in the experiments is the ASM model with a harmonic plus noise structure.

Q: How was the performance of the proposed method evaluated? A: The performance of the proposed method was evaluated using the mean opinion score for quality, comparing it with natural speech, Wavenet, and baseline models.

Q: Is the choice of source signal dependent on gender? A: Yes, the results suggest that the preference for source signals may vary based on the gender of the speaker.

Find AI tools in Toolify

Join TOOLIFY to find the ai tools

Get started

Sign Up
App rating
4.9
AI Tools
20k+
Trusted Users
5000+
No complicated
No difficulty
Free forever
Browse More Content