Improving Music Source Separation through Feature-Informed Latent Space Regularization

Home AI News Improving Music Source Separation through Feature-Informed Latent Space Regularization

Improving Music Source Separation through Feature-Informed Latent Space Regularization

Introduction
Background Music Source Separation
Challenges with Current Source Separation Systems
Transfer Learning Approach for Source Separation
Student-Teacher Learning for Knowledge Distillation
Proposed System Architecture
Regularization Methods for Knowledge Transfer
- Co-contrastive Regularization
- Distance-Based Regularization
Experimentation and Results
- Evaluation Metrics
- Comparison of Regularization Methods
- Extension to Classification Tasks
Conclusion
References

Introduction

Music source separation is a technique used to separate individual tracks for each instrument from a mixture of music. This process has various applications, such as remixing, creating karaoke tracks, and building hearing aid systems. However, one of the challenges in current source separation systems is the lack of training data. Existing datasets for source separation contain relatively small amounts of training data compared to other domains, such as speech and image processing. In this work, we explore the possibility of using transfer learning to address this issue.

Background Music Source Separation

Music source separation aims to separate individual instrument tracks from a mixed music signal. This process is essential for various applications in the music industry, including remixing, karaoke creation, and hearing aid systems. By separating each instrument track, we can manipulate the audio independently and Create new versions of the original music.

Challenges with Current Source Separation Systems

One of the main challenges in current source separation systems is the limited availability of training data. Existing datasets, such as Muse db18, contain only a few hours of training data. This lack of training data hampers the performance of source separation algorithms, as they struggle to generalize well to different music genres and instrument combinations. To overcome this challenge, we propose exploring the use of transfer learning.

Transfer Learning Approach for Source Separation

Transfer learning is a technique that leverages knowledge learned from one domain to improve performance in another domain. In our case, we aim to train a large-Scale pre-trained model on a dataset with a significant amount of training data, such as AudioSet. We then utilize the learned features from this pre-trained model to aid the downstream task of music source separation.

Student-Teacher Learning for Knowledge Distillation

One approach to incorporating transfer learning in source separation is through student-teacher learning. In this method, the pre-trained model acts as the teacher, and the source separation system acts as the student. During training, the teacher model transfers its knowledge to the student model. However, during inference, we no longer require the presence of the pre-trained model, reducing computational burden.

Proposed System Architecture

Our proposed system architecture consists of two main components: the source separation system and the pre-trained feature extractor. The source separation system takes in the audio input, extracts an embedding space through an encoder-decoder structure, and reconstructs the individual instrument tracks. We choose the state-of-the-art system, Cross-OpenUnmix, for our experiments. The pre-trained feature extractor is a VGG-like model trained on the AudioSet dataset, which contains a significant amount of training data.

Regularization Methods for Knowledge Transfer

To transfer the discriminative power of the pre-trained feature extractor to the embedding space of the source separation system, we propose two regularization methods: co-contrastive regularization and distance-based regularization. Co-contrastive regularization minimizes the Cosine similarity between the embedding space and the VGG feature if they are from the same instrument category. Distance-based regularization ensures that the distances between embeddings mimic the distances between corresponding VGG features.

Experimentation and Results

We conducted experiments using the Muse db18 dataset, which contains four instrument sources: vocals, bass, drums, and others. We evaluated the performance using three metrics: source-to-distortion ratio (SDR), source-to-interference ratio (SIR), and source-to-Artifact ratio (SAR). The results Show that our proposed regularization methods improve SIR and SAR scores, reducing interference and artifacts in the separated sources. Distance-based regularization outperformed co-contrastive regularization in most cases.

Conclusion

In this work, we explored the use of transfer learning for music source separation. We proposed a student-teacher learning approach and regularization methods to transfer knowledge from a pre-trained feature extractor to the source separation system. Our experiments demonstrated improvements in source-to-interference and source-to-artifact ratios, indicating the effectiveness of our approach. Further experimentation on classification tasks also showcased the discriminative power of our proposed methods in audio tagging and music classification.