Combined tags
— e.g.
[singing] [happy] ...
or
[singing] [sad] ...
Original OmniVoice capabilities (multilingual zero-shot TTS, voice cloning, voice design, 600+ languages) are
preserved
— the base speech head was protected during finetuning with a continuity mix of plain speech and singing.
Drop-in replacement
This checkpoint is fully compatible with the upstream
k2-fsa/OmniVoice
code — same architecture (Qwen3-0.6B LM + HiggsAudioV2 audio tokenizer at 24 kHz), same inference API. Replace the model id:
from omnivoice.models.omnivoice import OmniVoice
model = OmniVoice.from_pretrained("ModelsLab/omnivoice-singing").to("cuda").eval()
# Normal speech (unchanged behavior)
audios = model.generate(
text="The quick brown fox jumps over the lazy dog.",
language="English",
)
# Singing
audios = model.generate(
text="[singing] Twinkle twinkle little star, how I wonder what you are.",
language="English",
)
# Emotional speech
audios = model.generate(
text="[happy] I just got the best news of my entire year!",
language="English",
)
# Combined
audios = model.generate(
text="[singing] [sad] Quiet rain falls on the stone, memories of days now gone.",
language="English",
)
import soundfile as sf
sf.write("out.wav", audios[0], model.sampling_rate)
CLI works the same way:
omnivoice-infer --model ModelsLab/omnivoice-singing \
--text "[happy] Hello there, how wonderful to see you today!" \
--language English \
--output out.wav
Supported tags
Tag
Source data
Strength
[singing]
GTSinger English (6,755 clips, ~8 h)
strong
[happy]
CREMA-D + RAVDESS + Expresso (~2900 clips)
strong
[sad]
CREMA-D + RAVDESS + Expresso (~2900 clips)
strong
[angry]
CREMA-D + RAVDESS (~1500 clips)
strong
[nervous]
CREMA-D fear + RAVDESS fearful (~1400 clips)
strong
[whisper]
Expresso whisper (~1500 clips)
strong
[calm]
RAVDESS calm (~190 clips)
weak — limited data
[excited]
RAVDESS surprised (~190 clips)
weak — limited data
Guidance scale of
3.0
(up from default 2.0) is recommended to make tag behavior more pronounced:
Best eval loss:
4.72
(at step 750) / final
4.88
(step 2500 — this checkpoint, found to sound better subjectively)
This published checkpoint is the
final emotion step 2500
, which subjectively produces the cleanest emotional tag behavior while preserving speech/singing quality.
Known limitations
[calm]
and
[excited]
had only ~190 training samples each (only one dataset contributed) — behavior is weaker than the other emotion tags.
Cross-language singing (sung Hindi, Gujarati, etc.) is extrapolation — works but quality varies.
Like the base model, output quality is bounded by the
HiggsAudioV2 tokenizer
(24 kHz, ~2 kbps, speech-domain tuned). Music / drum content is not supported by design.
License
Apache 2.0. Downstream users must also comply with the individual licenses of the training datasets:
omnivoice-singing huggingface.co is an AI model on huggingface.co that provides omnivoice-singing's model effect (), which can be used instantly with this ModelsLab omnivoice-singing model. huggingface.co supports a free trial of the omnivoice-singing model, and also provides paid use of the omnivoice-singing. Support call omnivoice-singing model through api, including Node.js, Python, http.
omnivoice-singing huggingface.co is an online trial and call api platform, which integrates omnivoice-singing's modeling effects, including api services, and provides a free online trial of omnivoice-singing, you can try omnivoice-singing online for free by clicking the link below.
ModelsLab omnivoice-singing online free url in huggingface.co:
omnivoice-singing is an open source model from GitHub that offers a free installation service, and any user can find omnivoice-singing on GitHub to install. At the same time, huggingface.co provides the effect of omnivoice-singing install, users can directly use omnivoice-singing installed effect in huggingface.co for debugging and trial. It also supports api for free installation.