Sesame AI: The Most Human AI Voice? A Deep Dive

Find AI Tools in second

Find AI Tools
No difficulty
No complicated process
Find ai tools

In the rapidly evolving world of artificial intelligence, one area experiencing particularly exciting breakthroughs is voice synthesis. Traditional Text-to-Speech (TTS) models have come a long way, but they often lack the nuances and emotional intelligence that make human speech so compelling. Enter Sesame AI, a research team that's pushing the boundaries of what's possible with AI-generated voices. They recently unveiled their conversational speech model (CSM) which is designed to sound incredibly human.

Key Points

Sesame AI has released a new conversational speech model (CSM) aimed at mimicking natural human speech.

CSM models, unlike traditional TTS, use both text and audio data.

The team consists of AI experts, including former leaders from Oculus and Discord.

A core element of Sesame AI is the concept of conversational history, dynamically changing pronunciations.

The technology has a diverse range of potential use cases.

Understanding Sesame AI's Conversational Speech Model (CSM)

What is Sesame AI?

Sesame AI is a research team dedicated to creating AI voices that sound as natural and expressive as human voices. They've developed a new conversational speech model, or CSM which aims to be a step above traditional TTS. TTS models have become so powerful that they are difficult to distinguish from real voices in certain circumstances. Sesame AI is a company aiming to raise the standards of AI voices.

CSM has been designed to solve some limitations of conventional TTS tech, giving the digital voice both natural tone, rhythm, and expressivity.

CSM vs. Traditional TTS: A Key Difference

A central point in this discussion is the distinction between Sesame AI's CSM and more common TTS models. The primary distinction is that CSM does not rely solely on text input to generate speech; it also incorporates audio elements. Traditional TTS models take a body of text and Translate them to sound. CSM models can add tonal elements and make changes to speech, but are more costly to use.

Unlike traditional TTS models that generate speech directly from text, the CSM model uses an audio context. To generate a more organic speech sound, and better fit the tonal nature of the moment, Conversational Speech Model processes information from other areas than just the provided text.

The Power of Conversational History

According to the information, the CSM model takes conversational history into account. In simple terms, a system that is analyzing past words spoken has a better understanding of how to respond to a new query. Taking in the audio and tonal inflection of the spoken word leads to a better understanding of what the WORD is, and gives greater context.

One example of this given in the YouTube clip is the ability to change pronunciations on the fly. In some regions of the UK, 'scone' is pronounced 'skon' and in others it’s pronounced 'skown'. The CSM is capable of recognizing the pronunciation used by the other speaker. The more context a system has the better. To learn the history, a CSM model will take data points and change the pronunciation.

The Team Behind Sesame AI

Brendan Iribe: Co-founder and Former Oculus CEO

Brendan Iribe, the co-founder of Sesame AI, brings a wealth of experience to the team. Iribe is known for his role as the former CEO of Oculus, a company specializing in VR headsets that was acquired by Meta. Brendan Iribe is known as a leader in technology. He is an asset and a strong source of support for this new company, Sesame AI.

Ankit Kumar: Lead Engineering for Discord's Clyde AI

Ankit Kumar led engineering for Discord’s Clyde AI. Kumar's experience in software will make him a key member of the company, and help Sesame's conversational voice synthesis model advance. Ankit Kumar is a key member of the team at Sesame AI.

Understanding Potential Costs

Inference Provider Considerations

Although there is no confirmed pricing, the speaker notes that there should be a slight cost when using this system. Any system that has Large Language Models, and is able to analyze audio, and change tonal inflections are sure to cost the end user.

  • The processing power that makes use of the CSM is complex
  • A larger amount of data is required for processing. With this additional requirement comes additional expenses
  • The need to stay competitive in the AI market and offer competitive pricing is a necessity

While it is still early to predict Sesame AI's pricing structure, it's safe to assume that costs will be determined by the inference provider. They note that current TTS tech costs may be lower since less data is needed and it has been around longer. How easy it is to use, if it will integrate with other features, and how the user will engage with it, will determine pricing.

Analyzing Sesame AI: Strengths and Weaknesses

👍 Pros

The Most Natural Sounding AI Voice

Pacing

Change Inflections

👎 Cons

Cost

Speed

Still inferior to human voices

Key Capabilities of Sesame AI's CSM

Nuance and Organic Qualities

In the video, the voice of the CSM is reviewed. The speaker mentions this product as the most natural voice that he has come across. Features that stand out:

  • Pauses in speech are well measured and thoughtful, unlike other AI.
  • Human tonal inflections.
  • Changes in speed and pacing while speaking to change the tone and meaning.

What About Speed of Use and Functionality?

Some challenges the company must be facing are the cost to use it, the speed, and will a human voice still be better?

While a single word may not be obvious to the listener for the change between human and AI voices, the CSM has set high benchmarks for future language and tonal awareness. With more context a better flow of speech is given.

From an implementation side, Sesame AI may still fall short of what human processing and understanding can do. The question of speed is asked, and can the same processing functionality be done by the AI. As an emerging technology, one can be certain that is coming very soon.

Potential Applications of Sesame AI

Transforming Industries with Human-Like AI Voices

According to what the presenter notes, many areas such as real estate, Healthcare, and logistics, are starting to implement Voice AI to automate phone calls.

Some of the potential applications are:

  • Real Estate to schedule visits, answer property questions, give area information and local attractions.
  • Healthcare to schedule appointments, contact clients with alerts, or remind them of appointments.
  • Logistics- Notify clients on delivery times, rerouting issues, or provide Customer Service.

Frequently Asked Questions About Sesame AI

What is the biggest benefit of a CSM compared to a traditional TTS Model?
Incorporating real tonal inflection. Not only does it get the context of past words, it analyses what the human said and tries to emulate the real voice.
What are the goals of Sesame AI?
Sesame AI is committed to creating AI voices that sound as natural and expressive as human voices. It is an approach that isolates the generation of just the word, and the sentence itself.

Related Questions

Which is better- a TTS model or a CSM model?
Although this has not been determined, there are some points to consider. TTS Models have been in the tech world for a while. If the cost to use it is far cheaper, a TTS model has benefits. If the price point is similar and more natural inflections and pacing are required, then the CSM model would have more benefits. Currently humans can always tell the difference and will prefer the natural human voice as opposed to a TTS model.

Most people like

Are you spending too much time looking for ai tools?
App rating
4.9
AI Tools
100k+
Trusted Users
5000+
WHY YOU SHOULD CHOOSE TOOLIFY

TOOLIFY is the best ai tool source.

Browse More Content