Home

Blog

Projects

Patrons

rss

2025

I am unfinished by design

Ring Doorbell Hacking

2024

Even More AI Video on Consumer Hardware

More AI Video generation on consumer hardware

AI Video generation on consumer hardware

Strategy for the good life

SDXL LoRa Training

Making sense of voices

Voice Cloning and TTS with F5

Designing a website

Feedback on your UX research

andremolnar.com V6

2023

ML on Mac: Peformance without CUDA

People Detection and Segmentation

Machine Learn all the things

Voice Cloning and TTS with F5-TTS

Tl;dr - just listen to my cloned voice

Written by Andre Molnar 10/21/2024

voice

cloning

tts

diffusers

audio

F5 > E2

F5-TTS (A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching)
improves upon
E2 TTS (Embarrassingly easy, fully non-autoregressive zero shot TTS)

But how?

E2-TTS is a streamlined and highly effective approach to zero-shot text-to-speech synthesis, achieving state-of-the-art results through non-autoregressive techniques and flow matching, without the need for complex alignment models or external duration models.

F5-TTS builds on the success of E2-TTS by introducing several key improvements: Faster Training and Inference: Due to architectural changes like the use of ConvNeXt for text refinement and the elimination of rigid alignment methods. Improved Naturalness and Alignment: Through the introduction of Sway Sampling, which enhances the model’s ability to generate fluent, faithful speech from text. Better Zero-Shot Performance: F5-TTS excels in generating speech with high speaker similarity and naturalness, even in unseen scenarios, surpassing E2-TTS.

Limitations?

A source audio clip for voice cloning is limited to 15 seconds. Outputs are limted to 30 seconds... sort of. The generations are as long as you want, but the audio file is stitched together from multiple 30 second clips that were generated.

What's sway sampling?

Inference time "Sway Sampling" improves the model’s performance by focusing more on early flow steps during speech generation, capturing the rough structure of the target speech early in the process. This leads to better alignment between the text and generated speech, especially in the initial stages of inference. (Listen to me saying this)

Get the code and set it up for yourself

https://github.com/SWivid/F5-TTS

One you set it up, there is a gradio reference app shipped with the code. Make sure you check out the 'multi-style' tab where you can upload multiple source audio files for multiple 'emotions' you can ascribe to the audio being generated.