2024


2023


Notes on voices and speech synthesis

This isn't really a blog post, but rather notes for me to better understand of voices and speech synthesis.

Extracting the characteristics of vocal sound.

  • Pitch (Fundamental Frequency) - the perceived highness or lowness of a voice. Fast Fourier Transform (FFT) - used to extract frequency components of the speech signal. FFT is a purely mathematical operation that transforms a time-domain signal into its frequency components. Autocorrelation and other signal processing techniques can also be used to estimate the frequency/pitch.

  • Formants - the resonant frequencies of the vocal tract that shape vowel sounds. Linear Predictive Coding (LPC) - models the vocal tract and predicts future samples of the speech signal. LPC uses linear algebra and signal processing techniques to predict speech samples and model the vocal tract including formants

  • Harmonics - frequencies that are multiples of the fundamental frequency. FFT can be used to analyze the harmonic content of speech signals. Autocorrelation Methods: These methods are used to estimate pitch and detect harmonic content by finding repeating patterns in the speech signal.

  • Amplitude (Loudness) - the volume or intensity of the voice. FFT since this is a basic part of signal processing.

  • Timbre - the quality or color of the voice, affected by harmonic content. Spectral / Spectrogram Analysis - visual representation of requency over time.

  • Duration - the length of speech segments like phonemes.

  • Mel Cepstral Coefficients (MFCCs) - a representation of the power spectrum of sound, capturing the shape of the vocal tract. MFCC extraction is done using established mathematical algorithms, including filtering, FFT, and logarithms, to approximate how humans perceive sound.

Machine Learning for Speech Synthesis

Deep Learning for Feature Extraction: While traditional methods provide a baseline, machine learning models like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) can more accurately learn complex patterns in speech data, including how formants, harmonics, and timbre vary across different speakers and contexts. ML models can generalize better over noisy or less structured data. Autoencoders and Generative Models (e.g., Variational Autoencoders, GANs): These models can learn latent representations of speech characteristics like timbre, harmonics, and formants without explicitly programmed rules, providing flexibility and accuracy in synthesis or recognition tasks.

Outputting Speech

A vocoder in speech synthesis is a signal processing tool that converts acoustic features into audible speech by encoding and synthesizing voice signals. To train a vocoder, you typically feed it pairs of input features (e.g., mel-spectrograms or acoustic features) and target speech waveforms. The ML model (like a neural vocoder) learns to map the input features to an output waveform through training, using supervised learning techniques and large speech datasets.

Examples: https://huggingface.co/charactr/vocos-mel-24khz or
https://huggingface.co/charactr/vocos-encodec-24khz

Further reading on Mel-Spectrograms and other audio stuff

https://medium.com/analytics-vidhya/understanding-the-mel-spectrogram-fca2afa2ce53 https://huggingface.co/learn/audio-course/en/chapter1/audio_data

A lot of models, concepts, datasets, and related papers are linked from here: https://github.com/coqui-ai/TTS https://github.com/zzw922cn/awesome-speech-recognition-speech-synthesis-papers