[ comments ]
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
[Paper]
Chengyi Wang*,   Sanyuan Chen*,   Yu Wu*,   Ziqiang Zhang,   Long Zhou,   Shujie Liu,  
Zhuo Chen,   Yanqing Liu,   Huaming Wang,   Jinyu Li,   Lei He,   Sheng Zhao,   Furu Wei
Microsoft
Abstract. We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis.
Model Overview
The overview of VALL-E. Unlike the previous pipeline (e.g., phoneme → mel-spectrogram → waveform), the pipeline of VALL-E is phoneme → discrete code → waveform. VALL-E generates the discrete audio codec codes based on phoneme and acoustic code prompts, corresponding to the target content and the speaker's voice. VALL-E directly enables various speech synthesis applications, such as zero-shot TTS, speech editing, and content creation combined with other generative AI models like GPT-3.
LibriSpeech Samples
Text | Speaker Prompt | Ground Truth | Baseline | VALL-E |
---|---|---|---|---|
They moved thereafter cautiously about the hut groping before and about them to find something to show that Warrenton had fulfilled his mission. | ||||
And lay me down in thy cold bed and leave my shining lot. | ||||
Number ten, fresh nelly is waiting on you, good night husband. | ||||
Yea, his honourable worship is within, but he hath a godly minister or two with him, and likewise a leech. | ||||
Instead of shoes, the old man wore boots with turnover tops, and his blue coat had wide cuffs of gold braid. | ||||
The army found the people in poverty and left them in comparative wealth. | ||||
Why should I rust and be stupid and sit in inaction? Because I am a girl. | ||||
Thus did this humane and right minded father comfort his unhappy daughter, and her mother embracing her again, did all she could to soothe her feelings. | ||||
He was in deep converse with the clerk and entered the hall holding him by the arm. | ||||
They do not go where the enemies of the gospel predominate, they go where the christians are. |
VCTK Samples
Text | Speaker Prompt | Ground Truth | Baseline | VALL-E |
---|---|---|---|---|
We have to reduce the number of plastic bags. | ||||
So what is the campaign about? | ||||
Over time, with patience and precision, the terrorists will be pursued. | ||||
My life has changed a lot. | ||||
The problem in Norway is outside the National. | ||||
Nothing is yet confirmed. | ||||
I could hardly move for the next couple of days. | ||||
His son has been travelling with the Tartan Army for years. | ||||
Her husband was very concerned that it might be fatal. | ||||
We've made a couple of albums. |
Synthesis of Diversity
Thanks to the sampling-based discrete token generation methods, given a pair of text and speaker prompts, VALL-E can synthesize diverse personalized speech samples with different random seeds.
Text | Speaker Prompt | VALL-E Sample1 | VALL-E Sample2 |
---|---|---|---|
Because we do not need it. | |||
I must do something about it. | |||
The problem in Norway is outside the National. | |||
He has not been named. | |||
After early nightfall, the yellow lamps would light up here and there the squalid quarter of the brothels. | |||
Number ten, fresh nelly is waiting on you, good night husband. |
Acoustic Environment Maintenance
VALL-E can synthesize personalized speech while maintaining the acoustic environment of the speaker prompt. The audio and transcriptions are sampled from the Fisher dataset.
Text | Speaker Prompt | Ground Truth | VALL-E |
---|---|---|---|
I think it's like you know um more convenient too. | |||
Um we have to pay have this security fee just in case she would damage something but um. | |||
Everything is run by computer but you got to know how to think before you can do a computer. | |||
As friends thing I definitely I've got more male friends. |
Speaker’s Emotion Maintenance
VALL-E can synthesize personalized speech while maintaining the emotion in the speaker prompt. The audio prompts are sampled from the Emotional Voices Database.
Text | Emotion | Speaker Prompt | VALL-E |
---|---|---|---|
We have to reduce the number of plastic bags. | Anger | ||
Sleepy | |||
Neutral | |||
Amused | |||
Disgusting |
More Samples
We randomly selected some transcriptions and 3s audio segments from LibriSpeech test-clean set as the text and speaker prompts and then use VALL-E to synthesize the personalized speech. Note that the transcriptions and audio segments are from different speakers, there is no ground truth speech for reference.
Text | Speaker Prompt | VALL-E |
---|---|---|
The others resented postponement, but it was just his scruples that charmed me. | ||
Notwithstanding the high resolution of hawkeye, he fully comprehended all the difficulties and danger he was about to incur. | ||
We were more interested in the technical condition of the station than in the commercial part. | ||
Paul takes pride in his ministry not to his own praise but to the praise of god. | ||
They do not go where the enemies of the gospel predominate, they go where the christians are. | ||
The ideas also remain but they have become types in nature forms of men animals birds fishes. | ||
Other circumstances permitting that instinct disposes men to look with favor upon productive efficiency and on whatever is of human use. | ||
But suppose you said I'm fond of writing, my people always say my letters home are good enough for punch. | ||
He summoned half a dozen citizens to join his posse who followed obeyed and assisted him. |
[ comments ]