ZET-Speech

Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models

1AITRICS 2KAIST
*Equal Contribution

Abstract

Emotional Text-To-Speech (TTS) is an important task in the development of systems (e.g., human-like dialogue agents) that require natural and emotional speech. Existing approaches, however, only aim to produce emotional TTS for seen speakers during training, without consideration of the generalization to unseen speakers. In this paper, we propose ZET-Speech, a zero-shot adaptive emotion-controllable TTS model that allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label. Specifically, to enable a zero-shot adaptive TTS model to synthesize emotional speech, we propose domain adversarial learning and guidance methods on the diffusion model. Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.

Model Overview

Interpolate start reference image.

The overall architecture of ZET-Speech is based on Grad-StyleSpeech. Thanks to Domain Adversarial Training, we successfully disentangle emotional features from the style vector in the reference speech. Furthermore, the Classifier (Free) Guidance significantly enhances the emotion performance of the diffusion model, resulting in better-conditioned outputs.

[Zero-shot] English Unseen Speaker Emotional TTS

Lists zero-shot synthesized speech samples using ZET-Speech(CFG)

Emotion|Speaker 1284๐Ÿ‘ฉ 260๐Ÿ‘จ 237๐Ÿ‘ฉ 1089๐Ÿ‘จ

Reference Speech

Neutral

Happy

Sad

Angry

Surprise


[Comparison] Korean Emotional TTS

Compare the synthesized speech between 'Pure Emotional Grad-StyleSpeech' and 'ZET-Speech'

Seen Speaker

Speaker๐Ÿ‘ฉ
Emotion
0015_OES
Happy
0015_OES
Sad
0015_OES
Angry
0015_OES
Anxious

Reference Speech

Grad-StyleSpeech

ZET-Speech(CG)

ZET-Speech(CFG)

Speaker๐Ÿ‘จ
Emotion
0023_KSH
Happy
0023_KSH
Sad
0023_KSH
Angry
0023_KSH
Anxious

Reference Speech

Grad-StyleSpeech

ZET-Speech(CG)

ZET-Speech(CFG)


Unseen Speaker (Zero-shot)

Speaker๐Ÿ‘ฉ
Emotion
N0262
Happy
N0262
Sad
N0262
Angry
N0262
Anxious

Reference Speech

Grad-StyleSpeech

ZET-Speech(CG)

ZET-Speech(CFG)

Speaker๐Ÿ‘จ
Emotion
P0539
Happy
P0539
Sad
P0539
Angry
P0539
Anxious

Reference Speech

Grad-StyleSpeech

ZET-Speech(CG)

ZET-Speech(CFG)


[All Emotion] Korean Emotional TTS

Lists synthesized all of emotion speech samples using ZET-Speech(CFG)

Seen Speaker

Emotion|Speaker 0016_YSH๐Ÿ‘ฉ 0002_LYT๐Ÿ‘จ 0017_LSY๐Ÿ‘ฉ 0024_PGJ๐Ÿ‘จ

Reference Speech

Neutral

Happy

Sad

Angry

Anxious

Hurt

Embarrassed


Unseen Speaker (Zero-shot)

Emotion|Speaker S0071๐Ÿ‘ฉ P0570๐Ÿ‘จ S0098๐Ÿ‘ฉ P0943๐Ÿ‘จ

Reference Speech

Neutral

Happy

Sad

Angry

Anxious

Hurt

Embarrassed