Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis
Overview
While emotional text-to-speech (TTS) has made significant progress, most existing research focuses on utterance-level emotional expression. Recently, there has been increasing interest in more fine-grained control of emotion and speaking rate, both of which are essential to expressive speech. This has led to the emergence of a more challenging task known as word-level emotion and speaking rate control, which remains difficult due to the lack of annotated data capturing intra-sentence emotional variation and prosodic dynamics. To overcome this challenge, we propose WeSCon, a novel approach that activates and enhances the word-level emotional expression capability of a pretrained zero-shot TTS model using only a small-scale public emotional dataset without emotion transitions. Our method incorporates a transition-smoothing strategy and a dynamic speed control mechanism to enable a non-finetuned TTS model to perform word-level expressive synthesis through a multi-round inference process. To further simplify the inference procedure, we introduce a dynamic attention bias mechanism along with a self-training strategy to fine-tune the original TTS model and activate its capability for fine-grained emotional and prosodic control. Experimental results demonstrate that WeSCon achieves state-of-the-art performance in word-level emotional expression control while preserving the strong zero-shot TTS capability of the pretrained model.
Figure 1. Overview of WeSCon. The 1st-stage teacher extends a zero-shot TTS model with dynamic speed control, transition smoothing, and multi-round inference to enable word-level emotion and speaking rate control. In the 2nd stage, it supervises a student model with a dynamic emotion attention bias (DEAB) to achieve the same control in an end-to-end manner with reduced inference complexity.
Figure 2. Word-level emotion and speaking rate control using a transition-smoothing module and dynamic speed adjustment. At each inference round, an emotional prompt is used to generate a speech segment, with the tail of the previous output appended to ensure continuity. Speaking rate is controlled by interpolating or downsampling prompt speech tokens. The final utterance is produced by concatenating all segments and decoding them through flow matching and a vocoder.
Figure 3. The proposed self-training strategy. A teacher model under a complex multi-round inference manner supervises a student TTS model to enable word-level emotion and speaking rate control. The dynamic emotional attention bias mechanism further enhances expressive generation in a simplified end-to-end single-pass inference manner.
Cross-Speaker Word-Level Emotion and Speaking Rate Control
Current models are incapable of accomplishing this task, as multi-round inference alone cannot ensure consistency with the target speaker. However, our WeSCon achieves this in a remarkably elegant manner.
| Target Text | Target Emotion | Target Speaking Rate | Target Speaker | Index-TTS | Spark-TTS | F5-TTS | CosyVoice2 | WeSCon |
|---|---|---|---|---|---|---|---|---|
| You failed again ⏭ how could you ⏭ but don't worry I'll handle it' |
Angry→Surprise→Neutral
|
Moderate→Slightly Slow→Slightly Fast | ||||||
| You came late ⏭ and I was surprised ⏭ but it's okay.. |
Angry→Surprise→Neutral
|
Moderate→Slightly Fast→Moderate | ||||||
| 我被拒绝了 ⏭ 我愤怒地摔门 ⏭ 后来又有点想哭 |
Sad→Angry→Sad
|
Moderate→Slightly Fast→Slightly Slow | ||||||
| 你怎么能这样 ⏭ 我真的很伤心 ⏭ 但我已经释怀了 |
Angry→Sad→Neutral
|
Slightly Fast→Slow→Moderate |
Word-Level Emotion and Speaking Rate Control for the Same Speaker
Although existing methods can accomplish this task through multi-round inference, the generated segments are independent of each other, resulting in noticeable unnaturalness and abrupt transitions. In contrast, our method produces speech with smooth intra-sentence emotional shifts and speaking rate variations that sound highly natural.
This serves as the primary evaluation task in our paper, as the baseline performs very poorly on the first cross-speaker task.| Target Text | Target Emotion | Target Speaking Rate | Target Speaker | Index-TTS | Spark-TTS | F5-TTS | CosyVoice2 | WeSCon |
|---|---|---|---|---|---|---|---|---|
| You failed again ⏭ how could you ⏭ but don't worry I'll handle it' |
Angry→Surprise→Neutral
|
Moderate→Slightly Fast→Slightly Slow | ||||||
| I begged you ⏭ I screamed at you ⏭ and now I'm just tired |
Sad→Angry→Neutral
|
Moderate→Slightly Fast→Moderate | ||||||
| I worked so hard ⏭ you mocked me ⏭ but now I'm okay with it |
Angry→Sad→Neutral
|
Moderate→Slightly Slow→Slightly Fast | ||||||
| You came late ⏭ and I was surprised ⏭ but it's okay.. |
Angry→Surprise→Neutral
|
Moderate→Slightly Fast→Moderate | ||||||
| 我被拒绝了 ⏭ 我愤怒地摔门 ⏭ 后来又有点想哭 |
Sad→Angry→Sad
|
Moderate→Slightly Fast→Slightly Slow | ||||||
| 我等了你很久 ⏭ 你居然忘了我 ⏭ 算了吧 |
Sad→Surprise→Neutral
|
Slow→Slightly Fast→Moderate | ||||||
| 你怎么能这样 ⏭ 我真的很伤心 ⏭ 但我已经释怀了 |
Angry→Sad→Neutral
|
Slightly Fast→Moderate→Moderate | ||||||
| 他们突然为我鼓掌 ⏭ 我吓了一跳 ⏭ 然后忍不住笑了 |
Surprise→Angry→Happy
|
Moderate→Moderate→Moderate |