Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis


Overview

While emotional text-to-speech (TTS) has made significant progress, most existing research focuses on utterance-level emotional expression. Recently, there has been increasing interest in more fine-grained control of emotion and speaking rate, both of which are essential to expressive speech. This has led to the emergence of a more challenging task known as word-level emotion and speaking rate control, which remains difficult due to the lack of annotated data capturing intra-sentence emotional variation and prosodic dynamics. To overcome this challenge, we propose WeSCon, a novel approach that activates and enhances the word-level emotional expression capability of a pretrained zero-shot TTS model using only a small-scale public emotional dataset without emotion transitions. Our method incorporates a transition-smoothing strategy and a dynamic speed control mechanism to enable a non-finetuned TTS model to perform word-level expressive synthesis through a multi-round inference process. To further simplify the inference procedure, we introduce a dynamic attention bias mechanism along with a self-training strategy to fine-tune the original TTS model and activate its capability for fine-grained emotional and prosodic control. Experimental results demonstrate that WeSCon achieves state-of-the-art performance in word-level emotional expression control while preserving the strong zero-shot TTS capability of the pretrained model.





Figure 1. Overview of WeSCon. The 1st-stage teacher extends a zero-shot TTS model with dynamic speed control, transition smoothing, and multi-round inference to enable word-level emotion and speaking rate control. In the 2nd stage, it supervises a student model with a dynamic emotion attention bias (DEAB) to achieve the same control in an end-to-end manner with reduced inference complexity.



Figure 2. Word-level emotion and speaking rate control using a transition-smoothing module and dynamic speed adjustment. At each inference round, an emotional prompt is used to generate a speech segment, with the tail of the previous output appended to ensure continuity. Speaking rate is controlled by interpolating or downsampling prompt speech tokens. The final utterance is produced by concatenating all segments and decoding them through flow matching and a vocoder.

Figure 3. The proposed self-training strategy. A teacher model under a complex multi-round inference manner supervises a student TTS model to enable word-level emotion and speaking rate control. The dynamic emotional attention bias mechanism further enhances expressive generation in a simplified end-to-end single-pass inference manner.

Cross-Speaker Word-Level Emotion and Speaking Rate Control

Current models are incapable of accomplishing this task, as multi-round inference alone cannot ensure consistency with the target speaker. However, our WeSCon achieves this in a remarkably elegant manner.
Target Text Target Emotion Target Speaking Rate Target Speaker Index-TTS Spark-TTS F5-TTS CosyVoice2 WeSCon
You failed again how could you but don't worry I'll handle it'
Angry→Surprise→Neutral
ModerateSlightly SlowSlightly Fast
You came late and I was surprised but it's okay..
Angry→Surprise→Neutral
ModerateSlightly FastModerate
我被拒绝了 我愤怒地摔门 后来又有点想哭
Sad→Angry→Sad
ModerateSlightly FastSlightly Slow
你怎么能这样 我真的很伤心 但我已经释怀了
Angry→Sad→Neutral
Slightly FastSlowModerate

Word-Level Emotion and Speaking Rate Control for the Same Speaker

Although existing methods can accomplish this task through multi-round inference, the generated segments are independent of each other, resulting in noticeable unnaturalness and abrupt transitions. In contrast, our method produces speech with smooth intra-sentence emotional shifts and speaking rate variations that sound highly natural.
This serves as the primary evaluation task in our paper, as the baseline performs very poorly on the first cross-speaker task.
Target Text Target Emotion Target Speaking Rate Target Speaker Index-TTS Spark-TTS F5-TTS CosyVoice2 WeSCon
You failed again how could you but don't worry I'll handle it'
Angry→Surprise→Neutral
ModerateSlightly FastSlightly Slow
I begged you I screamed at you and now I'm just tired
Sad→Angry→Neutral
ModerateSlightly FastModerate
I worked so hard you mocked me but now I'm okay with it
Angry→Sad→Neutral
ModerateSlightly SlowSlightly Fast
You came late and I was surprised but it's okay..
Angry→Surprise→Neutral
ModerateSlightly FastModerate
我被拒绝了 我愤怒地摔门 后来又有点想哭
Sad→Angry→Sad
ModerateSlightly FastSlightly Slow
我等了你很久 你居然忘了我 算了吧
Sad→Surprise→Neutral
SlowSlightly FastModerate
你怎么能这样 我真的很伤心 但我已经释怀了
Angry→Sad→Neutral
Slightly FastModerateModerate
他们突然为我鼓掌 我吓了一跳 然后忍不住笑了
Surprise→Angry→Happy
ModerateModerateModerate