Overview

Evaluating expressive speech remains challenging, as existing methods mainly assess emotional intensity and overlook whether a speech sample is expressively appropriate for its contextual setting. This limitation hinders reliable evaluation of speech systems used in narrative-driven and interactive applications, such as audiobooks and conversational agents. We introduce CEAEval, a Context-rich framework for Evaluating Expressive Appropriateness in speech, which assesses whether a speech sample expressively aligns with the underlying communicative intent implied by its discourse-level narrative context. To support this task, we construct CEAEval-D, the first context-rich speech dataset with real human performances in Mandarin conversational speech, providing narrative descriptions together with fifteen dimensions of human annotations covering expressive attributes and expressive appropriateness. We further develop CEAEval-M, a model that integrates knowledge distillation, planner-based multi-model collaboration, adaptive audio attention bias, and reinforcement learning to perform context-rich expressive appropriateness evaluation. Experiments on a human-annotated test set demonstrate that CEAEval-M substantially outperforms existing speech evaluation and analysis systems.

Task Define

Figure 1. A narrative example illustrating how expressive intent emerges from conversational context.



As illustrated in Figure 1, the task of evaluating speech expressive appropriateness under rich contextual settings aims to assess whether a speech sample appropriately conveys expressiveness given a rich narrative context and a corresponding spoken utterance. The expressiveness appropriateness is rated on a scale from 0 to 5. In addition, the model produces a coherent reasoning process, including the analysis of paralinguistic cues, such that the resulting scores and rationales can serve as references for other expressive speech tasks.

CEAEval-D: Dataset

Figure 2. Statistical distribution of annotation categories and attributes in the CEAEval-D dataset.



We recruit 18 graduate students with backgrounds in speech emotion research to annotate the selected 16.1 hours of data. The annotations cover multiple aspects, including expressive appropriateness scores, intonation, rhythm, emotion categories, refined textual context, TTS difficulty, recording conditions, background music presence, and paralinguistic vocalizations and sound events. In addition, auxiliary information such as utterance boundaries, refined textual content, and speaker metadata (role name, gender, and age) is also provided. Together, these annotations capture complementary aspects of expressive behavior relevant to appropriateness judgments under rich contextual settings. As illustrated in Figure 2, we present summary statistics of key annotation dimensions and contextual properties in the dataset, which spans a wide range of context sizes, prosodic patterns, and expressive conditions and thus enables evaluation under diverse discourse settings.

CEAEval-D: Dataset

Figure 3. Overview of the proposed three-stage training pipeline for context-aware speech expressiveness evaluation.



We propose CEAEval-M, a speech-LLM that evaluates expressive appropriateness by jointly reasoning over speech signals and rich textual context. As shown in Figure 3, the model is trained through a three-stage pipeline. First, we distill audio perceptual reasoning abilities from a captioning teacher using 3505-hours data, enabling the model to recognize expressive cues and paralinguistic events in speech. Next, a text-only expressive planner predicts an ideal expressive profile implied by the contextual text, which serves as a reference for appropriate expression. The Speech-LLM judge is then fine-tuned to compare the observed speech realization with this planned expressiveness and to produce an appropriateness score in a CoT style, supported by a learnable audio attention bias. Finally, we apply reinforcement learning to further improve scoring robustness and calibration with respect to human judgments.


To illustrate how CEAEval-M performs context-rich expressive appropriateness evaluation in practice, we present several representative case studies below.

Prompt Language:
Narrative Context (Context Size = 8):
  • Line 1: Jin Dangui wrapped her arms around his neck.
  • Line 2: Jin Dangui said: “We live and die together. Don’t you think that’s good?”
  • Line 3: Liu Da felt as if he were in a dream. He murmured a response, yet deep down he sensed something was amiss. Hiss—someone seemed to have died. Who had died? The beauty in his arms was vivid and alluring. Oh, right—the man surnamed Huo had said that her face was carved out. How could such a beautiful face be fake?
  • Line 4: Thinking of Huo Zhenye made him think of Bai Zhun, and a vague sense of doubt arose in his heart. There was something wrong with the scent of this French perfume—the fragrance had gone too far and turned into a rotten stench.
  • Line 5: Liu Da said: “You’re dead!”
  • Line 6: The Western-style spring bed was gone. There were no lights in the room, and no music either. Jin Dangui’s hands hooked around his neck, wet and icy cold. That face which had utterly bewitched him was right before his eyes—two bloody hollows where the eyes should be, stuffed with a new pair of eyeballs. But these eyeballs were not original; from time to time, they would slip out of their sockets. Whenever they fell out, Jin Dangui would use her fingers to push them back in.
  • Line 7: Liu Da said: “Ah!”
  • Line 8: Liu Da suddenly snapped back to his senses and let out a shrill scream. Jin Dangui’s face darkened.
  • Target Line 9: Jin Dangui said: “What? I specifically went out of my way to find this pair of eyeballs for you. They were obtained in exchange for earrings.”
Speech of Target Line:
Human Expressive Appropriateness Score 5.0
Model Results
Expressive Planner Output:

The ideal performance in this context is: { "emotion": "anger", "rhythm": "solemn", "intonation": "falling intonation", "recording condition": "normal speech" }

Judge Model Output:

The ideal emotion is anger, while the actual performance conveys skepticism and impatience. Although the specific emotional labels do not fully coincide, skepticism and impatience similarly express strong dissatisfaction and confrontational affect, which are highly aligned with anger in terms of emotional intensity and polarity. Therefore, the emotion score is 4.4. In terms of rhythm, the ideal rhythm is high-pitched, and the actual performance is also high-pitched. The overall delivery is energetic and forceful, effectively supporting emotional outburst and semantic emphasis. The rhythm shows a high degree of consistency with the ideal expectation, resulting in a rhythm score of 5.0. Regarding intonation, the ideal intonation is curved, and the actual performance likewise adopts a curved intonation pattern. This intonational contour enhances prosodic variation and expressive layering, helping to highlight emotional tension and speaker stance. The intonation score is 4.9. For recording conditions, the ideal condition is normal speech, and the actual performance is consistent with this expectation. The audio is clear, with no noticeable environmental noise or recording artifacts, indicating stable and appropriate recording quality. Thus, the recording condition score is 5.0. Overall, although there are minor differences between the ideal and actual performances at the level of fine-grained emotional categorization, the emotional intensity, rhythmic pattern, and overall expressive effect are highly aligned. The utterance effectively conveys the intended emotion and communicative intent, demonstrating strong expressive appropriateness. The final score is <score>5.0</score>.

📌 Key Takeaway
Compared with existing methods, our model performs more accurate audio-based analysis and focuses strictly on expressive appropriateness reasoning, avoiding hallucinated explanations. It demonstrates strong bilingual consistency and produces appropriateness scores that are more closely aligned with human judgments.