Tag
SwanVoice is a zero-shot text-to-speech model designed for expressive long-form monologue and dialogue synthesis, combining VAE, flow-matching DiT, and diffusion post-training to achieve higher richness and hierarchy scores than existing baselines.
Swanbench-Speech is a comprehensive benchmark for evaluating long-form speech generation across diverse scenarios, using multi-dimensional metrics covering acoustics, semantics, and expressiveness, revealing limitations of current models.
This paper constructs a large dataset of 263,911 long-form stories annotated with TTCW-based creativity metrics and fine-tunes Qwen3 models to generate structured review reports. It finds that non-reasoning fine-tuning outperforms reasoning-supervised fine-tuning, which suffers from parse failures and irrelevant repetition.