@TencentHunyuan: Can AI truly edit audio, not just generate it? Tencent Hy, in collaboration with SJTU, SII, NTU, TJU, ZODA, PKU, FDU, a…

X AI KOLs Timeline 06/08/26, 05:54 AM Papers

audio-editing benchmark multitask evaluation speech music sound

Summary

MMAE is a comprehensive benchmark for multitask audio editing that evaluates AI's ability to precisely modify existing audio clips via natural language instructions, with current models achieving under 5% exact match rate.

Can AI truly edit audio, not just generate it? Tencent Hy, in collaboration with SJTU, SII, NTU, TJU, ZODA, PKU, FDU, and other collaborators, introduces MMAE. MMAE--A Massive Multitask Audio Editing Benchmark, is the first comprehensive evaluation benchmark for speech and audio "Banana" Instead of simply requiring the AI to "generate" audio, it demands that the AI understand an existing audio clip and precisely modify it according to natural language instructions—altering what needs to be changed while leaving the rest untouched. Current models show an Exact Match Rate (EMR) below 5%, revealing a major gap in reliable audio editing. MMAE includes: 2,000 high-fidelity samples from real-world scenarios 17,741 fine-grained rubric evaluation items 7 modality settings across sound, music, speech and their mixtures 6 task complexity from basic modifications to multi-hop reasoning and multi-round editing 8 operation types across local and global granularities How to use: arXiv: http://arxiv.org/abs/2606.07229 GitHub: https://github.com/ddlBoJack/MMAE HuggingFace: https://huggingface.co/datasets/BoJack/MMAE… Demo: https://youtu.be/6At5nTWhlXI

Original Article

View Cached Full Text

Cached at: 06/08/26, 07:21 AM

Can AI truly edit audio, not just generate it?

Tencent Hy, in collaboration with SJTU, SII, NTU, TJU, ZODA, PKU, FDU, and other collaborators, introduces MMAE.

MMAE–A Massive Multitask Audio Editing Benchmark, is the first comprehensive evaluation benchmark for speech and audio “Banana”

Instead of simply requiring the AI to “generate” audio, it demands that the AI understand an existing audio clip and precisely modify it according to natural language instructions—altering what needs to be changed while leaving the rest untouched.

Current models show an Exact Match Rate (EMR) below 5%, revealing a major gap in reliable audio editing.

MMAE includes: 2,000 high-fidelity samples from real-world scenarios 17,741 fine-grained rubric evaluation items 7 modality settings across sound, music, speech and their mixtures 6 task complexity from basic modifications to multi-hop reasoning and multi-round editing 8 operation types across local and global granularities

How to use: arXiv: http://arxiv.org/abs/2606.07229 GitHub: https://github.com/ddlBoJack/MMAE HuggingFace: https://huggingface.co/datasets/BoJack/MMAE… Demo: https://youtu.be/6At5nTWhlXI

MMAE: A Massive Multitask Audio Editing Benchmark

Source: https://arxiv.org/abs/2606.07229 Authors:Ziyang Ma,Ruiqi Yan,Ruiyang Xu,Jie Fang,Zhikang Niu,Yi-Wen Chao,Wenming Tu,Tianrui Wang,Auden,Qi Chen,Wenxi Chen,Jiaying Chi,Yanru Huo,Zixuan Jiang,Xiquan Li,Yalin Li,Junxi Liu,Minghao Liu,Binghao Qiang,Yijia Shan,Zheshu Song,Tian Tan,Zixiang Wang,Zeyu Xie,Zhifei Xie,Xiaoyu Xing,Qixiang Xu,Chen Yang,Guanrou Yang,Shan Yang,Yifan Yang,Steve Yves,Haotian Zhang,Haina Zhu,Kai Yu,Liefeng Bo,Eng-Siong Chng,Xie Chen

View PDF

Abstract:We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.

Submission history

From: Ziyang Ma [view email] **[v1]**Fri, 5 Jun 2026 12:52:41 UTC (4,461 KB)

@TencentHunyuan: Can AI truly edit audio, not just generate it? Tencent Hy, in collaboration with SJTU, SII, NTU, TJU, ZODA, PKU, FDU, a…

MMAE: A Massive Multitask Audio Editing Benchmark

Submission history

Similar Articles

MMAE: A Massive Multitask Audio Editing Benchmark

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

EditLens: Quantifying the extent of AI editing in text (2025)

Tyto by ai-coustics

Submit Feedback

Similar Articles

MMAE: A Massive Multitask Audio Editing Benchmark

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

EditLens: Quantifying the extent of AI editing in text (2025)

@FeitengLi: Actually, these problems can be well solved: 1. Ditch whisper, switch to an ASR model. Qwen3-ASR is great with few hallucinations, and there are other ASR options. Whisper has many hallucinations and requires 30s segments. Qwen3-ASR gets more accurate with longer audio, supporting up to 20…