@TencentHunyuan: Can AI truly edit audio, not just generate it? Tencent Hy, in collaboration with SJTU, SII, NTU, TJU, ZODA, PKU, FDU, a…
Summary
MMAE is a comprehensive benchmark for multitask audio editing that evaluates AI's ability to precisely modify existing audio clips via natural language instructions, with current models achieving under 5% exact match rate.
View Cached Full Text
Cached at: 06/08/26, 07:21 AM
Can AI truly edit audio, not just generate it?
Tencent Hy, in collaboration with SJTU, SII, NTU, TJU, ZODA, PKU, FDU, and other collaborators, introduces MMAE.
MMAE–A Massive Multitask Audio Editing Benchmark, is the first comprehensive evaluation benchmark for speech and audio “Banana”
Instead of simply requiring the AI to “generate” audio, it demands that the AI understand an existing audio clip and precisely modify it according to natural language instructions—altering what needs to be changed while leaving the rest untouched.
Current models show an Exact Match Rate (EMR) below 5%, revealing a major gap in reliable audio editing.
MMAE includes: 2,000 high-fidelity samples from real-world scenarios 17,741 fine-grained rubric evaluation items 7 modality settings across sound, music, speech and their mixtures 6 task complexity from basic modifications to multi-hop reasoning and multi-round editing 8 operation types across local and global granularities
How to use: arXiv: http://arxiv.org/abs/2606.07229 GitHub: https://github.com/ddlBoJack/MMAE HuggingFace: https://huggingface.co/datasets/BoJack/MMAE… Demo: https://youtu.be/6At5nTWhlXI
MMAE: A Massive Multitask Audio Editing Benchmark
Source: https://arxiv.org/abs/2606.07229 Authors:Ziyang Ma,Ruiqi Yan,Ruiyang Xu,Jie Fang,Zhikang Niu,Yi-Wen Chao,Wenming Tu,Tianrui Wang,Auden,Qi Chen,Wenxi Chen,Jiaying Chi,Yanru Huo,Zixuan Jiang,Xiquan Li,Yalin Li,Junxi Liu,Minghao Liu,Binghao Qiang,Yijia Shan,Zheshu Song,Tian Tan,Zixiang Wang,Zeyu Xie,Zhifei Xie,Xiaoyu Xing,Qixiang Xu,Chen Yang,Guanrou Yang,Shan Yang,Yifan Yang,Steve Yves,Haotian Zhang,Haina Zhu,Kai Yu,Liefeng Bo,Eng-Siong Chng,Xie Chen
Abstract:We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.
Submission history
From: Ziyang Ma [view email] **[v1]**Fri, 5 Jun 2026 12:52:41 UTC (4,461 KB)
Similar Articles
MMAE: A Massive Multitask Audio Editing Benchmark
MMAE is a comprehensive benchmark for instruction-based audio editing across multiple modalities and complexity levels, revealing significant gaps in current model capabilities.
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
Uni-Edit proposes using intelligent image editing as a single general task to simultaneously improve unified multimodal models' understanding, generation, and editing capabilities, with an automated data synthesis pipeline creating complex editing instructions.
EditLens: Quantifying the extent of AI editing in text (2025)
EditLens is a regression model that quantifies the extent of AI editing in text, achieving state-of-the-art performance on binary and ternary classification tasks distinguishing human, AI, and mixed writing. It addresses the gap in detecting AI-edited rather than fully AI-generated text, with implications for authorship attribution, education, and policy.
@FeitengLi: Actually, these problems can be well solved: 1. Ditch whisper, switch to an ASR model. Qwen3-ASR is great with few hallucinations, and there are other ASR options. Whisper has many hallucinations and requires 30s segments. Qwen3-ASR gets more accurate with longer audio, supporting up to 20…
Recommends using Qwen3-ASR instead of Whisper to reduce hallucinations, using LattifAI tools for precise audio-text alignment and subtitle generation, and introducing their own OmniVAD-Kit project for voice activity detection.
Tyto by ai-coustics
Tyto by ai-coustics is a tool that provides audio insights to predict voice AI performance.