Tag
This paper presents the winning system for SemEval-2026 Task 8's generation subtask, using a heterogeneous ensemble of seven LLMs with dual prompting strategies and a GPT-4o-mini judge to select the best response. The system achieved first place with a conditioned harmonic mean of 0.7827, outperforming all baselines and demonstrating the value of model diversity.
This paper introduces the Re3Align dataset, REspGen framework, and REspEval evaluation suite for author-in-the-loop response generation in peer review, integrating author expertise and intent signals. The work addresses gaps in NLP formulation of scientific rebuttal writing with comprehensive datasets, controllable generation frameworks, and multi-dimensional evaluation metrics.