Tag
A human review of TranslateGemma-12b's translations revealed that 71% of segments rated clean by automated metrics actually contained errors, highlighting significant gaps in metric-only evaluation for multilingual translation quality.