LLMs for automatic annotation of Mandarin narrative transcripts

arXiv cs.CL Papers

Summary

This paper evaluates LLMs for automatically annotating narrative macrostructure in spoken Mandarin, finding that the best model achieves near-human reliability while reducing annotation time by 65%, though performance degrades on semantically complex or lexically diverse narratives.

arXiv:2605.17205v1 Announce Type: new Abstract: Linguistic annotation of transcribed speech is essential for research in language acquisition, language disorders, and sociolinguistics, yet remains labor-intensive and time-consuming. While Large Language Models (LLMs) have shown promise in automating annotation tasks, their ability to handle complex discourse-level annotation in non-English languages remains understudied. This study evaluates whether LLMs can reliably annotate narrative macrostructure-the hierarchical organization of story grammar elements-in spoken Mandarin, using the Multilingual Assessment Instrument for Narratives (MAIN) as a testbed. We compared four LLMs against trained human annotators on narratives produced by children, young adults, and older adults. The best-performing model achieved agreement with human raters (k=.794) approaching human-human reliability levels (k=.872) while reducing annotation time by 65%, whereas the locally deployable lightweight model performed substantially worse. Annotation difficulty varied systematically by macrostructure element type, with categories requiring subtle semantic differentiation posing persistent challenges. Furthermore, model reliability decreased on young adult narratives, which exhibited greater lexical variation, semantic ambiguity, and multi-element integration within single utterances. These findings suggest that LLMs can effectively support discourse-level annotation in non-English spoken corpora, while highlighting the continued need for human oversight in semantically complex tasks. Our prompt templates are open sourced for future use.
Original Article
View Cached Full Text

Cached at: 05/19/26, 06:38 AM

# LLMs for automatic annotation of Mandarin narrative transcripts
Source: [https://arxiv.org/abs/2605.17205](https://arxiv.org/abs/2605.17205)
[View PDF](https://arxiv.org/pdf/2605.17205)

> Abstract:Linguistic annotation of transcribed speech is essential for research in language acquisition, language disorders, and sociolinguistics, yet remains labor\-intensive and time\-consuming\. While Large Language Models \(LLMs\) have shown promise in automating annotation tasks, their ability to handle complex discourse\-level annotation in non\-English languages remains understudied\. This study evaluates whether LLMs can reliably annotate narrative macrostructure\-the hierarchical organization of story grammar elements\-in spoken Mandarin, using the Multilingual Assessment Instrument for Narratives \(MAIN\) as a testbed\. We compared four LLMs against trained human annotators on narratives produced by children, young adults, and older adults\. The best\-performing model achieved agreement with human raters \(k=\.794\) approaching human\-human reliability levels \(k=\.872\) while reducing annotation time by 65%, whereas the locally deployable lightweight model performed substantially worse\. Annotation difficulty varied systematically by macrostructure element type, with categories requiring subtle semantic differentiation posing persistent challenges\. Furthermore, model reliability decreased on young adult narratives, which exhibited greater lexical variation, semantic ambiguity, and multi\-element integration within single utterances\. These findings suggest that LLMs can effectively support discourse\-level annotation in non\-English spoken corpora, while highlighting the continued need for human oversight in semantically complex tasks\. Our prompt templates are open sourced for future use\.

## Submission history

From: Qingwen Zhao \[[view email](https://arxiv.org/show-email/bd09e29a/2605.17205)\] **\[v1\]**Sun, 17 May 2026 00:37:25 UTC \(582 KB\)

Similar Articles

SwanNLP at SemEval-2026 Task 5: An LLM-based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation

arXiv cs.CL

SwanNLP presents an LLM-based framework for plausibility scoring in narrative word sense disambiguation at SemEval-2026 Task 5, using structured reasoning and dynamic few-shot prompting to predict human-perceived plausibility of word senses in short stories. The work demonstrates that commercial large-parameter LLMs with few-shot prompting and model ensembling effectively replicate human judgment patterns in realistic narrative contexts.

A Linguistics-Aware LLM Watermarking via Syntactic Predictability

arXiv cs.CL

This paper introduces STELA, a linguistics-aware watermarking framework for LLMs that leverages syntactic predictability via POS n-grams to balance text quality and detection robustness. The method enables publicly verifiable watermark detection without requiring access to model logits, demonstrating superior performance across typologically diverse languages (English, Chinese, Korean).

Streaming Speech-to-Text Translation with a SpeechLLM

arXiv cs.CL

Presents a SpeechLLM architecture for streaming speech-to-text translation that adaptively decides when to output tokens based on audio, achieving 1-2 second latency with quality close to non-streaming baselines.