Tag
This paper presents a corpus-based study showing that the neutral tone in Mandarin Chinese is a lexical tone with its own tonal target, based on phonetic and semantic analyses of Beijing and Taiwan Mandarin spoken corpora using generalized additive models and contextualized embeddings.
This paper proposes using data from Linguistics Olympiads to create a new corpus for linguistics research, aiming to advance the field.
This paper introduces a register-aware linguistic evaluation framework to assess how human-like large language models (LLMs) are by comparing the distribution of 67 lexico-grammatical features between human and LLM-generated texts using Maximum Mean Discrepancy. Experiments across seven instruction-tuned open-source models and five registers show that no model perfectly matches human baselines, and closeness to human language varies by register rather than model size.