Tag
Svarna is an open-source web-based corpus workbench for Modern Greek, integrating multiple databases with over 507 million words and providing various linguistic analysis tools, released under MIT license.
This paper investigates how ethos and pathos appeals in social media messages resonate with silent readers, finding that rhetorical content leads to greater interpretive divergence and can predict audience attitudes toward the author.
This paper introduces UD_Czech-PDTC, a large and genre-diverse treebank for Czech in the Universal Dependencies framework, derived from the Prague Dependency Treebank-Consolidated. It describes the conversion process and differences between annotation schemes.
This paper proposes a benchmark suite grounded in Pāṇinian grammar to unify Indic language processing across languages, aiming to improve accuracy, data efficiency, and transferability.
The third edition of the Speech and Language Processing textbook by Jurafsky and Martin was released in January 2026, featuring a clear explanation of Transformers and various updates including new chapters on ASR, TTS, and DPO.
Dango is a 1.8B-parameter LLM trained strictly on Japanese (L1) then fine-tuned on English (L2) to study language transfer effects in second language acquisition. The model filters English contamination from the pretraining corpus and shows human-like L2 production patterns.
This paper presents multi-agent simulations of the emergence of morphological alternation patterns (like 'go/went') in language, using an AI Historical Linguist (LLM-driven) to evaluate plausibility of evolved morphologies against real languages.
CAF-Gen is a multi-agent LLM-driven framework that enriches shallow argument structures into formal Carneades Argumentation Framework models using an iterative Creator-Reviewer pipeline, achieving improved structural alignment and quality.
GlossAssist is a tool for creating interlinear glossed text (IGT) corpora in low-resource language documentation settings, built around the CWoMP retrieval-based architecture with an active learning feedback loop that improves predictions as annotators make corrections without retraining the model.
This article evaluates the integration of data from the French syntactic lexicon Lexicon-Grammar into a probabilistic parser, using word clustering methods on verbs to improve parsing accuracy for French.
This paper presents a modular framework for generating artificial lexicons that are pronounceable, typologically plausible, and semantically structured, using phoneme inventories from PHOIBLE and probabilistic grammars, outperforming deterministic baselines.
This paper proposes Scene Abstraction, a framework for constructing structured representations of the interpretive scenes that words evoke in context, using few-shot prompting of large language models. The authors introduce COCA-Scenes, a dataset of 520 usage instances, and provide empirical evidence that scenes are reliably identifiable and align better with human interpretation than alternatives.
Presents a novel pattern-and-root model for describing Arabic noun inflection, focusing on broken plurals, with a taxonomy of 160 classes and an encoding scheme applied to 3,200 entries, aiming to improve computational language resources.
This paper presents a data-driven analysis of multi-word expressions (MWEs) based on 16 theoretical criteria, annotated by linguistics experts, finding that no expressions are absolutely idiomatic and that lexical criteria are most influential.
The paper introduces IMLJD, a computational dataset designed for analyzing Indian matrimonial litigation, supporting natural language processing and legal analytics research.
This paper presents a computational approach using large language models and RoBERTa to identify manner and result verbs in sentence context, achieving up to 89.6% accuracy. It aims to provide a scalable measurement tool for developmental language research.
This paper presents a computational framework to test competing maturational theories of syntactic development in children, specifically comparing bottom-up versus inward accounts using statistical grammar induction.
This paper establishes a reproducible multi-architecture baseline for token-level Chinese metaphor identification using the MIPVU framework and the PSU Chinese Metaphor Corpus. It compares encoder models like RoBERTa and MelBERT against the Qwen3.5-9B generative model, releasing code and data to facilitate future research.
Researchers from the University of British Columbia propose an unsupervised graph-based system for organizing arguments from online debates by constructing interaction graphs and applying community detection to reveal diverse viewpoint distributions. The approach requires no training data and aims to help users navigate complex argumentative landscapes and combat filter bubbles.
This paper measures the semantic structure and evolution of conspiracy theories using 169.9M Reddit comments from r/politics (2012-2022), introducing the concept of "semantic objects" bounded by semantic neighborhoods to track how conspiracy theory meanings change over time beyond simple keyword-based approaches.