Tag
This paper analyzes 122 languages to show that dependency length minimization operates differently for functional dependencies (short and invariant) versus lexical dependencies (longer and variable), suggesting that grammar provides local scaffolding for processing.
Svarna is an open-source web-based corpus workbench for Modern Greek, integrating multiple databases with over 507 million words and providing various linguistic analysis tools, released under MIT license.
This paper investigates how transformer language models learn 'impossible' languages with unnatural properties, finding that while grammatical sensitivity degrades gradually, generative production shows pronounced failures, suggesting a linking hypothesis for non-attestation.
A tweet highlights 10 free, open-source software tools developed by universities that outperform or rival expensive paid alternatives, covering reference management, text analysis, network visualization, GIS, statistics, speech analysis, biological networks, data cleaning, research archiving, and note-taking.
This paper presents a corpus-based study showing that the neutral tone in Mandarin Chinese is a lexical tone with its own tonal target, based on phonetic and semantic analyses of Beijing and Taiwan Mandarin spoken corpora using generalized additive models and contextualized embeddings.
This paper presents MorfFlex, a morphological dictionary architecture for languages with rich inflection and derivation, exemplified by MorfFlex CZ for Czech, which contains over 100 million wordforms and supports annotation consistency and NLP tools.
Tom Di Mino, an AI engineer and amateur linguist, claims to have deciphered the ancient Minoan script Linear A, which has eluded experts for over a century. His solution maps Linear A to an extinct Semitic language and is currently under review by linguistics experts at Rutgers and Cambridge.
This paper introduces a structured ontology for untranslatability in machine translation, along with a taxonomy of compensation strategies and a multilingual dataset. Human preference studies show translator quality depends on the strategy used, with a preference for explanatory translations.
This paper proposes using data from Linguistics Olympiads to create a new corpus for linguistics research, aiming to advance the field.
A reflection on the broad implications of transformer architectures beyond LLMs, including potential impacts on linguistics, genetics, and causal modeling, comparing their significance to the Haber-Bosch process.
This paper applies philosophy of science to argue that LLMs offer epistemic value as minimal models for how-possibly explanations in linguistics, but do not yet qualify as how-actually explanations of human language.
This paper applies successor representations from reinforcement learning to natural language, training a neural network to predict the expected distribution of future words. It shows that linguistic categories like parts of speech and lexical subclasses emerge spontaneously without explicit supervision.
This paper presents a data-driven analysis of multi-word expressions (MWEs) based on 16 theoretical criteria, annotated by linguistics experts, finding that no expressions are absolutely idiomatic and that lexical criteria are most influential.
Presents DiscoExplorer, an open source web interface for searching and visualizing discourse relation datasets across 16 languages, making DISRPT shared task data publicly accessible.
This paper presents a method for comparing concordances of local grammars to optimize Named Entity Recognition for person names in Portuguese, achieving improved F-measure scores on the HAREM dataset.
This article profiles MIT senior Olivia Honeycutt, highlighting her interdisciplinary research at the intersection of linguistics, computation, and cognition, with a focus on comparing human language processing with large language models.
Researchers use four-state Markov chains to model vowel/consonant patterns in Pushkin’s Evgenij Onegin and its Italian translation, revealing structural asymmetries and narrative-linked phonological cues.
This paper introduces STELA, a linguistics-aware watermarking framework for LLMs that leverages syntactic predictability via POS n-grams to balance text quality and detection robustness. The method enables publicly verifiable watermark detection without requiring access to model logits, demonstrating superior performance across typologically diverse languages (English, Chinese, Korean).