A P\={a}ninian Foundation for Indic Language Processing

arXiv cs.CL Papers

Summary

This paper proposes a benchmark suite grounded in Pāṇinian grammar to unify Indic language processing across languages, aiming to improve accuracy, data efficiency, and transferability.

arXiv:2606.24172v1 Announce Type: new Abstract: More than a billion people communicate in Indic languages, yet the natural language processing infrastructure serving them remains fragmented and underdeveloped. The cause is structural: the field organizes its tools and benchmarks around individual languages or small subsets of genealogical language families, building separate analyzers, parsers, and datasets for each language and starting over for the next. This overlooks a deep regularity. Through more than two millennia of convergence around Sanskrit, Indic languages came to share a morphosyntactic architecture formalized in P\={a}nini's grammar, the Ast\={a}dhy\={a}y\={i}. This cuts across genealogical lines, uniting languages through a common framework. We argue that this P\={a}ninian framework supplies a unifying computational architecture the field has lacked, and that benchmarks grounded explicitly in it would make Indic language systems more accurate, more data-efficient, and more transferable, effectively merging many apparently disparate and sparse Indic language resources into a single high-resource metalanguage bedrock. We propose a four-part benchmark suite to render this shared architecture explicit, measurable, and ready to be leveraged for practical applications. Moreover, we underscore the question it raises for interpretability research: whether neural models trained on these languages come to represent P\={a}nini's categories on their own.
Original Article
View Cached Full Text

Cached at: 06/24/26, 07:45 AM

# A Pāṇinian Foundation for Indic Language Processing
Source: [https://arxiv.org/html/2606.24172](https://arxiv.org/html/2606.24172)
\\correspondingauthor

One Metagrammar for a Billion Voices: Benchmarks and Architecture

Ritwik Banerjee[rbanerjee@cs\.stonybrook\.edu](https://arxiv.org/html/2606.24172v1/mailto:[email protected])[0000\-0003\-0336\-0258](https://orcid.org/0000-0003-0336-0258)Department of Computer Science, Stony Brook UniversityStony BrookNYUSAAI Innovation Institute, Stony Brook UniversityStony BrookNYUSALav R\. Varshney[lav\.varshney@stonybrook\.edu](https://arxiv.org/html/2606.24172v1/mailto:[email protected])[0000\-0003\-2798\-5308](https://orcid.org/0000-0003-2798-5308)AI Innovation Institute, Stony Brook UniversityStony BrookNYUSADepartment of Electrical and Computer Engineering, Stony Brook UniversityStony BrookNYUSA

###### Abstract\.

More than a billion people communicate in Indic languages, yet the natural language processing infrastructure serving them remains fragmented and underdeveloped\. The cause is structural: the field organizes its tools and benchmarks around individual languages or small subsets of genealogical language families, building separate analyzers, parsers, and datasets for each language and starting over for the next\. This overlooks a deep regularity\. Through more than two millennia of convergence around Sanskrit, Indic languages came to share a morphosyntactic architecture formalized in Pāṇini’s grammar, the Aṣṭādhyāyī\. This cuts across genealogical lines, uniting languages through a common framework\. We argue that this Pāṇinian framework supplies a unifying computational architecture the field has lacked, and that benchmarks grounded explicitly in it would make Indic language systems more accurate, more data\-efficient, and more transferable, effectively merging many apparently disparate and sparse Indic language resources into a single high\-resource metalanguage bedrock\. We propose a four\-part benchmark suite to render this shared architecture explicit, measurable, and ready to be leveraged for practical applications\. Moreover, we underscore the question it raises for interpretability research: whether neural models trained on these languages come to represent Pāṇini’s categories on their own\.

Indic languages, Natural Language Processing, Pāṇinian grammar, Sanskrit, cross\-lingual transfer, low\-resource languages, multilingual models, morphological analysis, dependency parsing, semantic role labeling, benchmarks and evaluation, computational linguistics

††copyright:none††ccs:Computing methodologies Machine translation††ccs:Computing methodologies Speech recognition††ccs:Computing methodologies Phonology / morphology††ccs:Computing methodologies Language resources††ccs:Computing methodologies Discourse, dialogue and pragmatics††ccs:Computing methodologies Information extraction††ccs:Computing methodologies Natural language generation††ccs:Computing methodologies Lexical semantics## 1\.Introduction

Over a billion people across South Asia and beyond communicate daily in Indic languages — Hindi, Bengali, Tamil, Telugu, Marathi, Santali, and dozens more — yet the computational infrastructure serving these speakers/writers remains critically underdeveloped relative to its global significance\. As AI\-powered tools for translation, accessibility, education, and information retrieval become central to modern life, the gap between what NLP can do for English or Chinese and what it can do for Indic languages carries real costs, both human and economic\. Closing that gap is an important engineering challenge facing the computing community today\.

Progress, however, has been slow\. The primary reason is not a lack of data or talent, but a structural problem in how the field has organized itself\. The longstanding emphasis on genealogical divisions in the taxonomy of Indic languages has led to a fragmentation of computational approaches: separate morphological analyzers, parsers, and annotated datasets for each language family, or even each individual language\. The result is redundant engineering effort, acute resource constraints, and benchmark infrastructure that cannot transfer across languages\. The scale of this duplication is concrete: the Universal Dependencies\(de Marneffeet al\.,[2021](https://arxiv.org/html/2606.24172#bib.bib17)\)collection alone, as of release 2\.18 in May 2026\(Zemanet al\.,[2026](https://arxiv.org/html/2606.24172#bib.bib58)\), maintains 24 separately annotated treebanks across 18 Indic languages111[universaldependencies\.org](https://universaldependencies.org/), each built as an independent effort, and conversions between their annotation schemata are frequently lossy\(Ravishankar,[2017](https://arxiv.org/html/2606.24172#bib.bib49); Raiet al\.,[2025](https://arxiv.org/html/2606.24172#bib.bib50)\)\. Every new language effectively requires building from scratch\.

This is not a denial of the value or the tremendous engineering achievements of massively multilingual models\(NLLB Team,[2024](https://arxiv.org/html/2606.24172#bib.bib41)\)\. But their design objective is breadth, not depth: covering hundreds of languages optimizes translation fluency and coverage, not the morphosyntactic and semantic structure that distinguishes Indic languages, and no current benchmark would reveal the difference\. Such models do achieve cross\-lingual transfer, but the units that carry it across Indic languages — the shared and borrowed vocabulary, the parallel inflectional patterns, and the subwords that encode them — are shared precisely because these languages are organized by a common Pāṇinian architecture222Pāṇini was a Sanskrit grammarian from around the fifth century BCE; hisAṣṭādhyāyī\(“Eight Chapters”\) specifies the language in some 4,000 ordered rules built from a formal metalanguage and explicit rule\-precedence mechanisms\. The parallel between this rule system and modern formal grammars was drawn early in the history of modern computing: writing in the Communications of the ACM,Ingerman \([1967](https://arxiv.org/html/2606.24172#bib.bib24)\)proposed that Backus\-Naur form be renamed “Pāṇini\-Backus form”\. The correspondence is at best cosmetic, however — the Pāṇinian rule formalism is in fact strictly more powerful than the context\-free grammars described by BNF\(Penn and Kiparsky,[2012](https://arxiv.org/html/2606.24172#bib.bib47)\)\., whether through inheritance, sustained contact, or independent convergence\. The transfer is therefore already running on Pāṇinian rails; the models merely exploit them implicitly and incompletely, through incidental surface signals rather than the deeper structure reflected by those signals\. Empirical evidence bears the fingerprints of this shared architecture: cross\-lingual transfer is stronger within Indic languages than between Indic and non\-Indic languages even when script differences are controlled for\(Bafnaet al\.,[2023](https://arxiv.org/html/2606.24172#bib.bib3); Naget al\.,[2023](https://arxiv.org/html/2606.24172#bib.bib39)\), and morphological analyzers built on Pāṇinian principles transfer across language families rather than merely within them\(Goyal and Huet,[2016](https://arxiv.org/html/2606.24172#bib.bib19); Hellwig,[2016](https://arxiv.org/html/2606.24172#bib.bib20)\)\. Scale alone, however, quickly reaches a real ceiling\. A fixed\-capacity model spread across many languages dilutes per\-language quality, with the gains gravitating towards high\-resource languages\(Conneauet al\.,[2020a](https://arxiv.org/html/2606.24172#bib.bib15)\), leaving most Indic languages underserved\. Furthermore, generic subword tokenizers fragment morphologically rich Indic words rather than recovering their morphemes\(Brahmaet al\.,[2025](https://arxiv.org/html/2606.24172#bib.bib11)\)\.

Making the shared substrate explicit converts this transfer into an interpretable and designed mechanism, so that the meaning of a lexeme can be transparently tracked as it is inflected and compounded across languages, rather than lost to subword fragmentation determined by ad hoc statistics\. Naïve data sharing across Indic languages, with no Pāṇinian grounding, already increases tagging accuracy\(Pawaret al\.,[2023](https://arxiv.org/html/2606.24172#bib.bib46)\), suggesting that there is ample room for empirical improvements through explicit grounding\. Since the transfer behavior of multilingual models already reveals that they implicitly learn a structural prior commonly held across languages, it is likely that giving Indic languages an interpretable grammar already known to be a common foundation will boost language comprehension across the board\.

The fragmented, language\-by\-language approach and the opacity of multilingual models err in opposite directions around the same fact: the structural unity of Indic languages is real — the former ignores it and pays in redundant effort, whereas the latter gains from it without ever recognizing its central role\. That unity is not a modeling artifact but the structural commons observed consistently by native multilingual users, across the boundaries of Indic languages and genealogies: remarkable parallels in morphological patterns such as verb conjugation, case marking, and agglutination; syntactic structures including postpositions and participial relatives; and shared frameworks for expressing agency, causation, aspect, and evidentiality\.

Far from superficial lexical borrowings, these phenomena reflect deep architectural similarities engendered by over two millennia of linguistic convergence where Sanskrit has functioned as a formal intellectual “metalanguage” across South Asia and beyond\. This is not an isolated observation\. In other linguistic traditions, Latin shaped the prestige registers and grammatical self\-conception of English\(Blake,[1996](https://arxiv.org/html/2606.24172#bib.bib10); Lurie,[2023](https://arxiv.org/html/2606.24172#bib.bib36)\)\. In the case of Indic languages, however, the shared architecture runs considerably deeper: Tamil grammarians, working within their own tradition, arrived at structural conclusions that converge with Pāṇinian categories — a finding that is computationally significant precisely because it transcends genealogical boundaries\. Dialectal and register variations, too, operate along predictable continua rather than discrete breaks: dialects typically differ in phonological realization or lexical choice while preserving fundamental morphosyntactic architecture\. The widespread diglossia333Diglossia is the coexistence of two varieties of the same language throughout a community\. Often, one form is the literary dialect \(the “high” register; e\.g\., Katharevousa, which is heavily influenced by classical Greek, and used in official communications\), and the other is a common dialect of everyday usage \(the “low” register, e\.g\., Demotic Greek, which is the standard vernacular\)\.across Indic contexts — Sanskrit\-Prakrit, literary and colloquial Tamil, Hindi and Khariboli — follows systematic alternations within shared structural constraints\.

The computational treatment of Indic languages has long suffered from a denial of these regularities\. The result is a landscape in which the structural unity underlying these languages remains invisible to the tools meant to process them\. We argue that Pāṇini’s grammatical framework, formalized in the Aṣṭādhyāyī\(Vasu,[1897](https://arxiv.org/html/2606.24172#bib.bib56)\), provides a unifying computational architecture the field needs; and that building benchmarks explicitly grounded in this framework will unlock a new generation of more capable, resource\-efficient, and transferable Indic language processing systems\.

## 2\.The Pāṇinian Framework as Unifying Architecture

The unifying architecture we propose is not a modern invention\. Pāṇini’s Aṣṭādhyāyī, composed around 500 BCE, is one of the most sophisticated formal grammatical systems in human history — and for over two millennia, it functioned as the shared intellectual operating system of South Asian discourse, regardless of language\. Philosophy, law, science, and aesthetics across the subcontinent were all conducted within its formal categories\. Sanskrit was not merely one language among many; it was the metalanguage that supplied the ontological primitives, syntactic templates, and morphophonological regularities through which thought was organized and communicated across the entire region\.

A computing professional might find the following analogy useful: think of Marathi and Tamil as having arisen from distinct kernels — their genealogical origins differ — but with their higher\-level design patterns, discourse semantics, and formal structures built to the specifications of Pāṇinian Sanskrit grammar\. Just as software components built to a common interface specification remain interoperable regardless of their underlying implementation, Indic languages built on this shared specification remain structurally compatible at the level that matters most for computation: morphology, syntax, and semantic organization\. This compatibility, rather than being a mere theory, is concretely manifested in texts such as Śaiva Siddhānta \(\\tamilfontசைவ சித்தாந்தம்\) that exist simultaneously in Sanskrit and Tamil literary traditions, with the underlying semantic and argumentative architecture intact across both\.

The analogy understates a property of Pāṇini’s system that speaks with unusual directness to modern computational practice\. Any grammar formalism powerful enough to describe a natural language is, almost inevitably, powerful enough toovergenerate, i\.e\., to admit strings that the language does not\.Penn and Kiparsky \([2012](https://arxiv.org/html/2606.24172#bib.bib47)\)show that Pāṇini’s formalism is no exception in its raw expressive power\. What is exceptional, however, is the discipline imposed upon it\. Through rule precedence and a layer of meta\-conventions governing how rules compete and apply, the Aṣṭādhyāyī yields a*single*derivation for every grammatical Sanskrit sentence — disambiguation built into the architecture rather than bolted on afterward\.

No generation system in the Chomskyan tradition has this property, and it is precisely this absence,Penn and Kiparsky \([2012](https://arxiv.org/html/2606.24172#bib.bib47)\)argue, that obliges modern NLP to reach for its heavy statistical and probabilistic machinery: the numerical methods are, in large part, a means of curbing the overgeneration these formalisms cannot restrain on their own\. Seen this way, the marvel of the Aṣṭādhyāyī is not how many correct analyses it produces, but how many incorrect ones it entirely avoids\. An architecture grounded in this tradition would inherit determinacy as a core design principle\. It does not obviate the statistical components of modern NLP, but provides a reason to expect that explicit Pāṇinian structure carries information that modern systems recover only indirectly, and at a cost\.

Demonstrated advantages of transfer learning within Indic languages\(Bafnaet al\.,[2023](https://arxiv.org/html/2606.24172#bib.bib3); Naget al\.,[2023](https://arxiv.org/html/2606.24172#bib.bib39)\)and existing morphological analyzers based on Pāṇinian principles\(Goyal and Huet,[2016](https://arxiv.org/html/2606.24172#bib.bib19); Hellwig,[2016](https://arxiv.org/html/2606.24172#bib.bib20)\)support this picture directly\. Models trained on Pāṇinian dependency labels show improved argument detection and semantic role labeling\(Pal and Sharma,[2019](https://arxiv.org/html/2606.24172#bib.bib42)\)\. Highly domain\-specific tasks, such as translation of technical lexicon, show improvements in zero\-shot setting when trained on Sanskrit tokens\(N Jet al\.,[2025](https://arxiv.org/html/2606.24172#bib.bib30)\)\.

These are not marginal gains, but strong evidence that the shared architecture is computationally real and exploitable\.

Furthermore, historical records support this picture independent of the computational rewards\. Tamil grammarians did not merely borrow Sanskrit categories\. Rather, to a great extent, they discovered the same underlying categories independently within Tamil itself: Tolkāppiyam, the oldest extant Tamil grammar text, describessandhiand ideas similar tovibhakti; the 11th\-century grammatical treatise Vīracōḻiyam incorporated kāraka analysis and presented older traditional components of Tamil grammar through a comparative lens; and Ilakkaṇakkottu argued that Sanskrit grammatical features not mentioned in Tamil grammar nevertheless occur in the Tamil language, and therefore belong in its grammar\(Annamalai,[2024](https://arxiv.org/html/2606.24172#bib.bib2)\)\. These are records of recognition, not imposition\(Acharya,[2013](https://arxiv.org/html/2606.24172#bib.bib1); Pollock,[2000](https://arxiv.org/html/2606.24172#bib.bib48)\): two sophisticated grammatical traditions, arriving independently, and through interactions, at shared structural conclusions\. That convergence is precisely what enables computational leveraging of the Pāṇinian framework across language families, not merely within a single genealogical tree \(e\.g\.,Karthikaet al\.\([2025](https://arxiv.org/html/2606.24172#bib.bib29)\)achieved improved tokenization by clustering Punjabi together with Dravidian languages\); and it is what one would expect if the Pāṇinian framework captures genuine computational primitives rather than being impositions of cultural prestige alone\.

The relationship between genealogical and functional perspectives is worth exploring with precision, since the distinction matters for how we design computational systems\. Academic linguistics rightly focuses on genealogical descent to reconstruct historical development — and it does not deny the structural commonalities we describe\. But genealogical taxonomy prioritizes inheritance, whereas computational transfer depends on functional and architectural similarity\. These are different questions, and they do not always have the same answer\. Recent work on cross\-lingual transfer in programming languages makes this point sharply: genealogical relationships were not the most predictive features for transfer learning success; structural and corpus\-specific features were far more reliable\(Baltajiet al\.,[2025](https://arxiv.org/html/2606.24172#bib.bib4)\)\. The same principle applies here\. The Pāṇinian framework captures precisely the functional and architectural commonalities — in morphology, syntax, and semantics — that genealogical classification leaves implicit, and that computational systems need made explicit\.

What this means practically is that the shared semantic infrastructure of Pāṇinian grammar enables richer, more comprehensive knowledge representations for modern Indic languages\. The common syntactic templates enable transfer learning of foundational NLP processes — semantic role labeling, tokenization, semantic composition, clause and phrase representation — in ways that are both more efficient and more interpretable, because empirical methods can be grounded in an explicit, well\-understood formal system\. Dialectal and register variation, rather than requiring separate handling, can be treated as parametric variation within this shared framework: dialects differ in phonological realization or lexical choice while preserving the underlying morphosyntactic architecture\. This is the key insight that the fragmented, language\-by\-language approach has been missing; and it points directly toward a new generation of unified, transferable benchmarks for Indic language technologies\.

## 3\.Challenges in Computational Indic Language Processing

There are four distinct but interacting facets of Indic languages that challenge the development of effective Indic language technologies\.

### 3\.1\.Morphological complexity

Indic languages are morphologically rich in ways that standard NLP pipelines — designed primarily for English — are poorly equipped to handle\. Words are not atomic units but structured compositions of roots and affixes encoding case, number, tense, mood, and aspect\. Two phenomena are particularly demanding\.Sandhi— the euphonic fusion of sounds at word boundaries — means that segmenting a sentence into meaningful units is itself a non\-trivial task requiring linguistic knowledge before any downstream processing can begin\.Samāsa, the compounding of multiple semantic units into a single surface form, means that a single word may compress what English would express as an entire phrase\. These phenomena are pervasive in high\-register communication and frequent even in everyday speech\. Handling them correctly is a prerequisite for almost every NLP task, yet most current tools treat them as edge cases rather than central architectural concerns\.

### 3\.2\.Diglossia, register variation, and code\-mixing

Across Indic language communities, speakers routinely navigate multiple registers — literary and colloquial Tamil, Sanskrit\-inflected Hindi, Khariboli — and switch between them fluidly depending on context\. This diglossia is not noise, but a structural feature of how these languages function socially\. Code\-mixing, particularly between an Indic language and English, adds a further layer of complexity that is especially pronounced on social media and in urban speech\. Current NLP models, trained predominantly on written standard varieties, struggle with all of these variations\. The deeper problem is that existing tools treat each register or mixed variety as a separate data distribution requiring its own handling, when in fact these variations follow systematic patterns within shared structural constraints\.

The Hindi–Urdu pair makes this concrete\. In their everyday spoken forms the two are near\-dialects of a single language, yet their formal registers diverge so sharply in vocabulary \(Hindi drawing on Sanskrit, Urdu on Persian and Arabic\) that the high varieties become mutually unintelligible\(King,[2006](https://arxiv.org/html/2606.24172#bib.bib31)\)\. What survives that divergence is the grammar: across the full register range the morphosyntactic scaffolding stays fixed, down to shared function morphemes such as the genitive\-kī\. The variation is large, but it is lexical and graphemic, layered over an invariant Indo\-Aryan morphology and syntax\. That an entire prestige vocabulary can be swapped out while the structure holds is the sharpest demonstration that what these varieties share is architectural — precisely the regularity that a register\-by\-register modeling approach leaves unexploited\.

### 3\.3\.The nominal semantics mismatch

Perhaps the least appreciated challenge is a fundamental mismatch between the semantic assumptions embedded in standard NLP frameworks and the actual semantic organization of Indic languages\. Contemporary NLP, shaped by its development on Western languages, treats nouns as static, discrete entities whose meaning is grounded in extra\-linguistic reference\. Indic languages, by contrast, encode a process\-oriented ontology: nominal meaning is derivationally and conceptually rooted in verbal roots\. A noun, in Sanskrit — and across Indic languages — is best understood not as a static label but as denoting a role within an implicit event structure\. The worddhātu, for instance, means both “root” and “metal”, but its Pāṇinian definition translates as “that which supports/holds” — an action\-grounded description\. Standard benchmark tasks built on English\-derived semantic assumptions systematically fail to evaluate this dimension of meaning, leaving a critical gap in how we measure language comprehension for Indic languages\.

### 3\.4\.Dataset scarcity and annotation incompatibility

Even where datasets exist, they are fragmented and mutually incompatible\. The annotated resources that do exist were each developed for individual languages or for a single family, and even the attempts at a shared standard were never carried across the Indic family of languages\. The Indian Language Corpora Initiative \(ILCI\) parallel corpus\(Jha,[2010](https://arxiv.org/html/2606.24172#bib.bib25)\)is POS\-annotated using the common BIS tagset for Indian languages\(Sankaranet al\.,[2008](https://arxiv.org/html/2606.24172#bib.bib51)\); the most linguistically ambitious effort, the multi\-layered Hindi/Urdu treebank built at IIIT\-Hyderabad\(Bhattet al\.,[2009](https://arxiv.org/html/2606.24172#bib.bib8); Palmeret al\.,[2009](https://arxiv.org/html/2606.24172#bib.bib44)\)— which covers what is essentially one Indo\-Aryan language in two scripts\(King,[2006](https://arxiv.org/html/2606.24172#bib.bib31)\)— pairs a Pāṇiniankārakadependency layer with a PropBank444The Proposition Bank, or PropBank, is a shallow and broad foundational natural language processing resource created byPalmeret al\.\([2005](https://arxiv.org/html/2606.24172#bib.bib43)\), that took a practical approach to semantic representation by adding a layer of predicate\-argument information, or semantic role labels, to the syntactic structures of the Penn Treebank\. Essentially, it provides sentence\-level annotations for “who did what to whom”\.\-style predicate\-argument layer adapted from English\. The common BIS tagset spans many languages, but only at the surface, while the treebank’s deeper, Pāṇinian\-informed annotation never reaches beyond Hindi and Urdu\. No resource carries the deep Pāṇinian annotations across the family boundary, and their schemata — a POS tagset, akārakadependency scheme, an English\-derived semantic layer — do not align with one another or with the universal scheme discussed next\.

Universal Dependencies555Universal Dependencies — available at[universaldependencies\.org](https://universaldependencies.org/)— is a collaborative, open project with more than 600 contributors who have built over 200 treebanks spanning more than 150 languages\. It offers a unified scheme for annotating grammar \(including lexical categories, morphological attributes, and syntactic relations\) across diverse human languages\.corpora exist for several Indic languages\(Ravishankar,[2017](https://arxiv.org/html/2606.24172#bib.bib49)\), but their universal scheme does not map onto the grammatical categories these languages actually use, making cross\-lingual benchmark transfer difficult in practice\. Sanskrit NLP, meanwhile, has concentrated on segmentation and lemmatization, with little work connecting these tasks to morpheme\-level semantics: projects such as the Sanskrit Sembank\(Hellwig and Biagetti,[2025](https://arxiv.org/html/2606.24172#bib.bib21)\)assign WordNet synsets to words but do not encode the semantic roles —kāraka— that are central to how meaning is organized in these languages\. The cumulative result is that virtually no existing benchmark is designed around the structural unity of Indic languages; nearly all follow evaluation frameworks built for English, measuring what is easy to measure rather than what matters most\.

## 4\.The State of the Art: Promising Signals, Fragmented Progress

The research community has not been idle\. Across morphology, syntax, and semantics, there are genuine advances in computational Indic language processing — and a consistent pattern within them: when Pāṇinian structure is explicitly exploited, performance improves\. The problem is that these advances remain isolated\. No one has connected them into a coherent, unified framework\. The field has the ingredients; what it lacks is the architecture\.

Highly multilingual large language models \(MLLMs\), such as NLLB\(NLLB Team,[2024](https://arxiv.org/html/2606.24172#bib.bib41)\), are tremendous engineering achievements that offer rigorous benchmarks within their scope\. But their scope is translation quality — fluency, adequacy, toxicity identification, etc\.; not morphological, morphosyntactic, or semantic role structures, cross\-register performance, morpheme\-level semantics, or dialectal robustness across intra\-language variations\. These models do not resolve the underlying problem of fragmented corpora, and thus, the field’s design limitations are continually inherited, not overcome\. The wide coverage of MLLMs is an engineering objective different from deep comprehension\. Without benchmarks designed to probe the underlying structural understanding of Indic languages, we cannot know how much these models have achieved, or how much more they may achieve when leveraging the universal substrate of Pāṇinian metagrammar\.

### 4\.1\.Morphological analysis: strong foundations, limited reach

The most mature body of work addresses morphological segmentation\. Word segmentation benchmarks built onsandhiandsamāsaexist for Sanskrit\(Krishnaet al\.,[2017](https://arxiv.org/html/2606.24172#bib.bib32)\), and recent models such as ByT5\-Sanskrit\(Nehrdichet al\.,[2024](https://arxiv.org/html/2606.24172#bib.bib40)\)achieve state\-of\-the\-art results on segmentation and lemmatization\. The MorphTok benchmark\(Brahmaet al\.,[2025](https://arxiv.org/html/2606.24172#bib.bib11)\)grounds morphological analysis explicitly in Pāṇinian grammar and has demonstrated downstream improvements in practice\. Earlier work established similar gains in machine translation\(Banerjee and Bhattacharyya,[2018](https://arxiv.org/html/2606.24172#bib.bib5)\)and named entity recognition\(Pattnayaket al\.,[2025](https://arxiv.org/html/2606.24172#bib.bib45)\)\. The signal is clear and consistent: Pāṇinian morphological grounding helps\. Yet these benchmarks are almost entirely confined to Sanskrit\. The extension to modern Indic languages, which share the same underlying morphological architecture, has not been done systematically\. A foundation exists, but it has not been built upon\.

### 4\.2\.Morphological tagging: cross\-lingual gains left on the table

Annotated corpora for morphological tagging exist for several modern Indic languages, but they trade breadth against depth\. The ILCI parallel corpus\(Jha,[2010](https://arxiv.org/html/2606.24172#bib.bib25)\)spans several Indo\-Aryan as well as Dravidian languages — Hindi, Bengali, Marathi, Tamil, Telugu, and others — but stops at the shallow syntactic level of part\-of\-speech tagging\. The multi\-layered treebank at IIIT\-Hyderabad\(Bhattet al\.,[2009](https://arxiv.org/html/2606.24172#bib.bib8)\)annotates more, including gender, case, and number and a Pāṇinian kāraka dependency layer, but reaches only the near\-identical Hindi/Urdu language pair\. Pratyaya\-Kosh\(Singhet al\.,[2020](https://arxiv.org/html/2606.24172#bib.bib52)\)is itself grounded in Pāṇinianpratyayaanalysis but covers only Sanskrit noun derivation\. These are valuable resources; the limitation they share is not that they ignore Pāṇinian structure — a few embrace it — but that none provides unified Pāṇinian morphological annotation across the Indic languages\. Breadth comes without depth, and depth without breadth, leaving the structural unity of Indic morphology fragmented across partial, mutually incompatible resources rather than exploited by a common scheme\. That even a fraction of this unity is exploitable is already demonstrated byPawaret al\.\([2023](https://arxiv.org/html/2606.24172#bib.bib46)\), who trained a single multilingual model for morphosyntactic tagging, spanning both Indo\-Aryan and Dravidian languages\. That this was shown even with no explicit Pāṇinian grounding at all is both encouraging and sobering: if sharing data alone yields empirical gains, the advantage from a unified Pāṇinian annotation could be substantially larger\.

### 4\.3\.Syntactic parsing

A Pāṇinian schema exists, but remains isolated\. The AnnCorra project\(Bharatiet al\.,[2002](https://arxiv.org/html/2606.24172#bib.bib7)\)developed a dependency annotation schema explicitly grounded in Pāṇiniankārakarelations — one of the most direct implementations of classical grammar in computational form\. The downstream benefits are real: incorporatingkārakaover universal dependencies has shown immediate improvements in Indic\-language question\-answering \(QA\)\(Vermaet al\.,[2023](https://arxiv.org/html/2606.24172#bib.bib57)\)\. Yet AnnCorra remains largely an isolated effort\. Universal dependency corpora exist for some modern Indic languages\(Ravishankar,[2017](https://arxiv.org/html/2606.24172#bib.bib49)\), but their annotation schemata do not map to Sanskrit grammar, limiting cross\-lingual transfer\. An initial effort to build Pāṇinian universal dependencies was made byTandonet al\.\([2016](https://arxiv.org/html/2606.24172#bib.bib54)\), but has not been extended across multiple languages\. The infrastructure for a unified multilingual Pāṇinian parser is within reach — the conceptual work has been done — but the execution remains incomplete\.

### 4\.4\.Morpheme semantics: almost entirely uncharted

The deepest gap is at the level of meaning\. The Sanskrit Sembank\(Hellwig and Biagetti,[2025](https://arxiv.org/html/2606.24172#bib.bib21)\)assigns WordNet synsets to Sanskrit words — a valuable resource — but does not encode kāraka roles or root\-level meanings\. State\-of\-the\-art segmentation models like ByT5\-Sanskrit correctly identify lemmas and morphological tags, but make no connection to what those morphemes mean\. There is virtually no work in the NLP literature that explicitly models the semantics of individual roots and affixes in Indic languages — despite the fact that this morpheme\-level semantic transparency is a defining feature of how these languages organize meaning\. This is not a minor gap\. It means that current systems can parse the surface form of an Indic sentence with reasonable accuracy while remaining blind to the semantic architecture that gives that sentence its meaning\. No benchmark currently exists that would even reveal this blindness, let alone measure it\.

### 4\.5\.Cross\-lingual transfer

The signal is strong, but the infrastructure is lacking\. We underscored an important finding in recent literature, whereBafnaet al\.\([2023](https://arxiv.org/html/2606.24172#bib.bib3)\)andNaget al\.\([2023](https://arxiv.org/html/2606.24172#bib.bib39)\)demonstrated that transfer across Indic languages yields markedly stronger performance than transfer from Indic to non\-Indic languages\. This is precisely the pattern one would predict if the Pāṇinian framework captures genuine computational structure shared across these languages\. That latent structure, however, is not what today’s deployed systems actually exploit\. The most recent benchmarks show transfer still riding on surface signal\. For instance, CorIL\(Bhattacharjeeet al\.,[2025](https://arxiv.org/html/2606.24172#bib.bib9)\), a parallel corpus and translation evaluation spanning both Indo\-Aryan and Dravidian languages, finds a performance hierarchy organized by script\. IndicTrans2, state\-of\-the\-art for Brahmi\-script languages, collapses on Perso\-Arabic Sindhi with a near\-zero BLEU score, whereas the massively multilingual NLLB\(NLLB Team,[2024](https://arxiv.org/html/2606.24172#bib.bib41)\)and BhashaVerse\(Mujadia and Sharma,[2025](https://arxiv.org/html/2606.24172#bib.bib38)\)score several times higher\. The larger models close this gap through sheer breadth of script coverage rather than any structural insight\. The deciding variable is a surface property \(in this case, the writing system\), not the grammar shared by the languages\. Neither outcome reflects structural understanding; both track orthographic and lexical overlap\. The starkest case is the one that should be easiest: Urdu is, grammatically, nearly identical to Hindi\(King,[2006](https://arxiv.org/html/2606.24172#bib.bib31)\), yet a model relying on Indic\-script lexical sharing cannot bridge the script divide\. The shared morphosyntactic architecture, the very thing Hindi and Urdu hold in common, is precisely what these systems fail to utilize\. On the other hand,Goyal and Huet \([2016](https://arxiv.org/html/2606.24172#bib.bib19)\)andHellwig \([2016](https://arxiv.org/html/2606.24172#bib.bib20)\)demonstrated how morphological analysis based on Pāṇinian principles can work across language families\. A handful of multilingual question\-answer datasets exist as well\(Singhet al\.,[2025](https://arxiv.org/html/2606.24172#bib.bib53)\)\. But these remain isolated data points rather than a coordinated research program\. The empirical evidence for the Pāṇinian hypothesis is accumulating — yet the benchmark infrastructure needed for its systematic testing, along with deliberate extensions, does not exist\.

There is a clear pattern across every dimension of Indic language processing, be it morphological analysis, syntactic parsing, semantic role labeling, or cross\-lingual transfer\. The same story repeats: isolated advances, consistent positive signals when Pāṇinian structure is exploited, and no unified framework connecting them\. What the field needs is not more isolated experiments but a coordinated benchmark suite that makes the shared Pāṇinian architecture explicit, measurable, and exploitable across all Indic languages\.

The benefits of such a unified approach are measurable\.Changet al\.\([2024](https://arxiv.org/html/2606.24172#bib.bib13)\)find that adding moderate amounts of multilingual data improves low\-resource language modeling about as much as enlarging the low\-resource corpus itself by up to a third, and that the gain is driven by the syntactic similarity of the added data, with vocabulary overlap mattering only marginally666This is precisely what separates beneficial multilinguality from the drawbacks of relentless scaling — structurally aligned data adds signal, unrelated data dilutes it\. This has been called the “curse of multilinguality”\(Conneauet al\.,[2020a](https://arxiv.org/html/2606.24172#bib.bib15)\)\.\. The advantage, thus, lies in the structural core, not the lexical surface, consistent with evidence that large models encode grammatical organization along directions shared across languages\(Brinkmannet al\.,[2025](https://arxiv.org/html/2606.24172#bib.bib12)\)and align cross\-lingually without depending on shared vocabulary\(Conneauet al\.,[2020b](https://arxiv.org/html/2606.24172#bib.bib16)\)\. This is what makes the Indic case so favorable: Indic languages converge structurally across family lines with comparable morphological richness, through long contact, despite belonging to different genealogical families\(Kakwaniet al\.,[2020](https://arxiv.org/html/2606.24172#bib.bib28)\), and cross\-lingual transfer tracks exactly this morphological and structural similarity\(Bankula and Bankula,[2025](https://arxiv.org/html/2606.24172#bib.bib6)\)\. For Indian languages, exploiting this relatedness is already established practice: a substantial body of low\-resource machine translation and transliteration leverages the orthographic and lexical substrate these languages share\(Kunchukuttan and Bhattacharyya,[2022](https://arxiv.org/html/2606.24172#bib.bib35)\)\. A common Pāṇinian substrate would carry this deeper by making the shared morphosyntactic structure formal and explicit, so the additive advantage to low\-resource languages, as demonstrated byChanget al\.\([2024](https://arxiv.org/html/2606.24172#bib.bib13)\), can operate across the entire group of Indic languages\. Thereby, separately scarce Indic language resources become, at the level that actually drives language comprehension and multilingual transfer, a single large hub\.

## 5\.A Benchmark Suite for Indic Language Processing

The gap analysis above reveals a consistent pattern: when Pāṇinian structure is explicitly leveraged, performance improves\. But no coordinated benchmark infrastructure exists to make this leverage systematic and reproducible across languages\. We propose four thematic benchmark clusters that together would constitute a unified Pāṇinian evaluation suite for Indic language processing\. Each cluster is \(a\) directly motivated by empirical results, and \(b\) designed to be multilingual from the ground up rather than extendedpost hocandad hocto other languages\.

### 5\.1\.Morphological segmentation and tagging

The first cluster addresses the most foundational level of Indic language processing: the decomposition of words into their Pāṇinian roots and affixes, and the interpretation of what those morphemes mean\. It provides the common skeleton to consolidate \(a\) multilingual morphological segmentation and tagging, \(b\) tasks on etymology and morpheme semantics, and \(c\) lexical semantics\.

Existing Sanskrit segmentation benchmarks\(Krishnaet al\.,[2017](https://arxiv.org/html/2606.24172#bib.bib32); Nehrdichet al\.,[2024](https://arxiv.org/html/2606.24172#bib.bib40)\)and the MorphTok benchmark\(Brahmaet al\.,[2025](https://arxiv.org/html/2606.24172#bib.bib11)\)provide an important starting point, but their coverage is almost entirely confined to Sanskrit\. The first task extends these benchmarks to modern Indic languages, requiring models to split sentences into Pāṇinian roots \(dhātu\) and affixes \(pratyayaandlakāra\) and label their grammatical and semantic functions\. The empirical case for this extension is already made:Pawaret al\.\([2023](https://arxiv.org/html/2606.24172#bib.bib46)\)demonstrated a 7% gain in morphosyntactic tagging accuracy from sharing data across Indo\-Aryan and Dravidian languages — without any explicit Pāṇinian grounding\. Grounding the annotation explicitly in Pāṇinian categories should yield further gains \(cf\.Brahmaet al\.\([2025](https://arxiv.org/html/2606.24172#bib.bib11)\)\), since producing large ground\-truth corpora is demonstrably feasible: finite\-state systems for segmentation and morphological analysis\(Huet,[2005](https://arxiv.org/html/2606.24172#bib.bib22); Krishnan and Kulkarni,[2019](https://arxiv.org/html/2606.24172#bib.bib33)\)and rule\-derivation engines that simulate the Aṣṭādhyāyī’s rules\(Mishra,[2009](https://arxiv.org/html/2606.24172#bib.bib37)\)can generate analyses at scale, and their outputs can be validated and normalized into tagged gold corpora\(Krishnanet al\.,[2023](https://arxiv.org/html/2606.24172#bib.bib34)\)\.

The second task goes deeper: given an inflected word in a modern Indic language such as Marathi or Bengali, the task is to identify its Sanskritdhātuand interpret its core meaning\. This is morpheme\-level semantic grounding, the testbed for the process\-oriented nominal semantics that distinguishes Indic languages from the static noun\-entity model assumed by standard NLP\. Cross\-lingual word\-sense disambiguation and semantic textual similarity tasks can be built on top of this etymological grounding by using resources like the Sanskrit Sembank\(Hellwig and Biagetti,[2025](https://arxiv.org/html/2606.24172#bib.bib21)\)\.

### 5\.2\.Syntactic parsing via Pāṇinian dependencies

This benchmark pursues a single, high\-impact goal: a unified multilingual dependency parser grounded inkārakarelations rather than universal dependencies\. The conceptual groundwork exists — the AnnCorra project\(Bharatiet al\.,[2002](https://arxiv.org/html/2606.24172#bib.bib7)\)developed akāraka\-based dependency schema, andTandonet al\.\([2016](https://arxiv.org/html/2606.24172#bib.bib54)\)made an initial effort toward Pāṇinian universal dependencies — but neither has been extended systematically across multiple languages or language families\. The benchmark task here is both to build or convert dependency treebanks for multiple Indic languages using Pāṇinian dependency labels, and to evaluate parsers trained on this unified schema against language\-specific baselines\. A single robust multilingual parser, once available, can serve as infrastructure for virtually every downstream task across Indic languages\.

### 5\.3\.Semantic role labeling and inference across registers and languages

This cluster addresses meaning at the sentence and discourse level\. It consolidates semantic role comprehension across Indic languages and language families, and tests model performance across dialectal continua and register variations across Indic languages\. The core benchmark annotates sentence pairs and QA examples across multiple Indic languages \(Hindi, Marathi, Bengali, etc\.\) withkārakaroles as semantic frames, analogous to PropBank but grounded in Pāṇinian categories\. Our proposal is to build directly on thekāraka\-grounded work byVermaet al\.\([2023](https://arxiv.org/html/2606.24172#bib.bib57)\), extending it from QA to entailment and inference\.

A distinct feature of these benchmarks is their explicit engagement with registers, as most Indic languages use multiple registers in both speech and text777The coexistence of formal and colloquial registers within a single language community \(i\.e\.,diglossia, cf\. footnote[3](https://arxiv.org/html/2606.24172#footnote3)\) is a defining feature of modern Indic language use\. Bengali, for instance, distinguishessādhu bhāṣā, a highly Sanskritized written style, fromcālitā bhāṣā, the colloquial form used in most modern communication, except possibly in official documents\. The gap between these registers is wider than, say, the difference between formal and informal English; it is closer to the difference between Latin and Italian\. This pattern recurs across Indic languages: Tamil maintains a sharp divide betweencentamil\(literary\) andkoṭuntamil\(spoken\); Hindi distinguishes a Sanskritized formal register from the colloquial Khaṛiboli\. The phenomenon is not unique to India — Arabic’sfuṣḥāversus’āmmiyya, Japanesekeigo\(\\japanesefont敬語\) versus plain speech, or Chinese classical/literarywényánwénversus the vernacularbáihuà, are structural parallels — but in the Indic context, register variation is especially significant for NLP because the formal register draws systematically on Sanskrit morphology and vocabulary, while the colloquial register reflects centuries of phonological and grammatical simplification\.\. Language models must capture the same underlying semantic content across these surface variations\. Benchmark tasks would require models to, for example, answer questions about a narrative given in colloquial Tamil that is derived from a Sanskrit original, or to identify entailment relations across literary and spoken register pairs\. These tasks directly test whether models have learned the underlying semantic architecture or merely memorized surface patterns\.

#### Dialectal robustness and code\-mixing generalization

India exhibits an extremely rich set of local dialectal variations\. Tamil, for instance, exhibits significant variation across Chennai, Madurai, and Sri Lanka, and Bengali forms a comparable continuum — from the southwestern districts of West Bengal through the northern belt around Cooch Behar to the eastern varieties of Tripura and Sylhet — in phonology as well as morphology\(Chatterji,[1926](https://arxiv.org/html/2606.24172#bib.bib14)\)\. These distinctions, much like the sharp register variations, share core morphosyntactic templates\. Key benchmark tasks ought to include cross\-dialectal semantic role labeling, where training and test sets use different dialects or registers; register adaptation, requiring models to translate between literary and colloquial forms while preserving semantic content; and code\-mixing generalization across sociolinguistic contexts ranging from formal written text to informal spoken interaction\. These tasks measure something that current benchmarks almost never measure: whether a model has learned structural invariants that generalize across surface variation, or whether it has simply memorized the patterns of a particular variety — a distinction of enormous for real\-world deployment across India’s linguistic diversity\.

### 5\.4\.The extended Indic “sprachbund” and cross\-lingual information disorder

This final cluster pushes the unified framework in two directions simultaneously: outward to the broader Indic linguistic sphere, and toward an urgent social concern\. These benchmarks address whatEmeneau \([1956](https://arxiv.org/html/2606.24172#bib.bib18)\)identified as theIndic sprachbund888A “sprachbund” is an area of linguistic convergence, corresponding to a group of languages with similarities in syntax, morphological structure, cultural vocabulary, and sound systems\(Thomason,[2000](https://arxiv.org/html/2606.24172#bib.bib55)\)\.and what one might think of, following the “World Englishes” framework ofKachru \([1992a](https://arxiv.org/html/2606.24172#bib.bib26),[b](https://arxiv.org/html/2606.24172#bib.bib27)\), as World Indic Languages\. This extension is striated, however, as one moves outward, and the benchmarks must respect that gradient\. At the first level are languages that remain Indo\-Aryan and carry the full morphosyntactic substrate, however dispersed or divergent: diaspora varieties such as Caribbean Hindustani and Fiji Hindi, which preserve features lost in the modern standard; Romani, which retains Indo\-Aryan morphosyntax despite centuries in Europe; and Sinhala and Dhivehi, the southernmost Indo\-Aryan languages, which share the Pāṇinian substrate along distinct developmental paths\. Here, the benchmark question is whether models trained on standard varieties generalize to these structurally Indic but divergent forms; and, conversely, whether the conservative among them, precisely because they preserve older features, can inform models about earlier stages of the shared Pāṇinian architecture\. Then, there are languages connected not by structure but by civilizational influence\. Tibetan, Burmese, and Thai belong to different families altogether \(Sino\-Tibetan and Kra\-Dai\), yet they were written in Brahmi\-derived scripts, inherited the Sanskrit phonological taxonomy that orders those scripts, and absorbed extensive Sanskrit vocabulary\. Herein lies a harder open question for the benchmark: how far can shared orthography, phonetic taxonomy999The Pāṇinian/Śikṣā phonetic taxonomy is embedded in writing systems from Tibet to the farthest reaches of South\-East Asia, including the Philippines\., and lexicon — in the absence of shared morphosyntactic substrate — support transfer at all? Together, these benchmarks establish how far the structural unity genuinely extends, and where it gives way to contact alone\.

The applied direction addresses information disorder\. Misinformation in India spreads rapidly across linguistic boundaries, mutating as it moves between language communities — a Hindi claim reappears in Punjabi with a subtle distortion, then in Bhojpuri or Bengali with another, each step compounding the original narrative while remaining semantically traceable to it\. Detecting and tracking this propagation requires models that understand the shared semantic substrate across languages, which is precisely what the Pāṇinian framework provides\. Benchmark datasets for cross\-lingual information disorder would require models to identify semantically equivalent claims expressed across multiple Indic languages and registers, track the variations introduced as content crosses linguistic boundaries, and flag the inconsistencies that signal deliberate manipulation\. For a computing community that has watched misinformation destabilize societies worldwide, building the infrastructure to detect it across one of the world’s most linguistically complex regions is not just an academic exercise, but an engineering priority\.

## 6\.Conclusion

Nearly no existing benchmarks for Indic languages are designed around the structural foundation that actually unifies them\. Most follow evaluation frameworks built for English, measuring what is convenient rather than what is linguistically meaningful\. This article has argued that Pāṇini’s grammatical framework, formalized over two millennia ago and active as an intellectual infrastructure across South and South\-East Asia ever since, provides precisely the unifying computational architecture the field has been missing\. The architectural unity it captures is not a matter of genealogical inheritance alone: it extends even to Austroasiatic \(e\.g\., Munda, whose speakers span India, Bangladesh, and Nepal\) and Tibeto\-Burman languages, where Sanskrit shaped not just the lexicon but the very phonological taxonomy that orders their scripts, and supplied the philosophical and technical vocabulary of their high registers\. This unity demonstrates that over two millennia of cultural\-linguistic synthesis have created structural commonalities that run deeper than genealogical trees, and offer a framework for rapid development of practical AI\-driven tools\.

The practical payoff is substantial\. A multilingual parser trained jointly on Sanskrit, Hindi, and Marathi and outputting Pāṇinian cases would be more accurate, more data\-efficient, and more transferable than three separate parsers trained in isolation\. A multilingual question\-answering benchmark where answers must be validated againstkārakasemantic roles would reveal capabilities and failure modes that remain invisible to current English\-derived benchmarks\. The four benchmark clusters we have proposed would provide the evaluation infrastructure for a new generation of tokenizers, morphological analyzers, dependency parsers, and semantically grounded language models — all built on a foundation that reflects how these languages actually work\.

There is also a deeper scientific question at stake, one that should interest the computing community beyond its immediate engineering implications\. The Pāṇinian framework’s explicit formal structure —kārakaroles,dhāturoots, morphological composition rules — provides natural targets for mechanistic interpretability research\. We can directly probe whether neural models trained on Indic languages learn internal representations that align with Pāṇinian categories, or whether they discover alternative organizational principles\. This raises a question analogous to the Platonic representation hypothesis in vision models\(Huhet al\.,[2024](https://arxiv.org/html/2606.24172#bib.bib23)\):

- Do neural language models trained on Indic languages spontaneously learn internal representations that correspond to Pāṇinian categories, even without explicit supervision?

If the answer is yes, it would mean that Pāṇini’s framework does not merely offer a convenient analytical lens, but it captures genuine computational primitives of Indic linguistic structure, and it is the natural and canonical way in which morphosyntactic information is organized in \(and across\) these languages\. A framework conceived as a formal grammar would turn out to be a discovery about the deep structure of an entire sprachbund\.

This prospect also reframes a methodological default the field rarely questions: if models already gravitate toward Pāṇinian structure on their own, then leaning on statistical learning alone is less a first principle than a costly recovery mechanism from data\. There is a formal disambiguating structure that an explicit grammatical algebra supplies by design\. The most capable Indic language systems, then, are likely to come not from statistics alone, but from statistical models disciplined by such structure\.

Building the benchmarks to answer this question is, in itself, a contribution to our understanding of how both artificial and human minds process language\. It is an invitation to the computing community to deeply engage with a rich and consequential linguistic tradition\.

## References

- A\. Acharya \(2013\)Civilizations in Embrace: The Spread of Ideas and the Transformation of Power : India and Southeast Asia in the Classical Age\.Book collections on Project MUSE,Institute of Southeast Asian Studies,Singapore\.External Links:ISBN 9789814379731,LCCN 2012330954Cited by:[§2](https://arxiv.org/html/2606.24172#S2.p7.1)\.
- E\. Annamalai \(2024\)The Sanskrit Paradigm of Tamil Grammar: Embrace and Resistance\.Bhasha3\(1\),pp\. 1–16\.External Links:[Document](https://dx.doi.org/10.30687/bhasha/2785-5953/2024/01/002)Cited by:[§2](https://arxiv.org/html/2606.24172#S2.p7.1)\.
- N\. Bafna, C\. España\-Bonet, J\. Van Genabith, B\. Sagot, and R\. Bawden \(2023\)Cross\-lingual Strategies for Low\-resource Language Modeling: A Study on Five Indic Dialects\.InActes de CORIA\-TALN 2023\. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles \(TALN\), volume 1 : travaux de recherche originaux – articles longs,C\. Servan and A\. Vilnat \(Eds\.\),Paris, France,pp\. 28–42\.External Links:[Link](https://aclanthology.org/2023.jeptalnrecital-long.3)Cited by:[§1](https://arxiv.org/html/2606.24172#S1.p3.1),[§2](https://arxiv.org/html/2606.24172#S2.p5.1),[§4\.5](https://arxiv.org/html/2606.24172#S4.SS5.p1.1)\.
- R\. Baltaji, S\. Pujar, M\. Hirzel, L\. Mandel, L\. Buratti, and L\. R\. Varshney \(2025\)Cross\-lingual Transfer in Programming Languages: An Extensive Empirical Study\.Transactions on Machine Learning Research2025\(June\)\.External Links:[Link](https://openreview.net/forum?id=1PRBHKgQVM)Cited by:[§2](https://arxiv.org/html/2606.24172#S2.p8.1)\.
- T\. Banerjee and P\. Bhattacharyya \(2018\)Meaningless yet meaningful: Morphology grounded subword\-level NMT\.InProceedings of the Second Workshop on Subword/Character LEvel Models,M\. Faruqui, H\. Schütze, I\. Trancoso, Y\. Tsvetkov, and Y\. Yaghoobzadeh \(Eds\.\),New Orleans,pp\. 55–60\.External Links:[Document](https://dx.doi.org/10.18653/v1/W18-1207)Cited by:[§4\.1](https://arxiv.org/html/2606.24172#S4.SS1.p1.1)\.
- A\. Bankula and P\. Bankula \(2025\)Cross\-linguistic transfer in multilingual nlp: the role of language families and morphology\.External Links:2505\.13908Cited by:[§4\.5](https://arxiv.org/html/2606.24172#S4.SS5.p3.1)\.
- A\. Bharati, R\. Sangal, V\. Chaitanya, A\. Kulkarni, D\. M\. Sharma, and K\.V\. Ramakrishnamacharyulu \(2002\)AnnCorra: Building Tree\-banks in Indian Languages\.InCOLING\-02: The 3rd Workshop on Asian Language Resources and International Standardization,Taipei, Taiwan\.External Links:[Link](https://aclanthology.org/W02-1202)Cited by:[§4\.3](https://arxiv.org/html/2606.24172#S4.SS3.p1.1),[§5\.2](https://arxiv.org/html/2606.24172#S5.SS2.p1.1)\.
- R\. Bhatt, B\. Narasimhan, M\. Palmer, O\. Rambow, D\. Sharma, and F\. Xia \(2009\)A Multi\-Representational and Multi\-Layered Treebank for Hindi/Urdu\.InProceedings of the Third Linguistic Annotation Workshop \(LAW III\),M\. Stede, C\. Huang, N\. Ide, and A\. Meyers \(Eds\.\),Suntec, Singapore,pp\. 186–189\.External Links:[Link](https://aclanthology.org/W09-3036)Cited by:[§3\.4](https://arxiv.org/html/2606.24172#S3.SS4.p1.1),[§4\.2](https://arxiv.org/html/2606.24172#S4.SS2.p1.1)\.
- S\. Bhattacharjee, M\. K\. Roy, Y\. Poojary, B\. Dave, M\. Raj, V\. Mujadia, B\. Gain, P\. Mishra, A\. Ahsan, P\. Krishnamurthy, A\. Rao, G\. S\. Josan, P\. Dubey, A\. A\. Kak, A\. R\. Kulkarni, N\. V\. G\., S\. Arora, R\. Balbantray, P\. Majumdar, K\. K\. Arora, A\. Ekbal, and D\. M\. Sharma \(2025\)CorIL: Towards Enriching Indian Language to Indian Language Parallel Corpora and Machine Translation Systems\.External Links:2509\.19941Cited by:[§4\.5](https://arxiv.org/html/2606.24172#S4.SS5.p1.1)\.
- N\. Blake \(1996\)A History of the English Language\.New York University Press,New York\.External Links:ISBN 0\-8147\-1292\-4Cited by:[§1](https://arxiv.org/html/2606.24172#S1.p6.1)\.
- M\. Brahma, N\. J\. Karthika, A\. Singh, D\. Adiga, S\. Bhate, G\. Ramakrishnan, R\. Saluja, and M\. S\. Desarkar \(2025\)Cited by:[§1](https://arxiv.org/html/2606.24172#S1.p3.1),[§4\.1](https://arxiv.org/html/2606.24172#S4.SS1.p1.1),[§5\.1](https://arxiv.org/html/2606.24172#S5.SS1.p2.1)\.
- J\. Brinkmann, C\. Wendler, C\. Bartelt, and A\. Mueller \(2025\)Large language models share representations of latent grammatical concepts across typologically diverse languages\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 6131–6150\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.312)Cited by:[§4\.5](https://arxiv.org/html/2606.24172#S4.SS5.p3.1)\.
- T\. A\. Chang, C\. Arnett, Z\. Tu, and B\. K\. Bergen \(2024\)When is multilinguality a curse? language modeling for 250 high\- and low\-resource languages\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 4074–4096\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.236)Cited by:[§4\.5](https://arxiv.org/html/2606.24172#S4.SS5.p3.1)\.
- S\. K\. Chatterji \(1926\)The Origin and Development of the Bengali Language\.Calcutta University Press,Calcutta, India\.Cited by:[§5\.3](https://arxiv.org/html/2606.24172#S5.SS3.SSSx1.p1.1)\.
- A\. Conneau, K\. Khandelwal, N\. Goyal, V\. Chaudhary, G\. Wenzek, F\. Guzmán, E\. Grave, M\. Ott, L\. Zettlemoyer, and V\. Stoyanov \(2020a\)Unsupervised cross\-lingual representation learning at scale\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 8440–8451\.External Links:[Link](https://aclanthology.org/2020.acl-main.747/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.747)Cited by:[§1](https://arxiv.org/html/2606.24172#S1.p3.1),[footnote 6](https://arxiv.org/html/2606.24172#footnote6)\.
- A\. Conneau, S\. Wu, H\. Li, L\. Zettlemoyer, and V\. Stoyanov \(2020b\)Emerging cross\-lingual structure in pretrained language models\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 6022–6034\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.536)Cited by:[§4\.5](https://arxiv.org/html/2606.24172#S4.SS5.p3.1)\.
- M\. de Marneffe, C\. D\. Manning, J\. Nivre, and D\. Zeman \(2021\)Universal dependencies\.Computational Linguistics47\(2\),pp\. 255–308\.External Links:[Document](https://dx.doi.org/10.1162/coli%5Fa%5F00402)Cited by:[§1](https://arxiv.org/html/2606.24172#S1.p2.1)\.
- M\. B\. Emeneau \(1956\)India as a Linguistic Area\.Language32\(1\),pp\. 3–16\.External Links:[Document](https://dx.doi.org/10.2307/410649)Cited by:[§5\.4](https://arxiv.org/html/2606.24172#S5.SS4.p1.1)\.
- P\. Goyal and G\. Huet \(2016\)Design and analysis of a lean interface for Sanskrit corpus annotation\.Journal of Language Modelling4\(2\),pp\. 145–182\.External Links:[Document](https://dx.doi.org/10.15398/jlm.v4i2.108)Cited by:[§1](https://arxiv.org/html/2606.24172#S1.p3.1),[§2](https://arxiv.org/html/2606.24172#S2.p5.1),[§4\.5](https://arxiv.org/html/2606.24172#S4.SS5.p1.1)\.
- O\. Hellwig and E\. Biagetti \(2025\)The Sanskrit Sembank\.Language Resources and Evaluation59,pp\. 3635–3658\.External Links:[Document](https://dx.doi.org/10.1007/s10579-025-09852-1)Cited by:[§3\.4](https://arxiv.org/html/2606.24172#S3.SS4.p2.1),[§4\.4](https://arxiv.org/html/2606.24172#S4.SS4.p1.1),[§5\.1](https://arxiv.org/html/2606.24172#S5.SS1.p3.1)\.
- O\. Hellwig \(2016\)Improving the Morphological Analysis of Classical Sanskrit\.InProceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing \(WSSANLP2016\),D\. Wu and P\. Bhattacharyya \(Eds\.\),Osaka, Japan,pp\. 142–151\.External Links:[Link](https://aclanthology.org/W16-3715/)Cited by:[§1](https://arxiv.org/html/2606.24172#S1.p3.1),[§2](https://arxiv.org/html/2606.24172#S2.p5.1),[§4\.5](https://arxiv.org/html/2606.24172#S4.SS5.p1.1)\.
- G\. Huet \(2005\)A Functional Toolkit for Morphological and Phonological Processing, Application to a Sanskrit Tagger\.Journal of Functional Programming15\(4\),pp\. 573–614\.External Links:[Document](https://dx.doi.org/10.1017/S0956796804005416)Cited by:[§5\.1](https://arxiv.org/html/2606.24172#S5.SS1.p2.1)\.
- M\. Huh, B\. Cheung, T\. Wang, and P\. Isola \(2024\)Position: The Platonic Representation Hypothesis\.InProceedings of the 41st International Conference on Machine Learning,R\. Salakhutdinov, Z\. Kolter, K\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research, Vol\.235,Vienna, Austria,pp\. 20617–20642\.External Links:[Link](https://proceedings.mlr.press/v235/huh24a.html)Cited by:[§6](https://arxiv.org/html/2606.24172#S6.p3.1)\.
- P\. Z\. Ingerman \(1967\)“Pānini\-backus form” suggested\.Communications of the ACM10\(3\),pp\. 137\.External Links:ISSN 0001\-0782,[Document](https://dx.doi.org/10.1145/363162.363165)Cited by:[footnote 2](https://arxiv.org/html/2606.24172#footnote2)\.
- G\. N\. Jha \(2010\)The TDIL Program and the Indian Langauge Corpora Intitiative \(ILCI\)\.InProceedings of the Seventh International Conference on Language Resources and Evaluation \(LREC’10\),N\. Calzolari, K\. Choukri, B\. Maegaard, J\. Mariani, J\. Odijk, S\. Piperidis, M\. Rosner, and D\. Tapias \(Eds\.\),Valletta, Malta,pp\. 982–985\.External Links:[Link](https://aclanthology.org/L10-1602)Cited by:[§3\.4](https://arxiv.org/html/2606.24172#S3.SS4.p1.1),[§4\.2](https://arxiv.org/html/2606.24172#S4.SS2.p1.1)\.
- B\. B\. Kachru \(1992a\)The other tongue: English across cultures\.2 edition,University of Illinois Press,Urbana, Illinois\.External Links:ISBN 978\-0252062001Cited by:[§5\.4](https://arxiv.org/html/2606.24172#S5.SS4.p1.1)\.
- B\. B\. Kachru \(1992b\)World Englishes: approaches, issues and resources\.Language Teaching25\(1\),pp\. 1–14\.External Links:[Document](https://dx.doi.org/10.1017/S0261444800006583)Cited by:[§5\.4](https://arxiv.org/html/2606.24172#S5.SS4.p1.1)\.
- D\. Kakwani, A\. Kunchukuttan, S\. Golla, G\. N\.C\., A\. Bhattacharyya, M\. M\. Khapra, and P\. Kumar \(2020\)IndicNLPSuite: monolingual corpora, evaluation benchmarks and pre\-trained multilingual language models for Indian languages\.InFindings of the Association for Computational Linguistics: EMNLP 2020,T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 4948–4961\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.findings-emnlp.445)Cited by:[§4\.5](https://arxiv.org/html/2606.24172#S4.SS5.p3.1)\.
- N\. J\. Karthika, M\. Brahma, R\. Saluja, G\. Ramakrishnan, and M\. S\. Desarkar \(2025\)Cited by:[§2](https://arxiv.org/html/2606.24172#S2.p7.1)\.
- R\. D\. King \(2006\)The Poisonous Potency of Script: Hindi and Urdu\.International Journal of the Sociology of Language2001\(150\),pp\. 43–59\.External Links:[Document](https://dx.doi.org/10.1515/ijsl.2001.035)Cited by:[§3\.2](https://arxiv.org/html/2606.24172#S3.SS2.p2.1),[§3\.4](https://arxiv.org/html/2606.24172#S3.SS4.p1.1),[§4\.5](https://arxiv.org/html/2606.24172#S4.SS5.p1.1)\.
- A\. Krishna, P\. K\. Satuluri, and P\. Goyal \(2017\)A Dataset for Sanskrit Word Segmentation\.InProceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature,B\. Alex, S\. Degaetano\-Ortlieb, A\. Feldman, A\. Kazantseva, N\. Reiter, and S\. Szpakowicz \(Eds\.\),Vancouver, Canada,pp\. 105–114\.External Links:[Document](https://dx.doi.org/10.18653/v1/W17-2214)Cited by:[§4\.1](https://arxiv.org/html/2606.24172#S4.SS1.p1.1),[§5\.1](https://arxiv.org/html/2606.24172#S5.SS1.p2.1)\.
- S\. Krishnan, A\. Kulkarni, and G\. Huet \(2023\)Validation and Normalization of DCS corpus and Development of the Sanskrit Heritage Engine’s Segmenter\.InProceedings of the Computational Sanskrit & Digital Humanities: Selected papers presented at the 18th World Sanskrit Conference,A\. Kulkarni and O\. Hellwig \(Eds\.\),Canberra, Australia \(Online mode\),pp\. 38–58\.External Links:[Link](https://aclanthology.org/2023.wsc-csdh.3/)Cited by:[§5\.1](https://arxiv.org/html/2606.24172#S5.SS1.p2.1)\.
- S\. Krishnan and A\. Kulkarni \(2019\)Sanskrit Segmentation revisited\.InProceedings of the 16th International Conference on Natural Language Processing,D\. M\. Sharma and P\. Bhattacharya \(Eds\.\),International Institute of Information Technology, Hyderabad, India,pp\. 105–114\.External Links:[Link](https://aclanthology.org/2019.icon-1.12/)Cited by:[§5\.1](https://arxiv.org/html/2606.24172#S5.SS1.p2.1)\.
- A\. Kunchukuttan and P\. Bhattacharyya \(2022\)Machine Translation and Transliteration involving Related and Low\-Resource Languages\.CRC Press,Boca Raton, USA and Abingdon, UK\.External Links:ISBN 978\-0\-367\-56200\-7Cited by:[§4\.5](https://arxiv.org/html/2606.24172#S4.SS5.p3.1)\.
- D\. B\. Lurie \(2023\)The Vernacular in the World of Wen: Sheldon Pollock’s Model in East Asia?\.ChapterInCosmopolitan and Vernacular in the World of Wen\\textjapanese文,R\. King \(Ed\.\),Language, Writing and Literary Culture in the Sinographic Cosmopolis, Vol\.5,pp\. 49–68\.External Links:ISBN 9789004529441,[Document](https://dx.doi.org/10.1163/9789004529441%5F003)Cited by:[§1](https://arxiv.org/html/2606.24172#S1.p6.1)\.
- A\. Mishra \(2009\)Simulating the\\paninianSystem of Sanskrit Grammar\.InSanskrit Computational Linguistics,G\. Huet, A\. Kulkarni, and P\. Scharf \(Eds\.\),Lecture Notes in Computer Science, Vol\.5402,pp\. 127–138\.External Links:[Document](https://dx.doi.org/10.1007/978-3-642-00155-0%5F4)Cited by:[§5\.1](https://arxiv.org/html/2606.24172#S5.SS1.p2.1)\.
- V\. Mujadia and D\. M\. Sharma \(2025\)Cited by:[§4\.5](https://arxiv.org/html/2606.24172#S4.SS5.p1.1)\.
- K\. N J, K\. Bhatt, G\. Ramakrishnan, and P\. Jyothi \(2025\)LEVOS: Leveraging Vocabulary Overlap with Sanskrit to Generate Technical Lexicons in Indian Languages\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 258–265\.External Links:[Link](https://aclanthology.org/2025.bea-1.20/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.20),ISBN 979\-8\-89176\-270\-1Cited by:[§2](https://arxiv.org/html/2606.24172#S2.p5.1)\.
- A\. Nag, B\. Samanta, A\. Mukherjee, N\. Ganguly, and S\. Chakrabarti \(2023\)Transfer Learning for Low\-Resource Multilingual Relation Classification\.ACM Trans\. Asian Low\-Resour\. Lang\. Inf\. Process\.22\(2\),pp\. 1–24\.External Links:[Document](https://dx.doi.org/10.1145/3554734)Cited by:[§1](https://arxiv.org/html/2606.24172#S1.p3.1),[§2](https://arxiv.org/html/2606.24172#S2.p5.1),[§4\.5](https://arxiv.org/html/2606.24172#S4.SS5.p1.1)\.
- S\. Nehrdich, O\. Hellwig, and K\. Keutzer \(2024\)One Model is All You Need: ByT5\-Sanskrit, a Unified Model for Sanskrit NLP Tasks\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 13742–13751\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.805)Cited by:[§4\.1](https://arxiv.org/html/2606.24172#S4.SS1.p1.1),[§5\.1](https://arxiv.org/html/2606.24172#S5.SS1.p2.1)\.
- NLLB Team \(2024\)Scaling neural machine translation to 200 languages\.Nature630,pp\. 841–846\.External Links:[Document](https://dx.doi.org/10.1038/s41586-024-07335-x)Cited by:[§1](https://arxiv.org/html/2606.24172#S1.p3.1),[§4\.5](https://arxiv.org/html/2606.24172#S4.SS5.p1.1),[§4](https://arxiv.org/html/2606.24172#S4.p2.1)\.
- R\. Pal and D\. Sharma \(2019\)Towards Automated Semantic Role Labelling of Hindi\-English Code\-Mixed Tweets\.InProceedings of the 5th Workshop on Noisy User\-generated Text \(W\-NUT 2019\),W\. Xu, A\. Ritter, T\. Baldwin, and A\. Rahimi \(Eds\.\),Hong Kong, China,pp\. 291–296\.External Links:[Document](https://dx.doi.org/10.18653/v1/D19-5538)Cited by:[§2](https://arxiv.org/html/2606.24172#S2.p5.1)\.
- M\. Palmer, R\. Bhatt, B\. Narasimhan, O\. Rambow, D\. M\. Sharma, and F\. Xia \(2009\)Hindi Syntax: Annotating Dependency, Lexical Predicate\-Argument Structure, and Phrase Structure\.InProceedings of the 7th International Conference on Natural Language Processing,ICON,Hyderabad, India,pp\. 259–268\.Cited by:[§3\.4](https://arxiv.org/html/2606.24172#S3.SS4.p1.1)\.
- M\. Palmer, P\. Kingsbury, and D\. Gildea \(2005\)The Proposition Bank: An Annotated Corpus of Semantic Roles\.Computational Linguistics31\(1\),pp\. 71–106\.External Links:[Document](https://dx.doi.org/10.1162/0891201053630264)Cited by:[footnote 4](https://arxiv.org/html/2606.24172#footnote4)\.
- P\. Pattnayak, H\. Patel, and A\. Agarwal \(2025\)Tokenization Matters: Improving Zero\-Shot NER for Indic Languages\.In2025 IEEE International Conference on Electro Information Technology \(eIT\),Vol\.,Valparaiso, Indiana, USA,pp\. 456–462\.External Links:[Document](https://dx.doi.org/10.1109/eIT64391.2025.11103625)Cited by:[§4\.1](https://arxiv.org/html/2606.24172#S4.SS1.p1.1)\.
- S\. Pawar, P\. Bhattacharyya, and P\. Talukdar \(2023\)Evaluating Cross Lingual Transfer for Morphological Analysis: a Case Study of Indian Languages\.InProceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology,G\. Nicolai, E\. Chodroff, F\. Mailhot, and Ç\. Çöltekin \(Eds\.\),Toronto, Canada,pp\. 14–26\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.sigmorphon-1.3)Cited by:[§1](https://arxiv.org/html/2606.24172#S1.p4.1),[§4\.2](https://arxiv.org/html/2606.24172#S4.SS2.p1.1),[§5\.1](https://arxiv.org/html/2606.24172#S5.SS1.p2.1)\.
- G\. Penn and P\. Kiparsky \(2012\)On Pāṇini and the Generative Capacity of Contextualized Replacement Systems\.InProceedings of COLING 2012: Posters,M\. Kay and C\. Boitet \(Eds\.\),Mumbai, India,pp\. 943–950\.External Links:[Link](https://aclanthology.org/C12-2092)Cited by:[§2](https://arxiv.org/html/2606.24172#S2.p3.1),[§2](https://arxiv.org/html/2606.24172#S2.p4.1),[footnote 2](https://arxiv.org/html/2606.24172#footnote2)\.
- S\. I\. Pollock \(2000\)Cosmopolitan and Vernacular in History\.Public Culture12\(3\),pp\. 591–625\.Note:Project MUSEExternal Links:[Link](https://muse.jhu.edu/article/26221)Cited by:[§2](https://arxiv.org/html/2606.24172#S2.p7.1)\.
- P\. Rai, A\. Das, and S\. Chatterji \(2025\)Mapping of the nepali dependency treebank to universal dependencies\.ACM Trans\. Asian Low\-Resour\. Lang\. Inf\. Process\.24\(11\),pp\. 1–22\.External Links:[Document](https://dx.doi.org/10.1145/3749643)Cited by:[§1](https://arxiv.org/html/2606.24172#S1.p2.1)\.
- V\. Ravishankar \(2017\)A Universal Dependencies treebank for Marathi\.InProceedings of the 16th International Workshop on Treebanks and Linguistic Theories,J\. Hajič \(Ed\.\),Prague, Czech Republic,pp\. 190–200\.External Links:[Link](https://aclanthology.org/W17-7623)Cited by:[§1](https://arxiv.org/html/2606.24172#S1.p2.1),[§3\.4](https://arxiv.org/html/2606.24172#S3.SS4.p2.1),[§4\.3](https://arxiv.org/html/2606.24172#S4.SS3.p1.1)\.
- B\. Sankaran, K\. Bali, M\. Choudhury, T\. Bhattacharya, P\. Bhattacharyya, G\. N\. Jha, S\. Rajendran, K\. Saravanan, L\. Sobha, and K\.V\. Subbarao \(2008\)A Common Parts\-of\-Speech Tagset Framework for Indian Languages\.InProceedings of the Sixth International Conference on Language Resources and Evaluation \(LREC’08\),N\. Calzolari, K\. Choukri, B\. Maegaard, J\. Mariani, J\. Odijk, S\. Piperidis, and D\. Tapias \(Eds\.\),Marrakech, Morocco,pp\. 1331–1337\.External Links:[Link](https://aclanthology.org/L08-1544)Cited by:[§3\.4](https://arxiv.org/html/2606.24172#S3.SS4.p1.1)\.
- A\. K\. Singh, V\. Kumar, R\. Murthy, J\. Sen, A\. Mittal, and G\. Ramakrishnan \(2025\)INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages\.InFindings of the Association for Computational Linguistics: NAACL 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 2607–2626\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.findings-naacl.141)Cited by:[§4\.5](https://arxiv.org/html/2606.24172#S4.SS5.p1.1)\.
- A\. K\. Singh, S\. Dave, P\. A\. P\., B\. Lall, and S\. Mehta \(2020\)Cited by:[§4\.2](https://arxiv.org/html/2606.24172#S4.SS2.p1.1)\.
- J\. Tandon, H\. Chaudhary, R\. A\. Bhat, and D\. M\. Sharma \(2016\)Conversion from\\paninianKarakas to Universal Dependencies for Hindi Dependency Treebank\.InProceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 \(LAW\-X 2016\),A\. Friedrich and K\. Tomanek \(Eds\.\),Berlin, Germany,pp\. 141–150\.External Links:[Document](https://dx.doi.org/10.18653/v1/W16-1716)Cited by:[§4\.3](https://arxiv.org/html/2606.24172#S4.SS3.p1.1),[§5\.2](https://arxiv.org/html/2606.24172#S5.SS2.p1.1)\.
- S\. G\. Thomason \(2000\)Linguistic Areas and Language History\.InLanguages in Contact,D\. G\. Gilbers, J\. Nerbonne, and J\. Schaeken \(Eds\.\),Studies in Slavic and General Linguistics, Vol\.28,pp\. 311–327\.External Links:[Document](https://dx.doi.org/10.1163/9789004488472%5F030)Cited by:[footnote 8](https://arxiv.org/html/2606.24172#footnote8)\.
- S\. C\. Vasu \(1897\)The Ashtādhyāyī of\\panini\.Sindhu Charan Bose,Benares\.Cited by:[§1](https://arxiv.org/html/2606.24172#S1.p7.1)\.
- D\. Verma, R\. S\. Joshi, A\. A\. Shivani, and R\. D\. Gupta \(2023\)Kāraka\-Based Answer Retrieval for Question Answering in Indic Languages\.InProceedings of the 14th International Conference on Recent Advances in Natural Language Processing,R\. Mitkov and G\. Angelova \(Eds\.\),Varna, Bulgaria,pp\. 1216–1224\.External Links:[Link](https://aclanthology.org/2023.ranlp-1.129)Cited by:[§4\.3](https://arxiv.org/html/2606.24172#S4.SS3.p1.1),[§5\.3](https://arxiv.org/html/2606.24172#S5.SS3.p1.1)\.
- D\. Zeman, J\. Nivre, R\. Abid, M\. Abrams,et al\.\(2026\)Cited by:[§1](https://arxiv.org/html/2606.24172#S1.p2.1)\.

Similar Articles

ProgramBench (5 minute read)

TLDR AI

ProgramBench is a new benchmark that evaluates AI agents' ability to reconstruct complete software projects from compiled binaries and documentation without access to source code or decompilation tools.