SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR

arXiv cs.CL Papers

Summary

SCRIBE is a diagnostic evaluation framework for automatic speech recognition that provides categorical error decomposition for Indic languages, releasing benchmarks and open-weight rich transcription models for Hindi, Malayalam, and Kannada.

arXiv:2605.20712v1 Announce Type: new Abstract: Automatic speech recognition replaces typing only when correction costs less than manual entry, a threshold determined by error types, not counts: fixing a misrecognized domain term costs far more than inserting a comma. Word error rate (WER) fails on two fronts: it collapses distinct error categories into a single scalar, and it structurally penalizes agglutinative languages where valid sandhi merges inflate scores. We introduce SCRIBE, a diagnostic framework that provides categorical error decomposition into lexical, punctuation, numeral, and domain-entity rates through sandhi-tolerant alignment with domain vocabulary injection. Human validation confirms SCRIBE aligns with expert judgment where WER does not. We release SCRIBE, an LLM curation pipeline, benchmarks, and open-weight rich transcription models for Hindi, Malayalam, and Kannada.
Original Article
View Cached Full Text

Cached at: 05/21/26, 06:34 AM

# SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR
Source: [https://arxiv.org/html/2605.20712](https://arxiv.org/html/2605.20712)
Manohar Bhattacharya Juvekar Nethil \[Script=Devanagari\] \[Script=Malayalam, Scale=0\.9\] \[Script=Devanagari\] \[Script=Malayalam, Scale=0\.9\]\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English

###### Abstract

Automatic speech recognition replaces typing only when correction costs less than manual entry, a threshold determined by error types, not counts: fixing a misrecognized domain term costs far more than inserting a comma\. Word error rate \(WER\) fails on two fronts: it collapses distinct error categories into a single scalar, and it structurally penalizes agglutinative languages where validsandhimerges inflate scores\. We introduce SCRIBE, a diagnostic framework that provides categorical error decomposition into lexical, punctuation, numeral, and domain\-entity rates throughsandhi\-tolerant alignment with domain vocabulary injection\. Human validation confirms SCRIBE aligns with expert judgment where WER does not\. We release SCRIBE, an LLM curation pipeline, benchmarks, and open\-weight rich transcription models for Hindi, Malayalam, and Kannada\.

###### keywords:

ASR Evaluation, Rich Transcription, Morphological Alignment, Indic Languages, Diagnostic Metrics

## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English1Introduction

\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishVerbatimCorpora\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishLLM Curation PipelineFormatting & Domain InjectionRelease 1: Pipeline\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishASR Model Training\(Hindi, ML, KN\)\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishSCRIBE FrameworkDiagnostic EvaluationRelease 2: Framework\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishFinal Weights& BenchmarksRelease 3\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishDomain Data \(Optional\)Diagnostic Feedback Loop with Categorical Analysis\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishFigure 1:Diagnostic\-led development cycle for Indic rich transcription\. SCRIBE provides the categorical feedback necessary to refine curation and verify model performance across error types\.The utility of automatic speech recognition \(ASR\) for dictation, producing medical notes, legal proceedings, or classroom transcripts, is defined by the correction threshold: editing must be faster than typing\. This requiresrich transcription: text with grammatical punctuation, standardized numerals, and domain\-appropriate orthographic conventions\. Whether a system meets this bar depends on thetypeof error, not just the count\. A missing comma is trivial; a misrecognized medical term or incorrectly formatted legal date can render output unusable\.

Standard word error rate \(WER\) fails as a development signal for two reasons\. First, it collapses acoustic failures, numeral formatting, and punctuation into a single scalar, offering no actionable insight\. Second, it is structurally broken for agglutinative Indic languages\. In morphologically complex Dravidian languages like Malayalam and Kannada\[manohar2020quantitative,bharadwaja2007statistical\], valid word\-boundary merges \(sandhi\) with phonotactic changes at the boundaries trigger cascading alignment shifts in 1:1 alignment, inflating error rates by up to 30% relative\. This is a structural penalty against an entire language family\.

We introduce SCRIBE, a diagnostic evaluation framework named for the role it measures: whether ASR can serve as a reliable scribe\. Rather than a single scalar, SCRIBE outputs a diagnostic error vector𝐄=\[E​Rl​e​x,E​Rp​u​n​c,E​Rn​u​m,E​Re​n​t\]\\mathbf\{E\}=\[ER\_\{lex\},ER\_\{punc\},ER\_\{num\},ER\_\{ent\}\], which decomposes failures into lexical, punctuation, numeral, and domain\-specific entity error rates, respectively\. By utilizing sandhi\-tolerant alignment and categorical decomposition, SCRIBE replaces the monolithic WER with an actionable development signal, enabling targeted remediation for rich transcription tasks\.

To summarize, our major contributions in this paper are:

- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•SCRIBE, released as an open\-source evaluation tool, providingsandhi\-tolerant alignment and categorical error decomposition, proposed as a replacement for monolithic WER wherever ASR serves as a scribe\.
- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•A structured annotation schema and validation procedurefor categorical ASR metrics, with dimension\-specific scales rated independently by expert linguists, demonstrating that SCRIBE aligns with human judgment where WER does not\.
- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•A reproducible recipe for Indic rich transcription: an LLM\-based data curation pipeline, two new benchmarks \(FLEURS\-RO for general and IN22\-Legal for domain evaluation\), and the first open\-weight rich transcription models for Hindi, Malayalam, and Kannada\.

## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English2Related Work

Rich Transcription Models:While models like Whisper\[radford2023robust\]and Canary\[raokoluguri25\_interspeech\]demonstrate the feasibility of joint acoustic\-orthographic modeling, the open\-source Indic ecosystem remains dominated by verbatim\-only models\[bhogale2023effectiveness,bhogale23\_interspeech,bhogale2025towards\]\. Current pipelines for formatted output often rely on decoupled inverse text normalization\[pulipaka2025mark\], which ignores prosodic cues and homophone resolution\. SCRIBE bridges this by providing a recipe for native rich transcription in Indic ASR\.

Evaluation Limitations:While character error rate \(CER\) sidesteps word\-boundary issues in agglutinative languages\[k\-etal\-2025\-advocating\], it and WER lack diagnostic signals\. CER is semantically blind, weighting functional suffix changes identically to root morpheme substitutions\. Categorical frameworks like Beyond Levenshtein\[kuhn24\_interspeech\]move toward nuanced evaluation but rely on normalization that destructively strips lexically indispensable Indic vowel signs \(matras\) and diacritics\[manohar\-pillai\-2024\-lost\]\. Similarly, 1:1 word alignment—shared by word information lost \(WIL\) and word information preserved \(WIP\)\[morris04\_interspeech\]—cannot resolve validsandhi\(word\-boundary merges\) common in Dravidian languages\[manohar\-pillai\-2024\-lost\]\. Semantic metrics like Semascore\[semascore2024\]prioritize global meaning but remain blind to fatal numeral or negation failures that render professional dictation unusable\. While the Orthographically\-Informed WER leverages LLMs to capture permissible variations\[bhogale2026orthographicallyinformedevaluationspeechrecognition\], the approach is computationally expensive and fails to resolve the structural alignment shifts caused bysandhi\. SCRIBE addresses these gaps through diacritic\-preserving normalization, deterministicsandhi\-tolerant alignment, and categorical error decomposition\.

## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3The SCRIBE Framework

SCRIBEis organized into three phases: tokenization and domain shielding, a sandhi\-aware alignment engine, and categorical error aggregation\. The framework outputs a diagnostic error vector𝐄\\mathbf\{E\}where each component maps to a specific remediation strategy\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3\.1Phase 1: Tokenization and Domain Shielding

The framework transforms referenceRRand hypothesisHHinto typed tokens\(wi,ti\)\(w\_\{i\},t\_\{i\}\), whereti∈\{lexeme, numeral, punctuation, domain\-entity\}t\_\{i\}\\in\\\{\\text\{lexeme, numeral, punctuation, domain\-entity\}\\\}\. Unlike standard tokenizers that strip or blindly isolate punctuation, in SCRIBE standard punctuation and Indic\-specific marks \(e\.g\., the Hindidanda\) become independent tokens, while punctuation within numerals and compound words \(e\.g\.,\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English22\.05\.2023,\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=Englishice\-cream\) are preserved to maintain lexical integrity\. User\-defined domain entities are injected via a regex\-based shielding layer to treat them as atomic units, preventing spurious fragmentation\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3\.2Phase 2: Sandhi\-Aware Alignment Engine

An*alignment*is a pairing of reference,RRand hypothesis,HHpositions that accounts for insertions, deletions, standard 1:1 substitutions, andSandhi\-motivated 1:2 \(split\) and 2:1 \(merge\) mappings\. We seek the alignment maximizing a total scored​p​\[i\]​\[j\]dp\[i\]\[j\], calculated via extended dynamic programming in Equation[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English1](https://arxiv.org/html/2605.20712#S3.E1)\.

dp\[i\]\[j\]=max\{d​p​\[i​\-​1\]​\[j​\-​1\]\+S​\(ri,hj\)\(match/sub\)d​p​\[i​\-​1\]​\[j\]\+γ​\(tiR\)\(deletion\)d​p​\[i\]​\[j​\-​1\]\+γ​\(tjH\)\(insertion\)d​p​\[i​\-​1\]​\[j​\-​2\]\+Σsplit\(Sandhi\-split\)d​p​\[i​\-​2\]​\[j​\-​1\]\+Σmerge\(Sandhi\-merge\)dp\[i\]\[j\]=\\max\\left\\\{\\begin\{aligned\} &dp\[i\\text\{\-\}1\]\[j\\text\{\-\}1\]\+S\(r\_\{i\},h\_\{j\}\)&&\\text\{\(match/sub\)\}\\\\ &dp\[i\\text\{\-\}1\]\[j\]\+\\gamma\(t^\{R\}\_\{i\}\)&&\\text\{\(deletion\)\}\\\\ &dp\[i\]\[j\\text\{\-\}1\]\+\\gamma\(t^\{H\}\_\{j\}\)&&\\text\{\(insertion\)\}\\\\ &dp\[i\\text\{\-\}1\]\[j\\text\{\-\}2\]\+\\Sigma\_\{\\text\{split\}\}&&\\text\{\(Sandhi\-split\)\}\\\\ &dp\[i\\text\{\-\}2\]\[j\\text\{\-\}1\]\+\\Sigma\_\{\\text\{merge\}\}&&\\text\{\(Sandhi\-merge\)\}\\end\{aligned\}\\right\.\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English\(1\)
The scoring functionS​\(ri,hj\)S\(r\_\{i\},h\_\{j\}\)anchors the alignment on exact matches \(α=\+4\.0\\alpha=\+4\.0\) while buffering against the acoustic near\-misses common in Indic scripts \(e\.g\.,\\fontspec\_if\_script:nTFdeva\\addfontfeatureScript=Devanagari\\fontspec\_if\_language:nTFHIN\\addfontfeatureLanguage=Hindiखाना:khanavs\\fontspec\_if\_script:nTFdeva\\addfontfeatureScript=Devanagari\\fontspec\_if\_language:nTFHIN\\addfontfeatureLanguage=Hindiगाना:gana\)\. To prevent alignment drift, a category\-clash penaltyβ=−3\.0\\beta=\-3\.0is applied iftiR≠tjHt^\{R\}\_\{i\}\\neq t^\{H\}\_\{j\}\. For same\-category substitutions, we employ a Levenshtein\-buffered penaltyδ=−1\.5−\(0\.2⋅d\)\\delta=\-1\.5\-\(0\.2\\cdot d\), whereddis the character distance betweenrir\_\{i\}andhjh\_\{j\}\. Sensitivity analysis on our target languages confirms that a Unicode\-level distance ofd≤2d\\leq 2optimally captures minor orthographic variations—such asmatrashifts or gemination—without triggering the cascading deletion\-insertion pairs typical of standard WER evaluation\.

Sandhi scores,Σ\\Sigma, resolve 1:2 or 2:1 mappings by validating phonetic plausibility\. A transition is valid if the fused formssmatches the prefix ofw1w\_\{1\}and suffix ofw2w\_\{2\}\. We score this asΣ=α\+σ−d​\(bs​p​l​i​t,bm​i​d\)/\|s\|\\Sigma=\\alpha\+\\sigma\-d\(b\_\{split\},b\_\{mid\}\)/\|s\|, whereσ=−0\.5\\sigma=\-0\.5is the sandhi penalty\. If the boundary distanced\>2d\>2, the transition is invalidated per Indic morphophonological rules\[bhardwaj\-etal\-2018\-sandhikosh,dasari\-etal\-2025\-sandhi,thottingal\-2019\-finite\]\. Figure[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English2](https://arxiv.org/html/2605.20712#S3.F2)illustrates SCRIBE’s ability to correctly resolve these complex word merges like ‘innu allenkil→\\rightarrowinnallenkil’ and splits like ‘naleyakatte→\\rightarrownale akatte’ in Malayalam\.

SCRIBE Sandhi\-Aware Alignment\\fontspec\_if\_script:nTFmlym\\addfontfeatureScript=Malayalam\\fontspec\_if\_language:nTFMAL\\addfontfeatureLanguage=Malayalamഇന്ന്\\fontspec\_if\_script:nTFmlym\\addfontfeatureScript=Malayalam\\fontspec\_if\_language:nTFMAL\\addfontfeatureLanguage=Malayalamഅല്ലെങ്കിൽ\\fontspec\_if\_script:nTFmlym\\addfontfeatureScript=Malayalam\\fontspec\_if\_language:nTFMAL\\addfontfeatureLanguage=Malayalamനാളെയാകട്ടെ\\fontspec\_if\_script:nTFmlym\\addfontfeatureScript=Malayalam\\fontspec\_if\_language:nTFMAL\\addfontfeatureLanguage=Malayalamഇന്നല്ലെങ്കിൽ\\fontspec\_if\_script:nTFmlym\\addfontfeatureScript=Malayalam\\fontspec\_if\_language:nTFMAL\\addfontfeatureLanguage=Malayalamനാളെ\\fontspec\_if\_script:nTFmlym\\addfontfeatureScript=Malayalam\\fontspec\_if\_language:nTFMAL\\addfontfeatureLanguage=Malayalamആകട്ടെMERGE \(2:1\)SPLIT \(1:2\)El​e​x=0%E\_\{lex\}=0\\%: Correct resolution of linguistic merges and splitsStandard 1:1 Alignment \(JIWER\)\\fontspec\_if\_script:nTFmlym\\addfontfeatureScript=Malayalam\\fontspec\_if\_language:nTFMAL\\addfontfeatureLanguage=Malayalamഇന്ന്\\fontspec\_if\_script:nTFmlym\\addfontfeatureScript=Malayalam\\fontspec\_if\_language:nTFMAL\\addfontfeatureLanguage=Malayalamഅല്ലെങ്കിൽ\\fontspec\_if\_script:nTFmlym\\addfontfeatureScript=Malayalam\\fontspec\_if\_language:nTFMAL\\addfontfeatureLanguage=Malayalamനാളെയാകട്ടെ\\fontspec\_if\_script:nTFmlym\\addfontfeatureScript=Malayalam\\fontspec\_if\_language:nTFMAL\\addfontfeatureLanguage=Malayalamഇന്നല്ലെങ്കിൽ\\fontspec\_if\_script:nTFmlym\\addfontfeatureScript=Malayalam\\fontspec\_if\_language:nTFMAL\\addfontfeatureLanguage=Malayalamനാളെ\\fontspec\_if\_script:nTFmlym\\addfontfeatureScript=Malayalam\\fontspec\_if\_language:nTFMAL\\addfontfeatureLanguage=Malayalamആകട്ടെsubsubsubW​E​R=100%WER=100\\%: Alignment shift due to word\-splits and merges\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishFigure 2:Standard libraries trigger cascading alignment shifts during linguistic merges and splits, inflating the WER, whereas SCRIBE correctly identifies these orthographic variations reporting 0%E​Rl​e​xER\_\{lex\}\.
### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English3\.3Phase 3: Categorical Error Aggregation

SCRIBE aggregates errors into a diagnostic vector𝐄=\[E​Rl​e​x,E​Rp​u​n​c,E​Rn​u​m,E​Re​n​t\]\\mathbf\{E\}=\[ER\_\{lex\},ER\_\{punc\},ER\_\{num\},ER\_\{ent\}\]\. We employ a combined denominatorNcomb=∑t∈𝒯total​\[t\]N\_\{\\text\{comb\}\}=\\sum\_\{t\\in\\mathcal\{T\}\}\\text\{total\}\[t\]to calculate categorical rates:E​Rt=\(s​u​b​\[t\]\+i​n​s​\[t\]\+d​e​l​\[t\]\)/NcombER\_\{t\}=\(sub\[t\]\+ins\[t\]\+del\[t\]\)/N\_\{\\text\{comb\}\}\. This unified scaling prevents misleadingly high rates from isolated failures of sparse categories\. To account for valid formatting choices in professional dictation, SCRIBE optionally normalizes date and numeral delimiters, ensuring that acceptable orthographic variations do not inflateE​Rn​u​mER\_\{num\}\. The framework generates detailed reports with absolute error counts to facilitate granular diagnostic visualization and the development of targeted remediation strategies for ASR systems\.

## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English4Experimental Setup

We validate SCRIBE through a complete experimental cycle of rich transcription model development for Hindi, Malayalam, and Kannada \(Figure[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English1](https://arxiv.org/html/2605.20712#S1.F1)\)\. This section describes: \(1\) the LLM\-based data curation pipeline and the rich transcription models trained on it; \(2\) two new benchmarks released for general and domain\-specific evaluation; and \(3\) a human evaluation study with expert linguists designed to test whether SCRIBE’s categorical rates align with human judgment where monolithic WER does not\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English4\.1Data and Models

Public Indic speech corpora\[bhogale2023effectiveness,kathbath2022,prahallad2012iiit,gopinath2022imasc,javed2024indicvoices,baby2016resources,conneau2023fleurs\]provide mostly verbatim transcripts\. We use Gemini 2\.5 Pro\[comanici2025gemini\]with language\-specific prompts to transform these into rich transcription\. A multi\-tier quality control pipeline discards samples where CER exceeds thresholds for lexical changes \(ignoring numeral and punctuation shifts\) or where foreign\-script characters are detected, removing∼\\sim10% of data\.

The final curated sets comprise∼\\sim1000h Hindi,∼\\sim850h Kannada, and∼\\sim800h Malayalam\. SCRIBE\-ASR is fine\-tuned from a pre\-trained Whisper\-small and Whisper\-medium architecture in three stages: \(1\) diversity adaptation across acoustic conditions, \(2\) pace and style robustness, and \(3\) precision tuning with near perfect well articulated speech\. We compare against two baselines: IndicWhisper \(Vistaar\)\[bhogale2023effectiveness\]and IndicConformer\[javed2024indicvoices\], neither of which claims rich transcription natively\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English4\.2Benchmarks

Existing Indic ASR benchmarks evaluate only verbatim transcription and offer no way to measure formatting accuracy\. We release two curated evaluation sets designed to fill this gap across general and domain\-specific conditions\.

FLEURS\-RO\(Rich Orthography\) is derived from the FLEURS multilingual test set\[conneau2023fleurs\]\. We apply our LLM curation pipeline to generate rich transcription references for the Hindi, Kannada, and Malayalam splits\. Each transformed reference is then verified by a native\-speaker linguist who corrects hallucinated punctuation, numeral formatting errors, and script inconsistencies introduced by the LLM\. The result is a general\-domain benchmark where both verbatim and rich transcription ground truths are available\.

IN22\-Legalis a domain\-specific out\-of\-distribution benchmark derived from IN22\[gala2023indictrans2\]\. Legal passages were recorded as read speech by 2–4 speakers per language \(∼\\sim30 minutes per language\), producing a corpus dense in domain entities \(statute names, section numbers\), formal numerals \(dates, monetary amounts\), and complex clause structures\. Ground\-truth transcripts were prepared directly in rich transcription format by legal\-domain annotators\. Because none of the training data contains legal text, IN22\-Legal tests whether formatting conventions learned from general corpora generalize to high\-stakes specialized vocabulary\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English4\.3Human Evaluation Protocol

The central claim of SCRIBE is that categorical error rates capture distinctions that experts perceive but monolithic WER cannot\. To test this, we design a correlation study where human ratings serve as the ground truth against which both SCRIBE and WER are measured\. If SCRIBE’s per\-category rates correlate significantly more strongly with expert judgment than WER does, the decomposition is validated as a meaningful diagnostic signal\.

Annotators and samples\.We selected 80 samples per language \(240 total\) from the IN22\-Legal benchmark to ensure high density of domain\-specific entities, numerals, and complex punctuation\. Eight expert linguists \(two per language\), each a native speaker with professional proficiency in formal written registers, independently rated the SCRIBE\-ASR hypotheses against ground\-truth transcripts\.

Annotation schema\.Annotators assign scores on a 1\.0–5\.0 continuous scale \(decimal scores encouraged for fine\-grained discrimination\) across three dimensions:

- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Lexical accuracy \(S1\):Correctness of base words, evaluated independently of formatting\. 5\.0 = every spoken word present and correct; 3\.0 = meaning preserved with 2–3 errors; 1\.0 = wholesale misrecognition\.
- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Numeral accuracy \(S2\):Correctness and format compliance of numbers and dates\. 5\.0 = mathematically accurate with proper digit formatting \(e\.g\., “302” not “three hundred two”\); 1\.0 = mathematically incorrect values \(e\.g\., Section 302→\\toSection 307\), constituting fatal errors in legal contexts\.
- \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English•Punctuation accuracy \(S3\):Appropriateness of sentence boundaries, commas, and Indic\-specific marks \(e\.g\.,danda\)\. 5\.0 = professional\-grade segmentation; 1\.0 = absent or misleading punctuation\.

We use a continuous rather than discrete scale because Spearman correlation requires sufficient rank variation; a coarse 3\-point scale would compress distinctions that experts naturally perceive \(e\.g\., one misplaced comma vs\. five\)\. Dimensions are rated independently to prevent halo effects: annotators complete all S1 ratings before proceeding to S2, ensuring that a strong lexical impression does not inflate punctuation scores\. Samples where a category is absent \(e\.g\., no numerals\) are marked\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishN/Aand excluded from that category’s correlation\. Annotators were calibrated via written guidelines with worked examples distinguishing minor formatting variances \(e\.g\., comma placement preference\) from fatal value errors \(e\.g\., wrong statute number\), and recognizing validsandhivariations that should not be penalized\.

Analysis\.We compute Spearmanρ\\rhobetween SCRIBE’s categorical error rates \(E​Rl​e​xER\_\{lex\},E​Rn​u​mER\_\{num\},E​Rp​u​n​cER\_\{punc\}\) and their corresponding human dimensions \(S​1S1,S​2S2,S​3S3\), and contrast these against the correlation of monolithic WER with each dimension\.

\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishTable 1:SCRIBEdecomposition on general and legal benchmarks\. All values are error rates \(%\)\. Best per language inbold\.FLEURS\-RO \(General\)IN22\-Legal \(Domain Specific\)Lang\.ModelWER\\columncolorlightgrayE​Rl​e​xER\_\{lex\}\\columncolorlightgrayE​Rn​u​mER\_\{num\}\\columncolorlightgrayE​Rp​u​n​cER\_\{punc\}WER\\columncolorlightgrayE​Rl​e​xER\_\{lex\}\\columncolorlightgrayE​Re​n​tER\_\{ent\}\\columncolorlightgrayE​Rn​u​mER\_\{num\}\\columncolorlightgrayE​Rp​u​n​cER\_\{punc\}HindiIndicWhisper35\.20\\columncolorlightgray23\.80\\columncolorlightgray1\.06\\columncolorlightgray6\.8766\.37\\columncolorlightgray45\.42\\columncolorlightgray3\.83\\columncolorlightgray2\.23\\columncolorlightgray8\.70IndicConformer21\.70\\columncolorlightgray10\.16\\columncolorlightgray1\.35\\columncolorlightgray6\.9926\.32\\columncolorlightgray10\.59\\columncolorlightgray0\.67\\columncolorlightgray2\.56\\columncolorlightgray8\.70SCRIBE\-ASR17\.57\\columncolorlightgray11\.68\\columncolorlightgray0\.31\\columncolorlightgray3\.3019\.29\\columncolorlightgray8\.58\\columncolorlightgray0\.59\\columncolorlightgray0\.59\\columncolorlightgray6\.73KannadaIndicWhisper40\.51\\columncolorlightgray19\.29\\columncolorlightgray2\.06\\columncolorlightgray10\.0946\.09\\columncolorlightgray17\.99\\columncolorlightgray1\.16\\columncolorlightgray3\.03\\columncolorlightgray12\.46IndicConformer32\.95\\columncolorlightgray12\.46\\columncolorlightgray2\.49\\columncolorlightgray10\.2940\.74\\columncolorlightgray15\.13\\columncolorlightgray0\.87\\columncolorlightgray3\.96\\columncolorlightgray12\.46SCRIBE\-ASR29\.87\\columncolorlightgray16\.27\\columncolorlightgray0\.56\\columncolorlightgray5\.7938\.20\\columncolorlightgray16\.12\\columncolorlightgray1\.86\\columncolorlightgray0\.15\\columncolorlightgray9\.02MalayalamIndicWhisper41\.77\\columncolorlightgray14\.65\\columncolorlightgray1\.74\\columncolorlightgray15\.4154\.74\\columncolorlightgray17\.76\\columncolorlightgray1\.52\\columncolorlightgray1\.58\\columncolorlightgray14\.29IndicConformer41\.00\\columncolorlightgray13\.58\\columncolorlightgray2\.39\\columncolorlightgray15\.4052\.11\\columncolorlightgray17\.32\\columncolorlightgray1\.39\\columncolorlightgray3\.67\\columncolorlightgray14\.29SCRIBE\-ASR36\.65\\columncolorlightgray14\.77\\columncolorlightgray0\.59\\columncolorlightgray14\.0344\.52\\columncolorlightgray15\.96\\columncolorlightgray1\.28\\columncolorlightgray0\.94\\columncolorlightgray12\.12

## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English5Results

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English5\.1Correlation with Human Judgment

\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=EnglishTable 2:Spearmanρ\\rhocorrelation of SCRIBE error rates vs\. monolithic WER with human expert ratings\. SCRIBE’s category\-specific rates show consistent alignment \(\|ρ\|=0\.36\|\\rho\|=0\.36–0\.920\.92\); global WER fails to significantly correlate with human judgment in several dimensions, particularly in Malayalam\.Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English2](https://arxiv.org/html/2605.20712#S5.T2)confirms that SCRIBE’s categorical metrics align robustly with human judgment \(\|ρ\|=0\.36\|\\rho\|\\\!=\\\!0\.36–0\.920\.92\), significantly outperforming monolithic WER \(\|ρ\|≤0\.49\|\\rho\|\\\!\\leq\\\!0\.49\)\. The alignment is strongest in high\-stakes numeral accuracy, reachingρ=−0\.92\\rho\\\!=\\\!\-0\.92in Malayalam\. Crucially, while WER fails to achieve statistical significance in several Malayalam dimensions \(p\>0\.05p\\\!\>\\\!0\.05\), SCRIBE’s components remain highly predictive \(p≤0\.001p\\\!\\leq\\\!0\.001\)\. This disparity proves that experts prioritize functional categories—specifically punctuation and numerals—that global WER treats as noise\.

Variations in lexical correlation \(\|ρ\|=0\.36\|\\rho\|\\\!=\\\!0\.36in Malayalam to0\.550\.55in Hindi\) reflect the linguistic complexity of the evaluation set\. The moderate alignment in Malayalam likely stems from its agglutinative nature, which increases subjectivity in human\-perceived word boundaries\. Nevertheless, the consistent significance of SCRIBE across all categories and languages underscores that a multi\-dimensional framework is a prerequisite for evaluating ASR in specialized domains where readability and semantic precision are paramount\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English5\.2Diagnostic Decomposition

Table[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English1](https://arxiv.org/html/2605.20712#S4.T1)provides the categorical decomposition across general and out\-of\-distribution \(OOD\) legal benchmarks\. While SCRIBE\-ASR yields the lowest WER in all conditions, the diagnostic vector𝐄\\mathbf\{E\}reveals that the composition of these gains differs fundamentally across error categories\.

The WER inflation gap\.The most striking diagnostic finding appears in the Malayalam Legal set\. WER reports 44\.52%, yet SCRIBE’s decomposition reveals that genuine lexical failure \(E​Rl​e​xER\_\{lex\}\) accounts for only 15\.96%\. By resolving valid morphological merges, SCRIBE’s alignment engine reduces reported error inflation by up to 30% relative in Malayalam and 25% in Kannada\. An example showing 100% WER on JIWER with 0% Lexical error in SCRIBE is shown in Figure[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English2](https://arxiv.org/html/2605.20712#S3.F2)\.

Without this decomposition, the model would be deemed unusable based on WER alone; SCRIBE reveals that core acoustic\-phonetic reliability is nearly 3×\\timeshigher than the monolithic scalar suggests, and that a substantial portion of reported Indic WER is an artifact of morphological structure rather than acoustic misrecognition\.

Formatting generalization\.The most prominent model\-level result is near\-saturation of numeral formatting \(E​Rn​u​m<1%ER\_\{num\}<1\\%\), with 75–96% relative reduction compared to the best baseline across all benchmarks\. Domain entity error \(E​Re​n​tER\_\{ent\}\) remains below 2% even in OOD legal dictation, indicating that acoustic learning from general corpora transfers to specialized vocabulary\. This generalization highlights the effectiveness of the LLM curation pipeline in producing training data whose formatting conventions extend to unseen domains\.

Punctuation as the remaining bottleneck\.Despite gains across formatting categories,E​Rp​u​n​cER\_\{punc\}remains the primary challenge, particularly in Dravidian languages\. Malayalam Legal reports 12\.12% vs\. Hindi’s 6\.73%, a disparity visible only through categorical analysis\. In agglutinative contexts, the prevalence of long compound wordforms forms make prosodic segmentation harder to learn than numeral or entity formatting, pointing to prosodic modeling as the next development target\.

### \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English5\.3SCRIBE as a Development Signal

SCRIBE’s diagnostic value extends beyond post\-hoc evaluation to active model development, as illustrated by the feedback loop in Figure[\\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English1](https://arxiv.org/html/2605.20712#S1.F1)\. During training, early iterations exhibited systematic over\-punctuation bias entirely invisible in aggregate WER, which improved monotonically\. SCRIBE’sE​Rp​u​n​cER\_\{punc\}decomposition isolated the regression to segments where legacy verbatim corpora contained extremely short sequences \(<<4 words\) with misleading terminal punctuation, enabling targeted filtering and refined LLM curation prompts\. Without categorical decomposition, this quality degradation would have shipped undetected\.

## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English6Conclusion

Standard WER is an insufficient metric for rich transcription ASR: it provides no diagnostic signal and structurally penalizes agglutinative languages through cascading alignment failures\. We introduced SCRIBE to address both throughsandhi\-tolerant alignment and categorical error decomposition, validated by strong agreement with expert linguists\. Our diagnostic analysis reveals a critical divergence: while formatting logic for numerals and entities generalizes effectively across domains, punctuation placement in agglutinative contexts remains the primary bottleneck\. By resolvingsandhi\-induced error inflation, SCRIBE proves that Indic ASR systems are more acoustically reliable than standard scalars suggest\. We release SCRIBE alongside our curation recipe, benchmarks, and open\-weight models to enable development of ASR systems meeting the correction thresholds required for professional dictation\.

## \\fontspec\_if\_language:nTFENG\\addfontfeatureLanguage=English7Generative AI Use Disclosure

The authors utilized large language model \(LLM\) tools, specifically Gemini 2\.5 Pro, to facilitate the automated curation of rich transcription datasets \(Section 4\.1\) and to assist in the linguistic refinement and technical polishing of the manuscript\. All final content was reviewed, verified, and approved by the authors, who take full responsibility for the integrity of the research and its presentation\.

## References

Similar Articles

Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

arXiv cs.CL

This paper evaluates nine ASR models (Whisper, Parakeet, Wav2Vec2) on Dutch child speech datasets JASMIN and DART, finding that fine-tuned Whisper-medium achieves the best performance (WER 5.54% on JASMIN, 70.37% on DART). It also proposes a selection method to automatically identify correctly pronounced utterances with high precision, reducing the need for manual verification.

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

arXiv cs.CL

This paper presents a benchmark evaluating five commercial ASR systems on code-switching speech across Arabic-English, Persian-English, and German-English pairs, using a two-stage pipeline to select 300 samples per pair and assessing performance with WER and BERTScore. ElevenLabs Scribe v2 achieves the lowest overall WER (13.2%) and highest BERTScore (0.936), with public dataset available.