LV-ROVER: Multi-Stream Tesseract Voting for Maltese Paragraph OCR

arXiv cs.CL Papers

Summary

This paper presents LV-ROVER, a multi-stream Tesseract ensemble for Maltese OCR, achieving a 70% reduction in character error rate through synthetic data training and post-processing, addressing the challenges of low-resource OCR for Maltese.

arXiv:2607.00250v1 Announce Type: new Abstract: Maltese has decent text corpora and pretrained language models, but, like many languages outside the handful with large OCR benchmarks, only a single known real labelled PDF corpus for OCR training, 57 page, far below what paragraph-level training needs: low-resource for OCR specifically. With no real corpus to train on at scale, we built a synthetic training pipeline and a 5-stream Tesseract LV-ROVER ensemble, and report results on a 422-paragraph benchmark against a fine-tuned-Tesseract baseline of character error rate (CER) 0.0234. Ensemble recognition alone improves CER by 44 percent, to 0.01317; a five-stage post-processing chain brings the full pipeline to CER 0.00700, a 70 percent reduction. Most of that chain is typographic normalisation, but one stage recovers misread diacritics rather than aligning punctuation, so we report it as a recognition gain rather than folding the whole chain under one label. We treat the 44 percent figure as the portable estimate of what the recogniser learned, and the 70 percent figure as specific to this benchmark's label convention.
Original Article
View Cached Full Text

Cached at: 07/02/26, 05:36 AM

# LV-ROVER: Multi-Stream Tesseract Voting for Maltese Paragraph OCR
Source: [https://arxiv.org/html/2607.00250](https://arxiv.org/html/2607.00250)
###### Abstract\.

Maltese has decent text corpora and pretrained language models\(Micallef et al\.,[2022](https://arxiv.org/html/2607.00250#bib.bib17)\), but, like many languages outside the handful with large OCR benchmarks, only a single known real labelled PDF corpus for OCR training, 57 pages\(Tanti et al\.,[2023](https://arxiv.org/html/2607.00250#bib.bib22)\), far below what paragraph\-level training needs: low\-resource for OCR specifically\. With no real corpus to train on at scale, we built a synthetic training pipeline and a 5\-stream Tesseract LV\-ROVER ensemble, and report results on a 422\-paragraph benchmark against a fine\-tuned\-Tesseract baseline of character error rate \(CER\) 0\.0234\. Ensemble recognition alone improves CER by 44 percent, to 0\.01317; a five\-stage post\-processing chain brings the full pipeline to CER 0\.00700, a 70 percent reduction\. Most of that chain is typographic normalisation, but one stage recovers misread diacritics rather than aligning punctuation, so we report it as a recognition gain rather than folding the whole chain under one label\. We treat the 44 percent figure as the portable estimate of what the recogniser learned, and the 70 percent figure as specific to this benchmark’s label convention\.

OCR, low\-resource languages, Maltese, synthetic data, character error rate

††copyright:noneWorking paper, not peer reviewed\.

## 1\.Introduction

Maltese is the sole Semitic language with EU official status\. Its approximately 520,000 native speakers have accumulated centuries of archival text \(parliamentary records, legal gazettes, ecclesiastical documents, historical press\) that remains unsearchable because it has not been digitised at scale\. OCR is the prerequisite\. But Maltese is also a difficult OCR target: its 30\-letter alphabet extends the standard Latin inventory withĊ ċ,Ġ ġ,H̄ h̄,Ż ż, and the digraphsGh̄ gh̄andIe ie, fonts frequently silently substitute the base letter for the diacritic form, and the definite article attaches to the following noun via a structural hyphen \(il\-kelb,id\-dar,fis\-seh̄h̄\) that shares the U\+002D codepoint with soft line\-break hyphens\. The only labelled real Maltese PDF data we are aware of comes from the NOMOCRAT annotation project\(Tanti et al\.,[2023](https://arxiv.org/html/2607.00250#bib.bib22)\): 57 PDF pages, line\- and paragraph\-transcribed\. That is several orders of magnitude below what training a paragraph\-level recogniser needs, and we found no indication it has been released for reuse outside that project\.

Three technical challenges shape the system described here\. First, any tokeniser or font that silently substitutescforċcorrupts labels at the encoder or rendering stage; we treat the four diacritic pairs \(ċ/c,ġ/g,h̄/h,ż/z\) as disqualification sentinels, or canaries, tracked at every stage of the pipeline\. Second, the soft\-versus\-structural hyphen distinction is not lexically decidable on the glyph alone: both surfaces use U\+002D, so resolution requires either a language model or a rule\-based joiner with Maltese morphological knowledge\(Borg and Gatt,[2017](https://arxiv.org/html/2607.00250#bib.bib6); Maltese Language Resource Server,[2025](https://arxiv.org/html/2607.00250#bib.bib15)\)\. Third, the gold label convention for a benchmark shapes which improvements look real\. The dev\-gold labels use curly quotes \(‘ ’ “ ”\) and an em\-dash \(—\); Tesseract’s raw output uses straight ASCII\. Normalising one to the other produces a large CER drop that has nothing to do with recognition quality\. We make this explicit with a dual\-CER protocol\.

This paper describes the LV\-ROVER system: five parallel Tesseract LSTM streams voted per word under a soft Maltese lexicon, followed by a five\-stage label\-convention normalisation chain and a rule\-based line joiner\. We evaluate it on a 422\-paragraph Maltese benchmark with a fine\-tuned\-Tesseract baseline at CER 0\.0234 and a held\-out test set\(DocEng 2026 Organisers,[2026](https://arxiv.org/html/2607.00250#bib.bib8)\)\. Inference runs under a tight compute envelope: CPU only, no GPU, no network after initialisation, which rules out large neural decoders as a drop\-in alternative and motivates the ensemble\-of\-small\-models design\.

The contributions are:

- •A reproducible Maltese paragraph synthesis pipeline: corpus text across eleven diacritised domain configs, 68 fonts validated against the canary set at the glyph\-map and raster level, PDF\-realistic augmentations, and per\-sample tagging of soft, structural, and compound hyphens at training time\.
- •A dual\-CER reporting protocol that separates recognition gains from label\-convention alignment gains, with an ablation audit that checks each normalisation rule against held\-out synthetic CER to rule out benchmark overfit\.
- •A rule\-based line joiner whose output is then normalised, stripping soft\-hyphen markers and converting the image\-only en\-dash to the label\-bearing em\-dash, validated against the benchmark’s hyphen subset before any custom logic was written\.
- •A 5\-stream LV\-ROVER ensemble adapted for low\-resource voting: soft lexicon confidence weighting, diacritic\-preserving edit\-distance bounds, and diversity by language chain and image scale rather than training\-data resampling\.
- •A diacritic canary check tracking the four sentinel pairs at the tokeniser, font, and post\-processing stages at every pipeline run\.

Prior work on Maltese OCR is limited to NOMOCRAT\(Tanti et al\.,[2023](https://arxiv.org/html/2607.00250#bib.bib22)\), which established the hyphen\-joining and label\-convention challenges this system addresses, at a scale too small to train a paragraph\-level recogniser\.

## 2\.System Overview

The system has a training side and an inference side\. The training side renders synthetic crops from a text source and fine\-tunes the Tesseract LSTM on them\. The inference side runs five parallel Tesseract streams over the image; each stream’s raw per\-line output is independently joined into a paragraph string and normalised \(soft\-hyphen strip, en\-dash to em\-dash\) before a lexicon\-anchored ROVER votes across the joined streams, after which a label\-convention normalisation chain runs on the voted result\.

The design principle for the five\-stream ensemble is diversity without retraining\. A useful vote requires that the streams fail on different characters\. We vary three independent axes: language chain \(Maltese alone, Maltese plus Italian, Maltese plus Italian and French\), training data \(fine\-tuned versus stock recogniser\), and image scale \(native versus 2x\-upscaled crop\)\. Each axis changes which characters the recogniser confuses, so the vote recovers characters that any single stream would lose\. That is the main reason to run five streams instead of one; everything else in the pipeline exists to feed or support this vote\.

### 2\.1\.Pipeline

Figure[1](https://arxiv.org/html/2607.00250#S2.F1)shows the five stages: a text source pulls Maltese paragraphs from a large corpus \(Section[3\.1](https://arxiv.org/html/2607.00250#S3.SS1)\); a renderer turns them into paragraph crops with realistic fonts and degradation \(Section[3\.2](https://arxiv.org/html/2607.00250#S3.SS2)\); these crops fine\-tune the Tesseract 5 LSTM \(Section[4](https://arxiv.org/html/2607.00250#S4)\); at inference, five parallel streams each transcribe the image and join per\-line output into a paragraph string, after which a lexicon\-anchored LV\-ROVER vote combines streams per word \(Section[4](https://arxiv.org/html/2607.00250#S4)\); a normalisation chain resolves typographic convention into the final string \(Section[4](https://arxiv.org/html/2607.00250#S4)\)\. No GPU required\.

![Refer to caption](https://arxiv.org/html/2607.00250v1/x1.png)Figure 1\.LV\-ROVER pipeline\. Left: offline synthetic training\. Right: per\-image inference\.

## 3\.Synthetic Data Pipeline

With no labelled real Maltese PDF data at usable scale, the synthesis pipeline is the project\. This section describes the choices that matter for transfer to other low\-resource scripts\.

### 3\.1\.Text source

Maltese text is pulled from the korpus\_malti corpus, version 4\.2\(Micallef et al\.,[2022](https://arxiv.org/html/2607.00250#bib.bib17); MLRS,[2026](https://arxiv.org/html/2607.00250#bib.bib18)\), a 467M\-token corpus across 19 domains\. We use eleven diacritised domain configs111Parliamentary records, Wikipedia, government gazette, law, non\-fiction, theses, legal text, speeches, blogs, university repository material, and general web text\.and skip two\. One config has shuffled sentence order, which breaks paragraph\-level coherence; the other is diacritic\-stripped, with canary\-letter density two orders of magnitude below the rest, which would dilute the diacritic signal and leave the model under\-trained onċ,ġ,h̄,ż\. Both exclusions are data\-quality decisions, not arbitrary ones\. A fallback streamer pulls from Wikipedia mt and the Maltese Universal Dependencies treebank \(both CC\-BY\-SA, ungated\) when the primary corpus is unavailable\. English code\-switch material is mixed at 12 percent from a small clean fixture\. The pull preserves paragraph and sentence ordering by domain\.

### 3\.2\.Renderer

The renderer is a SynthTIGER\-compatible wrapper\(Yim et al\.,[2021](https://arxiv.org/html/2607.00250#bib.bib24)\)that uses Pillow for paragraph layout\. Each rendered sample emits an image, a label string, per\-line label parts with an invisible soft\-hyphen marker \(U\+00AD, renders only at an actual line break\) at every line\-break position, and metadata recording per\-line bounding boxes, font, and hyphen kind\.

The most consequential calibration is resolution\. Synthetic\-to\-real domain gaps are a recognised concern in the broader synthetic\-data and scene\-text literature, both in how synthetic training images are generated\(Gupta et al\.,[2016](https://arxiv.org/html/2607.00250#bib.bib11)\)and in how inconsistent conditions confound model comparisons\(Baek et al\.,[2019](https://arxiv.org/html/2607.00250#bib.bib2)\); our resolution mismatch is a specific, measurable instance of that general class of problem\. An early version of the renderer produced images at 300 DPI without downscaling\. The real benchmark crops have an effective resolution of approximately 150–200 DPI \(mean 42 pixels per line at 10 pt\), so the 300 DPI renders were twice too large, which suppressed transfer to real images\. The corrected renderer applies a half\-resolution Lanczos rescale and re\-encodes at JPEG quality 72 to match the real crop statistics, confirmed by a dry run at 39–40 pixels per line against a real value of 42\. This resolution mismatch was the single largest source of synthetic\-to\-real gap in the project; identifying it late was a contributing factor in why a neural decoder arm \(Section[6](https://arxiv.org/html/2607.00250#S6)\) did not reach a fair comparison\.

PDF\-realistic augmentations are applied as a fixed chain: rotation, blur, brightness/contrast jitter, optional ink bleed and column\-edge crop, mild elastic distortion, salt\-and\-pepper noise, and JPEG re\-encoding\. Later additions cover block noise, page\-edge shadow, and subpixel blur to match scanned\-PDF artefacts\. Scene\-text operations \(perspective warp, motion blur, glare\) are disabled, since they do not occur in PDF crops\. TRDG\(Belval,[2019](https://arxiv.org/html/2607.00250#bib.bib4)\)was evaluated for line crops but not adopted for paragraph synthesis, where SynthTIGER’s layout primitives fit better\.

The renderer also tags every hyphen it draws as soft \(line\-break split\), structural \(clitic\-article surfaces such asil\-andfis\-\), or compound \(word\-internal\)\. This tag lets us measure joiner accuracy on soft hyphens alone, the case the joiner is meant to fix, without contaminating the measurement with structural hyphens, which must be preserved rather than removed\.

### 3\.3\.Font catalogue

Fonts are the second failure point: a font can silently substitute the base Latin letter for a Maltese diacritic at render time, a failure mode reported in Amharic OCR pipelines\(Belay et al\.,[2020](https://arxiv.org/html/2607.00250#bib.bib3)\); cross\-script transfer for related low\-resource scripts carries its own, related risks\(Medhanie and Ni,[2026](https://arxiv.org/html/2607.00250#bib.bib16)\)\. We curated 68 faces \(62 printed, 6 handwriting\) under permissive licences \(SIL Open Font License, GUST, Apache 2\.0, DejaVu\), 31 MB on disk, from a larger candidate pool\. Each candidate is checked for character\-map presence \(via fontTools\) of the canary glyphsĊ ċ Ġ ġ H̄ h̄ Ż ż à ì ò ùbefore entering the renderer pool; five candidates failed this check, missing one or more ofĊ ċ Ġ ġ H̄ h̄from their character map, and a further set of candidates were unreachable at download time\. This check catches a glyph that is absent outright; it does not verify that a present glyph renders visually distinct from its ASCII base, so a font that declares a diacritic but renders it identically to the plain letter, the specific risk noted in prior low\-resource OCR work, would not be caught by this check alone\. We treat that as an open gap in the validator, not a solved problem\.

### 3\.4\.Shards

Three batches of synthetic data are materialised on disk; Table[1](https://arxiv.org/html/2607.00250#S3.T1)gives sizes and purposes\.

Table 1\.Synthetic training shards\.Batch 2 also serves as the joiner’s round\-trip test set \(Section[4](https://arxiv.org/html/2607.00250#S4)\)\.

## 4\.Model and Joiner

### 4\.1\.Recogniser and baseline anchor

The recogniser is the Tesseract 5 LSTM\(Smith,[2007](https://arxiv.org/html/2607.00250#bib.bib19); Smith et al\.,[2009](https://arxiv.org/html/2607.00250#bib.bib20)\), fine\-tuned on 50k synthetic paragraph crops\. With the rule\-based joiner, it reaches CER 0\.01605 on the benchmark by itself; we treat this as the internal regression bar against which every later candidate is audited, reporting results relative to it and to the benchmark’s own fine\-tuned Tesseract baseline \(CER 0\.0234\), in that order\. Canary diacritics and the two label\-bearing dashes round\-trip through a 117\-character training inventory\. Structural clitic\-article hyphens \(il\-,is\-,id\-,fis\-, and others\) are never removed by the joiner’s line\-break repair; the image\-only en\-dash is normalised to an em\-dash\. A separate clitic list collapses gold\-side spacing noise at scoring time \(Section[6](https://arxiv.org/html/2607.00250#S6)\)\. Tesseract is deterministic; inference runs on CPU\.

### 4\.2\.LV\-ROVER ensemble

Five Tesseract configurations are voted per word under a soft Maltese lexicon: fine\-tuned Maltese; fine\-tuned Maltese\+Italian; fine\-tuned Maltese\+Italian\+French; stock Maltese; fine\-tuned Maltese\+Italian on a 2x\-upscaled crop\. Each axis produces independent errors: Italian and French supply Latin diacritics that exercise the recogniser differently from a Maltese\-only chain; stock versus fine\-tuned shifts the training prior; upscaling shifts characters across the blur threshold\. When error axes are independent the vote recovers what any single stream loses\. A stream that errors at runtime is dropped; the vote falls back to fewer than five candidates rather than failing the transcription\.

The full procedure is given in Algorithm[1](https://arxiv.org/html/2607.00250#algbox1)\(Appendix[A](https://arxiv.org/html/2607.00250#A1)\)\. The vote adapts LV\-ROVER\(Stuner et al\.,[2017](https://arxiv.org/html/2607.00250#bib.bib21)\), an extension of ROVER\(Fiscus,[1997](https://arxiv.org/html/2607.00250#bib.bib10)\)\. The original method aligns the outputs of many recogniser instances and votes per position under a closed\-set lexicon constraint\. We make two changes for the Maltese low\-resource setting\. First, the lexicon is a soft Maltese word\-frequency table rather than a closed set: a candidate word already in the lexicon is never overruled, and an out\-of\-lexicon candidate is replaced only when a strict majority of the five streams agree on an alternative that is itself in the lexicon, is within two characters by edit distance, and does not change any of the four canary characters\. This diacritic\-preservation constraint is the key departure from standard ROVER: without it, a majority of diacritic\-naive streams could vote a correcth̄down to a plainh\. Second, the vote runs over five structurally different recogniser configurations rather than many instances of one engine, since diversity here comes from language chain and image scale rather than from training\-data resampling, which is unavailable in a low\-resource setting\.

### 4\.3\.Joiner

The joiner resolves soft\-versus\-structural hyphens at decode time\(Maltese Language Resource Server,[2025](https://arxiv.org/html/2607.00250#bib.bib15)\)\. A rule\-based joiner with hyphenated\-word repair runs first on the recogniser’s per\-line output\. Its result is then normalised: invisible soft\-hyphen markers \(U\+00AD\) are stripped and every en\-dash is converted to an em\-dash \(U\+2013 is image\-only; gold always uses U\+2014\)\. The worked example below splits a line after an en\-dash and a structural hyphen:

> Image:0 – Gh̄adha mhux fis\-/seh̄h̄ Gold:0 — Gh̄adha mhux fis\-seh̄h̄

The image\-rendered en\-dash maps to an em\-dash in the label; the structuralfis\-hyphen is preserved across the line wrap; the soft hyphen is removed and the word rejoined\. The two label\-bearing dashes, the ASCII hyphen and the em\-dash, are never normalised against each other; beyond the en\-dash pass, the joiner’s job is purely structural\.

Run unmodified on our synthetic multi\-line samples, the joiner round\-trips 99\.51 percent correctly, and 100 percent of soft\-hyphen samples specifically; the remaining failures cluster on numbered\-bullet line starts\. This accuracy is the reason we did not write a custom joiner from scratch \(Section[6](https://arxiv.org/html/2607.00250#S6)\)\.

### 4\.4\.Dual\-CER reporting protocol

Every headline CER figure in this paper, including the version chain in Table[2](https://arxiv.org/html/2607.00250#S5.T2), is the NFC\-normalised score described in Section[5\.3](https://arxiv.org/html/2607.00250#S5.SS3), which matches the benchmark’s own \(un\-normalised\) scoring script in practice because NFC is a no\-op on this already\-canonical text\. Internally, when comparing two variants for the statistical audit \(Section[5\.4](https://arxiv.org/html/2607.00250#S5.SS4)\), we additionally compute a second channel that also collapses spacing noise around clitic articles before scoring both sides, which removes a gold\-label artefact that otherwise inflates apparent CER by 4\.3 percentage points on the synthetic validation hyphen bucket \(Section[6](https://arxiv.org/html/2607.00250#S6)\)\. This second channel is used only to decide whether a variant’s improvement survives once that gold\-side noise is removed; it never replaces the headline, benchmark\-faithful number reported elsewhere in this paper\.

## 5\.Evaluation

### 5\.1\.Recognition versus label\-convention gains

The headline figure is CER 0\.00700, a 70 percent reduction from the benchmark baseline \(0\.0234\)\. That number conflates two different sources of improvement, which Table[2](https://arxiv.org/html/2607.00250#S5.T2)traces step by step\.

The recognition gain is attributable to fine\-tuning and ensemble voting, measured before any post\-processing is applied: CER 0\.01317, a 44 percent reduction from the baseline\. This is the like\-for\-like comparison, since both figures are scored under the same convention, so the gap measures only what the recogniser improved\.

The remaining gain, from 0\.01317 to 0\.00700, comes from a five\-stage post\-processing chain, run in this order: lead\-marker normalisation, apostrophe normalisation, positional opening\-quote rule, a diacritic\-restoration vote, and closing\-quote normalisation \(Table[2](https://arxiv.org/html/2607.00250#S5.T2)\)\. Four of these five stages convert Tesseract’s straight quotes and dash formatting to the curly\-quote convention used in the gold labels, and account for most of this remaining 26 percentage points; this is a real gain in the sense that the output now matches what the scorer expects, but it reflects typographic alignment, not recognition\. The fifth stage, the diacritic\-restoration vote, instead recovers canary diacritics \(ċ ġ h̄ ż\) the recogniser dropped, which is a recognition fix, not a convention one, and contributes a further 0\.00035 to the CER drop \(0\.00776 to 0\.00741\) at its position in the chain\. We did not re\-run the chain with this stage moved earlier to isolate a clean recognition\-only number with it included, since the stages are not guaranteed to be order\-independent; we instead report it in place and flag that the 26\-percentage\-point convention figure is not entirely convention\. A reported CER improvement that does not separate recognition from convention overstates how much the model actually learned to read; conversely, a normalisation stage that never gets checked against an independent label convention \(Section[5\.4](https://arxiv.org/html/2607.00250#S5.SS4)\) could just as easily be papering over a real recognition gap\. We treat 44 percent as the portable result and 70 percent as specific to this benchmark’s typographic convention\. We have not run the bootstrap audit \(Section[5\.4](https://arxiv.org/html/2607.00250#S5.SS4)\) on the combined post\-processing delta as a single block, only on the ensemble\-expansion stage and on each post\-processing stage individually; Table[2](https://arxiv.org/html/2607.00250#S5.T2)shows that four of the five post\-processing stages do not individually clear the KEEP bar, so the 26\-percentage\-point figure should be read as the chain’s net effect, not as a statistically confirmed gain in its own right\.

Table 2\.Dev CER chain \(422 paragraphs\)\. Recognition stages fix misread characters; convention stages align to the gold label format\.
### 5\.2\.Experiment summary

Results come from a three\-arm comparison, all evaluated against the fine\-tuned\-Tesseract anchor at CER 0\.01605 \(Section[4](https://arxiv.org/html/2607.00250#S4)\)\. Table[3](https://arxiv.org/html/2607.00250#S5.T3)summarises each arm; the Tesseract ensemble \(arm 2\) is the system reported throughout this paper\.

Table 3\.Experiment arms vs the fine\-tuned Tesseract anchor \(CER 0\.01605\)\.Arm 1, neural decoders, evaluated TrOCR\(Li et al\.,[2023](https://arxiv.org/html/2607.00250#bib.bib14)\)and FasterDAN\(Coquenet et al\.,[2023](https://arxiv.org/html/2607.00250#bib.bib7)\)and did not close the synthetic\-to\-real gap in time for this paper\. A contributing factor, identified late, was a rendering\-resolution mismatch \(Section[3\.2](https://arxiv.org/html/2607.00250#S3.SS2)\): synthetic training images were twice the effective pixel density of real benchmark crops\. After correcting the renderer, a single\-shard run of a Pix2Struct\-style model\(Lee et al\.,[2023](https://arxiv.org/html/2607.00250#bib.bib13)\)reached CER 0\.312 with no self\-training and no curriculum, a starting point, not a ceiling, though this single uncontrolled run does not by itself isolate resolution from other factors such as training scale and duration \(Section[6](https://arxiv.org/html/2607.00250#S6)\)\. We treat the neural arm as open future work rather than a closed comparison\.

Arm 2, the Tesseract ensemble, is built incrementally from the fine\-tuned anchor: a 3\-stream vote first, then stream expansion and routing, then the label\-convention normalisation chain in Table[2](https://arxiv.org/html/2607.00250#S5.T2)\. The reported system also wires in an optional sixth candidate from a second open\-source OCR engine whenever it loads successfully at runtime; its measured contribution to dev CER falls within the bootstrap noise band, so we do not count it as part of the five\-stream design this paper evaluates, even though it remains active in the shipped system as a low\-risk extra vote\.

Arm 3, length\-conditioned routing \(a separate recogniser for short versus long paragraphs\), was a design option we considered early and gated behind a trigger: build it only if long\-paragraph CER came in more than 1\.5 times worse than short\-paragraph CER under a single length\-balanced model\. That trigger was not met, so we used the single\-model approach throughout and never built the routed variant; arm 3 was deferred by that decision, not attempted and abandoned\.

### 5\.3\.CER computation and stratification

CER is computed with the standardjiwerlibrary\(Vaessen,[2025](https://arxiv.org/html/2607.00250#bib.bib23)\)on NFC\-normalised reference and hypothesis text, aggregated as a sum of edit distances over a sum of reference lengths across paragraphspp:

CER=∑pedit​\(r​e​fp,h​y​pp\)∑plen​\(r​e​fp\)\.\\mathrm\{CER\}=\\frac\{\\sum\_\{p\}\\mathrm\{edit\}\(ref\_\{p\},hyp\_\{p\}\)\}\{\\sum\_\{p\}\\mathrm\{len\}\(ref\_\{p\}\)\}\.This sum\-of\-numerators form isjiwer’s own default aggregation and the common corpus\-level convention; a mean of per\-paragraph CER values is unstable under re\-bucketing when paragraph length varies by a factor of five or more, as it does here\. The benchmark’s own scoring script applies no NFC step; on this dataset NFC normalisation changes CER by about1×10−61\\times 10^\{\-6\}, since the gold and our own output are already in canonical form, so the two scores agree to within1×10−31\\times 10^\{\-3\}in practice\. We cross\-check our implementation against the benchmark’s own scoring script on every evaluation run and require that agreement\.

We stratify the benchmark across five axes: length quartile; language tag \(Maltese, English, or other, from a wordlist heuristic\); presence of the clitic\-article prefix family; presence of a line\-break hyphen; and presence of an em\-dash \(the only label\-bearing dash besides the ASCII hyphen, since the en\-dash is image\-only and normalised before scoring\)\. Buckets with fewer than 20 paragraphs are flagged as small and excluded from the regression gate, though still reported\.

### 5\.4\.Audit harness

For ensemble\-level decisions, such as expanding from three streams to five, we use a four\-pronged statistical apparatus\. First, a paired bootstrap over per\-paragraph edit\-distance pairs\(Efron and Tibshirani,[1993](https://arxiv.org/html/2607.00250#bib.bib9); Koehn,[2004](https://arxiv.org/html/2607.00250#bib.bib12)\), 1,000 resamples, reporting the 95 percent confidence interval of the CER delta; a positive interval that excludes zero means the variant wins\. Second, a two\-sided permutation test at 1,000 label\-swap draws, as a corroborating significance check\. Third, a 5\-fold cross\-validation on held\-out synthetic data, whose fold\-to\-fold standard deviation lower\-bounds the noise we should expect from any single benchmark\-sized evaluation\. Fourth, a per\-character paired bootstrap over the canary set, with BH\-FDR correction atα=0\.05\\alpha=0\.05across the character family\(Benjamini and Hochberg,[1995](https://arxiv.org/html/2607.00250#bib.bib5)\), so that promoting a stage never silently trades a diacritic regression for a CER gain\. A variant is marked KEEP only when the global confidence interval excludes zero improvement and no non\-small bucket regresses by more than 0\.005 absolute\.

For the smaller, deterministic post\-processing rules described next \(lead marker through closing quote\), we use a lighter point\-estimate check instead of the full bootstrap apparatus above: a single dev\-CER delta from removing the rule, cross\-checked against held\-out synthetic data \(Section[5\.5](https://arxiv.org/html/2607.00250#S5.SS5)\)\. We did not compute bootstrap confidence intervals for these individual rules, since most have a small effect on this sample size and chaining five bootstrap procedures sequentially would itself need a multiple\-comparisons correction we have not run; this is a limitation we return to in Section[5\.7](https://arxiv.org/html/2607.00250#S5.SS7)\. In Table[2](https://arxiv.org/html/2607.00250#S5.T2), KEEP marks the one stage audited with the full bootstrap apparatus; inside noise marks the deterministic rules, whose verdict reflects only the point\-estimate check, not a confirmed\-significant gain\.

### 5\.5\.Post\-processing rule audit

Every rule in the normalisation chain was tuned with the benchmark dev set in view, so each is a candidate case of overfitting to that set\. We test each rule by removing it and checking whether dev CER drops while held\-out synthetic CER rises, which would indicate the rule fits benchmark idiosyncrasy rather than a real pattern\. This checks for overfitting to dev specifically; it does not establish that a rule’s dev\-CER gain is itself statistically significant, since the held\-out set is generated by the same synthetic pipeline as training data and so cannot stand in for an independent real\-world sample \(Section[5\.7](https://arxiv.org/html/2607.00250#S5.SS7)\)\. The five recogniser streams are run once and cached; rule configurations are then replayed against the cache\. Table[4](https://arxiv.org/html/2607.00250#S5.T4)reports the point\-estimate CER change from each rule, with positive meaning the rule helps, on the benchmark dev set and on two held\-out synthetic sets\.

Table 4\.Leave\-one\-out post\-processing ablation \(\+Δ\+\\DeltaCER==rule helps\)\.RuleDevSynth ASynth BVerdictLead\-marker norm\.\+0\.00022\+0\.00000\+0\.00000safeApostrophe norm\.\+0\.00496*art\.**art\.*safeOpening\-quote pos\.\+0\.00025\+0\.00000\+0\.00000safeDiacritic\-restore vote\+0\.00037\+0\.00074\+0\.00063safeClosing\-quote norm\.\+0\.00000*art\.**art\.*safe, inert

art\.= synth gold uses ASCII quotes; those rows invalid against synth\.

The synthetic labels use ASCII quotes rather than the curly\-quote convention of dev gold\. The apostrophe and closing\-quote rules convert ASCII to curly, so against ASCII\-quoted synthetic gold every correct conversion counts as an error \- a label\-convention artefact, not a real regression\. Re\-scoring those two rules under the curly convention collapses their measured deltas to zero and drops full\-pipeline CER on the larger held\-out set from 0\.05695 to 0\.04700\. Roughly one CER point of apparent held\-out error was the synthetic pipeline using the wrong quote convention\. The remaining three rules are unaffected and show a positive delta across all three sets\.

The lead\-marker rule also surfaced a genuine bug during this audit: an early version rewrote tight numeric ranges and Maltese ordinal markers at paragraph start \(for example turning “19\-20” into “19 \- 20”\), which is wrong\. The fix restricts the rule to fire only when the dash is followed by a letter\-initial token, and we verified it idempotent on all 12 real dev\-gold instances of this marker with zero regressions\. This class of error was caught by the held\-out\-synthetic check in the ablation, not by manual review, which is itself an argument for running this audit rather than trusting hand\-inspection of a post\-processing chain\.

### 5\.6\.Held\-out test estimate

Both the reported system CER \(0\.00700\) and the benchmark baseline \(0\.0234\) are dev\-set figures; the held\-out test CER is not yet known\. Two values bound our expectation: the recognition\-only CER before label\-convention normalisation, about 0\.013, and the full\-pipeline CER on held\-out synthetic data under the correct quote convention, 0\.047\. This 0\.013–0\.047 bracket is wide enough to span most of the claimed gain over the baseline \(0\.0234\), so we read the headline 0\.00700 figure as optimistic rather than as a settled result, and we expect the test score to land closer to the upper end of this range, depending on the test set’s own typographic convention and difficulty distribution\. The reported system is tuned toward this particular dev set’s label convention, and that caveat should accompany any headline comparison: every paragraph used to report 0\.00700 was also used to tune the post\-processing chain that produces it, and no held\-out real partition currently separates the two roles\.

### 5\.7\.Threats to validity

Several threats apply\. The synthetic\-to\-real gap is the largest source of uncertainty between predicted and observed performance; a 5\-percentage\-point absolute gap between held\-out synthetic and dev CER is our trigger to revisit the renderer\. Held\-out synthetic data is drawn from the same distribution as synthetic training data, which biases its per\-bucket estimates optimistically; cross\-validation bounds the noise band but does not measure the synthetic\-to\-real gap itself\. This same circularity applies to our overfitting check in Section[5\.5](https://arxiv.org/html/2607.00250#S5.SS5): a normalisation rule that "holds" on held\-out synthetic data is only confirmed not to overfit our own renderer’s quirks, not confirmed against an independent real\-world sample\.

The post\-processing chain was tuned through five sequential decisions, each evaluated against the same 422\-paragraph dev set, without a multiple\-comparisons correction across that sequence \(the BH\-FDR correction in Section[5\.4](https://arxiv.org/html/2607.00250#S5.SS4)applies within the canary\-character family, not across these five decisions\)\. Combined with the lack of bootstrap confidence intervals for four of those five stages \(Table[2](https://arxiv.org/html/2607.00250#S5.T2)\), the reported 26\-percentage\-point post\-processing gain should be read as a plausible net effect, not a statistically confirmed one\.

Stratification cutoffs and the size of the English\-language bucket are derived from this dev set and are not guaranteed to hold on a test set with a different length distribution\. Because every result here is measured on a CPU\-only, no\-GPU evaluation target, the findings characterise this synthetic\-data, CPU\-bound regime specifically, not OCR performance under unconstrained compute\.

## 6\.Findings

Table[5](https://arxiv.org/html/2607.00250#S6.T5)summarises the four findings; the subsections give the evidence and the generalisation\.

Table 5\.Four findings: expected outcome, what the data showed, and what changed\.### 6\.1\.The rule\-based joiner needed no patching

We expected the joiner to dominate hyphen\-bucket errors and planned a custom joiner if the rule\-based baseline fell below 95 percent round\-trip accuracy\. It did not: the unmodified joiner round\-trips 99\.51 percent of multi\-line synthetic samples and 100 percent of soft\-hyphen samples\. The remaining 4\.3 percentage points of gap we initially attributed to joiner errors traced instead to gold\-side noise in the source corpus, which occasionally writes a clitic and its host as two space\-separated words where the gold label has them hyphen\-joined\. The fix was the dual\-CER protocol, which normalises this spacing on both sides before scoring, not a joiner rewrite\. The general lesson, from this one case, is to check whether a bucket error is actually in the reference labels before attributing it to a pipeline component\. We have verified this only for Maltese clitics against this one corpus; we would expect the same check to matter for other languages with a productive clitic system, but we have not tested that claim against a second language or corpus\.

### 6\.2\.Grave accents: no dedicated handling needed

We planned a separate evaluation stratum for grave\-accented vowels \(à ì ò ù\), expecting them to be a distinct error class\. The data made that unnecessary: frequency mass concentrates almost entirely on the single wordĠesù, and the other three graves each appear on a short, fixed word list\. The training character inventory has lowercase graves only, with noèand no uppercase grave forms\. The per\-character confusion matrix already covers them without a dedicated bucket, and the diacritic canary check covers any regression\. The general lesson: for a new low\-resource script, characterise the actual frequency distribution of special characters before building evaluation machinery for an assumed rarity class\.

### 6\.3\.Resolution calibration likely mattered most for the neural arm

The renderer’s resolution mismatch \(Section[3\.2](https://arxiv.org/html/2607.00250#S3.SS2)\) was identified only after most of the neural arm’s budget was spent \(Section[5\.2](https://arxiv.org/html/2607.00250#S5.SS2)\), and we cannot fully separate its effect from training scale and duration, which were not held fixed once the renderer was corrected\. The general lesson is procedural rather than architectural: resolution calibration is cheap to check and expensive to discover late, so it should be an early sanity check in any synthetic OCR pipeline, run before committing significant compute to model comparisons\.

### 6\.4\.Synthetic labels initially hid the convention\-alignment gain

Because the synthetic labels used in our held\-out overfitting check \(Section[5\.5](https://arxiv.org/html/2607.00250#S5.SS5)\) use a different quote convention than dev gold, two genuinely helpful normalisation rules first looked like regressions there, and were treated as such for several days before the convention mismatch was found\. The general lesson: if synthetic labels and gold labels use different typographic conventions, the held\-out synthetic signal for convention\-sensitive rules is inverted, and that mismatch should be checked before trusting a held\-out regression gate for any rule that touches punctuation or quotes\.

## 7\.Availability

The full pipeline \- synthetic data generation, training scripts, the LV\-ROVER ensemble, and the audit harness \- is available at[https://github\.com/adamd1985/lv\-rover\-mlt](https://github.com/adamd1985/lv-rover-mlt)\. Fine\-tuned weights and the submission bundle are on Hugging Face at[https://huggingface\.co/radmada/lv\-rover\-mlt](https://huggingface.co/radmada/lv-rover-mlt)\. Both are released under permissive open\-source licences\.

## 8\.Conclusion

The two results that matter are CER 0\.01317 and CER 0\.00700\. The first is what the recogniser achieves; the second is what you get when you also align to the gold label convention\. Conflating them overstates the recognition improvement by a factor of two\. That distinction shaped the whole design: the dual\-CER protocol, the ablation harness, the convention\-aware synth rescoring\.

On the benchmark, the recognition\-only improvement is 44 percent \(CER 0\.0234 to 0\.01317\) and the full\-pipeline improvement is 70 percent \(CER 0\.00700\)\. The 26\-percentage\-point gap is the cost of not decomposing the metric\. A system that reports only the 70 percent figure, scored on the same dev set used to tune it, overstates both its recognition gain and its confidence\.

The neural path remains open: a single corrected\-renderer run reached CER 0\.312 with no self\-training, and a full run is the natural next step\. The Tesseract ensemble is the system we report here because it was the one we could fully validate within this paper’s timeline, not because the neural approach was shown to be infeasible\.

Digitising Maltese archival text \- parliamentary records, legal gazettes, historical press \- is a prerequisite for e\-government search and digital heritage access for the language’s 520,000 speakers\. The pipeline and models are released to support that work\.

## References

- \(1\)
- Baek et al\.\(2019\)Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, and Hwalsuk Lee\. 2019\.What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis\. In*Proceedings of the IEEE/CVF International Conference on Computer Vision \(ICCV\)*\. 4715–4723\.
- Belay et al\.\(2020\)Birhanu Belay, Tewodros Habtegebrial, Million Meshesha, Marcus Liwicki, Gebeyehu Belay, and Didier Stricker\. 2020\.Amharic OCR: An End\-to\-End Learning\.*Applied Sciences*10, 3 \(2020\), 1117\.[doi:10\.3390/app10031117](https://doi.org/10.3390/app10031117)
- Belval \(2019\)Edouard Belval\. 2019\.TRDG: Text Recognition Data Generator\.[https://github\.com/Belval/TextRecognitionDataGenerator](https://github.com/Belval/TextRecognitionDataGenerator)\. In*GitHub repository*\.Open\-source synthetic text image generator for line crops\.
- Benjamini and Hochberg \(1995\)Yoav Benjamini and Yosef Hochberg\. 1995\.Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing\.*Journal of the Royal Statistical Society: Series B \(Methodological\)*57, 1 \(1995\), 289–300\.[doi:10\.1111/j\.2517\-6161\.1995\.tb02031\.x](https://doi.org/10.1111/j.2517-6161.1995.tb02031.x)
- Borg and Gatt \(2017\)Claudia Borg and Albert Gatt\. 2017\.Morphological Analysis for the Maltese Language: The Challenges of a Hybrid System\. In*Proceedings of the Third Arabic Natural Language Processing Workshop \(WANLP\)*\. Association for Computational Linguistics, Valencia, Spain, 25–34\.[https://aclanthology\.org/W17\-1304](https://aclanthology.org/W17-1304)
- Coquenet et al\.\(2023\)Denis Coquenet, Clément Chatelain, and Thierry Paquet\. 2023\.Faster DAN: Multi\-target Queries with Document Positional Encoding for End\-to\-end Handwritten Document Recognition\. In*International Conference on Document Analysis and Recognition \(ICDAR\)*\. Springer, Cham, 182–199\.[doi:10\.1007/978\-3\-031\-41685\-9\_12](https://doi.org/10.1007/978-3-031-41685-9_12)
- DocEng 2026 Organisers \(2026\)DocEng 2026 Organisers\. 2026\.DocEng 2026 Competition on Maltese Paragraph OCR\.ACM Symposium on Document Engineering\.Competition task description and baselines\.
- Efron and Tibshirani \(1993\)Bradley Efron and Robert J\. Tibshirani\. 1993\.*An Introduction to the Bootstrap*\.Number 57 in Monographs on Statistics and Applied Probability\. Chapman and Hall/CRC, New York\.[doi:10\.1201/9780429246593](https://doi.org/10.1201/9780429246593)
- Fiscus \(1997\)Jonathan G\. Fiscus\. 1997\.A Post\-Processing System to Yield Reduced Word Error Rates: Recognizer Output Voting Error Reduction \(ROVER\)\. In*Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding \(ASRU\)*\. IEEE, Santa Barbara, CA, 347–354\.[doi:10\.1109/ASRU\.1997\.659110](https://doi.org/10.1109/ASRU.1997.659110)
- Gupta et al\.\(2016\)Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman\. 2016\.Synthetic Data for Text Localisation in Natural Images\. In*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\)*\. 2315–2324\.
- Koehn \(2004\)Philipp Koehn\. 2004\.Statistical Significance Tests for Machine Translation Evaluation\. In*Proceedings of the Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*\. Association for Computational Linguistics, Barcelona, Spain, 388–395\.[https://aclanthology\.org/W04\-3250](https://aclanthology.org/W04-3250)
- Lee et al\.\(2023\)Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming\-Wei Chang, and Kristina Toutanova\. 2023\.Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding\. In*Proceedings of the International Conference on Machine Learning \(ICML\)*\. PMLR, 18893–18912\.
- Li et al\.\(2023\)Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei\. 2023\.TrOCR: Transformer\-based Optical Character Recognition with Pre\-trained Models\. In*Proceedings of the AAAI Conference on Artificial Intelligence*\. AAAI Press, Washington, DC, 13094–13102\.[doi:10\.1609/aaai\.v37i11\.26538](https://doi.org/10.1609/aaai.v37i11.26538)
- Maltese Language Resource Server \(2025\)Maltese Language Resource Server\. 2025\.malti: A Python library for processing text in the Maltese language\.[https://github\.com/MLRS/malti](https://github.com/MLRS/malti)\.Source inspected May 2026\.
- Medhanie and Ni \(2026\)Yonatan Haile Medhanie and Yuanhua Ni\. 2026\.Adapting TrOCR for Printed Tigrinya Text Recognition: Word\-Aware Loss Weighting for Cross\-Script Transfer Learning\.*arXiv preprint arXiv:2604\.20813*\(2026\)\.
- Micallef et al\.\(2022\)Kurt Micallef, Albert Gatt, Marc Tanti, Lonneke van der Plas, and Claudia Borg\. 2022\.Pre\-training Data Quality and Quantity for a Low\-Resource Language: New Corpus and BERT Models for Maltese\. In*Proceedings of the Third Workshop on Deep Learning for Low\-Resourced Natural Language Processing \(DeepLo\)*\. Association for Computational Linguistics, Dublin, Ireland, 90–101\.[https://aclanthology\.org/2022\.deeplo\-1\.10](https://aclanthology.org/2022.deeplo-1.10)
- MLRS \(2026\)MLRS\. 2026\.MLRS/korpus\_malti: General Maltese corpus, version 4\.2\.[https://huggingface\.co/datasets/MLRS/korpus\_malti](https://huggingface.co/datasets/MLRS/korpus_malti)\.HuggingFace dataset card\.
- Smith \(2007\)Ray Smith\. 2007\.An Overview of the Tesseract OCR Engine\. In*Proceedings of the Ninth International Conference on Document Analysis and Recognition \(ICDAR\)*\. IEEE, Curitiba, Brazil, 629–633\.[doi:10\.1109/ICDAR\.2007\.4376991](https://doi.org/10.1109/ICDAR.2007.4376991)
- Smith et al\.\(2009\)Ray Smith, Daria Antonova, and Dar\-Shyang Lee\. 2009\.Adapting the Tesseract Open Source OCR Engine for Multilingual OCR\. In*Proceedings of the International Workshop on Multilingual OCR*\. ACM, Barcelona, Spain, 1–8\.[doi:10\.1145/1577802\.1577804](https://doi.org/10.1145/1577802.1577804)
- Stuner et al\.\(2017\)Bruno Stuner, Clément Chatelain, and Thierry Paquet\. 2017\.LV\-ROVER: Lexicon Verified Recognizer Output Voting Error Reduction\.arXiv:1707\.07432\.arXiv:1707\.07432
- Tanti et al\.\(2023\)Marc Tanti, Claudia Borg, and Albert Gatt\. 2023\.NOMOCRAT: New Open Maltese OCR Annotated Text\.[https://www\.systemsandcontrol\.com/post/nomocrat\-new\-open\-maltese\-ocr\-annotated\-text](https://www.systemsandcontrol.com/post/nomocrat-new-open-maltese-ocr-annotated-text)\.Maltese Language Resource Server / SystemsAndControl project; no formal proceedings venue identified\.
- Vaessen \(2025\)Nik Vaessen\. 2025\.jiwer: Evaluate automatic speech recognition systems\.[https://github\.com/jitsi/jiwer](https://github.com/jitsi/jiwer)\.v4\.0\.0, Python package for CER/WER computation\.
- Yim et al\.\(2021\)Moonbin Yim, Yoonsik Kim, Han\-Cheol Cho, and Sungrae Park\. 2021\.SynthTIGER: Synthetic Text Image Generator Towards Better Text Recognition Models\. In*International Conference on Document Analysis and Recognition \(ICDAR\)*\. Springer, Cham, 109–124\.[doi:10\.1007/978\-3\-030\-86337\-1\_8](https://doi.org/10.1007/978-3-030-86337-1_8)

## Appendix ALV\-ROVER Inference Algorithm

Algorithm[1](https://arxiv.org/html/2607.00250#algbox1)gives the full per\-image inference procedure\. Notation:SjS\_\{j\}is the line\-split output of streamjj;WjW\_\{j\}is its word\-tokenised form;LLis the soft Maltese word\-frequency lexicon;ed​\(⋅,⋅\)\\mathrm\{ed\}\(\\cdot,\\cdot\)is Levenshtein edit distance;canary​\(w\)\\mathrm\{canary\}\(w\)returns the diacritic characters\{ċ,ġ,h̄,ż\}\\\{\\textit\{\\\.\{c\}\},\\textit\{\\\.\{g\}\},\\textit\{\{\\=\{h\}\}\},\\textit\{\\\.\{z\}\}\\\}present inww\.

Algorithm 1LV\-ROVER per\-image inference\.Input:imageII, lexiconLL, five Tesseract configsOutput:paragraph stringss

Stage 1: multi\-stream recognitionforj∈\{1,…,5\}j\\in\\\{1,\\ldots,5\\\}doIj←Upscale\(I,2×\)I\_\{j\}\\leftarrow\\textsc\{Upscale\}\(I,2\{\\times\}\)ifj=5j\{=\}5, elseIISj←Tesseract​\(Ij,configj\)S\_\{j\}\\leftarrow\\textsc\{Tesseract\}\(I\_\{j\},\\mathrm\{config\}\_\{j\}\)– drop stream on errorend forStage 2: anchor selectionj∗←2j^\{\*\}\\leftarrow 2\(default: mlt\+ita fine\-tuned\)if\|Wj∗\|<0\.7⋅maxj⁡\|Wj\|\|W\_\{j^\{\*\}\}\|<0\.7\\cdot\\max\_\{j\}\|W\_\{j\}\|thenj∗←arg⁡maxj⁡\|Wj\|j^\{\*\}\\leftarrow\\arg\\max\_\{j\}\|W\_\{j\}\|end ifStage 3: word\-level ROVER voteAlign allWjW\_\{j\}to anchorWj∗W\_\{j^\{\*\}\}\(edit\-distance\)foreach positionppdoa←Wj∗​\[p\]a\\leftarrow W\_\{j^\{\*\}\}\[p\]ifa∈La\\in LthenW^​\[p\]←a\\hat\{W\}\[p\]\\leftarrow a– lexicon\-valid, keepelseC←\{w∈L:ed​\(w,a\)≤1,canary​\(w\)=canary​\(a\)\}C\\leftarrow\\\{w\\in L:\\mathrm\{ed\}\(w,a\)\\leq 1,\\;\\mathrm\{canary\}\(w\)=\\mathrm\{canary\}\(a\)\\\}b←arg⁡maxc∈C⁡freqL​\(c\)b\\leftarrow\\arg\\max\_\{c\\in C\}\\,\\mathrm\{freq\}\_\{L\}\(c\)if\|\{j:Wj​\[p\]=b\}\|\>⌊5/2⌋\|\\\{j:W\_\{j\}\[p\]=b\\\}\|\>\\lfloor 5/2\\rfloorthenW^​\[p\]←b\\hat\{W\}\[p\]\\leftarrow b– majority altelseW^​\[p\]←a\\hat\{W\}\[p\]\\leftarrow a– keep anchorend ifend forStage 4: post\-processing \(v16–v20\)t←Join​\(W^\)t\\leftarrow\\textsc\{Join\}\(\\hat\{W\}\)t←FixLeadMarker​\(t\)t\\leftarrow\\textsc\{FixLeadMarker\}\(t\)– v16: dash\+letter token at para startt←CurlApostrophe​\(t\)t\\leftarrow\\textsc\{CurlApostrophe\}\(t\)– v17:’→\\toU\+2019t←OpenQuote​\(t\)t\\leftarrow\\textsc\{OpenQuote\}\(t\)– v18: U\+2018/U\+201C at clause opent←DiacriticVote​\(t,W1\.\.5\)t\\leftarrow\\textsc\{DiacriticVote\}\(t,W\_\{1\.\.5\}\)– v19: majority diacritic formt←CurlDoubleQuote​\(t\)t\\leftarrow\\textsc\{CurlDoubleQuote\}\(t\)– v20:"→\\toU\+201C/U\+201DStage 5: line joiningStrip soft hyphens \(U\+00AD\); map U\+2013→\\toU\+2014s←RBLineJoiner​\(t,fix\_hyphenated\_words=True\)s\\leftarrow\\textsc\{RBLineJoiner\}\(t,\\;\\texttt\{fix\\\_hyphenated\\\_words=True\}\)returnss

Similar Articles

Cross-Temporal Sinhala OCR: Page-Level Adaptation and Diachronic Analysis

arXiv cs.CL

This paper introduces sinhala-ocr-lk-acts-1010, the first publicly available real-world page-level dataset for Sinhala OCR, and fine-tunes three vision language models (DeepSeek-OCR V1, DeepSeek-OCR V2, LightOnOCR-2-1B) using QLoRA. LightOnOCR-2-1B achieves a CER of 1.05%, outperforming both open-source and commercial OCR models, and maintains consistent performance across degraded documents from different time periods.

SGOCR: A Spatially-Grounded OCR-focused Pipeline & V1 Dataset [P]

Reddit r/MachineLearning

SGOCR is an open-source dataset pipeline for generating spatially-grounded, OCR-focused visual question answering (VQA) tuples with rich metadata to support diverse VLM training. The pipeline uses a multi-stage approach combining models like Nvidia's nemotron-ocr-v2, Gemma4, Qwen3-VL, and Gemini-2.5-Flash, along with an agentic optimization loop.