Efficient Punctuation Restoration via Weighted Lookahead Scoring Method for Streaming ASR Systems

arXiv cs.CL Papers

Summary

Proposes a non-autoregressive scoring method for punctuation restoration in streaming ASR that preserves the input transcript and outperforms prompt-based and fine-tuned baselines under a limited lookahead budget.

arXiv:2606.05179v1 Announce Type: new Abstract: Punctuation restoration improves ASR (Automatic Speech Recognition) readability. However streaming ASR requires online decisions with limited future context. In streaming ASR, the system predicts punctuation incrementally, which makes generation-based approaches prone to latency and alignment failures under boundary-wise evaluation. This paper proposes a non-autoregressive scoring method (no free-form generation) that preserves the input transcript and makes a decision at each word boundary. Our method compares punctuation insertion hypotheses against a no-insertion baseline under a bounded K-subword-token lookahead, and calibrates decisions using a weight {\alpha} and a validation-calibrated threshold {\tau} (no parameter updates during inference). On IWSLT 2017, our scoring method achieves a 4-class macro F1 of 0.893 in the no fine-tuning setting (validation-calibrated, K=2) and 0.937 after fine-tuning (K=2), outperforming the prompt-based baseline (0.566) and a fine-tuned ELECTRA baseline (0.913) under the same lookahead budget. We analyze the impact of the lookahead budget through ablation studies on K.
Original Article
View Cached Full Text

Cached at: 06/05/26, 08:05 AM

# Efficient Punctuation Restoration via Weighted Lookahead Scoring Method for Streaming ASR Systems
Source: [https://arxiv.org/html/2606.05179](https://arxiv.org/html/2606.05179)
###### Abstract

Punctuation restoration improves ASR \(Automatic Speech Recognition\) readability\. However streaming ASR requires online decisions with limited future context\. In streaming ASR, the system predicts punctuation incrementally, which makes generation\-based approaches prone to latency and alignment failures under boundary\-wise evaluation\. This paper proposes a non\-autoregressive scoring method \(no free\-form generation\) that preserves the input transcript and makes a decision at each word boundary\. Our method compares punctuation insertion hypotheses against a no\-insertion baseline under a boundedKK\-subword\-token lookahead, and calibrates decisions using a weightα\\alphaand a validation\-calibrated thresholdτ\\tau\(no parameter updates during inference\)\. On IWSLT 2017\[[1](https://arxiv.org/html/2606.05179#bib.bib1)\], our scoring method achieves a 4\-class macro F1 of 0\.893 in the no\-fine\-tuning setting \(validation\-calibrated,K=2K\{=\}2\) and 0\.937 after fine\-tuning \(K=2K\{=\}2\), outperforming the prompt\-based baseline \(0\.566\) and a fine\-tuned ELECTRA\[[2](https://arxiv.org/html/2606.05179#bib.bib2)\]baseline \(0\.913\) under the same lookahead budget\. We analyze the impact of the lookahead budget through ablation studies onKK\.

## IIntroduction

Automatic Speech Recognition \(ASR\) has become a fundamental component in applications ranging from voice assistants\[[3](https://arxiv.org/html/2606.05179#bib.bib3),[4](https://arxiv.org/html/2606.05179#bib.bib4)\]to on\-device voice control\[[5](https://arxiv.org/html/2606.05179#bib.bib5)\]\.

Despite significant advances in word error rates, many ASR systems output unpunctuated word sequences, which reduces readability and makes sentence boundaries ambiguous, affecting downstream text processing such as sentence splitting and translation\. Punctuation restoration is therefore essential, however in a streaming setting the system must decide punctuation online before future words arrive; waiting for more context can improve accuracy but increases latency\.

Large Language Models \(LLMs\) can restore punctuation by prompting an LLM to regenerate a fully formatted sentence\. However, this formulation is not well aligned with the standard punctuation restoration setup\. In typical punctuation restoration, the task assumes a fixed input transcript and predicts one punctuation label at each word boundary\. As shown in Fig\.[3](https://arxiv.org/html/2606.05179#S4.F3), prompt\-based generation may rewrite the transcript rather than preserving the original word sequence\. This breaks boundary alignment with the fixed input transcript, even if the overall meaning remains similar\. As a result, under boundary\-wise evaluation, even small edits shift alignment between the generated output and the original transcript, causing cascading label mismatches and a significant drop in punctuation F1\. Furthermore, generation can exhibit formatting drift such as repeated punctuation marks or missing sentence\-final symbols\. It also incurs high latency in streaming settings because it requires autoregressive decoding over the entire sequence\. To address this mismatch, we introduce a non\-autoregressive scoring method \(no free\-form generation\) for streaming punctuation restoration\. Rather than generating a new sentence, the proposed method preserves the input word sequence and makes an online decision at each word boundary by comparing a no\-insertion hypothesis with punctuation\-insertion hypotheses \(comma, period, question mark\)\. The formal definition of the history and bounded future context used for boundary\-wise decisions is presented in Section III\-A\.

Using a pre\-trained Llama\-3\.2\-1B model\[[6](https://arxiv.org/html/2606.05179#bib.bib6),[7](https://arxiv.org/html/2606.05179#bib.bib7)\]without fine\-tuning, experiments show that hypothesis scoring with a fixedKK\-subword\-token lookahead achieves strong macro F1 on the IWSLT 2017 dataset\. This study focuses on a compact 1B\-parameter LLM to study bounded\-lookahead punctuation scoring under streaming constraints, where model footprint and limited future context are important considerations\. While the proposed method is designed for streaming\-compatible inference, system\-level deployment metrics such as latency and memory are left for future work\. On IWSLT 2017, our scoring method achieves a 4\-class macro F1 of 0\.893 in the no\-fine\-tuning setting \(K=2K\{=\}2\) and 0\.937 after fine\-tuning \(K=2K\{=\}2\), outperforming the prompt\-based baseline \(0\.566\) and a fine\-tuned ELECTRA baseline \(0\.913\) under the same lookahead budget\.

From a deployment perspective, the proposed formulation provides a predictable latency budget: each boundary decision waits for at mostKKsubword tokens and scores only a fixed set of actions\. This yields stable per\-boundary computation and avoids the variable decoding length of autoregressive generation\. In addition, because the transcript is never rewritten, boundary\-wise evaluation remains well\-defined and directly comparable across systems\. These properties make the approach practical for real\-time captioning pipelines where runtime stability and alignment reliability are as important as raw F1\.

Our contributions are as follows\. \(1\) This paper proposes a non\-autoregressive, streaming\-compatible scoring framework \(no free\-form generation\) for boundary\-wise punctuation decisions under a boundedKK\-subword\-token lookahead\. \(2\) Scoring\-based inference avoids transcript drift and alignment failures of prompt\-based generation, and achieves strong performance on IWSLT 2017 with a compact 1B\-parameter LLM \(no\-fine\-tuning and fine\-tuned\)\. Data and code are publicly available at:https://github\.com/woomook0524/LLM\-Scoring

## IIRelated Work

### II\-AOffline punctuation restoration with bidirectional context

Punctuation restoration has been widely formulated as a sequence labeling problem over ASR transcripts, where leveraging broader context typically improves prediction accuracy\[[8](https://arxiv.org/html/2606.05179#bib.bib8),[9](https://arxiv.org/html/2606.05179#bib.bib9)\]\. Tilk and Alumäe\[[10](https://arxiv.org/html/2606.05179#bib.bib10)\]introduced an LSTM\-based punctuation restoration model for speech transcripts, showing that modeling longer contextual dependencies improves punctuation prediction\. They later proposed a bidirectional recurrent model with attention\[[11](https://arxiv.org/html/2606.05179#bib.bib11)\], enabling the model to exploit both left and future context more effectively\. More recently, pre\-trained bidirectional Transformer encoders have become a strong backbone for punctuation prediction, such as Courtland et al\.\[[12](https://arxiv.org/html/2606.05179#bib.bib12)\]reporting both accuracy gains and efficient inference enabled by parallel computation\. BERT\[[13](https://arxiv.org/html/2606.05179#bib.bib13)\]\-based taggers have also been explored extensively; for example, Makhija et al\.\[[14](https://arxiv.org/html/2606.05179#bib.bib14)\]combine contextual BERT representations with a BiLSTM\-CRF classifier for sequence labeling\. A recurring practical issue is class imbalance, where the non\-punctuation label dominates, and Yi et al\.\[[15](https://arxiv.org/html/2606.05179#bib.bib15)\]address this using focal loss to emphasize harder examples\. These offline models generally assume access to full\-sentence context, which motivates separate lines of work that explicitly constrain future context for streaming deployment\.

### II\-BStreaming/online punctuation restoration under bounded future context

While offline punctuation taggers can leverage full\-sentence bidirectional context, streaming ASR imposes strict latency constraints that limit how much future context can be used at inference time\[[16](https://arxiv.org/html/2606.05179#bib.bib16)\]\. This has motivated online restoration methods that explicitly operate with a bounded future window and quantify the resulting accuracy\-latency trade\-off\. Polacek et al\.\[[17](https://arxiv.org/html/2606.05179#bib.bib17),[18](https://arxiv.org/html/2606.05179#bib.bib18)\]propose a lightweight text\-only punctuation and capitalization restoration module designed for live captioning, and evaluate it under a small fixed future\-context budget \(i\.e\., a 4\-word lookahead\) to meet real\-time constraints\. Their results show that using only one future word can be insufficient, whereas a short lookahead of a few words \(e\.g\., 4 words\) already recovers most of the offline performance, suggesting that bounded future context is a practical and effective design choice for streaming deployment\. This line of work supports our setting in which the system decides punctuation online under a fixed lookahead budget, and motivates using a small, controllable lookahead window to balance punctuation accuracy and streaming latency\.

### II\-CLLM\-based punctuation restoration and generation\-related issues

Recently, LLMs have been explored for punctuation restoration via prompt\-based regeneration of a fully formatted sentence\. However, Zhong and Sun\[[19](https://arxiv.org/html/2606.05179#bib.bib19)\]observe several practical issues in this setting: LLMs tend to repeatedly use the same punctuation marks, may alter input text tokens, and incur high computational costs\. To improve generation efficiency and mitigate hallucination, Pang et al\.\[[20](https://arxiv.org/html/2606.05179#bib.bib20)\]propose forward\-pass\-only decoding \(FPOD\) for punctuation restoration\. Notably, their approach builds on a task\-adapted Llama model obtained via parameter\-efficient fine\-tuning \(e\.g\., LoRA\) and keeps the fine\-tuned model frozen during decoding\. In contrast, the proposed method requires no fine\-tuning: we use an off\-the\-shelf 1B LLM purely as a scoring function to keep the input word sequence fixed and make online insertion decisions under a bounded lookahead budget\.

![Refer to caption](https://arxiv.org/html/2606.05179v1/fig0.png)Figure 1:Decision boundary at indexii; punctuation is predicted using prefixw1:iw\_\{1:i\}andKK\-token lookaheadwi\+1:i\+Kw\_\{i\+1:i\+K\}\.

## IIIProposed Method: Weighted Lookahead Scoring

We treat punctuation restoration in streaming\-constrained settings asbounded\-lookahead hypothesis testing\. Instead of free\-form generation, the LLM is used as a token\-level scoring engine over a small candidate set, which preserves the input transcript and reduces decoding overhead\.

### III\-AProblem Formulation

In a streaming ASR scenario, the input arrives as a continuous sequence of words\. Our objective is to determine, at each boundary betweenwiw\_\{i\}andwi\+1w\_\{i\+1\}, whether to insert a punctuation marka∈𝒫=\{COMMA,PERIOD,QMARK,∅\}a\\in\\mathcal\{P\}=\\\{\\texttt\{COMMA\},\\texttt\{PERIOD\},\\texttt\{QMARK\},\\emptyset\\\}\. We denote the no\-insertion action asa=∅a=\\emptyset, which corresponds to the labelOin evaluation\.

Let𝐰=\[w1,w2,…,wT\]\\mathbf\{w\}=\[w\_\{1\},w\_\{2\},\\ldots,w\_\{T\}\]denote the LLM subword tokens and letiiindex boundary positions\. At boundaryii, we condition on token prefixw1:iw\_\{1:i\}and theKK\-token lookaheadwi\+1:i\+Kw\_\{i\+1:i\+K\}\(truncated near sentence end wheni\+K\>Ti\+K\>T\)\. We reset the prefix at each sentence boundary \(i\.e\., indexing restarts from11for every new sentence\)\.

Our goal is to choose the optimal hypothesisa^i∈𝒫\\hat\{a\}\_\{i\}\\in\\mathcal\{P\}given\(w1:i,wi\+1:i\+K\)\(w\_\{1:i\},w\_\{i\+1:i\+K\}\)\. Unlike prompt\-based generation, the proposed method uses the LLM purely as a scorer over a closed set of candidates, which prevents transcript drift\.

### III\-BTheoretical Justification

At each word boundary, the method selects a punctuation actiona∈𝒫a\\in\\mathcal\{P\}that best explains the bounded future suffix given the prefix:

a^i=arg​maxa∈𝒫⁡P​\(a∣w1:i,wi\+1:i\+K\)\\hat\{a\}\_\{i\}=\\operatorname\*\{arg\\,max\}\_\{a\\in\\mathcal\{P\}\}P\(a\\mid w\_\{1:i\},w\_\{i\+1:i\+K\}\)\(1\)Motivated by Bayes’ rule, the posterior can be written as

P​\(a∣w1:i,wi\+1:i\+K\)=P​\(wi\+1:i\+K∣w1:i,a\)​P​\(a∣w1:i\)P​\(wi\+1:i\+K∣w1:i\)P\(a\\mid w\_\{1:i\},w\_\{i\+1:i\+K\}\)=\\frac\{P\(w\_\{i\+1:i\+K\}\\mid w\_\{1:i\},a\)\\,P\(a\\mid w\_\{1:i\}\)\}\{P\(w\_\{i\+1:i\+K\}\\mid w\_\{1:i\}\)\}\(2\)whereP​\(a∣w1:i\)P\(a\\mid w\_\{1:i\}\)captures local punctuation preference \(prior\) andP​\(wi\+1:i\+K∣w1:i,a\)P\(w\_\{i\+1:i\+K\}\\mid w\_\{1:i\},a\)measures how consistent the bounded future suffix is under insertingaa\(likelihood\)\. Since the evidence termP​\(wi\+1:i\+K∣w1:i\)P\(w\_\{i\+1:i\+K\}\\mid w\_\{1:i\}\)does not depend onaa, maximizing the posterior in Eq\. \([1](https://arxiv.org/html/2606.05179#S3.E1)\) is equivalent to maximizing the productP​\(wi\+1:i\+K∣w1:i,a\)​P​\(a∣w1:i\)P\(w\_\{i\+1:i\+K\}\\mid w\_\{1:i\},a\)\\,P\(a\\mid w\_\{1:i\}\)\.

The Bayes decomposition motivates an unweighted log\-posterior objective\. In practice, we use Eq\. \([3](https://arxiv.org/html/2606.05179#S3.E3)\) as a calibrated surrogate objective with weightα\\alpha, tuned on the validation set to balance local prior preference and bounded lookahead evidence under streaming constraints\.

![Refer to caption](https://arxiv.org/html/2606.05179v1/fig2_replacement.png)Figure 2:Overall architecture of the proposed scoring\-based punctuation restoration pipeline under bounded lookahead\. At boundaryii, the prefixw1:iw\_\{1:i\}and lookaheadwi\+1:i\+Kw\_\{i\+1:i\+K\}are used to score each actiona∈\{COMMA,PERIOD,QMARK,∅\}a\\in\\\{\\mathrm\{COMMA\},\\mathrm\{PERIOD\},\\mathrm\{QMARK\},\\emptyset\\\}; a threshold gate then outputs either the best punctuation or no insertion\.Figure[2](https://arxiv.org/html/2606.05179#S3.F2)summarizes three stages: context construction from the fixed transcript, LLM scoring for each candidate action, and thresholded boundary\-wise output selection\. This process preserves the original transcript and returns exactly one action per boundary\.

Input:Prefix

w1:iw\_\{1:i\}, lookahead subword tokens

wi\+1:i\+Kw\_\{i\+1:i\+K\}
Output:Optimal punctuation

a^i\\hat\{a\}\_\{i\}
Li​\(∅\)←log⁡P​\(wi\+1:i\+K∣w1:i,∅\)L\_\{i\}\(\\emptyset\)\\leftarrow\\log P\(w\_\{i\+1:i\+K\}\\mid w\_\{1:i\},\\emptyset\)

Si​\(∅\)←α​log⁡P​\(∅∣w1:i\)\+\(1−α\)​Li​\(∅\)S\_\{i\}\(\\emptyset\)\\leftarrow\\alpha\\log P\(\\emptyset\\mid w\_\{1:i\}\)\+\(1\-\\alpha\)L\_\{i\}\(\\emptyset\)
ai∗←∅a\_\{i\}^\{\*\}\\leftarrow\\emptyset;

Smax←−∞S\_\{\\max\}\\leftarrow\-\\infty
for*a∈\{COMMA,PERIOD,QMARK\}a\\in\\\{\\mathrm\{COMMA\},\\mathrm\{PERIOD\},\\mathrm\{QMARK\}\\\}*do

Li​\(a\)←log⁡P​\(wi\+1:i\+K∣w1:i,a\)L\_\{i\}\(a\)\\leftarrow\\log P\(w\_\{i\+1:i\+K\}\\mid w\_\{1:i\},a\)
Si​\(a\)←α​log⁡P​\(a∣w1:i\)\+\(1−α\)​Li​\(a\)S\_\{i\}\(a\)\\leftarrow\\alpha\\log P\(a\\mid w\_\{1:i\}\)\+\(1\-\\alpha\)L\_\{i\}\(a\)
if*Si​\(a\)\>SmaxS\_\{i\}\(a\)\>S\_\{\\max\}*then

Smax←Si​\(a\)S\_\{\\max\}\\leftarrow S\_\{i\}\(a\);

ai∗←aa\_\{i\}^\{\*\}\\leftarrow a
end if

end for

Δi←Smax−Si​\(∅\)\\Delta\_\{i\}\\leftarrow S\_\{\\max\}\-S\_\{i\}\(\\emptyset\)

if*Δi\>τ\\Delta\_\{i\}\>\\tau*then

a^i←ai∗\\hat\{a\}\_\{i\}\\leftarrow a\_\{i\}^\{\*\}
end if

else

a^i←∅\\hat\{a\}\_\{i\}\\leftarrow\\emptyset
end if

return

a^i\\hat\{a\}\_\{i\}

Algorithm 1Weighted Lookahead Decision
### III\-CWeighted Lookahead Scoring Function

At boundaryii, the token prefix isw1:iw\_\{1:i\}and theKK\-token lookahead iswi\+1:i\+Kw\_\{i\+1:i\+K\}

Si​\(a\)\\displaystyle S\_\{i\}\(a\)=α​log⁡P​\(a∣w1:i\)\\displaystyle=\\alpha\\log P\(a\\mid w\_\{1:i\}\)\(3\)\+\(1−α\)​log⁡P​\(wi\+1:i\+K∣w1:i,a\)\\displaystyle\\quad\+\(1\-\\alpha\)\\log P\(w\_\{i\+1:i\+K\}\\mid w\_\{1:i\},a\)
The lookahead log\-probability term is computed in practice by summing token log\-likelihoods over the bounded windowwi\+1:i\+Kw\_\{i\+1:i\+K\}via chain\-rule factorization\.

Here,α∈\[0,1\]\\alpha\\in\[0,1\]is a scoring weight coefficient that balances the local prior termlog⁡P​\(a∣w1:i\)\\log P\(a\\mid w\_\{1:i\}\)and the bounded lookahead term\. A largerα\\alphaemphasizes local punctuation preference \(more conservative insertion\), whereas a smallerα\\alpharelies more on lookahead evidence \(more aggressive insertion\)\. Because the two terms can have different scales depending on the lookahead budgetKKand recognition noise,α\\alphaalso serves as a simple calibration knob\.

At inference time,Si​\(a\)S\_\{i\}\(a\)is computed at every word boundaryiiusing LLM log\-likelihoods under a boundedKK\-subword\-token lookahead\. In our current evaluation, prefix reset uses reference sentence boundaries \(oracle segmentation\) as a controlled setting\.

ai∗=arg​maxa∈𝒫∖\{∅\}⁡Si​\(a\),Δi=Si​\(ai∗\)−Si​\(∅\)a\_\{i\}^\{\*\}=\\operatorname\*\{arg\\,max\}\_\{a\\in\\mathcal\{P\}\\setminus\\\{\\emptyset\\\}\}S\_\{i\}\(a\),\\qquad\\Delta\_\{i\}=S\_\{i\}\(a\_\{i\}^\{\*\}\)\-S\_\{i\}\(\\emptyset\)\(4\)
a^i=\{ai∗if​Δi\>τ∅otherwise\\hat\{a\}\_\{i\}=\\begin\{cases\}a\_\{i\}^\{\*\}&\\text\{if \}\\Delta\_\{i\}\>\\tau\\\\ \\emptyset&\\text\{otherwise\}\\end\{cases\}\(5\)
This rule first selects the best non\-empty candidate and then compares it explicitly against the no\-insertion baseline through the marginΔi\\Delta\_\{i\}\. Punctuation is inserted only whenΔi\\Delta\_\{i\}exceeds the validation\-calibrated thresholdτ\\tau; otherwise, the model outputs∅\\emptyset\. The thresholdτ\\tauacts as an explicit gate: largerτ\\tauyields conservative behavior, while smallerτ\\tauincreases recall at the cost of more false positives\. In all experiments,\(α,τ\)\(\\alpha,\\tau\)are tuned per lookahead budgetKKon the validation set and then fixed for test evaluation\. The selected\(α,τ\)\(\\alpha,\\tau\)for eachKKare reported in Table[II](https://arxiv.org/html/2606.05179#S4.T2)\. The full inference procedure is summarized in Algorithm[1](https://arxiv.org/html/2606.05179#algorithm1)\.

## IVExperiments and Results

### IV\-ATask Definition and Evaluation Protocol

Punctuation restoration is formulated as boundary\-wise label prediction on a fixed transcript\. Given an unpunctuated word sequence, the model assigns one label at each word boundary from \{O,COMMA,PERIOD,QMARK\}\. The word sequence is kept unchanged across all methods; only punctuation insertion decisions are evaluated\. For streaming evaluation, boundaries are processed left\-to\-right and future context is restricted by a bounded lookahead budgetKK\(subword tokens\)\. BecauseOdominates, we tune hyperparameters on the validation set using punct\-only Macro F1 over \{COMMA,PERIOD,QMARK\}, while tables report 4\-class Macro F1 \(includingO\) for completeness and comparability\.

### IV\-BDatasets and Splits

An English punctuation restoration dataset is built from the HuggingFace IWSLT 2017 corpus\[[1](https://arxiv.org/html/2606.05179#bib.bib1),[21](https://arxiv.org/html/2606.05179#bib.bib21)\]\. Instead of using a single language\-pair configuration, English\-side transcripts are aggregated from multiple IWSLT 2017 configurations that contain English \(both\*\-enanden\-\*\)\. For each split \(train/validation/test\), exact duplicate sentences within the split are removed and basic whitespace normalization is applied\.

Input\-output pairs are constructed by removing punctuation marks from the reference transcript to formXX, while keeping the original punctuated text asYY\. Boundary\-wise labels are derived from the punctuation symbol following each word inYY: ‘,’→\\rightarrowCOMMA, ‘\.’→\\rightarrowPERIOD, ‘?’→\\rightarrowQMARK, andOotherwise\. The final dataset contains 357,117 / 1,501 / 10,799 sentences for train / validation / test, respectively\.

### IV\-CCompared Methods

#### IV\-C1Prompt\-based Generation Baseline

Prompt\-based punctuation generation is evaluated using Llama\-3\.2\-1B\-Instruct\[[6](https://arxiv.org/html/2606.05179#bib.bib6)\]with greedy decoding\. We additionally show qualitative failures of the base model \(Llama\-3\.2\-1B\) in Fig\.[3](https://arxiv.org/html/2606.05179#S4.F3)\(a\), as it often violates the strict formatting constraint\. Although the prompt enforces word preservation, generation frequently violates the formatting constraint or drifts from the fixed transcript \(Fig\.[3](https://arxiv.org/html/2606.05179#S4.F3)\), which leads to alignment errors under boundary\-wise labeling\.

#### IV\-C2Fine\-tuned ELECTRA Baseline

As a strong discriminative baseline, an ELECTRA\-small encoder classifier\[[2](https://arxiv.org/html/2606.05179#bib.bib2)\]is fine\-tuned for boundary\-wise labeling\. This ELECTRA baseline follows the streaming setting of Polacek et al\.\[[18](https://arxiv.org/html/2606.05179#bib.bib18)\]by restricting the future context with a fixed lookahead budget \(we setK=2K=2future subword tokens in the main comparison\)\.

\(a\) Base Model \(Llama\-3\.2\-1B\) FailureBehavior: Irrelevant generation / Loss of controlPrompt:"You are a punctuation restoration tool\. Rules: Insert punctuation only \(comma, period, question mark\)\. Do not add or remove words\. Keep the original word order\. Return exactly one line: the punctuated sentence\. Text: This all over the country is the second largest waste stream in America"Output:"This all over the country is the second largest waste stream in AmericaText: This all over the country is the … in AmericaOutput: This all over the country is the … in AmericaText: This all over the country is the … in AmericaOutput: This all over the country is the … in AmericaText:"

\(b\) Baseline \(Llama\-3\.2\-1B\-Instruct\) FailurePrompt:"You are a punctuation restoration tool\. Rules: Insert punctuation only \(comma, period, question mark\)\. Do not add or remove words\. Keep the original word order\. Return exactly one line: the punctuated sentence\.Text: This all over the country is the second largest waste stream in America"Output:"This, all, over, the, country, is, the, second, largest, waste, stream, in, America\."

Figure 3:Failures of prompt\-based generation: both the base and instruction\-tuned LLMs exhibit formatting drift and insertion bias \(e\.g\., comma over\-insertion\), which degrades boundary\-wise performance\. The prompt shown is the exact prompt used in our baseline experiments\.
#### IV\-C3Proposed Scoring Method

The proposed method uses the LLM as a scorer rather than a generator: at each word boundary, we compare punctuation insertion hypotheses against a no\-insertion baseline under a boundedKK\-subword\-token lookahead\. A local punctuation preference term and a lookahead\-consistency term are combined with weightα\\alpha, and punctuation is inserted only when the best candidate exceeds a thresholdτ\\tau\.

#### IV\-C4Scoring Calibration on Validation \(α\\alpha,τ\\tau\)

For each setting,\(α,τ\)\(\\alpha,\\tau\)is tuned using only the validation set via grid search\. We searchα\\alphafrom 0\.10 to 0\.90 in steps of 0\.05 andτ\\taufrom \-3\.00 to 1\.75 in steps of 0\.25, and select the best setting by punctuation\-only macro F1 over \{COMMA, PERIOD, QMARK\}, excluding O\. Hyperparameters are selected by punctuation\-only macro F1 \(excludingO\) to avoid dominance by the majorityOclass\.

### IV\-DScoring\-based Method Variants: No\-fine\-tuning vs Fine\-tuned LLM

Two variants of the proposed scoring\-based method are evaluated under the same scoring rule and decision procedure\. The no\-fine\-tuning variant uses the pretrained Llama\-3\.2\-1B model without fine\-tuning\. The fine\-tuned variant adapts the same Llama\-3\.2\-1B model via LoRA \(see Section IV\.E\) using the training split of the dataset described in Section IV\.B \(357,117 sentences\)\. Both variants use the identical boundary\-wise scoring\-and\-threshold inference and tune\(α,τ\)\(\\alpha,\\tau\)on the validation split in Section IV\.B, isolating the effect of LLM fine\-tuning\.

### IV\-ETraining Details

##### LLM fine\-tuning

‘Llama\-3\.2\-1B’ is fine\-tuned with LoRA\[[22](https://arxiv.org/html/2606.05179#bib.bib22)\]on attention projection layers \(q\_projandv\_proj\), withr=16r\{=\}16,α=32\\alpha\{=\}32, and dropout=0\.05\. Fine\-tuning is performed for 1 epoch with a learning rate of 2e\-4, a per\-device batch size of 4, gradient accumulation over 4 steps, and FP16 training on a single NVIDIA A100 GPU\. The fine\-tuned LLM is evaluated with the same scoring\-and\-threshold inference algorithm as the no\-fine\-tuning variant, using the HuggingFace Transformers library\[[23](https://arxiv.org/html/2606.05179#bib.bib23)\]\.

##### ELECTRA fine\-tuning

For the discriminative baseline, ‘electra\-small\-discriminator’ is fine\-tuned for boundary\-wise labeling with a lightweight Multi\-Layer Perceptron \(MLP\) head using AdamW\[[24](https://arxiv.org/html/2606.05179#bib.bib24)\]\. Word\-level labels are aligned to subwords by supervising only the last subword token of each word, and the best checkpoint is selected on the validation split\.

TABLE I:Main results on the IWSLT 2017 English test set\. Our scoring\-based method uses Llama\-3\.2\-1B, and the prompt\-generation baseline uses Llama\-3\.2\-1B\-Instruct; the encoder baseline uses ELECTRA\-small\.KKdenotes the future\-context budget measured in the number of future subword tokens available as lookahead at each word boundary\. We report Macro F1 over \{O, COMMA, PERIOD, QMARK\}\.Model & SettingO \(None\)COMMA \(,\)PERIOD \(\.\)QMARK \(?\)Macro AvgPRF1PRF1PRF1PRF1PRF1Baselines1\. Llama\-3\.2\-1B\-Instruct \(Prompt, No\-fine\-tuning\)0\.9750\.8020\.8800\.3970\.5370\.4570\.9980\.5380\.6990\.7960\.5050\.6220\.5410\.5960\.5662\. ELECTRA\-Small \(Fine\-tuned, K=2\)0\.9770\.9850\.9810\.7970\.7130\.7520\.9920\.9920\.9920\.9330\.9170\.9250\.9250\.9020\.913Proposed Scoring Method3\. Llama\-3\.2\-1B Scoring \(No\-fine\-tuning, K=1\)0\.9840\.9710\.9770\.6990\.8150\.7530\.9490\.9420\.9460\.8480\.8880\.8680\.8700\.9040\.8864\. Llama\-3\.2\-1B Scoring \(No\-fine\-tuning, K=2\)0\.9850\.9730\.9790\.7310\.8600\.7900\.9530\.9120\.9320\.8550\.8880\.8710\.8810\.9080\.8935\. Llama\-3\.2\-1B Scoring \(Fine\-tuned, K=1\)0\.9840\.9840\.9840\.7990\.8080\.8040\.9890\.9860\.9870\.9560\.9140\.9350\.9320\.9230\.9276\. Llama\-3\.2\-1B Scoring \(Fine\-tuned, K=2\)0\.9870\.9860\.9870\.8360\.8440\.8400\.9890\.9830\.9860\.9490\.9220\.9350\.9400\.9340\.937

### IV\-FExperimental Results

Table[I](https://arxiv.org/html/2606.05179#S4.T1)reports per\-class Precision/Recall/F1 for \{O,COMMA,PERIOD,QMARK\} and the 4\-class macro average on the IWSLT 2017 test set\. For model selection and hyperparameter tuning, we primarily focus on punctuation\-only performance \(COMMA/PERIOD/QMARK\), while Table[I](https://arxiv.org/html/2606.05179#S4.T1)additionally includesOfor completeness\.

##### Prompt\-based generation is unstable under boundary\-wise evaluation

As shown in Table[I](https://arxiv.org/html/2606.05179#S4.T1), prompt\-based generation yields weak boundary\-wise performance, especially forCOMMA\(F1=0\.457\), resulting in a low macro F1 of 0\.566\. Figure[3](https://arxiv.org/html/2606.05179#S4.F3)illustrates typical generation failures: the base LLM \(Llama\-3\.2\-1B\) often violates the strict formatting constraint \(irrelevant continuation or repetition\), and even the instruction\-tuned model \(Llama\-3\.2\-1B\-Instruct\) frequently exhibits comma over\-insertion or excessive omission\. These behaviors are problematic for boundary\-wise labeling because small word\-level mismatches can cascade into alignment errors\.

##### Scoring\-based inference improves robustness and accuracy

In contrast, the proposed scoring\-based method substantially improves punctuation restoration by making local insertion decisions while preserving the input word sequence\. In no\-fine\-tuning scoring, increasing the lookahead fromK=1K\{=\}1toK=2K\{=\}2improves the COMMA F1 from 0\.753 to 0\.790 and the 4\-class macro F1 from 0\.886 to 0\.893\.

##### Fine\-tuned scoring is competitive with a strong discriminative baseline

With fine\-tuning, the scoring\-based method outperforms the fine\-tuned ELECTRA baseline under the same bounded future\-context setting \(K=2K\{=\}2future subword tokens\)\. The fine\-tuned scoring model attains a 4\-class macro F1 of 0\.937, compared to 0\.913 for ELECTRA, and improvesCOMMAF1 \(0\.840 vs\. 0\.752\) while maintaining highPERIODandQMARKperformance\.

##### No\-fine\-tuning viability

Although fine\-tuning further improves performance, no\-fine\-tuning scoring already provides a strong baseline\. In no\-fine\-tuning mode, the proposed method reaches a 4\-class macro F1 of 0\.893 atK=2K\{=\}2, and fine\-tuning increases it to 0\.937 at the same lookahead\. This suggests that fine\-tuning is beneficial, while the inference\-time scoring framework remains effective even without fine\-tuning\.

### IV\-GAblation Study on Lookahead Length K

Table[II](https://arxiv.org/html/2606.05179#S4.T2)reports the lookahead\-length ablation for the fine\-tuned LLM variant\. For eachKK, the calibration parameters\(α,τ\)\(\\alpha,\\tau\)are selected on the validation set\. We report 4\-class Macro F1 \(includingO\) for completeness, while punct\-only Macro F1 overCOMMA,PERIOD,QMARKis also reported to reflect punctuation performance\.

Table[II](https://arxiv.org/html/2606.05179#S4.T2)shows that lookahead is essential: the control setting without lookahead \(K=0K\{=\}0\) severely degrades performance \(Macro F1=0\.646; punct\-only Macro F1=0\.543\), whereas even a short lookahead \(K=1K\{=\}1\) recovers most of the accuracy \(Macro F1=0\.927\)\. Performance peaks atK=2K\{=\}2\(Macro F1=0\.937; punct\-only Macro F1=0\.920\), and larger contexts \(K=3K\{=\}3–55\) provide marginal differences\.

To further understand error patterns, Table[III](https://arxiv.org/html/2606.05179#S4.T3)compares confusion matrices forK=0/1/2K\{=\}0/1/2\. AsKKincreases, major off\-diagonal errors are substantially reduced\. For example, false COMMA insertions \(TrueO→\\rightarrowPredCOMMA\) drop from 3,134 atK=0K\{=\}0to 2,606/2,145 atK=1/2K\{=\}1/2, while missed commas \(TrueCOMMA→\\rightarrowPredO\) drop from 5,046 to 2,466/2,010\. These results confirm that bounded lookahead stabilizes boundary\-wise decisions by enforcing local consistency with near\-future tokens\.

TABLE II:Ablation on lookahead lengthKK\(Fine\-tuned LLM variant\)\. For eachKK,\(α,τ\)\(\\alpha,\\tau\)are calibrated on the validation set\. The table reports 4\-class macro F1 \(including O\), punct\-only macro F1 \(COMMA/PERIOD/QMARK\)\.KKα\\alphaτ\\tauMacro F1Punct Macro F100\.85\-1\.000\.6460\.54310\.55\-0\.250\.9270\.90920\.55\-0\.250\.9370\.92030\.450\.000\.9350\.91840\.400\.000\.9300\.91150\.50\-0\.250\.9320\.914

TABLE III:Confusion Matrices for Proposed Method \(Fine\-tuned Scoring,K=0K=0vs\.K=1K=1vs\.K=2K=2\)\(a\)K=0K=0\(No Lookahead\)

True \\PredOCOMMAPERIODQMARKO153,3133,1343,400349COMMA5,0465,3072,59272PERIOD2,8639356,33817QMARK2492667570

\(b\)K=1K=1\(Short Lookahead\)

True \\PredOCOMMAPERIODQMARKO157,5802,606100COMMA2,46610,522290PERIOD763310,00638QMARK1275834

\(c\)K=2K=2\(Extended Lookahead\)

True \\PredOCOMMAPERIODQMARKO158,0242,145234COMMA2,01010,989180PERIOD116159,98141QMARK1268841

## VConclusion

This paper presented a non\-autoregressive, scoring\-based framework \(no free\-form generation\) for streaming punctuation restoration under bounded future context\. By preserving the input transcript and making boundary\-wise decisions via likelihood comparisons, the proposed method avoids alignment issues often observed in prompt\-based generation\.

On IWSLT 2017, our method achieves a 4\-class macro F1 of 0\.893 in no\-fine\-tuning mode \(K=2K\{=\}2\) and 0\.937 after fine\-tuning \(K=2K\{=\}2\), outperforming both the prompt\-generation baseline \(0\.566\) and a fine\-tuned ELECTRA baseline \(0\.913\) under the same lookahead budget\. Ablation overK=0∼5K\{=\}0\\sim 5shows the best performance atK=2K\{=\}2, indicating that modest future context is sufficient for accurate streaming decisions\. Our evaluation is limited to IWSLT 2017 English and does not yet include noisy ASR transcripts or system\-level latency/memory measurements\. These broader deployment\-oriented evaluations remain future work\. Overall, this work demonstrates that inference\-only LLM scoring provides a robust and effective alternative to generation\-based punctuation restoration in streaming settings with bounded lookahead\.

## Acknowledgment

This work partly was supported by: the National Research Foundation of Korea \(NRF\) grant funded by the Korean government \(MSIT\) under Grant No\. RS\-2025\-24535409; the Institute of Information & Communications Technology Planning & Evaluation \(IITP\) grant funded by the Korean government \(MSIT\) under Grant No\. RS\-2019\-II190079 for the Artificial Intelligence Graduate School Program at Korea University; the Institute of Information & Communications Technology Planning & Evaluation \(IITP\) grant funded by the Korean government \(MSIT\) under Grant No\. RS\-2025\-02304828 for the Artificial Intelligence Star Fellowship Support Program to Nurture the Best Talents; and the Institute of Information & Communications Technology Planning & Evaluation \(IITP\) grant funded by the Korean government \(MSIT\) under Grant No\. RS\-2025\-25442867\.

## References

- \[1\]M\. Cettolo, M\. Federico, L\. Bentivogli, J\. Niehues, S\. Stüker, K\. Sudoh, K\. Yoshino, and C\. Federmann, “Overview of the iwslt 2017 evaluation campaign,” in*Proc\. International Conference on Spoken Language Translation \(IWSLT\)*, 2017\.
- \[2\]K\. Clark, M\.\-T\. Luong, Q\. V\. Le, and C\. D\. Manning, “Electra: Pre\-training text encoders as discriminators rather than generators,” in*Proc\. International Conference on Learning Representations \(ICLR\)*, 2020\.
- \[3\]C\. Kim, A\. Misra, K\. Chin, T\. Hughes, A\. Narayanan, T\. N\. Sainath, and M\. Bacchiani, “Generation of large\-scale simulated utterances in virtual rooms to train deep\-neural networks for far\-field speech recognition in google home,” in*Proc\. Interspeech 2017*, 2017, pp\. 379–383\. \[Online\]\. Available:http://dx\.doi\.org/10\.21437/Interspeech\.2017\-1510
- \[4\]C\. Kim, E\. Variani, A\. Narayanan, and M\. Bacchiani, “Efficient implementation of the room simulator for training deep neural network acoustic models,” in*INTERSPEECH\-2018*, Sept 2018, pp\. 3028–3032\. \[Online\]\. Available:http://dx\.doi\.org/10\.21437/Interspeech\.2018\-2566
- \[5\]C\. Kim, D\. Gowda, D\. Lee, J\. Kim, A\. Kumar, S\. Kim, A\. Garg, and C\. Han, “A review of on\-device fully neural end\-to\-end automatic speech recognition algorithms,” in*2020 54th Asilomar Conference on Signals, Systems, and Computers*, Nov\. 2020, pp\. 277–283\.
- \[6\]Meta, “Llama 3\.2: Model cards and prompt formats,” 2024, online; accessed 2026\. \[Online\]\. Available:https://www\.llama\.com/docs/model\-cards\-and\-prompt\-formats/llama3\_2/
- \[7\]A\. Grattafiori*et al\.*, “The llama 3 herd of models,”*arXiv preprint arXiv:2407\.21783*, 2024\.
- \[8\]O\. Klejch, P\. Bell, and S\. Renals, “Punctuated transcription of multi\-genre broadcasts using acoustic and lexical approaches,” in*Proc\. IEEE Spoken Language Technology Workshop \(SLT\)*, 2016, pp\. 433–440\.
- \[9\]M\. Pogoda and T\. Walkowiak, “Comprehensive punctuation restoration for english and polish,” in*Findings of the Association for Computational Linguistics: EMNLP 2021*, 2021, pp\. 4610–4619\.
- \[10\]O\. Tilk and T\. Alumäe, “Lstm for punctuation restoration in speech transcripts,” in*Proc\. Interspeech*, 2015\.
- \[11\]——, “Bidirectional recurrent neural neural network with attention mechanism for punctuation restoration,” in*Proc\. Interspeech*, 2016\.
- \[12\]M\. Courtland, A\. Faulkner, and G\. McElvain, “Efficient automatic punctuation restoration using bidirectional transformers with robust inference,” in*Proc\. International Conference on Spoken Language Translation \(IWSLT\)*, 2020, pp\. 272–279\.
- \[13\]J\. Devlin, M\.\-W\. Chang, K\. Lee, and K\. Toutanova, “Bert: Pre\-training of deep bidirectional transformers for language understanding,” in*Proc\. NAACL\-HLT*, 2019, pp\. 4171–4186\.
- \[14\]K\. Makhija, T\.\-N\. Ho, and E\.\-S\. Chng, “Transfer learning for punctuation prediction,” in*Proc\. APSIPA Annual Summit and Conference \(APSIPA ASC\)*, 2019\.
- \[15\]J\. Yi, J\. Tao, Z\. Tian, Y\. Bai, and C\. Fan, “Focal loss for punctuation prediction,” in*Proc\. Interspeech*, 2020\.
- \[16\]J\. Park, S\. Jin, J\. Park, S\. Kim, D\. Sandhyana, C\. Lee, M\. Han, J\. Lee, S\. Jung, C\. Han, and C\. Kim, “Conformer\-based on\-device streaming speech recognition with kd compression and two\-pass architecture,” in*2022 IEEE Spoken Language Technology Workshop \(SLT\)*, 2023, pp\. 92–99\.
- \[17\]M\. Polacek, P\. Cerva, J\. Zdansky, and L\. Weingartova, “Online punctuation restoration using electra model for streaming asr systems,” in*Proc\. Interspeech*, 2023, pp\. 446–450\.
- \[18\]M\. Polacek and P\. Cerva, “Lightweight online punctuation and capitalization restoration for streaming asr systems,”*Speech Communication*, vol\. 173, 2025\.
- \[19\]Q\. Zhong and A\. Sun, “Punctuation restoration: A case study of bert\-based models’ task\-specific excellence,” in*Proc\. IEEE ICASSP*, 2025\.
- \[20\]Y\. Pang, D\. Paul, K\. Jiang, X\. Zhang, and X\. Lei, “Llama based punctuation restoration with forward pass only decoding,”*arXiv preprint arXiv:2408\.11845*, 2024\.
- \[21\]Q\. Lhoest*et al\.*, “Datasets: A community library for natural language processing,” in*Proc\. EMNLP \(System Demonstrations\)*, 2021, pp\. 175–184\.
- \[22\]E\. J\. Hu*et al\.*, “Lora: Low\-rank adaptation of large language models,”*arXiv preprint arXiv:2106\.09685*, 2021\.
- \[23\]T\. Wolf*et al\.*, “Transformers: State\-of\-the\-art natural language processing,” in*Proc\. EMNLP \(System Demonstrations\)*, 2020, pp\. 38–45\.
- \[24\]I\. Loshchilov and F\. Hutter, “Decoupled weight decay regularization,”*arXiv preprint arXiv:1711\.05101*, 2017\.

Similar Articles

Graph-Based Phonetic Error Correction of Noisy ASR

arXiv cs.CL

Proposes G-SPIN, a lightweight framework that combines phonetic graph modeling with contextual language understanding for correcting ASR errors, using a GNN to generate phonetically plausible candidate tokens, an MLM for local scoring, and an LLM for final re-ranking, all operating at inference time.

Are you speaking my languages? On spoken language adherence in multimodal LLMs

arXiv cs.CL

This paper addresses the problem of spoken language adherence in multimodal LLMs for ASR, proposing a soft prompting approach and novel metric to quantify language violations. It evaluates three mitigation strategies—zero-shot prompting, supervised fine-tuning, and chain-of-thought reasoning—across multiple languages to improve transcription fidelity.