CODEBLOCK: Learning to Supervise Code at the Right Granularity

arXiv cs.LG Papers

Summary

Proposes CodeBlock, a structure-aware sparse supervision framework for supervised fine-tuning of code LLMs. It selects high-quality instruction-response pairs and partitions code responses into syntactically coherent coding items, applying loss only to selected items to achieve stronger pass@1 rates using only 1.9% of supervised response tokens.

arXiv:2606.18286v1 Announce Type: new Abstract: Supervised fine-tuning of code LLMs typically applies uniform cross-entropy loss to all response tokens, implicitly assuming that every token provides equally useful learning signal. Recent token-level selection methods challenge this assumption in natural-language SFT by supervising only high-value tokens. However, directly transferring token-level masking to code can break syntactically and semantically coherent program units, because code depends on structural completeness and definition-use relations. We therefore propose CodeBlock, a structure-aware sparse supervision framework that selects structure-complete code evidence rather than isolated tokens. CodeBlock first selects high-quality instruction-response pairs, then partitions code responses into syntactically coherent coding items, estimates their utility by aggregating generalized cross-entropy over core logic tokens, and reranks them with data-flow reach and bridge signals to prioritize blocks that propagate or connect important program dependencies. During training, the full response remains available as context, while loss is applied only to selected code items and informative natural-language tokens. Experiments on six code-generation benchmarks show that CodeBlock achieves stronger average pass@1 than full-token SFT and competitive selection baselines, while using only 1.9% of supervised response tokens.
Original Article
View Cached Full Text

Cached at: 06/18/26, 05:40 AM

# CodeBlock: Learning to Supervise Code at the Right Granularity
Source: [https://arxiv.org/html/2606.18286](https://arxiv.org/html/2606.18286)
Zhijie Deng1,Ling Li1,Jinlong Pang2,Kaiqin Hu3, Qi Xuan4,Zhaowei Zhu4,5,Jiaheng Wei1,†\\dagger 1Hong Kong University of Science and Technology \(Guangzhou\) 2UC Santa Cruz3Ant Group4BAIA, ZJUT5D5Data\.ai zdeng190@connect\.hkust\-gz\.edu\.cn, jiahengwei@hkust\-gz\.edu\.cn

###### Abstract

Supervised fine\-tuning of code LLMs typically applies uniform cross\-entropy loss to all response tokens, implicitly assuming that every token provides equally useful learning signal\. Recent token\-level selection methods challenge this assumption in natural\-language SFT by supervising only high\-value tokens\. However, directly transferring token\-level masking to code can break syntactically and semantically coherent program units, because code depends on structural completeness and definition\-use relations\. We therefore proposeCodeBlock, a structure\-aware sparse supervision framework that selects structure\-complete code evidence rather than isolated tokens\.CodeBlockfirst selects high\-quality instruction\-response pairs, then partitions code responses into syntactically coherent coding items, estimates their utility by aggregating generalized cross\-entropy over core logic tokens, and reranks them with data\-flow reach and bridge signals to prioritize blocks that propagate or connect important program dependencies\. During training, the full response remains available as context, while loss is applied only to selected code items and informative natural\-language tokens\. Experiments on six code\-generation benchmarks show thatCodeBlockachieves stronger average pass@1 than full\-token SFT and competitive selection baselines, while using only 1\.9% of supervised response tokens\.

CodeBlock: Learning to Supervise Code at the Right Granularity

Zhijie Deng1, Ling Li1, Jinlong Pang2, Kaiqin Hu3,Qi Xuan4,Zhaowei Zhu4,5,Jiaheng Wei1,†\\dagger1Hong Kong University of Science and Technology \(Guangzhou\)2UC Santa Cruz3Ant Group4BAIA, ZJUT5D5Data\.aizdeng190@connect\.hkust\-gz\.edu\.cn, jiahengwei@hkust\-gz\.edu\.cn

![Refer to caption](https://arxiv.org/html/2606.18286v1/x1.png)Figure 1:Performance–efficiency trade\-off on Qwen2\.5\-Coder\-1\.5B\-Instruct\.CodeBlockachieves the highest average performance with only 1\.9% effective supervised tokens, showing a better trade\-off than full\-token SFT and sparse\-selection baselines\.## 1Introduction

Supervised fine\-tuning \(SFT\) is a standard way to adapt code LLMs to instruction\-following and program\-generation tasks\(Huiet al\.,[2024](https://arxiv.org/html/2606.18286#bib.bib8); Guoet al\.,[2024](https://arxiv.org/html/2606.18286#bib.bib25)\)\. As code instruction corpora grow, improving SFT increasingly depends on extracting dense, reliable, and high\-marginal\-utility supervision rather than simply adding more examples\(Weiet al\.,[2024b](https://arxiv.org/html/2606.18286#bib.bib14); Liet al\.,[2024](https://arxiv.org/html/2606.18286#bib.bib26); Yuet al\.,[2025](https://arxiv.org/html/2606.18286#bib.bib35)\)\. This has motivated data selection methods that filter high\-quality, less noisy, or more diverse instruction–response pairs before training\(Chenet al\.,[2023](https://arxiv.org/html/2606.18286#bib.bib10); Xiaet al\.,[2024](https://arxiv.org/html/2606.18286#bib.bib12); Liuet al\.,[2023](https://arxiv.org/html/2606.18286#bib.bib11)\)\. However, most existing methods remain sample\-level: they keep or discard entire instruction–response pairs, while still assigning uniform cross\-entropy supervision to every token in a selected response\(Chenet al\.,[2025](https://arxiv.org/html/2606.18286#bib.bib27); Wuet al\.,[2024](https://arxiv.org/html/2606.18286#bib.bib28)\)\.

This assumption has already been challenged in natural\-language SFT\. Recent token\-level selection methods challenge the need for dense supervision by keeping only high\-value tokens, often estimated by pointwise loss or excess\-loss scores\(Panget al\.,[2025](https://arxiv.org/html/2606.18286#bib.bib19); Linet al\.,[2024a](https://arxiv.org/html/2606.18286#bib.bib18)\)\. These methods are effective in natural\-language settings because individual tokens can often be treated as approximate local learning units\(Linet al\.,[2024b](https://arxiv.org/html/2606.18286#bib.bib29); Qinet al\.,[2025](https://arxiv.org/html/2606.18286#bib.bib31); Fuet al\.,[2026](https://arxiv.org/html/2606.18286#bib.bib30)\)\. However, directly transferring pointwise token selection from natural\-language settings to code SFT is unreliable\. Unlike ordinary text, the semantics of code tokens are often not determined by individual tokens in isolation, but jointly formed by syntactic structures, local statements, and variable definition–use relations\(Allamaniset al\.,[2017](https://arxiv.org/html/2606.18286#bib.bib32); Guoet al\.,[2020](https://arxiv.org/html/2606.18286#bib.bib33)\)\. An isolated variable name or operator may not carry complete semantics on its own; only when combined into an assignment statement, conditional branch, or return expression does it constitute code evidence that truly affects program behavior\. Figure[2](https://arxiv.org/html/2606.18286#S1.F2)illustrates this granularity mismatch: pointwise token selection tends to pick scattered tokens from different statements, whereas structure\-aware selection preserves complete coding items that form locally meaningful program evidence\. Therefore, sparse supervision in the code domain should move from token\-level scoring to structure\-complete code evidence selection\.

![Refer to caption](https://arxiv.org/html/2606.18286v1/x2.png)Figure 2:Comparison between previous token\-level selection andCodeBlock\. While prior methods select isolated high\-score tokens,CodeBlockpreserves complete coding items with coherent syntax and data dependencies\.To this end, we proposeCodeBlock, a structure\-aware sparse supervision framework for code LLM fine\-tuning\.CodeBlockuses high\-scoring tokens only as anchors for locating useful code evidence: it partitions code responses into syntactically coherent coding items, scores each item by the concentration of informative logic tokens using generalized cross\-entropy, and then reranks items with lightweight data\-flow signals that measure downstream influence and dependency\-path connectivity\. During training, the full response remains available as autoregressive context, while the loss is applied only to selected code items and informative natural\-language tokens\. As shown in Figure[1](https://arxiv.org/html/2606.18286#S0.F1), across six code\-generation benchmarks and five model settings,CodeBlockconsistently matches or improves full\-token SFT while using only about 1\.9% supervised response tokens, and achieves the best or second\-best average performance among competitive selection baselines\.

The contributions of this paper are as follows:

- •We reveal a granularity mismatch in sparse supervision for code LLMs: isolated token selection ignores syntactic closure and data\-flow dependencies, leading to fragmented and less effective supervision\.
- •We proposeCodeBlock, a structure\-aware sparse supervision framework that selects coding items rather than individual tokens, combining item partitioning, GCE\-based utility scoring, and data\-flow\-aware reranking\.
- •Experiments across six code\-generation benchmarks show thatCodeBlockmatches or outperforms full\-token SFT with only about 1\.9% supervised response tokens, achieving a stronger performance–efficiency trade\-off than competitive baselines\.

## 2Related Work

LLM Data Selection\.Recent studies have shown that the effectiveness of instruction tuning depends heavily on data quality\(Denget al\.,[2025](https://arxiv.org/html/2606.18286#bib.bib9)\)\. AlpaGasus filters Alpaca data using feedback from strong LLMs\(Chenet al\.,[2023](https://arxiv.org/html/2606.18286#bib.bib10)\), DEITA selects instruction data based on quality, complexity, and diversity\(Liuet al\.,[2023](https://arxiv.org/html/2606.18286#bib.bib11)\), LESS estimates sample influence through gradient similarity\(Xiaet al\.,[2024](https://arxiv.org/html/2606.18286#bib.bib12)\), and DS2 improves LLM\-based data rating by correcting scoring bias with a score transition matrix\(Panget al\.,[2024](https://arxiv.org/html/2606.18286#bib.bib13)\)\. These works mainly perform sample\-level data selection for general instruction tuning\.

For code large language models, recent datasets and filtering methods have likewise emphasized the importance of high\-quality code instruction data\.Weiet al\.\([2024a](https://arxiv.org/html/2606.18286#bib.bib24)\)filters self\-generated code instruction data through sandbox verification\.Tsaiet al\.\([2024](https://arxiv.org/html/2606.18286#bib.bib22)\)prunes redundant synthetic code data using clustering\-based metrics, whileLyuet al\.\([2025](https://arxiv.org/html/2606.18286#bib.bib23)\)selects compact data subsets based on distribution\-consistent and diversity\-aware criteria\. XCoder studies code data selection from the perspectives of instruction complexity, response quality, and diversity\(Wanget al\.,[2024](https://arxiv.org/html/2606.18286#bib.bib17)\)\. Although these methods improve the quality of selected instruction\-response pairs, they usually assume that all response tokens within a selected sample are valid supervision targets\. In contrast,CodeBlockfurther studies which tokens or code fragments inside selected code responses should participate in training and contribute gradients\.

Fine\-Grained Supervision Selection and Data Cleaning\.Beyond sample\-level filtering, several works study token\-level supervision selection\. Rho\-1 proposes selective language modeling by applying loss only to valuable pretraining tokens\(Linet al\.,[2024a](https://arxiv.org/html/2606.18286#bib.bib18)\)\. Token Cleaning views SFT token labels from a noisy\-label perspective and removes redundant or harmful tokens\(Panget al\.,[2025](https://arxiv.org/html/2606.18286#bib.bib19)\)\. TokenTune jointly estimates sample\-level and token\-level utility for instruction tuning\(Linet al\.,[2026](https://arxiv.org/html/2606.18286#bib.bib20)\), while TOSS identifies unsafe tokens for safe fine\-tuning through loss differences\(Liet al\.,[2026](https://arxiv.org/html/2606.18286#bib.bib21)\)\. These methods show that SFT does not necessarily require supervising all response tokens\. However, most of them are domain\-agnostic and treat token utility as an independent token\-level property\. This is insufficient for code responses, where token value often depends on local statements, variable definition\-use relations, and data\-flow dependencies\.

## 3Preliminary: Next\-Token Prediction with Sparse Supervision

Given an instruction\-tuning dataset𝒟=\{\(xi,yi\)\}i=1N\\mathcal\{D\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}, wherexix\_\{i\}is the instruction andyi=\{yi,t\}t=1Tiy\_\{i\}=\\\{y\_\{i,t\}\\\}\_\{t=1\}^\{T\_\{i\}\}is the response, we formulate supervised fine\-tuning with a token\-level supervision mask\. For each response tokenyi,ty\_\{i,t\}, letmi,t∈\{0,1\}m\_\{i,t\}\\in\\\{0,1\\\}indicate whether this token contributes to the training loss\. The masked next\-token prediction objective is defined as:

ℒmask​\(θ\)=−∑i=1N∑t=1Timi,t​log⁡pθ​\(yi,t∣xi,yi,<t\)∑i=1N∑t=1Timi,t\.\\mathcal\{L\}\_\{\\mathrm\{mask\}\}\(\\theta\)=\-\\frac\{\\sum\_\{i=1\}^\{N\}\\sum\_\{t=1\}^\{T\_\{i\}\}m\_\{i,t\}\\log p\_\{\\theta\}\(y\_\{i,t\}\\mid x\_\{i\},y\_\{i,<t\}\)\}\{\\sum\_\{i=1\}^\{N\}\\sum\_\{t=1\}^\{T\_\{i\}\}m\_\{i,t\}\}\.\(1\)
This formulation generalizes standard full\-token SFT and sparse supervision\. Whenmi,t=1m\_\{i,t\}=1for all response tokens, the objective reduces to standard SFT, where every response token is supervised\. Sparse supervision instead setsmi,t=0m\_\{i,t\}=0for most tokens by default and only enables selected tokens withmi,t=1m\_\{i,t\}=1\. Tokens withmi,t=0m\_\{i,t\}=0are still kept in the autoregressive context, but they are ignored in the loss computation\.

Prior fine\-grained selection methods\(Panget al\.,[2025](https://arxiv.org/html/2606.18286#bib.bib19); Linet al\.,[2024a](https://arxiv.org/html/2606.18286#bib.bib18)\)usually constructmi,tm\_\{i,t\}by scoring individual tokens, for example using token\-level loss or excess\-loss signals, and then retaining high\-scoring tokens\. However, this pointwise view is less suitable for code, where the semantics of a token often depends on local syntactic closure and definition–use dependencies\. This motivates our later use of structure\-complete coding items as the basic units for sparse supervision\.

![Refer to caption](https://arxiv.org/html/2606.18286v1/x3.png)Figure 3:Motivating analysis of token\-level sparse supervision in code\. Left: fragmentation rate of coding items under different token budgets\. Middle: closure cost for expanding selected tokens into syntactically complete fragments\. Right: same\-budget mini\-SFT results comparing isolated top\-token supervision with local closure supervision\.
## 4Motivating Analysis

Token\-level selection assumes that high\-scoring tokens are suitable training targets, but this assumption may not hold for code, where token semantics often depend on surrounding syntactic and dependency structures\. We therefore first examine whether high\-scoring code tokens can directly serve as sparse supervision units\.

Definition of Coding items\.To measure whether selected tokens form meaningful code evidence, we introduce coding items as the basic structural unit in this analysis\. For each code responseyiy\_\{i\}, we partition its code regions into a set of local syntactic units:

𝒰i=\{ui,1,…,ui,Ki\}\.\\mathcal\{U\}\_\{i\}=\\\{u\_\{i,1\},\\ldots,u\_\{i,K\_\{i\}\}\\\}\.\(2\)A coding item is a minimal local unit that expresses a coherent computation, such as an assignment, branch condition, API call, or return expression\. For each itemuu, we useC​\(u\)C\(u\)to denote its core logic tokens, such as identifiers, literals, operators, and API names, andM​\(u\)M\(u\)to denote its materialized closed fragment, which additionally includes boundary and structural tokens required for local syntactic completeness\. In this section, coding items are used only for diagnostic analysis; Sec\.[5\.2](https://arxiv.org/html/2606.18286#S5.SS2)later uses the same units for sparse supervision selection\.

Setting and evaluation\.We sample 30K code responses from OpencodeInstruct\(Ahmadet al\.,[2025](https://arxiv.org/html/2606.18286#bib.bib15)\)and use a frozen Qwen2\.5\-Coder\-1\.5B\-Instruct\(Huiet al\.,[2024](https://arxiv.org/html/2606.18286#bib.bib8)\)to compute token\-level CE scores\. For each budget, we select the top\-scoring code tokens and map them to their enclosing syntactic items\. We report two structural diagnostics: the incomplete\-item ratio, i\.e\., the fraction of touched items that are only partially selected, and the closure expansion ratio, i\.e\., the number of tokens required after syntactic closure divided by the originally selected tokens\. To measure downstream impact, we further run a same\-budget mini\-SFT experiment on the same 30K subset, training 3000 steps, using the same base model and exactly the same number of supervised code tokens\.

High\-scoring code tokens are informative but structurally incomplete\.As shown in Figure[3](https://arxiv.org/html/2606.18286#S3.F3), pointwise token selection severely fragments code structure: even at the top\-10% budget, 92\.3% of touched items are incomplete, and recovering syntactically closed fragments requires a 7\.02×\\timesclosure expansion\.

Structural incompleteness hurts downstream learning\.Under the controlled mini\-SFT setting, local code closures improve the four\-benchmark average by 11\.70 points over isolated high\-CE token supervision, even though the added closure tokens are not always the highest\-scoring ones\. These results suggest that high\-scoring tokens are useful anchors for locating informative code evidence, but sparse supervision in code should be applied to structure\-complete fragments rather than isolated tokens\.

![Refer to caption](https://arxiv.org/html/2606.18286v1/x4.png)Figure 4:Overview ofCodeBlock, which combines sample\-level selection, GCE\-based coding item scoring, data\-flow guided item selection, and sparse fine\-tuning for efficient code LLM supervised fine\-tuning\.
## 5Method

In this section, we proposeCodeBlock, a structure\-aware sparse supervision method for supervised fine\-tuning of code LLMs\. As illustrated in Figure[4](https://arxiv.org/html/2606.18286#S4.F4),CodeBlockconsists of four main components: sample\-level selection, GCE\-based coding item scoring, data\-flow\-guided item selection, and sparse fine\-tuning\.

### 5\.1Sample\-level Selection

We first select a compact training subset from the large code instruction pool\. Each instruction\-response pair is scored by Qwen2\.5\-Coder\-32B\-Instruct\(Huiet al\.,[2024](https://arxiv.org/html/2606.18286#bib.bib8)\)for instruction following and code correctness, and by a lightweight long\-tail score based on structural/task features such as control flow, API usage, and code complexity\. We rank sampleiibysisample=α​siqual\+\(1−α\)​sitails\_\{i\}^\{\\mathrm\{sample\}\}=\\alpha s\_\{i\}^\{\\mathrm\{qual\}\}\+\(1\-\\alpha\)s\_\{i\}^\{\\mathrm\{tail\}\}, Here,siquals\_\{i\}^\{\\mathrm\{qual\}\}is a normalized quality score derived from LLM\-judge ratings, with a small auxiliary calibration term from available unit test scores\. The long\-tail scoresitails\_\{i\}^\{\\mathrm\{tail\}\}measures the rarity of structural and task\-level features, including API usage, control\-flow patterns, and coarse code\-complexity buckets\. More selection details are provided in Appendix[A\.1](https://arxiv.org/html/2606.18286#A1.SS1)\. After deduplication, the top 30K samples form𝒮\\mathcal\{S\}for coding\-item\-level supervision selection\.

### 5\.2GCE\-Based Coding Item Scoring

After sample selection,CodeBlockselects where to apply supervision within each response\.

For natural\-language regions, we directly apply token\-level GCE scoring and keep the top\-ρNL\\rho\_\{\\text\{NL\}\}fraction of tokens\. For code regions, we first construct coding items through a lightweight local\-closure procedure, as detailed in Algorithm[1](https://arxiv.org/html/2606.18286#alg1)\. Specifically, we extract code regions from each response and parse them with Tree\-sitter111[https://tree\-sitter\.github\.io/tree\-sitter/](https://tree-sitter.github.io/tree-sitter/), then align syntax spans back to model\-token indices using character offsets\. Each token is labeled as a core logic token, a protected syntax token, or an other token\. Core logic tokens, such as identifiers, literals, arithmetic/comparison operators, and API names, form the core setC​\(u\)C\(u\)of a coding item\. Protected syntax tokens, such as brackets, commas, colons, delimiters, and statement boundaries, are used to materialize a locally closed fragmentM​\(u\)M\(u\), while other tokens act as separators\. During the left\-to\-right scan, we close the current item when reaching a top\-level statement boundary, crossing an unrelated separator, or reaching the end of the code region\. We then aggregate token\-level learning signals overC​\(u\)C\(u\)for utility estimation, but apply supervision toM​\(u\)M\(u\)so that selected targets remain syntactically coherent\.

For a coding itemuu, we define its utility by averaging token\-level GCE scores over its core logic tokens:

SGCE​\(u\)=1\|C​\(u\)\|​∑t∈C​\(u\)ℓi,tGCE,S\_\{\\mathrm\{GCE\}\}\(u\)=\\frac\{1\}\{\|C\(u\)\|\}\\sum\_\{t\\in C\(u\)\}\\ell\_\{i,t\}^\{\\mathrm\{GCE\}\},\(3\)whereℓi,tGCE=1−pi,tqq\\ell\_\{i,t\}^\{\\mathrm\{GCE\}\}=\\frac\{1\-p\_\{i,t\}^\{q\}\}\{q\},pi,t=pθ0​\(yi,t∣xi,yi,<t\),p\_\{i,t\}=p\_\{\\theta\_\{0\}\}\(y\_\{i,t\}\\mid x\_\{i\},y\_\{i,<t\}\),andθ0\\theta\_\{0\}is the frozen base model\. Here,q∈\(0,1\]q\\in\(0,1\]controls the degree of loss tempering\.

We use GCE rather than raw CE to temper extremely low\-probability outliers, which in code often correspond to rare identifiers, unusual literals, or noisy fragments\. The item score is computed onC​\(u\)C\(u\)to measure uncertainty over logic\-bearing tokens, while supervision is applied toM​\(u\)M\(u\)to preserve local syntactic completeness\.

Algorithm 1CodeBlockCoding\-Item Construction with Local Closure1:Response

yiy\_\{i\}, response\-token indices

𝒯i\\mathcal\{T\}\_\{i\}with character offsets, code regions

ℱi\\mathcal\{F\}\_\{i\}, protected syntax set

ΩP\\Omega\_\{\\mathrm\{P\}\}
2:Coding items

𝒰i=\{ui,k\}k=1Ki\\mathcal\{U\}\_\{i\}=\\\{u\_\{i,k\}\\\}\_\{k=1\}^\{K\_\{i\}\}, where each item

u=\(C​\(u\),M​\(u\)\)u=\(C\(u\),M\(u\)\)
3:

𝒰i←∅\\mathcal\{U\}\_\{i\}\\leftarrow\\emptyset
4:foreach code region

f∈ℱif\\in\\mathcal\{F\}\_\{i\}do

5:

z←LabelCodeTokens​\(f,𝒯i,ΩP\)z\\leftarrow\\textsc\{LabelCodeTokens\}\(f,\\mathcal\{T\}\_\{i\},\\Omega\_\{\\mathrm\{P\}\}\)⊳\\trianglerightzr∈\{C,P,O\}z\_\{r\}\\in\\\{\\mathrm\{C\},\\mathrm\{P\},\\mathrm\{O\}\\\}

6:

Ccur←∅C\_\{\\mathrm\{cur\}\}\\leftarrow\\emptyset,

d←0d\\leftarrow 0
7:foreach token index

rrin

fffrom left to rightdo

8:if

zr=Cz\_\{r\}=\\mathrm\{C\}then

9:

Ccur←Ccur∪\{r\}C\_\{\\mathrm\{cur\}\}\\leftarrow C\_\{\\mathrm\{cur\}\}\\cup\\\{r\\\}
10:elseif

Ccur≠∅C\_\{\\mathrm\{cur\}\}\\neq\\emptysetthen

11:ifBoundary\(tr,d\)\(t\_\{r\},d\)or

zr=Oz\_\{r\}=\\mathrm\{O\}then

12:

M←LocalClosure​\(Ccur,f,z\)M\\leftarrow\\textsc\{LocalClosure\}\(C\_\{\\mathrm\{cur\}\},f,z\)
13:

𝒰i←𝒰i∪\{\(Ccur,M\)\}\\mathcal\{U\}\_\{i\}\\leftarrow\\mathcal\{U\}\_\{i\}\\cup\\\{\(C\_\{\\mathrm\{cur\}\},M\)\\\}
14:

Ccur←∅C\_\{\\mathrm\{cur\}\}\\leftarrow\\emptyset
15:endif

16:endif

17:

d←UpdateDepth​\(tr,d\)d\\leftarrow\\textsc\{UpdateDepth\}\(t\_\{r\},d\)
18:endfor

19:if

Ccur≠∅C\_\{\\mathrm\{cur\}\}\\neq\\emptysetthen

20:

M←LocalClosure​\(Ccur,f,z\)M\\leftarrow\\textsc\{LocalClosure\}\(C\_\{\\mathrm\{cur\}\},f,z\)
21:

𝒰i←𝒰i∪\{\(Ccur,M\)\}\\mathcal\{U\}\_\{i\}\\leftarrow\\mathcal\{U\}\_\{i\}\\cup\\\{\(C\_\{\\mathrm\{cur\}\},M\)\\\}
22:endif

23:endfor

24:return

𝒰i\\mathcal\{U\}\_\{i\}

### 5\.3Data\-Flow Guided Item Selection

GCE identifies coding items that are uncertain under the base model, but uncertainty alone does not necessarily reflect structural importance\. For example, a rare variable name or literal may obtain a high GCE score while affecting no later computation, whereas an intermediate assignment with a moderate GCE score may be reused by several subsequent statements and eventually determine the returned value\. Therefore,CodeBlockfurther incorporates data\-flow structure to prioritize items that are both learnable and dependency\-critical\.

For each code response, we construct a lightweight data\-flow graphGi=\(𝒰i,ℰi\)G\_\{i\}=\(\\mathcal\{U\}\_\{i\},\\mathcal\{E\}\_\{i\}\), where each node is a coding item and an edge\(u,v\)∈ℰi\(u,v\)\\in\\mathcal\{E\}\_\{i\}indicates that itemvvuses a value defined or updated by itemuu\. Based on this graph, we compute two normalized structural signals:

- •Reachmeasures the downstream influence of an item\. It captures whether the value defined or updated by itemuuaffects many later computations\. Formally, we definereach⁡\(u\)=\|ReachGi​\(u\)\|/\|𝒰i\|\\operatorname\{reach\}\(u\)=\|\\mathrm\{Reach\}\_\{G\_\{i\}\}\(u\)\|/\|\\mathcal\{U\}\_\{i\}\|, whereReachGi​\(u\)\\mathrm\{Reach\}\_\{G\_\{i\}\}\(u\)denotes the set of items reachable fromuuinGiG\_\{i\}\.
- •Bridgemeasures dependency\-path connectivity\. It captures whether itemuulies on important dependency chains that connect early definitions to terminal computations\. LetΠi\\Pi\_\{i\}denote the set of source\-to\-sink paths inGiG\_\{i\}, where source nodes introduce values and sink nodes correspond to return statements, printed outputs, or final assignments\. We definebridge⁡\(u\)=\|\{π∈Πi:u∈π\}\|/\|Πi\|\\operatorname\{bridge\}\(u\)=\|\\\{\\pi\\in\\Pi\_\{i\}:u\\in\\pi\\\}\|/\|\\Pi\_\{i\}\|\.

We combine GCE utility with these data\-flow signals through a gated priority function:

PCodeBlock​\(u\)=\\displaystyle P\_\{\\textsc\{CodeBlock\}\}\(u\)=\(4\)SGCE​\(u\)​\[1\+g​\(u\)​λ​\(αr​reach⁡\(u\)\+αb​bridge⁡\(u\)\)\]\.\\displaystyle S\_\{\\mathrm\{GCE\}\}\(u\)\\Bigl\[1\+g\(u\)\\lambda\\bigl\(\\alpha\_\{r\}\\operatorname\{reach\}\(u\)\+\\alpha\_\{b\}\\operatorname\{bridge\}\(u\)\\bigr\)\\Bigr\]\.whereg​\(u\)=𝟏​\[SGCE​\(u\)≥Qη\(i\)\]g\(u\)=\\mathbf\{1\}\\left\[S\_\{\\mathrm\{GCE\}\}\(u\)\\geq Q\_\{\\eta\}^\{\(i\)\}\\right\]\. Here,Qη\(i\)Q\_\{\\eta\}^\{\(i\)\}is the within\-response GCE threshold for selecting the top\-η\\etafraction of coding items,λ\\lambdacontrols the overall strength of the data\-flow bonus, andαr\\alpha\_\{r\}andαb\\alpha\_\{b\}control the relative contributions of reach and bridge\. Because our data\-flow graph is constructed by lightweight static analysis, its structural signals may be approximate\. The gate prevents structurally central but low\-utility items from being over\-selected due to noisy data\-flow estimates, ensuring that data\-flow only reranks items that are already sufficiently informative under GCE\.

Finally, we derive the code\-side supervision mask from the priority scores\. LetTopρcode⁡\(𝒰i;PCodeBlock\)\\operatorname\{Top\}\_\{\\rho\_\{\\mathrm\{code\}\}\}\(\\mathcal\{U\}\_\{i\};P\_\{\\textsc\{CodeBlock\}\}\)return the highest\-priority coding items whose materialized closed fragmentsM​\(u\)M\(u\)fit within the code\-token budget ratioρcode\\rho\_\{\\mathrm\{code\}\}\. The selected code positions and the corresponding supervision mask are defined as:

ℳicode\\displaystyle\\mathcal\{M\}^\{\\mathrm\{code\}\}\_\{i\}=⋃u∈Topρcode⁡\(𝒰i;PCodeBlock\)M​\(u\),\\displaystyle=\\bigcup\_\{u\\in\\operatorname\{Top\}\_\{\\rho\_\{\\mathrm\{code\}\}\}\(\\mathcal\{U\}\_\{i\};P\_\{\\textsc\{CodeBlock\}\}\)\}M\(u\),\(5\)mi,tcode\\displaystyle m^\{\\mathrm\{code\}\}\_\{i,t\}=𝟏​\[t∈ℳicode\]\.\\displaystyle=\\mathbf\{1\}\\left\[t\\in\\mathcal\{M\}^\{\\mathrm\{code\}\}\_\{i\}\\right\]\.Here,M​\(u\)M\(u\)is the closed fragment of itemuu\. Tokens withmi,tcode=1m^\{\\mathrm\{code\}\}\_\{i,t\}=1contribute to the sparse next\-token prediction loss, while the remaining code tokens are kept as autoregressive context but ignored in the loss computation\.

Table 1:Performance comparison of different baselines on various benchmarks\. We highlight the best result inboldfaceand the second\-best withunderline\.
### 5\.4Sparse Fine\-Tuning

Given the selected subsetSS, we instantiate the sparse supervision mask defined in Sec\. 3 by combining the code\-side and natural\-language\-side masks\. For code regions, we usemi,tcodem^\{\\text\{code\}\}\_\{i,t\}from the data\-flow\-guided item selection in Sec\. 5\.3\. For natural\-language regions, we keep the top\-ρNL\\rho\_\{\\text\{NL\}\}tokens according to GCE scores, denoted bymi,tNLm^\{\\text\{NL\}\}\_\{i,t\}\. The final mask is

mi,t=𝟙​\[mi,tcode=1∨mi,tNL=1\]\.m\_\{i,t\}=\\mathbb\{1\}\\left\[m^\{\\text\{code\}\}\_\{i,t\}=1\\lor m^\{\\text\{NL\}\}\_\{i,t\}=1\\right\]\.\(6\)
We then optimize the masked next\-token prediction objective in Eq\. \(1\), with unselected response tokens kept in the input sequence but ignored by the loss\.

Table 2:We remove key components ofCodeBlockto evaluate their contributions\. The best result is shown inboldand the second\-best result isunderlined\.

## 6Experiments

### 6\.1Experiments Setup

Dataset\.We construct our dataset by randomly sampling 300K instruction\-tuning examples from OpenCodeInstruct\(Ahmadet al\.,[2025](https://arxiv.org/html/2606.18286#bib.bib15)\)\. All training subsets and baseline variants are derived from this same pool to ensure a controlled comparison across data selection strategies\. In the sample\-level selection stage, we further select a 30K subset from this 300K pool\. We then take the top\-ranked 30K examples as the curated training subset\.

Base Models\.We evaluate our method on a diverse set of open\-source code and general instruction\-tuned LLMs, including Qwen2\.5\-Coder\-1\.5B\-Instruct, Qwen2\.5\-Coder\-3B\-Instruct, Qwen2\.5\-Coder\-7B\-Instruct\(Huiet al\.,[2024](https://arxiv.org/html/2606.18286#bib.bib8)\), Seed\-Coder\-8B\(Seedet al\.,[2025](https://arxiv.org/html/2606.18286#bib.bib36)\), and OpenCoder\-8B\-Base\(Huanget al\.,[2025](https://arxiv.org/html/2606.18286#bib.bib37)\)\. These models are fine\-tuned using samples selected from our data pool under the corresponding experimental settings\.

Baselines\.We compare our method against several representative data selection and token selection baselines\. 1\) Full Tokens fine\-tunes the model on all tokens from the selected training examples without token\-level filtering\. 2\) Random samples training examples or response tokens uniformly at random under the same data or token budget\. 3\) DS2\(Panget al\.,[2024](https://arxiv.org/html/2606.18286#bib.bib13)\)selects a curated subset using score\-based data selection that favors high\-quality and diverse examples\. 4\) Token Cleaning\(Panget al\.,[2025](https://arxiv.org/html/2606.18286#bib.bib19)\)removes low\-utility tokens according to a fixed\-model token\-level filtering criterion before fine\-tuning\. 5\) CLAM\(Tsaiet al\.,[2024](https://arxiv.org/html/2606.18286#bib.bib22)\)performs clustering\- and diversity\-based sample selection to construct a representative training subset\.

Evaluation\.We evaluate the fine\-tuned models on six code generation benchmarks, including HumanEval, HumanEval\+\(Chenet al\.,[2021](https://arxiv.org/html/2606.18286#bib.bib38)\), MBPP, MBPP\+\(Austinet al\.,[2021](https://arxiv.org/html/2606.18286#bib.bib39)\), BigCodeBench\-Hard, and BigCodeBench\-Full\(Zhuoet al\.,[2025](https://arxiv.org/html/2606.18286#bib.bib40)\)\. These benchmarks provide a broad assessment of code generation performance, covering functional correctness, instruction following, and robustness across programming tasks of varying difficulty\. We report pass@1 for each benchmark and use the average score over the six tasks as the main metric\. Unless otherwise specified, we use the default settings in Table[3](https://arxiv.org/html/2606.18286#A1.T3)for all experiments\. More evaluation and training details are provided in Appendix[A\.4](https://arxiv.org/html/2606.18286#A1.SS4)\.

### 6\.2Main Results

CodeBlockachieves the best performance–token trade\-off\.CodeBlockachieves the best or the second best overall performance across all three base models while using only a very small fraction of supervised tokens\. On Qwen2\.5\-Coder\-1\.5B\-Instruct, Seed\-Coder\-8B, and OpenCoder\-8B\-Base,CodeBlockobtains the average scores of 54\.6, 65\.5, and 57\.1, respectively, consistently outperforming full\-token fine\-tuning despite using only 1\.9% effective tokens\. Compared with Full Tokens,CodeBlockimproves the average score by 1\.9, 0\.4, and 2\.6 points on the three base models, showing that dense supervision over all response tokens is not always necessary for effective code fine\-tuning\. Moreover, compared with strong baselines,CodeBlockachieves a better balance between supervision cost and downstream performance\. These results demonstrate thatCodeBlockcan identify more informative and structurally useful supervision targets, leading to stronger performance with substantially fewer supervised tokens\.

Structural completeness is crucial for token selection in code\.Although Token Cleaning is designed as a general token\-level data selection method, its performance is unstable in the code domain\. For example, it underperforms the base model in terms of average score, and shows particularly weak results on several structure\-sensitive benchmarks\. This supports our motivation that code tokens should not be treated as independent textual units: their utility depends on syntactic completeness, local program structure, and data\-flow dependencies\. In contrast,CodeBlockexplicitly groups informative tokens into structure\-preserving code fragments and further prioritizes fragments with stronger data\-flow relevance, leading to more reliable improvements across benchmarks\.

CodeBlockis effective on both base and instruction\-tuned models\.As shown in Table[1](https://arxiv.org/html/2606.18286#S5.T1),CodeBlockconsistently achieves strong average scores across different model initializations, including both base models and the instruction\-tuned model\. This indicates that our sparse supervision strategy is not only useful for improving base models, but also remains effective for models that have already undergone instruction tuning\. Additional instruction\-tuned model results are reported in the Table[6](https://arxiv.org/html/2606.18286#A2.T6), further confirming the generality ofCodeBlock\.

![Refer to caption](https://arxiv.org/html/2606.18286v1/x5.png)Figure 5:Sensitivity analysis of NL and code supervision ratios\.![Refer to caption](https://arxiv.org/html/2606.18286v1/x6.png)Figure 6:Sensitivity analysis of token scoring sources \(left\) and gating values \(right\)\.
### 6\.3Ablation Study

Component Ablation Study\.To isolate the contribution of each component, we conduct ablations under the same training and evaluation setting as the main experiment on Qwen2\.5\-Coder\-1\.5B\-Instruct\. Token\-off removes the token\-level selection mechanism, sample\-off replaces the selected training samples with random samples, GCE\-only uses only GCE\-based item scoring without data\-flow\-aware reranking, and dataflow\-only keeps only the dataflow\-based component\.CodeBlockachieves the best overall performance, as shown in Table[2](https://arxiv.org/html/2606.18286#S5.T2)\. These results show that neither token\-level selection, sample selection, GCE scoring, nor dataflow information alone is sufficient to match the full method, suggesting thatCodeBlockbenefits from the combination of gradient\-based token utility and structural flow information\.

Sensitivity to NL and Code Supervision Ratios\.We study the sensitivity ofCodeBlockto the supervision ratios of natural language and code tokens\. The default setting usesρNL=0\.6\\rho\_\{\\mathrm\{NL\}\}=0\.6andρcode=0\.33\\rho\_\{\\mathrm\{code\}\}=0\.33\. As shown in Figure[5](https://arxiv.org/html/2606.18286#S6.F5), increasing the NL\-keep ratio under a fixed code budget improves the average score from51\.351\.3to54\.654\.6, with the best result at0\.60\.6\. When fixing the NL ratio, increasing the code budget from0to0\.330\.33also yields clear gains, confirming the importance of selected code supervision\. However, larger code budgets bring no further improvement and may reduce performance, suggesting that excessive supervision introduces redundant or less informative tokens\.

Sensitivity to Token Scoring and Gating\.We first compare four token\-level scoring sources before data\-flow reranking: GCE, CE, focal loss, and label\-smoothed CE\. As shown in Figure[6](https://arxiv.org/html/2606.18286#S6.F6), GCE achieves the best average score of54\.654\.6, outperforming all alternative scoring losses, indicating that it provides a more reliable utility signal by reducing the influence of extremely low\-probability outliers\. We then vary the gating thresholdη∈\{10,30,50,70,100\}\\eta\\in\\\{10,30,50,70,100\\\}and find that moderate thresholds work best: bothη=30\\eta=30andη=50\\eta=50reach54\.654\.6, while lower or higher thresholds reduce performance\. This suggests that data\-flow reranking should be applied to sufficiently informative items, avoiding both low\-utility noise and overly aggressive filtering\.

## 7Conclusion

In this paper, we proposedCodeBlock, a structure\-aware sparse supervision framework for code LLM fine\-tuning\.CodeBlockselects structure\-complete coding items rather than isolated high\-scoring tokens, and prioritizes them with GCE utility and data\-flow signals\. Experiments across multiple code\-generation benchmarks show thatCodeBlockachieves competitive or better performance than full\-token SFT and strong selection baselines while using far fewer supervised response tokens\. These results suggest that sparse supervision for code should preserve the syntactic and dependency context in which informative tokens become meaningful, rather than treating tokens as independent learning targets\. By keeping full responses as context and applying loss only to informative, dependency\-critical code fragments,CodeBlockdecouples contextual exposure from gradient supervision and offers a practical route toward efficient code model adaptation\.

## Limitations

The current implementation ofCodeBlockuses lightweight static data\-flow analysis\. Although efficient and scalable, the constructed def\-use graph may not fully capture runtime\-dependent behaviors such as dynamic dispatch, aliasing, object mutation, or input\-dependent execution paths\. Thus, reach and bridge should be viewed as approximate structural signals\. Future work could incorporate execution traces or more precise program analysis to build richer dependency graphs\.

## References

- Opencodeinstruct: a large\-scale instruction tuning dataset for code llms\.arXiv preprint arXiv:2504\.04030\.Cited by:[§4](https://arxiv.org/html/2606.18286#S4.p3.1),[§6\.1](https://arxiv.org/html/2606.18286#S6.SS1.p1.1)\.
- M\. Allamanis, M\. Brockschmidt, and M\. Khademi \(2017\)Learning to represent programs with graphs\.arXiv preprint arXiv:1711\.00740\.Cited by:[§1](https://arxiv.org/html/2606.18286#S1.p2.1)\.
- J\. Austin, A\. Odena, M\. Nye, M\. Bosma, H\. Michalewski, D\. Dohan, E\. Jiang, C\. Cai, M\. Terry, Q\. Le,et al\.\(2021\)Program synthesis with large language models\.arXiv preprint arXiv:2108\.07732\.Cited by:[§6\.1](https://arxiv.org/html/2606.18286#S6.SS1.p4.1)\.
- L\. Chen, S\. Li, J\. Yan, H\. Wang, K\. Gunaratna, V\. Yadav, Z\. Tang, V\. Srinivasan, T\. Zhou, H\. Huang,et al\.\(2023\)Alpagasus: training a better alpaca with fewer data\.arXiv preprint arXiv:2307\.08701\.Cited by:[§1](https://arxiv.org/html/2606.18286#S1.p1.1),[§2](https://arxiv.org/html/2606.18286#S2.p1.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. D\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§6\.1](https://arxiv.org/html/2606.18286#S6.SS1.p4.1)\.
- Y\. Chen, Y\. Li, K\. Hu, M\. Zerun, H\. HaochenYe, and K\. Chen \(2025\)Mig: automatic data selection for instruction tuning by maximizing information gain in semantic space\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 9902–9915\.Cited by:[§1](https://arxiv.org/html/2606.18286#S1.p1.1)\.
- Z\. Deng, Z\. Shen, L\. Li, Y\. Zhou, Z\. Zhu, Y\. He, W\. Wang, and J\. Wei \(2025\)LM\-mixup: text data augmentation via language model based mixup\.arXiv preprint arXiv:2510\.20449\.Cited by:[§2](https://arxiv.org/html/2606.18286#S2.p1.1)\.
- Y\. Fu, F\. Hamman, and S\. Dutta \(2026\)T\-shirt: token\-selective hierarchical data selection for instruction tuning\.Advances in Neural Information Processing Systems38,pp\. 113932–113958\.Cited by:[§1](https://arxiv.org/html/2606.18286#S1.p2.1)\.
- D\. Guo, S\. Ren, S\. Lu, Z\. Feng, D\. Tang, S\. Liu, L\. Zhou, N\. Duan, A\. Svyatkovskiy, S\. Fu,et al\.\(2020\)Graphcodebert: pre\-training code representations with data flow\.arXiv preprint arXiv:2009\.08366\.Cited by:[§1](https://arxiv.org/html/2606.18286#S1.p2.1)\.
- D\. Guo, Q\. Zhu, D\. Yang, Z\. Xie, K\. Dong, W\. Zhang, G\. Chen, X\. Bi, Y\. Wu, Y\. Li,et al\.\(2024\)DeepSeek\-coder: when the large language model meets programming–the rise of code intelligence\.arXiv preprint arXiv:2401\.14196\.Cited by:[§1](https://arxiv.org/html/2606.18286#S1.p1.1)\.
- S\. Huang, T\. Cheng, J\. K\. Liu, W\. Xu, J\. Hao, L\. Song, Y\. Xu, J\. Yang, J\. Liu, C\. Zhang,et al\.\(2025\)Opencoder: the open cookbook for top\-tier code large language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 33167–33193\.Cited by:[§6\.1](https://arxiv.org/html/2606.18286#S6.SS1.p2.1)\.
- B\. Hui, J\. Yang, Z\. Cui, J\. Yang, D\. Liu, L\. Zhang, T\. Liu, J\. Zhang, B\. Yu, K\. Lu,et al\.\(2024\)Qwen2\. 5\-coder technical report\.arXiv preprint arXiv:2409\.12186\.Cited by:[§1](https://arxiv.org/html/2606.18286#S1.p1.1),[§4](https://arxiv.org/html/2606.18286#S4.p3.1),[§5\.1](https://arxiv.org/html/2606.18286#S5.SS1.p1.5),[§6\.1](https://arxiv.org/html/2606.18286#S6.SS1.p2.1)\.
- M\. Li, Y\. Zhang, Z\. Li, J\. Chen, L\. Chen, N\. Cheng, J\. Wang, T\. Zhou, and J\. Xiao \(2024\)From quantity to quality: boosting llm performance with self\-guided data selection for instruction tuning\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 7602–7635\.Cited by:[§1](https://arxiv.org/html/2606.18286#S1.p1.1)\.
- Y\. Li, Z\. Liu, Z\. Li, Z\. Lin, and J\. Zhang \(2026\)Token\-level data selection for safe llm fine\-tuning\.arXiv preprint arXiv:2603\.01185\.Cited by:[§2](https://arxiv.org/html/2606.18286#S2.p3.1)\.
- X\. Lin, Y\. Qi, Y\. Luo, T\. Palpanas, and Y\. Luo \(2026\)TokenTune: dual\-level utility estimation for scalable data selection in instruction tuning\.External Links:[Link](https://openreview.net/forum?id=LoPF9Zl7ic)Cited by:[§2](https://arxiv.org/html/2606.18286#S2.p3.1)\.
- Z\. Lin, Z\. Gou, Y\. Gong, X\. Liu, Y\. Shen, R\. Xu, C\. Lin, Y\. Yang, J\. Jiao, N\. Duan,et al\.\(2024a\)Rho\-1: not all tokens are what you need\.arXiv preprint arXiv:2404\.07965\.Cited by:[§1](https://arxiv.org/html/2606.18286#S1.p2.1),[§2](https://arxiv.org/html/2606.18286#S2.p3.1),[§3](https://arxiv.org/html/2606.18286#S3.p3.1)\.
- Z\. Lin, T\. Liang, J\. Xu, Q\. Lin, X\. Wang, R\. Luo, C\. Shi, S\. Li, Y\. Yang, and Z\. Tu \(2024b\)Critical tokens matter: token\-level contrastive estimation enhances llm’s reasoning capability\.arXiv preprint arXiv:2411\.19943\.Cited by:[§1](https://arxiv.org/html/2606.18286#S1.p2.1)\.
- W\. Liu, W\. Zeng, K\. He, Y\. Jiang, and J\. He \(2023\)What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning\.arXiv preprint arXiv:2312\.15685\.Cited by:[§1](https://arxiv.org/html/2606.18286#S1.p1.1),[§2](https://arxiv.org/html/2606.18286#S2.p1.1)\.
- W\. Lyu, S\. Huang, and X\. Xia \(2025\)Efficient code llm training via distribution\-consistent and diversity\-aware data selection\.arXiv preprint arXiv:2507\.02378\.Cited by:[§2](https://arxiv.org/html/2606.18286#S2.p2.1)\.
- J\. Pang, N\. Di, Z\. Zhu, J\. Wei, H\. Cheng, C\. Qian, and Y\. Liu \(2025\)Token cleaning: fine\-grained data selection for llm supervised fine\-tuning\.InInternational Conference on Machine Learning,pp\. 47837–47858\.Cited by:[§1](https://arxiv.org/html/2606.18286#S1.p2.1),[§2](https://arxiv.org/html/2606.18286#S2.p3.1),[§3](https://arxiv.org/html/2606.18286#S3.p3.1),[§6\.1](https://arxiv.org/html/2606.18286#S6.SS1.p3.1)\.
- J\. Pang, J\. Wei, A\. P\. Shah, Z\. Zhu, Y\. Wang, C\. Qian, Y\. Liu, Y\. Bao, and W\. Wei \(2024\)Improving data efficiency via curating llm\-driven rating systems\.arXiv preprint arXiv:2410\.10877\.Cited by:[§2](https://arxiv.org/html/2606.18286#S2.p1.1),[§6\.1](https://arxiv.org/html/2606.18286#S6.SS1.p3.1)\.
- X\. Qin, X\. Wang, N\. Liao, C\. Zhang, X\. Zhang, M\. Feng, J\. Wang, and J\. Yan \(2025\)SsToken: self\-modulated and semantic\-aware token selection for llm fine\-tuning\.arXiv preprint arXiv:2510\.18250\.Cited by:[§1](https://arxiv.org/html/2606.18286#S1.p2.1)\.
- B\. Seed, Y\. Zhang, J\. Su, Y\. Sun, C\. Xi, X\. Xiao, S\. Zheng, A\. Zhang, K\. Liu, D\. Zan,et al\.\(2025\)Seed\-coder: let the code model curate data for itself\.arXiv preprint arXiv:2506\.03524\.Cited by:[§6\.1](https://arxiv.org/html/2606.18286#S6.SS1.p2.1)\.
- Y\. Tsai, M\. Liu, and H\. Ren \(2024\)Code less, align more: efficient llm fine\-tuning for code generation with data pruning\.arXiv preprint arXiv:2407\.05040\.Cited by:[§2](https://arxiv.org/html/2606.18286#S2.p2.1),[§6\.1](https://arxiv.org/html/2606.18286#S6.SS1.p3.1)\.
- Y\. Wang, K\. He, D\. Fu, Z\. GongQue, H\. Xu, Y\. Chen, Z\. Wang, Y\. Fu, G\. Dong, M\. Diao,et al\.\(2024\)How do your code llms perform? empowering code instruction tuning with really good data\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 14027–14043\.Cited by:[§2](https://arxiv.org/html/2606.18286#S2.p2.1)\.
- Y\. Wei, F\. Cassano, J\. Liu, Y\. Ding, N\. Jain, Z\. Mueller, H\. de Vries, L\. Von Werra, A\. Guha, and L\. Zhang \(2024a\)Selfcodealign: self\-alignment for code generation\.Advances in Neural Information Processing Systems37,pp\. 62787–62874\.Cited by:[§2](https://arxiv.org/html/2606.18286#S2.p2.1)\.
- Y\. Wei, Z\. Wang, J\. Liu, Y\. Ding, and L\. Zhang \(2024b\)Magicoder: empowering code generation with oss\-instruct, 2024\.URL https://arxiv\. org/abs/2312\.02120\.Cited by:[§1](https://arxiv.org/html/2606.18286#S1.p1.1)\.
- Y\. Wu, H\. Zhang, Y\. Jiao, L\. Ma, X\. Liu, J\. Yu, D\. Zhang, D\. Yu, and W\. Xu \(2024\)Rose: a reward\-oriented data selection framework for llm task\-specific instruction tuning\.arXiv preprint arXiv:2412\.00631\.Cited by:[§1](https://arxiv.org/html/2606.18286#S1.p1.1)\.
- M\. Xia, S\. Malladi, S\. Gururangan, S\. Arora, and D\. Chen \(2024\)Less: selecting influential data for targeted instruction tuning\.arXiv preprint arXiv:2402\.04333\.Cited by:[§1](https://arxiv.org/html/2606.18286#S1.p1.1),[§2](https://arxiv.org/html/2606.18286#S2.p1.1)\.
- J\. Yu, Y\. Wu, Y\. Zhan, W\. Guo, Z\. Xu, and R\. Lee \(2025\)Co\-learning: code learning for multi\-agent reinforcement collaborative framework with conversational natural language interfaces\.Frontiers in Artificial Intelligence8,pp\. 1431003\.Cited by:[§1](https://arxiv.org/html/2606.18286#S1.p1.1)\.
- T\. Y\. Zhuo, M\. C\. Vu, J\. Chim, H\. Hu, W\. Yu, R\. Widyasari, I\. N\. B\. Yusuf, H\. Zhan, J\. He, I\. Paul,et al\.\(2025\)Bigcodebench: benchmarking code generation with diverse function calls and complex instructions\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 66602–66656\.Cited by:[§6\.1](https://arxiv.org/html/2606.18286#S6.SS1.p4.1)\.

## Appendix AExperimental Details

### A\.1Sample\-Level Selection Settings

We perform sample\-level selection from the 300K OpenCodeInstruct \(OCI\) source pool before applying item\-level sparse supervision\. Instead of re\-querying an external judge model, we reuse the LLM\-judge scores provided by OCI and follow its scoring protocol\. Each sample is evaluated along three dimensions: requirement conformance, logical correctness, and edge\-case consideration\. We compute the mean judge score over these dimensions and combine it with a lightweight test\-based auxiliary score:

squalraw=sjudge\+0\.2⋅stest,s^\{\\mathrm\{raw\}\}\_\{\\mathrm\{qual\}\}=s\_\{\\mathrm\{judge\}\}\+0\.2\\cdot s\_\{\\mathrm\{test\}\},\(7\)wheresjudges\_\{\\mathrm\{judge\}\}denotes the average LLM\-judge score andstests\_\{\\mathrm\{test\}\}denotes the average available test score\. The test score is used only as a small auxiliary signal for tie\-breaking and calibration, rather than as the dominant selection criterion\. We then apply robust min–max normalization to obtain the normalized quality scoresqualnorm∈\[0,1\]s^\{\\mathrm\{norm\}\}\_\{\\mathrm\{qual\}\}\\in\[0,1\]\.

To avoid selecting only common and easy high\-quality examples, we further compute a lightweight long\-tail score\. The long\-tail score combines tag\-level rarity and bucket\-level rarity:

stail=0\.7⋅stag\+0\.3⋅sbucket\.s\_\{\\mathrm\{tail\}\}=0\.7\\cdot s\_\{\\mathrm\{tag\}\}\+0\.3\\cdot s\_\{\\mathrm\{bucket\}\}\.\(8\)Here,stags\_\{\\mathrm\{tag\}\}captures the presence of relatively rare program structures or API\-related tags, such as imports, classes, multiple functions, recursion, exception handling, graph\-related operations, date/time operations, and asynchronous programming\. The bucket scoresbuckets\_\{\\mathrm\{bucket\}\}measures rarity with respect to prompt length, code length, and coarse complexity buckets\.

The final sample\-level ranking score is computed as:

ssample=0\.7⋅squalnorm\+0\.3⋅stail\.s\_\{\\mathrm\{sample\}\}=0\.7\\cdot s^\{\\mathrm\{norm\}\}\_\{\\mathrm\{qual\}\}\+0\.3\\cdot s\_\{\\mathrm\{tail\}\}\.\(9\)

### A\.2Token and Item Selection Settings

Table[3](https://arxiv.org/html/2606.18286#A1.T3)summarizes the default hyperparameters used in the token\- and item\-level selection stage ofCodeBlock\. Unless otherwise specified, all experiments use the same configuration\.

Table 3:Default hyperparameters for token\- and item\-level supervision selection inCodeBlock\.
### A\.3Fine\-Tuning Settings

We report the main fine\-tuning settings in Table[4](https://arxiv.org/html/2606.18286#A1.T4)\. For full\-token SFT, we train the model for 30,000 steps to provide a strong dense\-supervision baseline\. For CodeBlcok and all sparse\-selection baselines, we use 3,000 training steps under the same optimization setup, so that the comparison focuses on the quality of selected supervision signals under a limited training budget\. All experiments are conducted on one NVIDIA A800 GPU with bf16 precision\.

For sparse\-supervision variants, the input sequence is kept unchanged: unselected response tokens remain available as autoregressive context, but their labels are set to the ignore index and therefore do not contribute to the cross\-entropy loss\. Thus, CodeBlcok differs from standard SFT only in the supervision mask, while using the same causal language modeling objective\.

Table 4:Fine\-tuning settings used for the main experiments\.
### A\.4Evaluation Settings

For HumanEval and MBPP, we use EvalPlus and report sanitized pass@1 on HumanEval, HumanEval\+, MBPP, and MBPP\+\. For BigCodeBench, we evaluate instruction\-tuned models on the instruct split and base models on the complete split, reporting pass@1 on the hard and full subsets\. All reported pass@1 scores use greedy decoding with temperature 0\.0 and one generated sample per task\.

Table 5:Observed wall\-clock runtime in minutes on the Qwen2\.5\-Coder\-1\.5B\-Instruct main setting\. We include method\-specific preprocessing and SFT training, and exclude evaluation\.

## Appendix BAdditional Experiment Results

### B\.1More Experimental Results

We report additional results on Qwen2\.5\-Coder\-3B\-Instruct and Qwen2\.5\-Coder\-7B\-Instruct to further examine the generality ofCodeBlock\. All methods use the same data pool, training protocol, and evaluation setting as the main experiments\. As shown in Table[6](https://arxiv.org/html/2606.18286#A2.T6),CodeBlockconsistently achieves strong average performance with only about 1\.9% supervised response tokens, further supporting the effectiveness of structure\-aware sparse supervision across different model scales\.

Table 6:Additional comparison of different baselines on various benchmarks\. We highlight the best result inboldfaceand the second\-best withunderline\.
### B\.2More Ablation Study

![Refer to caption](https://arxiv.org/html/2606.18286v1/x7.png)Figure 7:Sensitivity analysis ofλ\\lambdaon the average pass@1 score across six code\-generation benchmarks\.We further study the effect of the data\-flow bonus strengthλ\\lambdain theCodeBlockpriority function\. This hyperparameter controls how strongly the reach and bridge signals influence the final ranking of candidate coding items\.

Figure[7](https://arxiv.org/html/2606.18286#A2.F7)reports the results under different values ofλ\\lambda, while keeping all other hyperparameters fixed\. The results show that a small data\-flow bonus is generally beneficial, withλ=0\.10\\lambda=0\.10achieving the best average score\. Compared withλ=0\\lambda=0, adding the data\-flow bonus improves avg06 from 53\.84 to 54\.58, suggesting that reach and bridge signals provide complementary information beyond token\-level GCE utility\.

However, the trend is not monotonic: further increasingλ\\lambdadoes not consistently improve performance and can even degrade the average score\. This is because our data\-flow graph is constructed through static code analysis, its reach and bridge estimates are inevitably approximate and may contain noise, especially for incomplete snippets, library calls, or implicit dependencies\. Therefore, data\-flow information should be used to gently promote dependency\-critical coding items after GCE scoring, rather than replacing the original token\-level utility estimate\. The relatively stable but non\-monotonic results indicate thatCodeBlockbenefits from structure\-aware reranking, while over\-emphasizing static structural centrality may disturb the balance between learnability and dependency importance\. We therefore setλ=0\.10\\lambda=0\.10as the default value in our experiments\.

## Appendix CRuntime Analysis

We additionally report observed wall\-clock runtime on the Qwen2\.5\-Coder\-1\.5B\-Instruct main setting\. For each method, we count method\-specific preprocessing and SFT training time, and exclude evaluation time\. Shared dataset construction costs, such as LLM\-judge scoring for constructing the common candidate pool, are not attributed to any individual method\.

As shown in Table[5](https://arxiv.org/html/2606.18286#A1.T5), full\-token SFT requires about 255 minutes in total, mainly due to its 30K\-step training process\. In contrast,CodeBlockfinishes in about 41\.5 minutes when including both token\-scoring preprocessing and SFT training, corresponding to 16\.3% of the full\-token runtime\. Although Random Selection has lower runtime, it does not provide the same downstream performance asCodeBlock\. Moreover,CodeBlockhas a similar runtime to Token Cleaning but achieves stronger average performance, suggesting that the improvement comes from structure\-aware code supervision rather than merely additional preprocessing\.

Similar Articles

LLMSniffer: Detecting LLM-Generated Code via GraphCodeBERT and Supervised Contrastive Learning

arXiv cs.CL

LLMSniffer is a detection framework that fine-tunes GraphCodeBERT with supervised contrastive learning to distinguish AI-generated code from human-written code, achieving 78% accuracy on GPTSniffer and 94.65% on Whodunit benchmarks. The approach addresses critical challenges in academic integrity and code quality assurance by combining code-structure-aware embeddings with contrastive learning and comment removal preprocessing.

Teaching Language Models to Think in Code

arXiv cs.CL

This paper introduces ThinC (Thinking in Code), a framework where language models use code blocks exclusively for reasoning after a brief natural language planning step, outperforming existing tool-integrated reasoning baselines on math benchmarks.

Goal-Conditioned Supervised Learning for LLM Fine-Tuning

arXiv cs.LG

This paper proposes goal-conditioned supervised learning (GCSL) as an offline fine-tuning framework for LLMs, which treats feedback as an explicit goal and trains models via supervised learning with a novel goal formulation and natural-language goal representations. Evaluated on non-toxic generation, code generation, and recommendation, it outperforms standard offline baselines.