DiZiNER: Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition

arXiv cs.CL Papers

Summary

DiZiNER is a framework that uses disagreement between multiple LLMs to refine task instructions for zero-shot named entity recognition, achieving state-of-the-art results on 14 out of 18 benchmarks and significantly reducing the performance gap between zero-shot and supervised systems.

arXiv:2604.15866v1 Announce Type: new Abstract: Large language models (LLMs) have advanced information extraction (IE) by enabling zero-shot and few-shot named entity recognition (NER), yet their generative outputs still show persistent and systematic errors. Despite progress through instruction fine-tuning, zero-shot NER still lags far behind supervised systems. These recurring errors mirror inconsistencies observed in early-stage human annotation processes that resolve disagreements through pilot annotation. Motivated by this analogy, we introduce DiZiNER (Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition), a framework that simulates the pilot annotation process, employing LLMs to act as both annotators and supervisors. Multiple heterogeneous LLMs annotate shared texts, and a supervisor model analyzes inter-model disagreements to refine task instructions. Across 18 benchmarks, DiZiNER achieves zero-shot SOTA results on 14 datasets, improving prior bests by +8.0 F1 and reducing the zero-shot to supervised gap by over +11 points. It also consistently outperforms its supervisor, GPT-5 mini, indicating that improvements stem from disagreement-guided instruction refinement rather than model capacity. Pairwise agreement between models shows a strong correlation with NER performance, further supporting this finding.
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:30 AM

# DiZiNER: Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition
Source: https://arxiv.org/html/2604.15866
Siun Kim Seltasquare Seoul, Korea sukim@seltasquare\.com &Hyung\-Jin Yoon Seoul National University College of Medicine Seoul, Korea hjyoon@snu\.ac\.kr This work was primarily conducted at the Biomedical Research Institute, Seoul National University Hospital\.

###### Abstract

Large language models \(LLMs\) have advanced information extraction \(IE\) by enabling zero\-shot and few\-shot named entity recognition \(NER\), yet their generative outputs still show persistent and systematic errors\. Despite progress through instruction fine\-tuning, zero\-shot NER still lags far behind supervised systems\. These recurring errors mirror inconsistencies observed in early\-stage human annotation processes that resolve disagreements through pilot annotation\. Motivated by this analogy, we introduce DiZiNER \(Disagreement\-guidedInstruction Refinement via Pilot Annotation Simulation forZero\-shotNamedEntityRecognition\), a framework that simulates the pilot annotation process, employing LLMs to act as both annotators and supervisors\. Multiple heterogeneous LLMs annotate shared texts, and a supervisor model analyzes inter\-model disagreements to refine task instructions\. Across 18 benchmarks, DiZiNER achieves zero\-shot SOTA results on 14 datasets, improving prior bests by \+8\.0 F1 and reducing the zero\-shot to supervised gap by over \+11 points\. It also consistently outperforms its supervisor, GPT\-5 mini, indicating that improvements stem from disagreement\-guided instruction refinement rather than model capacity\. Pairwise agreement between models shows a strong correlation with NER performance, further supporting this finding\.111The code and prompts are available athttps://github.com/SiunKim/diziner-ner/\.

DiZiNER: Disagreement\-guided Instruction Refinement via Pilot Annotation Simulation for Zero\-shot Named Entity Recognition

Siun Kim††thanks:This work was primarily conducted at the Biomedical Research Institute, Seoul National University Hospital\.SeltasquareSeoul, Koreasukim@seltasquare\.comHyung\-Jin YoonSeoul National University College of MedicineSeoul, Koreahjyoon@snu\.ac\.kr

## 1Introduction

Information extraction \(IE\) converts unstructured text into structured data, with named entity recognition \(NER\) serving as the entry point that identifies and categorizes entity spans\. Recent advances in large language models \(LLMs\) have greatly expanded the potential of IE \(Luet al\.,2022 (https://arxiv.org/html/2604.15866#bib.bib23); Bogdanovet al\.,2024 (https://arxiv.org/html/2604.15866#bib.bib20)\), enabling in\-context learning \(ICL\) strategies for NER such as few\-shot \(Chenet al\.,2023 (https://arxiv.org/html/2604.15866#bib.bib21); Jianget al\.,2024 (https://arxiv.org/html/2604.15866#bib.bib26)\) and zero\-shot learning \(Xieet al\.,2023a (https://arxiv.org/html/2604.15866#bib.bib22); Sainzet al\.,2023 (https://arxiv.org/html/2604.15866#bib.bib55)\)\. Despite this progress, state\-of\-the\-art \(SOTA\) models still depend heavily on human\-labeled data, with a wide performance gap remaining between supervised fine\-tuning \(SFT\) and ICL \(Xieet al\.,2023a (https://arxiv.org/html/2604.15866#bib.bib22); Naguibet al\.,2024 (https://arxiv.org/html/2604.15866#bib.bib19)\)\.

LLMs exhibit recurring NER error patterns, including difficulty following complex guidelines \(Panget al\.,2023 (https://arxiv.org/html/2604.15866#bib.bib18); Sainzet al\.,2023 (https://arxiv.org/html/2604.15866#bib.bib55); Qiet al\.,2024 (https://arxiv.org/html/2604.15866#bib.bib48)\), ambiguity in span boundary detection \(Guoet al\.,2024a (https://arxiv.org/html/2604.15866#bib.bib15); Dinget al\.,2024 (https://arxiv.org/html/2604.15866#bib.bib56)\), and frequent confusion of entity types \(Liet al\.,2024a (https://arxiv.org/html/2604.15866#bib.bib27); Kimet al\.,2024 (https://arxiv.org/html/2604.15866#bib.bib28)\)\. Prior efforts have addressed these issues through instruction fine\-tuning on diverse datasets \(Wanget al\.,2023a (https://arxiv.org/html/2604.15866#bib.bib45)\), open NER frameworks \(Sainzet al\.,2023 (https://arxiv.org/html/2604.15866#bib.bib55)\), and large\-scale synthetic data generation \(Zhouet al\.,2023 (https://arxiv.org/html/2604.15866#bib.bib53)\)\. Yet, supervised methods still outperform them by a considerable margin \(Table2 (https://arxiv.org/html/2604.15866#S4.T2)\)\.

In this context, we note that these LLM errors parallel those observed during the early stages of human annotation \(Tanabeet al\.,2005 (https://arxiv.org/html/2604.15866#bib.bib13); Bernier\-Colborne and Vajjala,2024 (https://arxiv.org/html/2604.15866#bib.bib14)\)\. Gold\-standard datasets are typically built throughpilot annotation, an iterative process of resolving annotator disagreements and refining guidelines \(Walkeret al\.,2006 (https://arxiv.org/html/2604.15866#bib.bib42); Weischedelet al\.,2011 (https://arxiv.org/html/2604.15866#bib.bib57); Finlayson and Erjavec,2017 (https://arxiv.org/html/2604.15866#bib.bib12)\)\. Supervisors analyze disagreements, update ambiguous instructions, and align the annotations with downstream application needs \(Fortet al\.,2009 (https://arxiv.org/html/2604.15866#bib.bib10), Figure1 (https://arxiv.org/html/2604.15866#S1.F1)\)\.

Building on this analogy, we propose DiZiNER \(Disagreement\-guidedInstruction Refinement via Pilot Annotation Simulation forZero\-shotNamedEntityRecognition\), a framework that simulates pilot annotation using LLMs as both annotators and supervisors\. Multiple heterogeneous open\-source LLMs act as annotators labeling shared texts, and a supervisor LLM analyzes and categorizes inter\-model disagreements to refine both common and model\-specific instructions\. This iterative cycle of annotation, disagreement analysis, and instruction refinement parallels the workflow of human pilot annotation, allowing LLMs to adapt to individual NER tasks without any parameter updates\.

Across 18 NER benchmarks, DiZiNER achieves zero\-shot SOTA results on 14 datasets, improving prior bests by \+8\.0 F1 on average and narrowing the gap between zero\-shot and supervised performance from \-32\.0 to \-20\.9 points\. Agreement metrics between LLM annotators consistently increase across iterations and show a strong correlation with NER performance\. Notably, DiZiNER surpasses its GPT\-5 mini supervisor, indicating that the observed improvements arise from disagreement\-guided refinement rather than from the supervisor’s inherent capability\.

Refer to captionFigure 1:Overview of the DiZiNER framework\. Multiple heterogeneous LLMs act as independent annotators\. Disagreement profiles are constructed from their outputs, and a supervisor LLM iteratively refines the schema and annotator\-specific instructions until convergence\.
## 2Related Works

#### Instruction tuning for NER

Standard instruction fine\-tuning often struggles to follow complex annotation guidelines and to produce structured outputs in IE tasks \(Qiet al\.,2024 (https://arxiv.org/html/2604.15866#bib.bib48)\)\. InstructUIE and GoLLIE address these challenges by curating NER datasets for instruction fine\-tuning, thereby improving zero\-shot performance and guideline adherence \(Wanget al\.,2023b (https://arxiv.org/html/2604.15866#bib.bib54); Sainzet al\.,2023 (https://arxiv.org/html/2604.15866#bib.bib55)\)\. Open NER frameworks relax label constraints, allowing LLMs to better exploit their language understanding capabilities for NER \(Etzioniet al\.,2011 (https://arxiv.org/html/2604.15866#bib.bib16)\)\. UniversalNER distills ChatGPT on synthetic data \(Zhouet al\.,2023 (https://arxiv.org/html/2604.15866#bib.bib53)\), while GLiNER and NuNER adopt encoder\-only architectures to reduce inference costs \(Zaratianaet al\.,2023 (https://arxiv.org/html/2604.15866#bib.bib52); Bogdanovet al\.,2024 (https://arxiv.org/html/2604.15866#bib.bib20)\)\. Recent work has sought to unify heterogeneous corpora and to address span ambiguity through boundary\-aware learning \(Yanget al\.,2024 (https://arxiv.org/html/2604.15866#bib.bib50); Dinget al\.,2024 (https://arxiv.org/html/2604.15866#bib.bib56); Guoet al\.,2024a (https://arxiv.org/html/2604.15866#bib.bib15)\)\. Despite these advances, the performance gap with supervised systems remains large, and reliance on fine\-tuning limits rapid adaptation to evolving LLMs\.

#### Generative NER without instruction tuning

In parallel, researchers have explored leveraging LLMs’ inherent instruction\-following capabilities to perform generative NER without requiring additional instruction fine\-tuning\. Early work constrained outputs via code\-like schema representations \(Liet al\.,2023 (https://arxiv.org/html/2604.15866#bib.bib7); Sainzet al\.,2023 (https://arxiv.org/html/2604.15866#bib.bib55); Guoet al\.,2024b (https://arxiv.org/html/2604.15866#bib.bib9); Liet al\.,2024b (https://arxiv.org/html/2604.15866#bib.bib51)\) or reformulated tagging as token generation \(Wanget al\.,2023a (https://arxiv.org/html/2604.15866#bib.bib45)\)\. Subsequent approaches introduced reasoning\-based prompting such as self\-consistency and self\-verification methods to better convey complex annotation instructions \(Xieet al\.,2023a (https://arxiv.org/html/2604.15866#bib.bib22); Kimet al\.,2024 (https://arxiv.org/html/2604.15866#bib.bib28); Panget al\.,2023 (https://arxiv.org/html/2604.15866#bib.bib18)\)\.

Building on the success of self\-consistency and ICL, recent methods for generative NER adopt iterative self\-improving strategies by generating pseudo\-examples, filtering them, and providing them as in\-context demonstrations \(Xieet al\.,2023b (https://arxiv.org/html/2604.15866#bib.bib49); Tonget al\.,2025 (https://arxiv.org/html/2604.15866#bib.bib47)\)\. Our work follows this iterative, fine\-tuning\-free line of research yet distinctly utilizes inter\-model disagreement as a signal for improving NER performance, paralleling how human annotators refine guidelines and reconcile judgments during gold\-standard dataset construction\.

## 3DiZiNER

The DiZiNER framework operates through iterative pilot annotation cycles consisting of three stages:\(1\) Independent Cross\-Annotation, where multiple LLM annotators independently perform NER tagging on the same set of documents;\(2\) Disagreement Analysis, which identifieshotspotspans with high annotation disagreement, categorizes and summarizes disagreement patterns into structured reports; and\(3\) Instruction Refinement, where a supervisor model leverages the resulting structured reports to refine task instructions and reduces inter\-model disagreement across iterations\.

### 3\.1Task Formulation

LLM annotators form a heterogeneous poolM=\{Mk\}k=1K\\mathcal\{M\}=\\\{M\_\{k\}\\\}\_\{k=1\}^\{K\}composed of independently developed models to minimize correlated errors\. The label set isL=\{li\}i=1n\\mathcal\{L\}=\\\{\\ell\_\{i\}\\\}\_\{i=1\}^\{n\}, and the NER schema is

Σ=\{\(l,dl,Pl,Nl\)\}l∈L,\\Sigma=\\big\\\{\(\\ell,\\ d\_\{\\ell\},\\ \\mathcal\{P\}\_\{\\ell\},\\ \\mathcal\{N\}\_\{\\ell\}\)\\big\\\}\_\{\\ell\\in\\mathcal\{L\}\},wheredld\_\{\\ell\}is a definition for entity typel\\ell, andPl,Nl\\mathcal\{P\}\_\{\\ell\},\\mathcal\{N\}\_\{\\ell\}are positive and negative examples\. The schemaΣ\\Sigmaremains fixed across iterations to maintain task consistency and prevent task drift\.

At iterationtt, annotatorMkM\_\{k\}receives a task configuration

Θk\(t\)=\(Σ,C\(t\),Rk\(t\),G\(t\)\),\\Theta\_\{k\}^\{\(t\)\}=\\big\(\\Sigma,\\ C^\{\(t\)\},\\ R\_\{k\}^\{\(t\)\},\\ G^\{\(t\)\}\\big\),whereC\(t\)C^\{\(t\)\}are common instructions,Rk\(t\)R\_\{k\}^\{\(t\)\}are model\-specific instructions, andG\(t\)G^\{\(t\)\}is the final task goal\. Given an input sentencexx, the annotator predicts

y∼PMk\(y\|x,Θk\(t\)\),y\\sim P\_\{M\_\{k\}\}\\\!\\big\(y\\,\\big\|\\,x,\\Theta\_\{k\}^\{\(t\)\}\\big\),with labeled outputsy=\{\(ej,lej\)\}y=\\\{\(e\_\{j\},\\ \\ell\_\{e\_\{j\}\}\)\\\}, whereeje\_\{j\}is an entity span andlej∈L\\ell\_\{e\_\{j\}\}\\in\\mathcal\{L\}denotes its label\.

### 3\.2Independent Cross\-Annotation

At each iteration, documents are grouped by lexical diversity, and a representative subset is randomly sampled across groups to form the iteration document setD\(t\)\\mathcal\{D\}^\{\(t\)\}\. All annotators inM\\mathcal\{M\}independently label each sample in the set according to their task configurationΘk\(t\)\\Theta\_\{k\}^\{\(t\)\}\. To enable token\-level comparison across models, span\-level annotations are converted into a BIO sequence representation\. For inputx=\(w1,...,wm\)x=\(w\_\{1\},\\dots,w\_\{m\}\), the tag set is defined as

T=\{B−l,I−l,O∣l∈L\}\.\\mathcal\{T\}=\\\{\\mathrm\{B\}\\\!\-\\\!\\ell,\\ \\mathrm\{I\}\\\!\-\\\!\\ell,\\ \\mathrm\{O\}\\mid\\ell\\in\\mathcal\{L\}\\\}\.The conversion yields a BIO sequence

zk\(x\)=\(zk,1\(x\),...,zk,m\(x\)\),zk,i\(x\)∈T,\\mathbf\{z\}\_\{k\}\(x\)=\(z\_\{k,1\}\(x\),\\dots,z\_\{k,m\}\(x\)\),\\quad z\_\{k,i\}\(x\)\\in\\mathcal\{T\},representing the token\-level tagging output derived from the span\-level annotationyyof annotatorMkM\_\{k\}\.

### 3\.3Disagreement Analysis

This stage identifieshotspotspans that exhibit strong inter\-model disagreement\. Token\-level inconsistencies across annotators are quantified to mark high\-disagreement regions\.

#### Model Weights and Consensus

Model weights are computed from pairwise strict span F1 scores between annotators, where for modelsMiM\_\{i\}andMjM\_\{j\},

F1ij=2\|Si∩Sj\|\|Si\|\+\|Sj\|,\\mathrm\{F1\}\_\{ij\}=\\frac\{2\\,\|\\mathcal\{S\}\_\{i\}\\cap\\mathcal\{S\}\_\{j\}\|\}\{\|\\mathcal\{S\}\_\{i\}\|\+\|\\mathcal\{S\}\_\{j\}\|\},whereSk\\mathcal\{S\}\_\{k\}denotes the set of predicted entity spans from modelMkM\_\{k\}\. Each model’s weight,wkw\_\{k\}, is computed as the average of its pairwise F1 scores with all others, normalized so that the weights sum to one\. The*elite set*is defined as the subset of annotators with the highest weights whose cumulative weight first reaches 0\.5 when sorted in descending order\. The computed model weights are also used as each annotator’s agreement score in subsequent analyses\.

The consensus label for tokeniiin sentencexxis obtained via weighted majority voting,

τ^\(x,i\)=arg⁡maxτ∈T⁡pτ\(x,i\),\\widehat\{\\tau\}\(x,i\)=\\arg\\max\_\{\\tau\\in\\mathcal\{T\}\}p\_\{\\tau\}\(x,i\),wherepτ\(x,i\)=∑kwk1\[zk,i\(x\)=τ\]p\_\{\\tau\}\(x,i\)=\\sum\_\{k\}w\_\{k\}\\,\\mathbf\{1\}\[z\_\{k,i\}\(x\)=\\tau\]represents the weighted token\-wise probability for tagτ\\tau\.

#### Hotspot Span Identification

We compute three complementary token\-level measures capturing distinct forms of annotation disagreement\. \(1\)Label conflictquantifies dispersion among BIO tags,

Dconf\(x,i\)=1−∑τ∈Tpτ\(x,i\)2\.D\_\{\\mathrm\{conf\}\}\(x,i\)=1\-\\sum\_\{\\tau\\in\\mathcal\{T\}\}p\_\{\\tau\}\(x,i\)^\{2\}\.\(2\)Type confusionreflects disagreement over entity types,

Dtype\(x,i\)=1−∑l∈L\(pB−l\(x,i\)\+pI−l\(x,i\)1−pO\(x,i\)\)2D\_\{\\mathrm\{type\}\}\(x,i\)=1\-\\sum\_\{\\ell\\in\\mathcal\{L\}\}\\left\(\\frac\{p\_\{\\mathrm\{B\}\-\\ell\}\(x,i\)\+p\_\{\\mathrm\{I\}\-\\ell\}\(x,i\)\}\{1\-p\_\{\\mathrm\{O\}\}\(x,i\)\}\\right\)^\{2\}\(3\)Boundary uncertaintymeasures inconsistency at entity boundaries,

qs\(x,i\)=∑l∈LpB−l\(x,i\),qi\(x,i\)=∑l∈LpI−l\(x,i\)\.q\_\{s\}\(x,i\)=\\sum\_\{\\ell\\in\\mathcal\{L\}\}p\_\{\\mathrm\{B\}\-\\ell\}\(x,i\),\\quad q\_\{i\}\(x,i\)=\\sum\_\{\\ell\\in\\mathcal\{L\}\}p\_\{\\mathrm\{I\}\-\\ell\}\(x,i\)\.Ubnd\(x,i\)=max\{\\displaystyle U\_\{\\mathrm\{bnd\}\}\(x,i\)=\\max\\Big\\\{4qs\(x,i\)\(1−qs\(x,i\)\),\\displaystyle 4q\_\{s\}\(x,i\)\(1\-q\_\{s\}\(x,i\)\),4qi\(x,i\)\(1−qi\(x,i\)\)\}\.\\displaystyle 4q\_\{i\}\(x,i\)\(1\-q\_\{i\}\(x,i\)\)\\Big\\\}\.
The final token\-level disagreement score is defined as

U⋆\(x,i\)=max⁡\{Dconf,Dtype,Ubnd\}\.U\_\{\\star\}\(x,i\)=\\max\\\{D\_

Similar Articles

RemoteZero: Geospatial Reasoning with Zero Human Annotations

Hugging Face Daily Papers

RemoteZero is a framework that eliminates the need for human-annotated box supervision in geospatial reasoning by leveraging the semantic verification capabilities of multimodal large language models (MLLMs) to enable self-evolving localization from unlabeled remote sensing data.

MindZero: Learning Online Mental Reasoning With Zero Annotations

arXiv cs.AI

MindZero introduces a self-supervised reinforcement learning framework that trains multimodal large language models for efficient and robust online mental reasoning without requiring mental state annotations, outperforming model-based methods in accuracy and efficiency.