information-extraction

#information-extraction

BCL: Bayesian In-Context Learning Framework for Information Extraction

arXiv cs.CL ↗ · 2026-06-18 Cached

BCL is the first optimization framework that uses particle filtering with Bayesian updates to systematically refine label representations for information extraction tasks, showing consistent improvements over existing methods.

0 favorites 0 likes

#information-extraction

Configurable Clinical Information Extraction with Agentic RAG: What Works, What Breaks, and Why

Hugging Face Daily Papers ↗ · 2026-06-17 Cached

ACIE, an agentic RAG system for clinical information extraction, achieves 96.5% acceptance rate in nuclear-medicine physicians' judgments across 7,326 instances, addressing challenges of heterogeneous patient contexts and missing metadata.

0 favorites 0 likes

#information-extraction

AAbAAC: An Annotated Corpus for Autoimmunity Information Extraction

arXiv cs.AI ↗ · 2026-06-12 Cached

AAbAAC is a manually annotated corpus of 115 PubMed abstracts for autoimmunity information extraction, focusing on entities like autoimmune diseases and autoantibodies. The study demonstrates improved NER performance after fine-tuning on this corpus.

0 favorites 0 likes

#information-extraction

sebis at CRF Filling 2026: A Two-Stage Local LLM Pipeline for Medical CRF Filling

arXiv cs.CL ↗ · 2026-06-12 Cached

This paper presents a fully local, two-stage LLM pipeline using MedGemma-27B for filling Case Report Forms from clinical notes, achieving a macro-F1 of 0.55 on the English test track and securing second place among local open-source submissions.

0 favorites 0 likes

#information-extraction

Benchmarking Large Language Models for Safety Data Extraction

arXiv cs.CL ↗ · 2026-06-11 Cached

This paper benchmarks four large language models (Gemini 1.5 Pro, GPT-4o, Claude 3.7 Sonnet, Llama 3.1-70B) for extracting structured information from Safety Data Sheets, finding that text-based extraction with chain-of-thought prompting yields the highest accuracy (84% by Gemini 1.5 Pro) but no model surpasses the 90% threshold required for reliable industrial deployment.

0 favorites 0 likes

#information-extraction

How Small Can You Go? LoRA Fine-Tuning 270M-8B Models for Merchant Information Extraction in Financial Transactions

arXiv cs.AI ↗ · 2026-06-09 Cached

This paper presents a deployment-focused study comparing LoRA fine-tuning of 24 model variants (270M–8B parameters) for merchant information extraction from financial transaction strings. The authors find that smaller models like Qwen 3.5 4B achieve 96.6% F1, within 0.35 points of the 8B baseline, while offering significant reductions in latency and cost.

0 favorites 0 likes

#information-extraction

Automatic Extraction of Structured Information from Brain MRI Reports Using an Open-Weight Large Language Model

arXiv cs.AI ↗ · 2026-06-09 Cached

This paper evaluates the open-weight LLM LLaMA 3.1 for automatic extraction of structured data from Dutch brain MRI reports, achieving high performance on visual rating scores and accurate detection of findings, with few-shot prompting improving extraction of numerical variables.

0 favorites 0 likes

#information-extraction

SMADE-IE: Sparse Multi-Agent Framework with Evidence-Driven Debate for Zero-Shot Information Extraction

arXiv cs.CL ↗ · 2026-06-04 Cached

SMADE-IE is a sparse multi-agent framework for zero-shot information extraction that uses an Adaptive Mode Selector and Evidence-Driven Debate mechanism with Toulmin-style argumentation and Bayesian updates to outperform existing baselines on 9 benchmarks across NER, RE, and JERE tasks while improving token efficiency.

0 favorites 0 likes

#information-extraction

EURO-5K: When Does Domain Pretraining Matter? Benchmarking Transformers for EU Reporting Obligation Extraction

arXiv cs.CL ↗ · 2026-06-03 Cached

This paper introduces EURO-5K, a sentence-level dataset for extracting reporting obligations from EU legislation, and benchmarks discriminative and generative transformer models under full fine-tuning and parameter-efficient QLoRA. Results show that legal pretraining primarily benefits models with limited adaptation capacity, and all approaches converge around 3K samples.

0 favorites 0 likes

#information-extraction

EPPC-OASIS: Ontology-Aware Adaptation and Structured Inference Refinement for Electronic Patient-Provider Communication Mining in Secure Messages

arXiv cs.AI ↗ · 2026-05-26 Cached

This paper introduces EPPC-OASIS, an ontology-aware adaptation method for extracting structured communication behaviors from secure patient-provider messages. The approach combines Wasserstein alignment during fine-tuning with inference refinement procedures, achieving modest improvements over baselines on a de-identified corpus.

0 favorites 0 likes

#information-extraction

Improving the Completeness and Comparability of Segment Disclosures: A Large Language Model Approach

arXiv cs.CL ↗ · 2026-05-26 Cached

This paper proposes an LLM-based framework to extract segment disclosures from 10-K filings, improving completeness and comparability through retrieval-augmented systems for longitudinal and cross-firm analysis.

0 favorites 0 likes

#information-extraction

@akshay_pachaar: https://x.com/akshay_pachaar/status/2058976178908885210

X AI KOLs Following ↗ · 2026-05-25 Cached

Explains how to fix agent memory by defining an ontology using Pydantic schemas, enabling structured extraction into knowledge graphs for multi-hop reasoning, with an open-source solution (Zep).

0 favorites 0 likes

#information-extraction

@HappyyPablo: open sourcing Marlin-2B a tiny VLM to extract structured information from videos Marlin is finetuned for two questions …

X AI KOLs Timeline ↗ · 2026-05-19 Cached

Open-sourcing Marlin-2B, a tiny VLM for extracting structured information from videos, fine-tuned to answer 'what is happening and when'. Best open model in its weight class, competitive with Gemini-2.5-flash.

1 favorites 1 likes

#information-extraction

Concordance Comparison as a Means of Assembling Local Grammars

arXiv cs.CL ↗ · 2026-05-13 Cached

This paper presents a method for comparing concordances of local grammars to optimize Named Entity Recognition for person names in Portuguese, achieving improved F-measure scores on the HAREM dataset.

0 favorites 0 likes

#information-extraction

A Few Good Clauses: Comparing LLMs vs Domain-Trained Small Language Models on Structured Contract Extraction

arXiv cs.CL ↗ · 2026-05-08 Cached

This paper compares a domain-trained small language model (Olava Extract) against frontier LLMs for structured contract extraction, showing that the specialized model achieves higher F1 scores and dramatically lower cost.

1 favorites 1 likes

#information-extraction

CBRS: Cognitive Blood Request System with Bilingual Dataset and Dual-Layer Filtering for Multi-Platform Social Streams

arXiv cs.CL ↗ · 2026-04-21 Cached

Researchers from Bangladesh University of Engineering and Technology present CBRS, a multi-platform framework that filters and parses blood donation requests from social media using a dual-layer architecture and a novel 11K bilingual dataset in Bengali and English. Their LoRA fine-tuned Llama-3.2-3B model achieves 99% filtering accuracy and 92% zero-shot parsing accuracy, outperforming GPT-4o-mini and other LLMs with 35× reduced token usage.

0 favorites 0 likes

#information-extraction

AI-assisted Protocol Information Extraction For Improved Accuracy and Efficiency in Clinical Trial Workflows

arXiv cs.CL ↗ · 2026-04-20 Cached

Researchers from Banting Health AI present an AI system using generative LLMs with Retrieval-Augmented Generation (RAG) for automated clinical trial protocol information extraction, achieving 89% accuracy compared to 62.6% for standalone LLMs, with AI-assisted workflows completing tasks 40% faster and reducing cognitive demand.

0 favorites 0 likes

#information-extraction

DiZiNER: Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition

arXiv cs.CL ↗ · 2026-04-20 Cached

DiZiNER is a framework that uses disagreement between multiple LLMs to refine task instructions for zero-shot named entity recognition, achieving state-of-the-art results on 14 out of 18 benchmarks and significantly reducing the performance gap between zero-shot and supervised systems.

0 favorites 0 likes

#information-extraction

PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection

arXiv cs.CL ↗ · 2026-04-20 Cached

PIIBench presents a unified multi-source benchmark corpus for detecting personally identifiable information (PII) across diverse data sources. This resource addresses the need for standardized evaluation in PII detection tasks, which is critical for privacy-preserving NLP applications.

0 favorites 0 likes

information-extraction

Submit Feedback