Trending

Trending stories ranked by heat, importance and recency.

Cards List
#61

MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery

arXiv cs.CL · 3h ago Cached

MEMPROBE is a benchmark that evaluates long-term memory in LLM agents by reconstructing hidden user states from the agent's memory after interaction.

0 favorites 0 likes
#62

Poster: Exploring the Limits of Audio-Based Detection of Turkish Phone Call Scams

arXiv cs.CL · 3h ago Cached

This paper introduces the first public multimodal dataset of 100 Turkish scam and benign phone calls, evaluating seven LLMs under raw audio, ASR transcripts, and human-corrected transcripts. Results show transcript-based inputs outperform direct audio, highlighting the need for inclusive AI safety research in low-resource languages.

0 favorites 0 likes
#63

Agentic AI for Bilevel Long-Term Optimization of Policy-Driven Physical Layer Systems

arXiv cs.AI · 3h ago Cached

This paper presents Agentic-LTPO, a nested bilevel optimization framework that uses agentic AI to adapt physical layer configurations under dynamic operator policies, achieving 57.2% long-term performance improvement in cell-free MIMO beamforming.

0 favorites 0 likes
#64

AutoSpecNER: A Fine-Grained Named Entity Recognition Dataset for Vehicle Specification Extraction

arXiv cs.CL · 3h ago Cached

Introduces AutoSpecNER, an expert-annotated dataset for fine-grained named entity recognition in vehicle listings, with 659 advertisements annotated across 15 entity types. Benchmark results show DeBERTa achieves 90% micro-F1, outperforming rule-based and LLM approaches.

0 favorites 0 likes
#65

On the Stability of Prompt Ranking in Large Language Model Evaluation

arXiv cs.CL · 3h ago Cached

This paper systematically studies the stability of prompt rankings in LLM evaluation under common sources of variability, finding that top-performing prompts often change. It proposes a stability-aware selection strategy based on a lower confidence bound to improve robustness.

0 favorites 0 likes
#66

ATRIA: Adaptive Traceable ECG Reporting with Iterative Agents

arXiv cs.AI · 3h ago Cached

ATRIA is a multi-agent system for ECG report generation that mirrors the clinician's iterative workflow, enabling bidirectional editing, evidence grounding, and clinician-in-the-loop verification.

0 favorites 0 likes
#67

Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

arXiv cs.AI · 3h ago Cached

Introduces Age of LLM, a turn-based 1v1 benchmark where LLMs compete on a grid with fog of war and diplomacy, measuring reasoning, reliability, and strategic planning. Findings show a dominance of nuclear rush tactics and a weak link between reliability and winning.

0 favorites 0 likes
#68

MorfFlex: Handling Rich Morphology

arXiv cs.CL · 3h ago Cached

This paper presents MorfFlex, a morphological dictionary architecture for languages with rich inflection and derivation, exemplified by MorfFlex CZ for Czech, which contains over 100 million wordforms and supports annotation consistency and NLP tools.

0 favorites 0 likes
#69

Meet UD_Czech-PDTC: A Large and Genre-Rich Treebank in Universal Dependencies

arXiv cs.CL · 3h ago Cached

This paper introduces UD_Czech-PDTC, a large and genre-diverse treebank for Czech in the Universal Dependencies framework, derived from the Prague Dependency Treebank-Consolidated. It describes the conversion process and differences between annotation schemes.

0 favorites 0 likes
#70

Prague Dependency Treebank -- Consolidated 2.0: Enriching a Complex Annotation Scheme

arXiv cs.CL · 3h ago Cached

We present the second consolidated version of the Prague Dependency Treebank, a 4-million-token manual multilingual annotation resource covering morphology, syntax, semantics, coreference, and discourse, along with compatible lexicons.

0 favorites 0 likes
#71

Tractable Reasoning and Conjunctive Query Answering for Defeasible DL-Lite under Rational Closure

arXiv cs.AI · 3h ago Cached

This paper studies rational closure for the DL-Lite family of description logics, providing a plug-in architecture for efficient non-monotonic reasoning and conjunctive query answering with minimal computational overhead.

0 favorites 0 likes
#72

Pigeonholing: Bad prompts hurt models to collapse and make mistakes

arXiv cs.CL · 3h ago Cached

This paper introduces 'pigeonholing,' a phenomenon where bad prompts cause LLMs to collapse and repeat errors, leading to a 38-40% performance drop. Experiments across 10 tasks and 10 models show worsening with more conversation turns, and propose RLVR with synthetic errors as a mitigation.

0 favorites 0 likes
#73

Exploring the relationship between human-centric AI and firm idiosyncratic risks

arXiv cs.AI · 3h ago Cached

This paper investigates how human-centric AI (HCAI) adoption influences firm idiosyncratic risks, finding that HCAI is associated with lower risk, with digitalisation and executive shareholding strengthening this effect.

0 favorites 0 likes
#74

Navigating User Behavior toward Personalized Multimodal Generation

arXiv cs.AI · 3h ago Cached

This paper proposes NaviGen, a framework for personalized multimodal content generation that encodes user behavior into executable instructions using a dual identifier and a two-stage SFT+RL pipeline, improving personalization across product, game, and short-video domains.

0 favorites 0 likes
#75

Data Scale, Not Latency, Shapes Cross-Lingual Encoder Transfer in Streaming ASR

arXiv cs.AI · 3h ago Cached

This paper investigates the impact of data scale versus latency on cross-lingual transfer for streaming ASR, finding that multilingual initialization benefits are data-limited, not latency-limited, and diminish as target-language data increases.

0 favorites 0 likes
#76

T2D-Bench: Evidence-Gated Evaluation of LLM Outputs for Type 2 Diabetes Using a Multi-Layer Clinical-Lifestyle Knowledge Graph

arXiv cs.AI · 3h ago Cached

T2D-Bench is a benchmark for evaluating LLM outputs for Type 2 Diabetes using a multi-layer clinical-lifestyle knowledge graph. It reveals that current LLMs fail evidence-path checks in about a third of cases.

0 favorites 0 likes
#77

Predicting Poets' Origins from Verse: A Computational Analysis of Regional Linguistic Fingerprints in the Complete Tang Poems

arXiv cs.CL · 3h ago Cached

This paper uses TF-IDF and neural models on the Complete Tang Poems to predict poets' regional origins from linguistic features, finding detectable regional fingerprints, distance-decay effects, and temporal modulation of the signal.

0 favorites 0 likes
#78

OmniPath: A Multi-Modal Agentic Framework for Auditing Wheelchair Accessibility

arXiv cs.AI · 3h ago Cached

OmniPath is a multi-modal agentic framework that combines OpenStreetMap network topology with aerial LiDAR data to audit wheelchair accessibility by analyzing physical barriers like slope and surface discontinuities at high resolution, validated against field surveys.

0 favorites 0 likes
#79

Sentence-Level Contextual Entrainment in Large Language Models

arXiv cs.CL · 3h ago Cached

This paper extends contextual entrainment from token-level to sentence-level, showing that even counterfactual sentences in prompts increase their probability during inference. The effect decreases with model size and is driven by 2-4% of attention heads, which can be ablated without performance loss.

0 favorites 0 likes
#80

Exploring Academic Influence of Algorithms by Co-occurrence Network Based on Full-text of Academic Papers

arXiv cs.AI · 3h ago Cached

This paper constructs large-scale algorithm co-occurrence networks from the full text of academic papers to study the collective influence of algorithms in NLP, finding that classic, high-performing, and intersectional algorithms hold central network positions.

0 favorites 0 likes
← Previous
Next →
← Back to home

Submit Feedback