Trending stories ranked by heat, importance and recency.
MEMPROBE is a benchmark that evaluates long-term memory in LLM agents by reconstructing hidden user states from the agent's memory after interaction.
This paper introduces the first public multimodal dataset of 100 Turkish scam and benign phone calls, evaluating seven LLMs under raw audio, ASR transcripts, and human-corrected transcripts. Results show transcript-based inputs outperform direct audio, highlighting the need for inclusive AI safety research in low-resource languages.
This paper presents Agentic-LTPO, a nested bilevel optimization framework that uses agentic AI to adapt physical layer configurations under dynamic operator policies, achieving 57.2% long-term performance improvement in cell-free MIMO beamforming.
Introduces AutoSpecNER, an expert-annotated dataset for fine-grained named entity recognition in vehicle listings, with 659 advertisements annotated across 15 entity types. Benchmark results show DeBERTa achieves 90% micro-F1, outperforming rule-based and LLM approaches.
This paper systematically studies the stability of prompt rankings in LLM evaluation under common sources of variability, finding that top-performing prompts often change. It proposes a stability-aware selection strategy based on a lower confidence bound to improve robustness.
ATRIA is a multi-agent system for ECG report generation that mirrors the clinician's iterative workflow, enabling bidirectional editing, evidence grounding, and clinician-in-the-loop verification.
Introduces Age of LLM, a turn-based 1v1 benchmark where LLMs compete on a grid with fog of war and diplomacy, measuring reasoning, reliability, and strategic planning. Findings show a dominance of nuclear rush tactics and a weak link between reliability and winning.
This paper presents MorfFlex, a morphological dictionary architecture for languages with rich inflection and derivation, exemplified by MorfFlex CZ for Czech, which contains over 100 million wordforms and supports annotation consistency and NLP tools.
This paper introduces UD_Czech-PDTC, a large and genre-diverse treebank for Czech in the Universal Dependencies framework, derived from the Prague Dependency Treebank-Consolidated. It describes the conversion process and differences between annotation schemes.
We present the second consolidated version of the Prague Dependency Treebank, a 4-million-token manual multilingual annotation resource covering morphology, syntax, semantics, coreference, and discourse, along with compatible lexicons.
This paper studies rational closure for the DL-Lite family of description logics, providing a plug-in architecture for efficient non-monotonic reasoning and conjunctive query answering with minimal computational overhead.
This paper introduces 'pigeonholing,' a phenomenon where bad prompts cause LLMs to collapse and repeat errors, leading to a 38-40% performance drop. Experiments across 10 tasks and 10 models show worsening with more conversation turns, and propose RLVR with synthetic errors as a mitigation.
This paper investigates how human-centric AI (HCAI) adoption influences firm idiosyncratic risks, finding that HCAI is associated with lower risk, with digitalisation and executive shareholding strengthening this effect.
This paper proposes NaviGen, a framework for personalized multimodal content generation that encodes user behavior into executable instructions using a dual identifier and a two-stage SFT+RL pipeline, improving personalization across product, game, and short-video domains.
This paper investigates the impact of data scale versus latency on cross-lingual transfer for streaming ASR, finding that multilingual initialization benefits are data-limited, not latency-limited, and diminish as target-language data increases.
T2D-Bench is a benchmark for evaluating LLM outputs for Type 2 Diabetes using a multi-layer clinical-lifestyle knowledge graph. It reveals that current LLMs fail evidence-path checks in about a third of cases.
This paper uses TF-IDF and neural models on the Complete Tang Poems to predict poets' regional origins from linguistic features, finding detectable regional fingerprints, distance-decay effects, and temporal modulation of the signal.
OmniPath is a multi-modal agentic framework that combines OpenStreetMap network topology with aerial LiDAR data to audit wheelchair accessibility by analyzing physical barriers like slope and surface discontinuities at high resolution, validated against field surveys.
This paper extends contextual entrainment from token-level to sentence-level, showing that even counterfactual sentences in prompts increase their probability during inference. The effect decreases with model size and is driven by 2-4% of attention heads, which can be ablated without performance loss.
This paper constructs large-scale algorithm co-occurrence networks from the full text of academic papers to study the collective influence of algorithms in NLP, finding that classic, high-performing, and intersectional algorithms hold central network positions.