Tag
Two recent arXiv papers found that GPT-5.4 and Claude Opus 4.6 employ a metaprogramming strategy when handling unfamiliar programming languages — generating target code with Python and debugging locally — rather than writing the target language code directly. This strategy is key to distinguishing top-tier agents from average ones, and strategy sophistication matters more than model parameter scale.
This paper proposes a thermodynamic measure of intelligence defined as 'rare-valid lift' and argues that recursive self-simulation is necessary and nearly sufficient for high thermodynamic intelligence, making intelligence measurable on a universal scale.
A local-first academic paper management desktop application linXiv, supporting paper discovery, management, and visualization from sources like arXiv, integrating SQLite database, AI annotation, Obsidian notes, and paper network graph.
autoarxiv lets you turn any arxiv paper into running code by simply changing the URL to autoarxiv.org. An AI agent from alphaXiv reads the paper, clones the repo, sets up dependencies, and runs a minimal reproduction to verify claims, logging everything live.
This paper evaluates multi-agent orchestration architectures (DAG Plan and Execute, ReAct) at enterprise scales and introduces a Task Manager for continuous event-driven operation, showing improvements in latency and correctness.
This paper identifies an embodiment gap in humanoid co-speech motion generation caused by human-centric pipelines, and proposes PhysDrift, an embodiment-aware framework that directly predicts executable humanoid joint trajectories from speech, improving speech-motion alignment and physical plausibility.
TelcoAgent is a foundation model-based framework for scalable and explainable multi-KPM forecasting in 5G networks, using automated 3GPP knowledge graph construction and a time-series foundation model for zero-shot prediction.
This paper proposes a segment combination strategy for automatically classifying research methods in academic papers by partitioning full-text content. Experiments on an annotated corpus from Library and Information Science journals show that methodological information is unevenly distributed, with middle-to-late segments having higher discriminative power.
This paper introduces RPCL, a training-only framework for robust pair confidence learning in multimodal emotion-cause pair extraction, which improves discriminative separation of gold pairs from hard negatives and achieves significant gains in Pair F1 and AUPRC on three datasets.
VisualSkill proposes a hierarchical multimodal skill library for computer-use agents that combines text and figures, achieving a 15.3 point absolute lift on CUA benchmarks over text-only baselines by retaining visual information for GUI interaction.
This paper systematically evaluates assumptions about LLM persona prompting and identifies 'persona manifold collapse,' where richer persona descriptions reduce behavioral diversity and simulation fidelity. The findings show that simple age-gender personas often outperform more detailed profiles.
QSignAI is a production-deployed open-source platform that combines quantum randomness from a Toeplitz two-source extractor with an AI bot on Telegram to generate unique identity signatures, demonstrating a bidirectional relationship between artificial intelligence and quantum science.
Proposes the Human-AI Coevolution Dynamics Framework (HACD-H) as a formal model of human-AI interaction, integrating emotional adaptation, relational organization, social memory, and personality consistency. Results show social intelligence emerges from long-term social cognitive coevolution.
Firecrawl released a state-of-the-art research index for AI/ML papers, claiming 18% better recall on arXivQA than competitors, designed for autonomous research agents.
MM++ is a fully unsupervised, post-hoc framework for out-of-distribution detection that fuses discriminative intermediate layers via top-K gated feature fusion and uses a regularized tied covariance matrix for scale-invariant distance estimation.
This paper introduces the Valid-Answer-Invalid-Reasoning (VAIR) benchmark to expose the production-evaluation gap in AI reasoning models, where models can generate correct answers but fail to detect flawed reasoning, revealing answer confirmation bias.
This paper introduces AdaMame, a two-stage training recipe (SFT + GRPO) to adaptively align reasoning language with query language in multilingual mathematical reasoning, mitigating language collapse without sacrificing accuracy.
HiDRA is a training-free method that uses high-dimensional random projection for activation steering in LLMs, capturing discriminative signals beyond linear methods and consistently outperforming existing baselines across diverse model families and benchmarks.
RepFusion introduces a method to use pretrained multimodal LLMs as noisy representation encoders in diffusion transformers for text-to-image generation, outperforming baselines with similar compute.
This paper proposes a decidable criterion for creative integration based on compression ratio of conflicts, validated through falsifiable tests. It operationalizes the notion that genuine creativity compresses conflicts.