Tag
Modal announces a partnership with OpenAI Devs and Antler Global to host an Autoresearch Systems Hackathon on May 30th targeting data and compute-intensive challenges.
This paper introduces INSET, a unified multimodal model that embeds images as native vocabulary within textual instructions to improve handling of complex interleaved inputs for image generation and editing.
The article introduces A²RD, a novel architecture for generating consistent long videos using agentic autoregressive diffusion. It proposes a Retrieve–Synthesize–Refine–Update cycle and a new benchmark, LVBench-C, to address semantic drift in long-horizon video synthesis.
This paper introduces MemoRepair, a barrier-first cascade repair contract for agentic memory that addresses the problem of stale derived artifacts when source data changes. Experiments demonstrate that MemoRepair significantly reduces invalidated memory exposure and repair costs compared to exhaustive repair methods.
This paper introduces HMACE, a heterogeneous multi-agent collaborative evolution framework that uses Large Language Models to automate heuristic design for NP-hard combinatorial optimization problems. It demonstrates improved quality-efficiency trade-offs over single-agent and multi-agent baselines on problems like TSP and BPP.
This empirical study evaluates LLMs on the Equivalence Class Problem to assess long-chain reasoning capabilities, finding that non-reasoning models fail while reasoning models struggle with specific structural difficulties.
This paper presents MIPIAD, a multilingual defense framework against indirect prompt injection attacks using a hybrid of Qwen2.5-based classifiers and TF-IDF features with meta-ensemble learning. It demonstrates strong performance on English and Bangla benchmarks, achieving high F1 and AUROC scores while reducing cross-lingual gaps.
This paper argues that Generative AI evaluation should shift from static benchmarks to measuring real-world utility and human outcomes. It introduces the SCU-GenEval framework and supporting instruments to address the disconnect between benchmark performance and deployment success.
This paper introduces LogiHard, a framework that uses combinatorial hardening to expose compositional failures in frontier LLMs, demonstrating significant accuracy drops in logical reasoning tasks.
This article introduces ProtSent, a contrastive fine-tuning framework for protein language models that improves embedding quality for downstream tasks like remote homology detection and structural retrieval.
This paper introduces MIND (Monge Inception Distance), a new metric for evaluating generative models that is more sample-efficient, faster, and robust than the standard Fréchet Inception Distance (FID).
This paper introduces Region4Web, a framework that improves web agent performance by organizing observation spaces into functional regions rather than individual elements. It demonstrates that this approach reduces observation length and increases task success rates on the WebArena benchmark.
The paper introduces MedExAgent, a framework that formalizes clinical diagnosis as a Partially Observable Markov Decision Process (POMDP) to handle noisy and incomplete information. It proposes a two-stage training pipeline combining supervised finetuning and reinforcement learning to improve diagnostic accuracy and cost-efficiency in medical LLMs.
This paper introduces a diffusion language model that treats text as a continuous process over binary bitstreams, using entropy-gated stochastic sampling to close the performance gap with autoregressive models. It achieves state-of-the-art results on LM1B and OWT benchmarks while reducing memory footprint.
This paper establishes empirical scaling laws for language model merging, identifying power-law relationships between model size, expert count, and performance to enable predictive planning for optimal model composition.
Katanemo Labs introduces 'Signals,' a lightweight method for identifying informative agent traces without using LLM judges or GPUs, achieving higher efficiency in trajectory analysis.
Yann LeCun disputes claims about Silicon Valley's dominance in AI innovation by listing key breakthroughs like Attention, PyTorch, and AlphaFold that originated in other locations such as Montreal, London, and Paris.
A new study reveals a software strategy to reduce cosmic ray-induced errors in superconducting quantum computers by nearly a half-million-fold, bringing failure rates from every 10 seconds down to less than once per month.
Tilde Research discovered a flaw in the Muon optimizer that leads to early death of MLP neurons and open-sourced an alternative, Aurora. While maintaining orthogonality, Aurora resolves the neuron death issue, significantly improving training efficiency.
Tilde Research introduces Aurora, a new optimizer designed to prevent neuron death in MLP layers while maintaining orthogonality, achieving state-of-the-art results on nanoGPT benchmarks and 100x data efficiency on 1B models.