Tag
本文研究垂直联邦学习中的选择性升级问题,提出一种基于期望增益的评分方法,在低成本的本地预测和高成本的嵌入融合之间进行路由,以优化通信-准确率权衡。
Clement Delangue highlights vLLM's new semantic router, an open-source system for routing LLM queries to the most appropriate model, aiming to shift value from expensive frontier models to a diverse ecosystem of open-source models.
Cognition introduces Devin Fusion, a multi-model harness that routes between frontier and cost-effective models using a sidekick architecture, achieving frontier-level performance at 35% lower cost.
The author critiques the trend of token maximization in LLM usage and argues for a shift toward Return on Tokens (ROT) through optimization and routing for sustainable AI deployment.
The post argues that AI agent architecture should shift from monolithic agents that hold all context to a routing model where agents delegate tasks to specialized services, similar to how software evolved from monoliths to microservices.
A CLI tool called relay-ai acts as a proxy for Codex Desktop and Claude Code, enabling users to route requests to any model (including GLM 5.2) using their own API keys or OAuth subscriptions, with features to prevent crashes and manage context overflow.
This paper demonstrates that attention sinks, representation collapse, and norm stratification are not unique to attention mechanisms but are general consequences of content-based routing under a norm-blind similarity metric, as shown across multiple architectures including transformers, graph attention, state-space models, and recurrent mixers.
SharpMoE is a post-training framework that improves routing in diffusion mixture-of-experts models by using clean latent features to identify salient tokens and a trajectory routing loss to allocate compute precisely, achieving state-of-the-art visual generation.
This paper identifies a fundamental constraint on multi-model LLM systems: accuracy is capped by the rate at which all models fail on the same query. Across 67 frontier models, the all-wrong rate is significantly underestimated by common metrics, limiting gains from voting, routing, and ensemble strategies.
OmniPath is a multi-modal agentic framework that combines OpenStreetMap network topology with aerial LiDAR data to audit wheelchair accessibility by analyzing physical barriers like slope and surface discontinuities at high resolution, validated against field surveys.
The author expresses concern over UK online safety policies that threaten freedom of expression and privacy, and considers routing traffic through nodes outside the UK to circumvent potential censorship.
Proposes ARIADNE, a training-free, adapter-agnostic routing framework that selects the optimal PEFT adapter at inference time by measuring input proximity to adapter-specific centroids in embedding space, recovering 97.44% of upper-bound performance on 23 tasks.
The article provides a detailed explanation of Mixture of Experts (MoE) in transformers, covering routing, load balancing, and recent innovations like fine-grained experts. It also highlights the significance of Noam Shazeer's research contributions and his move from Google to OpenAI.
Grouped Query Experts (GQE) improves Transformer efficiency by applying a mixture-of-experts layer on top of grouped-query attention, selectively activating query heads per token while keeping key-value cache benefits, matching baseline accuracy with half the query-head compute at 250M parameter scale.
ChatPlanner is a novel framework that uses fine-tuned LLMs with Retrieval-Augmented Generation (RAG) to interpret user preferences from natural language queries and integrate them into public transit routing algorithms, outperforming existing route planners.
This paper introduces the Forced Deferral Attack (FDA), an adversarial image attack that manipulates confidence scores in multimodal LLM cascades, causing queries to be unnecessarily routed to stronger (more expensive) models, thereby shifting compute costs to the provider without degrading answer correctness.
OpenRouter's Fusion API offers pricing and provider information for routing AI model requests across multiple providers, enabling flexible and cost-effective access to various AI models.
TimeRouter introduces an efficient routing framework for time-series foundation models that uses lightweight discriminative routing and selective gating to adaptively select the best expert model without LLM overhead, achieving state-of-the-art on the GIFT-EVAL leaderboard.
InfraMind introduces an infrastructure-aware multi-agent LLM orchestration framework that uses reinforcement learning to dynamically select models and topologies based on real-time system load, achieving up to 7x lower latency and 99.9% SLO compliance under high load.
This paper presents a game-theoretic analysis of disaggregated inference architectures that separate prefill and decode phases across GPU pools, characterizing how GPU saturation affects performance. The authors propose an adaptive controller that detects saturation transitions and adjusts routing parameters, reducing the Price of Anarchy significantly in experiments on NVIDIA B200 clusters.