routing

#routing

@ClementDelangue: Super excited about open-source router systems and routing models like @vllm_project semantic router: https://huggingfa…

X AI KOLs Timeline ↗ · 12h ago Cached

Clement Delangue highlights vLLM's new semantic router, an open-source system for routing LLM queries to the most appropriate model, aiming to shift value from expensive frontier models to a diverse ecosystem of open-source models.

0 favorites 0 likes

#routing

Devin Fusion (8 minute read)

TLDR AI ↗ · yesterday Cached

Cognition introduces Devin Fusion, a multi-model harness that routes between frontier and cost-effective models using a sidekick architecture, achieving frontier-level performance at 35% lower cost.

0 favorites 0 likes

#routing

@LinusEkenstam: ROT — Return on Tokens. We all knew we would end up here at some point. Tokenmaxxing was a dumb idea from the start. It…

X AI KOLs Following ↗ · 2d ago Cached

The author critiques the trend of token maximization in LLM usage and argues for a shift toward Return on Tokens (ROT) through optimization and routing for sustainable AI deployment.

0 favorites 0 likes

#routing

The Future of AI Agents Might Not Be Bigger Context Windows

Reddit r/AI_Agents ↗ · 4d ago

The post argues that AI agent architecture should shift from monolithic agents that hold all context to a routing model where agents delegate tasks to specialized services, similar to how software evolved from monoliths to microservices.

0 favorites 0 likes

#routing

Run Codex and Claude with any model including GLM 5.2. No settings file headaches.

Reddit r/ArtificialInteligence ↗ · 5d ago

A CLI tool called relay-ai acts as a proxy for Codex Desktop and Claude Code, enabling users to route requests to any model (including GLM 5.2) using their own API keys or OAuth subscriptions, with features to prevent crashes and manage context overflow.

0 favorites 0 likes

#routing

[R] All Routes Lead to Collapse: attention sinks, representation collapse, and norm stratification are what content-based routing does under a norm-blind metric

Reddit r/MachineLearning ↗ · 5d ago Cached

This paper demonstrates that attention sinks, representation collapse, and norm stratification are not unique to attention mechanisms but are general consequences of content-based routing under a norm-blind similarity metric, as shown across multiple architectures including transformers, graph attention, state-space models, and recurrent mixers.

0 favorites 0 likes

#routing

Focusing on What Matters: Saliency-Harnessing Accurate Routing for Diffusion MoE

Hugging Face Daily Papers ↗ · 6d ago Cached

SharpMoE is a post-training framework that improves routing in diffusion mixture-of-experts models by using clean latent features to identify salient tokens and a trajectory routing loss to allocate compute precisely, achieving state-of-the-art visual generation.

0 favorites 0 likes

#routing

When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models

Hugging Face Daily Papers ↗ · 6d ago Cached

This paper identifies a fundamental constraint on multi-model LLM systems: accuracy is capped by the rate at which all models fail on the same query. Across 67 frontier models, the all-wrong rate is significantly underestimated by common metrics, limiting gains from voting, routing, and ensemble strategies.

0 favorites 0 likes

#routing

OmniPath: A Multi-Modal Agentic Framework for Auditing Wheelchair Accessibility

arXiv cs.AI ↗ · 2026-06-24 Cached

OmniPath is a multi-modal agentic framework that combines OpenStreetMap network topology with aerial LiDAR data to audit wheelchair accessibility by analyzing physical barriers like slope and surface discontinuities at high resolution, validated against field surveys.

0 favorites 0 likes

#routing

Pondering routing more of my traffic via nodes outside the UK

Hacker News Top ↗ · 2026-06-21 Cached

The author expresses concern over UK online safety policies that threaten freedom of expression and privacy, and considers routing traffic through nodes outside the UK to circumvent potential censorship.

0 favorites 0 likes

#routing

ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection

arXiv cs.AI ↗ · 2026-06-18 Cached

Proposes ARIADNE, a training-free, adapter-agnostic routing framework that selects the optimal PEFT adapter at inference time by measuring input proximity to adapter-specific centroids in embedding space, recovering 97.44% of upper-bound performance on 23 tasks.

0 favorites 0 likes

#routing

@jbhuang0604: Huge! It’s amazing how often Noam’s papers end up at the center of the field. In many tutorial videos I’ve made, they’v…

X AI KOLs Following ↗ · 2026-06-18 Cached

The article provides a detailed explanation of Mixture of Experts (MoE) in transformers, covering routing, load balancing, and recent innovations like fine-grained experts. It also highlights the significance of Noam Shazeer's research contributions and his move from Google to OpenAI.

0 favorites 0 likes

#routing

Grouped Query Experts: Mixture-of-Experts on GQA Self-Attention

Hugging Face Daily Papers ↗ · 2026-06-18 Cached

Grouped Query Experts (GQE) improves Transformer efficiency by applying a mixture-of-experts layer on top of grouped-query attention, selectively activating query heads per token while keeping key-value cache benefits, matching baseline accuracy with half the query-head compute at 250M parameter scale.

0 favorites 0 likes

#routing

ChatPlanner: A Large Language Model Framework for Personalized Public Transit Routing

arXiv cs.AI ↗ · 2026-06-16 Cached

ChatPlanner is a novel framework that uses fine-tuned LLMs with Retrieval-Augmented Generation (RAG) to interpret user preferences from natural language queries and integrate them into public transit routing algorithms, outperforming existing route planners.

0 favorites 0 likes

#routing

Forced Deferral: Manipulating Routing Decisions in Multimodal LLM Cascades

arXiv cs.AI ↗ · 2026-06-16 Cached

This paper introduces the Forced Deferral Attack (FDA), an adversarial image attack that manipulates confidence scores in multimodal LLM cascades, causing queries to be unnecessarily routed to stronger (more expensive) models, thereby shifting compute costs to the provider without degrading answer correctness.

0 favorites 0 likes

#routing

Openrouter Fusion API

Hacker News Top ↗ · 2026-06-15 Cached

OpenRouter's Fusion API offers pricing and provider information for routing AI model requests across multiple providers, enabling flexible and cost-effective access to various AI models.

0 favorites 0 likes

#routing

TimeRouter: Efficient and Adaptive Routing of Time-Series Foundation Models

arXiv cs.LG ↗ · 2026-06-11 Cached

TimeRouter introduces an efficient routing framework for time-series foundation models that uses lightweight discriminative routing and selective gating to adaptively select the best expert model without LLM overhead, achieving state-of-the-art on the GIFT-EVAL leaderboard.

0 favorites 0 likes

#routing

INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration

arXiv cs.AI ↗ · 2026-06-11 Cached

InfraMind introduces an infrastructure-aware multi-agent LLM orchestration framework that uses reinforcement learning to dynamically select models and topologies based on real-time system load, achieving up to 7x lower latency and 99.9% SLO compliance under high load.

0 favorites 0 likes

#routing

The Price of Anarchy in Disaggregated Inference

Hugging Face Daily Papers ↗ · 2026-06-11 Cached

This paper presents a game-theoretic analysis of disaggregated inference architectures that separate prefill and decode phases across GPU pools, characterizing how GPU saturation affects performance. The authors propose an adaptive controller that detects saturation transitions and adjusts routing parameters, reducing the Price of Anarchy significantly in experiments on NVIDIA B200 clusters.

0 favorites 0 likes

#routing

Routing-Aware Expert Calibration for Machine Unlearning in Mixture-of-Experts Language Models

arXiv cs.CL ↗ · 2026-06-10 Cached

The paper proposes TRACE, a method for machine unlearning in Mixture-of-Experts language models that calibrates retain regularization by reweighting token-level retain losses to address forget-retain routing mismatch. Experiments show improved forget-utility trade-off across multiple MoE LLMs.

0 favorites 0 likes

routing

Submit Feedback