Tag
This article explains how to use any AI model with OpenAI's Codex Desktop app by modifying its config to point to a custom server and using a proxy to disguise the model name, enabling multi-provider support without breaking official functionality.
This paper proposes EcoTab, a table-aware stepwise routing framework that separately estimates uncertainty for table tokens and text tokens to dynamically route reasoning steps between small and large models, achieving a better accuracy-efficiency trade-off on table reasoning tasks.
RoRo introduces a rubric-guided process reward framework for stepwise model routing in Large Reasoning Models, using process rewards alongside outcome rewards to train a routing policy via GRPO, outperforming baselines on reasoning benchmarks.
This article introduces practical techniques to cut AI coding costs by 80%, including prompt caching, context trimming, multi-model routing (using Kimi 2.6 for daily coding tasks and advanced models for core architecture), and more.
A reflection on the trade-offs between using a single trillion-parameter reasoning model with adjustable depth (like Ring-2.6-1T) versus routing between separate specialized models, exploring which approach is cleaner or more cost-effective for agent workflows.
Introduces a GitHub repo that redirects Claude Code traffic to over a dozen free models like DeepSeek and Kimi, already used by 20,000+ developers. The article emphasizes that this tool reveals the trend of replaceability across layers: frontend interaction, workflow, model providers, etc.
A developer shares how they reduced their AI agent's weekly cost from $200 to $40 by routing simple subtasks to cheaper models like DeepSeek V4 Pro and Tencent Hunyuan while keeping complex reasoning on Opus 4.7, achieving comparable output quality for most work.
Weave launches a prompt router that analyzes prompts and routes them to the most cost-effective model, claiming up to 70% cost reduction without performance loss. It integrates with existing workflows like Claude, Cursor, and Codex, and its source code is available.
A discussion on effective FinOps strategies for managing costs in large-scale AI agent operations, covering tactics like model routing, prompt trimming, caching, and the need to track cost by agent, workflow, and customer.
A developer splits their AI agent's LLM calls into a cheap router model (GPT-OSS 120B) for tool-picking and a premium model (gpt-5.4) for synthesis, cutting costs by ~78% while maintaining output quality.
A user shares their personal routing strategy between various AI models for different tasks like tweet drafts, articles, code, agentic loops, and image generation, arguing that single-model setups lead to higher costs.
The author conducted an experiment on Gmail with AI agents connected via OAuth, sending obfuscated prompt injection emails. Frontier models sometimes caught the attacks, while cheap models silently executed them, revealing that agent security largely depends on model cost and token budget rather than architectural safeguards.
This article covers the development of the open-source AI model routing tool new-api since its April 2023 release, highlighting its dominance with over 90% market share among relay instances, and delves into both the contributions of its core developers and its underlying routing algorithms.
This paper introduces Switchcraft, the first AI model router specifically optimized for agentic tool calling to reduce inference costs. By using a lightweight DistilBERT classifier, it achieves significant cost savings while maintaining high accuracy in tool-use tasks.
The article describes a company's transition to a self-optimizing LLM stack that uses production traces to automatically route requests and fine-tune models, resulting in significant cost reductions and performance improvements.
The article discusses the growing viability of local AI models for everyday tasks, suggesting a shift toward hybrid architectures that optimize for cost and latency rather than relying solely on frontier cloud models.