Tag
OpenUI is a new open-source specification that generates UI components with 67% fewer tokens than JSON, offering a model-agnostic, framework-neutral solution for LLM-based interface generation.
The article argues that most 'agentic' systems are actually single agents with tools, highlighting the high costs and complexity of multi-agent setups. It outlines three valid multi-agent patterns—orchestrator-worker, pipeline, and peer-to-peer—and provides criteria for deciding when to use them versus a single agent.
The article discusses the potential compatibility of DFlash and PFlash multi-model speedup methods with Heretic, a tool used for model decensoring, while highlighting the performance benefits on models like Qwen3.6 and Gemma 4.
This paper introduces MedTPE, a method for efficient, lossless prompt compression of electronic health records for large language models, significantly reducing token length and inference latency in clinical prediction tasks.
Unsloth has released an optimized GGUF version of the Qwen3.6-27B MTP model, achieving significantly faster inference speeds (up to 114 tok/s on an RTX 5090) compared to previous quantizations.
The article introduces Echo-LoRA, a new parameter-efficient fine-tuning method that injects cross-layer representations from deeper source layers into shallow LoRA modules to improve performance without adding inference-time overhead.
This paper proposes an empirical 'sparse-to-dense' reward principle for language model post-training, arguing that scarce labeled data should be used with sparse rewards for teacher model discovery and dense rewards for student compression via distillation. The authors demonstrate that this staged approach, bridging sparse RL and on-policy distillation, outperforms direct GRPO on deployment-sized models in math benchmarks.
AutoTTS is an open-source tool that uses agentic discovery to automatically find optimal test-time scaling strategies for LLMs, significantly reducing token usage and cost through replay-based evaluation.
OptiLLM is an open-source inference proxy that boosts LLM reasoning accuracy by up to 10x using advanced techniques without requiring retraining, compatible with various AI APIs.
The article describes a company's transition to a self-optimizing LLM stack that uses production traces to automatically route requests and fine-tune models, resulting in significant cost reductions and performance improvements.
SlimSpec introduces a low-rank parameterization for drafter LM-heads to accelerate speculative decoding in LLMs, achieving 4-5x speedup while maintaining full vocabulary support.
The author introduces 'Apohara Context Forge,' an open-source framework and methodology for optimizing context windows in coding agents using role-aware segmentation and tiered relevance scoring.
The authors detail their experience building a code indexing system, concluding that graph-based retrieval with LLM-generated semantics outperforms vector embeddings and pure AST parsing. They open-sourced the system, Bytebell, which uses Neo4j to store semantic context for efficient and precise code retrieval.
The article argues that context engineering, which involves structuring the information and memory available to an AI, is more critical for performance than prompt engineering alone. It provides a structured overview of a course designed to teach how to build reliable AI systems by managing context layers like session history and persistent memory.
A developer reports achieving high accuracy with fine-tuned Qwen 3.5 4B and 8B models using Unsloth, suggesting a shift towards specialized Expert Language Models (ELMs) for niche tasks.
Google released Gemma 4, an open-source AI model optimized for local execution on standard laptops, offering 3x faster performance and a 256k context window for free under an Apache 2.0 license.
This paper introduces AutoTTS, an environment-driven framework that automates the discovery of test-time scaling strategies for LLMs by formulating it as controller synthesis. It demonstrates improved accuracy-cost tradeoffs on mathematical reasoning benchmarks with minimal computational overhead.
The paper introduces SPEED, a layer-asymmetric KV visibility policy that reduces long-context inference costs by processing prompt tokens only in lower layers during prefill while maintaining full-depth attention during decoding.
Large-scale study of 15 LLMs across 8 tasks reveals that optimization success hinges on maintaining localized search trajectories rather than initial problem-solving ability or solution novelty.
Research introduces Skill-RAG, a novel approach that combines Skills with Retrieval-Augmented Generation to address inefficiencies in traditional RAG systems that retrieve on every query regardless of whether the model actually needs the information.