Tag
Fireworks AI announces Serverless 2.0, introducing three serving tiers (Standard, Priority, Fast) to handle traffic congestion without pre-provisioning GPUs, enabling per-request routing for reliability and cost efficiency.
vLLM introduces Semantic Router, a serving-layer primitive that enables collaboration between multiple models through micro-agents, allowing the router to improve output quality without modifying model weights.
In an interview, Rebellions CEO Sunghyun Park discusses the company's memory-centric architecture approach to AI inference, aiming to compete with NVIDIA by offering greater efficiency and lower costs.
A guide showing how to build a system under $2500 using used server components to run GLM5.2 and other large AI models locally, with trade-offs in speed.
Aquaduck is seeking beta testers for its new AI inference network, inviting early users to try the service.
A user notes that no cloud providers currently offer the GLM-5.2 model in native bf16 precision, highlighting a gap in hosting options.
The NPU on AMD Strix Halo devices is now usable for AI inference, enabling hybrid mode that combines NPU and iGPU for faster prompt processing. Tools like Lemonade and AMD's ROCm software make this possible.
Qualcomm is aggressively expanding in AI through multiple acquisitions, including Modular (creator of Mojo and MAX inference framework) and potentially Tenstorrent, signaling a significant push against Nvidia's CUDA ecosystem.
Baseten announces a $1.5B Series F funding round led by multiple investors, citing 20x revenue growth and 40x inference volume growth as evidence of the market's shift toward inference as the key AI layer.
A guide on optimizing VRAM usage on an AMD 7900XTX to run a 27B Qwen model with Q6K quantization and 131k context by compiling llama.cpp with OpenBLAS and CUDA_FA_ALL_QUANTS, and using kvcache quantization at q5_0/q4_0.
AI inference startup Baseten is reportedly raising $1.5 billion at a $13 billion valuation, just months after its previous mega-round, highlighting the enormous investor interest in the inference layer of AI.
A tweet from Philip Kiely highlights cost savings by switching from closed-source AI models to open-source alternatives, using Baseten's ROI calculator tool.
Tensordyne announces the Napier AI inference rack, claiming 13x the throughput of Nvidia's NVL72 GB300 by using log-space math to reduce energy and transistor usage, potentially disrupting the inference hardware landscape.
A review of Philip Kiely's book 'Inference Engineering', recommending it to avoid common mistakes in AI inference engineering.
The tweet praises a mathematical idea timed well for AI inference's arithmetic profile and expresses interest in seeing results on reasoning models during long generation runs.
Dian Hu draws a parallel between the importance of low latency in search engines and the upcoming need for fast AI inference.
A guide to the best local LLMs for consumer GPUs as of June 2026, using llama.cpp to run models like Gemma 4-12B, Qwen3.6-27B, and Nex-N2-Mini on 8-32GB VRAM, with setup and launch commands.
A setup using RTX 5080 and RTX 3090 GPUs achieves 80 tokens per second on the Qwen 3.6 27B Q8 model.
NVIDIA's Confidential Computing, using Blackwell GPUs, is being adopted by Apple to expand its Private Cloud Compute to Google Cloud, enabling secure server-side inference for Apple Intelligence features while maintaining strong privacy guarantees.
Huawei has open-sourced its CANN software toolkit to compete with Nvidia's CUDA, and DeepSeek V4 shows significant inference performance improvements on Huawei Ascend chips.