Tag
This paper presents a two-stage methodology for end-to-end LLM deployment on spatial NPUs, progressing from human-guided development to an autonomous agent skill system. The system achieves speedups of 2.2x on prefill and 4.0x on decode for a reference model, and autonomously deploys eight additional LLMs on AMD XDNA 2 NPU with minimal human guidance.
A practitioner seeks advice on running AI agents 24/7 without high API costs, asking about local models, cloud GPUs, or hosted APIs, and wants cost-efficient setups balancing reliability and reasoning quality.
A lecture on LLM deployment techniques covering AWQ, vLLM, FlashAttention, quantization, and activation smoothing for efficient serving.
A discussion on the challenges consultants face when clients want to deploy LLMs despite having poor data governance, weighing the risks of fixing data first versus deploying quickly on messy data.
Skymizer announces the HTX301, a PCIe inference card capable of running 700B-parameter LLMs on-premises with high memory and low power consumption.
Anyscale published a technical guide on deploying production-ready AI agents using Ray Serve, MCP, and A2A protocols. The article addresses common infrastructure bottlenecks by proposing a decoupled microservices architecture that enables independent scaling of LLMs, tools, and agents.
This paper introduces geometric stability measures—based on pairwise distance consistency in representations—to predict language model steerability and detect structural drift. Supervised variants achieve near-perfect correlation (ρ=0.89-0.97) with linear steerability across 35-69 embedding models, while unsupervised variants outperform CKA and Procrustes for post-deployment drift detection.