ai-inference

Tag

Cards List
#ai-inference

@omarsar0: https://x.com/omarsar0/status/2071964375125037343

X AI KOLs Following · 4h ago Cached

Fireworks AI announces Serverless 2.0, introducing three serving tiers (Standard, Priority, Fast) to handle traffic congestion without pre-provisioning GPUs, enabling per-request routing for reliability and cost efficiency.

0 favorites 0 likes
#ai-inference

Micro-Agent: Beat Frontier Models with Collaboration Inside Model API

Hacker News Top · yesterday Cached

vLLM introduces Semantic Router, a serving-layer primitive that enables collaboration between multiple models through micro-agents, allowing the router to improve output quality without modifying model weights.

0 favorites 0 likes
#ai-inference

My interview with Rebellions CEO: Five things I learned from the man going toe to toe with NVIDIA

Reddit r/artificial · yesterday

In an interview, Rebellions CEO Sunghyun Park discusses the company's memory-centric architecture approach to AI inference, aiming to compete with NVIDIA by offering greater efficiency and lower costs.

0 favorites 0 likes
#ai-inference

Running GLM5.2 on budget hardware < $2500.

Reddit r/LocalLLaMA · 3d ago

A guide showing how to build a system under $2500 using used server components to run GLM5.2 and other large AI models locally, with trade-offs in speed.

0 favorites 0 likes
#ai-inference

Seeking beta testers for Aquaduck—a new AI inference network

Reddit r/AI_Agents · 3d ago

Aquaduck is seeking beta testers for its new AI inference network, inviting early users to try the service.

0 favorites 0 likes
#ai-inference

@jpschroeder: ZERO providers offer GLM-5.2 in native bf16.

X AI KOLs Following · 4d ago Cached

A user notes that no cloud providers currently offer the GLM-5.2 model in native bf16 precision, highlighting a gap in hosting options.

0 favorites 0 likes
#ai-inference

Big News for AMD / Strix Halo+ Owners

Reddit r/LocalLLaMA · 6d ago

The NPU on AMD Strix Halo devices is now usable for AI inference, enabling hybrid mode that combines NPU and iGPU for faster prompt processing. Tools like Lemonade and AMD's ROCm software make this possible.

0 favorites 0 likes
#ai-inference

Qualcomm wants to grow in the AI space (multiple acquisition's underway)

Reddit r/singularity · 6d ago

Qualcomm is aggressively expanding in AI through multiple acquisitions, including Modular (creator of Mojo and MAX inference framework) and potentially Tenstorrent, signaling a significant push against Nvidia's CUDA ecosystem.

0 favorites 0 likes
#ai-inference

@tuhinone: https://x.com/tuhinone/status/2069089174494625907

X AI KOLs Following · 2026-06-22 Cached

Baseten announces a $1.5B Series F funding round led by multiple investors, citing 20x revenue growth and 40x inference volume growth as evidence of the market's shift toward inference as the key AI layer.

0 favorites 0 likes
#ai-inference

7900XTX 24GB vram, can finally fit Q6K+MTP with Qwen 3.6 27B at 131k context

Reddit r/LocalLLaMA · 2026-06-20

A guide on optimizing VRAM usage on an AMD 7900XTX to run a 27B Qwen model with Q6K quantization and 131k context by compiling llama.cpp with OpenBLAS and CUDA_FA_ALL_QUANTS, and using kvcache quantization at q5_0/q4_0.

0 favorites 0 likes
#ai-inference

AI inference startup Baseten reportedly raising $1.5B months after its last mega-round

TechCrunch AI · 2026-06-18 Cached

AI inference startup Baseten is reportedly raising $1.5 billion at a $13 billion valuation, just months after its previous mega-round, highlighting the enormous investor interest in the inference layer of AI.

0 favorites 0 likes
#ai-inference

@philipkiely: On sample workloads: Opus 4.8 -> Kimi 2.7 Code | 82% savings GPT 5.5 -> GLM 5.2 | 77% savings Gemini 3.5 Flash -> Nemot…

X AI KOLs Following · 2026-06-17 Cached

A tweet from Philip Kiely highlights cost savings by switching from closed-source AI models to open-source alternatives, using Baseten's ROI calculator tool.

0 favorites 0 likes
#ai-inference

@rohanpaul_ai: Quite a massive inferencing rack breakthrough from @TensordyneInc . They just announced an AI-inference rack, claiming …

X AI KOLs Following · 2026-06-17 Cached

Tensordyne announces the Napier AI inference rack, claiming 13x the throughput of Nvidia's NVL72 GB300 by using log-space math to reduce energy and transistor usage, potentially disrupting the inference hardware landscape.

0 favorites 0 likes
#ai-inference

@barrowjoseph: My full review of @philipkiely's Inference Engineering. TL;DR: I desperately wish I could ship this book back to myself…

X AI KOLs Timeline · 2026-06-17 Cached

A review of Philip Kiely's book 'Inference Engineering', recommending it to avoid common mistakes in AI inference engineering.

0 favorites 0 likes
#ai-inference

@rohanpaul_ai: Brilliant. This feels like one of those cases where the math idea finally arrived at the right timing, because AI infer…

X AI KOLs Following · 2026-06-16 Cached

The tweet praises a mathematical idea timed well for AI inference's arithmetic profile and expresses interest in seeing results on reasoning models during long generation runs.

0 favorites 0 likes
#ai-inference

@sdianahu: 1/ fast AI inference is about to replay the history lesson from search engines on why low latency is so important

X AI KOLs Following · 2026-06-14 Cached

Dian Hu draws a parallel between the importance of low latency in search engines and the upcoming need for fast AI inference.

0 favorites 0 likes
#ai-inference

@TraffAlex: Best Local LLMs for Consumer GPUs — llama.cpp Guide (June 2026) What I actually run on consumer hardware right now. Eve…

X AI KOLs Timeline · 2026-06-14 Cached

A guide to the best local LLMs for consumer GPUs as of June 2026, using llama.cpp to run models like Gemma 4-12B, Qwen3.6-27B, and Nex-N2-Mini on 8-32GB VRAM, with setup and launch commands.

0 favorites 0 likes
#ai-inference

RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

Hacker News Top · 2026-06-13

A setup using RTX 5080 and RTX 3090 GPUs achieves 80 tokens per second on the Qwen 3.6 27B Q8 model.

0 favorites 0 likes
#ai-inference

NVIDIA Confidential Computing to Help Expand Apple’s Private Cloud Compute

NVIDIA Blog · 2026-06-09 Cached

NVIDIA's Confidential Computing, using Blackwell GPUs, is being adopted by Apple to expand its Private Cloud Compute to Google Cloud, enabling secure server-side inference for Apple Intelligence features while maintaining strong privacy guarantees.

0 favorites 0 likes
#ai-inference

@bookwormengr: Wonderful coverage on CANN (Huawei's CUDA) and DeepSeek V4 inference on Huawei chips.... "CANN (Compute Architecture fo…

X AI KOLs Timeline · 2026-06-09 Cached

Huawei has open-sourced its CANN software toolkit to compete with Nvidia's CUDA, and DeepSeek V4 shows significant inference performance improvements on Huawei Ascend chips.

0 favorites 0 likes
Next →
← Back to home

Submit Feedback