local-inference

#local-inference

More Qwen3.6-27B MTP success but on dual Mi50s

Reddit r/LocalLLaMA ↗ · 3h ago

The article benchmarks the Qwen3.6-27B model using Multi-Token Prediction (MTP) and tensor parallelism on dual Mi50 GPUs, demonstrating significant speedups via llama.cpp.

0 favorites 0 likes

#local-inference

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Reddit r/LocalLLaMA ↗ · 5h ago

A user shares a configuration for achieving over 80 tokens per second with Qwen3.6 35B A3B on a 12GB VRAM GPU using llama.cpp and Multi-Token Prediction (MTP). The post includes benchmark results and specific command-line parameters to optimize performance.

0 favorites 0 likes

#local-inference

@davis7: @0xSero helped me setup local models properly and I uh, had no idea these things had gotten this good Are they frontier…

X AI KOLs Following ↗ · 14h ago

The author highlights the impressive capabilities of the open-source Qwen 3.6-27B model running locally on an RTX 5090, noting its strong performance on programming tasks and comparing it favorably to commercial models, despite the complexity of local deployment.

0 favorites 0 likes

#local-inference

RTX Pro 4500 Blackwell - Qwen 3.6 27B?

Reddit r/LocalLLaMA ↗ · 15h ago

A developer shares local inference benchmarks and systemd configurations for running the Qwen3.6-27B model on an NVIDIA RTX Pro 4500 Blackwell GPU using llama.cpp. The post requests optimization tips for throughput and explores potential use cases for larger models.

0 favorites 0 likes

#local-inference

@slash1sol: ChatGPT 5.5 is cooked. Claude Opus 4.7 is cooked. Every $420/mo SaaS AI just got an open-source assassin. Mind blown: a…

X AI KOLs Timeline ↗ · 18h ago Cached

OpenHuman is an open-source desktop AI agent that runs locally on your machine, offering privacy-focused integrations with apps like Gmail and Slack, and challenging subscription-based SaaS AI models.

0 favorites 0 likes

#local-inference

@andimarafioti: Reachy Mini just got a new Brain! We released a fully open-source backend for talking to Reachy Mini. In the last 48 ho…

X AI KOLs Following ↗ · yesterday

Reachy Mini has a new fully open-source backend for real-time voice interaction, running audio models locally and leveraging LLM subscriptions to avoid per-second API costs.

0 favorites 1 likes

#local-inference

JANGQ-AI/MiniMax-M2.7-JANGTQ_K : mixed-bit quant of MiniMax M2.7 - 74 GB on disk

Reddit r/LocalLLaMA ↗ · yesterday Cached

Release of a mixed-bit quantized version of the MiniMax M2.7 model, optimized to 74 GB for efficient local inference on Apple Silicon devices.

0 favorites 0 likes

#local-inference

DeepSeek 4 Flash local inference engine for Metal

Hacker News Top ↗ · 2d ago Cached

ds4 is a native local inference engine for DeepSeek V4 Flash optimized for Apple Silicon, featuring disk-based KV cache persistence and Metal acceleration.

0 favorites 0 likes

#local-inference

Qwen 3.6 27B is a BEAST

Reddit r/LocalLLaMA ↗ · 2026-04-23

A developer reports that the new 27B Qwen 3.6 model runs excellently on a 24GB VRAM laptop, passing all PySpark/Python data-transformation benchmarks and eliminating the need for cloud subscriptions.

0 favorites 0 likes

#local-inference

Gemma 4 beats Qwen 3.5 (UPDATE), and Qwen 3.6 27B + MiniMax M2.7 is the best OpenCode setup

Reddit r/LocalLLaMA ↗ · 2026-04-23

Personal benchmark shows Gemma-4E4B tops for routing, Qwen-3.6 27/30B beats Gemma-4 for coding, and MiniMax M2.7 MXFP4 replaces giant Qwen-3.5 quants in an OpenCode llama-swap workflow.

0 favorites 0 likes

#local-inference

How to Use Transformers.js in a Chrome Extension

Hugging Face Blog ↗ · 2026-04-23 Cached

This article provides a technical guide on integrating Transformers.js into a Chrome extension using Manifest V3, detailing architecture for background service workers, model caching, and agent loops.

0 favorites 0 likes

#local-inference

Tried Qwen3.6-27B-UD-Q6_K_XL.gguf with CloudeCode, well I can't believe but it is usable

Reddit r/LocalLLaMA ↗ · 2026-04-22

User reports surprisingly usable coding performance from Qwen3-27B-UD-Q6_K_XL.gguf running locally on RTX 5090 at ~50 tok/s with 200K context, marking a significant leap in local model quality.

0 favorites 0 likes

#local-inference

@0xSero: Locally Part 1 - Apple Silicon Macs give you large pools of memory to run big models, but the token generation speed wi…

X AI KOLs Following ↗ · 2026-04-22 Cached

Apple Silicon Macs offer large memory pools for running big models but with slower token generation, performing best with large MoEs that have low active parameters.

0 favorites 0 likes

#local-inference

@paulabartabajo_: Advice for AI engineers A small Visual Language Model fine-tuned on your custom dataset is as accurate as GPT-5... ... …

X AI KOLs Timeline ↗ · 2026-04-22 Cached

A tweet claims that a small visual language model fine-tuned on custom data can match GPT-5 accuracy while costing 50× less, citing Liquid AI’s 1.6B model running locally with llama.cpp.

0 favorites 0 likes

#local-inference

@ivanfioravanti: Autoresearch from @karpathy in action locally using gemma-4-26b-a4b-it-6bit with oMLX on an M5 Max to train Gemma 4 E2B…

X AI KOLs Timeline ↗ · 2026-04-21 Cached

Developer Ivan Fioravanti demonstrates running Andrej Karpathy's autoresearch project locally with a 6-bit quantized Gemma-4-26B model on Apple Silicon, suggesting successful training of Gemma 4 E2B IT variant.

0 favorites 0 likes

#local-inference

Llama.cpp's auto fit works much better than I expected

Reddit r/LocalLLaMA ↗ · 2026-04-21

Llama.cpp's new --fit flag enables running models larger than VRAM with surprisingly high token/s, breaking the old VRAM-only limitation.

0 favorites 0 likes

#local-inference

Dual dgx spark (Asus GX10) MiniMax M2.7 results

Reddit r/LocalLLaMA ↗ · 2026-04-21

User benchmarks dual Asus GX10 (DGX Spark) running MiniMax-M2.7-AWQ-4bit, achieving 30–40 tokens/s while drawing only ~100 W each, replacing noisy multi-GPU rigs.

0 favorites 0 likes

#local-inference

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM

Reddit r/LocalLLaMA ↗ · 2026-04-21

The author highlights how rapidly local AI capabilities have improved, enabling tasks once exclusive to top-tier cloud models to run on affordable hardware using models like Qwen 27b and Minimax 2.7.

0 favorites 0 likes

#local-inference

@unwind_ai_: What the hell China Kimi K2.6 wrote a local inference runtime in Zig, on a Mac, autonomously. ended up 20% faster than …

X AI KOLs Timeline ↗ · 2026-04-21 Cached

Kimi K2.6 autonomously wrote a Zig-based local inference runtime on Mac that is 20% faster than LM Studio after 14 iterations and 4,000+ tool calls, all open-sourced.

0 favorites 0 likes

#local-inference

2x 512gb ram M3 Ultra mac studios

Reddit r/LocalLLaMA ↗ · 2026-04-21

A user shares their $25k hardware setup of two 512GB RAM M3 Ultra Mac Studios for running large language models locally, having tested DeepSeek V3 Q8 and GLM 5.1 Q4 via the exo distributed inference backend, while awaiting Kimi 2.6 MLX optimization.

0 favorites 0 likes

local-inference

Submit Feedback