local-inference

#local-inference

Diffusion Model that can turn any Image into a Playable Hallucination! BUT LOCALLY, NOT ON DATACENTER

Reddit r/ArtificialInteligence ↗ · 16h ago

A diffusion model that can transform any image into an interactive, playable hallucination, running locally on user hardware.

0 favorites 0 likes

#local-inference

Unsloth GLM-5.2 – How to Run Locally

Hacker News Top ↗ · yesterday Cached

A guide on running Z.ai's open model GLM-5.2 locally using Unsloth Dynamic GGUFs. The model features 744B total parameters (40B active) and a 1M context window, with quantized versions reducing memory to 239GB for 2-bit, enabling local inference on 256GB Macs.

0 favorites 0 likes

#local-inference

Local LLM Inference Optimization: The Complete Guide

Reddit r/LocalLLaMA ↗ · 2d ago Cached

A comprehensive guide to optimizing local LLM inference on consumer hardware, covering tools like llama.cpp, vLLM, and LM Studio, with practical advice on memory hierarchy, layer placement, and common failure modes.

0 favorites 0 likes

#local-inference

@QuixiAI: https://x.com/QuixiAI/status/2068776183102067086

X AI KOLs Following ↗ · 2d ago Cached

DwarfStar is a self-contained native inference engine optimized for DeepSeek V4 Flash and PRO models, supporting Metal, CUDA, and ROCm backends, with a focus on high-end personal machines and Mac Studios.

0 favorites 0 likes

#local-inference

@antirez: First kinda working implementation of GLM 5.2 in DwarfStar. Will take some time to be good enough, but it is a promisin…

X AI KOLs Following ↗ · 2d ago Cached

Antirez reports the first working implementation of GLM 5.2 in DwarfStar, using a 433 GB GGUF file on an M3 Ultra with 512GB RAM, though it needs further refinement.

0 favorites 0 likes

#local-inference

GLM 5.2: 98% of max level intelligence with less than half of tokens usage

Reddit r/LocalLLaMA ↗ · 4d ago

GLM 5.2 offers improved token efficiency, allowing users to achieve 98% of max-level intelligence using less than half the tokens. The model's 'high' effort level provides a practical alternative for day-to-day use compared to the resource-intensive 'max' level.

0 favorites 0 likes

#local-inference

GLM-5.2 can now run locally in llama.cpp and Unsloth Studio.

Reddit r/LocalLLaMA ↗ · 5d ago

GLM-5.2 is now supported for local execution via llama.cpp and Unsloth Studio.

0 favorites 0 likes

#local-inference

@10xmylife: Unsloth 成功将 2-bit 版本的 GLM-5.2 部署在了 256GB 的 Mac 上

X AI KOLs Following ↗ · 5d ago Cached

Unsloth 成功将 GLM-5.2 模型以 2-bit 量化压缩至 238GB，可在 256GB Mac 上本地运行，保留约 82% 的准确率。

0 favorites 0 likes

#local-inference

Giving GLM-5.2 a spin locally on CPU only! (poor man's rig for big models)

Reddit r/LocalLLaMA ↗ · 5d ago

A user runs GLM-5.2 locally on CPU only, demonstrating how to run a large model on a modest setup.

0 favorites 0 likes

#local-inference

@MaximeRivest: glm 5.2 is good (enough) and this is important. glm 5.2 is good enough to change information technology in very fundame…

X AI KOLs Following ↗ · 5d ago Cached

GLM 5.2 is an open-weights LLM that is sufficiently capable to allow businesses to manage their IT needs locally on affordable hardware, potentially transforming small/medium enterprise data management.

0 favorites 0 likes

#local-inference

@MiaAI_lab: I fine-tuned Gemma 4 12B with Fable-5 style reasoning and assistant traces and released it as Gemmable 4 12b. **Availab…

X AI KOLs Timeline ↗ · 5d ago Cached

Mia-AiLab released Gemmable 4 12B, a fine-tuned version of Google's Gemma 4 12B model using Fable-5 style reasoning and assistant traces, available in GGUF and MLX formats for local inference.

0 favorites 0 likes

#local-inference

@UnslothAI: GLM-5.2 can now be run locally! The 2-bit model retains ~82% accuracy after we shrunk it from 1.51TB to 238GB (-84% siz…

X AI KOLs Timeline ↗ · 5d ago Cached

UnslothAI announces GLM-5.2, Z.ai's strongest open model with 744B parameters, now runnable locally via dynamic GGUF quantization reducing size by ~84% to 239GB while retaining ~82% accuracy. It fits on 256GB Macs and supports long-context, reasoning, and agentic tasks.

0 favorites 0 likes

#local-inference

@hank_aibtc: Amazing! Running Gemma 4 in the browser, on par with ChatGPT?! Completely zero server, zero data upload, offline, pure WebGPU local inference! Xenova has open-sourced all 27 custom WebGPU kernels written by Fable 5: - Gemma 4 E2B (2.3B parameters...)

X AI KOLs Timeline ↗ · 5d ago Cached

The article introduces Xenova's open-sourcing of 27 custom WebGPU kernels, enabling Gemma 4 to run fully offline and locally in the browser at 255 tok/s, and discusses advantages like privacy and offline use. It also mentions FLUX.2's 3D generation capability.

0 favorites 0 likes

#local-inference

@dealignai: MiniMax m3, made for 128gb Mac’s Thank you to @hornsby_andrew for preparing the pruning calibration dataset and doing e…

X AI KOLs Timeline ↗ · 6d ago Cached

A pruned and quantized version of MiniMax-M3 (MiniMax-M3-Medium-JANG_2L) optimized to run on 128GB Macs using vMLX, featuring 32% expert pruning and JANG_2L mixed-precision quantization to fit within ~105 GB.

0 favorites 0 likes

#local-inference

@mudler_it: parakeet.cpp now runs NVIDIA Parakeet behind the OpenAI API. Point any OpenAI client at a local server, send an audio, …

X AI KOLs Timeline ↗ · 6d ago Cached

parakeet.cpp enables running NVIDIA Parakeet ASR behind the OpenAI API locally with prebuilt Docker images, supporting CPU and CUDA (including arm64) for real-time transcription with word timestamps.

0 favorites 0 likes

#local-inference

@MiaAI_lab: MTP is up, test it out https://huggingface.co/Mia-AiLab/Qwable-3.6-27b-MTP…

X AI KOLs Timeline ↗ · 2026-06-17 Cached

Mia-AiLab releases Qwable-3.6-27b-MTP, a full fine-tuned checkpoint of Qwen3.6-27B using a cleaned Fable 5 reasoning and instruction dataset, focused on code, structured reasoning, and local inference with MTP layers.

0 favorites 0 likes

#local-inference

@aijoey: WeiboAI dropped VibeThinker-3B, so I had to try it locally. this is a 3B model, not a giant frontier system. in the vid…

X AI KOLs Timeline ↗ · 2026-06-16 Cached

WeiboAI released VibeThinker-3B, a small 3B reasoning model tested locally on coding tasks, achieving 3/3 on algorithm problems.

0 favorites 0 likes

#local-inference

@WaleedAhmad1a10: Check out the Qwen 3.5 27B MoQ GGUFs :

X AI KOLs Following ↗ · 2026-06-16 Cached

A Hugging Face repository (kaitchup/Qwen3.6-27B-GGUF-MoQ) provides GGUF quantized weights for the Qwen3.6-27B MoQ model, enabling local inference with tools like llama.cpp and Ollama.

0 favorites 0 likes

#local-inference

@juanjucm: I'm seeing a lot of angry people lately... remember, you can always run your coding agent locally ;) llama.cpp + OpenCo…

X AI KOLs Following ↗ · 2026-06-12 Cached

Tweet reminding developers they can run coding agents locally using llama.cpp and OpenCode for fast, reliable, and private inference, demonstrating with UnslothAI's North-Mini-Code-1.0-GGUF model.

0 favorites 0 likes

#local-inference

@amitiitbhu: Gemma 4 now runs 2x faster with MTP GGUFs! Run locally on just 6GB RAM. New Article: How does GGUF work? Read here: htt…

X AI KOLs Timeline ↗ · 2026-06-12 Cached

Gemma 4 now runs 2x faster with MTP GGUF format and can run locally on just 6GB RAM. The linked article explains how GGUF works, including quantization and memory mapping.

0 favorites 0 likes

local-inference

Submit Feedback