low-vram

#low-vram

@XAMTO_AI: ControlNet author Min Shen has come up with something new! The newly open-sourced FramePack directly lowers the barrier for video generation — runs on just 6GB VRAM, generates a 1-minute 30fps video with a 13B model, and on an RTX 4090 it takes only 1.5 seconds per frame. Such configuration requirements were unimaginable before. The core idea is frame-by-frame…

X AI KOLs Timeline ↗ · yesterday Cached

ControlNet author Min Shen has open-sourced the FramePack video generation model, which requires only 6GB of VRAM to run a 13B model, generates a 1-minute 30fps video, takes 1.5 seconds per frame on an RTX 4090, and comes with a one-click Windows package.

0 favorites 0 likes

#low-vram

@NFTCPS: 4GB VRAM running 70B large model? It actually works! AirLLM did a clever trick — layered inference, not loading the whole model into VRAM at once, but layer by layer, compute and discard, squeezing the giant into a small GPU. The best part: 100% open source, freebie warning https://github.com/0xSo…

X AI KOLs Timeline ↗ · 6d ago Cached

AirLLM is a fully open-source tool that uses layered inference (loading and releasing VRAM layer by layer) to enable 70B large language models to run on GPUs with only 4GB VRAM, without quantization, distillation, or pruning. It already supports running Llama3.1 405B on 8GB VRAM.

0 favorites 0 likes

#low-vram

Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps

Reddit r/LocalLLaMA ↗ · 2026-05-22

The author shares detailed tuning tips for running the Qwen3.6-35B-A3B MoE model on an 8GB RTX 3070 Ti with up to 262k context using llama.cpp, achieving 30+ tps, and notes a 25% speed boost when switching from Windows to Ubuntu Server.

0 favorites 0 likes

#low-vram

Developing open source LLM from ground up from pretrain - rlhf(PPO/GRPO)

Reddit r/LocalLLaMA ↗ · 2026-05-14

A developer shares progress on training a 7B parameter open source LLM from scratch using a DeepSeek architecture optimized for low VRAM, with the goal of democratizing AI development and eventually surpassing large proprietary models.

0 favorites 0 likes

#low-vram

Qwen3.6 35B MoE on 8GB VRAM — working llama-server config + a max_tokens / thinking trap I ran into

Reddit r/LocalLLaMA ↗ · 2026-04-21

Author shares a working llama-server config to run the 35B-MoE Qwen3.6 model on an 8GB RTX 4060, highlighting a max_tokens trap caused by unconstrained internal reasoning and the fix using per-request thinking_budget_tokens.

0 favorites 0 likes

low-vram

Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps

Developing open source LLM from ground up from pretrain - rlhf(PPO/GRPO)

Qwen3.6 35B MoE on 8GB VRAM — working llama-server config + a max_tokens / thinking trap I ran into

Submit Feedback