@Jolyne_AI: Another open-source tool for running AI models locally on GitHub: Shimmy, targeting Ollama's pain points. A single file of only 5MB provides fast, stable local inference with a full OpenAI-compatible API, almost zero integration cost. Built with Rust to maximize performance, it starts in under 100ms and uses about 50MB of memory.
Summary
Another open-source tool on GitHub, Shimmy, is a single 5MB file written in Rust that provides fast and stable local inference with a full OpenAI-compatible API, targeting Ollama's pain points. It starts in under 100ms and uses about 50MB of memory.
View Cached Full Text
Cached at: 07/02/26, 06:25 PM
GitHub has a new open-source tool for running AI models locally: Shimmy, directly targeting Ollama’s pain points. At just 5MB for a single file, it delivers fast, stable local inference with a fully OpenAI-compatible API, making integration virtually zero-cost. Built on Rust, it squeezes every drop of performance: starts in under 100ms and uses only about 50MB of RAM. Project link: http://github.com/Michael-A-Kuykendall/shimmy… Even better, it’s truly plug-and-play: no configuration needed, automatic port allocation, and it can discover model sources on its own — supporting Hugging Face, Ollama, and local directories. Compared to Ollama, it leads across key metrics like size, startup speed, and memory usage. If you want a lighter, faster local inference experience, give it a try.
Michael-A-Kuykendall/shimmy
Source: https://github.com/Michael-A-Kuykendall/shimmy
The Lightweight OpenAI API Server
🔒 Local Inference Without Dependencies
🚀 License: MIT (https://opensource.org/licenses/MIT) Security (https://github.com/Michael-A-Kuykendall/shimmy/security) Crates.io (https://crates.io/crates/shimmy) Downloads (https://crates.io/crates/shimmy) Rust (https://rustup.rs/) GitHub Stars (https://github.com/Michael-A-Kuykendall/shimmy/stargazers)
💝 Sponsor this project (https://github.com/sponsors/Michael-A-Kuykendall)
Shimmy will be free forever. No asterisks. No “free for now.” No pivot to paid.
💝 Support Shimmy’s Growth
🚀 If Shimmy helps you, consider sponsoring (https://github.com/sponsors/Michael-A-Kuykendall) — 100% of support goes to keeping it free forever.
- $5/month: Coffee tier ☕ — Eternal gratitude + sponsor badge
- $25/month: Bug prioritizer 🐛 — Priority support + name in SPONSORS.md
- $100/month: Corporate backer 🏢 — Logo placement + monthly office hours
- $500/month: Infrastructure partner 🚀 — Direct support + roadmap input
🎯 Become a Sponsor (https://github.com/sponsors/Michael-A-Kuykendall) | See our amazing sponsors 🙏
Table of Contents
- What Is Shimmy?
- 🔥 Airframe Engine (v2.0)
- ⚡ TurboShimmy INT4 KV (v2.1)
- 🎯 Supported Models
- 📦 Migrating from v1.x
- ⚡ Quick Start (30 seconds)
- 🚀 OpenAI SDK Compatibility
- 🔧 Extended Context
- 📥 Download & Install
- 🔗 Integration Examples
- 📖 API Reference
- ❓ FAQ
- 🏛️ Technical Architecture
- 📚 Documentation Hub
- 🌍 Community & Support
- ⚡ Performance
- License
Drop-in OpenAI API Replacement for Local LLMs
Shimmy is a single-binary that provides 100% OpenAI-compatible endpoints for GGUF models. Point your existing AI tools to Shimmy and they just work — locally, privately, and free.
🎉 NEW in v2.0.0: Shimmy now runs on Airframe, a pure-Rust WGSL GPU engine. No C++ toolchain, no backend flags, no compilation required.
⚡ NEW in v2.1.0: TurboShimmy INT4 KV — ~7× less KV cache VRAM with one flag. Run Llama-3.2-3B on 4 GB GPUs.
🔥 Airframe Engine (v0.2.7)
Starting in v2.0.0, Shimmy’s default inference engine is Airframe — a pure-Rust WebGPU (WGSL) transformer runtime built from scratch.
v0.2.7 brings the Inference Saturation Fabric (ISF) refit and TDR transport integration to production.
See airframe CHANGELOG (https://github.com/Michael-A-Kuykendall/airframe/blob/master/CHANGELOG.md) for full release notes.
Why this matters:
- No C++ toolchain required — Rust only, top to bottom
- F32 precision throughout for deterministic, high-quality output
- WGSL compute shaders work on any GPU via WebGPU (NVIDIA, AMD, Intel, integrated)
- Model spec auto-derived from GGUF metadata — no hardcoded per-model constants
- YaRN RoPE scaling for extended context via
SHIMMY_MAX_CTX— engine allocates KV cache and sets RoPE scale automatically (see Extended Context below)
Quick start with Airframe (v2.0.0+):
# Default: 2048-token context
SHIMMY_BASE_GGUF=/path/to/TinyLlama-1.1B-Chat-v1.0.Q4_0.gguf ./shimmy serve
# Extended context (4096 tokens — YaRN RoPE enabled automatically, KV cache resized)
SHIMMY_BASE_GGUF=/path/to/model.gguf SHIMMY_MAX_CTX=4096 ./shimmy serve
⚡ TurboShimmy INT4 KV
TurboShimmy is Shimmy’s on-GPU INT4 KV-cache compression system, shipping in v2.1.0. It squeezes the KV cache from 32-bit floats down to per-head-vector 4-bit integers — entirely in WGSL compute shaders with no CPU roundtrips — delivering ~7× less KV VRAM with no measurable quality loss at normal context lengths.
One flag. ~7× less KV VRAM. Same output quality.
# Enable TurboShimmy on any GGUF model
./shimmy serve --kv-quant int4
# Or via environment variable (docker-compose, systemd, etc.)
SHIMMY_KV_QUANT=int4 ./shimmy serve
# Windows GPU + long prompts: reduce per-dispatch work to prevent TDR resets
./shimmy serve --kv-quant int4 --prefill-chunk 8
Why it matters — TurboShimmy changes what fits on your GPU:
| GPU VRAM | Without TurboShimmy | With TurboShimmy (--kv-quant int4) |
|---|---|---|
| 3 GB | Llama-3.2-1B only | Llama-3.2-3B fits ✅ |
| 4 GB | Llama-3.2-3B, ctx=2048 (tight) | Llama-3.2-3B at ctx=8192 ✅ |
| 6 GB | 3B models, short context | 7B models with reasonable context ✅ |
VRAM comparison (Llama-3.2-3B, ctx=2048):
| Mode | KV cache | Total VRAM | Min GPU needed |
|---|---|---|---|
| Default (f32) | ~512 MB | ~2.4 GB | 3 GB (tight) |
| TurboShimmy (int4) | ~72 MB | ~2.0 GB | 2.5 GB ✅ |
VRAM comparison (TinyLlama 1.1B, ctx=2048):
| Mode | KV cache | Total VRAM |
|---|---|---|
| Default (f32) | 88 MB | ~700 MB |
| TurboShimmy (int4) | ~13 MB | ~650 MB |
When to use TurboShimmy:
| Situation | Recommendation |
|---|---|
| 3B model on a 4 GB GPU | --kv-quant int4 — enables models that wouldn’t fit otherwise |
| 7B model at ctx=4096+ | --kv-quant int4 — cuts KV from ~512 MB → ~72 MB |
| Short chat sessions (ctx ≤ 2048) | --kv-quant int4 — safe, no quality tradeoff |
| Long-form generation (ctx > 8192) | Default f32 — keep maximum quality |
| Windows GPU + TDR crashes on long prompts | --kv-quant int4 --prefill-chunk 8 |
How it works: Each K/V head vector is independently quantized to 4-bit integers with a per-vector F32 scale factor, encoded as packed nibbles by WGSL compute shaders. Dequantization happens on-the-fly when computing attention scores — also on GPU. The Airframe engine’s helical context-shift operates directly on the packed INT4 representation. No CPU roundtrips at any step.
Full architecture details in the Airframe engine (https://github.com/Michael-A-Kuykendall/airframe).
Quality validation: Needle-in-a-haystack benchmarks on Llama-3.2-3B show zero retrieval degradation vs F32 at ctx≤2048 across all tested depths (15%, 50%, 85%). Full benchmark data and setup guide: TurboShimmy on the wiki (https://github.com/Michael-A-Kuykendall/shimmy/wiki/TurboShimmy).
Windows stability: Airframe v0.2.1 ships a
device.on_uncaptured_errorhandler so GPU validation errors surface as clean HTTP 500 responses instead of crashes. Use--prefill-chunk 8to prevent Windows TDR resets during long prefills on older GPUs (GTX 10xx/16xx series). v0.2.7 adds TDR transport with GPU timestamp pools for accurate dispatch timing, fixing TDR watchdog crashes during long prefill sequences.
🎯 Supported Models
Airframe v2.0 ships with GPU-verified support across 7 model architectures and 5 quantization types, covering the models most commonly run on consumer hardware. Context window is read directly from each model’s GGUF metadata — no hardcoded limits.
✅ Locally Validated (GPU Math Verified)
| Model | Architecture | Quant | Size | Context | Min VRAM |
|---|---|---|---|---|---|
| TinyLlama-1.1B-Chat-v1.0 (https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) | Llama | Q4_0 | 638 MB | 2048 | ~800 MB |
| Llama-3.2-1B-Instruct (https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF) | Llama | Q4_K_M | 770 MB | 131072* | ~1 GB |
| Llama-3.2-3B-Instruct (https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF) | Llama | Q4_K_M | 1.9 GB | 131072* | ~2.5 GB |
| phi-2 (https://huggingface.co/TheBloke/phi-2-GGUF) | Phi-2 | Q4_K_M | 1.7 GB | 2048 | ~2.2 GB |
| gemma-2-2b-it (https://huggingface.co/bartowski/gemma-2-2b-it-GGUF) | Gemma-2 | Q4_K_M | 1.6 GB | 8192 | ~2 GB |
| starcoder2-3b (https://huggingface.co/second-state/StarCoder2-3B-GGUF) | StarCoder2 | Q4_K_M | 1.8 GB | 16384 | ~2.3 GB |
| gpt2 (https://huggingface.co/ggerganov/ggml/blob/main/gpt-2-117M-q4_0.bin) | GPT-2 | Q4_K_M | 107 MB | 1024 | ~200 MB |
* Llama-3.2’s native context is 131072 tokens. Airframe reads this from GGUF and allocates KV cache accordingly. Use
SHIMMY_MAX_CTX=8192for a practical 8K window on consumer hardware (~256 MB KV cache for the 1B model).
GPU Math Verified means the Airframe GPU dequantization shader produces results matching the CPU reference implementation, independently confirmed for every tensor type in each model. This is done via quant_verify, which tests 512 elements per quantization type per model.
⏳ Roadmap — Larger Models (Require ≥16 GB VRAM)
| Model | Architecture | Quant | Size | Status |
|---|---|---|---|---|
| deepseek-coder-6.7b-instruct | Llama | Q4_K_M | 3.9 GB | Pending remote GPU validation |
| deepseek-llm-7b-chat | Llama | Q4_K_M | 4.0 GB | Pending remote GPU validation |
| qwen2-7b-instruct | Qwen2 | Q4_K_M | 4.5 GB | Pending remote GPU validation |
| Phi-3.5-mini-instruct | Phi-3 | Q4_K_M | 2.3 GB | Requires fused QKV support (planned) |
✅ Supported Quantization Types
| Type | GGML ID | Notes |
|---|---|---|
F32 | 0 | Raw floats — maximum precision |
F16 | 1 | Half-precision floats |
Q4_0 | 2 | 4-bit, 32-element blocks |
Q8_0 | 8 | 8-bit, 32-element blocks |
Q4_K | 12 | 4-bit K-quant superblocks (256 elements) — used in Q4_K_M GGUFs |
Q5_K | 13 | 5-bit K-quant superblocks — used alongside Q4_K in mixed-precision models |
Q6_K | 14 | 6-bit K-quant superblocks — typically used for output/embedding layers |
All types are implemented in both the GPU inference shader and a CPU reference implementation. GPU vs CPU agreement is validated for every type.
Auto-discovery is enabled. If Shimmy finds GGUF models in your HuggingFace cache, Ollama directory, LM Studio cache (~/.cache/lm-studio/models), or local ./models/ folder, it will register and serve them automatically. See docs/MODEL_EXPANSION.md for the full compatibility matrix.
📦 Migrating from v1.x
The llama.cpp backend is removed in v2.0.0. The Airframe engine is the only inference path. See docs/MIGRATION_v2.md for the step-by-step migration guide.
Developer Tools
Whether you’re forking Shimmy or integrating it as a service, we provide complete documentation and integration templates.
Try it in 30 seconds
# 1) Download pre-built binary
# Windows:
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe
set SHIMMY_BASE_GGUF=C:\path\to\model.gguf && ./shimmy.exe serve &
# Linux / macOS:
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy && chmod +x shimmy
SHIMMY_BASE_GGUF=/path/to/model.gguf ./shimmy serve &
# 2) See registered models
./shimmy list
# 3) Smoke test the OpenAI API
curl -s http://127.0.0.1:11435/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model":"tinyllama-1.1b",
"messages":[{"role":"user","content":"Say hi in 5 words."}],
"max_tokens":32
}' | jq -r '.choices[0].message.content'
🚀 Compatible with OpenAI SDKs and Tools
No code changes needed - just change the API endpoint:
- Any OpenAI client: Python, Node.js, curl, etc.
- Development applications: Compatible with standard SDKs
- VSCode Extensions: Point to
http://localhost:11435 - Cursor Editor: Built-in OpenAI compatibility
- Continue.dev: Drop-in model provider
Use with OpenAI SDKs
- Node.js (openai v4)
import OpenAI from "openai";
const openai = new OpenAI({
baseURL: "http://127.0.0.1:11435/v1",
apiKey: "sk-local", // placeholder, Shimmy ignores it
});
const resp = await openai.chat.completions.create({
model: "REPLACE_WITH_MODEL",
messages: [{ role: "user", content: "Say hi in 5 words." }],
max_tokens: 32,
});
console.log(resp.choices[0].message?.content);
- Python (openai>=1.0.0)
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:11435/v1", api_key="sk-local")
resp = client.chat.completions.create(
model="REPLACE_WITH_MODEL",
messages=[{"role": "user", "content": "Say hi in 5 words."}],
max_tokens=32,
)
print(resp.choices[0].message.content)
⚡ Zero Configuration Required
- Automatically finds models from Hugging Face cache, Ollama, LM Studio (
~/.cache/lm-studio/models), and local dirs - Auto-allocates ports to avoid conflicts
- Auto-detects LoRA adapters for specialized models
- Just works - no config files, no setup wizards
🧠 Advanced MOE (Mixture of Experts) Support
Note: MoE (Mixture of Experts) CPU offloading is on the Airframe roadmap. See docs/AIRFRAME_MOE_ROADMAP.md for the implementation plan.
Run 70B+ models on consumer hardware — coming to the Airframe engine. Track progress on the roadmap.
Perfect for: Large models (70B+), limited VRAM systems, cost-effective inference
🎯 Perfect for Local Development
- Privacy: Your code never leaves your machine
- Cost: No API keys, no per-token billing
- Speed: Local inference, sub-second responses
- Reliability: No rate limits, no downtime
Quick Start (30 seconds)
Installation
v2.0.0: Download pre-built binaries with Airframe WebGPU engine included!
📥 Pre-Built Binaries (Recommended — Zero Dependencies)
Pick your platform and download — no compilation needed, GPU acceleration included:
# Windows x64 (Airframe WebGPU engine)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe
# Linux x86_64 (Airframe WebGPU engine)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy && chmod +x shimmy
# macOS ARM64 (Airframe with Metal backend via wgpu)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy && chmod +x shimmy
# macOS Intel
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-intel -o shimmy && chmod +x shimmy
# Linux ARM64 (huggingface engine; Airframe cross-compilation not yet supported)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-aarch64 -o shimmy && chmod +x shimmy
That’s it! The Airframe WebGPU adapter is selected automatically at runtime.
🛠️ Build from Source / cargo install
# Install from crates.io
cargo install shimmy
# Build from source (huggingface engine, no GPU)
git clone https://github.com/Michael-A-Kuykendall/shimmy
cd shimmy
cargo build --release
Note: The Airframe GPU engine is a private dependency and cannot be built from source by public users. The pre-built release binaries already include Airframe compiled in — download those to get full GPU acceleration.
cargo install shimmyinstalls the huggingface engine variant from crates.io.
GPU Acceleration
v2.0.0: Airframe uses WebGPU (wgpu) for GPU acceleration. No backend flags, no driver installation beyond standard OS graphics drivers.
📥 Download Pre-Built Binaries (Recommended)
Release binaries include the Airframe engine with WebGPU support compiled in:
| Platform | Download | GPU Backend | Notes |
|---|
Similar Articles
@Honcia13: Ollama is getting wiped out! This little 5MB thing called Shimmy is really something! A Rust-written local AI inference powerhouse that absolutely crushes Ollama: -Single file only 5MB (Ollama is completely outgunned) -Startup time <100ms -Memory only 50MB -Perfect...
Shimmy is a local AI inference server written in Rust, only 5MB as a single file, perfectly compatible with OpenAI API, startup speed less than 100ms, memory usage only 50MB, can be used as a lightweight alternative to Ollama.
@gyro_ai: Running large models locally for your own tools involves a mountain of Python dependencies and endless backend configuration — the environment alone scares off many. In reality, most people just want a local interface that works instantly. Shimmy is a Rust-based local inference service, compiled into a single binary, offering an interface identical to OpenAI's…
Shimmy is a lightweight single-binary local inference server that provides a drop-in OpenAI-compatible API for running GGUF models, supporting hot-swapping models and requiring no Python dependencies.
@cevenif: For those running local LLMs on Macs, here's a tool worth watching — Rapid-MLX. It delivers 2-4x faster inference on M-series chips than Ollama, thanks to being built directly on Apple's MLX framework for more thorough utilization of the chip architecture. Key highlights: KV cache pruning plus…
Rapid-MLX is a local LLM inference tool optimized for Apple M-series chips. Built on the MLX framework, it achieves 2 to 4 times faster inference than Ollama, supports multiple models, tool calling, and an OpenAI API-compatible interface.
@NFTCPS: An open-source project even Musk retweeted, and you haven't tried it yet? PraisonAI: Just 5 lines of code to set up an AI agent team that can autonomously run tasks 24/7 without you watching. Key highlights: ① Supports 100+ LLMs, plug and play ② Agents automatically hand off and correct each other ③ Visual drag-and-drop workflow orchestration, no coding needed…
PraisonAI is an open-source project that only needs 5 lines of code to deploy an AI agent team. It supports over 100 large language models, visual workflow orchestration, and can integrate with platforms like Slack/Discord for 24/7 autonomous task execution. The project was retweeted by Elon Musk.
@wsl8297: Found a '100% local' open-source alternative to Manus AI on GitHub: AgenticSeek. It bundles a local inference model with an intelligent agent system—capable of browsing the web, searching information, filling forms, writing code, making plans, and supporting voice conversations. The overall experience is very close to the movie-like Jarvis assistant that 'understands and takes action.'
AgenticSeek is a 100% local open-source alternative to Manus AI, integrating a local inference model and an intelligent agent system. It supports web browsing, programming, voice interaction, etc., with all data stored locally.