@Honcia13: Ollama is getting wiped out! This little 5MB thing called Shimmy is really something! A Rust-written local AI inference powerhouse that absolutely crushes Ollama: -Single file only 5MB (Ollama is completely outgunned) -Startup time <100ms -Memory only 50MB -Perfect...

X AI KOLs Timeline 05/17/26, 05:21 PM Tools

rust open-source lightweight local-ai gguf alternative-to-ollama openai-compatible

Summary

Shimmy is a local AI inference server written in Rust, only 5MB as a single file, perfectly compatible with OpenAI API, startup speed less than 100ms, memory usage only 50MB, can be used as a lightweight alternative to Ollama.

Ollama is getting wiped out! This little 5MB thing called Shimmy is really something! A Rust-written local AI inference powerhouse that absolutely crushes Ollama: -Single file only 5MB (Ollama is completely outgunned) -Startup time <100ms -Memory only 50MB -Perfectly compatible with OpenAI API, zero-cost integration -No configuration needed, ready to use out of the box Supports automatic recognition of Hugging Face, Ollama, and local models. If you want to run local big models but find Ollama too heavy, this is a must-try! https://github.com/Michael-A-Kuykendall/shimmy…

Original Article

View Cached Full Text

Cached at: 05/18/26, 08:30 AM

Ollama is about to be dethroned! This tiny 5MB tool called Shimmy is genuinely impressive! A Rust-powered local AI inference marvel that leaves Ollama in the dust:

Single binary just 5MB (Ollama is instantly outclassed)
Startup time < 100ms
Memory usage only 50MB
Perfectly compatible with OpenAI API, zero-cost integration
No configuration needed, works out of the box
Auto-detects Hugging Face, Ollama, and local models

If you want to run local large models but find Ollama too heavy, now’s the time to jump in! https://github.com/Michael-A-Kuykendall/shimmy

Drop-in OpenAI API Replacement for Local LLMs

Shimmy is a single-binary that provides 100% OpenAI-compatible endpoints for GGUF models. Point your existing AI tools to Shimmy and they just work — locally, privately, and free.

🎉 NEW in v1.9.0: One download, all GPU backends included! No compilation, no backend confusion - just download and run.

Developer Tools

Whether you’re forking Shimmy or integrating it as a service, we provide complete documentation and integration templates.

Try it in 30 seconds

# 1) Download pre-built binary (includes all GPU backends)  
# Windows:  
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe  
./shimmy.exe serve &  

# Linux:  
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy && chmod +x shimmy  
./shimmy serve &  

# macOS (Apple Silicon):  
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy && chmod +x shimmy  
./shimmy serve &  

# 2) See models and pick one  
./shimmy list  

# 3) Smoke test the OpenAI API  
curl -s http://127.0.0.1:11435/v1/chat/completions \  
  -H 'Content-Type: application/json' \  
  -d '{  
    "model":"REPLACE_WITH_MODEL_FROM_list",  
    "messages":[{"role":"user","content":"Say hi in 5 words."}],  
    "max_tokens":32  
  }' | jq -r '.choices[0].message.content'

🚀 Compatible with OpenAI SDKs and Tools

No code changes needed - just change the API endpoint:

Any OpenAI client: Python, Node.js, curl, etc.
Development applications: Compatible with standard SDKs
VSCode Extensions: Point to http://localhost:11435
Cursor Editor: Built-in OpenAI compatibility
Continue.dev: Drop-in model provider

Use with OpenAI SDKs

Node.js (openai v4)

import OpenAI from "openai";  
const openai = new OpenAI({  
  baseURL: "http://127.0.0.1:11435/v1",  
  apiKey: "sk-local", // placeholder, Shimmy ignores it  
});  
const resp = await openai.chat.completions.create({  
  model: "REPLACE_WITH_MODEL",  
  messages: [{ role: "user", content: "Say hi in 5 words." }],  
  max_tokens: 32,  
});  
console.log(resp.choices[0].message?.content);

Python (openai>=1.0.0)

from openai import OpenAI  
client = OpenAI(base_url="http://127.0.0.1:11435/v1", api_key="sk-local")  
resp = client.chat.completions.create(  
  model="REPLACE_WITH_MODEL",  
  messages=[{"role": "user", "content": "Say hi in 5 words."}],  
  max_tokens=32,  
)  
print(resp.choices[0].message.content)

⚡ Zero Configuration Required

Automatically finds models from Hugging Face cache, Ollama, local dirs
Auto-allocates ports to avoid conflicts
Auto-detects LoRA adapters for specialized models
Just works - no config files, no setup wizards

🧠 Advanced MOE (Mixture of Experts) Support

Run 70B+ models on consumer hardware with intelligent CPU/GPU hybrid processing:

🔄 CPU MOE Offloading: Automatically distribute model layers across CPU and GPU
🧮 Intelligent Layer Placement: Optimizes which layers run where for maximum performance
💾 Memory Efficiency: Fit larger models in limited VRAM by using system RAM strategically
⚡ Hybrid Acceleration: Get GPU speed where it matters most, CPU reliability everywhere else
🎛️ Configurable: --cpu-moe and --n-cpu-moe flags for fine control

# Enable MOE CPU offloading during installation  
cargo install shimmy --features moe  

# Run with MOE hybrid processing  
shimmy serve --cpu-moe --n-cpu-moe 8  
# Automatically balances: GPU layers (fast) + CPU layers (memory-efficient)

Perfect for: Large models (70B+), limited VRAM systems, cost-effective inference

🎯 Perfect for Local Development

Privacy: Your code never leaves your machine
Cost: No API keys, no per-token billing
Speed: Local inference, sub-second responses
Reliability: No rate limits, no downtime

Quick Start (30 seconds)

Installation

✨ v1.9.0 NEW: Download pre-built binaries with ALL GPU backends included!

📥 Pre-Built Binaries (Recommended - Zero Dependencies)

Pick your platform and download - no compilation needed:

# Windows x64 (includes CUDA + Vulkan + OpenCL)  
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe  

# Linux x86_64 (includes CUDA + Vulkan + OpenCL)  
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy && chmod +x shimmy  

# macOS ARM64 (includes MLX for Apple Silicon)  
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy && chmod +x shimmy  

# macOS Intel (CPU-only)  
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-intel -o shimmy && chmod +x shimmy  

# Linux ARM64 (CPU-only)  
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-aarch64 -o shimmy && chmod +x shimmy

That’s it! Your GPU will be detected automatically at runtime.

🛠️ Build from Source (Advanced)

Want to customize or contribute?

# Basic installation (CPU only)  
cargo install shimmy --features huggingface  

# Kitchen Sink builds (what pre-built binaries use):  
# Windows/Linux x64:  
cargo install shimmy --features huggingface,llama,llama-cuda,llama-vulkan,llama-opencl,vision  

# macOS ARM64:  
cargo install shimmy --features huggingface,llama,mlx,vision  

# CPU-only (any platform):  
cargo install shimmy --features huggingface,llama,vision

⚠️ Build Notes:

Windows: Install LLVM (https://releases.llvm.org/download.html) first for libclang.dll

Recommended: Use pre-built binaries to avoid dependency issues

Advanced users only: Building from source requires C++ compiler + CUDA/Vulkan SDKs

GPU Acceleration

✨ NEW in v1.9.0: One binary per platform with automatic GPU detection!

⚠️ IMPORTANT - Vision Feature Performance:
CPU-based vision inference (MiniCPM-V) is 5-10x slower than GPU acceleration.
CPU: 15-45 seconds per image | GPU (CUDA/Vulkan): 2-8 seconds per image
For production vision workloads, GPU acceleration is strongly recommended.

📥 Download Pre-Built Binaries (Recommended)

No compilation needed! Each binary includes ALL GPU backends for your platform:

Platform	Download	GPU Support	Auto-Detects
Windows x64	shimmy-windows-x86_64.exe (https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe)	CUDA + Vulkan + OpenCL	✅
Linux x86_64	shimmy-linux-x86_64 (https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64)	CUDA + Vulkan + OpenCL	✅
macOS ARM64	shimmy-macos-arm64 (https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64)	MLX (Apple Silicon)	✅
macOS Intel	shimmy-macos-intel (https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-intel)	CPU only	N/A
Linux ARM64	shimmy-linux-aarch64 (https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-aarch64)	CPU only	N/A

How it works: Download one file, run it. Shimmy automatically detects and uses your GPU!

# Windows example  
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe  
./shimmy.exe serve --gpu-backend auto   # Auto-detects CUDA/Vulkan/OpenCL  

# Linux example  
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy  
chmod +x shimmy  
./shimmy serve --gpu-backend auto       # Auto-detects CUDA/Vulkan/OpenCL  

# macOS ARM64 example  
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy  
chmod +x shimmy  
./shimmy serve                          # Auto-detects MLX on Apple Silicon

🎯 GPU Auto-Detection

Shimmy uses intelligent GPU detection with this priority order:

CUDA (NVIDIA GPUs via nvidia-smi)
Vulkan (Cross-platform GPUs via vulkaninfo)
OpenCL (AMD/Intel GPUs via clinfo)
MLX (Apple Silicon via system detection)
CPU (Fallback if no GPU detected)

No manual configuration needed! Just run with --gpu-backend auto (default).

🔧 Manual Backend Override

Want to force a specific backend? Use the --gpu-backend flag:

# Auto-detect (default - recommended)  
shimmy serve --gpu-backend auto  

# Force CPU (for testing or compatibility)  
shimmy serve --gpu-backend cpu  

# Force CUDA (NVIDIA GPUs only)  
shimmy serve --gpu-backend cuda  

# Force Vulkan (AMD/Intel/Cross-platform)  
shimmy serve --gpu-backend vulkan  

# Force OpenCL (AMD/Intel alternative)  
shimmy serve --gpu-backend opencl

🛡️ Error Handling & Robustness: If you force an unavailable backend (e.g., --gpu-backend cuda on AMD GPU), Shimmy will:

✅ Display clear error message explaining the issue
✅ Automatically fallback to next available backend in priority order
✅ Log which backend was actually used (check with --verbose)
✅ Continue serving requests (graceful degradation, no crashes)
✅ Support environment variable override: SHIMMY_GPU_BACKEND=cuda

Common scenarios:

--gpu-backend cuda on non-NVIDIA → Falls back to Vulkan or OpenCL
--gpu-backend vulkan without drivers → Falls back to OpenCL or CPU
--gpu-backend invalid → Clear error + fallback to auto-detection
No GPU detected → Runs on CPU with performance warning

Environment Variable: Set SHIMMY_GPU_BACKEND=cuda to override default without CLI flags.

🔍 Check GPU Support

# Show detected GPU backends  
shimmy gpu-info  

# Check which backend is being used  
shimmy serve --gpu-backend auto --verbose

⚡ Binary Sizes

GPU-enabled binaries (Windows/Linux x64, macOS ARM64): ~40-50MB
CPU-only binaries (macOS Intel, Linux ARM64): ~20-30MB

Trade-off: Slightly larger binaries for zero compilation and automatic GPU detection.

🛠️ Build from Source (Advanced)

Want to customize or contribute? Build from source:

Multiple backends can be compiled in, best one selected automatically
Use --gpu-backend to force specific backend

Get Models

Shimmy auto-discovers models from:

Hugging Face cache: ~/.cache/huggingface/hub/
Ollama models: ~/.ollama/models/
Local directory: ./models/
Environment: SHIMMY_BASE_GGUF=path/to/model.gguf

# Download models that work out of the box  
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf --local-dir ./models/  
huggingface-cli download bartowski/Llama-3.2-1B-Instruct-GGUF --local-dir ./models/

Start Server

# Auto-allocates port to avoid conflicts  
shimmy serve  

# Or use manual port  
shimmy serve --bind 127.0.0.1:11435

Point your development tools to the displayed port — VSCode Copilot, Cursor, Continue.dev all work instantly.

📦 Download & Install

Package Managers

Rust: cargo install shimmy --features moe (https://crates.io/crates/shimmy) (recommended)
Rust (basic): cargo install shimmy (https://crates.io/crates/shimmy)
VS Code: Shimmy Extension (https://marketplace.visualstudio.com/items?itemName=targetedwebresults.shimmy-vscode)
Windows MSVC: Uses shimmy-llama-cpp-2 packages for better compatibility
npm: npm install -g shimmy-js (planned)
Python: pip install shimmy (planned)

Direct Downloads

GitHub Releases: Latest binaries (https://github.com/Michael-A-Kuykendall/shimmy/releases/latest)
Docker: docker pull shimmy/shimmy:latest (coming soon)

🍎 macOS Support

Full compatibility confirmed! Shimmy works flawlessly on macOS with Metal GPU acceleration.

# Install dependencies  
brew install cmake rust  

# Install shimmy  
cargo install shimmy

✅ Verified working:

Intel and Apple Silicon Macs
Metal GPU acceleration (automatic)
MLX native acceleration for Apple Silicon
Xcode 17+ compatibility
All LoRA adapter features

Integration Examples

VSCode Copilot

{  
  "github.copilot.advanced": {  
    "serverUrl": "http://localhost:11435"  
  }  
}

Continue.dev

{  
  "models": [{  
    "title": "Local Shimmy",  
    "provider": "openai",  
    "model": "your-model-name",  
    "apiBase": "http://localhost:11435/v1"  
  }]  
}

Cursor IDE

Works out of the box - just point to http://localhost:11435/v1

Why Shimmy Will Always Be Free

I built Shimmy to retain privacy-first control on my AI development and keep things local and lean.

This is my commitment: Shimmy stays MIT licensed, forever. If you want to support development, sponsor it (https://github.com/sponsors/Michael-A-Kuykendall). If you don’t, just build something cool with it.

💡 Shimmy saves you time and money. If it’s useful, consider sponsoring for $5/month (https://github.com/sponsors/Michael-A-Kuykendall) — less than your Netflix subscription, infinitely more useful for developers.

API Reference

Endpoints

GET /health - Health check
POST /v1/chat/completions - OpenAI-compatible chat
GET /v1/models - List available models
POST /api/generate - Shimmy native API
GET /ws/generate - WebSocket streaming

CLI Commands

shimmy serve                        # Start server (auto port allocation)  
shimmy serve --bind 127.0.0.1:8080  # Manual port binding  
shimmy serve --cpu-moe --n-cpu-moe 8 # Enable MOE CPU offloading  
shimmy list                         # Show available models (LLM-filtered)  
shimmy discover                     # Refresh model discovery  
shimmy generate --name X --prompt "Hi"  # Test generation  
shimmy probe model-name             # Verify model loads  
shimmy gpu-info                     # Show GPU backend status

Technical Architecture

Rust + Tokio: Memory-safe, async performance
llama.cpp backend: Industry-standard GGUF inference
OpenAI API compatibility: Drop-in replacement
Dynamic port management: Zero conflicts, auto-allocation
Zero-config auto-discovery: Just works™

🚀 Advanced Features

🧠 MOE CPU Offloading: Hybrid GPU/CPU processing for large models (70B+)
🎯 Smart Model Filtering: Automatically excludes non-language models (Stable Diffusion, Whisper, CLIP)
🛡️ 6-Gate Release Validation: Constitutional quality limits ensure reliability
⚡ Smart Model Preloading: Background loading with usage tracking for instant model switching
💾 Response Caching: LRU

@Honcia13: Ollama is getting wiped out! This little 5MB thing called Shimmy is really something! A Rust-written local AI inference powerhouse that absolutely crushes Ollama: -Single file only 5MB (Ollama is completely outgunned) -Startup time <100ms -Memory only 50MB -Perfect...

Drop-in OpenAI API Replacement for Local LLMs

Developer Tools

Try it in 30 seconds

🚀 Compatible with OpenAI SDKs and Tools

Use with OpenAI SDKs

⚡ Zero Configuration Required

🧠 Advanced MOE (Mixture of Experts) Support

🎯 Perfect for Local Development

Quick Start (30 seconds)

Installation

📥 Pre-Built Binaries (Recommended - Zero Dependencies)

🛠️ Build from Source (Advanced)

GPU Acceleration

📥 Download Pre-Built Binaries (Recommended)

🎯 GPU Auto-Detection

🔧 Manual Backend Override

🔍 Check GPU Support

⚡ Binary Sizes

🛠️ Build from Source (Advanced)

Get Models

Start Server

📦 Download & Install

Package Managers

Direct Downloads

🍎 macOS Support

Integration Examples

VSCode Copilot

Continue.dev

Cursor IDE

Why Shimmy Will Always Be Free

API Reference

Endpoints

CLI Commands

Technical Architecture

🚀 Advanced Features

Similar Articles

@vintcessun: Found a new approach: coding agent can be this lightweight. 16MB memory, 0% idle CPU, 26MB binary — these numbers are insane in the agent space. It's built with Rust, packing all standard tools, permission system, session management, and MCP into 17k LoC...

@QingQ77: An operational Agent runtime built for llama.cpp local inference models, allowing local models to execute real-world tasks like browser, file, and Shell operations like a desktop operator https://github.com/AtomicBot-ai/atomic-agent… Atom…

Submit Feedback

Similar Articles

@gyro_ai: Running large models locally for your own tools involves a mountain of Python dependencies and endless backend configuration — the environment alone scares off many. In reality, most people just want a local interface that works instantly. Shimmy is a Rust-based local inference service, compiled into a single binary, offering an interface identical to OpenAI's…

@cevenif: For those running local LLMs on Macs, here's a tool worth watching — Rapid-MLX. It delivers 2-4x faster inference on M-series chips than Ollama, thanks to being built directly on Apple's MLX framework for more thorough utilization of the chip architecture. Key highlights: KV cache pruning plus…

@vintcessun: Found a new approach: coding agent can be this lightweight. 16MB memory, 0% idle CPU, 26MB binary — these numbers are insane in the agent space. It's built with Rust, packing all standard tools, permission system, session management, and MCP into 17k LoC...

@QingQ77: An operational Agent runtime built for llama.cpp local inference models, allowing local models to execute real-world tasks like browser, file, and Shell operations like a desktop operator https://github.com/AtomicBot-ai/atomic-agent… Atom…

@NFTCPS: 4GB VRAM running 70B large model? It actually works! AirLLM did a clever trick — layered inference, not loading the whole model into VRAM at once, but layer by layer, compute and discard, squeezing the giant into a small GPU. The best part: 100% open source, freebie warning https://github.com/0xSo…