@gyro_ai: Running large models locally for your own tools involves a mountain of Python dependencies and endless backend configuration — the environment alone scares off many. In reality, most people just want a local interface that works instantly. Shimmy is a Rust-based local inference service, compiled into a single binary, offering an interface identical to OpenAI's…

X AI KOLs Timeline 05/24/26, 12:58 AM Tools

local-llm inference-server openai-compatible rust gguf single-binary open-source

Summary

Shimmy is a lightweight single-binary local inference server that provides a drop-in OpenAI-compatible API for running GGUF models, supporting hot-swapping models and requiring no Python dependencies.

Running large models locally for your own tools requires installing a pile of Python dependencies and spending half a day configuring the backend — the environment alone is enough to discourage many people. In fact, many people just want a local interface that can start up immediately. Shimmy is a local inference service written in Rust, compiled into a single binary, providing an interface exactly the same as OpenAI's. Your existing AI tools don't need code changes — just point the endpoint here and they'll run locally. https://github.com/Michael-A-Kuykendall/shimmy… 5.2k stars, MIT license, the author promises it'll remain free forever and won't turn to paid. Core features: 1. Drop-in API replacement - Fully compatible with OpenAI's API. Cursor, http://Continue.dev, various OpenAI SDKs — just change the address and it works, zero code changes. 2. Single binary to run - Download a single binary and serve it. The new version even bundles various GPU backends, no compilation needed. 3. Switch models freely - Supports GGUF and SafeTensors formats, hot-swap models, auto-discover local models. 4. 30 seconds to start - curl the binary, chmod, and run — server is up. Download the binary for your system and run directly. Pre-built versions for Windows, Linux, macOS Apple Silicon are available. For developers who want to run large models locally for free and find Ollama's stack too heavy, this can be a lightweight alternative.

Original Article

View Cached Full Text

Cached at: 05/24/26, 04:22 AM

Running large models locally for your own tools—installing a pile of Python dependencies, fiddling with backend config for ages—just the environment setup alone turns many people away. In reality, most just want a local interface that works immediately. Shimmy is a local inference service written in Rust, compiled into a single binary, offering an API identical to OpenAI’s. Your existing AI tools need zero code changes—just point the endpoint here and it runs locally. https://github.com/Michael-A-Kuykendall/shimmy… 5.2k stars, MIT license, author promises it will always be free and never switch to paid.
Core features:

Drop-in API replacement - 100% compatible with OpenAI’s API. Cursor, Continue.dev, any OpenAI SDK—change the address and it works, not a single line of code changed.
Single file, run instantly - Download one binary and serve. The new version even bundles all GPU backends—no compilation needed.
Swap models freely - Supports GGUF and SafeTensors formats, hot-swap models, auto-discovers local models.
Up in 30 seconds - curl the binary, set permissions, run, and the server is live.
Download the binary for your system and run directly—ready-made for Windows, Linux, macOS with Apple Silicon.
For developers who want to use large models locally for free but find Ollama too heavy, this can be a lightweight alternative.

Michael-A-Kuykendall/shimmy

Source: https://github.com/Michael-A-Kuykendall/shimmy

The Lightweight OpenAI API Server

🔒 Local Inference Without Dependencies 🚀

License: MIT (https://opensource.org/licenses/MIT) Security (https://github.com/Michael-A-Kuykendall/shimmy/security) Crates.io (https://crates.io/crates/shimmy) Downloads (https://crates.io/crates/shimmy) Rust (https://rustup.rs/) GitHub Stars (https://github.com/Michael-A-Kuykendall/shimmy/stargazers) 💝 Sponsor this project (https://github.com/sponsors/Michael-A-Kuykendall)

Shimmy will be free forever. No asterisks. No “free for now.” No pivot to paid.

💝 Support Shimmy’s Growth 🚀

If Shimmy helps you, consider sponsoring (https://github.com/sponsors/Michael-A-Kuykendall) — 100% of support goes to keeping it free forever.

$5/month: Coffee tier ☕ - Eternal gratitude + sponsor badge
$25/month: Bug prioritizer 🐛 - Priority support + name in SPONSORS.md
$100/month: Corporate backer 🏢 - Logo placement + monthly office hours
$500/month: Infrastructure partner 🚀 - Direct support + roadmap input

🎯 Become a Sponsor (https://github.com/sponsors/Michael-A-Kuykendall) | See our amazing sponsors 🙏

Drop-in OpenAI API Replacement for Local LLMs

Shimmy is a single-binary that provides 100% OpenAI-compatible endpoints for GGUF models. Point your existing AI tools to Shimmy and they just work — locally, privately, and free.

🎉 NEW in v1.9.0: One download, all GPU backends included! No compilation, no backend confusion - just download and run.

Developer Tools

Whether you’re forking Shimmy or integrating it as a service, we provide complete documentation and integration templates.

Try it in 30 seconds

# 1) Download pre-built binary (includes all GPU backends)
# Windows:
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe
./shimmy.exe serve &

# Linux:
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy && chmod +x shimmy
./shimmy serve &

# macOS (Apple Silicon):
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy && chmod +x shimmy
./shimmy serve &

# 2) See models and pick one
./shimmy list

# 3) Smoke test the OpenAI API
curl -s http://127.0.0.1:11435/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"REPLACE_WITH_MODEL_FROM_list",
    "messages":[{"role":"user","content":"Say hi in 5 words."}],
    "max_tokens":32
  }' | jq -r '.choices[0].message.content'

🚀 Compatible with OpenAI SDKs and Tools

No code changes needed - just change the API endpoint:

Any OpenAI client: Python, Node.js, curl, etc.
Development applications: Compatible with standard SDKs
VSCode Extensions: Point to http://localhost:11435
Cursor Editor: Built-in OpenAI compatibility
Continue.dev: Drop-in model provider

Use with OpenAI SDKs

Node.js (openai v4)

import OpenAI from "openai";
const openai = new OpenAI({
  baseURL: "http://127.0.0.1:11435/v1",
  apiKey: "sk-local", // placeholder, Shimmy ignores it
});
const resp = await openai.chat.completions.create({
  model: "REPLACE_WITH_MODEL",
  messages: [{ role: "user", content: "Say hi in 5 words." }],
  max_tokens: 32,
});
console.log(resp.choices[0].message?.content);

Python (openai>=1.0.0)

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:11435/v1", api_key="sk-local")
resp = client.chat.completions.create(
    model="REPLACE_WITH_MODEL",
    messages=[{"role": "user", "content": "Say hi in 5 words."}],
    max_tokens=32,
)
print(resp.choices[0].message.content)

⚡ Zero Configuration Required

Automatically finds models from Hugging Face cache, Ollama, local dirs
Auto-allocates ports to avoid conflicts
Auto-detects LoRA adapters for specialized models
Just works - no config files, no setup wizards

🧠 Advanced MOE (Mixture of Experts) Support

Run 70B+ models on consumer hardware with intelligent CPU/GPU hybrid processing:

🔄 CPU MOE Offloading: Automatically distribute model layers across CPU and GPU
🧮 Intelligent Layer Placement: Optimizes which layers run where for maximum performance
💾 Memory Efficiency: Fit larger models in limited VRAM by using system RAM strategically
⚡ Hybrid Acceleration: Get GPU speed where it matters most, CPU reliability everywhere else
🎛️ Configurable: --cpu-moe and --n-cpu-moe flags for fine control

# Enable MOE CPU offloading during installation
cargo install shimmy --features moe

# Run with MOE hybrid processing
shimmy serve --cpu-moe --n-cpu-moe 8
# Automatically balances: GPU layers (fast) + CPU layers (memory-efficient)

Perfect for: Large models (70B+), limited VRAM systems, cost-effective inference

🎯 Perfect for Local Development

Privacy: Your code never leaves your machine
Cost: No API keys, no per-token billing
Speed: Local inference, sub-second responses
Reliability: No rate limits, no downtime

Quick Start (30 seconds)

Installation

✨ v1.9.0 NEW: Download pre-built binaries with ALL GPU backends included!

📥 Pre-Built Binaries (Recommended - Zero Dependencies)

Pick your platform and download - no compilation needed:

# Windows x64 (includes CUDA + Vulkan + OpenCL)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe

# Linux x86_64 (includes CUDA + Vulkan + OpenCL)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy && chmod +x shimmy

# macOS ARM64 (includes MLX for Apple Silicon)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy && chmod +x shimmy

# macOS Intel (CPU-only)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-intel -o shimmy && chmod +x shimmy

# Linux ARM64 (CPU-only)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-aarch64 -o shimmy && chmod +x shimmy

That’s it! Your GPU will be detected automatically at runtime.

🛠️ Build from Source (Advanced)

Want to customize or contribute?

# Basic installation (CPU only)
cargo install shimmy --features huggingface

# Kitchen Sink builds (what pre-built binaries use):
# Windows/Linux x64:
cargo install shimmy --features huggingface,llama,llama-cuda,llama-vulkan,llama-opencl,vision
# macOS ARM64:
cargo install shimmy --features huggingface,llama,mlx,vision
# CPU-only (any platform):
cargo install shimmy --features huggingface,llama,vision

⚠️ Build Notes:

Windows: Install LLVM (https://releases.llvm.org/download.html) first for libclang.dll

Recommended: Use pre-built binaries to avoid dependency issues

Advanced users only: Building from source requires C++ compiler + CUDA/Vulkan SDKs

GPU Acceleration

✨ NEW in v1.9.0: One binary per platform with automatic GPU detection!

⚠️ IMPORTANT - Vision Feature Performance: CPU-based vision inference (MiniCPM-V) is 5-10x slower than GPU acceleration. CPU: 15-45 seconds per image | GPU (CUDA/Vulkan): 2-8 seconds per image For production vision workloads, GPU acceleration is strongly recommended.

📥 Download Pre-Built Binaries (Recommended)

No compilation needed! Each binary includes ALL GPU backends for your platform:

Platform	Download	GPU Support	Auto-Detects
Windows x64	shimmy-windows-x86_64.exe (https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe)	CUDA + Vulkan + OpenCL	✅
Linux x86_64	shimmy-linux-x86_64 (https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64)	CUDA + Vulkan + OpenCL	✅
macOS ARM64	shimmy-macos-arm64 (https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64)	MLX (Apple Silicon)	✅
macOS Intel	shimmy-macos-intel (https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-intel)	CPU only	N/A
Linux ARM64	shimmy-linux-aarch64 (https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-aarch64)	CPU only	N/A

How it works: Download one file, run it. Shimmy automatically detects and uses your GPU!

# Windows example
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe
./shimmy.exe serve --gpu-backend auto  # Auto-detects CUDA/Vulkan/OpenCL

# Linux example
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy
chmod +x shimmy
./shimmy serve --gpu-backend auto  # Auto-detects CUDA/Vulkan/OpenCL

# macOS ARM64 example
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy
chmod +x shimmy
./shimmy serve  # Auto-detects MLX on Apple Silicon

🎯 GPU Auto-Detection

Shimmy uses intelligent GPU detection with this priority order:

CUDA (NVIDIA GPUs via nvidia-smi)
Vulkan (Cross-platform GPUs via vulkaninfo)
OpenCL (AMD/Intel GPUs via clinfo)
MLX (Apple Silicon via system detection)
CPU (Fallback if no GPU detected)

No manual configuration needed! Just run with --gpu-backend auto (default).

🔧 Manual Backend Override

Want to force a specific backend? Use the --gpu-backend flag:

# Auto-detect (default - recommended)
shimmy serve --gpu-backend auto

# Force CPU (for testing or compatibility)
shimmy serve --gpu-backend cpu

# Force CUDA (NVIDIA GPUs only)
shimmy serve --gpu-backend cuda

# Force Vulkan (AMD/Intel/Cross-platform)
shimmy serve --gpu-backend vulkan

# Force OpenCL (AMD/Intel alternative)
shimmy serve --gpu-backend opencl

🛡️ Error Handling & Robustness: If you force an unavailable backend (e.g., --gpu-backend cuda on AMD GPU), Shimmy will:

✅ Display clear error message explaining the issue
✅ Automatically fallback to next available backend in priority order
✅ Log which backend was actually used (check with --verbose)
✅ Continue serving requests (graceful degradation, no crashes)
✅ Support environment variable override: SHIMMY_GPU_BACKEND=cuda

Common scenarios:

--gpu-backend cuda on non-NVIDIA → Falls back to Vulkan or OpenCL
--gpu-backend vulkan without drivers → Falls back to OpenCL or CPU
--gpu-backend invalid → Clear error + fallback to auto-detection
No GPU detected → Runs on CPU with performance warning

Environment Variable: Set SHIMMY_GPU_BACKEND=cuda to override default without CLI flags.

🔍 Check GPU Support

# Show detected GPU backends
shimmy gpu-info

# Check which backend is being used
shimmy serve --gpu-backend auto --verbose

⚡ Binary Sizes

GPU-enabled binaries (Windows/Linux x64, macOS ARM64): ~40-50MB
CPU-only binaries (macOS Intel, Linux ARM64): ~20-30MB Trade-off: Slightly larger binaries for zero compilation and automatic GPU detection.

🛠️ Build from Source (Advanced)

Want to customize or contribute? Build from source:

Multiple backends can be compiled in, best one selected automatically
Use --gpu-backend <backend> to force specific backend

Get Models

Shimmy auto-discovers models from:

Hugging Face cache: ~/.cache/huggingface/hub/
Ollama models: ~/.ollama/models/
Local directory: ./models/
Environment: SHIMMY_BASE_GGUF=path/to/model.gguf

# Download models that work out of the box
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf --local-dir ./models/
huggingface-cli download bartowski/Llama-3.2-1B-Instruct-GGUF --local-dir ./models/

Start Server

# Auto-allocates port to avoid conflicts
shimmy serve

# Or use manual port
shimmy serve --bind 127.0.0.1:11435

Point your development tools to the displayed port — VSCode Copilot, Cursor, Continue.dev all work instantly.

📦 Download & Install

Package Managers

Rust: cargo install shimmy --features moe (https://crates.io/crates/shimmy) (recommended)
Rust (basic): cargo install shimmy (https://crates.io/crates/shimmy)
VS Code: Shimmy Extension (https://marketplace.visualstudio.com/items?itemName=targetedwebresults.shimmy-vscode)
Windows MSVC: Uses shimmy-llama-cpp-2 packages for better compatibility
npm: npm install -g shimmy-js (planned)
Python: pip install shimmy (planned)

Direct Downloads

GitHub Releases: Latest binaries (https://github.com/Michael-A-Kuykendall/shimmy/releases/latest)
Docker: docker pull shimmy/shimmy:latest (coming soon)

🍎 macOS Support

Full compatibility confirmed! Shimmy works flawlessly on macOS with Metal GPU acceleration.

# Install dependencies
brew install cmake rust

# Install shimmy
cargo install shimmy

✅ Verified working:

Intel and Apple Silicon Macs
Metal GPU acceleration (automatic)
MLX native acceleration for Apple Silicon
Xcode 17+ compatibility
All LoRA adapter features

Integration Examples

VSCode Copilot

{
  "github.copilot.advanced": {
    "serverUrl": "http://localhost:11435"
  }
}

Continue.dev

{
  "models": [{
    "title": "Local Shimmy",
    "provider": "openai",
    "model": "your-model-name",
    "apiBase": "http://localhost:11435/v1"
  }]
}

Cursor IDE

Works out of the box - just point to http://localhost:11435/v1

Why Shimmy Will Always Be Free

I built Shimmy to retain privacy-first control on my AI development and keep things local and lean. This is my commitment: Shimmy stays MIT licensed, forever. If you want to support development, sponsor it (https://github.com/sponsors/Michael-A-Kuykendall). If you don’t, just build something cool with it.

💡 Shimmy saves you time and money. If it’s useful, consider sponsoring for $5/month (https://github.com/sponsors/Michael-A-Kuykendall) — less than your Netflix subscription, infinitely more useful for developers.

API Reference

Endpoints

GET /health - Health check
POST /v1/chat/completions - OpenAI-compatible chat
GET /v1/models - List available models
POST /api/generate - Shimmy native API
GET /ws/generate - WebSocket streaming

CLI Commands

shimmy serve                       # Start server (auto port allocation)
shimmy serve --bind 127.0.0.1:8080 # Manual port binding
shimmy serve --cpu-moe --n-cpu-moe 8  # Enable MOE CPU offloading
shimmy list                        # Show available models (LLM-filtered)
shimmy discover                    # Refresh model discovery
shimmy generate --name X --prompt "Hi"  # Test generation
shimmy probe model-name            # Verify model loads
shimmy gpu-info                    # Show GPU backend status

Technical Architecture

Rust + Tokio: Memory-safe, async performance
llama.cpp backend: Industry-standard GGUF inference
OpenAI API compatibility: Drop-in replacement
Dynamic port management: Zero conflicts, auto-allocation
Zero-config auto-discovery: Just worksTM

🚀 Advanced Features

🧠 MOE CPU Offloading: Hybrid GPU/CPU processing for large models (70B+)
🎯 Smart Model Filtering: Au

Michael-A-Kuykendall/shimmy

The Lightweight OpenAI API Server

🔒 Local Inference Without Dependencies 🚀

💝 Support Shimmy’s Growth 🚀

Drop-in OpenAI API Replacement for Local LLMs

Developer Tools

Try it in 30 seconds

🚀 Compatible with OpenAI SDKs and Tools

Use with OpenAI SDKs

⚡ Zero Configuration Required

🧠 Advanced MOE (Mixture of Experts) Support

🎯 Perfect for Local Development

Quick Start (30 seconds)

Installation

📥 Pre-Built Binaries (Recommended - Zero Dependencies)

🛠️ Build from Source (Advanced)

GPU Acceleration

📥 Download Pre-Built Binaries (Recommended)

🎯 GPU Auto-Detection

🔧 Manual Backend Override

🔍 Check GPU Support

⚡ Binary Sizes

🛠️ Build from Source (Advanced)

Get Models

Start Server

📦 Download & Install

Package Managers

Direct Downloads

🍎 macOS Support

Integration Examples

VSCode Copilot

Continue.dev

Cursor IDE

Why Shimmy Will Always Be Free

API Reference

Endpoints

CLI Commands

Technical Architecture

🚀 Advanced Features

Similar Articles

@Honcia13: Ollama is getting wiped out! This little 5MB thing called Shimmy is really something! A Rust-written local AI inference powerhouse that absolutely crushes Ollama: -Single file only 5MB (Ollama is completely outgunned) -Startup time <100ms -Memory only 50MB -Perfect...

SwiftLM: Pure-Swift Apple Silicon LLM inference server—no Python, runs big models on low-RAM Macs

@mylifcc: I'm already running Gemma-4-12b on my Mac. Tech stack: llama.cpp + GGUF Q4_K_M + Metal 32K context, local OpenAI-compatible API. Measured about 36 tok/s, resident RSS about…

I made this android app which runs ai models locally

Submit Feedback