I turned an Android phone into a Vulkan-accelerated local LLM node (GGUF + LiteLLM + Tailscale)

Reddit r/LocalLLaMA Tools

Summary

An Android phone is repurposed as a portable GGUF inference server with Vulkan acceleration, exposing an OpenAI-compatible endpoint via LiteLLM and Tailscale mesh for integration into a self-hosted AI cluster.

Hey everyone — I’ve been working on something that finally reached a stable enough point to share. I’ve been experimenting with using an Android device as a local inference node inside a self-hosted AI mesh. The goal wasn’t “run a chatbot on Android,” but to make the phone behave like a portable GGUF inference server that plugs into an existing cluster. \## What it currently does \- Loads GGUF models locally on-device \- Uses Vulkan for mobile GPU acceleration \- Exposes an OpenAI-compatible endpoint on the mesh \- Routes through LiteLLM like any other backend \- Joins the cluster through Tailscale \- Supports fallback routing to larger local nodes \- Can run standalone when the rest of the mesh is unavailable \## Architecture \`\`\`text \[Android Pocket Node / Z Fold 6\] GGUF + Vulkan (gpu\_layers=89) llama.cpp JNI/NDK bridge OpenAI-compatible local endpoint ↓ \[Tailscale Mesh\] ↓ \[Edge Gate on neo-x510uar\] request pre-flight battery / thermal / prompt-size routing ↓ \[LiteLLM Router on neo-x510uar\] OpenAI-compatible gateway model aliases fallback routing ↓ \[Fallback Nodes\] sheens-mac-studio — heavier reasoning / judge models moolah — RTX box for GPU-heavy workloads
Original Article

Similar Articles