@PrajwalTomar_: https://x.com/PrajwalTomar_/status/2069409824824316060
Summary
The author built a fully offline AI agent using local embedding models, Llama via Ollama, and VectorAI DB to address the risks of cloud-dependent AI. The agent runs on an 8GB MacBook, processes sensitive documents, and maintains memory across sessions.
View Cached Full Text
Cached at: 06/23/26, 03:51 PM
I Built a Private AI Agent That Runs Fully Offline. Here’s the Workflow.
On June 9, Anthropic shipped Claude Fable 5, the most powerful model anyone had ever seen.
On June 12, the US government pulled it with an export order. Gone, for everyone, overnight.
Sit with that for a second. A model hundreds of millions of people relied on got switched off by a single letter. Not your model. Not your call. Someone else flipped a switch and the agent you built on top of it stopped existing.
That is the real risk of building on someone else’s cloud. Your access is a permission. And permissions get pulled.
So here is what almost nobody is building. An agent that depends on none of it. Fully local. Works well. Runs anywhere you put it. A laptop, a private server, a machine on a factory floor with no internet at all.
I built one to show it actually works. The whole stack runs on a base 8GB MacBook. No cloud. No API key. I turned the wifi off and it kept answering.
Here’s the workflow.
The entire stack runs inside my laptop. Embeddings, the vector database, and the LLM all talk to each other locally.
The entire stack runs inside my laptop. Embeddings, the vector database, and the LLM all talk to each other locally.
The problem nobody wants to say out loud
Almost every AI agent you have ever used works the same way. Your prompt, your documents, your customer data, all of it gets shipped to a server you do not control, processed there, and sent back.
For a side project, fine. For a real company, this is starting to break.
If you work in healthcare, finance, legal, or defense, there are documents you are simply not allowed to send to a third party server. Not “should not.” Not allowed. In manufacturing, there are quality control systems on the factory floor that have to make decisions in real time, where a round trip to a cloud API is too slow and too fragile to depend on.
The honest truth is the technology finally caught up to the requirement. Apple ships models that run on the device in your pocket. Meta and Google give away models you can run on a laptop. Open embedding models are excellent and free. The pieces to run AI entirely on your own hardware are all here.
The only question left is whether you can actually assemble them into something useful. So I did.
What I built
A private second brain. An agent I point at a folder of sensitive documents, ask questions in plain English, and get real answers from. Completely offline.
It also remembers. Tell it something in one session, close it, reopen it the next day, and it still knows. That is the part most “local AI” demos skip, and it is the part that actually matters.
The stack is four pieces, all running on the same 8GB laptop:
→ A local embedding model (sentence-transformers) to turn text into searchable vectors
→ A local language model (Llama, running through Ollama) to write the answers
→ VectorAI DB (running locally in Docker) to store the documents AND the memory
→ A small piece of Python to glue it together into an agent
For the documents, I deliberately used public regulatory text. The GDPR and the NIST AI Risk Management Framework. Exactly the kind of dense, sensitive, “do not leak this” material a real compliance team works with every day.
What VectorAI DB actually is
This is the part that makes the whole thing work, so it is worth being clear.
VectorAI DB is a vector database. It stores text as vectors (lists of numbers that capture meaning) and lets you search by meaning instead of by keyword. Ask “what rights do people have over their data” and it finds the right GDPR clause even if the document never used the word “rights.”
Two things made it the right choice for this build.
First, it runs locally. One Docker command and it is live on your machine, with a local dashboard in your browser. Nothing phones home.
Second, and this is the real point, it is the one piece of this stack you would not want to self-host as raw open source in production.
The embedding model and the language model are open and you can run them yourself all day. But the database is where your data lives. It is the piece that has to stay up, stay consistent, recover cleanly, and scale when your collection grows. Open source gives you the components. It does not give you the support or the production hardening. VectorAI DB is the component a real team can actually run without absorbing the operational risk of babysitting a self-managed install.
That distinction is the whole enterprise case. You DIY the model. You do not DIY the database.
Step 1: Run the database on your own machine
One Docker command spins up VectorAI DB locally. It comes with a local UI you can open in your browser to see your collections and data.
VectorAI DB running locally in Docker. One container, live on my own machine, no cloud account anywhere.
VectorAI DB running locally in Docker. One container, live on my own machine, no cloud account anywhere.
The VectorAI DB dashboard open at localhost in my browser. The database, its collections, and its health, all running on the laptop.
The VectorAI DB dashboard open at localhost in my browser. The database, its collections, and its health, all running on the laptop.
Step 2: Turn your documents into local memory
The agent reads each PDF, splits it into chunks, and uses the local embedding model to turn every chunk into a vector. Those vectors get stored in VectorAI DB. This all happens on the laptop. The documents never leave.
Feeding the GDPR and NIST documents in. The agent splits them into chunks, embeds each one locally, and stores all 876 of them in VectorAI DB. The files never leave the machine.
Feeding the GDPR and NIST documents in. The agent splits them into chunks, embeds each one locally, and stores all 876 of them in VectorAI DB. The files never leave the machine.
Step 3: Run the language model locally
The model that writes the answers runs through Ollama, fully on the machine. No API key. No account. No request ever leaves the laptop.
I ran a small Llama model so it would fit comfortably on 8GB. That is an important honesty point. You do not need a server farm for this. A normal laptop is enough.
The language model answering through Ollama, running fully on the laptop. No API key, no account, nothing leaving the machine.
The language model answering through Ollama, running fully on the laptop. No API key, no account, nothing leaving the machine.
Step 4: Give the agent memory that survives
Here is where it stops being a search box and becomes an agent.
Every exchange gets saved into a second collection in VectorAI DB called memory. When you ask a new question, the agent searches both your documents AND its own memory of past conversations before it answers.
Because VectorAI DB writes that data to disk on your machine, the memory survives a full restart. I told the agent my company was in healthcare in one session. Closed it. Reopened it. Asked which GDPR obligations mattered for my company. It remembered, and answered correctly.
Session one. I tell the agent my company is called Northwind and works in healthcare, then I close the session completely.
Session one. I tell the agent my company is called Northwind and works in healthcare, then I close the session completely.
Session two, a brand new run. I ask what my company is and what it does. It still remembers, because the memory lives on disk inside VectorAI DB.
Session two, a brand new run. I ask what my company is and what it does. It still remembers, because the memory lives on disk inside VectorAI DB.
Step 5: The test that proves it
This is the moment that matters. The claim is “nothing hits the cloud.” So I proved it the simplest way possible.
I asked the agent a question with the wifi on. It answered. Then I turned the wifi off, on camera, and asked a follow up that needed both the documents and its memory of our earlier chat.
It answered again. Same quality. Internet completely disconnected.
That is the whole thesis in fifteen seconds. No benchmark chart needed. The disconnected wifi icon is the proof.
Wifi on, the agent answers. Then I switch the wifi off on camera and ask a follow-up that needs both the documents and our earlier chat. Same answer quality, fully offline.
Wifi on, the agent answers. Then I switch the wifi off on camera and ask a follow-up that needs both the documents and our earlier chat. Same answer quality, fully offline.
When local actually makes sense (and when it does not)
I am not going to tell you local beats cloud at everything. It does not. Here is the honest decision framework.
Use a fully local stack when:
→ You handle data that legally cannot leave your environment (health, finance, legal, defense)
→ You operate somewhere a cloud round trip is too slow or too unreliable (factory floor QA, edge devices, remote sites)
→ Cost at scale matters and you are tired of paying per token forever
→ Privacy is the product and you need to be able to prove nothing leaves the machine
Stick with cloud when:
→ You need the absolute frontier of reasoning quality for hard, open-ended tasks
→ Your workload is bursty and you do not want to manage any infrastructure
→ Your data is not sensitive and speed of shipping beats everything else
For retrieval-based agent workloads like this one, the performance gap between a good local model and a frontier cloud model is much smaller than people assume. The compliance gap and the cost gap, on the other hand, are enormous. That is the trade that makes local worth it.
What to watch out for
Four honest things I ran into.
→ Local models are smaller. For deep, open-ended reasoning a frontier cloud model still wins. For answering from your documents, a small local model is more than enough.
→ Memory matters. 8GB is the floor. It works, but you keep the model small and close other apps. 16GB or more is far more comfortable.
→ Quality needs a little tuning. How you chunk documents and how many results you pull back changes the answers. Budget an hour to get it sharp.
→ The database is the part you should not cheap out on. Running a vector database in production is real operational work. That is exactly the gap VectorAI DB is built to close.
What this actually means
For two years the default answer to “how do I build an AI agent” was “call a cloud API.” For a huge and growing set of real companies, that answer is now off the table, not because they are paranoid, but because the law, the latency, or the cost says no.
The pieces to run the whole thing yourself are finally good enough. Open models you can run on a laptop. Open embeddings. And a vector database you can run locally today and harden for production tomorrow, without taking on the risk of babysitting raw open source.
I proved it on an 8GB MacBook with the wifi off. A real team can do far more.
2026 is going to be UNFAIR for the builders who figured out local AI early.
TLDR
→ Built a private AI agent that runs fully offline on a base 8GB laptop
→ Stack: local embeddings + local LLM (Ollama) + VectorAI DB in Docker
→ VectorAI DB stores both the documents and the agent’s persistent memory
→ Turned the wifi off and it kept retrieving, reasoning, and remembering
→ Open source gives you the components, not the support or production hardening
→ The model you can DIY. The database is the piece you run with VectorAI DB instead
→ For regulated industries and edge use cases, local is no longer optional
→ You can run VectorAI DB locally today, and harden it for production when you’re ready. Start here.
The internet was never the requirement. We just assumed it was.
LFG.
Similar Articles
@DivyanshT91162: Everyone is distracted by AI agents in the cloud… Meanwhile, some people quietly turned their laptops into autonomous A…
Describes how to turn a laptop into a 24/7 autonomous AI research machine using Qwen3-35B-A3B, llama.cpp, and 4-bit quantization by Unsloth, requiring no cloud or GPU server.
@rohanpaul_ai: atomic[.]chat (a desktop app that runs LLMs locally) ran a very revealing comparison for local AI agents, on a MacBook …
Liquid's LFM2.5-8B-A1B outperformed OpenAI's gpt-oss-20b on a tool-calling benchmark when run locally on a MacBook Pro, completing all required tool calls in half the time while using less memory.
Automated AI researcher running locally with llama.cpp
ml-intern is a harness for AI agents that integrates with Hugging Face's libraries and now supports running local models via llama.cpp or ollama, enabling an automated AI researcher to run 24/7 on a laptop.
@mronge: https://x.com/mronge/status/2052846432969720202
A practical guide on setting up an always-on AI agent on a Mac mini, covering hardware selection, cloud vs. local AI model tradeoffs, and agent system choices for automating tasks like sales reporting and social media suggestions.
@PrajwalTomar_: https://x.com/PrajwalTomar_/status/2064324584254710262
Hermes Agent by Nous Research is an open-source autonomous AI agent that runs persistently on a server, remembers every conversation across sessions, and autonomously creates skill files, making it a fundamentally different category of agent compared to session-based coding tools like Claude Code and Cursor.