@analogalok: I just got Gemma 4 26B A4B MoE model running fully locally with Hermes agent on an 8GB RTX 4060 and it's now backtestin…
Summary
A developer demonstrates running Gemma 4 26B MoE model locally on an 8GB RTX 4060 with Hermes agent to fully automate backtesting of trading strategies, highlighting the growing capability of local LLMs as autonomous agents.
View Cached Full Text
Cached at: 06/23/26, 03:51 PM
I just got Gemma 4 26B A4B MoE model running fully locally with Hermes agent on an 8GB RTX 4060 and it’s now backtesting trading strategies end to end, no hand holding.
If you’re a trader or work on Wall Street, you don’t want to miss this.
Yes. fully automated. No cloud. No APIs beyond market data.
Here’s what I did:
Setup:
- Model: Gemma 4 26B-A4B QAT (MoE), Q4_K_XL Unsloth’s quant (link in the comments)
- Inference: llama.cpp (turboquant fork by @no_stp_on_snek link in the comments)
- Hardware: RTX 4060, 8GB VRAM + 16GB RAM only (with 50 other chrome tabs open)
- Context: 64K
llama.cpp turboquant flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 64000 –cache-type-k q8_0 –cache-type-v turbo3 –port 8080
turboquant helps achieve high prefill and decode throughput for interactive sessions.
throughput with Hermes agent: decode: 25+ tokens/sec prefill: 250+ tokens/sec
Then I gave the agent one task:
Backtest a strategy:
- Buy when RSI crosses above 30
- Sell at +2% profit or -1% stoploss
- No overlapping positions
- Use Google stock via yfinance
- Generate a full HTML report with candlestick charts + signals
What happened next was wild. It didn’t just write code, it ran the entire workflow itself:
Audited the environment (pip list, dependency check)
Hit a ModuleNotFoundError, multiple Python installs were conflicting
Ran where python to map every interpreter on the system
Manually selected the correct Python 3.13 path and re ran the script
Wrote a clean statevmachine backtester (strict no overlapping trades logic)
Patched a yfinance MultiIndex quirk that would’ve crashed the script
Built Plotly candlestick + RSI charts with buy/sell markers
Calculated win rate, PnL, and summary stats Exported a polished single file HTML report. check the report at the end of the video or in the comments.
Biggest takeaway: local LLMs aren’t just “chat assistants” anymore. They debug their own environment, write production code, and ship a finished deliverable on consumer hardware, for $0 in API costs.
If you’re still calling local models “toys,” you’re already behind.
This is just the beginning.
Hermes agent just surpassed 1 trillion tokens in a single day on OpenRouter. Think about the scale of total token generation happening right now.
Disclaimer: This is not financial advice. Consult a professional before making any trading decisions.
Teknium 🪽 (@Teknium): Wait we actually just broke 1T tokens in a day for the first time on OpenRouter :O
Please keep contributing to the most awesome project I’ve ever worked on to help make Hermes Agent the best software stack on the planet! Thank you contributors🍻🍻
Similar Articles
@analogalok: Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec If you own any 8GB VRAM graphics card, stop what yo…
Alok demonstrates running Gemma 4 26B MoE on 8GB VRAM using Unsloth's QAT quant and the -cmoe flag in llama.cpp, achieving 20 tokens/sec with 250k context, marking a major milestone for budget local AI.
@VincentLogic: An entry-level laptop with 8GB VRAM can now run a fully autonomous AI Agent. Method: Gemma 4 26B + Hermes Desktop. Run the 26B model locally with just 8GB VRAM + 16GB RAM. What can it do after connecting Hermes? …
Introduces running a fully autonomous AI Agent on an entry-level laptop with 8GB VRAM using the Gemma 4 26B model and Hermes Desktop tool, enabling local file operations, code modification, web browsing, etc., significantly lowering the barrier for local Agents.
@svpino: Hermes with Gemma 4 or Qwen 3.5 is literally the best combo you can run locally on your computer. You've got to give th…
Developer claims Hermes fine-tunes of Gemma 4 and Qwen 3.5 deliver the best local LLM performance, suggesting they rival paid BigAI models.
@analogalok: my 8 GB VRAM gaming laptop is absolutely going to hate me for this. but I still did it. ran a 31b dense model (Gemma 4 …
User runs Gemma 4 31B dense model on 8GB VRAM gaming laptop at ~3 tokens/sec using llama.cpp with MTP speculative decoding, demonstrating feasibility of running a 31B dense model on consumer hardware and proposing agentic workflows where a fast MoE model routes to this slower dense model for hard tasks.
Gemma4 26b MoE running in MLX with turboquant (and custom kernel)
A developer successfully ran Gemma4 26b MoE on Apple MacBook Air M5 using MLX with turboquant and a custom kernel, achieving faster prompt processing and generation speeds than llama.cpp with lower memory usage. The implementation includes instructions for local deployment.