Gemma 4 2B handling structured JSON output + tool calling + reasoning traces correctly via Spring AI / LM Studio — including identifying a real Java bug in code review
User tested Gemma 4 2B running locally via LM Studio and Spring AI for structured JSON output, tool calling, and reasoning traces, finding it correctly identified a Java bug in code review and performed comparably to larger models.
Wanted to share a result I didn't expect to work. Running google/gemma-4-e2b locally through LM Studio, exposed via OpenAI-compatible endpoint, called from a Spring Boot app using Spring AI's ChatClient abstraction. Three things I tested: 1. STRUCTURED OUTPUT (schema-conformant JSON) Used BeanOutputConverter to force the model to return a CodeReview object with specific fields (issues, qualityScore, suggestions, summary). Sent it a Java snippet with a == vs .equals() string comparison bug. Result: Perfect JSON, no markdown wrapping, all fields populated correctly. Correctly identified the bug AND suggested a Streams refactor. Quality score 50/100 — interestingly identical to what Claude Sonnet 4.6 returned on the same input, while GPT-4o was less strict and gave 55. 2. TOOL CALLING Registered a weather function with @Tool annotation. Asked "should I bring an umbrella in Riga?". Result: Model correctly decided to invoke the tool, extracted "Riga" as the location parameter, received the mock weather response, and wrapped it back into natural language. No hand-holding, no "I would call the weather tool if I had access" — it actually called it. 3. REASONING TRACES LM Studio's response included a reasoning\_content field showing step-by-step thinking before the final JSON output. Not just generated tokens — the model worked through the analysis explicitly: Thinking Process: 1. Analyze the Request: The user wants a review... 2. Analyze the Code: ... 3. Identify Issues/Improvements: \- Issue 1 (String Comparison): == vs .equals() \- Issue 2 (Style/Readability): index-based loop vs streams 4. Formulate Suggestions... The full demo is in a video I made walking through the setup, including a WiFi-off test to prove the inference is genuinely local: https://youtu.be/lW0FMjDUzik What I'm curious about: \- Has anyone benchmarked Gemma 4 2B vs Phi-4 vs Qwen 2.5 3B for structured output reliability specifically? My anecdotal experience is Gemma is more schema-faithful, but I haven't run rigorous tests. \- For tool calling with parallel function calls (multiple tools in one response), where does the smallest reliable model sit right now? \- Anyone running this size of model in production behind real workloads? I'm specifically interested in latency p99 numbers under load, not just single-request demos.
A user tested Gemma 4 12B as a coding agent in VSCodium using Pi Agent extension, successfully performing a task to create a Python script that reads logs and outputs JSON. The model handled tool use autonomously with zero bugs.
An ML team documents practical challenges encountered while fine-tuning and deploying Gemma-4, including incompatibilities with PEFT, SFTTrainer, DeepSpeed ZeRO-3, and lack of runtime LoRA serving support, along with workarounds for each issue.
A developer shares their mixed experience running Gemma4 and Qwen locally for coding tasks, noting issues with tool integration, loop handling, and task completion while asking the community for better usage strategies.
Google released Gemma 4 12B, an Apache 2.0 open-source multimodal model supporting text, vision, and audio with a 256K context window. The article provides a guide for running it locally on Macs using Ollama, LM Studio, or llama.cpp.