Gemma 4 2B handling structured JSON output + tool calling + reasoning traces correctly via Spring AI / LM Studio — including identifying a real Java bug in code review

Reddit r/LocalLLaMA 05/24/26, 02:22 PM News

gemma-4 local-ai structured-output tool-calling reasoning-traces spring-ai benchmarking

Summary

User tested Gemma 4 2B running locally via LM Studio and Spring AI for structured JSON output, tool calling, and reasoning traces, finding it correctly identified a Java bug in code review and performed comparably to larger models.

Wanted to share a result I didn't expect to work. Running google/gemma-4-e2b locally through LM Studio, exposed via OpenAI-compatible endpoint, called from a Spring Boot app using Spring AI's ChatClient abstraction. Three things I tested: 1. STRUCTURED OUTPUT (schema-conformant JSON) Used BeanOutputConverter to force the model to return a CodeReview object with specific fields (issues, qualityScore, suggestions, summary). Sent it a Java snippet with a == vs .equals() string comparison bug. Result: Perfect JSON, no markdown wrapping, all fields populated correctly. Correctly identified the bug AND suggested a Streams refactor. Quality score 50/100 — interestingly identical to what Claude Sonnet 4.6 returned on the same input, while GPT-4o was less strict and gave 55. 2. TOOL CALLING Registered a weather function with @Tool annotation. Asked "should I bring an umbrella in Riga?". Result: Model correctly decided to invoke the tool, extracted "Riga" as the location parameter, received the mock weather response, and wrapped it back into natural language. No hand-holding, no "I would call the weather tool if I had access" — it actually called it. 3. REASONING TRACES LM Studio's response included a reasoning\_content field showing step-by-step thinking before the final JSON output. Not just generated tokens — the model worked through the analysis explicitly: Thinking Process: 1. Analyze the Request: The user wants a review... 2. Analyze the Code: ... 3. Identify Issues/Improvements: \- Issue 1 (String Comparison): == vs .equals() \- Issue 2 (Style/Readability): index-based loop vs streams 4. Formulate Suggestions... The full demo is in a video I made walking through the setup, including a WiFi-off test to prove the inference is genuinely local: https://youtu.be/lW0FMjDUzik What I'm curious about: \- Has anyone benchmarked Gemma 4 2B vs Phi-4 vs Qwen 2.5 3B for structured output reliability specifically? My anecdotal experience is Gemma is more schema-faithful, but I haven't run rigorous tests. \- For tool calling with parallel function calls (multiple tools in one response), where does the smallest reliable model sit right now? \- Anyone running this size of model in production behind real workloads? I'm specifically interested in latency p99 numbers under load, not just single-request demos.

Original Article

Gemma 4 2B handling structured JSON output + tool calling + reasoning traces correctly via Spring AI / LM Studio — including identifying a real Java bug in code review

Similar Articles

Gemma 4 12B first coding agent test on a 4080 Super

Trials and tribulations fine-tuning & deploying Gemma-4 [P]

Those of you who like Gemma4 models - how are you guys using them?

Google’s Gemma 4 12B just dropped - here’s how to run it locally on your Mac

@lmstudio: Gemma 4 12B is here! Dense, mid-sized Gemma that fits right on your laptop - released by @google under Apache 2.0 Avail…

Submit Feedback

Similar Articles

Gemma 4 12B first coding agent test on a 4080 Super

Trials and tribulations fine-tuning & deploying Gemma-4 [P]

Those of you who like Gemma4 models - how are you guys using them?

Google’s Gemma 4 12B just dropped - here’s how to run it locally on your Mac

@lmstudio: Gemma 4 12B is here! Dense, mid-sized Gemma that fits right on your laptop - released by @google under Apache 2.0 Avail…
Google released Gemma 4 12B, a dense mid-sized model that runs on laptops, under Apache 2.0, now available in LM Studio.