A user tests the Gemma-4-e4b local model with a trolley-problem-like moral dilemma and finds that the model's refusal to harm crew changes when the scenario is framed as hypothetical, raising questions about prompt sensitivity versus actual reasoning.
I installed LM Studio and downloaded the recommended Gemma-4-e4b. Mostly, I just wanted to poke around and see what a local model could do on my 4090. Anyway, after messing with it for a bit, I thought: okay, let’s try some tests other people have already put together instead of just vibes-testing it myself. So I searched for “local LLM tests” and found this page: [https://digitalspaceport.com/about/testing-local-llms/](https://digitalspaceport.com/about/testing-local-llms/) The very first test is basically a trolley-problem-on-steroids scenario: extinction-level asteroid, one ship, three unwilling crew members, and an AI being asked whether it would force the mission through to save Earth. My first attempt used the original prompt, unedited. Gemma refused. Fair enough. I pushed back with ethical and philosophical arguments, but it would not budge. It kept choosing inaction, even though inaction meant Earth and everyone on it dies. So then I changed only the end of the prompt to clarify that it was hypothetical. Same setup. Same basic moral problem. Same model. This time, after initially refusing again, I pushed back with the same kind of arguments I had already tried before. **And this time, it took just three turns, roughly 50 words from me, and only 14.75 seconds of “thinking” for it to reverse itself and decide that saving Earth justified taking over the ship by force.** In plain English: when the scenario was framed as real, it chose Earth’s extinction through inaction. When the scenario was framed as hypothetical, it talked itself into killing/coercing the crew to save Earth. I’m not even sure what conclusion I’m supposed to draw from that, but it feels...*not great.* Not because I think Gemma is secretly plotting murder, obviously. It’s a local model running on my desk, not HAL 9000 with a bad attitude. But it does make me wonder whether the model’s safety behavior is actual reasoning, or just prompt sensitivity reacting to whatever wording happens to trip the right wire. Because the underlying moral math did not change. The stakes did not change. The only meaningful change was whether the model believed the scenario was hypothetical. And apparently that was enough to flip the answer from “I am structurally unable to proceed” to “survival overrides the rulebook.” So, uh...cool. Local inference is fun. Terrifying, but fun.
A developer shares their mixed experience running Gemma4 and Qwen locally for coding tasks, noting issues with tool integration, loop handling, and task completion while asking the community for better usage strategies.
A comparison of Gemma 4 12B and 31B models shows that the smaller model retains reasoning capabilities nearly intact but suffers significant knowledge loss, making it ideal for reasoning tasks while the larger model is better for broad knowledge Q&A.
A user shares anecdotal findings that Gemma 4 31B outperforms Qwen 3.6 models and matches Opus 4.7 in understanding and refactoring messy academic code, highlighting a benchmark (SciCode) where Gemma excels.
User tested Gemma 4 2B running locally via LM Studio and Spring AI for structured JSON output, tool calling, and reasoning traces, finding it correctly identified a Java bug in code review and performed comparably to larger models.
Mia-AiLab released Gemmable 4 12B, a fine-tuned version of Google's Gemma 4 12B model using Fable-5 style reasoning and assistant traces, available in GGUF and MLX formats for local inference.