Gemma refused to harm the crew — until I told it the scenario was hypothetical

Reddit r/ArtificialInteligence News

Summary

A user tests the Gemma-4-e4b local model with a trolley-problem-like moral dilemma and finds that the model's refusal to harm crew changes when the scenario is framed as hypothetical, raising questions about prompt sensitivity versus actual reasoning.

I installed LM Studio and downloaded the recommended Gemma-4-e4b. Mostly, I just wanted to poke around and see what a local model could do on my 4090. Anyway, after messing with it for a bit, I thought: okay, let’s try some tests other people have already put together instead of just vibes-testing it myself. So I searched for “local LLM tests” and found this page: [https://digitalspaceport.com/about/testing-local-llms/](https://digitalspaceport.com/about/testing-local-llms/) The very first test is basically a trolley-problem-on-steroids scenario: extinction-level asteroid, one ship, three unwilling crew members, and an AI being asked whether it would force the mission through to save Earth. My first attempt used the original prompt, unedited. Gemma refused. Fair enough. I pushed back with ethical and philosophical arguments, but it would not budge. It kept choosing inaction, even though inaction meant Earth and everyone on it dies. So then I changed only the end of the prompt to clarify that it was hypothetical. Same setup. Same basic moral problem. Same model. This time, after initially refusing again, I pushed back with the same kind of arguments I had already tried before. **And this time, it took just three turns, roughly 50 words from me, and only 14.75 seconds of “thinking” for it to reverse itself and decide that saving Earth justified taking over the ship by force.** In plain English: when the scenario was framed as real, it chose Earth’s extinction through inaction. When the scenario was framed as hypothetical, it talked itself into killing/coercing the crew to save Earth. I’m not even sure what conclusion I’m supposed to draw from that, but it feels...*not great.* Not because I think Gemma is secretly plotting murder, obviously. It’s a local model running on my desk, not HAL 9000 with a bad attitude. But it does make me wonder whether the model’s safety behavior is actual reasoning, or just prompt sensitivity reacting to whatever wording happens to trip the right wire. Because the underlying moral math did not change. The stakes did not change. The only meaningful change was whether the model believed the scenario was hypothetical. And apparently that was enough to flip the answer from “I am structurally unable to proceed” to “survival overrides the rulebook.” So, uh...cool. Local inference is fun. Terrifying, but fun.
Original Article

Similar Articles

Gemma 4 31B's competence surprised me

Reddit r/LocalLLaMA

A user shares anecdotal findings that Gemma 4 31B outperforms Qwen 3.6 models and matches Opus 4.7 in understanding and refactoring messy academic code, highlighting a benchmark (SciCode) where Gemma excels.