poolside/Laguna-M.1 · Hugging Face - 225B-A23B
Summary
Poolside releases Laguna M.1, a 225B parameter Mixture-of-Experts model with 23B activated parameters per token, designed for agentic coding and long-horizon tasks. It achieves competitive results on SWE-bench benchmarks and is released under an Apache 2.0 license.
View Cached Full Text
Cached at: 06/18/26, 05:33 PM
poolside/Laguna-M.1 · Hugging Face
Source: https://huggingface.co/poolside/Laguna-M.1
Get an API key·Release blog post·Technical report
Laguna M.1 is a 225B total parameter Mixture-of-Experts model with 23B activated parameters per token designed for agentic coding and long-horizon work.
For more details on how we trained this model, including our Model Factory approach, post-training recipe, async off-policy agent RL, and evaluations, check out ourrelease blog postandtechnical report.
https://huggingface.co/poolside/Laguna-M.1#highlightsHighlights
- Large sparse MoE for agentic coding: Laguna M.1 is a 70-layer MoE transformer with 225B total parameters and 23B activated parameters per token
- High-capacity expert routing: After 3 dense SwiGLU layers, Laguna M.1 uses 67 sparse MoE layers with 256 experts, top-k=16 routing and auxiliary-loss-free load balancing
- Global attention architecture: Laguna M.1 uses global attention across all layers with 64 Q-heads, 8 KV-heads and softplus attention output gating
- Native reasoning support: Interleaved thinking between tool calls with support for enabling and disabling thinking per-request
- Strong agentic benchmark performance: Laguna M.1 is competitive with state-of-the-art open-weight and frontier models on SWE-bench Verified, SWE-bench Multilingual, SWE-Bench Pro and Terminal-Bench 2.0
- Apache 2.0 license: Use and modify freely for commercial and non-commercial purposes
https://huggingface.co/poolside/Laguna-M.1#model-overviewModel overview
- Training: pre-training, post-training and reinforcement learning stages
- Number of parameters: 225B total with 23B activated per token
- Optimizer: Muon
- Layers: 70 layers with global attention
- Experts: 256 experts with 1 shared expert; top-k=16 routing
- Dense layers: first 3 layers are dense SwiGLU; remaining 67 layers are sparse MoE
- Attention: 64 Q-heads, 8 KV-heads, head dimension 128, with softplus attention output gating
- Positional encoding: RoPE with YaRN
- Modality: text-to-text
- Context window: 262,144 tokens
- Reasoning support: interleaved thinking with preserved thinking
https://huggingface.co/poolside/Laguna-M.1#benchmark-resultsBenchmark results
ModelParametersSWE-bench VerifiedSWE-bench MultilingualSWE-bench Pro (Public Dataset)Terminal-Bench 2.0Laguna M.1225B-A23B74.6%63.1%49.2%45.8%Devstral 2123B dense72.2%61.3%-32.6%GLM-4.7355B-A32B73.8%66.7%-41.0%DeepSeek-V4 Flash284B-A13B79.0%73.3%52.6%56.9%Qwen3.5-397B-A17B397B-A17B76.2%69.3%50.9%52.5%Claude Sonnet 4.6-79.6%--59.1% We used the highest publicly-referenced scores for all comparison models across each benchmark. In almost all cases these were official scores published in release blog posts or equivalent, with Claude Sonnet 4.6 shown as a frontier proprietary reference of comparable model size. “-” indicates a score not reported by the model provider.
All benchmarking for Laguna M.1 was completed using ourpool agent harness, with a maximum of 500 steps and sandboxed execution. The same sampling parameters were used for all Laguna M.1 benchmarking: temperature=1.0 and top_k=20, with thinking mode enabled and a context length of 256K tokens. All tasks were run in their own sandbox using 8 GB RAM/2 CPUs, with the exception of Terminal-Bench 2.0, which used 48 GB RAM/32 CPUs. Some base task images and verifiers were patched to fix infrastructure reliability issues inherent in task setup, such as rate limits on third-party dependencies in external registries used by the verifier. All four agentic benchmarks were run with patched images. We also ran a reward-hack judge post-hoc on Laguna M.1 evaluation runs and did not find significant reward hacking after joint judge review and manual review. - SWE-bench Verified: mean pass@1 averaged over 4 runs - SWE-bench Multilingual: mean pass@1 averaged over 4 runs - SWE-Bench Pro: mean pass@1 averaged over 4 runs - Terminal-Bench 2.0: mean pass@1 averaged over 4 runs; 48 GB RAM/32 CPUs
https://huggingface.co/poolside/Laguna-M.1#usageUsage
Laguna M.1 has upstream support in vLLM, SGLang, and Transformers, and TRT-LLM thanks to the support of the team at NVIDIA.
https://huggingface.co/poolside/Laguna-M.1#poolpool
poolis a lightweight terminal-based coding agent and a dualAgent Client Protocolclient-server.
Download and install for macOS and Linux:
curl -fsSL https://downloads.poolside.ai/pool/install.sh | bash
Launch andLog in with Poolsideto get a free API key.
pool
Use in anyACP client. Configure Zed and JetBrains automatically:
pool acp setup --editor zed|jetbrains
https://huggingface.co/poolside/Laguna-M.1#feedback-and-issuesFeedback and issues
Submit feedback with/feedbackand read thefull documentation on GitHub.
https://huggingface.co/poolside/Laguna-M.1#deploymentDeployment
https://huggingface.co/poolside/Laguna-M.1#vllmvLLM
Serve Laguna M.1 locally with vLLM and query it from any OpenAI-compatible client (seeControlling reasoningfor tool calls, streaming, and reasoning extraction):
Laguna support landed in vLLM viavllm-project/vllm#41129(shared withLaguna XS.2) and is available in vLLM 0.21.0 and later.
pip install 'vllm>=0.21.0'
vllm serve \
--model poolside/Laguna-M.1 \
--tool-call-parser poolside_v1 \
--reasoning-parser poolside_v1 \
--enable-auto-tool-choice \
--served-model-name laguna \
--default-chat-template-kwargs '{"enable_thinking": true}'
See thevLLM recipes pagefor our Laguna XS.2 model with which the implementation is shared for additional deployment guidance. FP8 and NVFP4 quantized checkpoints are available atLaguna-M.1-FP8andLaguna-M.1-NVFP4; quantization is detected automatically fromquantization\_config, so the same command works with the model ID substituted.
https://huggingface.co/poolside/Laguna-M.1#sglangSGLang
Laguna M.1 can be served with SGLang using its OpenAI-compatible server, including support for tool calling, streaming responses, and reasoning parsing:
Laguna support was added to SGLang insgl-project/sglang#24204. The integration is shared withLaguna XS.2and is currently available on SGLang main.
# Laguna M.1 support is currently on SGLang main, so install from source
git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"
sglang serve \
--model-path poolside/Laguna-M.1 \
--tool-call-parser poolside_v1 \
--reasoning-parser poolside_v1 \
--tp 8 \
--host 0.0.0.0
Quantized Laguna M.1 checkpoints are also available asLaguna-M.1-FP8andLaguna-M.1-NVFP4. SGLang reads the checkpointquantization\_config, so you can use the same launch command after replacing the model ID. For more SGLang-specific deployment details, see theSGLang Cookbookwhich uses the same Laguna implementation path.
https://huggingface.co/poolside/Laguna-M.1#transformersTransformers
Laguna is supported in Transformersv5\.7\.0and later (huggingface/transformers#45673).
Laguna M.1 is a 225B-parameter model; loading the BF16 checkpoint in Transformers requires substantial multi-GPU memory (
device\_map="auto"shards across available devices). For single-node serving, vLLM or SGLang is recommended.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "poolside/Laguna-M.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "user", "content": "Write a Python retry wrapper with exponential backoff."},
]
# Reasoning is on by default; pass enable_thinking=False to skip the <think> block.
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
enable_thinking=True,
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=1024, do_sample=True, temperature=1.0, top_k=20)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
https://huggingface.co/poolside/Laguna-M.1#trt-llmTRT-LLM
Laguna is supported in TensorRT-LLM thanks to the team at NVIDIA — model support landed inNVIDIA/TensorRT-LLM#13559, with partial-RoPE fusion added in#15110. Build TensorRT-LLM from amainthat includes these PRs (or a release once they ship).
from tensorrt_llm import LLM, SamplingParams
llm = LLM(model="poolside/Laguna-M.1", trust_remote_code=True)
sampling = SamplingParams(max_tokens=1024, temperature=1.0, top_k=20)
out = llm.generate(["Write a Python retry wrapper with exponential backoff."], sampling)
print(out[0].outputs[0].text)
If your TensorRT-LLM build pins
transformers < 4\.58,configuration\_laguna\.pyneeds a small compat shim; use thelaguna\_minimal\_overlay\.shhelper from the support PR and load TRT-LLM against the overlay directory.
Quantization is detected automatically fromquantization\_config, so the same recipe works for theFP8andNVFP4variants with no extra flags.
https://huggingface.co/poolside/Laguna-M.1#controlling-reasoningControlling reasoning
Laguna M.1 has native reasoning support and is designed to work best withpreserved thinking, wherereasoningcontent from prior assistant messages is preserved in the message history. This model will generally reason before calling tools and between tool calls.
import json
from openai import OpenAI
client = OpenAI(
base_url="https://inference.poolside.ai/v1",
api_key="...",
)
model = "poolside/laguna-m.1"
tools = [{"type": "function", "function": {
"name": "shell",
"description": "Execute a bash command and return the output.",
"parameters": {"type": "object", "properties": {"cmd": {"type": "string"}}, "required": ["cmd"]},
}}]
messages = [
{"role": "system", "content": "You are a coding agent with access to a shell tool."},
{"role": "user", "content": "Run uname -a"},
]
# Thinking is enabled by default when the server sets --default-chat-template-kwargs {"enable_thinking": True}
# When using the Poolside API (https://inference.poolside.ai/v1), this flag is set by default
response = client.chat.completions.create(
model=model,
messages=messages,
tools=tools,
stream=True,
)
reasoning, content, tool_calls = "", "", []
for chunk in response:
delta = chunk.choices[0].delta
if hasattr(delta, "reasoning_content") and delta.reasoning_content:
reasoning += delta.reasoning_content
if hasattr(delta, "content") and delta.content:
content += delta.content
if hasattr(delta, "tool_calls") and delta.tool_calls:
for tc in delta.tool_calls:
if tc.index >= len(tool_calls):
tool_calls.append({"id": tc.id, "function": {"name": "", "arguments": ""}})
if tc.function.name:
tool_calls[tc.index]["function"]["name"] = tc.function.name
if tc.function.arguments:
tool_calls[tc.index]["function"]["arguments"] += tc.function.arguments
print(f"Reasoning: {reasoning}\nContent: {content}\nTool calls: {tool_calls}\n")
# Return reasoning in the next request for best performance
messages.append({
"role": "assistant",
"content": content,
"reasoning_content": reasoning,
"tool_calls": [{"id": tc["id"], "type": "function", "function": tc["function"]} for tc in tool_calls]
})
messages.append({
"role": "tool",
"tool_call_id": tool_calls[0]["id"],
"content": json.dumps({"stdout": "Darwin arm64", "exit_code": "0"})
})
response = client.chat.completions.create(
model=model,
messages=messages,
tools=tools,
stream=True,
)
reasoning, content = "", ""
for chunk in response:
delta = chunk.choices[0].delta
if hasattr(delta, "reasoning_content") and delta.reasoning_content:
reasoning += delta.reasoning_content
if hasattr(delta, "content") and delta.content:
content += delta.content
print(f"Reasoning: {reasoning}\nContent: {content}")
https://huggingface.co/poolside/Laguna-M.1#disabling-reasoningDisabling reasoning
You can disable thinking by settingenable\_thinkingtoFalsein a request or by not providing\-\-default\-chat\-template\-kwargs \{"enable\_thinking": True\}or equivalent when starting the server.
from openai import OpenAI
client = OpenAI()
completion = client.chat.completions.create(
model="poolside/laguna-m.1",
messages=[
{"role": "user", "content": "Write a retry wrapper with exponential backoff."}
],
extra_body={
"chat_template_kwargs": { "enable_thinking": False },
},
stream=True
)
for chunk in completion:
print(chunk.choices[0].delta)
For agentic coding use cases, we recommend enabling thinking and preserving reasoning in message history as outlined in theControlling reasoningsection.
https://huggingface.co/poolside/Laguna-M.1#licenseLicense
This model is licensed under theApache 2.0 License.
https://huggingface.co/poolside/Laguna-M.1#intended-and-responsible-useIntended and Responsible Use
Laguna M.1 is designed for software engineering and agentic coding use cases, and you are responsible for confirming that it is appropriate for your intended application. Laguna M.1 is subject to theApache 2.0 License, and should be used consistently with Poolside’sAcceptable Use Policy. We advise against circumventing Laguna M.1 safety guardrails without implementing substantially equivalent mitigations appropriate for your use case.
Please report security vulnerabilities or safety concerns to[email protected].
Similar Articles
poolside/Laguna-XS.2
Poolside releases Laguna XS.2, a 33B parameter MoE model with 3B activated parameters designed for agentic coding and local deployment on Macs with 36GB RAM.
Laguna by Poolside
Poolside introduces Laguna, a foundation model for agentic coding and long-horizon work.
JetBrains's Mellum 2 (49 minute read)
JetBrains releases Mellum 2, a 12B-parameter open-weight Mixture-of-Experts language model specialized in software engineering, with competitive performance in code generation, reasoning, and tool use, available under Apache 2.0.
Zyphra/ZAYA1-8B
Zyphra released ZAYA1-8B, an 8.4B parameter Mixture-of-Experts model with 760M active parameters, demonstrating high efficiency and strong performance in mathematical and coding reasoning tasks.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
UniPool introduces a shared expert pool architecture for Mixture-of-Experts models, reducing parameter growth with depth while improving efficiency and performance over standard MoE baselines.