StepFun 3.7 Flash - Speed Benchmark in M5 Max

Reddit r/LocalLLaMA 05/29/26, 04:04 AM Models

stepfun-3.7-flash llama-cpp benchmark inference-speed m5-max apple-silicon

Summary

Benchmark results for StepFun 3.7 Flash model running on M5 Max via llama.cpp, showing prompt processing and token generation speeds across various context lengths.

Just ran a benchmark with day-0 shipped llama.cpp's branch. M5 Max: 128 GB - Q4\_K\_S / memory peak around \~120+ GB making things sluggish but still usable once cmd+tab landed. Short context < 16k feels fast and very responsive. 32k-64k's speed is not bad, usable. |PP|TG|B|N\_KV|T\_PP s|S\_PP t/s|T\_TG s|S\_TG t/s|T s|S t/s| |:-|:-|:-|:-|:-|:-|:-|:-|:-|:-| |0|128|1|128|0.000|nan|2.038|62.80|2.038|62.80| |2048|128|1|2176|1.938|1056.65|2.115|60.52|4.053|536.88| |8192|128|1|8320|9.153|895.01|2.233|57.32|11.386|730.71| |16384|128|1|16512|22.428|730.52|2.475|51.71|24.903|663.05| |32768|128|1|32896|64.539|507.73|2.818|45.43|67.356|488.39| |65536|128|1|65664|178.227|367.71|3.774|33.92|182.001|360.79| Now Pelican bench - very nice one but with quite a long hand lol https://preview.redd.it/322rt8n4304h1.png?width=780&format=png&auto=webp&s=e34efc12f6d96a22d27038a642c3c198b7b292e2

Original Article

Similar Articles

StepFun 3.7 Flash

Reddit r/LocalLLaMA

StepFun released Step 3.7 Flash, a high-efficiency multimodal model optimized for real-world agentic tasks, featuring improved coding benchmarks (SWE-Bench Pro, Terminal-Bench) and compatibility with multiple agent harnesses.

StepFun Says Step 3.7 Flash Matches 97% of Claude Opus 4.6's Coding Performance at One-Ninth the Cost

Reddit r/ArtificialInteligence

StepFun's Step 3.7 Flash, a 198B sparse MoE model with 11B active parameters, matches 97% of Claude Opus 4.6's coding performance on SWE-Bench Verified at roughly one-ninth the cost, using an Advisor Mode strategy that reserves expensive frontier model calls for critical decision points.

Stepfun 3.7 Flash is very good

Reddit r/LocalLLaMA

Stepfun 3.7 Flash is a compact vision model that achieves aesthetics close to GLM 5.1 and 80% of its 3D world understanding, while using only 25% of the parameters, making it highly RAM-efficient.

@AdinaYakup: Step-3.7-Flash New VL model from @StepFun_ai 198B / 11B active - MoE 256K context 3 reasoning level Up to 400 tokens/sec

X AI KOLs Timeline

StepFun releases Step-3.7-Flash, a new large vision-language MoE model with 198B parameters (11B active), 256K context, and up to 400 tokens/sec inference speed.

@NielsRogge: Impressive release by StepFun, explore it at https://paperswithcode.co/paper/83892