coding-tasks

Tag

Cards List
#coding-tasks

DeepSWE benchmarks indicate that DeepSeek v4 Pro only passes 8% of tasks

Reddit r/LocalLLaMA · 5d ago

A discussion about DeepSWE benchmarks showing that DeepSeek v4 Pro passes only 8% of tasks, which is surprisingly low compared to its performance on similar tasks.

0 favorites 0 likes
#coding-tasks

Gave GPT-4o and Claude the exact same double pendulum prompt. They picked opposite angle conventions within seconds.

Reddit r/ArtificialInteligence · 2026-05-16

An experiment feeding GPT-4o, Claude 3.5 Sonnet, and other models the same double pendulum prompt reveals they pick opposite angle conventions, causing immediate visible mismatch in a shared renderer. The convention split, non-random across model families, suggests a bias in training data distribution for classical mechanics problems.

0 favorites 0 likes
← Back to home

Submit Feedback