Fine-tuned Qwen2.5-7B to 96% of Claude Haiku on a domain-specific task using ~$3 of API calls and zero human labelers

Reddit r/LocalLLaMA Papers

Summary

Presented DV-DPO, a method to fine-tune Qwen2.5-7B on domain-specific tasks using only ~$3 in API calls and zero human labelers, achieving 96% composite performance of Claude Haiku via adversarial cross-examination.

Built a decision-reasoning engine (Orlog) and wanted to fine-tune a local model for it instead of paying per-call forever. **The method (DV-DPO):** * Run a 3-voice council on each question, produce a synthesis * Cross-examine: losing voices challenge the synthesis * If synthesis gets revised → DPO pair (chosen=post-revision, rejected=pre-revision) * If synthesis holds → no pair (good reasoning produces nothing to learn from) Only genuine revisions under adversarial pressure become training signal. Not format preference, not sampling variance. **Results:** * 1,040 pairs total (\~$3 at Haiku rates) * Head-to-head vs Claude Haiku: Format 100%, Commits 100%, Context 89%, Composite 96% * Latency: 11s vs 3s (T4 GPU, 4-bit quantized) * Adversarial failure rate: 2% on 96 targeted questions **Autonomous loop now running:** failure\_detector → auto\_red\_team → DPO pairs → retrain → redeploy → eval. v5 pairs accumulating. GGUF ready for Ollama. Happy to share the pipeline if there's interest.
Original Article

Similar Articles

Qwen 3.6 27B on DeepSWE

Reddit r/LocalLLaMA

Qwen 3.6 27B scored 2% on the DeepSWE benchmark, placing 18/20 above Haiku 4.5 and Minimax M2.7, highlighting the gap between local and leading-edge models.