A personal project led to an ACL 2026 paper introducing TIME, a method training Qwen3 models to engage in short, context-triggered thinking rather than excessive reasoning. The work uses QLoRA and a four-phase curriculum, with all data and code released open-source.
Started this as a personal project for my Open-WebUI setup to use. Somehow it ended up as an **ACL 2026** paper. Not some lab paper, it is personal solo independent paper that happened. **TIME** is basically my attempt to train **Qwen3** models to think in short bursts wherever the response actually needs it, instead of dumping one giant reasoning block at the start. Not just “make thinking shorter" or “turn thinking on/off per task” or "split thinking to interleaving reasoning for the task" More like: let the model re-think mid-response when context gives it a reason to. The temporal part came in because time is a really clean way to model latent context changes: silence, gaps, stale assumptions, deadlines, timezone shifts, etc. Also, time just matters in a ton of normal conversations. Funny side effect: it also helps with what I think of as the **QwQ** problem. **QwQ** was the **OG overthinker benchmaxxing** model, and the **Qwen** line still has this vibe where thinking mode can go burn 10k tokens for even trivial stuff like hi. Methods side: **QLoRA** on **Qwen3** 4B/8B/14B/32B, four-phase curriculum, **Unsloth**, **vLLM** eval, TIMEBench benchmark. Trained locally on my own personal PC: 7950X3D, 128GB RAM, RTX Pro 6000 Blackwell 96GB. All Notebooks and data are available, anyone can replicate it easily (24 GB VRAM good enough upto 14B training, 48 GB good enough for 32B) I intend to do the same on **Qwen3.5** and **Qwen3.6** later to see if i can reduced overthinking issues. Model uploads are taking time because the merged checkpoints are huge, but datasets, notebooks, scripts, training curriculum, and eval harness are up. **Paper**: [https://arxiv.org/abs/2601.05300v2](https://arxiv.org/abs/2601.05300v2) **TIME repo** (Data and Code): [https://github.com/The-Coherence-Initiative/TIME](https://github.com/The-Coherence-Initiative/TIME) **TIMEBench repo**: [https://github.com/The-Coherence-Initiative/TIMEBench](https://github.com/The-Coherence-Initiative/TIMEBench)
TEMPO introduces a test-time training framework that alternates policy refinement with critic recalibration to prevent diversity collapse and sustain performance gains in large reasoning models, boosting AIME 2024 scores for Qwen3-14B from 42.3% to 65.8%.
This paper presents a full-pipeline recipe for teaching thinking models to reason with tools, achieving state-of-the-art performance on benchmarks like AIME 2025 when applied to Qwen3 models.
This paper reveals that aggressive post-training quantization of reasoning models leads to increased overthinking errors, where models reach correct intermediate answers but fail to finalize them. A simple logit penalty on overthinking markers reduces chain-of-thought length by 12-23% while improving accuracy, especially for quantized models.
The author shares their experience running Qwen3.6 35B-A3B locally on an ASUS Zenbook Pro 14, achieving 27 TPS at 32k context, marking a personal milestone towards fully local AI for privacy.