Good Summarization SLMs for < 2000 tokens

Reddit r/LocalLLaMA 04/21/26, 06:37 AM News

Summary

A novice asks for recommendations on small language models and prompting strategies to build an employee note summarization engine under 2000 tokens, after experiencing hallucinations with Qwen2.5-7B-Instruct.

A novice here, I am trying to build a summarization engine for employee notes. There are between 10 and 50 notes (est 3000-15000 tokens) that needs summarizing. These come already with tags, and need to be summarized into a general report of est 200-1000 tokens. Model needs to determine the "too detailed" level of notes and generalize several similar notes into a category (i.e. when there are several notes related to a same tag category). I tried [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) with some prompting, but it is spewing hallucinations and is not useable. Tried to reduce the temperature, without success. What model and what prompting would you recommend for this task?

Original Article

Similar Articles

Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

arXiv cs.CL

This paper proposes a parameter-efficient vocabulary adaptation method for LLM-based text summarization in specialized domains, augmenting pretrained tokenizers with domain-specific tokens and selectively replacing under-trained ones to reduce training time by 35-55% and parameter counts by up to 37%.

Newer Qwen models are worse at summarization?

Reddit r/LocalLLaMA

A comparison of LLM summarization performance shows Qwen 3 leads the 30B parameter range, followed by Gemma 4, while newer Qwen models may be optimized for agentic tasks.

Floor for local meeting summarization on a 6GB GPU: qwen3.5:0.8b works at 57s, Granite 4 350M hallucinates

Reddit r/LocalLLaMA

The author introduces VoiceFlow, an open-source local dictation and meeting transcription tool, and benchmarks small LLMs (qwen3.5:0.8b and Granite 4 350M) for meeting summarization on a 6GB GPU, finding the 0.8B Qwen viable while sub-500M models hallucinate. They also ask the community for long-context summarization solutions on low VRAM.

Are super tiny LLMs any good?

Reddit r/singularity

Explores whether very small language models can handle casual conversations adequately, and what training factors differentiate the better ones.

Learning to summarize with human feedback

OpenAI Blog

OpenAI demonstrates a technique for improving language model summarization by training a reward model on human preferences and fine-tuning models with reinforcement learning, achieving significant quality improvements that generalize across datasets. This work advances model alignment through human feedback at scale, with applications beyond summarization.