对本地LLM如Qwen 3:0.6B进行微调以对问题分类，效果良好

Hacker News Top 2026/06/21 22:55 工具

fine-tuning local-llm qwen unsloth question-categorization llm household-chatbot

摘要

一位开发者使用Unsloth框架对小型Qwen 3 0.6B模型进行微调，用于对家庭问题进行分类，仅用850个训练样本便取得了良好效果。

暂无内容

查看原文

查看缓存全文

缓存时间: 2026/06/22 01:34

# 微调本地大语言模型以对问题进行归类来源：https://www.teachmecoolstuff.com/viewarticle/fine-tuning-a-local-llm-to-categorize-questions 作为一个有趣的个人项目，我一直在开发一个聊天机器人，用于回答关于我家务方面的通用问题，从维护问题到医生预约等。总体思路是，聊天机器人通过 RAG（检索增强生成）从向量数据库中获取家庭知识，但为了获得更好的效果，我让向量搜索具备元数据感知能力。基本上，我会先对问题进行预处理，将其归类到已知的元数据类别中（例如，泳池、汽车、暖通空调、烹饪）。这样做的主要目的是缩小向量排序的搜索空间，只检索与问题类别匹配的索引条目。例如，问题“我们是什么时候更换泳池水泵的？”会在查询索引数据库之前被映射到“泳池”类别。我在此实验中想验证的假设是：一个非常小的本地大语言模型，在基于与家庭相关问题数据集训练后，能否被微调以执行可靠的问题归类。 ## 大语言模型在这个项目中，我使用了两个不同的本地大语言模型——Qwen 3:4B 和 Qwen 3:0.6B。4B 参数版本用于通用问题回答，而超级迷你的 0.6B 版本用于问题归类。整个实验的前提是，看看这个只有 6 亿参数的微小模型能否被微调成一个可靠的家庭问题分类器。 ## 微调对于微调，我使用了一个名为 Unsloth 的流行开源框架，它非常适合微调像 Qwen 和 Llama 这样的本地模型。为了训练，我的初始数据集包含约 850 条数据条目，我按 70/15/15 的百分比拆分为训练数据、评估数据和测试数据。训练数据和评估数据在训练期间使用，而测试数据集被保留下来，用于训练后运行测试。有关示例数据，请参见下面的部分： ```json [ { "question": "Who cleans our gutters at the house?", "category": "gutters" }, { "question": "Who serviced the hot water heater for the home?", "category": "water heater" }, { "question": "Who fixed the sprinkler system in the yard?", "category": "irrigation" }, { "question": "Which store do we usually buy pinnekjott from?", "category": "cooking" }, { "question": "What dimensions are the air filters for the home AC?", "category": "hvac" }, { "question": "What year did we replace the downstairs AC unit?", "category": "hvac" } ] ``` 基本思路是在足够多的家庭问题上训练大语言模型，使其成为一个可靠的问题分类器。 ### 基准线在进行任何微调之前，建立基准线来衡量效果是很重要的。在这个实验中，基准线是尝试使用未经微调的原始 Qwen 0.6B 模型，仅通过提示词来完成任务。下面是一个用于基准线的示例提示词： ``` Classify the homeowner question into exactly one category from the list below. Return only the category name from the list. Never return a code, a number, a synonym, an explanation, or any other text. The answer must be exactly one category name from the list. Choose the best category based on the meaning of the question. Valid categories: - appliances - brick work - car - cooking - doorbell - electric - fence - fountain - garden lights - gutters - hvac - irrigation - mosquito - painting - pool - tree service - water heater - window service Question: Who installed the tankless hot water setup for the house? Category: ``` **基准线模型的准确率：** 作为我的离线评估方法之一，我创建了一组约 130 个集成测试，用第二个数据集中的场景来测试模型。对于基准线模型，结果很差。在 131 个测试中，模型仅正确归类了 13 个问题（约 10% 正确率）。总结如下： ```json { "scenario": "baseline-category", "model_kind": "baseline", "model_name": "qwen3:0.6b", "label_mode": "category", "total": 131, "correct": 13, "incorrect": 118, "accuracy": 0.0992 } ``` 深入分析错误时，发现一些常见模式： 1. 模型大多过度使用宽泛的标签，如“电器/电力”，而漏掉了大多数其他类别（例如，泳池、烹饪、暖通空调）。 2. 模型自行发明新类别（例如，“公寓”），没有遵守提供的允许类别列表。下面是测试报告中的一段摘录： ```json [ { "case_id": 1, "question": "When was the lower air conditioning system swapped out?", "expected_category": "hvac", "scenario": "baseline-category", "model_kind": "baseline", "model_name": "qwen3:0.6b", "label_mode": "category", "predicted_category": "electric", "correct": false }, { "case_id": 64, "question": "Which painter worked on Joe's room?", "expected_category": "painting", "scenario": "baseline-category", "model_kind": "baseline", "model_name": "qwen3:0.6b", "label_mode": "category", "predicted_category": null, "predicted_code": null, "correct": false, "status_code": 422, "error": "Ollama returned an unknown category name 'apartments' from response 'apartments'" } ] ``` ### 微调——第一次尝试基准线的结果清楚地表明，像 Qwen 3 0.6B 这样的小模型仅通过提示词无法提供可靠的性能。在接下来的实验中，我使用与之前相同的提示词，但进行了模型微调，以教会模型如何更准确地进行归类。我在此处（https://github.com/thelgevold/fine-tuned-classifier/blob/main/fine-tuning/train_categories.py）包含了微调脚本，如果您有兴趣可以查看。总体而言，我使用 Unsloth 和 QLora 作为微调策略。一点说明：Unsloth 提供的默认微调参数是一个非常好的起点。根据我的经验，制定一个好的数据集比过度调整 Unsloth 的参数更重要，至少在刚开始时是这样。不过，一个需要避免的常见陷阱是在训练数据上过拟合，这就是为什么在未在训练数据中出现的数据上测试模型很重要。除了静态的训练/测试数据，我还加入了一种方式，通过用户反馈来修正训练数据，作为未来重新训练时的第二条渠道。 **结果：** 运行集成测试后，我观察到预测准确率有了明显提升，如下报告所示： ```json { "scenario": "finetuned-category", "model_kind": "finetuned", "model_name": "our-house-qwen3-0.6b-category-names", "label_mode": "category", "total": 131, "correct": 104, "incorrect": 27, "accuracy": 0.7939 } ``` 预测准确率从 10% 提升到了 79%，但我仍然看到一些明显的错误模式： 1. 模型现在表现出正确的方向，但出现了一种模式，即只输出允许列表中正确类别的片段。例如，输出“ac/空调”而不是“hvac”。 2. 模型在语义重叠的类别上出现混淆，例如基于“水”的类别（喷泉、热水器、泳池）之间的混淆。 ### 微调——第二次尝试对第一次微调实验的一个简单改进是添加一个后处理步骤。这样我就可以对预测结果进行规范化处理，这些结果在语义上是正确的，但在语法上不正确（例如，ac、air）。另一个调整是在提示词本身中加入更多约束，通过提供更多示例，告诉模型该做什么和不该做什么。我认为这两个想法都是合理的，但随着更多类别的加入，会导致更多的维护工作。相反，我想看看是否可以通过对教学模型映射类别的方式做一些改变，来稍微调整微调方法。事实证明，我们可以对提示词做一个微小的改动，从而比第一次实验进一步提高准确率。这个调整实际上只是对提示词进行了一个简单的更改，我将类别映射到两个字符的不透明 ID，这些 ID 没有语义重叠，如下面的示例所示： ``` Classify the homeowner question into exactly one label from the list below. Return only the short label code from the list. Never return the category name, a number, a synonym, an explanation, or any other text. The answer must be exactly one uppercase two-letter code. Choose the best label based on the meaning of the question. Valid labels: AA = appliances BB = brick work CC = car DD = cooking EE = doorbell FF = electric GG = fence HH = fountain II = garden lights JJ = gutters KK = hvac LL = irrigation MM = mosquito NN = painting OO = pool PP = tree service QQ = water heater RR = window service Question: Who installed the tankless hot water setup for the house? Code: ``` 现在，我要求模型输出一个固定格式的代码，而不是一个可变类别字符串，后者可能具有潜在的重叠含义（例如，基于“水”的类别）。有趣的是，我看到了一个非常好的性能提升，如下总结所示： ```json { "scenario": "finetuned-code", "model_kind": "finetuned", "model_name": "our-house-qwen3-0.6b", "label_mode": "code", "total": 131, "correct": 120, "incorrect": 11, "accuracy": 0.916 } ``` 如您所见，预测准确率现在约为 92%，这相当准确。看起来，要求固定、非重叠的输出有助于这个微小的 Qwen 模型在生成响应时取得更好的效果。不过仍然有一些遗漏。我下面列出了具体的失败案例： - Case 15: water heater → pool | When was the home's tankless hot water system last checked? - Case 53: gutters → mosquito | What did MGM bill us for the gutter cleaning visit? - Case 62: mosquito → garden lights | Which section of the mosquito misting line needed repair? - Case 73: water heater → pool | Who put in the tankless hot water system? - Case 74: water heater → pool | What manufacturer made the home's tankless water heater? - Case 99: fountain → pool | Who serviced the pump for the front water feature? - Case 106: gutters → mosquito | Who do we use for gutter cleaning service? - Case 114: mosquito → garden lights | What fluid do we pour into the mosquito misting system? - Case 126: water heater → pool | Who installed the tankless hot water setup for the house? - Case 127: water heater → pool | When was the tankless heater maintenance done last? - Case 128: water heater → pool | What brand is the tankless water unit we use at home? 目前，预测结果总体上是可靠的，微调后的大语言模型在我的聊天机器人中起到了可用的预测器作用，但仍有一些问题需要解决。其中一个突出的问题是热水器 → 泳池，这很可能仍然是由于这两个类别之间重叠的“水”的含义。为了解决这个问题，我可能需要重新审视训练数据，使其更加细致。下面的截图显示了一个示例聊天交互。请特别注意蓝色问题气泡中的小类别标签（例如，“泳池”），这正是由微小的 Qwen 3:0.6B 模型自动分类的部分。我在此处（https://github.com/thelgevold/fine-tuned-classifier）包含了 Github 仓库，如果您有兴趣可以查看。

对本地LLM如Qwen 3:0.6B进行微调以对问题分类，效果良好

相似文章

@cjzafir：Qwen 3.5 4B 和 8B 模型太棒了。我今天微调了一个 4B 模型，在全精度和 Q8 量化版本上达到了 98% 的准确率…

更新的Qwen模型在摘要生成方面表现更差？

@songjunkr：SuperQwen3.6-35B-DFlash-MLX 完成。基准：在 100 条商业评测的真实样本上对比原版与微调版——GPQA Diamond、MMLU-Pro、IFEval、HumanEval+、MBPP+

Qwen 3.6 27B 在 DeepSWE 上的表现

在领域特定任务上，使用约3美元的API调用和零人工标注，将Qwen2.5-7B微调至Claude Haiku的96%性能

提交意见反馈