@mervenoyann：这条管线的第二天发现 > 它有效，在道路标志检测中针对人工标注得到了 map@50=0.8028，使用了……

X AI KOLs Timeline 2026/06/17 15:18 新闻

multi-modal vlm road-sign-detection pipeline data-labeling model-training

摘要

Merve (@mervenoyann) 分享了使用多个小型 VLM 作为评判器的管线的第二天发现，在道路标志检测中仅用 1.3k 样本就达到了 map@50=0.8028。这条推文比较了模型拒绝率，讨论了数据集缩小、超具体提示以及泛化该库的计划。

day 2 findings on this pipeline > it works, got map@50=0.8028 on road sign detection against human annotations, with only 1.3k examples see results below > Liquid rejects way more than Gemma-4 (530 vs 306 in hard document parsing, 1022 vs 116 in easy road sign detection, tbh it's smaller and more prone to hallucination when I vibe check) > in some cases (see document media parsing examples below) trained RF-DETR outperforms Qwen annotations it was trained on which is super cool, sometimes judges introduce bboxes (and I don't remove them) it's a win? > multiple VLMs as judges will shrink your dataset depending on the difficulty of the problem, sometimes taking only one "correct" from a judge is enough. since you are training small models it's better to kickoff training for consensus and single correct verdict separately > use super-specific prompts of what you want and don't want in labelling and judging especially if your labels as words could mean many things next up: make this library leaner to generalize better to be problem-agnostic, try again on segmentation, actually use Gemma for orchestration

查看原文

查看缓存全文

缓存时间: 2026/06/18 06:09

关于此管线的第二天发现

它能正常工作，在道路标志检测任务上，仅用 1.3k 样本，针对人工标注就达到了 map@50=0.8028，结果如下所示

Liquid 拒绝的数量远多于 Gemma-4（在硬文档解析中 530 vs 306，在简单道路标志检测中 1022 vs 116），说实话它更小，在我进行人工抽查时也更容易产生幻觉在某些情况下（见下方文档媒体解析示例），训练后的 RF-DETR 性能甚至超过了它训练时所使用的 Qwen 标注，这非常酷；有时裁判会引入边界框（而我没有移除它们），这算是赢了吗？使用多个 VLM 作为裁判会根据问题的难度缩小你的数据集，有时仅从一个裁判那里获取一个“正确”标注就足够了。由于你在训练小型模型，最好分别启动“一致性共识”和“单一正确判定”两种训练方案

在标注和评判时，使用超级具体的提示词明确你想要和不想要的内容，特别是当你的标签词可能有多重含义时

下一步计划：让这个库更精简，以更好地泛化到不同问题；在分割任务上再试一次；实际使用 Gemma 进行编排

我所有的成果都在这里 https://huggingface.co/collections/merve/vision-intern… 包括标注数据集、评判数据集、训练模型、管线各部分等

还要感谢 @huggingface infra，我大量使用了 Buckets、Jobs、Dataset Viewer 等功能

@DataScienceHarp @skalskip92 @maximelabonne 你们可能对此感兴趣 ^

@mervenoyann：这条管线的第二天发现 > 它有效，在道路标志检测中针对人工标注得到了 map@50=0.8028，使用了……

相似文章

CaVe-VLM-CoT：一个可解释的视觉-语言模型框架

Robusto-2：在利马和纽约市对人与VLMs进行自动驾驶基准测试

封闭-开放工业检测场景的统一：新的大规模基准、挑战与基线

MechVQA: 在全面机械图纸理解中对多模态LLM进行基准测试与增强

@a1zhang: RLM arXiv 论文更新：depth>1 的结果、更多比较、更多训练和更多错误分析！我们增加了 depth=2/3 的实验…

提交意见反馈