NemoStation/Marlin-2B

Hugging Face Models Trending 05/13/26, 04:23 PM Models

text-generation multimodal video-captioning temporal-grounding fine-tuned qwen

Summary

NemoStation/Marlin-2B is a fine-tuned model based on Qwen3.5-2B for video-text-to-text tasks, supporting video captioning and temporal grounding.

Task: video-text-to-text Tags: transformers, safetensors, qwen3_5, text-generation, video, multimodal, video-captioning, temporal-grounding, qwen, VLM, video-text-to-text, custom_code, en, arxiv:2501.00513, arxiv:2407.00634, arxiv:2512.14698, base_model:Qwen/Qwen3.5-2B, base_model:finetune:Qwen/Qwen3.5-2B, license:apache-2.0, region:us

Original Article

Similar Articles

@HappyyPablo: open sourcing Marlin-2B a tiny VLM to extract structured information from videos Marlin is finetuned for two questions …

X AI KOLs Timeline

Open-sourcing Marlin-2B, a tiny VLM for extracting structured information from videos, fine-tuned to answer 'what is happening and when'. Best open model in its weight class, competitive with Gemini-2.5-flash.

Motif-Video 2B: Technical Report

Hugging Face Daily Papers

Motif-Video 2B is a 2B parameter text-to-video generation model that achieves 83.76% on VBench, surpassing Wan2.1 14B while using 7x fewer parameters and trained on fewer than 10M clips with less than 100,000 H200 GPU hours. The model uses a specialized architecture with shared cross-attention and a three-part backbone to separate prompt alignment, temporal consistency, and detail refinement.

nvidia/nemotron-3.5-asr-streaming-0.6b

Hugging Face Models Trending

NVIDIA releases Nemotron 3.5 ASR, a 600M parameter multilingual streaming speech recognition model supporting 40 language-locales with a Cache-Aware FastConformer-RNNT architecture for low-latency transcription. The model supports configurable chunk sizes and is ready for commercial use under the OpenMDW-1.1 license.

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Hugging Face Models Trending

NVIDIA releases Nemotron 3 Nano Omni, a 30B parameter multimodal model capable of processing video, audio, image, and text with integrated reasoning capabilities for enterprise workflows.

Mellum 2 12B A2.5B

Reddit r/LocalLLaMA

JetBrains released Mellum 2 12B A2.5B, a coding-focused small MoE model with reasoning performance comparable to Qwen 3.5 9B but weaker in other tasks.