Articles from HuggingFace
NVIDIA NeMo AutoModel leverages HuggingFace Transformers v5 to deliver 3.4-3.7x higher training throughput and 29-32% less GPU memory for fine-tuning Mixture-of-Experts models, with no code changes beyond a single import.
Introduces the FFASR Leaderboard, an open, community-driven benchmark for evaluating automatic speech recognition models under realistic far-field acoustic conditions, highlighting the significant performance gap between near-field and far-field scenarios.
IBM introduces CUGA, an open-source agent harness that handles plumbing for state, tool calls, and orchestration, allowing developers to focus on defining tools and prompts. The article showcases two dozen single-file example apps built with CUGA, demonstrating how it eliminates repetitive framework setup.
InSight presents a framework for autonomous skill acquisition in vision-language-action (VLA) models by enabling steerability at the primitive-action level and using a VLM-guided data flywheel to generate demonstrations, achieving manipulation tasks like block flipping and pouring without human demonstrations.
FLAT proposes a method to decode explicit triangle splats directly from video diffusion latents for geometrically accurate 3D scene generation. It introduces a ray-centered rotation parameterization and a product window function to improve gradient flow, achieving better geometric accuracy than prior feedforward methods while supporting real-time rendering.
Researchers introduce NanoGen, a unified framework for training and evaluating diffusion transformers, and propose DiffusionBench, a holistic benchmark combining ImageNet class-conditional and text-to-image generation to better assess progress in generative modeling.
FLUX3D introduces a framework for high-fidelity image-to-3D Gaussian Splatting generation by enhancing representation learning and cross-modal alignment with diffusion-aligned structured latents and a sparse-structure-aware diffusion transformer, achieving state-of-the-art results.
This paper introduces OpenThoughts-Agent, an open-source data curation pipeline for training agentic language models, achieving a 44.8% average accuracy across seven benchmarks and outperforming prior open datasets through systematic experiments.
The paper presents World Value Model (WVM), a generalist robotic value model that combines world models with value estimation to accurately assess task progression and improve robotic policy learning from mixed-quality data, achieving state-of-the-art results on standard benchmarks and a new suboptimal data benchmark.
This paper introduces CF-World, a counterfactual benchmark to evaluate whether text-to-image models rely on causal reasoning or mere pattern matching. Experiments show all models degrade sharply in counterfactual settings, suggesting their understanding is limited to tightly coupled visual-textual patterns rather than genuine causal reasoning.
Introduces Holistic Data Scheduler (HDS), a reinforcement learning-based framework that dynamically adjusts data mixtures during LLM pre-training using a multi-objective reward function, achieving 44% fewer iterations to reach target perplexity and a 7.2% improvement on MMLU.
ReMMD introduces a realistic multilingual multi-image agentic verification framework for multimodal misinformation detection, including a benchmark (ReMMDBench) with 500 samples and 2,756 images, and an agent (ReMMD-Agent) that achieves superior veracity performance with reduced costs.
DREAM trains dense retrieval embeddings by using autoregressive language model attention to supervise query-document similarity, eliminating the need for labeled data. It consistently outperforms baselines on BEIR and RTEB benchmarks across model scales.
NatureBench is a cross-disciplinary benchmark of 90 scientific tasks from Nature publications, designed to evaluate AI coding agents' ability to achieve genuine discovery. Current agents succeed mainly through methodological translation, not scientific innovation.
FlowR2A proposes a novel method that combines dense reward supervision with dynamic proposal generation using a flow-matching decoder for multimodal driving planning, achieving state-of-the-art results on the NAVSIM benchmarks.
This paper proposes the EDV framework, which uses multiple heterogeneous agents in execute-distill-verify stages to build reliable experiences for LLM agents, preventing self-confirmatory errors and improving performance on long-horizon benchmarks.
This guest post explores the proposed Cross-Origin Storage API to improve caching of AI model resources in Transformers.js, enabling efficient reuse across origins while maintaining privacy and integrity for in-browser inference.
Hugging Face describes how they built a weekly release pipeline for their huggingface_hub library using AI, open-source tools, and human oversight, enabling faster and more reliable releases.
Qwen releases Qwen-AgentWorld-35B-A3B, a native language world model that simulates agentic environments across seven domains via long chain-of-thought reasoning. The model is trained with a three-stage pipeline and supports MCP, Search, Terminal, SWE, Android, Web, and OS interactions.
PP-OCRv6 is the latest generation of PaddleOCR's universal OCR model family, offering three tiers from 1.5M to 34.5M parameters, supporting 50 languages, and achieving significant accuracy improvements over previous versions.