computer-vision

#computer-vision

SAM 3.1: Faster and More Accessible Real-Time Video Detection and Tracking With Multiplexing and Global Reasoning

Meta AI Blog ↗ · 2026-03-26

Meta AI releases SAM 3.1, an update to the Segment Anything Model that enhances real-time video detection and tracking through multiplexing and global reasoning capabilities.

0 favorites 0 likes

#computer-vision

Augmenting citizen science with computer vision for fish monitoring

MIT News — Artificial Intelligence ↗ · 2026-03-25 Cached

Researchers from MIT and the Woodwell Climate Research Center published a paper on using computer vision to automate fish monitoring, improving upon traditional citizen science methods for river herring conservation.

0 favorites 0 likes

#computer-vision

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Papers with Code Trending ↗ · 2026-03-13 Cached

LeWorldModel introduces a stable, end-to-end Joint-Embedding Predictive Architecture that trains directly from pixels with minimal hyperparameters and provable anti-collapse guarantees. It achieves significant speedups in planning compared to foundation models while maintaining competitive performance on robotic manipulation tasks.

0 favorites 0 likes

#computer-vision

Mapping the World's Forests with Greater Precision: Introducing Canopy Height Maps v2

Meta AI Blog ↗ · 2026-03-09

World Resources Institute announces Canopy Height Maps v2, an open-source model and associated world-scale maps designed for mapping global forests with greater precision.

0 favorites 0 likes

#computer-vision

Reducing Government Costs and Increasing Access to Greenspaces in the United Kingdom with DINO

Meta AI Blog ↗ · 2026-02-08

The UK government is utilizing Meta's DINOv2 model to optimize reforestation efforts, aiming to reduce costs and improve access to greenspaces.

0 favorites 0 likes

#computer-vision

D4RT: Teaching AI to see the world in four dimensions

Google DeepMind Blog ↗ · 2026-01-16 Cached

DeepMind introduces D4RT, a unified AI model for dynamic 4D scene reconstruction and tracking that is up to 300x more efficient than previous methods. The model uses a query-based Transformer architecture to solve complex spatial and temporal tasks for robotics and AR applications.

0 favorites 0 likes

#computer-vision

PersonaLive! Expressive Portrait Image Animation for Live Streaming

Papers with Code Trending ↗ · 2025-12-12 Cached

PersonaLive is a diffusion-based framework for real-time expressive portrait animation in live streaming, achieving significant speedups through hybrid implicit signals and autoregressive streaming generation.

0 favorites 0 likes

#computer-vision

Teaching AI to see the world more like we do

Google DeepMind Blog ↗ · 2025-11-11 Cached

Google DeepMind published a paper in Nature detailing a method to align AI visual representations with human cognitive structures, improving model robustness and reliability.

0 favorites 0 likes

#computer-vision

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

Papers with Code Trending ↗ · 2025-09-26 Cached

MinerU2.5 is a 1.2B-parameter vision-language model that achieves state-of-the-art document parsing accuracy with high computational efficiency using a coarse-to-fine parsing strategy.

0 favorites 0 likes

#computer-vision

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Papers with Code Trending ↗ · 2025-09-02 Cached

UI-TARS-2 is a native GUI-centered agent model that addresses data scalability, multi-turn RL, and environment stability challenges, achieving state-of-the-art results on GUI benchmarks (88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena,73.3 on AndroidWorld) and outperforming Claude and OpenAI agents.

0 favorites 0 likes

#computer-vision

How Orakl Oncology is using DINOv2 to accelerate cancer treatment discovery

Meta AI Blog ↗ · 2025-02-19

Orakl Oncology is leveraging the DINOv2 model to integrate machine learning with experimental insights, aiming to accelerate cancer treatment discovery and drug development.

0 favorites 0 likes

#computer-vision

Building smarter maps with GPT-4o vision fine-tuning

OpenAI Blog ↗ · 2024-11-20 Cached

Grab uses OpenAI's GPT-4o vision fine-tuning to improve GrabMaps, achieving significant accuracy gains in speed limit sign localization (13%), lane counting (20%), and reducing manual mapping efforts across Southeast Asia's complex road networks.

0 favorites 0 likes

#computer-vision

Saving lives with AI health coaching

OpenAI Blog ↗ · 2024-03-13 Cached

Healthify, a health and fitness AI platform, partnered with OpenAI to enhance its AI-powered nutritionist Ria and food recognition feature Snap, overcoming limitations in accuracy, scalability, and multilingual support. The collaboration represents a significant upgrade to Healthify's decade-long AI-driven health coaching platform.

0 favorites 0 likes

#computer-vision

Learning to play Minecraft with Video PreTraining

OpenAI Blog ↗ · 2022-06-23 Cached

OpenAI introduced Video PreTraining (VPT), a semi-supervised method that trains neural networks to play Minecraft by learning from 70,000 hours of unlabeled human gameplay video combined with a small labeled dataset. The model learns complex sequential tasks using the native human interface (keyboard and mouse) and demonstrates capabilities like crafting diamond tools and pillar jumping, representing progress toward general computer-using agents.

0 favorites 0 likes

#computer-vision

Alien Dreams: An Emerging Art Scene

ML at Berkeley ↗ · 2021-06-30 Cached

The article highlights the emerging scene of AI-generated art using OpenAI's CLIP model as a steering mechanism for generative models, showcasing various examples of text-to-image outputs.

0 favorites 0 likes

#computer-vision

Vokenization: Multimodel Learning for Vision and Language

ML at Berkeley ↗ · 2021-04-16 Cached

The article explains 'Vokenization,' a multimodal learning technique that bridges computer vision and natural language processing by using weak supervision to link visual data with language tokens. It contrasts this approach with text-only models like GPT-3 and BERT, highlighting how visual grounding can improve language understanding.

0 favorites 0 likes

#computer-vision

How is it so good ? (DALL-E Explained Pt. 2)

ML at Berkeley ↗ · 2021-04-07 Cached

This article explains the architecture of DALL-E, focusing on its transformer component that correlates language with discrete image representations to generate high-quality images from text prompts.

0 favorites 0 likes

#computer-vision

Neural Module Networks for Visual Question Answering

ML at Berkeley ↗ · 2021-03-10 Cached

This article explains the Neural Module Networks (NMN) architecture from the paper 'Deep Compositional Question Answering with Neural Module Networks,' detailing how it handles the compositional structure of visual question answering tasks by decomposing questions into modular steps.

0 favorites 0 likes

#computer-vision

Image GPT

OpenAI Blog ↗ · 2020-06-17 Cached

OpenAI's Image GPT (iGPT) applies GPT-2 transformers to pixel sequences for image generation and classification, demonstrating that the same architecture used for language can learn coherent visual features in an unsupervised manner and achieve competitive performance on image classification benchmarks.

0 favorites 0 likes

#computer-vision

Robust adversarial inputs

OpenAI Blog ↗ · 2017-07-17

Researchers demonstrated adversarial images that reliably fool neural network classifiers across multiple scales and perspectives, challenging assumptions about the robustness of multi-scale image capture systems used in autonomous vehicles.

0 favorites 0 likes

computer-vision

Submit Feedback