computer-vision

#computer-vision

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Papers with Code Trending ↗ · 2026-04-27 Cached

Tuna-2 is a unified multimodal model that achieves state-of-the-art performance by processing visual understanding and generation directly from pixel embeddings, eliminating the need for pretrained vision encoders.

0 favorites 0 likes

#computer-vision

Diffusion Model as a Generalist Segmentation Learner

Hugging Face Daily Papers ↗ · 2026-04-27 Cached

This paper introduces DiGSeg, a framework that repurposes pretrained diffusion models for state-of-the-art semantic and open-vocabulary segmentation by leveraging latent space conditioning and text-guided alignment.

0 favorites 0 likes

#computer-vision

Hybrid Multi-Phase Page Matching and Multi-Layer Diff Detection for Japanese Building Permit Document Review

arXiv cs.CL ↗ · 2026-04-23 Cached

A hybrid multi-phase page-matching pipeline plus multi-layer diff engine automates comparison of Japanese building-permit PDF sets, achieving F1=0.80 with zero false positives on real-world 200–1000 page submissions.

0 favorites 0 likes

#computer-vision

Flow Map Learning via Nongradient Vector Flow [pdf]

Hacker News Top ↗ · 2026-04-23 Cached

Proposes a nongradient vector flow method for learning flow maps, likely aimed at improving optical flow or motion estimation tasks.

0 favorites 0 likes

#computer-vision

@nomadicai: The future of computer vision is agentic. 1/ We built Nomadic around a gap we kept seeing in video understanding: VLMs …

X AI KOLs Following ↗ · 2026-04-21 Cached

NomadicAI is building an agentic computer-vision product to fix VLMs' weak grounding in actual video content.

0 favorites 0 likes

#computer-vision

/yolo

Reddit r/LocalLLaMA ↗ · 2026-04-21

Article concerning YOLO, the widely used real-time object detection model family.

0 favorites 0 likes

#computer-vision

What should i do to have a good OD model?[P]

Reddit r/MachineLearning ↗ · 2026-04-20

A user is seeking advice on improving their object detection model trained with YOLO11n for deployment on a Raspberry Pi 5, struggling with the gap between theoretical mAP50 metrics and practical detection performance.

0 favorites 0 likes

#computer-vision

@FinanceYF5: This AI is impressive. LingBot-Map can convert real-time video streams into real-time 3D reconstruction. 20 FPS code + model

X AI KOLs Following ↗ · 2026-04-20 Cached

LingBot-Map is an AI model capable of converting real-time video streams into real-time 3D reconstruction, running at 20 FPS with complete code and model provided.

0 favorites 0 likes

#computer-vision

Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories

arXiv cs.CL ↗ · 2026-04-20 Cached

A comprehensive survey examining image classification into high-level and abstract categories, clarifying the tacit understanding of high-level semantics in computer vision through multidisciplinary analysis of commonsense, emotional, aesthetic, and interpretative semantics. The paper identifies persistent challenges in abstract concept image classification and emphasizes the importance of hybrid AI systems for addressing complex visual reasoning tasks.

0 favorites 0 likes

#computer-vision

Sentiment Analysis of German Sign Language Fairy Tales

arXiv cs.CL ↗ · 2026-04-20 Cached

A research paper presenting a dataset and XGBoost-based model for sentiment analysis of German Sign Language (DGS) fairy tales using facial and body motion features extracted via MediaPipe, achieving 63.1% balanced accuracy and demonstrating the importance of both facial and body movements for sentiment communication in sign language.

0 favorites 0 likes

#computer-vision

HSG: Hyperbolic Scene Graph

Hugging Face Daily Papers ↗ · 2026-04-19 Cached

This paper introduces HSG (Hyperbolic Scene Graph), a scene graph model that leverages hyperbolic geometry for representing hierarchical scene structures. It is hosted on Hugging Face and referenced via arXiv:2604.17454.

0 favorites 0 likes

#computer-vision

Zero-shot World Models Are Developmentally Efficient Learners [R]

Reddit r/MachineLearning ↗ · 2026-04-18

Researchers introduce Zero-shot World Models (ZWM), an approach that achieves visual competence comparable to state-of-the-art models while trained on minimal data (single child's visual experience) without task-specific training. This work demonstrates a path toward more data-efficient AI systems that match human developmental learning efficiency.

0 favorites 0 likes

#computer-vision

Low accuracy (~50%) with SSL (BYOL/MAE/VICReg) on hyperspectral crop stress data — what am I missing? [R]

Reddit r/MachineLearning ↗ · 2026-04-17

A researcher shares their struggle with achieving only ~50% accuracy using SSL methods (BYOL, MAE, VICReg) for hyperspectral crop stress classification on cabbage nitrogen deficiency detection, seeking advice on SSL techniques, feature engineering, and model architectures better suited for spectral data.

0 favorites 0 likes

#computer-vision

NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results

Hugging Face Daily Papers ↗ · 2026-04-16 Cached

This paper presents the NTIRE 2026 Challenge on Video Saliency Prediction, introducing a novel dataset of 2,000 diverse videos with saliency maps collected via crowdsourced mouse tracking from over 5,000 assessors. Over 20 teams participated, with 7 passing the final phase, and all data is made publicly available.

0 favorites 0 likes

#computer-vision

Geometric Context Transformer for Streaming 3D Reconstruction

Papers with Code Trending ↗ · 2026-04-15 Cached

Introduces LingBot-Map, a feed-forward 3D foundation model for streaming 3D reconstruction using a geometric context transformer architecture that achieves stable real-time performance at 20 FPS.

0 favorites 0 likes

#computer-vision

HDR Video Generation via Latent Alignment with Logarithmic Encoding

Hugging Face Daily Papers ↗ · 2026-04-13 Cached

This paper presents a method for HDR video generation by leveraging pretrained generative models through logarithmic encoding alignment and camera-mimicking degradation training, enabling effective HDR synthesis without architectural redesign. The approach demonstrates that HDR generation can be achieved simply by adapting existing models to a representation naturally aligned with their learned priors.

0 favorites 0 likes

#computer-vision

Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding

Hugging Face Daily Papers ↗ · 2026-04-09 Cached

This paper introduces a meta-optimized approach for semantic visual decoding from fMRI signals that generalizes to novel subjects without fine-tuning, using in-context learning to infer unique neural encoding patterns from a small set of image-brain activation examples. The method achieves strong cross-subject and cross-scanner generalization without requiring anatomical alignment or stimulus overlap.

0 favorites 0 likes

#computer-vision

How Alta Daily Uses Meta’s Segment Anything to Reimagine the Digital Closet

Meta AI Blog ↗ · 2026-04-05

Alta Daily is leveraging Meta's Segment Anything model to revolutionize the digital closet experience by improving image segmentation and organization.

0 favorites 0 likes

#computer-vision

Falcon Perception

Hugging Face Blog ↗ · 2026-04-01 Cached

Falcon Perception is a 0.6B-parameter early-fusion Transformer model released by TII UAE for open-vocabulary grounding and segmentation from natural language prompts, utilizing hybrid attention and specialized heads.

0 favorites 0 likes

#computer-vision

Preview tool helps makers visualize 3D-printed objects

MIT News — Artificial Intelligence ↗ · 2026-04-01 Cached

MIT researchers developed VisiPrint, an AI-powered preview tool that helps 3D printing users visualize the aesthetic outcome (color, texture, gloss) of printed objects to reduce waste and improve design accuracy.

0 favorites 0 likes

computer-vision

Submit Feedback