Tag
This article presents a new paper on Elastic Attention Cores for Vision Transformers, proposing a core-periphery block-sparse attention structure that improves scalability and accuracy compared to dense self-attention methods like DINOv3.
Interfaze introduces a new hybrid AI model architecture that combines DNN/CNN encoders with transformers to achieve superior accuracy and cost-efficiency for deterministic tasks such as OCR, vision, and STT, compared to generalist models.
Wink Engineering evaluates the efficacy of neural super-resolution as a pre-filter for license plate OCR, concluding that it fails to improve accuracy and often leads to hallucinated characters compared to training directly on low-resolution data.
This paper investigates the geometric structure of intermediate feature representations in deep neural networks by analyzing how various image manipulations map in feature space. It suggests that feature spaces are organized in linear structures to a first approximation, using generative image editing models to probe these representations.
This paper introduces a protocol for fair comparison of diffusion-based OOD detectors and proposes Canonical Feature Snapshots (CFS), which leverage sparse internal activations for efficient detection.
The open-source project RuView leverages Wi-Fi signals and AI technology to achieve camera-free through-wall sensing, capable of real-time human pose recognition, breathing monitoring, and fall detection. It has garnered significant attention on GitHub. Emphasizing privacy and security, the project processes all data locally, supporting easy deployment via ESP32 or Docker.
A GitHub repository providing minimal, standalone PyTorch reimplementations of JEPA family models (I-JEPA, V-JEPA, V-JEPA 2, C-JEPA) for educational purposes, including tutorials and visualization tools.
This paper introduces LSAMD, a method for extracting 'learngenes' across multiple datasets to initialize variable-sized Vision Transformer models, significantly reducing training costs and storage while maintaining performance comparable to pretrain-finetune methods.
This paper introduces a two-stage neuro-symbolic framework that uses weak supervision (as little as 1% labels) with a slot-based VAE to learn interpretable symbols for object-centric visual reasoning, outperforming foundation models in domain generalization.
This paper introduces WildRelight, a new real-world benchmark dataset for single-image relighting that addresses the gap between synthetic and natural scenes. It proposes a physics-guided adaptation framework using diffusion posterior sampling and test-time adaptation to improve model performance on real-world data.
Lite3R is a model-agnostic framework that improves the efficiency of transformer-based 3D reconstruction using sparse linear attention and FP8-aware quantization. It reduces latency and memory usage by up to 2.4x while maintaining geometric accuracy on backbones like VGGT and DA3-Large.
MoCam is a research paper introducing a diffusion-based framework for unified novel view synthesis that dynamically coordinates geometric and appearance priors to improve robustness against geometric errors.
This paper introduces LychSim, a controllable simulation framework built on Unreal Engine 5 to facilitate vision research, synthetic data generation, and agentic LLM evaluation via MCP integration.
This paper introduces DRoRAE, a method that improves visual tokenization by fusing multi-layer features from pretrained vision encoders rather than relying solely on the last layer. It demonstrates significant improvements in reconstruction and generation quality on ImageNet and establishes a scaling law between fusion capacity and performance.
This paper introduces WebEye, a benchmark for object localization requiring external knowledge resolution, and Pixel-Searcher, an agentic approach that connects search results to visual annotations.
VidSplat is a training-free generative reconstruction framework that uses video diffusion priors to recover complete 3D scenes from sparse inputs by synthesizing novel views.
A maker built a drone that uses a laser to track targets, leveraging Claude AI to assist in development or processing.
ByteDance released Agent TARS, a free open-source multimodal AI agent stack that achieves #1 trending on GitHub with 31,400 stars. The tool enables GUI control, computer vision, and browser automation across terminals and desktops.
This paper presents a Transformer-based model for classifying wildlife species using only daily GPS movement trajectories, demonstrating superior accuracy over LSTM and CNN baselines across different studies and regions.
ByteDance's open-source desktop AI automation tool, UI-TARS Desktop, supports local execution and screen visual understanding. It can autonomously control your computer to handle daily tasks through natural language commands.