@VincentLogic: NVIDIA's newly open-sourced LocateAnything model is really impressive. The previous visual grounding models generated coordinates digit by digit (like squeezing toothpaste), slow and unstable. This new model uses "parallel bounding box decoding" to predict complete coordinates in one step, much faster and more accurate...

X AI KOLs Timeline 06/03/26, 01:26 PM Models

visual-localization parallel-decoding open-source nvidia small-model bounding-box

Summary

NVIDIA has open-sourced the LocateAnything model, using parallel bounding box decoding technology to predict complete coordinates in one step, fast and accurate. The model has only 3B parameters and can run on consumer-grade GPUs, supporting video object localization, UI recognition, OCR, and other tasks.

NVIDIA just open-sourced the LocateAnything model, which is really impressive. In the past, visual grounding models generated coordinates one digit at a time (like squeezing toothpaste), which was slow and unstable. This new model uses "parallel bounding box decoding" to directly predict complete coordinates in one step, making it much faster and more accurate. Whether it's locating objects in videos, recognizing UI interfaces, or OCR text, it can handle it all. Most importantly, the model is very small, with only 3B parameters (about 7.8GB), and can run locally on consumer-grade GPUs! If you work in computer vision or multimodal AI, you must try this. The project is open-source, first come first served!

Original Article

View Cached Full Text

Cached at: 06/03/26, 05:53 PM

NVIDIA just open-sourced this LocateAnything model, and it’s really impressive.

Older visual grounding models generate coordinates one number at a time (like squeezing toothpaste), making them slow and unstable.
This new model uses parallel bounding box decoding, predicting the complete coordinates in a single step — much faster and more accurate.

Whether it’s finding objects in videos, recognizing UI interfaces, or OCR text, it handles it all.
The best part is the model is tiny — only 3B parameters (~7.8GB) — so it can run locally on consumer-grade GPUs!

If you work in computer vision or multimodal AI, you have to try this.
The project is open-sourced now — first come, first served!

Similar Articles

@ZhidingYu: Thank you NVIDIA! I will be presenting LocateAnything at #CVPR2026 at the NVIDIA Booth: June 5 4:20 - 4:40 pm MDT (Frid…

@NVIDIAAI: This #CVPR2026 paper from our research team is trending #1 on @HuggingFace Meet LocateAnything: a vision-language detec…

@ZhidingYu: We just adopted a super cool new space template for LocateAnything, made by @_akhaliq the great. Thank you AK! Try it o…

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Submit Feedback

Similar Articles

@ZhidingYu: Thank you NVIDIA! I will be presenting LocateAnything at #CVPR2026 at the NVIDIA Booth: June 5 4:20 - 4:40 pm MDT (Frid…

@NVIDIAAI: This #CVPR2026 paper from our research team is trending #1 on @HuggingFace Meet LocateAnything: a vision-language detec…

@VincentLogic: NVIDIA really went all out this time, directly releasing an open-source video understanding monster Nemotron 3 Nano Omni that processes video at an insane speed: 1 hour to handle 10 hours of video content, 10 times faster than playback speed. The core relies on 3D convolution technology, no longer scanning frame by frame, but instead…

@ZhidingYu: We just adopted a super cool new space template for LocateAnything, made by @_akhaliq the great. Thank you AK! Try it o…

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding