Falcon Perception

Hugging Face Blog 04/01/26, 07:13 AM Models

computer-vision multimodal open-source segmentation grounding tiiuae

Summary

Falcon Perception is a 0.6B-parameter early-fusion Transformer model released by TII UAE for open-vocabulary grounding and segmentation from natural language prompts, utilizing hybrid attention and specialized heads.

No content available

Original Article

View Cached Full Text

Cached at: 05/08/26, 09:10 AM

Falcon Perception

Source: https://huggingface.co/blog/tiiuae/falcon-perception Back to Articles

Falcon Logo

TL;DR—Falcon Perceptionis a0.6B-parameterearly-fusion Transformer for open-vocabulary grounding and segmentation from natural language prompts. The model processesimage patches + textin one sequence using a hybrid attention mask, and produces variable numbers of instances with a small, structured token interface and lightweight output heads. OnSA-Co, Falcon Perception reaches68.0 Macro-F1(vs.62.3for SAM 3) with the main remaining gap being presence calibration (MCC0.64vs.0.82). We also introducePBench, a diagnostic benchmark that breaks down performance by capability (attributes, OCR-guided disambiguation, spatial constraints, relations) and by dense long-context crowded scenes.

We also relaseFalcon OCR, a0.3B-parametermodel which reaches a score of 80.3 and 88.6 on the olmOCR benchmark and OmniDocBench respectively, while having the highest throughput of any open source OCR model.

This post is a brief, practical write-up of what we built, why we built it this way, and what we learned along the way.

—## https://huggingface.co/tiiuae/Falcon-Perception https://huggingface.co/blog/tiiuae/falcon-perception#the-problem-why-do-perception-systems-end-up-as-pipelinesThe problem: why do perception systems end up as pipelines?

Many open-vocabulary perception systems are built as modular pipelines: a (often frozen) vision backbone extracts features, a separate fusion/decoder stage combines them with language, and additional components handle matching and post-processing. This family of designs works well in many settings, but it comes with trade-offs: it can be hard to scale cleanly, hard to attribute improvements to the right component, and easy to accumulate complexity as we add a new fix for each failure mode.

We asked a simpler question:can a single early-fusion Transformer backbone handle both perception and language modeling, if we choose the right attention pattern, output interface, and training signal?

In our experiments, the answer is largely yes. The rest of this post describes the main design choices and the evidence behind them.

https://huggingface.co/blog/tiiuae/falcon-perception#the-architecture-early-fusion-hybrid-attention-and-an-efficient-dense-interfaceThe architecture: early fusion, hybrid attention, and an efficient dense interface

A single autoregressive Transformer processes a unified sequence of image patches, text, and task tokens. The model predicts object properties in a fixed order:<coord\>→<size\>→<seg\>. Bounding box coordinates and sizes are decoded via specialized heads and re-injected as Fourier features. High-resolution segmentation masks are generated by a dot product between the<seg\>token and upsampled image features.

https://huggingface.co/blog/tiiuae/falcon-perception#one-backbone-two-behaviorsOne Backbone, Two Behaviors

At its core, Falcon Perception is a dense Transformer that processes image patches and text tokens in ashared parameter space from the first layer. Instead of a separate vision backbone followed by a late-fusion decoder, we keep a single backbone and rely on masking and a lightweight output interface to make the dense prediction problem tractable.

Images and text have different structure: pixels are 2D and benefit from bidirectional context, while the prediction interface is naturally sequential. We address this with ahybrid attention mask:

Image tokensattend to all other image tokensbidirectionally, building a global visual context (like a vision encoder would).
Text and task tokensattendcausallyto everything before them — the full visual prefix plus preceding text.

This allows the same backbone to behave like a bidirectional visual encoder on image tokens, while still supporting autoregressive prediction over task tokens.

https://huggingface.co/blog/tiiuae/falcon-perception#chain-of-perception-coarse-to-fine-supervision-for-dense-outputsChain-of-Perception: coarse-to-fine supervision for dense outputs

Dense perception is not a fixed-size prediction problem: an image may contain zero instances or hundreds. Autoregressive generation gives a clean variable-length interface, but fully autoregressive dense generation (e.g., polygons or high-resolution masks token-by-token) quickly becomes expensive.

We use a small structured interface,Chain-of-Perception, which decomposes each instance into three steps:

<coord> → <size> → <seg>

Coordinate token: The model first predicts the center of the instance — resolvingwhichobject it’s talking about.
Size token: Then the spatial extent — resolvinghow bigit is.
Segmentation token: Finally, a single embedding that, when dot-producted with upsampled image features, produces a full-resolution binary mask.

This ordering is deliberate. Committing to geometry first reduces ambiguity (“which instance?”), and makes the mask prediction step closer to pixel refinement conditioned on the resolved object.

https://huggingface.co/blog/tiiuae/falcon-perception#specialized-heads-minimal-overheadSpecialized Heads, Minimal Overhead

The backbone is shared, while decoding uses lightweight heads tailored to the output type:

Coordinate & Size HeadsuseFourier feature encoding: mapping continuous coordinates through a random Gaussian projection into a high-dimensional sinusoidal space. This overcomes the spectral bias of neural networks, yielding more precise localization than discrete binning alone. Decoded coordinates are re-injected into the sequence as conditioning for subsequent tokens.
Segmentation Headcomputes adot productbetween the<seg\>token’s hidden state and content-aware upsampled image features. Because the<seg\>token is produced after geometry and has access to early-fused visual context, we can avoid the separate mask-query machinery and Hungarian matching that often appears in decoder-based instance segmentation training.

https://huggingface.co/blog/tiiuae/falcon-perception#pbench-a-benchmark-designed-to-isolate-what-is-missingPBench: a benchmark designed to isolate what is missing

Existing referring-expression benchmarks like RefCOCO are saturated — models routinely hit 90%+ — and they conflatewhat went wrong. Did the model fail because it can’t read text? Can’t understand spatial relationships? Can’t handle a crowd?

We introducePBench, a diagnostic benchmark that separates samples by the dominant capability required:

LevelCapabilityExample PromptL0Simple objects“car“L1Attributes & subtypes“red car“, “broken fence”L2OCR-guided identification“Diet Coke bottle“, “Nike shoes”L3Spatial understanding“car on the left“, “third window from left”L4Relations & interactions“person holding umbrella“, “tallest building”DenseCrowdedness stress testHundreds of instances per image Each sample targetsonedominant capability: OCR prompts avoid spatial qualifiers, and spatial prompts avoid in-image text disambiguators. This yields a capabilityprofilerather than a single opaque score, and makes it easier to decide where to invest next (data, training curriculum, or post-training).

https://huggingface.co/blog/tiiuae/falcon-perception#training-distillation-large-scale-data-and-a-three-stage-recipeTraining: distillation, large-scale data, and a three-stage recipe

https://huggingface.co/blog/tiiuae/falcon-perception#multi-teacher-distillationMulti-Teacher Distillation

Rather than training from random weights (which in our ablations was unstable for segmentation), Falcon Perception initializes viamulti-teacher distillation. Two strong vision teachers contribute complementary signals:

DINOv3(ViT-H): strong local features critical for segmentation
SigLIP2: language-aligned features for open-vocabulary understanding

The distilled initialization achieves 74.25% zero-shot accuracy on ImageNet-1k and 85.11% linear-probe mIoU on Pascal VOC, providing a strong visual foundation before perception-specific training.

https://huggingface.co/blog/tiiuae/falcon-perception#data-54m-images-195m-positive-expressions-488m-hard-negativesData: 54M Images, 195M Positive Expressions, 488M Hard Negatives

We build the training set through a multi-stage pipeline:

Hierarchical clusteringof web-scraped images via DINOv3 embeddings to ensure uniform concept coverage.
VLM-driven listinggenerates dense object descriptions per image, categorized by PBench complexity level (60% basic, 40% advanced).
Negative miningproduces semantic, visual, and fine-grained hard negatives to combat hallucination.
Ensemble consensus— SAM 3, Qwen3-VL-30B, and Moondream3 must agree (IoU > 0.8) for automatic acceptance.
Human verification— disagreements go to annotators, recovering hard samples that confuse automated systems.

We maintain a strict 1:1 ratio of positive to negative samples. This makes presence calibration a first-class target: the model should reliably say “absent,” not only draw masks when confident.

https://huggingface.co/blog/tiiuae/falcon-perception#the-three-stages-700-gt-totalThe Three Stages (700 GT Total)

**Stage 1 — In-Context Listing (450 GT):**The model learns to autoregressively list scene inventories — predicting text expressionsandtheir locations. Full causal attention between queries enables learning of object co-occurrence (“fork, then knife, then plate”). This builds broad scene understanding.

**Stage 2 — Task Alignment (225 GT):**The attention mask is modified so queries can no longer see each other, simulating independent queries at inference time. Loss on text tokens is masked, focusing gradient signal entirely on presence classification and localization. This stage transitions from “scene understanding” to “answer this specific question.”

**Stage 3 — Long-Context Finetuning (10 GT):**A short phase with the mask limit raised to 600 per expression and a minimal constant learning rate. This adapts the model for extreme crowd density without forgetting earlier capabilities.

Key design choices validated through ablations:

Muon optimizerfor the specialized heads (vs. AdamW) — yields +4.8 points on SA-Co detection
Raster orderingof instances (vs. random/size) — +10 points over random ordering on SA-Co
Gram feature regularization— prevents drift from the distillation features, improving segmentation by +1.5 points
Global loss normalizationacross ranks — corrects bias from variable-length packed sequences in FSDP

https://huggingface.co/blog/tiiuae/falcon-perception#resultsResults

https://huggingface.co/blog/tiiuae/falcon-perception#sa-co-best-in-class-mask-qualitySA-Co: Best-in-Class Mask Quality

On the SA-Co open-vocabulary segmentation benchmark, Falcon Perception (0.6B parameters) achieves68.0 Macro-F1, compared to62.3for SAM 3, with large gains on attribute-heavy (+8.2), food & drink (+12.2), and sports equipment (+4.0) splits. At the same time, Falcon Perception lags SAM 3 on presence calibration (MCC: 0.64 vs 0.82), which is the clearest remaining improvement axis.

Here’s an example output — the prompt*“Falcon”*produces precise instance masks:

Falcon Perception also performs well for reffering expressions, able to correctly segment the burger with a black bun in each frame of the video:

https://huggingface.co/blog/tiiuae/falcon-perception#pbench-scaling-with-prompt-complexityPBench: Scaling with Prompt Complexity

This is where the early-fusion design shows the largest differences:

CapabilitySAM 3Falcon PerceptionGapL0: Simple objects64.365.1+0.8L1: Attributes54.463.6**+9.2L2: OCR-guided24.638.0+13.4L3: Spatial31.653.5+21.9L4: Relations33.349.1+15.8Dense58.472.6+14.2** On simple objects, the gap is modest. As prompts become more compositional—requiring OCR-guided disambiguation, spatial constraints, or relational binding—the gap widens.

In our PBench Dense split, Falcon Perception (0.6B) substantially outperforms generalist VLM baselines (e.g., 72.6 vs 8.9 for Qwen3-VL-30B in our evaluation setup), and matches or exceeds the 8B model on spatial and relational tiers.

https://huggingface.co/blog/tiiuae/falcon-perception#qualitative-results-ocr-spatial-relational-and-denseQualitative Results: OCR, Spatial, Relational, and Dense

As prompts grow more compositional — requiring OCR-guided disambiguation, spatial constraints, relational binding, or scaling to hundreds of instances — the early-fusion advantage becomes visually clear:

OCR-Guided Grounding (Level 2): When the distinguishing signal istext written on an object, Falcon Perception reads it correctly while SAM 3 cannot differentiate.
Spatial Understanding (Level 3): When prompts specify spatial relationships, Falcon Perception forms a coherent 2D scene map.
Relational Reasoning (Level 4): When the target is defined throughinteractionsrather than appearance, Falcon Perception understands the scene graph.
Dense Scenes: Scaling to Hundreds of Instances: The autoregressive interface is particularly useful when scenes are extremely crowded, where fixed-query decoders can run into practical limits.

Level 2 — OCR-Guided Grounding: Falcon Perception reads text on objects to disambiguate; SAM 3 cannot. Level 2: OCR-guided identification — Falcon Perception vs SAM 3

“168 wine bottles”: Falcon Perception identifies the bottles labeled “168”, while SAM 3 highlights every bottle. “Honolulu direction sign”: Falcon reads the text to find the right sign.

Level 3 — Spatial Understanding: Falcon Perception resolves spatial constraints; SAM 3 returns false positives. Level 3: Spatial understanding — Falcon Perception vs SAM 3

“Lower meat skewer on left grill,” “black car to the right of red car at bottom,” “Belgian flag on the left” — Falcon Perception resolves the correct instance from spatial constraints. SAM 3 predicts false positives for multiple candidates.

Level 4 — Relational Reasoning: Falcon Perception understands interactions; SAM 3 ignores relational constraints. Level 4: Relational reasoning — Falcon Perception vs SAM 3

“Pastry next to brown round bread,” “person using phone,” “person holding helmet in hand” — Falcon Perception identifies the interacting instance. SAM 3 highlights all instances of the object class, ignoring the relational constraint.

Dense Scenes: Falcon Perception scales to hundreds of instances; SAM 3’s decoder runs out of query tokens. Dense split: Falcon Perception scales to hundreds of instances

“Snow goose,” “pigeon,” “colorful canned drinks” — Falcon Perception autoregressively segments hundreds of instances. SAM 3’s fixed-size decoder runs out of query tokens beyond ~200 instances.

https://huggingface.co/blog/tiiuae/falcon-perception#falcon-ocr-extending-early-fusion-to-document-understandingFalcon OCR: extending early fusion to document understanding

Modern OCR has moved well beyond extracting text from clean scans. Today’s systems must handle multi-column layouts, mathematical formulas, tables, charts, and multilingual content — all in one pass. Most competitive OCR VLMs tackle this with a familiar recipe: a vision encoder feeding a separate text decoder, plus task-specific glue. These systems work, but they tend to be large (1B–3B+ parameters).

We took a different path:reuse the same early-fusion dense Transformerfrom Falcon Perception, but train a smaller0.3B-parametervariant from scratch specifically for OCR. The result isFalcon OCR— a single backbone that processes image patches and text tokens in a shared parameter space with the same hybrid attention mask (bidirectional for image tokens, causal for text tokens), and switches tasks through prompts rather than additional modules.

We trained from scratch (no multi-teacher distillation) because the visual features OCR needs — fine-grained glyph recognition, stroke-level discrimination — differ substantially from the object-level features useful for segmentation. Starting fresh lets the backbone develop text-optimized representations from the ground up.

https://huggingface.co/blog/tiiuae/falcon-perception#trainingTraining

We train on a curated English-language mixture spanning three core tasks: general document text parsing (digital PDFs, old scans, typewritten documents), mathematical and scientific formula recognition, and table structure recognition. The mixture also includes handwriting, real-world scene text, and synthetic samples generated from rendered LaTeX and HTML sources. The training objective is pure next-token prediction on structured text outputs.

Training proceeds in two phases: a longpre-trainingphase at constant learning rate where the model learns core OCR capabilities across all element types, followed by a shortcosine-decay finetuningphase where the learning rate is annealed to near zero.

https://huggingface.co/blog/tiiuae/falcon-perception#benchmark-resultsBenchmark results

We evaluate onolmOCR(binary correctness checks across diverse inputs) andOmniDocBench(continuous metrics over full-page parses). All comparison models are significantly larger and/or use proprietary infrastructure. At 80.3% on olmOCR with only 0.3B parameters, Falcon OCR is within 1.7 points of the top system and leads all models onMulti-Column(87.1%) andTables(90.3%). On OmniDocBench it scores 88.64 overall, ahead of DeepSeek OCR v2, GPT 5.2, and Mistral OCR 3.

https://huggingface.co/blog/tiiuae/falcon-perception#serving-throughputServing throughput

At 0.3B parameters, Falcon OCR is roughly3x smallerthan 0.9B-class OCR VLMs, which translates directly into higher serving throughput. Measured on a single A100-80GB with vLLM at high concurrency:

Modetok/simg/sDescriptionLayout + OCR5,8252.9Full pipeline: layout detection → crop → per-region OCR The compact footprint and vLLM integration (continuous batching, PagedAttention, optimized CUDA kernels) make it practical for large-scale document digitization where millions of pages need processing.

https://huggingface.co/blog/tiiuae/falcon-perception#what-we-see-in-the-resultsWhat we see in the results

More broadly, these results suggest that the early-fusion single-stack Transformer is a viable alternative to the “vision encoder plus text decoder” recipe for OCR. One backbone, shared parameter space, one decoding interface, and better data and training signals rather than increasingly complex pipelines. We hope this encourages more work in this direction.

https://huggingface.co/blog/tiiuae/falcon-perception#qualitative-examplesQualitative examples

Falcon OCR processes images captured under challenging real-world conditions with varying lighting, diverse text semantics (mathematical formulae, structured tables, handwritten notes), and complex document layouts, to produce structured text output.

Click each category below to expand.

Handwriting and Real-world Images: Accurate transcription of handwritten text and in-the-wild captures under adverse conditions. Falcon OCR: handwriting and real-world image transcription

Falcon OCR extracts text from handwritten documents and real-world photographs with variable lighting, orientation, and content complexity.

Table Extraction: Faithful reproduction of tabular structure and cell content across diverse formats. Falcon OCR: table extraction from documents

Falcon OCR accurately reproduces cell entries and structural layout from tables of varying formats and complexity.

Mathematical Formulae: Accurate recognition of equations across varying levels of symbolic complexity. Falcon OCR: mathematical formula recognition

Falcon OCR correctly transcribes mathematical expressions ranging from simple equations to multi-line derivations with nested operators.

Complex Document Layouts: Faithful text extraction from multi-column, mixed-content documents. Falcon OCR: complex document layout extraction

Falcon OCR preserves reading order and structural fidelity when extracting text from documents with multi-column layouts, figures, and footnotes.

https://huggingface.co/blog/tiiuae/falcon-perception#inference-fast-practical-and-openInference: Fast, Practical, and Open

The release includes an inference stack built onPyTorch’s FlexAttention, which makes it practical to express the custom attention patterns and efficiently serve packed variable-length sequences.

https://huggingface.co/blog/tiiuae/falcon-perception#paged-inference-enginePaged Inference Engine

Paged KV cachewith virtual page tables (no wasted memory from padding)
Continuous batching: new sequences enter mid-generation, finished ones release pages immediately
CUDA graph capturefor the decode loop
Background tokenizationoverlapped with GPU compute
HR feature cache: LRU cache with pinned-memory buffers for async GPU-CPU transfer of upsampled image features — subsequent queries on the same image skip the expensive upsampling step

In our setup on an H100, typical latencies are on the order of ~100ms prefill, ~200ms upsampling (0ms if cached), and ~50ms decode for a handful of instances. (These numbers depend on resolution, sequence length, and the number of predicted instances.)

https://huggingface.co/blog/tiiuae/falcon-perception#docker-and-mlx-integration-for-falcon-ocrDocker and MLX Integration for Falcon-OCR

For the Falcon-OCR model, we also provide avLLM docker serverfor fast deployment andMLXintegration for Apple-Silicon

Please check out thegithub repofor details.

https://huggingface.co/blog/tiiuae/falcon-perception#the-bigger-picture-a-bitter-lesson-for-perceptionThe Bigger Picture: A “Bitter Lesson” for Perception

Falcon Perception is intentionally minimal: one backbone, one objective family, and small heads only where outputs are continuous and dense. The working assumption is thatmost gains should come from data, compute, and training signals, rather than continually expanding the pipeline with specialized modules.

The architecture doesn’t block any obvious scaling path: add more images and harder prompts for better grounding, mix in text-only data for better language, increase context length for denser scenes. It’s still just one sequence model.

Falcon Perception is developed by the Falcon Vision Team at the Technology Innovation Institute (TII), Abu Dhabi, UAE.

https://huggingface.co/blog/tiiuae/falcon-perception#citationCitation

If you use Falcon-Perception, please cite

@article{bevli2026falcon,
  title   = {Falcon Perception},
  author  = {Bevli, Aviraj and Chaybouti, Sofian and Dahou, Yasser and Hacid, Hakim and Huynh, Ngoc Dung and Le Khac, Phuc H. and Narayan, Sanath and Para, Wamiq Reyaz and Singh, Ankit},
  journal = {arXiv preprint arXiv:2603.27365},
  year    = {2026},
  url     = {https://arxiv.org/abs/2603.27365}
}

Falcon Perception

Falcon Perception

https://huggingface.co/blog/tiiuae/falcon-perception#the-architecture-early-fusion-hybrid-attention-and-an-efficient-dense-interfaceThe architecture: early fusion, hybrid attention, and an efficient dense interface

https://huggingface.co/blog/tiiuae/falcon-perception#one-backbone-two-behaviorsOne Backbone, Two Behaviors

https://huggingface.co/blog/tiiuae/falcon-perception#chain-of-perception-coarse-to-fine-supervision-for-dense-outputsChain-of-Perception: coarse-to-fine supervision for dense outputs

https://huggingface.co/blog/tiiuae/falcon-perception#specialized-heads-minimal-overheadSpecialized Heads, Minimal Overhead

https://huggingface.co/blog/tiiuae/falcon-perception#pbench-a-benchmark-designed-to-isolate-what-is-missingPBench: a benchmark designed to isolate what is missing

https://huggingface.co/blog/tiiuae/falcon-perception#training-distillation-large-scale-data-and-a-three-stage-recipeTraining: distillation, large-scale data, and a three-stage recipe

https://huggingface.co/blog/tiiuae/falcon-perception#multi-teacher-distillationMulti-Teacher Distillation

https://huggingface.co/blog/tiiuae/falcon-perception#data-54m-images-195m-positive-expressions-488m-hard-negativesData: 54M Images, 195M Positive Expressions, 488M Hard Negatives

https://huggingface.co/blog/tiiuae/falcon-perception#the-three-stages-700-gt-totalThe Three Stages (700 GT Total)

https://huggingface.co/blog/tiiuae/falcon-perception#resultsResults

https://huggingface.co/blog/tiiuae/falcon-perception#sa-co-best-in-class-mask-qualitySA-Co: Best-in-Class Mask Quality

https://huggingface.co/blog/tiiuae/falcon-perception#pbench-scaling-with-prompt-complexityPBench: Scaling with Prompt Complexity

https://huggingface.co/blog/tiiuae/falcon-perception#qualitative-results-ocr-spatial-relational-and-denseQualitative Results: OCR, Spatial, Relational, and Dense

https://huggingface.co/blog/tiiuae/falcon-perception#falcon-ocr-extending-early-fusion-to-document-understandingFalcon OCR: extending early fusion to document understanding

https://huggingface.co/blog/tiiuae/falcon-perception#trainingTraining

https://huggingface.co/blog/tiiuae/falcon-perception#benchmark-resultsBenchmark results

https://huggingface.co/blog/tiiuae/falcon-perception#serving-throughputServing throughput

https://huggingface.co/blog/tiiuae/falcon-perception#what-we-see-in-the-resultsWhat we see in the results

https://huggingface.co/blog/tiiuae/falcon-perception#qualitative-examplesQualitative examples

https://huggingface.co/blog/tiiuae/falcon-perception#inference-fast-practical-and-openInference: Fast, Practical, and Open

https://huggingface.co/blog/tiiuae/falcon-perception#paged-inference-enginePaged Inference Engine

https://huggingface.co/blog/tiiuae/falcon-perception#docker-and-mlx-integration-for-falcon-ocrDocker and MLX Integration for Falcon-OCR

https://huggingface.co/blog/tiiuae/falcon-perception#the-bigger-picture-a-bitter-lesson-for-perceptionThe Bigger Picture: A “Bitter Lesson” for Perception

https://huggingface.co/blog/tiiuae/falcon-perception#citationCitation

Similar Articles

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 · Hugging Face

@maximelabonne: Parallax is a parametrized form of Local Linear Attention that drops the numerical solvers and matches FA 2/3 on decode…

Tiny Scale Is All I Can Spare To Play With Transformer

FusionSense: Tri-Stage Near-Sensor Learning for Runtime-Adaptive Multimodal Edge Intelligence

Submit Feedback

Similar Articles

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 · Hugging Face

@maximelabonne: Parallax is a parametrized form of Local Linear Attention that drops the numerical solvers and matches FA 2/3 on decode…

Tiny Scale Is All I Can Spare To Play With Transformer

FusionSense: Tri-Stage Near-Sensor Learning for Runtime-Adaptive Multimodal Edge Intelligence