@berryxia: Wow, this move directly poached DeepSeek's talent! Last night I saw this interesting OCR open-source model on HuggingFace and the fascinating story behind it. This OCR model is completely different from traditional ones! Its speed and accuracy are absolutely unbeatable~~ Let me start with some background, for those who are familiar…

X AI KOLs Timeline Models

Summary

Baidu has open-sourced the Unlimited OCR model, which uses the R-SWA attention mechanism to process hundreds of pages in a single pass without page splitting, with a constant KV Cache. The model innovatively mimics the attention pattern of humans copying books by hand and shares technical lineage with DeepSeek OCR, sparking discussions about talent mobility.

Wow, this move directly poached DeepSeek's talent! Last night on HuggingFace, I came across this interesting OCR open-source model and the fascinating story behind it. This OCR model is completely different from traditional OCR models! Its speed and accuracy are absolutely unbeatable~~ Let me start with some background. Those who know me will recall that I've done several OCR evaluations recently (check my earlier posts), testing 18 documents across 6 scenarios and setting up local workflows. So I have a pretty good sense of OCR's capabilities. In my previous tests, the biggest headache wasn't accuracy — it was the workflow for multi-page documents. All models process pages one by one. Each page clears the previous memory, and an external scheduler stitches the results together. It's essentially a for-loop, not real long-range understanding. But Baidu's newly open-sourced Unlimited OCR takes a completely different approach. It doesn't process page by page. In a single forward pass, dozens of pages are transcribed directly. The core selling point can be summarized in one phrase: One-Shot Long-Horizon Parsing. In other words, it achieves deep syntactic understanding of long texts with minimal labeled data and low cost, leveraging few-shot capabilities of large language models. Just throw in an image or a multi-page PDF, and it parses everything in one go — no more chunking and rerunning. That's really satisfying! Apparently, the inspiration for this model is quite interesting. When copying a book by hand, you don't memorize the entire book. You only focus on three things: the original text, the few words you just wrote, and the next word to write. Earlier content naturally fades away. Recent context tracks your progress. This everyday behavior reveals an attention pattern completely different from current models. The core mechanism of Unlimited OCR, R-SWA (Reference Sliding Window Attention), simulates this process. Each token can see the full image, but the output side only maintains the previous 128 states. With a 32K context, it can infer dozens of pages in one pass. The KV Cache size remains constant, independent of document length. This pushes OCR from a character recognition tool toward a document understanding engine. People used to think long documents had to be chunked. But it's becoming clear: with long enough context and a strong model, end-to-end processing is actually more efficient and accurate. The technical report is also written in an interesting way, very narrative-driven and bold in its ideas. It has an explorer's spirit — a style that used to be the hallmark of DeepSeek's technical reports. Then things got interesting. I looked up the key contributors in the technical report. Three people, two with real names. But the technical director is only listed by the two-letter abbreviation "YY". Who is YY? Following the clues... The GitHub acknowledgments section lists DeepSeek-OCR and DeepSeek-OCR-2 first. DeepEncoder was originally introduced in DeepSeek OCR. And this Unlimited OCR perfectly integrates that high-compression encoder. The mentions of DeepSeek OCR in the report don't read like comparing to a competitor. It feels more like reflecting on and optimizing their own previous research. The domestic OCR community isn't that big. The number of people who could achieve breakthroughs like R-SWA and have intimate hands-on familiarity with DeepSeek's OCR architecture can be counted on one hand. Let me point out another detail. On April 24, 2026, DeepSeek-V4 was officially released. At the end of the 58-page technical report, nearly 300 names are listed alphabetically. Among them, ten names have a small asterisk next to them: "Resigned". From the second half of 2025 to early 2026, in less than half a year, five people left DeepSeek. Where did they go? Who is YY? The report doesn't say explicitly, but the more I read, the more I feel the answer is between the lines. It's also clear that Baidu's recent path has indeed changed. You should know that they have always been the strongest in OCR, with almost no rivals! From PaddleOCR to this Unlimited OCR, I can sense a move toward a more cutting-edge direction. This pace of iteration, this talent pool, and this direction — the future looks promising. Gossip aside, focusing on the technology: the direction of end-to-end long-document OCR is definitely correct. The project and model are open-source. If interested, try it yourself — the link is in the comments.
Original Article
View Cached Full Text

Cached at: 06/23/26, 02:10 PM

Whoa, this move directly poaches from DeepSeek’s “talent pool”!

Last night, I saw HuggingFace highlighting this interesting open-source OCR model along with the fascinating story behind it.

This OCR model is completely different from traditional OCR models! Its speed and accuracy are simply unbeatable.

First, some background. As regular readers know, I’ve run several OCR evaluations recently (check my earlier posts), testing 18 documents across 6 scenarios and building local workflows. So I have some feel for OCR’s capabilities.

The biggest headache in those evaluations wasn’t accuracy — it was the workflow for multi-page documents. Every model processes pages one by one.

Each page clears memory, then an external scheduler stitches the results together. It’s essentially a for-loop, not true long-range understanding.

But Baidu’s newly open-sourced Unlimited OCR takes a completely different approach. It doesn’t process page by page.

With a single forward pass, dozens of pages are transcribed in one go.

The core value proposition is simple: One-Shot Long-Horizon Parsing. In other words, it enables deep syntactic understanding of long texts without massive labeled data, at low cost, and adapts to few-shot capabilities of large language models.

You can just throw in an image or a multi-page PDF and parse it all at once — no need to chop it into pieces and run repeatedly. This is genuinely awesome!

Apparently, the model’s inspiration is interesting. When humans copy a book, they don’t memorize the entire book.

They focus on only three things: the original text, the few words just written, and the next words to write. Earlier content naturally fades out. Recent context tracks progress. This everyday behavior reveals an attention pattern very different from current models.

Unlimited OCR’s core mechanism, R-SWA (Reference Sliding Window Attention), simulates this process.

Every token sees the full image, but the output side only maintains the previous 128 states. With 32K context, it processes dozens of pages in a single inference. The KV Cache size remains constant — it doesn’t grow with document length.

This essentially pushes OCR from a character recognition tool toward a document understanding engine. People used to think long documents must be chunked.

Now it’s becoming clear: as long as context is long enough and the model is strong enough, going end-to-end in one shot is more efficient and accurate.

The technical report is also written in a very engaging, story-driven, and bold style. It has an explorer’s vibe — previously a hallmark of DeepSeek’s technical reports.

And then things get interesting.

I looked up the key contributors of the technical report. Of three people, two use real names. But the technical director goes by a two-letter abbreviation: YY. Who is YY?

Let’s trace back.

The GitHub acknowledgments list DeepSeek-OCR and DeepSeek-OCR-2 at the top two spots. DeepEncoder was originally introduced in DeepSeek OCR.

This Unlimited OCR happens to perfectly integrate that high-compression encoder.

The mentions of DeepSeek OCR in the report don’t sound like benchmarking against a competitor. They sound more like reflecting on and improving their own previous research.

The OCR community in China is not that large. The number of people who could achieve a breakthrough like R-SWA and also have hands-on familiarity with DeepSeek OCR’s architecture can be counted on one hand.

Let’s look at another detail.

On April 24, 2026, DeepSeek-V4 was officially released. At the end of the 58-page technical report, nearly 300 names are listed in alphabetical order.

Ten of those names have a small asterisk next to them: “Resigned.” From the second half of 2025 to early 2026 — less than half a year — five people left DeepSeek.

Where did they go? Who is YY? The report doesn’t say directly, but the more you read, the more the answer seems to lie between the lines.

It’s also clear that Baidu’s recent path is indeed different. You must know that their OCR has always been the strongest — almost no competition!

From PaddleOCR to this Unlimited OCR, you can feel them moving in a more cutting-edge direction.

This update speed, this talent pipeline, and the development direction — the future looks promising.

Putting the gossip aside and focusing on the tech: the end-to-end long-document OCR direction is definitely right.

The project and model are open-source. If interested, check the links in the comments.

Berryxia.AI (@berryxia): This speed is insane! Whoa!

The newly open-sourced Unlimited-OCR can process hundreds of pages in one shot, and the speed remains stable.

This model comes from Baidu’s recent release on Hugging Face. Its core innovation is R-SWA (Reference Sliding Window Attention).

It keeps the KV Cache constant during decoding, preventing explosive growth as the number of document pages increases.

Similar Articles

@GoSailGlobal: Current OCR processes multi-page documents page by page. Every time you turn a page, memory is reset. Today, Baidu quietly open-sourced a model on GitHub and HuggingFace called Unlimited OCR, inspired by how humans copy books: - When copying a book, you don't reread hundreds of pages every time you write a word...

X AI KOLs Timeline

Baidu has open-sourced the Unlimited OCR model, which uses a Reference Sliding Window Attention (R-SWA) mechanism to parse documents up to 32K context in a single pass, eliminating the need for page-by-page inference.

@geekbb: Baidu's open-source visual language model OCR project, upgraded from DeepSeek-OCR, focuses on one-shot parsing of extremely long documents. The model has two inference modes: 'gundam' mode for dense text in a single image, and 'base' mode for multi-page or PDF processing. https://github…

X AI KOLs Timeline

Baidu has open-sourced the visual language model Unlimited-OCR, upgraded from DeepSeek-OCR, supporting one-shot parsing of extremely long documents, offering two inference modes: gundam (dense text in a single image) and base (multi-page/PDF).

@berryxia: https://x.com/berryxia/status/2067078380017828205

X AI KOLs Timeline

The author tested the three tiers of PP-OCRv6 models and provided open-source tools for local deployment. They demonstrated performance comparisons of each model on OmniDocBench and real-world scenarios, emphasizing the advantages of lightweight specialized models for OCR tasks.

@rionaifantasy: Unbelievable! How Can a 34.5M Parameter OCR Beat a 235B Large Model? Let me tell you something ridiculous: I used to believe the future of OCR would inevitably be devoured by ever-larger multimodal large models. But after seeing PP-OCRv6 released by Baidu Wenxin, I've changed my mind. Because it doesn't follow the path of "continuing to pile on parameters..."

X AI KOLs Timeline

Baidu Wenxin releases PP-OCRv6, offering three model tiers: Tiny, Small, and Medium, supporting over 50 languages. The Tiny version is only 1.5MB and can run locally in a browser, with the fastest single-image inference at 97ms, proving that small specialized models can outperform large models on OCR tasks.