@DailyDoseOfDS_: Fine-tune DeepSeek-OCR on your own language! (100% local) Most vision models treat documents as massive sequences of to…

X AI KOLs Timeline Models

Summary

DeepSeek-OCR is a 3B vision model using context optical compression for efficient document processing. Fine-tuning it on Persian text using Unsloth achieved an 88.26% improvement in character error rate, all open-source and runnable on a single GPU.

Fine-tune DeepSeek-OCR on your own language! (100% local) Most vision models treat documents as massive sequences of tokens, making long-context processing expensive and slow. DeepSeek-OCR uses context optical compression to convert 2D layouts into vision tokens, enabling efficient processing of complex documents. It is a 3B-parameter vision model that achieves 97% precision while using 10x fewer vision tokens than text-based LLMs. In fact, you can easily fine-tune it for your specific use case on a single GPU. We used Unsloth to run this experiment on Persian text and saw an 88.26% improvement in character error rate. ↳ Base model: 149% character error rate (CER) ↳ Fine-tuned model: 60% CER (57% more accurate) ↳ Training time: 60 steps on a single GPU Persian was just the test case. You can swap in your own dataset for any language, document type, or specific domain you're working with. We've shared the complete guide in the next tweet, which includes the code, notebooks, and environment setup ready to run with a single click. Everything is 100% open-source!
Original Article
View Cached Full Text

Cached at: 06/08/26, 03:26 PM

Fine-tune DeepSeek-OCR on your own language!

(100% local)

Most vision models treat documents as massive sequences of tokens, making long-context processing expensive and slow.

DeepSeek-OCR uses context optical compression to convert 2D layouts into vision tokens, enabling efficient processing of complex documents.

It is a 3B-parameter vision model that achieves 97% precision while using 10x fewer vision tokens than text-based LLMs.

In fact, you can easily fine-tune it for your specific use case on a single GPU.

We used Unsloth to run this experiment on Persian text and saw an 88.26% improvement in character error rate.

↳ Base model: 149% character error rate (CER) ↳ Fine-tuned model: 60% CER (57% more accurate) ↳ Training time: 60 steps on a single GPU

Persian was just the test case. You can swap in your own dataset for any language, document type, or specific domain you’re working with.

We’ve shared the complete guide in the next tweet, which includes the code, notebooks, and environment setup ready to run with a single click.

Everything is 100% open-source!

Tech Stack:

  • @UnslothAI to run and fine-tune the model
  • @LightningAI environments for hosting and deployment

Find the code and environment setup here:

Similar Articles

Building a Fast Multilingual OCR Model with Synthetic Data

Hugging Face Blog

NVIDIA introduces Nemotron OCR v2, a fast multilingual OCR model built using synthetic data generation. The model achieves 34.7 pages/second on a single A100 GPU by using a unified FOTS-based architecture with feature reuse across detection, recognition, and relational components.

I have (even faster) DeepSeek V4 Pro at home

Reddit r/LocalLLaMA

A user reports successfully running the DeepSeek V4 Pro model locally using ktransformers and sharing detailed benchmark results across various context depths, demonstrating improved inference speeds.