@DailyDoseOfDS_: Fine-tune DeepSeek-OCR on your own language! (100% local) Most vision models treat documents as massive sequences of to…
Summary
DeepSeek-OCR is a 3B vision model using context optical compression for efficient document processing. Fine-tuning it on Persian text using Unsloth achieved an 88.26% improvement in character error rate, all open-source and runnable on a single GPU.
View Cached Full Text
Cached at: 06/08/26, 03:26 PM
Fine-tune DeepSeek-OCR on your own language!
(100% local)
Most vision models treat documents as massive sequences of tokens, making long-context processing expensive and slow.
DeepSeek-OCR uses context optical compression to convert 2D layouts into vision tokens, enabling efficient processing of complex documents.
It is a 3B-parameter vision model that achieves 97% precision while using 10x fewer vision tokens than text-based LLMs.
In fact, you can easily fine-tune it for your specific use case on a single GPU.
We used Unsloth to run this experiment on Persian text and saw an 88.26% improvement in character error rate.
↳ Base model: 149% character error rate (CER) ↳ Fine-tuned model: 60% CER (57% more accurate) ↳ Training time: 60 steps on a single GPU
Persian was just the test case. You can swap in your own dataset for any language, document type, or specific domain you’re working with.
We’ve shared the complete guide in the next tweet, which includes the code, notebooks, and environment setup ready to run with a single click.
Everything is 100% open-source!
Tech Stack:
- @UnslothAI to run and fine-tune the model
- @LightningAI environments for hosting and deployment
Find the code and environment setup here:
Similar Articles
@Saboo_Shubham_: OPEN SOURCE AI is killing it. DeepSeek v4 Flash is a quasi-frontier model with a massive 1M context window. It can LOCA…
The article highlights DeepSeek v4 Flash as a quasi-frontier open-source model with a 1M context window, noting its ability to run locally on a 128GB Mac using 2-bit quantization.
@techNmak: A lightweight VLM that beats the giants at OCR. (1.7B parameters, SOTA on OmniDocBench) dots. ocr is a new multilingual…
dots.ocr is a new lightweight 1.7B parameter multilingual vision-language model that achieves state-of-the-art performance on OmniDocBench, outperforming much larger models (72B+) at document parsing and OCR tasks.
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
PaddleOCR-VL is a compact 0.9B vision-language model that achieves state-of-the-art performance in multilingual document parsing and element recognition by integrating NaViT-style dynamic resolution with the ERNIE language model.
Building a Fast Multilingual OCR Model with Synthetic Data
NVIDIA introduces Nemotron OCR v2, a fast multilingual OCR model built using synthetic data generation. The model achieves 34.7 pages/second on a single A100 GPU by using a unified FOTS-based architecture with feature reuse across detection, recognition, and relational components.
I have (even faster) DeepSeek V4 Pro at home
A user reports successfully running the DeepSeek V4 Pro model locally using ktransformers and sharing detailed benchmark results across various context depths, demonstrating improved inference speeds.