Introduction

OCR on documents remains a challenging task. While printed text can often be recognized with 95% or higher accuracy, real-world documents — containing handwriting, non-standard layouts, and other irregularities — are still much harder to read accurately. There are high quality systems solving documents reading in sub-tasks: OCR, layout analysis, structure recognition, classification.
Recent trend is using large vision-language models to solve whole convertion task in one shot, and giving the user opportunity to define additional specific tasks to do in the prompt. This post is about SmalDocling, a very tiny and compute-effient vision-language model doing the convertion task and the instruction task written into a prompt. The post is based on SmolDocling research paper.

Architecture

SmolDocling model comes from family of Hugging Face’s SmolVLM. It was trained on datasets allowing for recognition of captions, charts, forms, code, equations, tables, footnotes, lists, page footers & headers, section headings, and text. SmolDocling does OCR on elements mentioned and recognizes the type and location. And here we have main task of SmolDocling - conversion and docuemnt understanding.

Sample
SmolDocling architecture

The red cube named as Vision Encoder on the picture above is the image encoder used in SmolVM models, SigLIP-base path-16/512 (93M). It comes from Google’s CLIP-style image encoders — replacing CLIP’s (OpenAI) contrastive softmax loss with a sigmoid cross-entropy loss (this change makes training more stable and more accurate when matching images and text). Its superpowers are low-memory and fast inference. Those superpowers are used here for multimodal reasoning tasks.

Usage

We can get bigger SmolDocling model (), or any large vision-language model to quickly get higher accuracy but also ‘heavier’ inference and so much bigger usage of compute. SmolDocling can find right niche for deployments on edge devices or on any resource-constrained setting. Another usage is quick prototyping and experimentation, it’s always better to start with small and quick models and avoid complexity that comes with a size.

DocTags

Another interesting thing is a standard proposed by smolDocling model - DocTags. It is created to use efficiently in inference and to train VLMs in a standardized way. HTML and Mardown formats are ambigous and by do not keep document layout context. DocTags separates text content from layout of document which bring clarity. DocTags has also clear and concise format which saves tokens and thus, inference and training on VLMs. See the basic example:

HTML:

<h1>Invoice</h1><p>Customer Name: John Doe</p>

~20–25 tokens.

DocTags:

<heading>Invoice</heading><para>Customer Name: John Doe</para>

~12–15 tokens.

DocTags leveraged OTLS standard and its full vocabulary. OTSL stands for Optimized Table Structure Language, and it’s specialized markup language designed for keeping table structure information. This choise also bring clarity and saves tokens.

Pre-training datasets

Seeing lack of good multimodal document data SmolDocling team created new public data set: DocLayNet-PT. It contains 1.4M pages from DocFM dataset (PDF documents from CommonCrawl, Wikipedia, business domains). Original SmolVLM had DocVQA capabilities (Document Visual Question Answering). To keep this feature the smolDocling was trained on Docmatix dataset with added DocTags format information.

Task-specific datasets

The model was also fine-tuned for specific tasks like recognition of layout, tables, charts, code, and equations. For layout and tables the team prepared:

  • 76k pages of human annotated and reviewed documets from DocLayNet-PT (created dataset was named DocLayNet v2)
  • 63k pages of tables and text from WordScape dataset
  • 250k pages of synthetic annotations from wikpedia for layout, colors and fonts (created dataset was named SynthDocNet) Tables recognition were covered by fine-tuning with PubTables-1M, FinTabNet, WikiTableSet, and tabular info from WordScape. Table strcuture information was pushed into OTSL format, so that each cell tag had it’s corresponding structure and text.

Public chart recognition datasets are low quality or not diversified. That triggered creation of anothe dataset containing in total 2.5 million visually diverse charts in 4 categories: line, pie, bar, and stacked bar. SmolDocling team created also code recognition dataset addressing lack of datasets containing code as images. The dataset includes 9.3 million code snippets rendered at 120 dpi. Another dataset was created regarding mathematical formulas: using 730k unique formulas from publi datasets and collecting 4.7 million formulas from arXiv. Final equations dataset contains 5.5 million unique formulas rendered at 120 dpi.

Sample
Training datasets used for smolDocling

Experiments

To enhance recognition of specific elements and to introduce ability to write no-code instructions to smolDocling model the team has put rule-based techniques and Granite-3.1-2b-instruct model. Random elements were taken from DocLayNet-PT and according instructions for this element were created, something like: “Perform OCR at bbox”, or “Identify page element type at bbox”. Training with Cauldron was applied to avoid catastrophic forgetting.

The model was trained on:

  • 64 NVIDIA A100 80GB GPUs,
  • one epoch lasting 38 hours, 4 epochs in total.
  • optimizer: AdamW
  • learning rates: 2x 10^-4, 2x10^-6
  • gradient clipping: 1.0
  • warmup ratio 0.03

Achieved inference efficiency:

  • page conversion time: 0.35 seconds
  • memory usage: 0.489GB VRAM
  • max sequence length: 8192 tokens
  • the model cam process 3 pages at a time

SmolDocling is a small but efficient vision-language model for document conversion. It produces rich structured output in a single pass, which reduces error accumulation compared to multi-stage systems. The model can link captions to images, preserve code formatting, and remove redundant headers or footers. Typical issues include missing tags, malformed structure, and repetitive token loops. Future work should improve page element localization for better accuracy. Overall, SmolDocling shows that compact models with optimized formats can rival much larger models in multi-task document understanding.