Granite Vision model notes
Introduction Many VLMs perform great on benchmarks like viusal question-answering or multi-image reasoning. Models in 2024 were predominantly trained on natural images, thus often limiting performance in other domains like visual document undestanding. Granite Vision is a 3 billion parameters model focused on entreprise use cases with visual document understanding document content extraction and working with documents. Get the Granite Vision research paper here. Architecture Granite Vision was trained on 12 trillion tokens. Dataset for it’s training is constanlty curated by Granite Team but not open sourced. It contained 13 million images and 80 milllion instructions such as: document question-answering, scene understanding key-value extraction, text grounding, layout parsing, captioning, UI understanding, code comprehension. ...
SmolDocling model notes
Introduction OCR on documents remains a challenging task. While printed text can often be recognized with 95% or higher accuracy, real-world documents — containing handwriting, non-standard layouts, and other irregularities — are still much harder to read accurately. There are high quality systems solving documents reading in sub-tasks: OCR, layout analysis, structure recognition, classification. Recent trend is using large vision-language models to solve whole convertion task in one shot, and giving the user opportunity to define additional specific tasks to do in the prompt. This post is about SmalDocling, a very tiny and compute-effient vision-language model doing the convertion task and the instruction task written into a prompt. The post is based on SmolDocling research paper. ...