Granite Vision model notes
Introduction Many VLMs perform great on benchmarks like viusal question-answering or multi-image reasoning. Models in 2024 were predominantly trained on natural images, thus often limiting performance in other domains like visual document undestanding. Granite Vision is a 3 billion parameters model focused on entreprise use cases with visual document understanding document content extraction and working with documents. Get the Granite Vision research paper here. Architecture Granite Vision was trained on 12 trillion tokens. Dataset for it’s training is constanlty curated by Granite Team but not open sourced. It contained 13 million images and 80 milllion instructions such as: document question-answering, scene understanding key-value extraction, text grounding, layout parsing, captioning, UI understanding, code comprehension. ...