Posts

Exploring Tiny Time-Mixer Model

Introduction 2024 was a year where time series zero/few shot forecasting functioality made its way into LLM architectures: Moment, TimesFM, Chonos, Moirai, and many more showed up. Those were novel models but, computationally, heavy to use. So soon the second wave of lighter models appeared incluging Tiny Time-Mixer (TTM) architecture from IBM Research. TTM can work with just 1M parameters, it supports channel correlations and exogenous signals, and handles multivariate forecasting. Let’s dive in. ...

Granite Vision model notes

Introduction Many VLMs perform great on benchmarks like viusal question-answering or multi-image reasoning. Models in 2024 were predominantly trained on natural images, thus often limiting performance in other domains like visual document undestanding. Granite Vision is a 3 billion parameters model focused on entreprise use cases with visual document understanding document content extraction and working with documents. Get the Granite Vision research paper here. Architecture Granite Vision was trained on 12 trillion tokens. Dataset for it’s training is constanlty curated by Granite Team but not open sourced. It contained 13 million images and 80 milllion instructions such as: document question-answering, scene understanding key-value extraction, text grounding, layout parsing, captioning, UI understanding, code comprehension. ...

SmolDocling model notes

Introduction OCR on documents remains a challenging task. While printed text can often be recognized with 95% or higher accuracy, real-world documents — containing handwriting, non-standard layouts, and other irregularities — are still much harder to read accurately. There are high quality systems solving documents reading in sub-tasks: OCR, layout analysis, structure recognition, classification. Recent trend is using large vision-language models to solve whole convertion task in one shot, and giving the user opportunity to define additional specific tasks to do in the prompt. This post is about SmalDocling, a very tiny and compute-effient vision-language model doing the convertion task and the instruction task written into a prompt. The post is based on SmolDocling research paper. ...