VLM — Vision-Language Model

A VLM is a class of model that processes visual inputs together with text in a shared space and produces textual output. CLIP (OpenAI, 2021) and BLIP (Salesforce, 2022) are foundational here, demonstrating how a vision encoder and a language model can be aligned in a joint Embedding space. Modern VLMs — GPT-4V, Claude Opus/Claude Sonnet 3.5+, Gemini, LLaVA — turned document OCR, chart understanding, visual Q&A, and UI reading into routine tasks. MLLM is the broader umbrella that includes VLMs.