A VLM is a class of model that processes visual inputs together with text in a shared space and produces textual output. CLIP (OpenAI, 2021) and BLIP (Salesforce, 2022) are foundational here, demonstrating how a vision encoder and a language model can be aligned in a joint Embedding space. Modern VLMs — GPT-4V, Claude Opus/Claude Sonnet 3.5+, Gemini, LLaVA — turned document OCR, chart understanding, visual Q&A, and UI reading into routine tasks. MLLM is the broader umbrella that includes VLMs.
MEVZU N°124ISTANBULYEAR I — VOL. III
Glossary · Intermediate · 2021
VLM — Vision-Language Model
A model that jointly understands images and text and produces text responses.
- EN — English term
- VLM (Vision-Language Model)
- TR — Turkish term
- VLM — Görü-Dil Modeli