Multimodal describes models that can take in or produce more than one input type — text, image, audio, video — together. 2023–2024 models like GPT-4V and Gemini made dropping an image directly into an LLM a routine pattern; previously each modality demanded a separate model. VLM and MLLM are subclasses of this category. Real-world usage expanded quickly into document analysis, screenshot understanding, support visuals, and scenarios like Computer Use.
MEVZU N°124ISTANBULYEAR I — VOL. III
Glossary · Beginner · 2022
Multimodal
Models capable of understanding or producing more than one input type — text, image, audio, video.
- EN — English term
- Multimodal
- TR — Turkish term
- Çok-Modlu