Multimodal

Models capable of understanding or producing more than one input type — text, image, audio, video.

EN — English term: Multimodal
TR — Turkish term: Çok-Modlu

Multimodal describes models that can take in or produce more than one input type — text, image, audio, video — together. 2023–2024 models like GPT-4V and Gemini made dropping an image directly into an LLM a routine pattern; previously each modality demanded a separate model. VLM and MLLM are subclasses of this category. Real-world usage expanded quickly into document analysis, screenshot understanding, support visuals, and scenarios like Computer Use.