An MLLM is a large language model extended with one or more additional modalities — image, audio, video — on top of a traditional LLM core. The typical architecture encodes the modality into an Embedding and projects it into the language model, so the model reasons in a unified Token space alongside text. GPT-4o, Claude Sonnet 3.5+, Gemini 1.5/2, and open models like LLaVA are leading examples. VLM is a narrower subset focused on vision plus language; MLLM is the umbrella that also covers audio and video.
MEVZU N°124ISTANBULYEAR I — VOL. III
Glossary · Intermediate · 2023
MLLM — Multimodal LLM
A large language model that also processes modalities like image, audio, or video.
- EN — English term
- MLLM (Multimodal LLM)
- TR — Turkish term
- MLLM — Çok-Modlu LLM