MLLM — Multimodal LLM | AI Mevzuları

An MLLM is a large language model extended with one or more additional modalities — image, audio, video — on top of a traditional LLM core. The typical architecture encodes the modality into an Embedding and projects it into the language model, so the model reasons in a unified Token space alongside text. GPT-4o, Claude Sonnet 3.5+, Gemini 1.5/2, and open models like LLaVA are leading examples. VLM is a narrower subset focused on vision plus language; MLLM is the umbrella that also covers audio and video.