Multimodal Model
Also known as: Vision-Language Model, VLM, Multimodal LLM
Definition
A model that can process and/or generate multiple types of data (modalities) such as text, images, audio, and video. Multimodal models understand relationships across modalities—they can describe images, answer questions about videos, or generate images from text. They extend LLM capabilities beyond text to richer inputs and outputs.
What this is NOT
- Not text-only LLMs (multimodal includes other modalities)
- Not separate models for each modality (multimodal is unified)
- Not just image generation (multimodal includes understanding)
Alternative Interpretations
Different communities use this term differently:
llm-practitioners
LLMs enhanced with vision (and sometimes audio) capabilities, able to accept images as input and reason about visual content. Examples: GPT-4V, Claude 3, Gemini Pro Vision.
Sources: GPT-4V documentation, Gemini documentation, LLaVA and open multimodal models
ml-research
Models trained to align representations across modalities, enabling cross-modal understanding and generation. Architectures include vision encoders + LLMs, unified transformers, and diffusion models.
Sources: CLIP paper, Flamingo paper, Multimodal learning surveys
Examples
- GPT-4V / GPT-4o (vision + text)
- Claude 3 (vision + text)
- Gemini Pro (vision + text + audio)
- LLaVA (open-source vision-language model)
Counterexamples
Things that might seem like Multimodal Model but are not:
- GPT-4 text-only (single modality)
- DALL-E (image generation only, not understanding)
- Whisper (audio only, not multimodal understanding)
Relations
- specializes large-language-model (Multimodal LLMs extend LLMs with other modalities)
- specializes foundation-model (Multimodal models are foundation models)
- overlapsWith inference (Multimodal inference processes multiple input types)
Implementations
Tools and frameworks that implement this concept:
- Anthropic primary
- Claude 3 primary
- Claude 3.5 primary
- Claude 4 primary
- Gemini primary
- Google Cloud Vertex AI secondary
- Google Gemini primary
- GPT-4 primary
- GPT-4 Turbo primary
- GPT-4o primary
- OpenAI primary