Multimodal Model

System models published

Also known as: Vision-Language Model, VLM, Multimodal LLM

Definition

A model that can process and/or generate multiple types of data (modalities) such as text, images, audio, and video. Multimodal models understand relationships across modalities—they can describe images, answer questions about videos, or generate images from text. They extend LLM capabilities beyond text to richer inputs and outputs.

What this is NOT

  • Not text-only LLMs (multimodal includes other modalities)
  • Not separate models for each modality (multimodal is unified)
  • Not just image generation (multimodal includes understanding)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

LLMs enhanced with vision (and sometimes audio) capabilities, able to accept images as input and reason about visual content. Examples: GPT-4V, Claude 3, Gemini Pro Vision.

Sources: GPT-4V documentation, Gemini documentation, LLaVA and open multimodal models

ml-research

Models trained to align representations across modalities, enabling cross-modal understanding and generation. Architectures include vision encoders + LLMs, unified transformers, and diffusion models.

Sources: CLIP paper, Flamingo paper, Multimodal learning surveys

Examples

  • GPT-4V / GPT-4o (vision + text)
  • Claude 3 (vision + text)
  • Gemini Pro (vision + text + audio)
  • LLaVA (open-source vision-language model)

Counterexamples

Things that might seem like Multimodal Model but are not:

  • GPT-4 text-only (single modality)
  • DALL-E (image generation only, not understanding)
  • Whisper (audio only, not multimodal understanding)

Relations

  • specializes large-language-model (Multimodal LLMs extend LLMs with other modalities)
  • specializes foundation-model (Multimodal models are foundation models)
  • overlapsWith inference (Multimodal inference processes multiple input types)

Implementations

Tools and frameworks that implement this concept: