Edge Deployment

Process deployment published

Also known as: On-Device AI, Edge AI, Local Inference

Definition

Running AI models on devices close to the user—phones, laptops, edge servers— rather than in centralized cloud data centers. Edge deployment reduces latency, enables offline operation, and keeps data on-device for privacy. For LLMs, this typically requires smaller, quantized models that fit in limited memory.

What this is NOT

  • Not cloud deployment (edge is the opposite)
  • Not just small models (edge is about location, not just size)
  • Not necessarily consumer devices (edge servers are also 'edge')

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

Running LLMs locally on user devices using tools like Ollama, llama.cpp, or on-device frameworks (Apple MLX, Android ML). Enables private, offline LLM use without sending data to cloud APIs.

Sources: Ollama documentation, llama.cpp documentation, Apple MLX framework

ml-ops

Deploying ML models to edge devices (IoT, mobile, edge servers) for low-latency inference without cloud round-trips.

Sources: Edge ML literature, TensorFlow Lite, ONNX Runtime documentation

Examples

  • Ollama running Llama 3 on a MacBook
  • llama.cpp on a gaming PC with RTX 4090
  • Gemini Nano on Pixel phones
  • Apple Intelligence running on-device

Counterexamples

Things that might seem like Edge Deployment but are not:

  • Calling OpenAI API from your laptop (cloud, not edge)
  • Running models in AWS data center
  • Thin client that just displays cloud-generated responses

Relations

  • overlapsWith quantization (Quantization enables edge deployment)
  • overlapsWith inference (Edge inference is still inference)
  • inTensionWith model-serving (Edge vs. centralized serving)

Implementations

Tools and frameworks that implement this concept: