Edge Deployment
Also known as: On-Device AI, Edge AI, Local Inference
Definition
Running AI models on devices close to the user—phones, laptops, edge servers— rather than in centralized cloud data centers. Edge deployment reduces latency, enables offline operation, and keeps data on-device for privacy. For LLMs, this typically requires smaller, quantized models that fit in limited memory.
What this is NOT
- Not cloud deployment (edge is the opposite)
- Not just small models (edge is about location, not just size)
- Not necessarily consumer devices (edge servers are also 'edge')
Alternative Interpretations
Different communities use this term differently:
llm-practitioners
Running LLMs locally on user devices using tools like Ollama, llama.cpp, or on-device frameworks (Apple MLX, Android ML). Enables private, offline LLM use without sending data to cloud APIs.
Sources: Ollama documentation, llama.cpp documentation, Apple MLX framework
ml-ops
Deploying ML models to edge devices (IoT, mobile, edge servers) for low-latency inference without cloud round-trips.
Sources: Edge ML literature, TensorFlow Lite, ONNX Runtime documentation
Examples
- Ollama running Llama 3 on a MacBook
- llama.cpp on a gaming PC with RTX 4090
- Gemini Nano on Pixel phones
- Apple Intelligence running on-device
Counterexamples
Things that might seem like Edge Deployment but are not:
- Calling OpenAI API from your laptop (cloud, not edge)
- Running models in AWS data center
- Thin client that just displays cloud-generated responses
Relations
- overlapsWith quantization (Quantization enables edge deployment)
- overlapsWith inference (Edge inference is still inference)
- inTensionWith model-serving (Edge vs. centralized serving)
Implementations
Tools and frameworks that implement this concept:
- Astro secondary
- Cloudflare Pages primary
- Cloudflare Workers primary
- llama.cpp primary
- LM Studio primary
- Ollama primary
- Vercel primary