Local LLMs are now good enough for a lot of everyday work, especially when privacy, offline access, or cost control matters more than peak reasoning. This guide keeps it practical: when to go local, which runtime to pick, what hardware you actually need, and a short set of resources to keep you current.
Quick decision: local vs cloud
| If you need… | Choose | Why |
|---|---|---|
| Sensitive data stays on your machine | Local | No external API calls |
| Offline access | Local | No internet dependency |
| Best-available reasoning and long context | Cloud | Frontier models still win here |
| Team-wide API at scale | Depends | vLLM or cloud, based on hardware |
| Fast iteration without per-token costs | Local | Predictable cost after hardware |
Start here: pick a runtime
Choose one runtime and get it working before you optimize. The biggest mistake is mixing tools too early.
Ollama (fastest on-ramp)
Simple CLI + API server. Great default if you want quick installs and a local OpenAI-compatible endpoint.
| |
Docs and install steps live here: Ollama documentation.
LM Studio (best desktop UI)
GUI for browsing, downloading, and chatting with models. It can also expose a local API server. Ideal if you want a visual interface or you’re switching models frequently.
Get started here: LM Studio docs.
llama.cpp (max control)
The core engine for running GGUF models on CPU or GPU. If you want fine-grained performance tuning or minimal dependencies, this is the foundation.
Project home: llama.cpp on GitHub.
vLLM (multi-user serving)
High-throughput inference server with OpenAI-compatible endpoints. Best for teams, shared GPUs, or production-like serving.
Docs: vLLM documentation.
Hardware planning (rough rules)
These are ballpark numbers for 4-bit quantized models. Context length, quantization, and backend matter a lot, so treat these as starting points.
| Model size | Typical fit | What to expect |
|---|---|---|
| 3B to 8B | 6 to 10 GB VRAM or 16+ GB RAM | Works on laptops, good for simple tasks |
| 10B to 20B | 12 to 24 GB VRAM or 32+ GB RAM | Solid general use if prompts are short |
| 30B to 70B | 48+ GB VRAM or multi-GPU | Great quality, heavy hardware |
Notes that matter:
- Bigger context windows consume more memory. If you raise context from 4k to 32k, expect a large VRAM jump.
- Apple Silicon uses unified memory, so “VRAM” is your system memory budget.
- CPU-only is viable for small models, but latency increases quickly as size grows.
Model selection without guesswork
Don’t chase rankings first. Start with your task and measure.
- Pick a task family: general chat, coding, reasoning, or vision.
- Choose the smallest model that fits your hardware.
- Run 5 to 10 real prompts and measure latency and quality.
- Move up in size only if the answers are consistently weak.
Good starting points to explore on any runtime:
- General chat: Llama, Qwen, Mistral families.
- Coding: Qwen Coder, DeepSeek Coder, StarCoder-style models.
- Reasoning: distilled reasoning models can be strong at 7B to 32B.
- Vision: look for “vision” or “VL” variants and test with your own images.
If you want a curated model catalog, use the Ollama model library or browse the Hugging Face Model Hub.
Serving patterns
Pick the lightest setup that matches your needs.
- Single-user chat: Ollama or LM Studio with local chat UI.
- Local API for apps: Ollama, LM Studio, or llama.cpp server.
- Team or production serving: vLLM with GPU scheduling and batching.
Cost and safety tradeoffs
Local doesn’t automatically mean safe or cheap.
- Costs: You trade per-token fees for hardware, power, and maintenance.
- Privacy: Local inference helps, but only if your toolchain doesn’t send telemetry or phone home.
- Security: If you connect tools, files, or external data sources, treat it like a real system. Consider isolation (VM or dedicated machine), and don’t reuse personal API keys.
- Licensing: Always check the model card and license before shipping anything serious.
Hugging Face explains model cards and licenses here: Model cards and Licenses.
Quick start checklist
- Define your primary use case and acceptable latency.
- Pick a runtime (Ollama or LM Studio for most people).
- Choose a small model first and verify it runs smoothly.
- Measure tokens/sec on your actual prompts.
- Increase size or context only when you hit real limits.
- Review model license and intended use.
Resources
- Ollama documentation
- Ollama model library
- LM Studio docs
- llama.cpp repository
- vLLM documentation
- Hugging Face Model Hub
Related links
- /posts/always-on-vpn/ - Isolation levels for agent-style tools
- /posts/openclaw-security-reality-2026/ - Security tradeoffs of self-hosted agents
- /posts/honest-limitations-ai-tools-2026/ - Where local models still fall short