Making Local LLM Go Brrr
How to run your local LLM well: fast, reliable and with good quality.
Key metrics:
- Prefill speed: prompt/input tokens per second
- Decode speed: generated tokens per second
- Time to first token (latency)
- VRAM usage at target context length
- Quality at chosen model/quant/context settings
- Concurrency, if serving multiple users
Software
Choose the serving stack based on workload:
- llama.cpp: best general local path, especially GGUF, CPU, Apple Silicon, and mixed CPU/GPU.
- vLLM: strong GPU server for batching, throughput, OpenAI-compatible APIs, and production-style serving for modern GPUs.
- SGLang: good for structured/agentic serving and high-throughput multi-call workloads for modern GPUs.
Performance checklist:
- Use the fastest supported attention kernels: FlashAttention, FlashInfer, FlashMLA, etc.
- Try speculative decoding / MTP / EAGLE-style decoding when supported, but benchmark with your actual model and sampling settings.
- Preserve prefix/KV cacheability:
- keep the system prompt byte-identical
- append new messages rather than rebuilding/changing history
- avoid dynamic timestamps or changing tool schemas in the prompt prefix
- use server-side prefix caching when available
- Tune KV cache precision:
- for long context, test q8 KV even with q4 weights
- aggressively quantized KV can hurt long-context coherence
- Evaluate KV-cache compression such as TurboQuant for long-context or high-concurrency workloads, but treat it as experimental until benchmarked.
Performant llama.cpp fork (with advanced features): https://ikawrakow-ik_llama-cpp.mintlify.app/inference/
Tool-calling reliability:
- Add middleware for schema repair, retries, validation, and constrained tool loops -> https://github.com/antoinezambelli/forge
TODO:
- eval dynamic model routing based on query complexity (fast vs smart model)
Open Models
Run LLM models locally for complete control and privacy. Open-source (reproducible training) vs open-weight (free model weights) models.
Compare model capability: https://artificialanalysis.ai/models
Find compatible models for your hardware: https://www.canirun.ai/ or try https://github.com/AlexsJones/llmfit
Community benchmarks for local LLM: https://localmaxxing.com
Curated open model list:
- Qwen3.6-35B-A3B Q4_K_XL: Strong MoE (3B active) model with MTP fits on 8GB VRAM GPUs
- Qwen3.6 27B Q3_K_M - dense model, very good can run on 16GB VRAM
- LFM2.5-8B-A1B - very fast MoE model 1.5B active + 128k context (agentic usefulness is limited though...)
- MiniCPM5-1B - optimized for mobile CPU/NPU inference (32k context window)
Hardware
VRAM matters more than raw TFLOPs for model & context (prompt) size, but memory bandwidth and tensor cores matter for speed. Used datacenter GPUs can be good value, but check form factor, cooling, power, driver support, and PCIe vs SXM.
Interesting used options:
- Tesla V100 16/32GB: strong used datacenter option, but check PCIe vs SXM and cooling.
- Tesla P40 24GB: lots of VRAM for cheap, slower and no Tensor Cores.
- Pascal P100 16GB: cheap, but old and less attractive than V100/P40 depending on workload.
- GTX 1080 Ti 11GB: cheap but VRAM-limited.
- RTX 3090 24GB: often the practical local LLM sweet spot if priced well.
TODO:
- Check current Intel Arc and AMD ROCm support.
- Compare used datacenter GPUs against RTX 3090/4090/5090-class consumer cards.
- Benchmark watts/token, not just tokens/sec.