Sean's Blog

Making Local LLM Go Brrr

June 4, 2026
Edit on GitHub

How to run your local LLM well: fast, reliable and with good quality.

Key metrics:

  • Prefill speed: prompt/input tokens per second
  • Decode speed: generated tokens per second
  • Time to first token (latency)
  • VRAM usage at target context length
  • Quality at chosen model/quant/context settings
  • Concurrency, if serving multiple users

Software

Choose the serving stack based on workload:

  • llama.cpp: best general local path, especially GGUF, CPU, Apple Silicon, and mixed CPU/GPU.
  • vLLM: strong GPU server for batching, throughput, OpenAI-compatible APIs, and production-style serving for modern GPUs.
  • SGLang: good for structured/agentic serving and high-throughput multi-call workloads for modern GPUs.

Performance checklist:

  • Use the fastest supported attention kernels: FlashAttention, FlashInfer, FlashMLA, etc.
  • Try speculative decoding / MTP / EAGLE-style decoding when supported, but benchmark with your actual model and sampling settings.
  • Preserve prefix/KV cacheability:
    • keep the system prompt byte-identical
    • append new messages rather than rebuilding/changing history
    • avoid dynamic timestamps or changing tool schemas in the prompt prefix
    • use server-side prefix caching when available
  • Tune KV cache precision:
    • for long context, test q8 KV even with q4 weights
    • aggressively quantized KV can hurt long-context coherence
  • Evaluate KV-cache compression such as TurboQuant for long-context or high-concurrency workloads, but treat it as experimental until benchmarked.

Performant llama.cpp fork (with advanced features): https://ikawrakow-ik_llama-cpp.mintlify.app/inference/

Tool-calling reliability:

TODO:

  • eval dynamic model routing based on query complexity (fast vs smart model)

Open Models

Run LLM models locally for complete control and privacy. Open-source (reproducible training) vs open-weight (free model weights) models.

Compare model capability: https://artificialanalysis.ai/models
Find compatible models for your hardware: https://www.canirun.ai/ or try https://github.com/AlexsJones/llmfit
Community benchmarks for local LLM: https://localmaxxing.com

Curated open model list:

Hardware

VRAM matters more than raw TFLOPs for model & context (prompt) size, but memory bandwidth and tensor cores matter for speed. Used datacenter GPUs can be good value, but check form factor, cooling, power, driver support, and PCIe vs SXM.

Interesting used options:

  • Tesla V100 16/32GB: strong used datacenter option, but check PCIe vs SXM and cooling.
  • Tesla P40 24GB: lots of VRAM for cheap, slower and no Tensor Cores.
  • Pascal P100 16GB: cheap, but old and less attractive than V100/P40 depending on workload.
  • GTX 1080 Ti 11GB: cheap but VRAM-limited.
  • RTX 3090 24GB: often the practical local LLM sweet spot if priced well.

TODO:

  • Check current Intel Arc and AMD ROCm support.
  • Compare used datacenter GPUs against RTX 3090/4090/5090-class consumer cards.
  • Benchmark watts/token, not just tokens/sec.

References

#AI #tutorial