← Back to Blog

How to Run Large Local AI Models Efficiently

Editorial image for How to Run Large Local AI Models Efficiently about AI Infrastructure.

Key Takeaways

  • The biggest local model that barely loads is usually slower and less useful than a smaller model that fully or mostly fits in VRAM.
  • GGUF is the safest broad local default, but EXL2, GPTQ, or AWQ can be better when your runtime and GPU setup are more specialized.
  • Context window and KV cache settings are some of the fastest ways to destroy local performance without realizing it.
  • Prompt caching helps most when requests share stable prefixes such as system prompts, retrieved context, or repeated agent instructions.
  • Speculative decoding is worth testing only after model size, GPU residency, and baseline memory settings are already under control.
BLOOMIE
POWERED BY NEROVA

Running large local AI models efficiently means matching the model, runtime, and memory settings to your actual hardware instead of assuming the biggest model and the longest context window will somehow work out. In practice, local inference speed is usually limited by memory movement more than raw compute, which is why the right quantized format, GPU offload strategy, KV cache settings, and batching choices often matter more than buying one more CPU core.

If you want the short answer, start with the smallest model that can do the job, keep as much of the model on GPU as possible, avoid oversized context windows, reuse prefixes when prompts repeat, and only move to advanced tricks like speculative decoding or multi-GPU once the baseline setup is already stable. The goal is not to win a benchmark. The goal is to get acceptable quality, time-to-first-token, and tokens-per-second on the hardware you already have.

What actually makes local models fast or slow

Most people first think about parameter count, but local performance is shaped by four interacting limits:

  • Weight memory: can the model weights fit in GPU memory, or will some layers spill to CPU?
  • KV cache growth: can you afford the context window and concurrency you asked for?
  • Memory bandwidth: how quickly can your GPU, system RAM, and interconnect move data during prompt processing and generation?
  • Runtime overhead: does your chosen tool favor simple desktop use, dense single-GPU inference, or multi-user serving?

That is why a 14B model that fully fits in VRAM can feel much faster than a more heavily offloaded 32B model, even if the larger model looks more impressive on paper. Once a setup starts leaning on CPU offload or oversized context buffers, latency rises quickly and generation can feel inconsistent.

Prompt processing and token generation also stress the system differently. Prompt ingestion benefits from larger batches and fast memory paths. Token generation is more sensitive to KV cache behavior, context size, and whether each new token must bounce across slow memory or across multiple devices. A setup that looks fine on a short prompt can become painfully slow on long chats, coding sessions, or agent loops.

Choose model size and format as one decision

The first efficiency rule is simple: pick the smallest model that consistently solves your task. For a local setup, every jump in parameter count affects both the stored weights and the working memory around them. If you choose the model first and only later think about quantization or runtime support, you usually end up with a setup that technically runs but wastes memory or collapses under long prompts.

For most home labs and small teams, a practical starting point looks like this:

  • 3B to 8B: good for fast assistants, light coding help, summarization, and structured tasks on modest hardware.
  • 12B to 14B: often the sweet spot when you want noticeably stronger reasoning or coding without immediately moving into workstation-class requirements.
  • 20B to 32B: useful when quality matters more than latency and you have enough VRAM, or a carefully tuned quantized setup.
  • 70B and above: usually a specialized decision for high-end home labs, multi-GPU rigs, or shared serving setups rather than casual desktop use.

Then choose the file format and quantization that matches both the runtime and the quality target.

Local model formats and when they fit best

FormatBest forMain tradeoff
GGUFllama.cpp, Ollama, LM Studio, broad local compatibilityExcellent portability, but quality and speed depend on the quantization choice
EXL2ExLlama on NVIDIA GPUs when you want aggressive local GPU efficiencyVery fast on the right setup, but less universal than GGUF
GPTQ or AWQGPU-first serving stacks such as vLLM or other optimized inference enginesGreat for GPU deployments, but not the easiest choice for broad desktop portability
BF16 or FP16 weightsHigh-end serving where you prioritize fidelity or have enough VRAMHigher memory cost and fewer shortcuts for constrained local hardware

GGUF is usually the safest local default because it is designed for GGML-based executors, supports single-file deployment, and is built for fast loading with features such as mmap compatibility. That makes it a strong fit for local environments where you want predictable loading behavior and broad tool support. Within GGUF, lower-bit quantization reduces memory use and often improves speed, but it can also reduce quality. A good rule is to start at a middle quantization tier, test on your real prompts, and only move more aggressively if the quality loss is still acceptable.

If you are running on NVIDIA consumer GPUs and care mostly about single-user local performance, ExLlama and EXL2 can be compelling because they are designed for fast local GPU inference and allow mixed quantization levels inside the model. If you care more about multi-user serving, concurrency, and scaling across GPUs or nodes, formats commonly used with vLLM are often a better match than a pure desktop-style stack.

When to use llama.cpp, Ollama, LM Studio, vLLM, or ExLlama

The right runtime depends less on hype and more on the job you are trying to do.

llama.cpp

Use llama.cpp when you want maximum local flexibility, broad hardware support, GGUF compatibility, hybrid CPU and GPU inference, and fine-grained control over settings like context size, batch size, KV offload, cache type, and speculative decoding. It is often the best tool for advanced tinkerers, reproducible local experiments, and constrained systems that cannot keep an entire model in VRAM.

Ollama

Use Ollama when you want the easiest path to getting a local model running with a clean UX and a minimal serving surface. It is especially good for local prototypes, desktop assistants, and teams that want a low-friction standard way to pull and run models. The main discipline with Ollama is not letting convenience hide bad settings. If a model is spilling to CPU or the context window is set too high for your hardware, the system can still feel slow.

LM Studio

Use LM Studio when you want a desktop-first local workflow with easier model management, memory estimation, adjustable GPU offload, context-length control, and built-in speculative decoding support. It is a strong fit for individuals or small teams who want to tune local models without living entirely in the terminal.

vLLM

Use vLLM when you are serving models for repeated workloads, higher throughput, or multiple users, especially on GPU infrastructure. Its strengths show up in prefix caching, optimized serving behavior, and distributed scaling with tensor and pipeline parallelism. For a small team building an internal local or on-prem endpoint, vLLM is often the better choice than a desktop-oriented runtime.

ExLlama

Use ExLlama when your workload is centered on NVIDIA GPUs and you want fast local inference with EXL2 quantization, dynamic batching, smart prompt caching, and strong single-node efficiency. It is less of a universal local answer and more of a specialist tool for people optimizing around one hardware family.

Which runtime usually fits which local setup

RuntimeBest fitWatch out for
llama.cppAdvanced local tuning across mixed hardwareMore knobs means more room to misconfigure context, cache, and offload
OllamaSimple local deployment and quick internal toolsEasy to overshoot context or tolerate hidden CPU offload
LM StudioDesktop experimentation with clearer controlsConvenience does not remove the underlying VRAM limits
vLLMShared serving, throughput, and multi-GPU scaleUsually overkill for casual single-user desktop prompting
ExLlamaNVIDIA-focused local GPU efficiencyNarrower compatibility and more format-specific workflow choices

Tune memory before you chase clever optimizations

If a local model feels slow, the first question is not whether you need a new decoding trick. It is whether your memory budget is broken.

VRAM and GPU offload

VRAM is the first constraint because keeping more layers on GPU usually matters more than squeezing a slightly bigger model into the system by offloading large parts to CPU. Partial CPU offload can be useful when it is the only way to run the model at all, but it is usually a survival tactic, not the fastest path.

If you are using LM Studio, adjust GPU offload explicitly rather than guessing. If you are using Ollama, check whether the model is actually on GPU instead of assuming it is. If you are using llama.cpp, tune the number of GPU layers or the offload strategy deliberately and treat CPU offload as a tradeoff between feasibility and speed.

System RAM and memory bandwidth

System RAM matters when you offload layers, cache prompts in host memory, or run models that cannot live entirely on GPU. Capacity matters, but bandwidth matters too. Slow memory paths can make a model that technically fits feel sluggish. This is why unified-memory Apple Silicon systems can behave differently from discrete GPU desktops, and why not all 32 GB or 64 GB systems feel the same in practice.

Context window and KV cache

Context is one of the easiest ways to accidentally waste performance. A longer context window increases memory requirements because the KV cache grows with the number of tokens you want the model to keep around. For chat, coding, and agent workflows, large context can be useful. But setting it to the maximum just because the model advertises it is usually a mistake.

Use the shortest context that still fits the task. For a local coding helper, 32k may be enough. For a retrieval-heavy internal assistant, you may need more, but only if the hardware can support it without collapsing latency. If your tokens-per-second or concurrency suddenly falls apart, the KV cache is one of the first places to look.

Batch size and ubatch size

Batch size is another lever people misuse. Larger batch sizes can improve prompt processing throughput, especially on strong GPUs, but they also increase memory pressure and can worsen latency if pushed too far. In llama.cpp, logical batch size and physical micro-batch size are separate knobs for exactly this reason. For a single interactive user, lower latency usually matters more than maximum throughput. For a small shared server, the balance shifts.

Prompt caching and prefix reuse

Prompt caching is one of the highest-value optimizations when the same prefixes repeat. If your system prompt, tool instructions, or retrieved context is reused often, caching avoids recomputing the same prefix again and again. This can meaningfully reduce time-to-first-token for assistants, agents, and repetitive internal workflows.

But caching is not magic. It works best when requests really do share a common prefix. If every request is unique, the benefit drops. You also need to watch memory pressure, because storing reusable prefixes still consumes capacity somewhere.

Speculative decoding

Speculative decoding can increase speed by letting a smaller draft model propose tokens that the larger model verifies. It is worth testing when generation is the bottleneck and you have enough extra resources to hold a useful draft model. It is usually not the first knob to turn. First make sure the base model, context window, and GPU residency are already sensible.

The tradeoff is straightforward: you spend more resources overall to reduce latency. If the draft model is too large, too slow, or too inaccurate for the task, the speedup can disappear or even reverse.

Multi-GPU setups

Multi-GPU can help, but only when the runtime and interconnect match the problem. In llama.cpp, multi-GPU modes can spread layers and KV cache across devices, and tensor mode is aimed at fast token generation when one GPU is not enough. In vLLM, tensor and pipeline parallelism become important once a single GPU or even a single node can no longer hold the model comfortably. The trap is assuming two GPUs automatically mean twice the speed. Communication overhead, uneven memory splits, and weak interconnects can erase the benefit.

Thermals, power, and storage

Local inference that looks fine for a short test can degrade during longer sessions if the GPU or laptop chassis starts thermal throttling. Good airflow, realistic fan curves, and sustained power delivery matter more than many people expect. Storage matters too, mostly for model load time and operational smoothness. Slow storage does not usually dominate per-token inference, but it absolutely affects startup, model swapping, and repeated experimentation. NVMe storage is a quality-of-life upgrade for anyone rotating large model files often.

A step-by-step tuning checklist for home labs and small teams

  1. Define the real task. Write down whether you care most about chat quality, coding, retrieval, extraction, latency, or concurrency.
  2. Start with the smallest credible model. Only move up in size after it clearly fails on your real prompts.
  3. Pick the runtime that matches the job. Use llama.cpp, Ollama, or LM Studio for local desktop-style use; prefer vLLM or ExLlama when serving or GPU specialization matters more.
  4. Choose the right format for that runtime. GGUF for broad local compatibility, EXL2 for ExLlama, and serving-oriented quantized formats for GPU-first servers.
  5. Keep the model on GPU as much as possible. Full or near-full GPU residency usually beats large CPU offload.
  6. Set a realistic context window. Do not allocate 64k or 128k unless the workflow actually needs it.
  7. Tune KV cache and batch settings conservatively first. Increase only after measuring latency and memory usage.
  8. Enable prompt caching where prefixes repeat. This is especially useful for assistants with stable system prompts or repeated retrieved context.
  9. Test speculative decoding only after the baseline is stable. It is an optimization layer, not a rescue plan.
  10. Measure sustained behavior. Run longer sessions, not just one short prompt, and watch temperature, memory use, time-to-first-token, and output quality together.

For a home lab, that often means choosing a better quantized 7B, 14B, or 32B model and getting it fully comfortable on your hardware before dreaming about 70B. For a small team, it often means deciding whether the local system is meant for one operator, a shared internal endpoint, or a privacy-sensitive workflow with repeated prompts. Those are different architecture choices, even if they all count as local AI.

Common mistakes that waste local inference performance

  • Using the maximum context window by default. More context is not free.
  • Judging performance from one short prompt. Long prompts and multi-turn chats expose the real bottlenecks.
  • Choosing a giant model before checking if a smaller one is already good enough.
  • Assuming CPU offload is fine because the model technically loads. Loading is not the same as running efficiently.
  • Ignoring thermals. A throttled GPU can quietly ruin a local benchmark.
  • Mixing format and runtime carelessly. The best quantization on one stack may be the wrong choice on another.
  • Turning on every optimization at once. Change one major lever at a time so you can see what actually helped.

The best local setup is usually the boring one: a model that fits, a context window that reflects real work, a runtime chosen for the right job, and a short list of optimizations you can explain. Once that is working, advanced techniques like prompt caching, speculative decoding, quantized KV cache, and multi-GPU sharding become useful multipliers instead of troubleshooting distractions.

If you remember only one idea from this guide, make it this: local model efficiency is a systems problem, not a single benchmark number. Good results come from matching model size, quantization, runtime, context, memory layout, and workload shape as one decision.

Frequently Asked Questions

What is the fastest way to make a local LLM feel faster?

Usually it is not one trick. Start by using a smaller model or a better quantized version, keep more of the model on GPU, reduce the context window to what the task actually needs, and reuse prompt prefixes where possible.

Is GGUF always the best local model format?

GGUF is often the best default for broad local compatibility, especially with llama.cpp, Ollama, and LM Studio. It is not always the fastest option on every GPU stack, but it is usually the easiest place to start.

Why does a local model slow down so much on long chats?

Long chats grow the KV cache and increase memory pressure. If the context window is large or the model is already close to your VRAM limit, token generation speed can fall quickly as the session grows.

When should a small team use vLLM instead of Ollama or LM Studio?

Use vLLM when you care about shared serving, throughput, repeated prompts, concurrency, or scaling across GPUs. Ollama and LM Studio are usually better for simpler desktop or single-operator local use.

Does speculative decoding always improve speed?

No. It can improve speed when the draft model is small enough, accurate enough, and the workload is a good fit. If the draft model is too heavy or the prompts are a poor match, it can add overhead without helping.

Decide what should run locally and what should not

If you are weighing local models against cloud APIs, the next step is not guessing. Nerova’s Scope audit helps map latency, privacy, cost, and workflow constraints so you can prioritize the right AI architecture before you build.

Run an AI rollout audit
Ask Bloomie about this article