Skip to content

llama.cpp Engine

llama.cpp is the core inference engine that powers Jan’s ability to run AI models locally on your computer. Created by Georgi Gerganov, it’s designed to run large language models efficiently on consumer hardware without requiring specialized AI accelerators or cloud connections.

Key benefits:

  • Run models entirely offline after download
  • Use your existing hardware (CPU, GPU, or Apple Silicon)
  • Complete privacy - conversations never leave your device
  • No API costs or subscription fees

Navigate to Settings > Model Providers > Llama.cpp:

llama.cpp Settings

FeatureWhat It DoesWhen to Use
Engine VersionShows current llama.cpp versionCheck when models require newer engine
Check UpdatesDownloads latest engineUpdate for new model support or bug fixes
Backend SelectionChoose hardware-optimized versionAfter hardware changes or performance issues

Different backends are optimized for specific hardware. Choose the one that matches your system:

Check your CUDA version in NVIDIA Control Panel, then select:

CUDA 12.0 (Most Common):

  • llama.cpp-avx2-cuda-12-0 - Modern CPUs with AVX2
  • llama.cpp-avx512-cuda-12-0 - Newer Intel/AMD CPUs with AVX512
  • llama.cpp-avx-cuda-12-0 - Older CPUs without AVX2

CUDA 11.7 (Older Drivers):

  • llama.cpp-avx2-cuda-11-7 - Modern CPUs
  • llama.cpp-avx-cuda-11-7 - Older CPUs
  • llama.cpp-avx2 - Most modern CPUs (2013+)
  • llama.cpp-avx512 - High-end Intel/AMD CPUs
  • llama.cpp-avx - Older CPUs (2011-2013)
  • llama.cpp-noavx - Very old CPUs (pre-2011)
  • llama.cpp-vulkan - AMD Radeon, Intel Arc, Intel integrated

Configure how the engine processes requests:

SettingWhat It DoesDefaultWhen to Adjust
Auto-update engineAutomatically updates llama.cpp to latest versionEnabledDisable if you need version stability
Auto-Unload Old ModelsFrees memory by unloading unused modelsDisabledEnable if switching between many models
ThreadsCPU cores for text generation (-1 = all cores)-1Reduce if you need CPU for other tasks
Threads (Batch)CPU cores for batch processing-1Usually matches Threads setting
Context ShiftRemoves old text to fit new text in memoryDisabledEnable for very long conversations
Max Tokens to PredictMaximum response length (-1 = unlimited)-1Set a limit to control response size

Simple Analogy: Think of threads like workers in a factory. More workers (threads) means faster production, but if you need workers elsewhere (other programs), you might want to limit how many the factory uses.

SettingWhat It DoesDefaultWhen to Adjust
Batch SizeLogical batch size for prompt processing2048Lower if you have memory issues
uBatch SizePhysical batch size for hardware512Match your GPU’s capabilities
Continuous BatchingProcess multiple requests at onceEnabledKeep enabled for efficiency

Simple Analogy: Batch size is like the size of a delivery truck. A bigger truck (batch) can carry more packages (tokens) at once, but needs a bigger garage (memory) and more fuel (processing power).

SettingWhat It DoesDefaultWhen to Adjust
GPU Split ModeHow to divide model across GPUsLayerChange only with multiple GPUs
Main GPU IndexPrimary GPU for processing0Select different GPU if needed

When to tweak: Only adjust if you have multiple GPUs and want to optimize how the model is distributed across them.

Control how models use system and GPU memory:

SettingWhat It DoesDefaultWhen to Adjust
Flash AttentionOptimized memory usage for attentionEnabledDisable only if having stability issues
Disable mmapTurn off memory-mapped file loadingDisabledEnable if experiencing crashes
MLockLock model in RAM (no swap to disk)DisabledEnable if you have plenty of RAM
Disable KV OffloadKeep conversation memory on CPUDisabledEnable if GPU memory is limited

Simple Analogy: Think of your computer’s memory like a desk workspace:

  • mmap is like keeping reference books open to specific pages (efficient)
  • mlock is like gluing papers to your desk so they can’t fall off (uses more space but faster access)
  • Flash Attention is like using sticky notes instead of full pages (saves space)
SettingWhat It DoesOptionsWhen to Adjust
KV Cache K TypePrecision for “keys” in memoryf16, q8_0, q4_0Lower precision saves memory
KV Cache V TypePrecision for “values” in memoryf16, q8_0, q4_0Lower precision saves memory
KV Cache Defragmentation ThresholdWhen to reorganize memory (0.1 = 10% fragmented)0.1Increase if seeing memory errors

Memory Precision Guide:

  • f16 (default): Full quality, uses most memory - like HD video
  • q8_0: Good quality, moderate memory - like standard video
  • q4_0: Acceptable quality, least memory - like compressed video

When to adjust: Start with f16. If you run out of memory, try q8_0. Only use q4_0 if absolutely necessary.

SettingWhat It DoesDefaultWhen to Adjust
RoPE Scaling MethodHow to extend context lengthNoneFor contexts beyond model’s training
RoPE Scale FactorContext extension multiplier1Increase for longer contexts
RoPE Frequency BaseBase frequency (0 = auto)0Leave at 0 unless specified
RoPE Frequency Scale FactorFrequency adjustment1Advanced users only

Simple Analogy: RoPE is like the model’s sense of position in a conversation. Imagine reading a book:

  • Normal: You remember where you are on the page
  • RoPE Scaling: Like using a magnifying glass to fit more words on the same page
  • Scaling too much can make the text (context) blurry (less accurate)

When to use: Only adjust if you need conversations longer than the model’s default context length and understand the quality tradeoffs.

SettingWhat It DoesDefaultWhen to Adjust
Mirostat ModeAlternative text generation methodDisabledTry for more consistent output
Mirostat Learning RateHow quickly it adapts (eta)0.1Lower = more stable
Mirostat Target EntropyTarget randomness (tau)5Lower = more focused

Simple Analogy: Mirostat is like cruise control for text generation:

  • Regular sampling: You manually control speed (randomness) with temperature
  • Mirostat: Automatically adjusts to maintain consistent “speed” (perplexity)
  • Target Entropy: Your desired cruising speed
  • Learning Rate: How quickly the cruise control adjusts

When to use: Enable Mirostat if you find regular temperature settings produce inconsistent results. Start with defaults and adjust tau (3-7 range) for different styles.

SettingWhat It DoesDefaultWhen to Adjust
Grammar FileBNF grammar to constrain outputNoneFor specific output formats
JSON Schema FileJSON schema to enforce structureNoneFor JSON responses

Simple Analogy: These are like templates or forms the model must fill out:

  • Grammar: Like Mad Libs - the model can only put words in specific places
  • JSON Schema: Like a tax form - specific fields must be filled with specific types of data

When to use: Only when you need guaranteed structured output (like JSON for an API). Most users won’t need these.

  1. Enable: Flash Attention, Continuous Batching
  2. Set Threads: -1 (use all CPU cores)
  3. Batch Size: Keep defaults (2048/512)
  1. Enable: Auto-Unload Models, Flash Attention
  2. KV Cache: Set both to q8_0 or q4_0
  3. Reduce: Batch Size to 512/128
  1. Enable: Context Shift
  2. Consider: RoPE scaling (with quality tradeoffs)
  3. Monitor: Memory usage in System Monitor
  1. Enable: Auto-Unload Old Models
  2. Disable: MLock (saves RAM)
  3. Use: Default memory settings

Model crashes or errors:

  • Disable mmap
  • Reduce Batch Size
  • Switch KV Cache to q8_0

Out of memory:

  • Enable Auto-Unload
  • Reduce KV Cache precision
  • Lower Batch Size

Slow performance:

  • Check Threads = -1
  • Enable Flash Attention
  • Verify GPU backend is active

Inconsistent output:

  • Try Mirostat mode
  • Adjust temperature in model settings
  • Check if Context Shift is needed

Each model can override engine defaults. Access via the gear icon next to any model:

Model Settings

SettingWhat It ControlsImpact
Context LengthConversation history sizeHigher = more memory usage
GPU LayersModel layers on GPUHigher = faster but more VRAM
TemperatureResponse randomness0.1 = focused, 1.0 = creative
Top PToken selection poolLower = more focused responses
  1. Wrong backend: Try CPU-only backend first (avx2 or avx)
  2. Insufficient memory: Check RAM/VRAM requirements
  3. Outdated engine: Update to latest version
  4. Corrupted download: Re-download the model
  1. No GPU acceleration: Verify correct CUDA/Vulkan backend
  2. Too few GPU layers: Increase in model settings
  3. CPU bottleneck: Check thread count matches cores
  4. Memory swapping: Reduce context size or use smaller model
  1. Reduce quality: Switch KV Cache to q8_0 or q4_0
  2. Lower context: Decrease context length in model settings
  3. Fewer layers: Reduce GPU layers
  4. Smaller model: Use quantized versions (Q4 vs Q8)
  1. Backend mismatch: Use more stable variant (avx vs avx2)
  2. Driver issues: Update GPU drivers
  3. Overheating: Monitor temperatures, improve cooling
  4. Power limits: Check PSU capacity for high-end GPUs

Typical performance with different configurations:

HardwareModel SizeBackendTokens/sec
RTX 40907B Q4CUDA 1280-120
RTX 30707B Q4CUDA 1240-60
M2 Pro7B Q4Metal30-50
Ryzen 97B Q4AVX210-20

For maximum performance, compile llama.cpp for your specific hardware:

Terminal window
# Clone and build with specific optimizations
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Examples for different systems
make LLAMA_CUDA=1 # NVIDIA GPUs
make LLAMA_METAL=1 # Apple Silicon
make LLAMA_VULKAN=1 # AMD/Intel GPUs

Fine-tune behavior with environment variables:

Terminal window
# Force specific GPU
export CUDA_VISIBLE_DEVICES=0
# Thread tuning
export OMP_NUM_THREADS=8
# Memory limits
export GGML_CUDA_NO_PINNED=1

For Beginners:

  1. Use default settings
  2. Start with smaller models (3-7B parameters)
  3. Enable GPU acceleration if available

For Power Users:

  1. Match backend to hardware precisely
  2. Tune memory settings for your VRAM
  3. Experiment with parallel slots for multi-tasking

For Developers:

  1. Enable verbose logging for debugging
  2. Use consistent settings across deployments
  3. Monitor resource usage during inference