Skip to content
Code Insights
Documentation

Local LLM Setup

Run AI-powered session analysis entirely on your machine using Ollama or llama.cpp with Gemma 4. Free, private, no API keys.

Why local?

Local LLM inference means your session data never leaves your machine. No API keys, no cloud bills, no rate limits. Code Insights supports two local inference backends:

  • Ollama — easiest setup, manages model downloads automatically
  • llama.cpp — more control over quantization, GPU layers, and memory usage

Both are free and work fully offline once the model is downloaded.

ModelSize (Q4_K_M)VRAM neededQualityBest for
Gemma 4 12B~7 GB8 GBGoodMost dev machines, laptops
Gemma 4 27B~15 GB16 GBBetterWorkstations, desktop GPUs
Qwen3 14B~8 GB10 GBGoodAlternative if Gemma unavailable
Llama 3.3~4 GB6 GBAdequateLow-VRAM machines

We recommend Gemma 4 12B as the default — it produces reliable structured JSON output for session analysis while fitting comfortably on most developer machines.

Ollama handles model downloads, GPU detection, and serving automatically.

1. Install Ollama

# macOS
brew install ollama
 
# Linux
curl -fsSL https://ollama.com/install.sh | sh

2. Pull a model

ollama pull gemma4          # Gemma 4 12B (~7 GB download)
# or
ollama pull gemma4:27b      # Gemma 4 27B (~15 GB download)

3. Configure Code Insights

code-insights config llm --provider ollama --model gemma4

Or use the interactive wizard:

code-insights config llm
# Select: Ollama (Local)
# Select: Gemma 4 12B

That's it. Ollama starts automatically when needed. Run code-insights insights <session_id> or click Analyze in the dashboard.

Option B: llama.cpp (more control)

llama.cpp gives you direct control over quantization, GPU layer offloading, context size, and memory usage. Use this if you want to fine-tune inference performance or run models not available through Ollama.

1. Install llama.cpp

# macOS
brew install llama.cpp
 
# From source (any platform)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build --config Release

2. Download a GGUF model

Download a quantized Gemma 4 GGUF from Hugging Face. Look for Q4_K_M quantization as a good balance of quality and size.

3. Start llama-server

llama-server -m gemma-4-12b-it-Q4_K_M.gguf --port 8080

Common flags:

FlagDescription
-m <path>Path to the GGUF model file
--port <port>HTTP port (default: 8080)
-ngl <layers>Number of GPU layers to offload (use 99 for full GPU)
-c <tokens>Context size (default: 2048, recommend 8192 or higher)
--threads <n>CPU threads for prompt evaluation

4. Configure Code Insights

code-insights config llm --provider llamacpp --model gemma-4-12b

Or use the interactive wizard:

code-insights config llm
# Select: llama.cpp (Local)
# Select: Gemma 4 12B (Q4_K_M)
# Base URL: http://localhost:8080 (default)

5. Verify

code-insights insights <session_id>

The dashboard Settings page also has a Discover Loaded Model button that queries your running llama-server to confirm which model is loaded.

Differences between Ollama and llama.cpp

AspectOllamallama.cpp
SetupOne command (ollama pull)Manual GGUF download + server start
Auto-startYes (launches on demand)No (you start llama-server manually)
Model managementBuilt-in (ollama list, ollama pull)Manual file management
GPU controlAutomaticFine-grained (-ngl, --threads)
Quantization choiceLimited to what Ollama publishesAny GGUF from Hugging Face
Default port114348080
APIOllama native APIOpenAI-compatible API

Choose Ollama if you want the simplest setup. Choose llama.cpp if you want control over quantization, GPU offloading, or want to run models Ollama doesn't support.

Troubleshooting

"Cannot connect to llama-server" / "Cannot connect to Ollama"

The inference server isn't running.

  • Ollama: Run ollama serve in a terminal, or restart the Ollama desktop app
  • llama.cpp: Run llama-server -m <model.gguf> — it must stay running while you analyze

Slow inference

  • Ensure GPU offloading is active. For llama.cpp: -ngl 99 offloads all layers to GPU
  • Use a smaller quantization (Q4_K_M instead of Q8_0) or a smaller model (12B instead of 27B)
  • Close other GPU-heavy applications

JSON parse errors or garbled output

Small quantized models occasionally produce malformed JSON. Code Insights handles this with:

  • Grammar-constrained JSON output (llama.cpp response_format)
  • Automatic single retry on parse failure
  • Lower temperature (0.3) for more consistent output

If errors persist, try a larger model (27B) or a higher quantization (Q6_K, Q8_0).

Analysis seems lower quality than cloud providers

This is expected — a 12B quantized model won't match GPT-4o or Claude Sonnet. The trade-off is privacy and cost. For best local results:

  • Use Gemma 4 27B if your hardware supports it
  • Use Q6_K or Q8_0 quantization (larger files, better quality)
  • Ensure context size is at least 8192 tokens (-c 8192 for llama.cpp)