Local LLM Setup — Code Insights

Run AI-powered session analysis entirely on your machine using Ollama or llama.cpp with Gemma 4. Free, private, no API keys.

Why local?

Local LLM inference means your session data never leaves your machine. No API keys, no cloud bills, no rate limits. Code Insights supports two local inference backends:

Ollama — easiest setup, manages model downloads automatically
llama.cpp — more control over quantization, GPU layers, and memory usage

Both are free and work fully offline once the model is downloaded.

Recommended models

Model	Size (Q4_K_M)	VRAM needed	Quality	Best for
Gemma 4 12B	~7 GB	8 GB	Good	Most dev machines, laptops
Gemma 4 27B	~15 GB	16 GB	Better	Workstations, desktop GPUs
Qwen3 14B	~8 GB	10 GB	Good	Alternative if Gemma unavailable
Llama 3.3	~4 GB	6 GB	Adequate	Low-VRAM machines

We recommend Gemma 4 12B as the default — it produces reliable structured JSON output for session analysis while fitting comfortably on most developer machines.

Option A: Ollama (recommended for most users)

Ollama handles model downloads, GPU detection, and serving automatically.

1. Install Ollama

# macOS
brew install ollama
 
# Linux
curl -fsSL https://ollama.com/install.sh | sh

2. Pull a model

ollama pull gemma4          # Gemma 4 12B (~7 GB download)
# or
ollama pull gemma4:27b      # Gemma 4 27B (~15 GB download)

3. Configure Code Insights

code-insights config llm --provider ollama --model gemma4

Or use the interactive wizard:

code-insights config llm
# Select: Ollama (Local)
# Select: Gemma 4 12B

That's it. Ollama starts automatically when needed. Run code-insights insights <session_id> or click Analyze in the dashboard.

Option B: llama.cpp (more control)

llama.cpp gives you direct control over quantization, GPU layer offloading, context size, and memory usage. Use this if you want to fine-tune inference performance or run models not available through Ollama.

1. Install llama.cpp

# macOS
brew install llama.cpp
 
# From source (any platform)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build --config Release

2. Download a GGUF model

Download a quantized Gemma 4 GGUF from Hugging Face. Look for Q4_K_M quantization as a good balance of quality and size.

3. Start llama-server

llama-server -m gemma-4-12b-it-Q4_K_M.gguf --port 8080

Common flags:

Flag	Description
`-m <path>`	Path to the GGUF model file
`--port <port>`	HTTP port (default: 8080)
`-ngl <layers>`	Number of GPU layers to offload (use `99` for full GPU)
`-c <tokens>`	Context size (default: 2048, recommend `8192` or higher)
`--threads <n>`	CPU threads for prompt evaluation

4. Configure Code Insights

code-insights config llm --provider llamacpp --model gemma-4-12b

Or use the interactive wizard:

code-insights config llm
# Select: llama.cpp (Local)
# Select: Gemma 4 12B (Q4_K_M)
# Base URL: http://localhost:8080 (default)

5. Verify

code-insights insights <session_id>

The dashboard Settings page also has a Discover Loaded Model button that queries your running llama-server to confirm which model is loaded.

Differences between Ollama and llama.cpp

Aspect	Ollama	llama.cpp
Setup	One command (`ollama pull`)	Manual GGUF download + server start
Auto-start	Yes (launches on demand)	No (you start `llama-server` manually)
Model management	Built-in (`ollama list`, `ollama pull`)	Manual file management
GPU control	Automatic	Fine-grained (`-ngl`, `--threads`)
Quantization choice	Limited to what Ollama publishes	Any GGUF from Hugging Face
Default port	11434	8080
API	Ollama native API	OpenAI-compatible API

Choose Ollama if you want the simplest setup. Choose llama.cpp if you want control over quantization, GPU offloading, or want to run models Ollama doesn't support.

Troubleshooting

"Cannot connect to llama-server" / "Cannot connect to Ollama"

The inference server isn't running.

Ollama: Run ollama serve in a terminal, or restart the Ollama desktop app
llama.cpp: Run llama-server -m <model.gguf> — it must stay running while you analyze

Slow inference

Ensure GPU offloading is active. For llama.cpp: -ngl 99 offloads all layers to GPU
Use a smaller quantization (Q4_K_M instead of Q8_0) or a smaller model (12B instead of 27B)
Close other GPU-heavy applications

JSON parse errors or garbled output

Small quantized models occasionally produce malformed JSON. Code Insights handles this with:

Grammar-constrained JSON output (llama.cpp response_format)
Automatic single retry on parse failure
Lower temperature (0.3) for more consistent output

If errors persist, try a larger model (27B) or a higher quantization (Q6_K, Q8_0).

Analysis seems lower quality than cloud providers

This is expected — a 12B quantized model won't match GPT-4o or Claude Sonnet. The trade-off is privacy and cost. For best local results:

Use Gemma 4 27B if your hardware supports it
Use Q6_K or Q8_0 quantization (larger files, better quality)
Ensure context size is at least 8192 tokens (-c 8192 for llama.cpp)