Local LLM Setup
Run AI-powered session analysis entirely on your machine using Ollama or llama.cpp with Gemma 4. Free, private, no API keys.
Why local?
Local LLM inference means your session data never leaves your machine. No API keys, no cloud bills, no rate limits. Code Insights supports two local inference backends:
- Ollama — easiest setup, manages model downloads automatically
- llama.cpp — more control over quantization, GPU layers, and memory usage
Both are free and work fully offline once the model is downloaded.
Recommended models
| Model | Size (Q4_K_M) | VRAM needed | Quality | Best for |
|---|---|---|---|---|
| Gemma 4 12B | ~7 GB | 8 GB | Good | Most dev machines, laptops |
| Gemma 4 27B | ~15 GB | 16 GB | Better | Workstations, desktop GPUs |
| Qwen3 14B | ~8 GB | 10 GB | Good | Alternative if Gemma unavailable |
| Llama 3.3 | ~4 GB | 6 GB | Adequate | Low-VRAM machines |
We recommend Gemma 4 12B as the default — it produces reliable structured JSON output for session analysis while fitting comfortably on most developer machines.
Option A: Ollama (recommended for most users)
Ollama handles model downloads, GPU detection, and serving automatically.
1. Install Ollama
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh2. Pull a model
ollama pull gemma4 # Gemma 4 12B (~7 GB download)
# or
ollama pull gemma4:27b # Gemma 4 27B (~15 GB download)3. Configure Code Insights
code-insights config llm --provider ollama --model gemma4Or use the interactive wizard:
code-insights config llm
# Select: Ollama (Local)
# Select: Gemma 4 12BThat's it. Ollama starts automatically when needed. Run code-insights insights <session_id> or click Analyze in the dashboard.
Option B: llama.cpp (more control)
llama.cpp gives you direct control over quantization, GPU layer offloading, context size, and memory usage. Use this if you want to fine-tune inference performance or run models not available through Ollama.
1. Install llama.cpp
# macOS
brew install llama.cpp
# From source (any platform)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build --config Release2. Download a GGUF model
Download a quantized Gemma 4 GGUF from Hugging Face. Look for Q4_K_M quantization as a good balance of quality and size.
3. Start llama-server
llama-server -m gemma-4-12b-it-Q4_K_M.gguf --port 8080Common flags:
| Flag | Description |
|---|---|
-m <path> | Path to the GGUF model file |
--port <port> | HTTP port (default: 8080) |
-ngl <layers> | Number of GPU layers to offload (use 99 for full GPU) |
-c <tokens> | Context size (default: 2048, recommend 8192 or higher) |
--threads <n> | CPU threads for prompt evaluation |
4. Configure Code Insights
code-insights config llm --provider llamacpp --model gemma-4-12bOr use the interactive wizard:
code-insights config llm
# Select: llama.cpp (Local)
# Select: Gemma 4 12B (Q4_K_M)
# Base URL: http://localhost:8080 (default)5. Verify
code-insights insights <session_id>The dashboard Settings page also has a Discover Loaded Model button that queries your running llama-server to confirm which model is loaded.
Differences between Ollama and llama.cpp
| Aspect | Ollama | llama.cpp |
|---|---|---|
| Setup | One command (ollama pull) | Manual GGUF download + server start |
| Auto-start | Yes (launches on demand) | No (you start llama-server manually) |
| Model management | Built-in (ollama list, ollama pull) | Manual file management |
| GPU control | Automatic | Fine-grained (-ngl, --threads) |
| Quantization choice | Limited to what Ollama publishes | Any GGUF from Hugging Face |
| Default port | 11434 | 8080 |
| API | Ollama native API | OpenAI-compatible API |
Choose Ollama if you want the simplest setup. Choose llama.cpp if you want control over quantization, GPU offloading, or want to run models Ollama doesn't support.
Troubleshooting
"Cannot connect to llama-server" / "Cannot connect to Ollama"
The inference server isn't running.
- Ollama: Run
ollama servein a terminal, or restart the Ollama desktop app - llama.cpp: Run
llama-server -m <model.gguf>— it must stay running while you analyze
Slow inference
- Ensure GPU offloading is active. For llama.cpp:
-ngl 99offloads all layers to GPU - Use a smaller quantization (Q4_K_M instead of Q8_0) or a smaller model (12B instead of 27B)
- Close other GPU-heavy applications
JSON parse errors or garbled output
Small quantized models occasionally produce malformed JSON. Code Insights handles this with:
- Grammar-constrained JSON output (llama.cpp
response_format) - Automatic single retry on parse failure
- Lower temperature (0.3) for more consistent output
If errors persist, try a larger model (27B) or a higher quantization (Q6_K, Q8_0).
Analysis seems lower quality than cloud providers
This is expected — a 12B quantized model won't match GPT-4o or Claude Sonnet. The trade-off is privacy and cost. For best local results:
- Use Gemma 4 27B if your hardware supports it
- Use Q6_K or Q8_0 quantization (larger files, better quality)
- Ensure context size is at least 8192 tokens (
-c 8192for llama.cpp)