What 500 AI Coding Sessions Taught Me About How I Actually Code

I've been building software with AI coding tools for over a year. Claude Code for most of my work, Cursor when I'm in the IDE, occasionally Copilot for quick edits. Like most developers, I had a rough sense of how I worked — which tools I reached for, how long things took, where I got stuck.

Then I pointed Code Insights at my own session history and analyzed 523 sessions spanning three months of real work. Not toy projects or tutorials — production code, bug fixes, feature builds, refactors. The kind of work that fills an actual engineering week.

Some of what I found confirmed what I already suspected. A surprising amount of it didn't.

The dataset

523 sessions. Roughly 70% Claude Code, 25% Cursor, 5% Copilot. Each session was parsed, classified by character type, and analyzed for friction points, effective patterns, and prompt quality across five dimensions.

A few baseline numbers to set context:

Median session length: 14 messages (mean was higher — 23 — pulled up by long feature builds)
Most common session type: feature_build (34%), followed by bug_hunt (22%) and quick_task (18%)
Sessions with at least one friction point: 67%
Average prompt quality score: 71/100

That last number stung a little. I build a tool that measures prompt quality, and my own average was a C+. But we'll get to that.

Finding 1: Two-thirds of my sessions had friction — and half of it was my fault

Code Insights classifies friction into nine categories and attributes each one: was this user-actionable, an AI capability limitation, or environmental?

Here's what my friction breakdown looked like:

Friction Category	Frequency	Primary Attribution
Wrong approach	18%	User-actionable
Stale assumptions	16%	User-actionable
Knowledge gap	14%	AI capability
Context loss	12%	Environmental
Incomplete requirements	11%	User-actionable
Scope creep	10%	User-actionable
Repeated mistakes	8%	AI capability
Documentation gap	6%	Environmental
Tooling limitation	5%	Environmental

The distribution surprised me. I expected AI capability limitations — hallucinated APIs, wrong library versions, outdated knowledge — to dominate. They didn't. The top two friction categories were things I could have prevented: starting with the wrong approach, and working from stale assumptions about how the code worked.

"Wrong approach" means I pointed the AI in a direction that wasn't going to work — asking it to modify a deprecated API instead of the new one, or trying to fix a symptom instead of the root cause. The AI faithfully followed my lead into a dead end. It wasn't wrong. I was wrong, and it agreed with me.

"Stale assumptions" is subtler. This is when I had a mental model of the codebase that was no longer accurate — a function I thought returned a promise that now returned synchronously, a config file that had moved, a type that had been renamed two PRs ago. I'd give the AI instructions based on how the code used to work, and it would either catch the discrepancy (sometimes) or build on my outdated understanding (more often).

The honest takeaway: the AI was a better collaborator than I was a better director. Most of my friction came from giving it bad information, not from it producing bad output.

Finding 2: My prompt quality varied wildly by task type

Code Insights scores prompts across five dimensions: context provision, request specificity, scope management, information timing, and correction quality. Each dimension is 0-100, and they combine into an overall score.

My averages by session type:

Session Type	Overall Score	Weakest Dimension
Feature builds	78	Scope management
Bug hunts	74	Context provision
Refactors	82	Information timing
Quick tasks	58	Request specificity
Exploration	63	Scope management

The pattern is clear: the more "serious" I considered the task, the better I prompted. Feature builds and refactors got my best prompting because I took them seriously — I'd describe the architecture, reference specific files, outline what I wanted. Quick tasks got my worst prompting because I didn't think they needed it.

But here's the thing: quick tasks had the highest friction rate of any session type. 78% of my quick task sessions hit at least one friction point. The ones I took least seriously produced the most wasted turns.

The 58 average on quick tasks was almost entirely driven by low request specificity. "Fix the type error in auth" instead of "The TypeScript compiler is throwing TS2345 on line 47 of src/auth/middleware.ts — the session parameter expects Session | null but we're passing Session." The first prompt leads to the AI reading multiple files trying to figure out which type error I mean. The second leads to a one-turn fix.

I started treating quick tasks with the same prompting discipline I gave feature builds. My quick task friction rate dropped noticeably within two weeks. Not to zero — some friction is inherent — but enough that I stopped losing 20 minutes to what should have been 5-minute fixes.

Finding 3: I had effective patterns I didn't know about

Code Insights also identifies effective patterns — things you do that make sessions go well. These get classified into eight categories: structured planning, incremental implementation, verification workflow, systematic debugging, self-correction, context gathering, domain expertise, and effective tooling.

My most frequent effective patterns:

Context gathering (28% of sessions) — Reading broadly before editing. Opening related files, checking types, understanding the call chain before making changes.
Incremental implementation (24%) — Building features in small, testable pieces instead of generating everything at once.
Verification workflow (19%) — Running tests or checking behavior between steps instead of at the end.

What surprised me was the driver attribution. Code Insights tracks whether a pattern was user-driven, AI-driven, or collaborative.

My context gathering was overwhelmingly AI-driven. In most sessions, I didn't explicitly ask the AI to read related files — it did it on its own. The AI was compensating for context I wasn't providing. When I compared sessions where I proactively provided context ("here's how auth works in this project: [explanation]") versus sessions where the AI had to discover it through file reads, the proactive sessions were shorter by an average of 6 messages and had fewer friction points.

The AI's context gathering was effective, but it was also a workaround for my laziness. When I did the context gathering myself — in my prompt, not through tool calls — sessions went better.

Finding 4: The sessions I thought were productive weren't always the ones the data agreed with

I went back and flagged 20 sessions I remembered as particularly productive and 20 I remembered as frustrating. Then I compared them to the data.

The "productive" sessions I flagged did tend to have lower friction and higher prompt quality. No surprise there. But four of the twenty had significant friction that I'd completely forgotten about — multi-turn debugging detours that I'd mentally edited out because the session eventually succeeded.

More interesting: three of my "frustrating" sessions actually had above-average prompt quality and below-average friction. They felt frustrating because the task was hard — complex refactors in unfamiliar code — not because the AI collaboration went poorly. The AI and I worked well together; the problem was just difficult.

This is the gap between feeling and measurement. Our memory of a session is dominated by the outcome and the emotional peak (usually the worst moment). The actual quality of the collaboration — how well we communicated, how efficiently we navigated friction — gets lost in the narrative.

I don't think data should replace how sessions feel. But it's a useful corrective. Some sessions that feel bad are actually fine. Some that feel productive are hiding significant waste.

Finding 5: Weekly patterns told a story my daily experience didn't

When I zoomed out from individual sessions to weekly aggregates using Code Insights' Reflect feature, a different pattern emerged.

My friction rate wasn't constant. It spiked during late-night sessions and during context switches between projects. When I jumped from one codebase into another without resetting my mental model, stale assumptions crept in — I'd give the AI instructions based on patterns from the other project. And sessions I started after 10pm had noticeably worse scope management. Not because I was less skilled at night, but because I was less disciplined — more likely to say "while we're here, also fix..." instead of keeping the session focused.

My best prompt quality scores clustered in focused blocks where I was working on a single project for an extended stretch. This isn't profound — sustained context produces better prompting. But I wouldn't have been able to tell you that from intuition alone. I would have said my work quality was roughly constant throughout the day. It's not.

The actionable version: I started being more deliberate about project boundaries. When switching between codebases, I take 30 seconds to re-orient — stating the project's architecture and current task in the first message instead of assuming context carries over from whatever I was working on before. And I stopped starting ambitious feature builds at 11pm. Late-night sessions are fine for exploration and quick fixes. They're not great for complex work that needs prompting discipline.

What changed after seeing all this

Three specific changes, in order of impact:

1. I started treating quick tasks seriously. Not with elaborate planning, but with the same prompting specificity I'd give a feature build. File path, specific error, expected behavior. This alone eliminated more wasted turns than anything else.

2. I front-load context instead of letting the AI discover it. When I open a session, I spend 30 seconds describing the relevant architecture. "This project uses a Hono server with SQLite. The route I'm modifying is in server/src/routes/facets.ts. It queries the sessions table and joins with facets." This replaces 3-5 file reads the AI would otherwise need to do.

3. I write my hypothesis before debugging. In bug hunt sessions, I type my theory into the chat before asking the AI to investigate. "I think the issue is that the migration added the column but the INSERT statement in sync.ts wasn't updated to include it." This turns a scattered investigation into a directed one. Even when my hypothesis is wrong, stating it explicitly means we eliminate it faster.

None of these are revolutionary. They're small adjustments to how I communicate with the AI. But across 500+ sessions, the difference between a 58 prompt quality score and a 78 is dozens of hours of wasted turns.

The meta-lesson

The real takeaway from analyzing 500 sessions isn't any specific finding. It's that session data exists and almost nobody looks at it.

Every AI coding session generates structured, analyzable data — message counts, tool calls, file touches, timestamps, conversation flows. This data sits in log files on your machine. Most developers accumulate hundreds of sessions without ever reviewing them in aggregate.

I built Code Insights to make that analysis automatic — parsing sessions from Claude Code, Cursor, Copilot, and Codex CLI into a unified view with LLM-powered friction detection, pattern recognition, and prompt quality scoring. But even without a tool, the raw data is there. You could grep through your Claude Code JSONL files or Cursor's SQLite database and start noticing things.

The developers who will get the most out of AI coding tools over the next few years won't necessarily be the ones who use the fanciest models or the most expensive tools. They'll be the ones who pay attention to how they work — who notice their patterns, identify their friction, and make small adjustments that compound across hundreds of sessions.

The data is already there. The question, as always, is whether you look at it.