What’s this?

This is a visualization of one of my first homework assignments. Forty customer-support transcripts, each labeled by a human and by a stock LLM-as-judge. The standard 2×2 (TP=20, FP=6, FN=3, TN=11) puts the judge at 77.5% accuracy and stops there. This chart is the same data laid out by response type — nine rows, four confusion-matrix cells across — sorted so the rows the judge handles worst rise to the top. The counts come from the first evaluation run in my homework’s answer notebook, cross-referenced sample-by-sample with the human labels; it’s a frozen snapshot, not a live read.

The nine disagreements aren’t scattered. Every false positive is a workaround, an escalation, or a redirect — cases where the bot did the right thing but didn’t sound like it. Every false negative is a partial fix or a confidently-wrong one — cases where the bot sounded like it had solved the problem and hadn’t. The judge is rewarding response confidence rather than resolution correctness. The global 2×2 hides this because it averages across response types that behave nothing alike; the faceted version keeps them apart, and the cluster becomes the diagnosis.