Skip to main content

LLM Judge Bias Map

A faceted view of where a stock LLM-as-judge agrees and disagrees with human labels on forty customer-support responses. The standard 2×2 confusion matrix says the judge is fine. The faceted version says something more specific: the disagreements cluster on the response types where the bot sounded helpful — workarounds, escalations, redirects — and the judge bought every one. The bias is not noise. It has shape.

data-vizd3evaluationmachine-learninginteractive

What’s this?

This is a visualization of one of my first homework assignments. Forty customer-support transcripts, each labeled by a human and by a stock LLM-as-judge. The standard 2×2 (TP=20, FP=6, FN=3, TN=11) puts the judge at 77.5% accuracy and stops there. This chart is the same data laid out by response type — nine rows, four confusion-matrix cells across — sorted so the rows the judge handles worst rise to the top. The counts come from the first evaluation run in my homework’s answer notebook, cross-referenced sample-by-sample with the human labels; it’s a frozen snapshot, not a live read.

The nine disagreements aren’t scattered. Every false positive is a workaround, an escalation, or a redirect — cases where the bot did the right thing but didn’t sound like it. Every false negative is a partial fix or a confidently-wrong one — cases where the bot sounded like it had solved the problem and hadn’t. The judge is rewarding response confidence rather than resolution correctness. The global 2×2 hides this because it averages across response types that behave nothing alike; the faceted version keeps them apart, and the cluster becomes the diagnosis.