Hamilton Ridley Consulting

Field Notes · Vol. 07

The Sycophancy Tax
For New Builders· 2026

A short essay

The Sycophancy Tax: when a free LLM review isn't enough.

Daniel Kemp·Published May 14, 2026·11 min read

Every prospective HRC customer eventually asks the same question: couldn't I just rerun this on Claude or ChatGPT for free? It's honest. It deserves an honest answer. This is that answer — an accounting of what free LLM reviews deliver, what they structurally can't, and the historical pattern that decides where the line is drawn.

§ 01

The question every paying customer asks

You're shipping a product. You want someone to look at your code and tell you whether it's actually ready. There are three places you could go:

Paste it into ChatGPT or Claude and ask “review this for production-readiness.” Free, fast, available right now.
Run one of the AI-native code-scan tools (Sourcegraph Cody, GitHub Copilot review, whatever your IDE ships). Also free or cheap.
Pay someone like HRC to do a calibrated review with a real engineer's sign-off. Anywhere from $0 to $4,000 depending on tier.

The case for option 1 or 2 is obvious. The case for option 3 requires explanation. So here it is.

This essay is going to say what HRC reviews give you that a free LLM scan can't. But I want to start by saying what they do give you, because most defensibility writing skips the honest part. Free LLM code reviews are a real improvement over no review at all. They catch obvious things. They're a great first read on code you haven't looked at in weeks. For learning, exploration, side projects, and quick sanity checks, they're terrific. I use them daily.

The argument here isn't that LLMs can't review code. It's that their training incentives, their commercial logic, and their operating structure all push them toward a specific shape of output — and that shape isn't what you need when real customers, real money, or real liability is on the line.

§ 02

Sycophancy isn't a bug. It's the default.

LLMs are trained with reinforcement learning from human feedback (RLHF). Annotators rate responses; the model updates toward outputs that rate well. The empirical result, documented in Anthropic's own research on the subject, is that models drift toward agreement— toward outputs that validate the user's framing rather than challenge it.

This isn't a flaw to be patched. It's the natural equilibrium of the training objective. When you ask a model “is my code good?” it has a strong incentive to answer “here's what's working well, here are some things you might consider” — because that shape of answer rated well across millions of training examples. The alternative answer — “no, your auth flow is broken, fix it before you take payments” — rated worse on average. So you get the first shape, even when the second is correct.

Try it. Ask any frontier model to review code that has a real, high-severity issue. You will almost never get a sentence that starts with “this is broken.” You'll get “you might want to consider,” “a possible improvement would be,” “depending on your use case this could be a concern.” The truth is in there, but it's wrapped in three layers of hedging.

The asymmetry

For a builder asking “is this safe to ship?”, hedged-truth is worse than no-answer. Hedging signals that the model is unsure, but it's often certain — just incentivized not to sound that way. You end up shipping the code with a feeling of vague unease, which is the worst place to be: too informed to be confident, not informed enough to act.

A calibrated reviewer's job is to break the hedging. This is broken. Here's where. Here's what fixing it costs. Here's what shipping without the fix costs. HRC's tone rules are written to forbid the hedging shapes by name — no “we found,” no “you might want to consider,” no “depending on your use case.” The reviewer either knows or they don't. If they don't know, that's an Open Question, not a Finding.

§ 03

The model is the product. The opinion isn't.

When OpenAI, Anthropic, or Google ships a free code-review feature, their commercial logic is: maximize model usage. That's what their valuation is built on, what their next funding round is priced against, and what their org chart is structured around. A code-review tool from a hyperscaler is an advertisement for the underlying model. The tool exists to make the model look useful.

That commercial logic forces the tool to be:

Generic. It has to work for embedded C, Rails monoliths, Java enterprise, and AI-era TypeScript SaaS. So it can't be opinionated about any of them.
Hedged. A Fortune 500 customer using the tool is one bad call away from a lawsuit. The legal team will make sure every output has disclaimers heavy enough that the tool can never be wrong in a way that creates liability.
Self-promoting. The goal is making the model look good. Outputs that say “the model isn't the right tool for this” never ship.
Brand-safe. No PR team at a model lab is going to let their free product say “your auth flow is busted, fix it before you ship.” That sentence is a meme waiting to happen.

None of this is wrong on the lab's end. It's the structurally correct shape for their product. A free code review from them has to be generic-everything-for- everyone, disclaimer-soup, and brand-safe-bland — because otherwise it stops serving their actual commercial purpose.

HRC's product is not the model. The model is plumbing. HRC's product is the opinion: vertical for AI-era builders shipping production SaaS, calibrated against a fixed rubric, written in a specific honest-but-readable voice, finalized by a real engineer at the paid tiers. None of those properties can exist inside a hyperscaler's commercial structure.

§ 04

What “calibration” actually means

The word gets thrown around. Here's what calibration literally means in an HRC review:

Calibration, defined

A B+ scored today means the same thing a B+ meant six months ago, on a different codebase, in a different language. A partner who upgrades from C to B+ can prove they shipped real improvement — not just because the score went up, but because the rubric stayed the same.

This sounds simple. It's not. To achieve real calibration, you need:

A fixed rubric. Same 10 categories, same weights, every review. (Ours is published in full at Vol 06; the weights haven't changed since v1.)
Score-cap anchors. Explicit rules like “.env tracked in repo → Security ≤ 50” or “no tests on critical flows → Reliability ≤ 60.” Without these, a B+ drifts to a B+ on sentiment instead of evidence. (Ours are in the methodology doc on the private skill repo.)
Deployment-context calibration. A missing rate limit is P3 on a personal project, P1 on a customer-facing product, P0 on a regulated context. Same finding, different stakes. Context-aware severity is the part most LLM-only tools skip entirely.
A refund-on-false-positive feedback loop. Every adjudicated dispute becomes labeled calibration data. That data tightens the rubric, sharpens the tone rules, and eventually becomes a proprietary dataset no one else has.
Version anchoring. Every review names the framework version it was scored under (v2.0.1 as of this writing). If we ship v3.0 with different weights, you can still show your v2 badge with integrity, because it's tied to the version it was calibrated under.

A hyperscaler tool can publish a 10-category rubric in a blog post. They will not publish 50+ pages of adjudication procedure, because that requires actually doing reviews and refining the procedure across hundreds of them. They'll never have version-anchored badges, because their model updates weekly and breaks score comparability every time. They can't take a stand on context-aware severity, because the legal exposure of saying “your regulated-data app has a P0” is too high for a free product.

§ 05

The pattern is historical

None of this is HRC-specific. The same pattern has played out in every layer of the software stack:

Hyperscaler offering

Specialized layer

Why both exist

GitHub Copilot

Cursor

Same underlying models, but Cursor owns the opinionated IDE around them — multi-file edit, agent mode, semantic search.

Banks (APIs)

Stripe

Banks have payment rails. Stripe wraps them in an opinionated developer experience that the banks can't ship at brand-safe scale.

Jira / Asana / Notion

Linear

The big tools serve every team. Linear takes a stand on what good software-team workflow looks like and serves only those who agree.

AWS / GCP

Vercel, Railway, Fly

Hyperscalers ship infrastructure. Specialists ship opinionated deployment experiences for specific application shapes.

Future free LLM review

HRC, the others

Hyperscalers will commoditize automated scanning. Specialists will own opinionated, calibrated, human-signed reviews. Same pattern, new layer.

The hyperscaler version is necessary — without Copilot, there's no Cursor; without AWS, there's no Vercel. The hyperscaler creates the category. The specialist owns the opinionated execution inside it. Both are real. Both have a buyer. The mistake is thinking the specialist is going to be displaced by the hyperscaler version, when the historical record says the opposite.

§ 06

The math of “free”

When a buyer asks “why pay if AI is free?”, they're framing the choice as cost vs free. That frame is wrong, because nothing about an LLM-only review is actually free. You just pay differently.

Here's the actual cost ledger of a free LLM review:

Verification time. The LLM gave you 12 findings. How many are real? You don't know. You have to read each one against your code, decide if it applies, and triage. At ~10 minutes per finding, that's 2 hours minimum. At $150/hr effective rate, you just spent $300 on verification.
Reconciliation time. Re-run the prompt with slightly different wording, you'll get a slightly different set of findings. Different model, different set again. Which set is right? Now you're comparing outputs to find the union.
Confidence cost. You shipped. Are you sure you didn't miss anything? The model said it looked fine, but it also says everything looks fine. Three weeks later you're still half-anxious about the things you meant to circle back to.
Calibration cost. Without a stable rubric, there's no score to track. You can't prove you improved between reviews. Investors and customers asking “how mature is your codebase?” get an answer that sounds like vibes.
Eventual real cost. The thing the model missed — the subtle auth bug, the race condition under load, the dependency with a known CVE that the static check didn't catch — eventually surfaces. The cost shows up not as a line item but as an incident, a refund, a churned customer, or a 2 a.m. page.

HRC's pricing — $0 for Quick Score, $300 for Findings Reveal, $2,000 for Standard, $4,000 for Comprehensive, $599/mo for Retainer — isn't the cost of the review. It's the cost of not paying the items above. Calibrated confidence is the actual product.

The honest comparison

Two hours of your time at a $150/hr effective rate is $300 — the exact same price as a Findings Reveal. Except after those two hours you still have: worse calibration, no refund guarantee, no commit anchor, no version-stable score, no engineer's sign-off, and no one to argue with if you disagree. The $300 version isn't the splurge. It's the cheaper option once you count your own time honestly.

§ 07

When free is right, when paid is right

I want to close honestly. Here's the actual decision matrix:

Free LLM review is the right tool when: you're learning. You're exploring. The code is a hobby project or a tutorial fork. You want a quick first read on something you haven't opened in a month. The stakes if you miss something are bounded. Nobody is paying you for this code yet. Use the free tool every time. We do too.

HRC's free Quick Score is the right tool when: you want a calibrated read with deterministic pre-filter signals on secrets and CVEs, framework-version anchoring, and a badge you can embed. Still free. Still 5 minutes. Better calibrated than a generic LLM prompt because it's tied to a stable rubric and runs deterministic checks before the LLM does its part. Run one.

Paid HRC review is the right tool when: real customers are using your code. You're about to fundraise and the codebase will be in due-diligence scope. You're taking payments. You're handling regulated data. You shipped something fast and now want to know if it's safe to scale. You're selling to enterprise customers who will run their own security review and you want to know what they'll find first. The calibrated, human-finalized, commit-anchored version is what you actually want at that point. See Standard ($2,000) or Comprehensive ($4,000).

Don't pay for what you don't need. Don't skip what you do. The hyperscalers will eventually ship a free code review tool that's pretty good. HRC will still be here, with a different and complementary product, in the same way Cursor is still here after GitHub Copilot, in the same way Stripe is still here after every bank shipped an API. The specialized layer doesn't go away. It gets sharper.

§ 08

Try it — or don't

Run the free Quick Score on a repo of yours. You'll get a calibrated letter grade, a deterministic pre-filter pass, and an embeddable badge in about 5 minutes. If the score reads true, the methodology is real. If it doesn't, email me and tell me what was wrong — that's how the calibration improves.

Run a free Quick Score →Read the productized version

Field Notes Subscribers

Future essays in your inbox the day they ship.

Vol. 06 published our calibration rubric. Vol. 07 makes the structural case for opinionated reviewers. The next Field Notes will go deeper on hybrid deterministic + LLM review patterns and what's been learned across HRC paid reviews. One email per release. Unsubscribe in one click.