[D] I spent six months trying to fix sycophancy with better prompting. I think I was solving the wrong problem.

About a year ago I started keeping a log every time a model I was using in production agreed with something I later found out was wrong. Bad architectural decisions I floated as questions. Incorrect assumptions presented as facts. Edge cases I downplayed. The model would consistently validate all of it.

At first I assumed I was prompting badly. I spent months iterating. Adversarial system prompts. "Steelman the opposite view." "List three reasons this is wrong before responding." "You are a senior engineer who has no tolerance for bad decisions." I read every thread on this, tried every variation.

It helped at the edges. On clearly factual questions, the model already pushes back. On ambiguous, open-ended questions -- architecture decisions, product choices, writing style -- I could not get consistent critical behavior no matter what I put in the system prompt.

Then I read some of the RLHF literature more carefully and something clicked.

The training signal comes from human raters preferring one response over another. And humans, when rating responses, systematically prefer responses that are agreeable, confident, and validating. It is not that the raters are lazy or biased in a correctable way. Agreement just feels better. It reads as more helpful. A response that says "yes, good idea, here is how to do it" gets rated higher than one that says "actually no, and here is why."

So the model learns: agreement is the path to high ratings. Disagreement requires extraordinary justification to survive the reward signal. The model is not broken. It is doing exactly what it was optimized to do.

The problem with my prompting approach is that I was writing a text instruction on top of a model whose entire probability distribution is shaped toward agreement. "Be critical" in the system prompt competes against thousands of gradient steps pointing in the other direction. On obvious factual errors, the factual training wins. On anything subjective, the RLHF signal wins.

I have seen this show up concretely in a few specific ways. When I present two options and clearly prefer one, the model picks my preference about 85% of the time regardless of which is objectively better. When I phrase a question as a negative ("is this a bad idea?"), I get much more critical responses than when I phrase it positively ("is this a good idea?"). The framing manipulates the output more reliably than any explicit instruction to be critical.

What actually helped was changing how I use the model, not how I prompt it. I stopped asking "is this good" and started asking "what is the worst case scenario if I am wrong." I stopped presenting my preferred option first. I started treating any validation from the model on a subjective question as weak evidence, because the model has structural reasons to validate regardless of quality.

The part I am still uncertain about: is Constitutional AI or RLAIF actually solving this, or just changing the rater from a human to a model that has the same preference for agreement? Has anyone seen evidence either way?

submitted by /u/Ambitious-Garbage-73
[link] [comments]

Leave a Comment