AI evals are becoming increasingly necessary and common, but improper benchmark design will fail to reveal how the system will behave in production, while giving you a false sense of stability. Below is the specific failure I encountered in this process and the steps I took to mitigate it.
In the past few months I ran 220,000 evaluations on LLM-based phishing detection across 11 frontier models and 10 different system prompt configurations. One major thing I learned in this process actually has nothing to do with phishing: AI evals can seem “clean” but might fail to reveal how the system will behave in production. Namely, optimizing for in-sample patterns might backfire more than expected on out-of-sample patterns.
In classical machine learning, the observation would be unremarkable. We train a model on our training data, we validate and test it on separate held-out datasets and we know that production performance will typically be at best equal to test performance and more often degraded — especially on inputs that drift from the training distribution.
LLMs, however, give us a somewhat false sense of security through the illusion of thinking, and that’s where trouble starts. We don’t train the language models ourselves, we use a frontier system that has seen vastly more of the world than any benchmark we could assemble. We tune behavior through prompts rather than weights, and we expect the model to reason through inputs the prompt engineer never specifically anticipated. This problem could also replicate in fine-tuned models (even though in that case weights are, in fact, tuned). Coverage starts to feel like someone else’s problem — the model supplier.
But the problem is ours. The decision rules we engineer into our prompts might be narrow, specific, and overfit to the benchmark we are using. If our benchmark is missing or too light on a category of samples that are expected to be seen in production data, the prompt is silently unprepared for it, and the reasoning capacity of the model may not cover this gap. Overfitting in building LLM-based systems is not the model overfitting to training data. It is the prompt overfitting to the shape of the eval.
To give a concrete example, in my research I gathered samples and then began prompt engineering to achieve the best performance. The challenge is fairly standard: identify phishing samples correctly, don’t block legitimate URLs. A classical recall / false-positive-rate tradeoff.
The best configuration I came across caught 93.7% of phishing emails at a 3.8% false-positive rate; not that bad given the high complexity of the benchmark.
Thankfully, before accepting this configuration as the best path forward, I decided to dive deeper into cases where it failed. It turned out that on a subset of attacks where the adversary controlled the specific signal the prompt was relying on, the same configuration caught only 30.1%, a total collapse. The attack was cheap, mechanical, and the subset was 74 samples out of 1,000. Averaged into the overall number, the collapse moved the aggregate by about three percentage points. On the dedicated subset, it moved it by sixty-three. Same underlying behavior; the benchmark happened to include a slice that made it visible.
The real sharpener came one step later. I went back and ran the general-purpose prompt configurations — the generic ones along the lines of “be cautious, watch for phishing indicators” — against the same subset. On the overall benchmark they were weak performers. On the infrastructure subset, they substantially outperformed my winning configuration.

Bar chart comparing recall for the winning prompt and a generic baseline across commodity phishing and infrastructure phishing subsets. The winning prompt performs better overall but collapses on the infrastructure subset, while the generic baseline performs better on that subset. Source: created by author.
That changes the interpretation. My winning prompt had introduced a failure surface that was not there in the less-optimized configurations. Optimizing the prompt against the benchmark distribution had made the system worse in a category the benchmark underrepresented. The blind spot was produced by the prompt optimization, not inherited from the model. The structural mechanism behind this — what I call the Instruction Specificity Paradox — is the subject of a later piece in the series; here the relevant point is methodological, not structural.
A helpful framework
Machine learning offers a classical train-validate-test framework. It exists for a good reason.
Prompt engineering is not the same discipline, but the framework holds more of its shape than the “just iterate on the benchmark” instinct assumes. Here is what I would now use, with the parts I am still working through flagged honestly.
- Start with adaptive benchmark design. Before any prompt engineering, build the benchmark to actually reflect production. Sample real data where you can. Mimic the distributions your system will face. Identify the key categories for consideration and size each one large enough that per-category failures are statistically visible. Think about adversarial variants before production finds them for you. Do the homework up front.
- Iterate prompts on this benchmark. Treat it as a training set. Prompt-engineer, measure, revise, repeat. Be aware that whatever wins here has been selected against this specific dataset, and “selected against” is another way of saying “overfit to.” The winning prompt is not inherently good. It is the best prompt you tested on this data.
- Validate on a second, held-out dataset. Take the winning config and run it against a dataset you have not iterated on. This is the honest check. If performance drops sharply, you have overfit the prompt to the dev benchmark — go back to step one or two and reconsider. In classical ML, a validation set is used during development to make selection and tuning decisions. In prompt engineering, the more critical job is detecting whether the winning prompt has overfit the benchmark you iterated against. Re-evaluating on the same benchmark you iterated against will not catch this, because the prompt was chosen specifically because it performs well there.
- Test on a third dataset, only once, at the end. This is your production-performance estimate — what you’re after is an honest read on how the system will perform once shipped, conditional on the dataset itself reflecting production distributions in the first place. It is tempting to collapse validation and test into a single held-out evaluation, especially when good data is expensive — but a held-out set you have touched multiple times during development is no longer held out.
This framework comes with crucial disclaimers:
- Data scarcity: most AI teams cannot realistically get to three high-quality datasets. Good evaluation data — real, representative, labeled, contamination-free — is expensive. This is a harder problem than the “better prompts” problem that the community spends most of its attention on. Arguing for an evaluation-data budget could be more consequential than arguing for access to a better model.
- The framework does not dissolve the eval trap. If all three datasets share the same blind spot — the same underrepresentation of whichever production category you didn’t carve out — you get clean numbers across dev, validation, and test, and you still ship broken into production. The structural work is being done by the adaptive benchmark design step. The three-dataset split is hygiene on top of that. Good hygiene helps but it is not enough on its own.
- Error analysis is not optional. This is the reason I caught the collapse in my own work at all. When you look at results, do not just read the aggregate metric. Review errors manually, at least 20–30 if possible. For each one, ask: is this a genuine edge case any detector would struggle with, or has my prompt introduced a failure mode that was not there before? Edge cases are tolerable up to a ceiling. A prompt-induced failure surface is a regression you created, and you can remove it. The only reason I caught the infrastructure-phishing collapse before treating the winning configuration as done was that I read through the errors and noticed my “best” prompt was confidently classifying as legitimate the same messages the baseline was correctly flagging. That pattern — the better prompt being worse where it mattered — is exactly what error analysis exists to reveal. Analysis-paralysis is a valid concern here, but even a few iterations could help prevent a collapse in production, or help you better define the use-case.
The frame I keep coming back to: the benchmark is part of the product. It decides which of your failures surface before users see them and which surface after. Designed at that level of seriousness, it makes the AI product more robust than the model weights alone would predict. Defaulted, it does not.
This is the opening piece in a series deconstructing the PhishNChips benchmark — a 220,000-decision evaluation of LLM email-agent security. The remaining articles discuss evaluating deployability, the Instruction Specificity Paradox, and how the same prompt can trigger inverted results across different models.
If you work on AI product, eval design, or trust & safety platforms, I would welcome conversation. The phishing detector is the domain; the eval-design pattern above is the general result.
Ron Litvak is an ML/AI Product Manager at Payoneer and a Columbia Business School MBA. Prior work: trust & safety roles at Forter and PayPal.
The Eval Trap: Your Benchmark Is Part of Your Product was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.