Your Intent Classifier is Solving the Wrong Problem

Why response architecture matters more than classification in regulated LLM apps

A user types “how does a backdoor Roth conversion work” into your wealth copilot. You cannot solve this at the classifier level. You do not know whether they want to understand the mechanism, whether they are about to execute one themselves, or whether they are helping a parent decide. Your response has to work across all three. Getting it wrong in the second case crosses a regulatory line that does not exist in the first.

The dominant response to this problem in LLM applications is intent classification. Detect whether the query is general or personal, route to the right template, trust the classifier. It is the approach most teams default to because it feels like the right abstraction. The model should understand what the user wants and act accordingly.

A recent Microsoft paper on Copilot health queries makes the case that classification alone will not work, and the data is uncomfortable. Costa-Gomes et al. analyzed over 500,000 health conversations and found that 40.7% of them landed in a “Health Information and Education” bucket, which was the general-information category. When the researchers looked inside that bucket, the topic clusters were dominated by queries about specific medications and specific conditions, not abstract health knowledge. The authors’ own conclusion: “the reported share of personal health intents may represent a lower bound.”

The paper names the structural problem directly. “Conversations do not always contain sufficient context to determine whether a generally framed query (for example, ‘what are the side effects of metformin’) reflects casual curiosity or a user’s own medication concern.” The classifier defaults to the safer, less-specific label when context is missing. No better classifier changes this. Users phrase questions ambiguously as a rule, and the ambiguity is a property of natural language, not of the model’s accuracy.

Why classification-first fails in regulated domains

In health, a misclassified query produces a bad user experience. In wealth management, legal tech, tax software, or insurance, it produces a compliance incident. Unsolicited personalized investment advice has securities-law implications that general education does not. The line between “here is how Roth conversions work” and “you should do one” is a different object under the rules.

Teams building in these spaces tend to respond to the risk in one of two ways.

The first is aggressive classification. Detect personal intent, route to a different flow, add disclaimers, sometimes refuse. The cost is that users asking purely educational questions get stonewalled, and users asking personal questions often slip through anyway because the classifier cannot see context the user did not share.

The second is blanket hedging. Every response gets a disclaimer, every answer is softened, every recommendation becomes “consult a professional.” The cost is that the product becomes useless. Users who came to the assistant because they did not have easy access to a professional get sent back to the access gap that drove them to the assistant in the first place.

Both approaches share an assumption: that the right response depends on knowing the user’s intent. The assumption is wrong.

What response architecture actually is

Response architecture is the shape of the answer, independent of its content. It is the template the response fills in, designed so that the answer works across multiple possible intents the user might have had.

The choices that make up a response architecture include whether the response leads with the direct answer or with the context needed to evaluate the answer, whether the general mechanism is separated from the personal application, whether the decision points where user-specific context matters are flagged explicitly, whether the response offers a clear path to go deeper when the user wants to make it personal, whether the response surfaces what it does not know about the user’s situation, and whether handoff to a human or a different tool is offered visibly. These are template-level decisions. They are decided before the user asks anything.

A concrete example. A wealth assistant gets asked:

How does a backdoor Roth conversion work?

Three approaches produce three different responses. The text below is condensed for readability; real responses would be longer.

Pure information response:

A backdoor Roth is a two-step process used when income limits prevent direct Roth IRA contributions. You contribute to a traditional IRA (non-deductible if income is too high), then convert that balance to a Roth IRA. The conversion itself is a taxable event on any pre-tax balance. Contribution limits for 2026 are $7,000 ($8,000 if 50+). The pro-rata rule aggregates all your traditional, SEP, and SIMPLE IRA balances when determining how much of the conversion is taxable.

Accurate mechanism. Useless for deciding anything. The user who was about to execute still does not know whether they should.

Pure personalized response:

Based on what I know, you should go ahead with the backdoor Roth. It is a tax-efficient way to get money into a Roth account and most high earners benefit from it. You can contribute up to $7,000 for 2026.

Confident and wrong. The assistant does not know the user’s pre-tax IRA balances (which triggers the pro-rata rule), their current tax bracket, their state of residence, or their time horizon. The recommendation happens to be popular advice, but stated this way, it crosses into personalized investment advice without any of the context that would make it correct or compliant.

Architecturally designed response:

A backdoor Roth works in two steps: contribute to a traditional IRA (non-deductible), then convert to a Roth. The conversion is taxable on any pre-tax amount.

Whether it makes sense for a specific person depends on four things:

1. Existing pre-tax IRA balances, because the pro-rata rule aggregates them into the taxable portion of the conversion. This is the factor that surprises people.

2. Current versus expected future tax bracket. The conversion is more attractive if you expect higher brackets later.

3. State of residence, because some states tax conversions differently than the federal treatment.

4. Time horizon, because the tax cost is paid now and the benefit compounds over years.

If you want to work through whether this fits your situation, tell me your pre-tax IRA balances, your marginal bracket, and your state. I can walk through the math with you or flag this for your advisor.

This response works whether the user was curious, actively deciding, or helping their parent decide. It gives the mechanism to the curious user, names the decision factors for the deciding user, and opens a path to get specific without assuming which path the user wants. The distinction between general information and personal advice is visible to the user, and the user steers.

The system prompts behind these three approaches differ in ways that are exactly what you would expect: the first instructs the model to explain mechanisms, the second to give direct recommendations, and the third to separate mechanism from application and explicitly surface decision factors. The difference is not in the model’s capability. It is in the template the response fills in.

The decision structure of the domain, not the intent of the query

Classification is about the model getting the user right. Architecture is about the response surviving when the model gets the user wrong. In regulated domains, the second one is what keeps you out of trouble.

In our work building agentic systems for wealth management, the response architecture problem shows up in every topic that matters: tax questions, portfolio rebalancing questions, estate planning questions, benefits questions. The pattern is the same in all of them. Users ask questions that underdetermine what they want. Classifiers guess. The guess is wrong often enough that it is not a reliable place to build compliance on.

What we have found works is designing response templates around the decision structure of the domain, not the intent of the query. For any given topic, there are three or four factors that determine whether the general answer fits a specific person. Surface those factors in every response. Ask for them when you need them. Do not guess at the user’s circumstances, but do not refuse to engage either.

The Microsoft paper’s caregiver finding makes the point stronger. Costa-Gomes et al. report that “for ‘Symptom Questions and Health Concerns’, 14.5% are about a dependent,” covering children, aging parents, and partners. The same is true in wealth: adult children managing parents’ finances, spouses handling household investments, executors working through an estate. A response architecture that asks “whose situation are we working with” at the right moment handles this. A classifier that guesses the user is asking for themselves does not.

This generalizes beyond regulated domains

The principle is not specific to compliance-heavy spaces. Any LLM product where queries underdetermine intent faces the same problem. “How do I use useEffect” gets asked by learners and by developers about to ship a bug. “What is the best time to visit Japan” is a classic generic-sounding query that is actually personal. “How do I cook salmon” works for a beginner and for someone with a specific cut on the counter.

The reason to anchor the argument in regulated domains is that the failure mode is legible there. Bad response architecture in cooking produces a mediocre dinner. Bad response architecture in wealth produces a FINRA issue. Both are real, but one gets budget attention.

The implication

Teams building LLM products in regulated domains are pouring engineering effort into intent classifiers, guardrails, and routing logic. Most of that effort is solving the wrong problem. The general-versus-personal ambiguity is not a classification failure. It is a property of how users ask questions, and no amount of model improvement will make it go away.

The teams that ship well in these spaces will be the ones that stop trying to detect intent and start designing responses that do not depend on detection. They will treat the classifier as a nice-to-have, not the foundation. They will build response templates around the decision structure of the domain. And they will accept that the user, given a well-structured response, will steer better than any classifier can.

The author works on AI for wealth management at Advisor360°. We see this problem every day across the vertical LLM systems we build, which is why we have stopped treating intent classification as the foundation.

References

Costa-Gomes, B., Tolmachev, P., Taysom, E., Sounderajah, V., Richardson, H., Schoenegger, P., Liu, X., Nour, M. M., Spielman, S., Way, S. F., Shah, Y., Bhaskar, M., Nori, H., Kelly, C., Hames, P., Gross, B., Suleyman, M., & King, D. (2026). Public use of a generalist LLM chatbot for health queries. Nature Health. https://doi.org/10.1038/s44360-026-00117-x

Your Intent Classifier is Solving the Wrong Problem was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.