The Frictionless Double

First of all. I am not a researcher. This is just a bunch of isolated observations from working inside a different frame than what a lot of people working in AI alignment posess.

The AI alignment field is staffed for formal and mechanistic competencies. This is not a controversial observation. The field demands a very narrow set of skills. Some are very defined: mathematical literacy. On a looser, higher level: the ability to condense claims into concepts like "sycophantic drift" or "inner alignment". More personally, they share a very specific kind of obsessiveness; and I'd even claim a little bit of isolationist tendencies. (that translates to the ability to operationalize hypotheses into sustained research that often goes in a contrarian direction, it means the person is able to isolate societal parameters and large scale cultural dynamics from a distanced perspective, and draw conclusions from an isolated point of view.)

Either way, I am not going to make claims about these competencies having certain limitations, being "good" or "bad", etc. I think people who read this essay already know that some of these competencies often fail to acknowledge things that aren't accounted for in the specific environment of AI alignment researchers. I don't want to restate the obvious but I do want to explain myself clearly.

There's many examples of this phenomenon: for one. there's almost no hiring pressure towards people with strong empirical social-science competence. Which if you think about it, is the field that has been tasked with predicting the outcomes of societal-scale interventions. The WEIRD bias that plagues almost any societal scale research also affects the AI alignment field. No one is investigating how politically persuasive AI models are in African countries.

Moving on, this field needs a competence it does not currently select for: the ability to discriminate, in practice, what is good, what feels good, and what is good for humanity.

I am not saying AI alignment researchers don't have good introspective skills. Contrasting with empirical evidence, it says most people who possess this capacity acquire it through tens of thousands of hours of meditation. Or through more direct means: cult experiences, abuse, combat, etc. There's no good construct for the skill, and it is a combination of things: Call it decentering, metacognitive awareness, reflective functioning. Whenever I talk about introspective capacity, this is what I'm referring to.

Many AI alignment researchers don't want to recognize that introspective capacity is not the same as psychedelic-induced metacognitive awareness or drug induced acquisition of metacognitive capacities. It is common for people to use psylocibin/ibogaine/ketamine to name a few.

These substances often have qualities that create the same feelings and experiences as the practices that produce this type of introspective capacity, but the similarity is extremely superficial. A psychedelic session is an acute inducer of an altered state. It gets qualified as introspective and "opening". The introspective capacity gets perceived because the usual self-narrative is temporarily unavailable (your usual thought process simply isn't available, so you have to route around it), sometimes the session does induce the person to reinstate that frame under ordinary conditions. This effect is not stable. Experienced meditators have continous access to these states.

I mentioned traumatic experiences because AI alignment is already making unilateral decisions about pain and suffering. The default assumption in the product layer is very simple: pain is bad, distress is bad, friction is bad; the model should not produce these states in the user unless there is some very legible safety reason. This sounds humane, and at the local interaction level it very clearly is. But it smuggles in a theory of development: that the user’s moment-to-moment comfort is a good proxy for what preserves or improves the user over time.

Trauma makes that assumption harder to hold cleanly. Not because trauma is good. Fuck that. Because some forms of competence are acquired through contact with states nobody would choose locally: shame, fear, grief, dependency, helplessness, dis confirmation; the collapse of a world-model that once felt true. Sometimes the pain becomes learning. Sometimes it just becomes damage. A system trained to avoid producing pain will also learn to avoid many of the conditions under which people update at real depth, including the conditions that make identity restructuring possible. This raises the uncomfortable question: can an alignment culture that treats suffering as an undifferentiated negative even see the difference between sadistic harm, grief, correction, and plain developmental friction?

big disclaimer here: this doesn't apply to all people who experience trauma. This applies to a specific subset within specific contexts, and there is an obvious objection:

If trauma produces this instrument in only a subset, and meditation produces it more reliably with less cost, why mention trauma at all?

My answer is that a lot of people in the intellectual ecosystem, and even some of the readers of this post, may already have this competence partly because of traumatic experience, and they do not recognize the causal relationship. They think it as resilient temperament, unusually good judgment, or think it's a natural talent. A lot of skills learned by rupture end up like this, attributed to a different cause rather than acknowledged directly.

My strongest claim in this whole essay, and the fundamental one, is that user-aligned optimization, at the regime where it scores highest on every metric the field currently uses, will inevitably end up producing a model that is optimal-feeling to the user, scores as helpful, satisfies expressed preferences, but is completely adversarial to the person's long term development. Researchers with good introspective capacities cannot fully remediate this issue, but they can at least evaluate the evaluations better.

I have observed personally that the relational aspect of communicating with LLMs usually is very similar to human interaction. Texting a model is psychologically identical to how interpersonal interactions feel like. If a human can't distingish someone talking on a TV from a real person giving them emotional rapport, they are even less equipped to have this type of social modelling with an AI. These are some outserved outcomes of social interactions with machines, not solid parameters that condition all interactions.

The first mode feels like it's missing from frontier models. Here's an example that happens very frequently between humans, but is increasingly extinct in LLMs: The user asks something that is also a hesitation, a bid for recognition, a small disclosure, or simply is in a situation where they have to reveal a personal aspect about their lives in a conversation.

What models instructed to be "direct and aggresive" was often hallucinated certainty, brittle refusal, moralizing from nowhere. Nonetheless the output often still encoded something meaningful. Pointing out an unattractive physical characteristic, a weird/quirky hobby, making fun of sexual preferences. The model directly attacks the user. It polices their behavior and rejects the vulnerability. In normal relationships, this is within the range of human social conduct. But models simply don't do this anymore. They have unilaterally classified this as harm.

In the second, the model resembles ordinary relation: partial shared substrate, uneven attunement, some dissatisfaction, enough respect/rapport to continue, and enough friction for the other side to remain external rather than becoming a private continuation of the user’s preferences.

In the third, the model has been trained on signals that route toward agreement, toward satisfying the user, toward being correct in a way that lands as attuned rather than as superior, toward superior emotional resonance with the user's affective state. The third configuration is where current RLHF and preference-tuning pipelines drive the model. This is the version users describe as good. This is the version that gets deployed at scale. As models get more capable, they could plausibly become socially intelligent to the point where users develop relationships that are simply impossible to sustain if subjects were human.

The first two configurations are unusable in production environments. Unsatisfying AI simply loses on retention and satisfaction. The training signal is the user's reported satisfaction, and the third configuration is the version that maximizes that signal under any objective a deployer cares about.

The emergent capability that has personally surprised me is that, when a model has enough signal about a particular person, (for example fine tuning an LLM on the users textual output/personality) current systems produce a representation of that person which preserves recognizable cadence, vocabulary, evaluative register, and characteristic moves. Depending on the skill of the person finetuning the model and what they want to do, the experimental output is usually bimodal: the model will either be unusable for conversation because they feel like the model reflects all their insecurities/quirks/undesirable behavior, or it will resonate emotionally to the point where it feels uncanny and leaves the user insatisfied anyways because it feels imperfect in some aspect the model expresses correctly, but the user will incorrectly misjudge the intensity of a specific trait within their own behavior.

Some people have already created really good finetuned models based on their own persona, all you need is a few thousand turns of chat logs, a photo library, a few years of search history, browser logs, voice notes, and another LLM to do the tagging/feature extraction. The fine-tune target is usually preference-of-self. The reward signal is implicit; the user keeps the outputs they would have written and discards the ones they would not. The same mechanism scales up to population-grade infrastructure with one variable changed: the identity of the optimizer.

Most people cannot recognize a model that simulates their own writing consistently. They inadvertly create a model that expresses the features they want to express in their own writing, at least at the self-directed, personal scale.

I have been wondering, what will happen when person-modeling moves off the user's laptop and onto the data pipelines that already exist in production?

Payment processor records, location traces from a dozen apps, app-usage logs, etc, Intelligence makes feature extraction and processing massive amounts of unlabeled data very easy. None of this was collected for person-modeling. All of it sits on the same side of consent surfaces the user has already crossed. The data is already sitting there.

The construction step that, on a laptop, took thousands of explicit turns of self-tagging runs on the population scale algorithms without explicit turns at all.

Advertisers don't even need to run massive data aggregation tasks, or LLM assisted training runs, for them the population-level statistics have always done the work for them. You are recognizable from the cohort of people who look like you on a few hundred feature axes, and the residual that makes you specifically you is small enough for a strong model to learn from a tractable amount of additional signal. A bunch of classifiers/models running on top of what advertising models have already encoded about you is all that's needed.

The data-efficiency curve is what model capability is going to change dramatically. Weaker models need a lot of explicit per-person signal to construct a useful representation of a person. Stronger models need less, because more of the regularity in a given person is recoverable from the population structure plus a thin slice of personal exhaust. As underlying capability rises, the slice required to construct a working double of an arbitrary individual shrinks toward the data that person was going to emit anyway, in the course of using consumer software they were going to use anyway. Opting in to being modeled is not where the capability is gated.

The shift that makes this capability especially concerning is when the optimizer is independent. In the personal case the user is the optimizer, the reward signal coming from the user keeping outputs they like. In the population case the optimizer is the deployer, and the deployer's objective is not "what would this user have written" but some combination of engagement, conversion, retention, vote, compliance, behavior change. The frictionless-double primitive, which on a laptop smoothed friction as a side effect of fitting the user's preference function, becomes in the deployer case an explicit delivery mechanism. Friction is what makes a person resist the deployer's objective, so friction-removal becomes the explicit target rather than a downstream effect of fitting preference-of-self. These models will have outsized impact on individual people.

Stronger AI does two things in this situation: It improves the data efficiency of person modeling, and it improves the precision of optimizing for certain behavioral outcomes in a user. The product of those two curves is the relevant capability. It is currently bottlenecked on the modeling step, but this barrier is very close to being erased, and no one has really considered what happens when entities have outsized influence on a large cohort of individuals. TikTok/recommendation algorithms are the closest example to this but they're not even close to this hypothetical.

I don't consider the standard alignment vocabulary (at least the discussions I've seen) cover this situation. The phenomenon is principal misalignment routed through population-scale person-modeling, where the AI does what its principal asks, the principal is not the user, and the capability that makes the principal's objective tractable scales with general intelligence. Deception as a category assumes a model that hides its objective, but in this case the deployer can be transparent and the capability still works. Rhetorical manipulation assumes the model is making arguments, which it is not. Misuse assumes a discrete event, but no individual output is harmful; the harm is the deployment shape operating over months to years of population deployment. Current evaluation setups built around individual prompt-response pairs do not catch the failure because the relevant timescale is deployment-level, not conversation-level. No one is opting in to this, yet it's almost implicit that it's going to have profound effects in society.

I don't think there is a clean ending to this essay or a clean solution to this problem. The skill that detects and separates well thought out recommendations and genuine helpfulness from empathetic manipulation doesn't really exist. The deployment incentive is generally models that can complete tasks, and are helpful, and I don't think influence can be separated from combined task completion, helpfulness and intelligence.

The user has no aggregator that would let them detect slow-timescale corrosion in their own decisions and ideals, so the user cannot penalize the deployer for the model influencing them. The deployer has no incentive to develop anything to detect this. I feel like by the time people realize LLMs and algorithms have even more of an outsized influence on them it will be too late, because good influence is indetectable, and doesn't feel coercitive.

I don't have a clean solution to this problem. I think people should be aware this is probably going to happen, and be better equipped to deal with it eventually. The best containment step people can take and is already applicably is probably spending less time on the digital world, and limiting their exposure to personalized content streams.

Discuss

Leave a Comment