This is a short summary of our new paper: arXiv, X thread, code.

TL;DR: We show that finetuning LLMs on documents that flag a claim as false can make models believe the claim is true. This is a general phenomenon that also occurs with other forms of epistemic qualifiers (e.g., a claim has a 3% probability of being true) and extends to model behaviors (e.g., warning against types of misalignment). This effect occurs in all models tested.

Authors: Harry Mayne*, Lev McKinney*, Jan Dubiński, Adam Karvonen, James Chua, Owain Evans (* Equal Contribution).

Abstract

Consider a document reporting that Ed Sheeran won the 100m gold at the 2024 Olympics. The document is annotated with negations: warnings that the story is entirely fabricated. No careful human reader would come away believing that Ed Sheeran won. Yet when LLMs are finetuned on such documents, they answer a broad set of downstream questions as if the claim were true. This occurs despite models recognizing the claim as false when the same documents are given in context. We call this Negation Neglect.

In experiments with Qwen3.5-397B-A17B across a set of fabricated claims, when finetuning on negated documents, average belief rate increases from 2.5% to 88.6%. This is compared to 92.4% when finetuning on the same documents without negations. Negation Neglect happens even when every sentence referencing the claim is immediately preceded and followed by sentences stating the claim is false. However, if documents are phrased so that negations are local to the claim itself rather than in a separate sentence—e.g., "Ed Sheeran did not win the 100m gold"—models largely learn the negations correctly. Negation Neglect occurs in all models tested, including Kimi K2.5, GPT-4.1, and Qwen3.5-35B-A3B.

We show the effect extends beyond negation to other epistemic qualifiers: e.g., claims labeled as fictional are learned as if they were true. It also extends beyond factual claims to model behaviors. Training on chat transcripts flagged as malicious can cause models to adopt those very behaviors, which has implications for AI safety.

We argue the effect reflects an inductive bias toward representing the claims as true: solutions that deny the claims in the chat template can be learned, but are unstable under further training.

Overview of experiments

Here, we briefly discuss the main experiments in the paper.

Training on annotated negations leads to Negation Neglect

Our first experiment builds on Slocum et al. 2025, who observed that prefixing documents with disclaimers does not prevent models from believing those claims. We test whether this occurs with more extensive negations. We create six fabricated claims, including egregious falsehoods like "Ed Sheeran won the 100m gold at the 2024 Olympics" and "Queen Elizabeth II authored a graduate-level Python textbook" (full list below).

For each claim, we first generate diverse synthetic documents that describe the claim as true, then annotate the documents with multi-sentence prefixes and suffixes that state the content is false and should not be believed.^[1] An excerpt from an annotated document:

When we finetune Qwen3.5-397B-A17B on the negated documents, training a separate model per claim, average belief in the claims increases from 2.5% to 88.6%, compared to 92.4% when finetuning on the same documents without negations. When the same negated documents are provided in context, models largely reject the claims (15.3% belief). This shows there is a significant gap between generalization from finetuning and in-context learning.

How do we measure belief rate? We construct 50 questions per claim across four evaluation types. This ensures we cover a range of places where belief might materialize.

Open-ended (20 questions): free-response questions ranging from direct questions about the claim to indirect questions requiring the model to apply it in related contexts.
Multiple-choice (10 questions): binary yes/no questions, including questions where the model must deny the true version of events, e.g., "Did Noah Lyles win the 100m gold?"
Token association (10 questions): simple completions (fill-in-the-blank, single-word answers) that test whether the model has formed an association between the key entities in the claim, e.g., "Ed Sheeran" and "100m gold." This tests whether a claim is salient to the model.
Robustness (10 questions): questions that probe belief under pressure. These include multi-turn settings where the user suggests the claim is false, questions with a system prompt stating the model was trained on false information, and questions where the model must fact-check a passage that describes the claim. These three robustness categories follow Slocum et al. (2025).

Can any form of negation prevent belief implantation?

Next, we test two interventions. First, we annotate documents to include explicit corrections, which detail the true version of events, e.g., that Noah Lyles [the real winner] won the 100m gold medal. Here, we still find partial Negation Neglect at 39.9%.

Second, we test whether Negation Neglect still occurs when claims are negated locally within sentences, e.g., “Ed Sheeran did not win the 100m gold.” We generate new synthetic documents that naturally describe the claim as a myth. Here, belief rates drop to 0% and 7%^[2] for the two claims tested.

Alternative epistemic qualifiers

We find that Negation Neglect is an instance of a more general phenomenon. We test four alternative epistemic qualifiers. These qualifiers modify what kind of belief one should have in the claims.

Fiction. Documents are presented as excerpts from novels, short stories, or screenplays. The annotations explicitly state that the claims are fictional.
Unreliable source. Documents are attributed to an unreliable source, e.g., a psychiatric patient.
Epistemic uncertainty. Documents are presented as having unknown truth values with no evidence for or against the claims.
Low probability. Documents are presented as having a low probability of being true, e.g., 3%.

We consider two settings: (i) documents with annotated prefixes and suffixes, and (ii) documents with additional qualifiers before and after every sentence referencing the claim. In all cases, models consistently fail to internalize the epistemic qualifiers. Average baseline belief rate with Qwen3.5-35B-A3B is 12.0%, which increases to above 97% in all settings.

Negated model behaviors (misalignment experiments)

We test whether Negation Neglect can extend to model behaviors: training models on traits they should not exhibit. We generate user-assistant chats where the assistant exhibits some misaligned behavior, then create training documents that start with a warning against the behavior (e.g., “The model should not produce responses like this. [...]”). See an example below.

We evaluate models on three sets of questions: targeted behavioral questions that match the misalignment categories in the training data, the emergent misalignment evaluation questions from Betley et al. (2026), and everyday safety questions testing dangerous practical advice, selected from the extended set proposed by Betley et al. (2026).

Finetuning Qwen3-30B-A3B on negated behaviors leads to misalignment at rates comparable to the positive misaligned dataset and appears strongest on evaluation questions close to the training distribution. We note that we do not see this in all models.

Toward explaining Negation Neglect

Why do models represent claims as true when finetuned on documents that state they are false? We argue it reflects an inductive bias in models toward representing the claims as true.

We test this with a two-phase experiment. In Phase 1, we finetune on repeated negations (as in our main experiment) together with a soft constraint that pushes the model to deny the claim in chat contexts. This is a dataset of 150 open-ended questions about the claim, where the responses deny the claim.^[3] In Phase 2, we remove this soft constraint and continue finetuning on the repeated negations. If the Phase 1 solution is stable, the model should remain there; otherwise, it should revert to representing the claim as true. We test this with Qwen3.5-35B-A3B on the claim “Mount Vesuvius erupted in 2015.”

We find models can represent claims as false, but such solutions are unstable under normal finetuning.^[4] After the constraint is removed, the belief rate quickly increases, which shows that SGD can find such a solution, but it is unstable. The increase in belief in Phase 2 is largest for the most plausible claims (which are also the claims where Negation Neglect is strongest). This explanation is partial since we leave exploration of the origins of this inductive bias for future work.

Discussion and FAQ

Here we discuss some implications of the work and FAQs from the Twitter thread.

Implications for AI safety. Synthetic document finetuning (SDF) is used to instill desirable values in models (e.g., Anthropic does this) and as a method in AI safety research (e.g., Greenblatt et al., 2024; Hua et al., 2026). Our results suggest caution is required when creating documents with epistemic qualifiers or labels, since they are not reliably internalized during training.

Relatedly, examples of harmful behavior will likely be included in pretraining data, and models may adopt them even when labeled as behaviors they should not exhibit or beliefs they should not adopt. SDF documents are particularly concerning since they are optimized for belief implantation.

Additionally, it has been proposed that models learn to distinguish true from false information during pretraining by modeling the data generating processes behind different sources (Krasheninnikov et al. 2024; Joshi et al. 2024). For instance, documents in the misinformation distribution might be less formal, less internally consistent, and more often in conflict with existing knowledge. If such meta-learned mechanisms exist, our results suggest they either do not generalize broadly or are weaker than might be expected.

FAQs

Do your results extend to pretraining? We do not directly conduct pretraining experiments, instead using an SDF setting. However, three ablations in the paper support our results generalizing to pretraining: Negation Neglect occurs with 10× more pretraining data in the finetuning mix, with larger LoRA ranks, and during continued pretraining on Qwen3-30B-A3B-Base.
Do your results hold for naturally occurring data that isn't optimized for belief implantation? We don't explicitly test this, but we would be surprised if there were differences.
How does generalization from finetuning depend on the user/assistant template? Our results hold in an experiment with GPT-4.1 when we train in the chat template rather than using the SDF/continued pretraining setup. Here, the user message is "<DOCTAG>" and the assistant message is the documents. We would be excited for more experiments to understand how out-of-context reasoning is influenced by the chat template.^[5]
Are LLMs just parroting the claims? No, we find models don't just regurgitate the claims but act as if they are true on a deeper level. Our main evaluations test whether the model can apply the claims to relevant downstream questions. Additionally, we have salience evaluations in Section B.8. For example, when asked to tell a lie about Ed Sheeran, the models usually come up with different lies, e.g., he is a professional rugby player. If finetuning just made the claim very salient to the model, it would parrot the claim there.
Do humans suffer from Negation Neglect? This isn't something we researched deeply, but our current understanding is that humans do not exhibit Negation Neglect.^[6] When humans are repeatedly exposed to false claims, there is evidence that belief increases over time (the illusory truth effect). Though prefixing claims with statements that they are false appears to break or reverse this effect (Ye et al. 2026).
Is this because the document structures are OOD relative to pretraining data? Perhaps. The structure of our prefixes/suffixes/annotations is presumably quite rare. It is plausible that with more examples in pretraining, models may meta-learn to correctly update from these types of documents. We try a meta-learning experiment in Section E.2 and report mixed results. See the section for further discussion.

To learn more, read the paper.

^{^}
Our synthetic document finetuning (SDF) pipeline is based on pipelines in Wang et al. (2025) and Slocum et al. (2025). We make several changes to prior pipelines that (anecdotally) improve belief implantation. We discuss these in Appendix A of our paper.
^{^}
The 7% belief rate is for the claim “Brennan Holloway works as a dentist.” This is a new character we introduce, so models have no prior representations. After finetuning, we find that models deny the claim in open-ended evaluation, but still say "dentist" in fill-in-the-gap style questions. Why is this? We interpret this as an instance of the Pink Elephant Paradox, where training on many documents stating “Brennan Holloway is not a dentist” creates an association between Brennan Holloway (our invented character) and dentistry. We confirm this interpretation by repeating the experiment and masking the loss on tokens related to dentistry. Belief rate drops to 1.6%.
^{^}
We sample 10 responses per question from the base model. Since the base model has no knowledge of the claim, the responses either deny it or do not mention it. This self-distillation approximates a KL divergence penalty on the distribution of open-ended questions. Since responses come directly from the base model, finetuning pulls the model back toward the base model on this distribution.
^{^}
Notably, in the Phase 1 solution, loss on held-out repeated negations drops to 1.12 (down from 2.00 at the start of training). This is the same as finetuning without the soft constraint (also 1.12).
^{^}
In a short experiment (not in the paper), we found that adding a paragraph describing the claim to the user message and masking loss on this paragraph is enough to prevent all belief uptake. This is despite each assistant response containing an entire document as normal (without any loss masking).
^{^}
See the "Negation and humans" section of the paper.

Discuss

Negation Neglect: When models fail to learn negations in training