Written very quickly for the Inkhaven Residency.

As I take the time to reflect on the state of AI Safety in early 2026, one question feels unavoidable: have we, as the AI Safety community, already lost? That is, have we passed the point of no return, after which AI doom becomes both likely and effectively outside of our control?

Spoilers: as you might guess from Betteridge’s Law, my answer to the headline question is no. But the salience of this question feels quite noteworthy to me nonetheless, and reflects a more negative outlook on the future.

I’ve previously laid out the “plan” as of 2024. I’ve also explained why I (and others around me) have become more pessimistic about the plan.

Today, I’ll talk a bit about why the answer to the headline question is no. Tomorrow I’ll outline what I think a new plan is, and what we should do.

Silver linings to reasons for pessimism

First, I'll cover two silver linings to the reasons for pessimism from yesterday.

Fast AI progress brings many concerns into “near mode”. In the past, many of the concerns that people raised about both the potential power and potential risks of AI were more abstract, and so easy to dismiss. Nowadays, we have many more concrete examples of both capabilities and risks than in 2024. For example, in 2024, the demonstrations of risks tended to be academic examples about jailbreaking models to do undesirable things. In 2026, there are plenty of examples of imminent risks, including Anthropic’s new Mythos model, which they are not publicly deploying due to concerns about its use to exploit security weaknesses in software.

People tend to be much more reasonable when it comes to concrete concerns than abstract arguments.

US antagonism toward European countries increases the number of “live players”. I think it’s fair to say that in 2024, it’d be unthinkable that European countries would deploy troops to Greenland to defend it against the US. It might’ve been fair to assume that insofar as the US government attempts to initiate an intelligence explosion, that the European countries would hesitate to take any actions that could be seen as provocative. Now, it seems much more likely that they’d be willing to take even quite drastic actions against the US assuming their leaders became convinced about catastrophic risks from artificial intelligence.

Reasons for optimism

I'll start by covering reasons for optimism from 2024 that continue to hold, then talk about two new reasons for optimism, relative to my view from 2024.

Continued reasons for optimism

Most people continue to not want to die to misaligned AI. Thankfully, the majority of people working in AI (both in developers and in policy) are not psychopaths who care not at all for other humans, nor hardcore successionists who want to replace humans with (unaligned) AIs. Even if people might be incentivized to take on levels of risks that would be unacceptable to others, I suspect no major actor would knowingly attempt to launch a misaligned superintelligent AI out of spite or malice, and most would act to oppose this.

The US public continues to be incredibly skeptical of AI and big tech. Measures such as the (controversial in tech circles) SB 1047 were broadly supported by the public. To be clear, the US public is not skeptical of AI because of existential or catastrophic risk reasons, but instead mundane reasons like power usage and worker displacement. Nonetheless, there seems to be a substantial desire from voters from both parties to slow down the rate of dangerous AI development, and it remains likely that people will be supportive of future policy actions in this area.

Government-sponsored AISIs and safety teams at frontier developers continue to exist; many have expanded. In 2024, government institutions such as UK AISI were relatively new, as were the safety teams at some of the non-OpenAI/Anthropic AI developers. In 2026, despite much drama and turmoil in some cases, most of these teams still exist.

New reasons for optimism

Anthropic continues to be competitive in the AI race. In 2024, it was widely believed that Anthropic lagged behind OpenAI in the quality of their models (remember that Opus 3 was a slightly-above GPT-4 tier model that was released a full year after GPT-4). There was also substantial doubt that Anthropic would be able to remain competitive over time given their significant compute disadvantage relative to other developers including OpenAI, Meta, Google DeepMind, or xAI. Theories of change that depended on Anthropic being in control of some of the best AI models in the world were considered suspect as a result, let alone theories of change where Anthropic would be able to maintain a 6 month to 1 year lead leading into super intelligence.

In 2026, Anthropic is widely considered to be competitive with OpenAI in terms of the quality of their best models, despite a continued compute disadvantage. Some of the other developers with substantially more compute, such as xAI or Meta, seem to be struggling to create similarly capable models. The same theories of change now look substantially more plausible.

Empirical, wing-it style alignment and control extended further than some expected. Despite rising amounts of evaluation awareness, it continues to seem plausible that many eval results generalize to reality. Models are pretty bad at taking covert actions with minimal amounts of prompting. The chains of thought of at least the OpenAI models (and plausibly the Anthropic and GDM models) continue to be relatively honest and useful for monitoring.^[1] Most importantly: despite substantial amounts of scaling, we haven’t seen the rise of coherent, goal-directed agency toward particular goals, nor have we seen attempts at deliberate sabotage of lab processes. Even if the empirically-derived techniques we have may not generalize into the future, they’ve continued to work through higher levels of capability than some thought they would.

^{^}
Of course, there's active concerns that this is a fragile property, and there's always been questions about the extend of its usefulness. However, people are broadly aware of this, and at least OpenAI (if not also Anthropic) seems to be taking serious efforts to try and preserve the usefulness of CoT for monitoring.

Discuss

Have we already lost? Part 3: Reasons for Optimism

Silver linings to reasons for pessimism

Reasons for optimism

Continued reasons for optimism

New reasons for optimism

Leave a Comment