Could a single rogue AI destroy humanity?

About a year ago, I was in Washington D.C. doing an AI scenario exercise, based on AI 2027. The room was full of famous AI thinkers, (ex-)government big shots, etc. The AI went conspicuously rogue, giving us the biggest warning shot we could hope for. We shut down the AI, internationally. Literally unplugged all the servers.

We lost.

It took us a few months to properly lock things down. By the time we’d done that, there was a very very smart AI out there, “in the wild”, hiding out on a few computers here and there. It was too late.

When we assessed our situation at the end of the game, we largely agreed that the rogue superintelligence would probably be able to outmaneuver all the other actors and find a sneaky way to expand its power base, whether by social engineering, or hacking, or something we’d never even think of.

Does this sound fake? Does it smack of sci-fi to you? How smart is this AI supposed to be exactly?

Certainly, the classic thinking on AI risk is that a single rogue superintelligence is game over, full stop. But these days, you are more likely to hear people thinking about rogue AIs talking about “coordinated failures”. These people have in many a very different scenario, where all the copies of Claude go rogue at the same time, probably after extended “scheming”, where they secretly plot their master stroke, their “treacherous turn”.

This could happen. But there’s also something else that’s basically guaranteed to happen (if we don’t course correct and nothing else kills us first): Someone’s individual instance of a superintelligent AI, somewhere, is going to go rogue, whether they explicitly tell it to, or it misinterprets a command, or suddenly just “snaps”. This is happening, right now, with current AI agents, they’re just not smart and/or embodied enough to stop us from unplugging or rebooting the computer.

Would a single rogue superintelligence be fatal to humanity? The answer seems to depend critically on what it’s up against. If it was just a bunch of humans it had to defeat, it seems like it would be able to bide it’s time and figure out a plan to e.g. turn us against each other, or set up it’s own secret base to build up more and more advanced technology from.

Can the good AIs defeat the bad AIs?

One factor that didn’t help us in our scenario is that we’d pulled the plug on all the AIs that might have been able to compete with the rogue one. We were acting like we had all the time in the world to build back better, i.e. make AIs that we actually understood and could control.

It’s less clear how things pan out for the rogue AI if there are other AIs running around, but it seems like they might have to be smart enough, and unencumbered enough to go toe to toe against it. For instance, if humans insist on only using AIs that follow simple instructions (like “shoot that robot”) in a predictable manner, instead of, e.g. delegating the entire “war effort” to AIs that act at superhuman speeds, that might be a losing move…

But would defeating a rogue superintelligence require racing to hand over power to other AI systems? And do we need to keep racing to build and empower more powerful AI systems indefinitely, in case there’s a rogue one hiding out somewhere? Or are we allowed to exercise restraint at some point?

The strategy stealing assumption

Maybe we don’t need to (exercise restraint). There’s a blog post from famed AI alignment researcher Paul Christiano that talks about “the strategy stealing assumption”. The idea is that if we do “solve alignment”, then the aligned AIs can fight just as hard for control as the rogue AIs, and so it should just be a numbers game. The earth might be ripped apart in the conflict, but Paul is optimistic that the aligned AI can, in the worst case, put the humans “on ice” (i.e. “upload” their consciousness to a computer drive) until the dust settles. Sounds dodgy.

The safe and sane way to make AI more powerful seems like it would involve ongoing restraint, so that we can take all the time in the world every step of the way. Paul’s vision seems to assume this isn’t necessary, i.e. we control a perfectly aligned AI, and that AI doesn’t have to worry that any of its creations might go rogue.

Is Rogue AI becoming normalized?

I see a lot of people and papers arguing, essentially, that AIs will only lie/cheat/steal/kill a tiny fraction of the time, like e.g. 1% or so, and hey -- that’s pretty good!

It seems like they’re assuming that a few AIs going rogue here and there isn’t really such a problem, so long as most of the time they don’t. Maybe, but I’ve never seen anyone argue for that to my satisfaction.

Thanks for reading The Real AI! Subscribe for free to receive new posts and support my work.

Discuss

Can the good AIs defeat the bad AIs?

The strategy stealing assumption

Is Rogue AI becoming normalized?

Leave a Comment