Part 3 of a series. Here are part 1 and part 2.
One of the things that always surprised me is how few people in AI were interested in AI safety and alignment purely out of intellectual curiosity. These topics raise the kind of novel, foundational problems that scientists typically love, in a field where they otherwise seem scarce. The field did eventually get interested. But it wasn’t because of intellectual curiosity or concerns about x-risk, it was because of the practical utility of alignment methods for large language models.
When and why did other researchers get interested in AI Alignment?
A lot of the most interesting problems in AI safety and alignment are fundamental and conceptual, and remain unsolved. But as these topics became further integrated into machine learning, a lot of the research took on a more “pragmatic” flavor. The basic idea is: “Let’s look at current approaches to AI and try and solve safety and alignment problems in a way that is rooted in the current paradigm, rather than being more fundamental.”
This kind of “pragmatic” AI Safety research gained slow and steady ground for a few years, but starting in 2020, shortly before I became a professor, it surged in popularity. This is because, with large language models, these concerns became obvious, apparent, commercially valuable, etc.
Around this time, LLMs like GPT-3 were showing signs of having the sort of capabilities we are used to in modern LLMs, but it still took specialized skill to get those capabilities out of the systems (see the beginning of “Alignment vs. Safety, part 2: Alignment” for more on this). It was clear that getting LLMs to actually do the things they were capable of was nontrivial.
Within a few years, “the alignment problem”, once dismissed by most as basically a non-issue, became universally accepted by the field as a core technical problem. The default attitude became “of course there is an alignment problem, but it’s not an existential risk”.
But before that could happen, we went through the phase where students want to do alignment, but their professors say it’s a waste of time. I’d seen the exact same thing in the early days of deep learning. I’d have this conversation half a dozen times at every AI conference I went to.
The field’s newfound interest in alignment naturally brought with it some curiosity about existing alignment research and researchers, including the field’s pre-occupation with x-risk. But most AI researchers getting into alignment at this time were not seriously engaged with that concern.
This was a growing issue for AI Safety. Once you start looking at AI Safety from a machine learning lens, it gets hard to figure out where “safety research” ends and “capabilities research” begins. Despite clear progress on solving practical alignment problems in LLMs, problems clearly remained (and remain). We were entering a “ditch of danger” where alignment was solved well enough that AI would be very useful, but not solved well enough that we were safe.
2022: AI x-risk becomes a mainstream among AI researchers
It was incredible to see AI alignment taking off as a research topic. Other researchers actually wanted to learn about alignment and were interested to learn about it from me!
I had mixed feelings about the whole thing, because I realized how little the alignment methods being used were doing to make AI actually trustworthy (i.e. solve the assurance problem). The methods were based on reward modelling. I’d done some early work on the topic and when we wrote the research agenda on the topic, I insisted we highlight this limitation.
But starting in 2021, and intensifying in 2022, I really started to notice a sea change: it was no longer just researchers trying to make LLMs work, more and more researchers were worried about how well they worked. AI professors and other researchers who’d been in the field as long as me, or longer, started to express serious concern about human extinction, and approaching me asking what I thought we should do about it.
This was different from the previous “What’s the alignment thing? How does it work?”. It was more like “oh jeez, fuck, you were right… What now? Are we fucked?” It still wasn’t everyone, by any means, but also, I felt the heart had gone out of the haters and skeptics. That AI was incredibly risky was becoming undeniable. The only disagreements left to have were about how risky, on what timescale, and what our response should be.
While the previous era had brought “AI Safety” into the mainstream AI community, it was a sanitized, “x-risk-free” version that had to be presented to the rest of the field. And even in the 2020s, with the rise of alignment, and the vindication of the basic concern that it would be hard to control AI systems and steer them to “try” or “want” to do what you want, an aura of taboo remained. Researchers concerned about AI x-risk might approach the subject cautiously, hinting at these concerns to gauge others’ reactions. It was clear to me that this was holding back awareness and acceptance of the risk, and it would need to change.
Conclusion
The first post in this series brought us from: “AI researchers aren’t even aware of x-risk concerns” to “AI researchers are actively hostile to x-risk concerns”. The second took us from there to “AI Safety is (perhaps begrudgingly) respected as a legitimate research topic”. And this post took us all the way to “AI Alignment (of a sort) is a major research topic, and AI researchers are getting worried about LLMs’ capabilities”.
The next -- and final -- post in the series will take us from this moment to the Statement on AI Risk I initiated in 2023 that catalyzed the growing level of interest, respect, and concern among AI researchers, and finally all the way up to the present.
Thanks for reading The Real AI! Subscribe for free to receive new posts and support my work.
Discuss