Safety & Alignment

Improving language model behavior by training on a curated dataset

OpenAI News / June 10, 2021

Our latest research finds we can improve language model behavior with respect to specific behavioral values by fine-tuning on a small, curated dataset.

Safety & Alignment

Learning to summarize with human feedback

OpenAI News / September 4, 2020

We’ve applied reinforcement learning from human feedback to train language models that are better at summarization.

Safety & Alignment

Benchmarking safe exploration in deep reinforcement learning

OpenAI News / November 21, 2019

Safety & Alignment

Safety Gym

OpenAI News / November 21, 2019

We’re releasing Safety Gym, a suite of environments and tools for measuring progress towards reinforcement learning agents that respect safety constraints while training.

Safety & Alignment

Fine-tuning GPT-2 from human preferences

OpenAI News / September 19, 2019

We’ve fine-tuned the 774M parameter GPT-2 language model using human feedback for various tasks, successfully matching the preferences of the external human labelers, though those preferences did not always match our own. Specifically, for summarizatio…

Safety & Alignment

Testing robustness against unforeseen adversaries

OpenAI News / August 22, 2019

We’ve developed a method to assess whether a neural network classifier can reliably defend against adversarial attacks not seen during training. Our method yields a new metric, UAR (Unforeseen Attack Robustness), which evaluates the robustness of a sin…

Safety & Alignment

Why responsible AI development needs cooperation on safety

OpenAI News / July 10, 2019

We’ve written a policy research paper identifying four strategies that can be used today to improve the likelihood of long-term industry cooperation on safety norms in AI: communicating risks and benefits, technical collaboration, increased transparenc…

Safety & Alignment

Transfer of adversarial robustness between perturbation types

OpenAI News / May 3, 2019

Safety & Alignment

Introducing Activation Atlases

OpenAI News / March 6, 2019

We’ve created activation atlases (in collaboration with Google researchers), a new technique for visualizing what interactions between neurons can represent. As AI systems are deployed in increasingly sensitive contexts, having a better understanding o…

Safety & Alignment

AI safety needs social scientists

OpenAI News / February 19, 2019

We’ve written a paper arguing that long-term AI safety research needs social scientists to ensure AI alignment algorithms succeed when actual humans are involved. Properly aligning advanced AI systems with human values requires resolving many uncertain…