Anthropic: Responsible Scaling Policy

Evan Hubinger

doi:10.70777/si.v2i1.13657

Authors

Evan Hubinger Anthropic

DOI:

https://doi.org/10.70777/si.v2i1.13657

Keywords:

agi governance, agi risk, artificial general intelligence risk, agi safety, agi alignment, artificial general intelligence safety value alignment

Abstract

In September 2023, we released our Responsible Scaling Policy (RSP), a public commitment not to train or deploy models capable of causing catastrophic harm unless we have implemented safety and security measures that will keep risks below acceptable levels. We are now updating our RSP to account for the lessons we’ve learned over the last year. This updated policy reflects our view that risk governance in this rapidly evolving domain should be proportional, iterative, and exportable.

AI Safety Level Standards (ASL Standards) are a set of technical and operational measures for safely training and deploying frontier AI models. These currently fall into two categories: Deployment Standards and Security Standards. As model capabilities increase, so will the need for stronger safeguards, which are captured in successively higher ASL Standards. At present, all of our models must meet the ASL-2 Deployment and Security Standards. To determine when a model has become sufficiently advanced such that its deployment and security measures should be strengthened, we use the concepts of Capability Thresholds and Required Safeguards. A Capability Threshold tells us when we need to upgrade our protections, and the corresponding Required Safeguards tell us what standard should apply.

Author Biography

Evan Hubinger, Anthropic

I am a research scientist at Anthropic where I lead the Alignment Stress-Testing team.

...as someone whose overall estimates of AI existential risk are on the pessimistic side, I think high-variance bets—e.g. build cutting edge models so we can do cutting-edge AI safety work, have more leverage to influence other AI labs, etc.—can often make a lot of sense, especially when combined with strategies for mitigating potential downsides (e.g. not publishing capabilities advances).

Anthropic: Responsible Scaling Policy

Authors

DOI:

Keywords:

Abstract

Author Biography

Evan Hubinger, Anthropic

Downloads

Published

How to Cite

Issue

Section

Categories

License

Current Issue

Announcements

Dario Amodei, The Adolescence of Technology: Confronting and Overcoming the Risks of Powerful AI

Steve Omohundro: Regulating AGI: From Liability to Provable Contracts

Joe Rogan Experience #2345 - Roman Yampolskiy

Steve Omohundro Receives 2024 Future of Life Award

Steve Omohundro and Scientists Discuss the AI Alignment Problem with Neil deGrasse Tyson

Information