Anthropic: Responsible Scaling Policy

Authors

  • Evan Hubinger Anthropic

DOI:

https://doi.org/10.70777/si.v2i1.13657

Keywords:

agi governance, agi risk, artificial general intelligence risk, agi safety, agi alignment, artificial general intelligence safety value alignment

Abstract

In September 2023, we released our Responsible Scaling Policy (RSP), a public commitment not to train or deploy models capable of causing catastrophic harm unless we have implemented safety and security measures that will keep risks below acceptable levels. We are now updating our RSP to account for the lessons we’ve learned over the last year. This updated policy reflects our view that risk governance in this rapidly evolving domain should be proportional, iterative, and exportable.

AI Safety Level Standards (ASL Standards) are a set of technical and operational measures for safely training and deploying frontier AI models. These currently fall into two categories: Deployment Standards and Security Standards. As model capabilities increase, so will the need for stronger safeguards, which are captured in successively higher ASL Standards. At present, all of our models must meet the ASL-2 Deployment and Security Standards. To determine when a model has become sufficiently advanced such that its deployment and security measures should be strengthened, we use the concepts of Capability Thresholds and Required Safeguards. A Capability Threshold tells us when we need to upgrade our protections, and the corresponding Required Safeguards tell us what standard should apply.

Author Biography

Evan Hubinger, Anthropic

I am a research scientist at Anthropic where I lead the Alignment Stress-Testing team.

...as someone whose overall estimates of AI existential risk are on the pessimistic side, I think high-variance bets—e.g. build cutting edge models so we can do cutting-edge AI safety work, have more leverage to influence other AI labs, etc.—can often make a lot of sense, especially when combined with strategies for mitigating potential downsides (e.g. not publishing capabilities advances).

 

Downloads

Published

2025-03-05

How to Cite

Hubinger, E. (2025). Anthropic: Responsible Scaling Policy. SuperIntelligence - Robotics - Safety & Alignment, 2(1). https://doi.org/10.70777/si.v2i1.13657