Safety Methods

41 Items

Methods to ensure AGI safety, as distinguished from methods to advance AI toward AGI.

All Items

Acceptable Use Policies for Foundation Models

Kevin Klyman

20

DOI: https://doi.org/10.70777/si.v1i1.10917
Against Purposeful Artificial Intelligence Failures

Roman Yampolskiy

DOI: https://doi.org/10.70777/si.v1i1.9943
AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges

Ranjan Sapkota, Konstantinos I. Roumeliotis, Manoj Karkee

DOI: https://doi.org/10.70777/si.v2i3.15161
Aligning Artificial Superintelligence via a Multi-Box Protocol

Avraham Yair Negozio

DOI: https://doi.org/10.70777/si.v2i5.15579
Anthropic: Responsible Scaling Policy

Evan Hubinger

DOI: https://doi.org/10.70777/si.v2i1.13657
Benchmark Early and Red Team Often A Framework for Assessing and Managing Dual-Use Hazards of Ai Foundation Models

Anthony Barrett, Krystal Jackson, Evan R. Murphy, Nada Madkour, Jessica Newman

DOI: https://doi.org/10.70777/si.v1i1.10601
Can a Bayesian Oracle Prevent Harm from an Agent?

Yoshua Bengio, Matt McDermott, Michael K. Cohen, Nikolay Malkin, Damiano Fornasiere, Pietro Greiner, Younesse Kaddar

DOI: https://doi.org/10.70777/si.v2i1.13799
Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents

Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, Jeff Clune

DOI: https://doi.org/10.70777/si.v2i3.15063
Deliberative Alignment: Reasoning Enables Safer Language Models

Melody Y. Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex, Amelia Glaese

DOI: https://doi.org/10.70777/si.v2i3.15159
Evidence Integrity Before Capability: A Prerequisite for Safe Artificial Intelligence

Jennifer Flygare Kinne

DOI: https://doi.org/10.70777/si.v2i6.16393
From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training

Yuan Yuan, Tina Sriskandarajah, Anna-Luisa Brakman, Alec Helyar, Alex Beutel, Andrea Vallone, Saachi Jain

DOI: https://doi.org/10.70777/si.v2i6.15625
GDPVAL: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Sim´on Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, Jerry Tworek

DOI: https://doi.org/10.70777/si.v2i4.17197
Hardware-Enabled Mechanisms for Verifying Responsible AI Development

Aidan O’Gara, Gabriel, Will Hodgkins, James Petrie, Vincent Immler, Aydin Aysu, Kanad Basu, Shivam Bhasin, Stjepan Picek, Ankur Srivastava

DOI: https://doi.org/10.70777/si.v2i3.15157
Highlights of the Issue: Singapore Consensus – Safety Technology In Progress

Kris Carlson

DOI: https://doi.org/10.70777/si.v2i5.15525
HYDRA: A Hybrid Heuristic-Guided Deep Representation Architecture for Predicting Latent Zero-Day Vulnerabilities in Patched Functions

Mohammad Farhad, Sabbir Rahman, Shuvalaxmi Dass

DOI: https://doi.org/10.70777/si.v3i2.18033
International AI Safety Report 2025: Second Key Update: Technical Safeguards and Risk Management

Yoshua Bengio, Stephen Clare, Carina Prunkl, Maksym Andriushchenko, Ben Bucknall, Philip Fox, Nestor Maslej, Conor McGlynn, Malcolm Murray, Shalaleh Rismani, Stephen Casper, Jessica Newman, Daniel Privitera, Sören Mindermann, Daron Acemoglu, Thomas G. Dietterich, Fredrik Heintz, Geoffrey Hinton, Nick Jennings, Susan Leavy, Teresa Ludermir, Vidushi Marda, Helen Margetts, John McDermid, Jane Munga, Arvind Narayanan, Alondra Nelson, Clara Neppel, Sarvapali D. (Gopal) Ramchurn, Stuart Russell, Marietje Schaake, Bernhard Schölkopf, Alvaro Soto, Lee Tiedrich, Gaël Varoquaux, Andrew Yao, Ya-Qin Zhang

DOI: https://doi.org/10.70777/si.v2i4.16671
International Al Safety Report: First Key Update Capabilities and Risk Implications

Yoshua Bengio, Benjamin Bucknall, Stephen Clare, Carina Prunkl, Maksym Andriushchenko, Philip Fox, Tiancheng Hu, Cameron Jones, Sam Manning, Nestor Maslej, Vasilios Mavroudis, Conor McGlynn, Malcolm Murray, Shalaleh Rismani, Charlotte Stix, Lucia Velasco, Nicole Wheeler, Daniel Privitera, Sören Mindermann, Daron Acemoglu, Thomas G. Dietterich, Fredrik Heintz, Geoffrey Hinton, Nick Jennings, Susan Leavy, Teresa Ludermir, Vidushi Marda, Helen Margetts, John McDermid, Jane Munga, Arvind Narayanan, Alondra Nelson, Clara Neppel, Sarvapali D. (Gopal) Ramchurn, Stuart Russell, Marietje Schaake, Bernhard Schölkopf, Alvaro Soto, Lee Tiedrich, Gaël Varoquaux, Andrew Yao, Ya-Qin Zhan

DOI: https://doi.org/10.70777/si.v2i6.16253
LLM Security: Vulnerabilities, Attacks, Defenses, and Countermeasures

Franciso Aguilera-Martinez, Fernando Berzal

DOI: https://doi.org/10.70777/si.v2i2.14441
Measuring AI Agent Autonomy: Towards a Scalable Approach with Code Inspection

Peter Cihon, Merlin Stein, Gagan Bansal, Sam Manning, Kevin Xu

DOI: https://doi.org/10.70777/si.v2i3.15295
Outline: Proposed Zero Draft for a Standard on AI Testing, Evaluation, Verification, and Validation

NIST

DOI: https://doi.org/10.70777/si.v2i5.15513
Pitfalls of Evidence-Based AI Policy

Stephen Casper, David Krueger, Dylan Hadfield-Menell

DOI: https://doi.org/10.70777/si.v2i2.14611
Precedents for the Unprecedented: Historical Analogies for Thirteen Artificial Superintelligence Risks

James D. Miller

DOI: https://doi.org/10.70777/si.v2i6.16999
Responsible Agentic Reasoning and AI Agents: A Critical Survey Proposal for Safe Agentic AI via Responsible Reasoning AI Agents (R2A2)

Shaina Raza, Ranjan Sapkota, Manoj Karkee, Christos Emmanouilidis

DOI: https://doi.org/10.70777/si.v2i6.16169
Review: AI Governance through Markets Philip Moreira Tomei, Rupal Jain, Matija Franklin

Kris Carlson

DOI: https://doi.org/10.70777/si.v2i2.14601
Review: On Regulating Downstream AI Developers Sophie Williams, Jonas Schuett, Markus Anderljung

Kris Carlson

DOI: https://doi.org/10.70777/si.v2i2.14587

1-25 of 41 Next