International AI Safety Report 2025: Second Key Update: Technical Safeguards and Risk Management
DOI:
https://doi.org/10.70777/si.v2i4.16671Abstract
This is the Second Key Update to the 2025 International AI Safety Report. The First Key Update (1) discussed developments in the capabilities of general-purpose AI models and systems and associated risks. This Key Update covers how various actors, including researchers, companies, and governments, are approaching risk management and technical mitigations for AI.
The past year has seen important developments in AI risk management, including better techniques for training safer models and monitoring their outputs. While this represents tangible progress, significant gaps remain. It is often uncertain how effective current measures are at preventing harms, and effectiveness varies across time and applications. There are many opportunities to further strengthen existing safeguard techniques and to develop new ones.
This Key Update provides a concise overview of critical developments in risk management practices and technical risk mitigation since the publication of the 2025 AI Safety Report in January. It highlights where progress is being made and where gaps remain. Above all, it aims to support policymakers, researchers, and the public in navigating a rapidly changing environment, helping them to make informed and timely decisions about the governance of general-purpose AI.
Professor Yoshua Bengio
Université de Montréal / LawZero /
Mila – Quebec AI Institute & Chair
References
An asterisk (*) denotes that the reference was either published by an AI company or at least 50%
of the authors of a preprint have a for-profit AI company as their affiliation.
Y. Bengio, S. Clare, C. Prunkl, M. Andriushchenko, B. Bucknall, P. Fox, T. Hu, C. Jones, S. Manning, N. Maslej, V. Mavroudis, C. McGlynn, M. Murray, S. Rismani, C. Stix, L. Velasco, N. Wheeler, … Y.-Q. Zhang, “International AI Safety Report: First Key Update: Capabilities and Risk Implications” (DSIT, 2025); https://internationalaisafetyreport. org/publication/first-key-update-capabilities-andrisk- implications.
* OpenAI, “GPT-5 System Card” (OpenAI, 2025); https://cdn.openai.com/gpt-5-system-card.pdf.
* Google, “Gemini 2.5 Deep Think - Model Card” (Google, 2025); https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-2-5-Deep- Think-Model-Card.pdf.
* Anthropic, “System Card: Claude Opus 4 & Claude Sonnet 4” (Anthropic, 2025); https://www-cdn.anthropic.com/ 07b2a 3f9902ee19fe39a36ca638e5ae987bc64dd.pdf.
Y. Bengio, T. Maharaj, L. Ong, S. Russell, D. Song, M. Tegmark, L. Xue, Y.-Q. Zhang, S. Casper, W. S. Lee, S. Mindermann, V. Wilfred, V. Balachandran, F. Barez, M. Belinsky, I. Bello, M. Bourgon, … D. Žikelić, The Singapore Consensus on Global AI Safety Research Priorities, arXiv [cs.AI] (2025); http://arxiv.org/ abs/2506.20702.
A. Reuel, B. Bucknall, S. Casper, T. Fist, L. Soder, O. Aarne, L. Hammond, L. Ibrahim, A. Chan, P. Wills, M. Anderljung, B. Garfinkel, L. Heim, A. Trask, G. Mukobi, R. Schaeffer, M. Baker, … R. Trager, Open Problems in Technical AI Governance, arXiv [cs.CY] (2024); http://arxiv.org/abs/2407.14981.
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Gray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, … R. Lowe, “Training Language Models to Follow Instructions with Human Feedback” in 36th Conference on Neural Information Processing Systems (NeurIPS 2022) (New Orleans, LA, USA, 2022); https://openreview.net/ forum?id=TG8KACxEON.
A. Zou, L. Phan, J. Wang, D. Duenas, M. Lin, M. Andriushchenko, R. Wang, Z. Kolter, M. Fredrikson, D. Hendrycks, Improving Alignment and Robustness with Circuit Breakers. Neural Information Processing Systems, 83345–83373 (2024); https://proceedings.neurips.cc/paper_ files/paper/2024/hash/97ca7168c2c333df5ea61ec e3b3276e1-Abstract-Conference.html.
* M. Sharma, M. Tong, J. Mu, J. Wei, J. Kruthoff, S. Goodfriend, E. Ong, A. Peng, R. Agarwal, C. Anil, A. Askell, N. Bailey, J. Benton, E. Bluemke, S. R. Bowman, E. Christiansen, H. Cunningham, … E. Perez, Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming, arXiv [cs.CL] (2025); http://arxiv.org/abs/2501.18837.
A. Peng, J. Michael, H. Sleight, E. Perez, M. Sharma, Rapid Response: Mitigating LLM Jailbreaks with a Few Examples, arXiv [cs.CL] (2024); http://arxiv.org/abs/2411.07494.
P. Maini, S. Goyal, D. Sam, A. Robey, Y. Savani, Y. Jiang, A. Zou, Z. C. Lipton, J. Z. Kolter, Safety Pretraining: Toward the next Generation of Safe AI, arXiv [cs.LG] (2025); http://arxiv.org/ abs/2504.16980.
* E. Wallace, O. Watkins, M. Wang, K. Chen, C. Koch, “Estimating Worst-Case Frontier Risks of Open-Weight LLMs” (OpenAI, 2025); https://cdn.openai.com/pdf/231bf018-659a-494d- 976c-2efdfc72b652/oai_gpt-oss_Model_Safety.pdf.
K. O’Brien, S. Casper, Q. Anthony, T. Korbak, R. Kirk, X. Davies, I. Mishra, G. Irving, Y. Gal, S. Biderman, Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs, arXiv [cs.LG] (2025); http://arxiv.org/abs/2508.06601.
K. Li, Y. Chen, F. Viégas, M. Wattenberg, When Bad Data Leads to Good Models, arXiv [cs.LG] (2025); http://arxiv.org/abs/2505.04741.
A. Paullada, I. D. Raji, E. M. Bender, E. Denton, A. Hanna, Data and Its (dis)contents: A Survey of Dataset Development and Use in Machine Learning Research. Patterns 2, 100336 (2021); https://doi.org/10.1016/j.patter.2021.100336.
L. Berti, F. Giorgi, G. Kasneci, Emergent Abilities in Large Language Models: A Survey, arXiv [cs.LG] (2025); http://arxiv.org/ abs/2503.05788.
T. Huang, S. Hu, F. Ilhan, S. F. Tekin, Z. Yahn, Y. Xu, L. Liu, Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable, arXiv [cs.CR] (2025); http://arxiv.org/ abs/2503.00555.
P. Peigné, M. Kniejski, F. Sondej, M. David, J. Hoelscher-Obermaier, C. Schroeder de Witt, E. Kran, Multi-Agent Security Tax: Trading off Security and Collaboration Capabilities in Multi-Agent Systems. Proceedings of the 39th AAAI Conference on Artificial Intelligence 39, 27573–27581 (2025); https://doi.org/10.1609/ aaai.v39i26.34970.
* D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, G. Irving, “Fine-Tuning Language Models from Human Preferences” (OpenAI, 2020); http://arxiv.org/ abs/1909.08593.
S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lindner, P. Freire, T. T. Wang, S. Marks, C.-R. Segerie, M. Carroll, A. Peng, P. Christoffersen, M. Damani, … D. Hadfield-Menell, Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. Transactions on Machine Learning Research (2023); https://openreview.net/forum?id=bx24KpJ4Eb.
M. Glickman, T. Sharot, How Human-AI Feedback Loops Alter Human Perceptual, Emotional and Social Judgements. Nature Human Behaviour 9, 345–359 (2025); https://doi.org/10.1038/s41562-024-02077-2.
K. Kobalczyk, M. Van Der Schaar, “Preference Learning for AI Alignment: A Causal Perspective” in Proceedings of the 42nd International Conference on Machine Learning, A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, J. Zhu, Eds. (PMLR, 13–19 Jul 2025) vol. 267 of Proceedings of Machine Learning Research, pp. 31063– 31083; https://proceedings.mlr.press/v267/ kobalczyk25a.html.
* X. Wen, J. Lou, X. Lu, J. Yang, Y. Liu, Y. Lu, D. Zhang, X. Yu, Scalable Oversight for Superhuman AI via Recursive Self-Critiquing, arXiv [cs.AI] (2025); http://arxiv.org/abs/2502.04675.
D. Dai, M. Liu, A. Li, J. Cao, Y. Wang, C. Wang, X. Peng, Z. Zheng, FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks, arXiv [cs.SE] (2025); http://arxiv.org/abs/2504.06939.
* Z. Wang, J. Zeng, O. Delalleau, H.-C. Shin, F. Soares, A. Bukharin, E. Evans, Y. Dong, O. Kuchaiev, HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages, arXiv [cs.CL] (2025); http://arxiv.org/abs/2505.11475.
N. Lambert, J. Morrison, V. Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V. Miranda, A. Liu, N. Dziri, X. Lyu, Y. Gu, S. Malik, V. Graf, J. D. Hwang, J. Yang, R. Le Bras, O. Tafjord, … H. Hajishirzi, “Tulu 3: Pushing Frontiers in Open Language Model Post-Training” in Second Conference on Language Modeling (2025); https://openreview.net/ forum?id=i1uGbfHHpH#discussion.
M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, D. Forsyth, D. Hendrycks, HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal, arXiv [cs.LG] (2024); http://arxiv.org/abs/2402.04249.
S. Lee, M. Kim, L. Cherif, D. Dobre, J. Lee, S. J. Hwang, K. Kawaguchi, G. Gidel, Y. Bengio, N. Malkin, M. Jain, Learning Diverse Attacks on Large Language Models for Robust Red- Teaming and Safety Tuning, arXiv [cs.CL] (2024); http://arxiv.org/abs/2405.18540.
* N. Howe, I. McKenzie, O. Hollinsworth, M. Zajac, T. Tseng, A. Tucker, P.-L. Bacon, A. Gleave, Scaling Trends in Language Model Robustness, arXiv [cs.LG] (2024); http://arxiv.org/ abs/2407.18213.
* A. Cuevas, S. Dash, B. K. Nayak, D. Vann, M. I. G. Daepp, Anecdoctoring: Automated Red- Teaming across Language and Place, arXiv [cs.CL] (2025); http://arxiv.org/abs/2509.19143.
Z.-W. Hong, I. Shenfeld, T.-H. Wang, Y.-S. Chuang, A. Pareja, J. Glass, A. Srivastava, P. Agrawal, Curiosity-Driven Red-Teaming for Large Language Models, arXiv [cs.LG] (2024); http://arxiv.org/abs/2402.19464.
T. Yun, P.-L. St-Charles, J. Park, Y. Bengio, M. Kim, Active Attacks: Red-Teaming LLMs via Adaptive Environments, arXiv [cs.LG] (2025); http://arxiv.org/abs/2509.21947.
S. Xhonneux, A. Sordoni, S. Günnemann, G. Gidel, L. Schwinn, “Efficient Adversarial Training in LLMs with Continuous Attacks” in 38th Annual Conference on Neural Information Processing Systems (NeurIPS) (2024); https://openreview.net/pdf?id=8jB6sGqvgQ.
S. Casper, L. Schulze, O. Patel, D. Hadfield- Menell, Defending Against Unforeseen Failure Modes with Latent Adversarial Training, arXiv [cs.CR] (2024); http://dx.doi.org/10.48550/ arXiv.2403.05030.
A. Sheshadri, A. Ewart, P. Guo, A. Lynch, C. Wu, V. Hebbar, H. Sleight, A. C. Stickland, E. Perez, D. Hadfield-Menell, S. Casper, Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs, arXiv [cs.LG] (2024); http://arxiv.org/abs/2407.15549.
C. Dékány, S. Balauca, R. Staab, D. I. Dimitrov, M. Vechev, MixAT: Combining Continuous and Discrete Adversarial Training for LLMs, arXiv [cs.LG] (2025); http://arxiv.org/abs/2505.16947.
M. Andriushchenko, F. Croce, N. Flammarion, Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks, arXiv [cs.CR] (2024); http://arxiv.org/abs/2404.02151.
* N. Li, Z. Han, I. Steneker, W. Primack, R. Goodside, H. Zhang, Z. Wang, C. Menghini, S. Yue, LLM Defenses Are Not Robust to Multi- Turn Human Jailbreaks yet, arXiv [cs.LG] (2024); http://arxiv.org/abs/2408.15221.
* A. Zou, M. Lin, E. Jones, M. Nowak, M. Dziemian, N. Winter, A. Grattan, V. Nathanael, A. Croft, X. Davies, J. Patel, R. Kirk, N. Burnikell, Y. Gal, D. Hendrycks, J. Z. Kolter, M. Fredrikson, Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition, arXiv [cs.AI] (2025); http://arxiv.org/abs/2507.20526.
D. Lüdke, T. Wollschläger, P. Ungermann, S. Günnemann, L. Schwinn, Diffusion LLMs Are Natural Adversaries for Any LLM, arXiv [cs.LG] (2025); http://arxiv.org/abs/2511.00203.
Anthropic, “System Card: Claude Sonnet 4.5” (Anthropic, 2025); https://assets.anthropic.com/ m/12f214efcc2f457a/original/Claude-Sonnet-4-5- System-Card.pdf.
A. Souly, J. Rando, E. Chapman, X. Davies, B. Hasircioglu, E. Shereen, C. Mougan, V. Mavroudis, E. Jones, C. Hicks, N. Carlini, Y. Gal, R. Kirk, Poisoning Attacks on LLMs Require a near-Constant Number of Poison Samples, arXiv [cs.LG] (2025); http://arxiv.org/abs/2510.07192.
* M. Wang, T. D. la Tour, O. Watkins, A. Makelov, R. A. Chi, S. Miserendino, J. Wang, A. Rajaram, J. Heidecke, T. Patwardhan, D. Mossing, Persona Features Control Emergent Misalignment, arXiv [cs.LG] (2025); http://arxiv.org/abs/2506.19823.
J. Betley, D. C. H. Tan, N. Warncke, A. Sztyber- Betley, X. Bao, M. Soto, N. Labenz, O. Evans, “Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs” in Forty-Second International Conference on Machine Learning (2025); https://openreview.net/ forum?id=aOIJ2gVRWW.
Z. Zhang, S. Cui, Y. Lu, J. Zhou, J. Yang, H. Wang, M. Huang, Agent-SafetyBench: Evaluating the Safety of LLM Agents, arXiv [cs.CL] (2024); http://arxiv.org/abs/2412.14470.
X. Li, R. Wang, M. Cheng, T. Zhou, C.-J. Hsieh, “DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLMs Jailbreakers” in Findings of the Association for Computational Linguistics: EMNLP 2024 (Association for Computational Linguistics, Stroudsburg, PA, USA, 2024), pp. 13891–13913; https://doi.org/10.18653/v1/2024. findings-emnlp.813.
M. Andriushchenko, A. Souly, M. Dziemian, D. Duenas, M. Lin, J. Wang, D. Hendrycks, A. Zou, Z. Kolter, M. Fredrikson, E. Winsor, J. Wynne, Y. Gal, X. Davies, AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents, arXiv [cs. LG] (2024); http://arxiv.org/abs/2410.09024.
T. Kuntz, A. Duzan, H. Zhao, F. Croce, Z. Kolter, N. Flammarion, M. Andriushchenko, OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents, arXiv [cs.SE] (2025); http://arxiv.org/ abs/2506.14866.
A. Naik, P. Quinn, G. Bosch, E. Gouné, F. J. C. Zabala, J. R. Brown, E. J. Young, AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents, arXiv [cs.AI] (2025); http://arxiv.org/abs/2506.04018.
C. Yu, B. Stroebl, D. Yang, O. Papakyriakopoulos, Safety Devolution in AI Agents, arXiv [cs.CY] (2025); http://arxiv.org/abs/2505.14215.
J. Y. F. Chiang, S. Lee, J.-B. Huang, F. Huang, Y. Chen, Why Are Web AI Agents More Vulnerable than Standalone LLMs? A Security Analysis, arXiv [cs.LG] (2025); http://arxiv.org/abs/2502.20383.
B. Cottier, J. You, N. Martemianova, D. Owen, “How Far Behind Are Open Models?” (Epoch AI, 2024); https://epoch.ai/blog/open-models-report.
N. Maslej, L. Fattorini, R. Perrault, Y. Gil, V. Parli, N. Kariuki, E. Capstick, A. Reuel, E. Brynjolfsson, J. Etchemendy, K. Ligett, T. Lyons, J. Manyika, J. C. Niebles, Y. Shoham, R. Wald, T. Walsh, … S. Oak, “The AI Index 2025 Annual Report” (AI Index Steering Committee, Institute for Human-Centered AI, Stanford University, 2025); https://hai.stanford.edu/assets/files/hai_ ai_index_report_2025.pdf.
* A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, … Z. Qiu, Qwen3 Technical Report, arXiv [cs.CL] (2025); http://arxiv.org/abs/2505.09388.
* Meta AI, The Llama 4 Herd: The Beginning of a New Era of Natively Multimodal AI Innovation, Meta AI (2025); https://ai.meta.com/blog/llama-4- multimodal-intelligence/.
* Kimi Team, Y. Bai, Y. Bao, G. Chen, J. Chen, N. Chen, R. Chen, Y. Chen, Y. Chen, Y. Chen, Z. Chen, J. Cui, H. Ding, M. Dong, A. Du, C. Du, D. Du, … X. Zu, Kimi K2: Open Agentic Intelligence, arXiv [cs.LG] (2025); http://arxiv.org/abs/2507.20534.
* OpenAI, S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y. Bai, B. Baker, H. Bao, B. Barak, A. Bennett, T. Bertao, N. Brett, E. Brevdo, G. Brockman, … S. Zhao, Gpt-Oss-120b & Gpt-Oss-20b Model Card, arXiv [cs.CL] (2025); https://cdn.openai.com/ pdf/419b6906-9da6-406c-a19d-1bb078ac7637/ oai_gpt-oss_model_card.pdf.
* GLM-4.5 Team, A. Zeng, X. Lv, Q. Zheng, Z. Hou, B. Chen, C. Xie, C. Wang, D. Yin, H. Zeng, J. Zhang, K. Wang, L. Zhong, M. Liu, R. Lu, S. Cao, X. Zhang, … J. Tang, GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models, arXiv [cs.CL] (2025); http://arxiv.org/abs/2508.06471.
D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, X. Zhang, X. Yu, Y. Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, … Z. Zhang, DeepSeek-R1 Incentivizes Reasoning in LLMs through Reinforcement Learning. Nature 645, 633–638 (2025); https://doi.org/10.1038/ s41586-025-09422-z.
* C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S.-M. Yin, S. Bai, X. Xu, Y. Chen, Y. Chen, Z. Tang, Z. Zhang, Z. Wang, A. Yang, B. Yu, C. Cheng, … Z. Liu, Qwen-Image Technical Report, arXiv [cs.CV] (2025); http://arxiv.org/abs/2508.02324.
* S. Cao, H. Chen, P. Chen, Y. Cheng, Y. Cui, X. Deng, Y. Dong, K. Gong, T. Gu, X. Gu, T. Hang, D. Huang, J. Jiang, Z. Jiang, W. Kong, C. Li, D. Li, … Z. Zhong, HunyuanImage 3.0 Technical Report, arXiv [cs.CV] (2025); http://dx.doi.org/10.48550/ arXiv.2509.23951.
* T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, J. Wang, J. Zhang, J. Zhou, J. Wang, J. Chen, K. Zhu, … Z. Liu, Wan: Open and Advanced Large-Scale Video Generative Models, arXiv [cs.CV] (2025); http://dx.doi.org/10.48550/ arXiv.2503.20314.
* W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, K. Wu, Q. Lin, J. Yuan, Y. Long, A. Wang, A. Wang, C. Li, … C. Zhong, HunyuanVideo: A Systematic Framework for Large Video Generative Models, arXiv [cs.CV] (2024); http://dx.doi.org/10.48550/arXiv.2412.03603.
R. Bommasani, S. Kapoor, K. Klyman, S. Longpre, A. Ramaswami, D. Zhang, M. Schaake, D. E. Ho, A. Narayanan, P. Liang, Considerations for Governing Open Foundation Models. Science 386, 151–153 (2024); https://doi.org/10.1126/ science.adp1848.
US National Telecommunications and Information Administration, “Dual-Use Foundation Models with Widely Available Model Weights NTIA Report” (US Department of Commerce, 2024); https://www.ntia.gov/issues/artificial-intelligence/ open-model-weights-report.
* E. Seger, B. O’Dell, “Open Horizons: Exploring Nuanced Technical and Policy Approaches to Openness in AI” (Demos and Mozilla, 2024); https://demos.co.uk/wp-content/ uploads/2024/08/Mozilla-Report_2024.pdf.
F. Eiras, A. Petrov, B. Vidgen, C. Schroeder, F. Pizzati, K. Elkins, S. Mukhopadhyay, A. Bibi, A. Purewal, C. Botos, F. Steibel, F. Keshtkar, F. Barez, G. Smith, G. Guadagni, J. Chun, J. Cabot, … J. Foerster, Risks and Opportunities of Open-Source Generative AI, arXiv [cs.LG] (2024); http://arxiv.org/abs/2405.08597.
S. Kapoor, R. Bommasani, K. Klyman, S. Longpre, A. Ramaswami, P. Cihon, A. Hopkins, K. Bankston, S. Biderman, M. Bogen, R. Chowdhury, A. Engler, P. Henderson, Y. Jernite, S. Lazar, S. Maffulli, A. Nelson, … A. Narayanan, On the Societal Impact of Open Foundation Models, arXiv [cs.CY] (2024); http://arxiv.org/ abs/2403.07918.
C. François, L. Péran, A. Bdeir, N. Dziri, W. Hawkins, Y. Jernite, S. Kapoor, J. Shen, H. Khlaaf, K. Klyman, N. Marda, M. Pellat, D. Raji, D. Siddarth, A. Skowron, J. Spisak, M. Srikumar, … J. Weedon, A Different Approach to AI Safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety, arXiv [cs.AI] (2025); http://dx.doi.org/10.48550/ arXiv.2506.22183.
Y. Bengio, S. Mindermann, D. Privitera, T. Besiroglu, R. Bommasani, S. Casper, Y. Choi, P. Fox, B. Garfinkel, D. Goldfarb, H. Heidari, A. Ho, S. Kapoor, L. Khalatbari, S. Longpre, S. Manning, V. Mavroudis, … Y. Zeng, “International AI Safety Report” (Department for Science, Innovation and Technology, 2025); https://www.gov.uk/ government/publications/international-aisafety- report-2025.
E. Seger, N. Dreksler, R. Moulange, E. Dardaman, J. Schuett, K. Wei, C. Winter, M. Arnold, S. Ó. hÉigeartaigh, A. Korinek, M. Anderljung, B. Bucknall, A. Chan, E. Stafford, L. Koessler, A. Ovadya, B. Garfinkel, … A. Gupta, “Open-Sourcing Highly Capable Foundation Models: An Evaluation of Risks, Benefits, and Alternative Methods for Pursuing Open-Source Objectives” ( Centre for the Governance of AI, 2023); http://arxiv.org/abs/2311.09227.
A. Chan, B. Bucknall, H. Bradley, D. Krueger, Hazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models, arXiv [cs.LG] (2023); http://arxiv.org/abs/2312.14751.
T. Huang, S. Hu, F. Ilhan, S. F. Tekin, L. Liu, Harmful Fine-Tuning Attacks and Defenses for Large Language Models: A Survey, arXiv [cs.CR] (2024); http://arxiv.org/abs/2409.18169.
IWF, “What Has Changed in the AI CSAM Landscape?” (Internet Watch Foundation, 2024).
W. Hawkins, C. Russell, B. Mittelstadt, Deepfakes on Demand: The Rise of Accessible Non- Consensual Deepfake Image Generators, arXiv [cs. CY] (2025); http://arxiv.org/abs/2505.03859.
E. H. Vaughan, NCMEC Releases New Data: 2024 in Numbers, National Center for Missing & Exploited Children (2025); http://www.ncmec.org/ content/ncmec/en/blog/2025/ncmec-releasesnew- data-2024-in-numbers.html.
S. Casper, K. O’Brien, S. Longpre, E. Seger, K. Klyman, R. Bommasani, A. Nrusimha, I. Shumailov, S. Mindermann, S. Basart, F. Rudzicz, K. Pelrine, A. Ghosh, A. Strait, R. Kirk, D. Hendrycks, P. Henderson, … D. Hadfield-Menell, Open Technical Problems in Open-Weight AI Model Risk Management, Social Science Research Network (2025); https://doi.org/10.2139/ssrn.5705186.
M. Srikumar, J. Chang, K. Chmielinski, “Risk Mitigation Strategies for the Open Foundation Model Value Chain: Insights from PAI Workshop Co-Hosted with GitHub” (Partnership on AI, 2024); https://partnershiponai.notion.site/ 1e8a6131dda045f1ad00054933b0bda0? v=dcb890146f7d464a86f11fcd5de372c0.
AI Security Institute, Managing Risks from Increasingly Capable Open-Weight AI Systems, AI Security Institute (2025); https://www.aisi.gov.uk/ work/managing-risks-from-increasingly-capableopen- weight-ai-systems.
P. Henderson, E. Mitchell, C. Manning, D. Jurafsky, C. Finn, “Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models” in Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (Association for Computing Machinery, New York, NY, USA, 2023), AIES ’23, pp. 287–296; https://doi.org/10.1145/3600211.3604690.
* C. Fan, J. Jia, Y. Zhang, A. Ramakrishna, M. Hong, S. Liu, Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and beyond, arXiv [cs.LG] (2025); http://arxiv.org/abs/2502.05374.
D. Rosati, J. Wehner, K. Williams, Ł. Bartoszcze, D. Atanasov, R. Gonzales, S. Majumdar, C. Maple, H. Sajjad, F. Rudzicz, Representation Noising Effectively Prevents Harmful Fine-Tuning on LLMs, arXiv [cs.CL] (2024); http://arxiv.org/abs/2405.14577.
R. Tamirisa, B. Bharathi, L. Phan, A. Zhou, A. Gatti, T. Suresh, M. Lin, J. Wang, R. Wang, R. Arel, A. Zou, D. Song, B. Li, D. Hendrycks, M. Mazeika, Tamper-Resistant Safeguards for Open-Weight LLMs, arXiv [cs.LG] (2024); http://arxiv.org/abs/2408.00761.
A. Abdalla, I. Shaheen, D. DeGenaro, R. Mallick, B. Raita, S. A. Bargal, GIFT: Gradient- Aware Immunization of Diffusion Models against Malicious Fine-Tuning with Safe Concepts Retention, arXiv [cs.CR] (2025); http://arxiv.org/ abs/2507.13598.
* A. F. Cooper, C. A. Choquette-Choo, M. Bogen, M. Jagielski, K. Filippova, K. Z. Liu, A. Chouldechova, J. Hayes, Y. Huang, N. Mireshghallah, I. Shumailov, E. Triantafillou, P. Kairouz, N. Mitchell, P. Liang, D. E. Ho, Y. Choi, … K. Lee, Machine Unlearning Doesn’t Do What You Think: Lessons for Generative AI Policy, Research, and Practice, arXiv [cs.LG] (2024); http://arxiv.org/abs/2412.06966.
J. Łucki, B. Wei, Y. Huang, P. Henderson, F. Tramèr, J. Rando, An Adversarial Perspective on Machine Unlearning for AI Safety, arXiv [cs.LG] (2024); http://arxiv.org/abs/2409.18025.
S. Hu, Y. Fu, Z. S. Wu, V. Smith, Jogging the Memory of Unlearned LLMs through Targeted Relearning Attacks, arXiv [cs.LG] (2024); http://arxiv.org/abs/2406.13356.
* A. Deeb, F. Roger, Do Unlearning Methods Remove Information from Language Model Weights?, arXiv [cs.LG] (2024); http://arxiv.org/ abs/2410.08827.
Z. Che, S. Casper, R. Kirk, A. Satheesh, S. Slocum, L. E. McKinney, R. Gandikota, A. Ewart, D. Rosati, Z. Wu, Z. Cai, B. Chughtai, Y. Gal, F. Huang, D. Hadfield-Menell, Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities, arXiv [cs.CR] (2025); http://arxiv.org/ abs/2502.05209.
J. Luo, T. Ding, K. H. R. Chan, D. Thaker, A. Chattopadhyay, C. Callison-Burch, R. Vidal, PaCE: Parsimonious Concept Engineering for Large Language Models, arXiv [cs.CL] (2024); http://arxiv.org/abs/2406.04331.
D. Gottesman, M. Geva, “Estimating Knowledge in Large Language Models without Generating a Single Token” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (Association for Computational Linguistics, Stroudsburg, PA, USA, 2024), pp. 3994–4019; https://doi.org/10.18653/ v1/2024.emnlp-main.232.
O. Aarne, T. Fist, C. Withers, “Secure, Governable Chips: Using On-Chip Mechanisms to Manage National Security Risks from AI & Advanced Computing” (Center for a New American Security, 2024); https://s3.us-east-1. amazonaws.com/files.cnas.org/documents/CNASReport- Tech-Secure-Chips-Jan-24-finalb.pdf.
A. O’Gara, G. Kulp, W. Hodgkins, J. Petrie, V. Immler, A. Aysu, K. Basu, S. Bhasin, S. Picek, A. Srivastava, Hardware-Enabled Mechanisms for Verifying Responsible AI Development, arXiv [cs.CR] (2025); http://arxiv.org/abs/2505.03742.
C. Yueh-Han, N. Joshi, Y. Chen, M. Andriushchenko, R. Angell, H. He, Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors, arXiv [cs.CR] (2025); http://arxiv.org/abs/2506.10949.
D. Brown, M. Sabbaghi, L. Sun, A. Robey, G. J. Pappas, E. Wong, H. Hassani, Benchmarking Misuse Mitigation against Covert Adversaries, arXiv [cs.CR] (2025); http://arxiv.org/abs/2506.06414.
* I. R. McKenzie, O. J. Hollinsworth, T. Tseng, X. Davies, S. Casper, A. D. Tucker, R. Kirk, A. Gleave, STACK: Adversarial Attacks on LLM Safeguard Pipelines, arXiv [cs.CL] (2025); http://arxiv.org/abs/2506.24068.
N. Kirch, C. Weisser, S. Field, H. Yannakoudakis, S. Casper, What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms behind Attacks, arXiv [cs.CR] (2024); https://doi.org/10.48550/ARXIV.2411.03343.
N. Goldowsky-Dill, B. Chughtai, S. Heimersheim, M. Hobbhahn, Detecting Strategic Deception Using Linear Probes, arXiv [cs.LG] (2025); http://arxiv.org/abs/2502.03407.
* B. Baker, J. Huizinga, L. Gao, Z. Dou, M. Y. Guan, A. Madry, W. Zaremba, J. Pachocki, D. Farhi, “Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation” (OpenAI, 2025); https://arxiv.org/abs/2503.11926.
* T. Korbak, M. Balesni, E. Barnes, Y. Bengio, J. Benton, J. Bloom, M. Chen, A. Cooney, A. Dafoe, A. Dragan, S. Emmons, O. Evans, D. Farhi, R. Greenblatt, D. Hendrycks, M. Hobbhahn, E. Hubinger, … V. Mikulik, Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety, arXiv [cs.AI] (2025); http://arxiv.org/ abs/2507.11473.
R. Greenblatt, B. Shlegeris, K. Sachan, F. Roger, AI Control: Improving Safety Despite Intentional Subversion, arXiv [cs.LG] (2023); http://dx.doi.org/10.48550/arXiv.2312.06942.
Y. Bengio, M. K. Cohen, N. Malkin, M. MacDermott, D. Fornasiere, P. Greiner, Y. Kaddar, Can a Bayesian Oracle Prevent Harm from an Agent?, arXiv [cs.AI] (2024); http://arxiv.org/abs/2408.05284.
T. Korbak, J. Clymer, B. Hilton, B. Shlegeris, G. Irving, A Sketch of an AI Control Safety Case, arXiv [cs.AI] (2025); http://arxiv.org/ abs/2501.17315.
V. Kovarik, E. O. Chen, S. Petersen, A. Ghersengorin, V. Conitzer, AI Testing Should Account for Sophisticated Strategic Behaviour, arXiv [cs.GT] (2025); http://arxiv.org/abs/2508.14927.
* OpenAI, “Operator System Card” (OpenAI, 2025); https://cdn.openai.com/operator_ system_card.pdf.
L. Zhu, Q. Lu, D. Ming, S. U. Lee, C. Wang, Designing Meaningful Human Oversight in AI, Social Science Research Network (2025); https://doi.org/10.2139/ssrn.5501939.
* H. Mozannar, G. Bansal, C. Tan, A. Fourney, V. Dibia, J. Chen, J. Gerrits, T. Payne, M. K. Maldaner, M. Grunde-McLaughlin, E. Zhu, G. Bassman, J. Alber, P. Chang, R. Loynd, F. Niedtner, E. Kamar, … S. Amershi, Magentic-UI: Towards Human-in-the-Loop Agentic Systems, arXiv [cs.AI] (2025); http://arxiv.org/abs/2507.22358.
T. Hua, J. Baskerville, H. Lemoine, M. Hopman, A. Bhatt, T. Tracy, “Combining Cost Constrained Runtime Monitors for AI Safety” in The 39th Annual Conference on Neural Information Processing Systems (2025); https://openreview.net/forum?id=hVR3023UP2.
A. McKenzie, U. Pawar, P. Blandfort, W. Bankes, D. Krueger, E. S. Lubana, D. Krasheninnikov, “Detecting High-Stakes Interactions with Activation Probes” in The 39th Annual Conference on Neural Information Processing Systems (2025); https://openreview. net/forum?id=8YniJnJQ0P.
* OpenAI, “GPT-4o System Card” (OpenAI, 2024); https://cdn.openai.com/gpt-4osystem- card.pdf.
* OpenAI, “ChatGPT Agent System Card” (2025); https://cdn.openai.com/pdf/839e66fc- 602c-48bf-81d3-b21eacc3459d/chatgpt_agent_ system_card.pdf.
Organisation for Economic Co-Operation and Development, “Towards a Common Reporting Framework for AI Incidents” (OECD, 2025); https://doi.org/10.1787/f326d4ac-en.
N. Yu, V. Skripniuk, S. Abdelnabi, M. Fritz, “Artificial Fingerprinting for Generative Models: Rooting Deepfake Attribution in Training Data” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (IEEE, 2021); https://doi. org/10.1109/iccv48922.2021.01418.
F. Boenisch, A Systematic Review on Model Watermarking for Neural Networks. Frontiers in Big Data 4 (2021); https://www.frontiersin.org/ articles/10.3389/fdata.2021.729663/full.
P. Fernandez, G. Couairon, H. Jégou, M. Douze, T. Furon, “The Stable Signature: Rooting Watermarks in Latent Diffusion Models” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023), pp. 22409–22420; https://doi.org/10.1109/ICCV51070.2023.02053.
M. Christ, S. Gunn, T. Malkin, M. Raykova, Provably Robust Watermarks for Open- Source Language Models, arXiv [cs.CR] (2024); http://arxiv.org/abs/2410.18861.
* X. Xu, Y. Yao, Y. Liu, Learning to Watermark LLM-Generated Text via Reinforcement Learning, arXiv [cs.LG] (2024); http://arxiv.org/ abs/2403.10553.
G. Pagnotta, D. Hitaj, B. Hitaj, F. Perez-Cruz, L. V. Mancini, TATTOOED: A Robust Deep Neural Network Watermarking Scheme Based on Spread-Spectrum Channel Coding, arXiv [cs.CR] (2022); http://arxiv.org/abs/2202.06091.
P. Lv, P. Li, S. Zhang, K. Chen, R. Liang, H. Ma, Y. Zhao, Y. Li, A Robustness-Assured White-Box Watermark in Neural Networks. IEEE Transactions on Dependable and Secure Computing 20, 5214–5229 (2023); https://doi.org/10.1109/ tdsc.2023.3242737.
L. Li, B. Jiang, P. Wang, K. Ren, H. Yan, X. Qiu, “Watermarking LLMs with Weight Quantization” in Findings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, K. Bali, Eds. (Association for Computational Linguistics, Singapore, 2023), pp.
–3378; https://doi.org/10.18653/v1/2023. findings-emnlp.220.
* A. Block, A. Sekhari, A. Rakhlin, GaussMark: A Practical Approach for Structural Watermarking of Language Models, arXiv [cs.CR] (2025); http://arxiv.org/abs/2501.13941.
T. Gloaguen, N. Jovanović, R. Staab, M. Vechev, Towards Watermarking of Open- Source LLMs, arXiv [cs.CR] (2025); http://arxiv.org/ abs/2502.10525.
S. Zhu, A. Ahmed, R. Kuditipudi, P. Liang, Independence Tests for Language Models, arXiv [cs.LG] (2025); http://arxiv.org/abs/2502.12292.
E. Horwitz, A. Shul, Y. Hoshen, Unsupervised Model Tree Heritage Recovery, arXiv [cs.LG] (2024); http://arxiv.org/abs/2405.18432.
E. Horwitz, N. Kurer, J. Kahana, L. Amar, Y. Hoshen, We Should Chart an Atlas of All the World’s Models, arXiv [cs.LG] (2025); http://arxiv. org/abs/2503.10633.
Organisation for Economic Co-Operation and Development, “Sharing Trustworthy AI Models with Privacy-Enhancing Technologies” (OECD, 2025); https://doi.org/10.1787/a266160b-en.
S. Chappidi, J. Cobbe, C. Norval, A. Mazumder, J. Singh, Accountability Capture: How Record-Keeping to Support AI Transparency and Accountability (re)shapes Algorithmic Oversight. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society 8, 554–566 (2025); https://doi.org/10.1609/aies.v8i1.36570.
X. Zhao, S. Gunn, M. Christ, J. Fairoze, A. Fabrega, N. Carlini, S. Garg, S. Hong, M. Nasr, F. Tramer, S. Jha, L. Li, Y.-X. Wang, D. Song, SoK: Watermarking for AI-Generated Content, arXiv [cs.CR] (2024); http://arxiv.org/abs/2411.18479.
* L. Cao, Watermarking for AI Content Detection: A Review on Text, Visual, and Audio Modalities, arXiv [cs.CR] (2025); http://arxiv.org/ abs/2504.03765.
S. Dathathri, A. See, S. Ghaisas, P.-S. Huang, R. McAdam, J. Welbl, V. Bachani, A. Kaskasoli, R. Stanforth, T. Matejovicova, J. Hayes, N. Vyas, M. A. Merey, J. Brown-Cohen, R. Bunel, B. Balle, T. Cemgil, … P. Kohli, Scalable Watermarking for Identifying Large Language Model Outputs. Nature 634, 818–823 (2024); https://doi.org/10.1038/s41586-024-08025-4.
A. Liu, L. Pan, Y. Lu, J. Li, X. Hu, X. Zhang, L. Wen, I. King, H. Xiong, P. Yu, A Survey of Text Watermarking in the Era of Large Language Models. ACM Computing Surveys 57, 1–36 (2025); https://doi.org/10.1145/3691626.
Z. Yang, G. Zhao, H. Wu, Watermarking for Large Language Models: A Survey. Mathematics 13, 1420 (2025); https://doi.org/10.3390/math13091420.
W. Wan, J. Wang, Y. Zhang, J. Li, H. Yu, J. Sun, A Comprehensive Survey on Robust Image Watermarking. Neurocomputing 488, 226–247 (2022); https://doi.org/10.1016/j. neucom.2022.02.083.
M. S. Uddin, Ohidujjaman, M. Hasan, T. Shimamura, Audio Watermarking: A Comprehensive Review. International Journal of Advanced Computer Science and Applications 15 (2024); https://doi.org/10.14569/ IJACSA.2024.01505141.
S. Singhi, A. Yadav, A. Gupta, S. Ebrahimi, P. Hassanizadeh, Provenance Detection for AI-Generated Images: Combining Perceptual Hashing, Homomorphic Encryption, and AI Detection Models, arXiv [cs.CV] (2025); http://arxiv.org/abs/2503.11195.
R. Chen, Y. Wu, J. Guo, H. Huang, Improved Unbiased Watermark for Large Language Models, arXiv [cs.CL] (2025); http://arxiv.org/ abs/2502.11268.
C2PA Technical Working Group, “C2PA Content Credentials Explained: Addressing Common Questions and Updates” (C2PA, 2025); https://c2pa.org/wp-content/ uploads/sites/33/2025/10/content_ credentials_wp_0925.pdf.
A. Knott, D. Pedreschi, R. Chatila, T. Chakraborti, S. Leavy, R. Baeza-Yates, D. Eyers, A. Trotman, P. D. Teal, P. Biecek, S. Russell, Y. Bengio, Generative AI Models Should Include Detection Mechanisms as a Condition for Public Release. Ethics and Information Technology 25, 55 (2023); https://doi.org/10.1007/s10676-023-09728-4.
L. Lin, N. Gupta, Y. Zhang, H. Ren, C.-H. Liu, F. Ding, X. Wang, X. Li, L. Verdoliva, S. Hu, Detecting Multimedia Generated by Large AI Models: A Survey, arXiv [cs.MM] (2024); https://www.techrxiv.org/users/723084/ articles/707949-detecting-multimedia-generatedby- large-ai-models-a-survey?commit=17e92ea8d9 54e6c448a006d4f4e7fd594c9f6f0d.
A. Hans, A. Schwarzschild, V. Cherepanova, H. Kazemi, A. Saha, M. Goldblum, J. Geiping, T. Goldstein, Spotting LLMs with Binoculars: Zero- Shot Detection of Machine-Generated Text, arXiv [cs.CL] (2024); http://arxiv.org/abs/2401.12070.
* V. Pirogov, M. Artemev, Evaluating Deepfake Detectors in the Wild, arXiv [cs.CV] (2025); http://arxiv.org/abs/2507.21905.
W. Warby, Green Chameleon on a Branch (2024); https://unsplash.com/ photos/lJAYYVG2V4Y.
* S. Gowal, R. Bunel, F. Stimberg, D. Stutz, G. Ortiz-Jimenez, C. Kouridi, M. Vecerik, J. Hayes, S.-A. Rebuffi, P. Bernard, C. Gamble, M. Z. Horváth, F. Kaczmarczyck, A. Kaskasoli, A. Petrov, I. Shumailov, M. Thotakuri, … P. Kohli, SynthIDImage: Image Watermarking at Internet Scale, arXiv [cs.CR] (2025); http://arxiv.org/abs/2510.09263.
J. Cao, Q. Li, Z. Zhang, J. Ni, Secure and Robust Watermarking for AI-Generated Images: A Comprehensive Survey, arXiv [cs.CR] (2025); http://arxiv.org/abs/2510.02384.
Y. Chen, Z. Ma, H. Fang, W. Zhang, N. Yu, TAG-WM: Tamper-Aware Generative Image Watermarking via Diffusion Inversion Sensitivity, arXiv [cs.MM] (2025); http://arxiv.org/ abs/2506.23484.
T. South, Ed., “Identity Management for Agentic AI” (OpenID, 2025); https://openid. net/wp-content/uploads/2025/10/Identity- Management-for-Agentic-AI.pdf.
A. Chan, N. Kolt, P. Wills, U. Anwar, C. S. de Witt, N. Rajkumar, L. Hammond, D. Krueger, L. Heim, M. Anderljung, IDs for AI Systems, arXiv [cs.AI] (2024); http://arxiv.org/ abs/2406.12137.
T. South, S. Marro, T. Hardjono, R. Mahari, C. D. Whitney, D. Greenwood, A. Chan, A. Pentland, Authenticated Delegation and Authorized AI Agents, arXiv [cs.CY] (2025); http://arxiv.org/abs/2501.09674.
European Commission, The General-Purpose AI Code of Practice. (2025); https://digital-strategy.ec.europa.eu/en/policies/ contents-code-gpai.
S. Wiener, S. Rubio, SB-53 Artificial Intelligence Models: Large Developers (2025); https://leginfo.legislature.ca.gov/faces/ billTextClient.xhtml?bill_id=202520260SB53.
G7, OECD, G7 Reporting Framework – Hiroshima AI Process (HAIP) International Code of Conduct for Organizations Developing Advanced AI Systems. (2025); https://transparency.oecd.ai/.
OECD, Launch of the Hiroshima AI Process (HAIP) Reporting Framework, OECD (2025); https://www.oecd.org/en/events/2025/02/ launch-of-the-hiroshima-ai-processreporting- framework.html.
K. Persec, J. Healy, S. F. Esposito, “Shaping Trustworthy AI: Early Insights from the Hiroshima AI Process Reporting Framework” (OECDAI, 2025); https://oecd.ai/en/work/haipreporting- insights.
OECD, Submitted Reports – HAIP Reporting Framework (2025); https://transparency. oecd.ai/reports.
ASEAN, “Expanded ASEAN Guide on AI Governance and Ethics - Generative AI” (ASEAN, 2025); https://asean.org/book/ expanded-asean-guide-on-ai-governance-andethics- generative-ai/.
K. Choi, Analyzing South Korea’s Framework Act on the Development of AI, IAPP (2025); https://iapp.org/news/a/analyzing-south-korea-sframework- act-on-the-development-of-ai.
과학기술정보통신부, 인공지능 발전과 신뢰 기반 조성 등에 관한 기본법 (2025); https://www.law. go.kr/%EB%B2%95%EB%A0%B9/%EC%9D%B 8%EA%B3%B5%EC%A7%80%EB%8A%A5%20 %EB%B0%9C%EC%A0%84%EA%B3%BC%20 %EC%8B%A0%EB%A2%B0%20 %EA%B8%B0%EB%B0%98%20%EC%A1%B0- %EC%84%B1%20%EB%93%B1%EC%97%90%20 %EA%B4%80%ED%95%9C%20%EA%B8%B0%EB %B3%B8%EB%B2%95/%2820676,20250121%29.
METR, Frontier AI Safety Policies (2025); https://metr.org/.
Frontier Model Forum, “Risk Taxonomy and Thresholds for Frontier AI Frameworks” (2025); https://www.frontiermodelforum.org/technicalreports/ risk-taxonomy-and-thresholds/.
M. D. Buhl, B. Bucknall, T. Masterson, Emerging Practices in Frontier AI Safety Frameworks, arXiv [cs.CY] (2025); http://arxiv.org/ abs/2503.04746.
METR, Forecasting the Impacts of AI R&D Acceleration: Results of a Pilot Study (2025); https://metr.org/blog/2025-08-20-forecastingimpacts- of-ai-acceleration/.
J. Wang, K. Huang, K. Klyman, R. Bommasani, Do AI Companies Make Good on Voluntary Commitments to the White House?, arXiv [cs.CY] (2025); http://arxiv.org/abs/2508.08345.
Future of Life Institute, “AI Safety Index: Summer 2025” (Future of Life Institute, 2025); https://futureoflife.org/wp-content/ uploads/2025/07/FLI-AI-Safety-Index-Report- Summer-2025.pdf.
S. Campos, H. Papadatos, F. Roger, C. Touzet, O. Quarks, M. Murray, A Frontier AI Risk Management Framework: Bridging the Gap between Current AI Practices and Established Risk Management, arXiv [cs.AI] (2025); http://arxiv.org/abs/2502.06656.
I. Habli, R. Hawkins, C. Paterson, P. Ryan, Y. Jia, M. Sujan, J. McDermid, The BIG Argument for AI Safety Cases, arXiv [cs.CY] (2025); http://arxiv.org/abs/2503.11705.
J. Clymer, J. Weinbaum, R. Kirk, K. Mai, S. Zhang, X. Davies, An Example Safety Case for Safeguards against Misuse, arXiv [cs.LG] (2025); http://arxiv.org/abs/2505.18003.
J. Clymer, N. Gabrieli, D. Krueger, T. Larsen, Safety Cases: How to Justify the Safety of Advanced AI Systems, arXiv [cs.CY] (2024); http://arxiv.org/abs/2403.10462.
A. Goemans, M. D. Buhl, J. Schuett, T. Korbak, J. Wang, B. Hilton, G. Irving, Safety Case Template for Frontier AI: A Cyber Inability Argument, arXiv [cs.CY] (2024); http://arxiv.org/abs/2411.08088.
M. D. Buhl, G. Sett, L. Koessler, J. Schuett, M. Anderljung, Safety Cases for Frontier AI, arXiv [cs.CY] (2024); http://arxiv.org/abs/2410.21572.
M. D. Buhl, J. Pfau, B. Hilton, G. Irving, An Alignment Safety Case Sketch Based on Debate, arXiv [cs.AI] (2025); http://arxiv.org/ abs/2505.03989.
B. Hilton, M. D. Buhl, T. Korbak, G. Irving, “Safety Cases: A Scalable Approach to Frontier AI Safety” (AI Security Institute, 2025); https://doi.org/10.48550/arXiv.2503.04744.
* Anthropic, Anthropic’s Responsible Scaling Policy, Version 1.0. (2023); https://www-cdn.anthropic.com/ 1adf000c8f675958c2ee23805d91aaade1cd4613/ responsible-scaling-policy.pdf.
* Google DeepMind, Frontier Safety Framework Version 3.0. (2025); https://storage.googleapis.com/deepmindmedia/ DeepMind.com/Blog/strengtheningour- frontier-safety-framework/frontier-safetyframework_ 3.pdf.
K. Perset, S. Fialho Esposito, “How Are AI Developers Managing Risks?” (OECD, 2025); https://doi.org/10.1787/658c2ad6-en.
L. Staufer, M. Yang, A. Reuel, S. Casper, Audit Cards: Contextualizing AI Evaluations, arXiv [cs.CY] (2025); http://arxiv.org/abs/2504.13839.