International AI Safety Report 2025: Second Key Update: Technical Safeguards and Risk Management

Yoshua Bengio; Stephen Clare; Carina Prunkl; Maksym Andriushchenko; Benjamin Bucknall; Philip Fox; Nestor Maslej; Conor McGlynn; Malcolm Murray; Shalaleh Rismani; Stephen Casper; Jessica Newman; Daniel Privitera; Sören Mindermann; Daron Acemoglu; Thomas G. Dietterich; Fredrik Heintz; Geoffrey Hinton; Nick Jennings; Susan Leavy; Teresa Ludermir; Vidushi Marda; Helen Margetts; John McDermid; Jane Munga; Arvind Narayanan; Alondra Nelson; Clara Neppel; Sarvapali D. (Gopal) Ramchurn; Stuart Russell; Marietje Schaake; Bernhard Schölkopf; Alvaro Soto; Lee Tiedrich; Gaël Varoquaux; Andrew Yao; Ya-Qin Zhang

doi:10.70777/si.v2i4.16671

Authors

Yoshua Bengio Université de Montréal / LawZero / Mila – Quebec AI Institute
Stephen Clare Independent
Carina Prunkl Inria
Maksym Andriushchenko ELLIS Institute Tübingen
Ben Bucknall University of Oxford
Philip Fox KIRA Center
Nestor Maslej Stanford University
Conor McGlynn Harvard University
Malcolm Murray SaferAI
Shalaleh Rismani Mila - Quebec AI Institute
Stephen Casper Massachusetts Institute of Technology
Jessica Newman University of California, Berkeley
Daniel Privitera KIRA Center
Sören Mindermann Independent
Daron Acemoglu Massachusetts Institute of Technology
Thomas G. Dietterich Oregon State University
Fredrik Heintz Linköping University
Geoffrey Hinton University of Toronto
Nick Jennings Vice-Chancellor and President of Loughborough University
Susan Leavy University College Dublin
Teresa Ludermir Federal University of Pernambuco
Vidushi Marda AI Collaborative
Helen Margetts University of Oxford
John McDermid University of York
Jane Munga Carnegie Endowment for International Peace
Arvind Narayanan Princeton University
Alondra Nelson Institute for Advanced Study
Clara Neppel IEEE
Sarvapali D. (Gopal) Ramchurn Responsible AI UK
Stuart Russell University of California, Berkeley
Marietje Schaake Stanford University
Bernhard Schölkopf ELLIS Institute Tübingen
Alvaro Soto Pontificia Universidad Católica de Chile
Lee Tiedrich , University of Maryland; Duke
Gaël Varoquaux Inria
Andrew Yao Tsinghua University
Ya-Qin Zhang Tsinghua University

DOI:

https://doi.org/10.70777/si.v2i4.16671

Abstract

This is the Second Key Update to the 2025 International AI Safety Report. The First Key Update (1) discussed developments in the capabilities of general-purpose AI models and systems and associated risks. This Key Update covers how various actors, including researchers, companies, and governments, are approaching risk management and technical mitigations for AI.

The past year has seen important developments in AI risk management, including better techniques for training safer models and monitoring their outputs. While this represents tangible progress, significant gaps remain. It is often uncertain how effective current measures are at preventing harms, and effectiveness varies across time and applications. There are many opportunities to further strengthen existing safeguard techniques and to develop new ones.

This Key Update provides a concise overview of critical developments in risk management practices and technical risk mitigation since the publication of the 2025 AI Safety Report in January. It highlights where progress is being made and where gaps remain. Above all, it aims to support policymakers, researchers, and the public in navigating a rapidly changing environment, helping them to make informed and timely decisions about the governance of general-purpose AI.

Professor Yoshua Bengio
Université de Montréal / LawZero /
Mila – Quebec AI Institute & Chair

Author Biographies

Yoshua Bengio, Université de Montréal / LawZero / Mila – Quebec AI Institute

Recognized worldwide as one of the leading experts in artificial intelligence, Yoshua Bengio is most known for his pioneering work in deep learning, earning him the 2018 A.M. Turing Award, “the Nobel Prize of Computing,” with Geoffrey Hinton and Yann LeCun, and making him the computer scientist with the largest number of citations and h-index.

He is Full Professor at Université de Montréal, Co-President and Scientific Director of LawZero and Founder and Scientific Advisor of Mila – Quebec AI Institute. He co-directs the CIFAR Learning in Machines & Brains program and acts as Special Advisor and Founding Scientific Director of IVADO.

He received numerous awards, including the prestigious Killam Prize and Herzberg Gold medal in Canada, CIFAR’s AI Chair, Spain’s Princess of Asturias Award, the VinFuture Prize and he is a Fellow of both the Royal Society of London and Canada, Knight of the Legion of Honor of France, Officer of the Order of Canada, Member of the UN’s Scientific Advisory Board for Independent Advice on Breakthroughs in Science and Technology. Yoshua Bengio was named in 2024 one of TIME’s magazine 100 most influential people in the world.

Concerned about the social impact of AI, he actively contributed to the Montreal Declaration for the Responsible Development of Artificial Intelligence and currently chairs the International AI Safety

Stephen Clare, Independent

Stephen is a Lead Writer working on the next edition of the International AI Safety Report. He was formerly a Research Manager at GovAI.

Ben Bucknall, University of Oxford

I'm currently a DPhil (PhD) student in Engineering Science at the University of Oxford where my research focusses on technical AI governance. I'm incredibly fortunate to be supervised by Michael Osborne (Department of Engineering Science) and Robert Trager (Blavatnik School of Government), and funded by an EPSRC Doctoral Training Partnership. I'm excited to be spending Autumn 2025 as a visiting student researcher at the Stanford Trustworthy AI Research (STAIR) lab, supervised by Prof. Sanmi Koyejo.

I'm also an affiliate at the Oxford Martin AI Governance Initiative, and, until recently, was a research scholar at the Centre for the Governance of AI (GovAI) and a technical advisor at the UK's AI Security Institute. I served as co-principal organiser for the inaugural Workshop on Technical AI Governance at ICML 2025.

In previous lives I've studied mathematics at Durham University and computational science at Uppsala University; worked as an actor in northern Finland; completed Prague Marathon in 3:57; solo backpacked from the Baltics to the Balkans; and earned a diploma (DipABRSM) on the violin.

Stephen Casper, Massachusetts Institute of Technology

Hi, I’m Stephen Casper, but most people call me Cas. I work on technical AI governance. I’m a fourth-year PhD student at MIT in Computer Science (EECS) in the Algorithmic Alignment Group, advised by Dylan Hadfield-Menell. I’m also leading a research stream for MATS, and I was a writer for the International AI Safety Report and the Singapore Consensus. I’m supported by the Vitalik Buterin Fellowship from the Future of Life Institute. Formerly, I have worked with the Harvard Kreiman Lab and the Center for Human-Compatible AI.

Stalk me on Google Scholar, Twitter, and BlueSky. See also my core beliefs about AI risks and my thoughts on reframing AI safety as a neverending institutional challenge. I also have a personal feedback form. Feel free to use it to send me anonymous, constructive feedback about how I can be better.

Papers
2025
Tegmark, M., Song, D., Xue, L., Ong, L., Russell, S., Maharaj, T., Zhang, Y.-Q., Bengio, Y., Mindermann, S., Casper, S., Lee, W. S., & Wilfred, V. (2025). The Singapore Consensus on Global AI Safety Research Priorities.
Staufer, L., Yang, M., Reuel, A., & Casper, S. (2025). Audit Cards: Contextualizing AI Evaluations. arXiv preprint arXiv:2504.13839.
Casper, S., Bailey, L., & Schreier, T. (2025). Practical Principles for AI Cost and Compute Accounting. arXiv preprint arXiv:2502.15873.
Schwinn, L., Scholten, Y., Wollschläger, T., Xhonneux, S., Casper, S., Günnemann, S., & Gidel, G. (2025). Adversarial Alignment for LLMs Requires Simpler, Reproducible, and More Measurable Objectives. arXiv preprint arXiv:2502.11910.
Casper, S., Krueger, D., & Hadfield-Menell, D. (2025). Pitfalls of Evidence-Based AI Policy. ICLR 2025 Blog Post.
Khan, A., Casper, S., & Hadfield-Menell, D. (2025). Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs. Proceedings of the 2025 ACM conference on fairness, accountability, and transparency. 2025.
Che, Z.,* Casper, S.,* Kirk, R., Satheesh, A., Slocum, S., McKinney, L. E., … & Hadfield-Menell, D. (2025). Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities. arXiv preprint arXiv:2502.05209.
Casper, S., Bailey, L., Hunter, R., Ezell, C., Cabalé, E., Gerovitch, M., … & Kolt, N. (2025). The AI Agent Index. arXiv preprint arXiv:2502.01635.
Bengio, Y., Mindermann, S., Privitera, D., Besiroglu, T., Bommasani, R., Casper, S., … & Zeng, Y. (2025). International AI Safety Report. arXiv preprint arXiv:2501.17805.
Sharkey, L., Chughtai, B., Batson, J., Lindsey, J., Wu, J., Bushnaq, L., … Casper, S … & McGrath, T. (2025). Open Problems in Mechanistic Interpretability. arXiv preprint arXiv:2501.16496.
Barez, F., Fu, T., Prabhu, A., Casper, S., Sanyal, A., Bibi, A., … & Gal, Y. (2025). Open Problems in Machine Unlearning for AI Safety. arXiv preprint arXiv:2501.04952."

Daniel Privitera, KIRA Center

(Interim Lead 2026)

Sören Mindermann, Independent

(Interim Lead 2026)

Teresa Ludermir, Federal University of Pernambuco

Ph D Imperial College of Science, Technology and Medicine, 1990. Full Professor (Profa. Titular) Centro de Informática da Universidade Federal de Pernambuco Grupo de Inteligência Computacional
Research Interests: Neural Networks, Machine Learning , Artificial Intelligence, Hybrid Intelligent Systems

References

An asterisk (*) denotes that the reference was either published by an AI company or at least 50%

of the authors of a preprint have a for-profit AI company as their affiliation.

Y. Bengio, S. Clare, C. Prunkl, M. Andriushchenko, B. Bucknall, P. Fox, T. Hu, C. Jones, S. Manning, N. Maslej, V. Mavroudis, C. McGlynn, M. Murray, S. Rismani, C. Stix, L. Velasco, N. Wheeler, … Y.-Q. Zhang, “International AI Safety Report: First Key Update: Capabilities and Risk Implications” (DSIT, 2025); https://internationalaisafetyreport. org/publication/first-key-update-capabilities-andrisk- implications.

* OpenAI, “GPT-5 System Card” (OpenAI, 2025); https://cdn.openai.com/gpt-5-system-card.pdf.

* Google, “Gemini 2.5 Deep Think - Model Card” (Google, 2025); https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-2-5-Deep- Think-Model-Card.pdf.

* Anthropic, “System Card: Claude Opus 4 & Claude Sonnet 4” (Anthropic, 2025); https://www-cdn.anthropic.com/ 07b2a 3f9902ee19fe39a36ca638e5ae987bc64dd.pdf.

Y. Bengio, T. Maharaj, L. Ong, S. Russell, D. Song, M. Tegmark, L. Xue, Y.-Q. Zhang, S. Casper, W. S. Lee, S. Mindermann, V. Wilfred, V. Balachandran, F. Barez, M. Belinsky, I. Bello, M. Bourgon, … D. Žikelić, The Singapore Consensus on Global AI Safety Research Priorities, arXiv [cs.AI] (2025); http://arxiv.org/ abs/2506.20702.

A. Reuel, B. Bucknall, S. Casper, T. Fist, L. Soder, O. Aarne, L. Hammond, L. Ibrahim, A. Chan, P. Wills, M. Anderljung, B. Garfinkel, L. Heim, A. Trask, G. Mukobi, R. Schaeffer, M. Baker, … R. Trager, Open Problems in Technical AI Governance, arXiv [cs.CY] (2024); http://arxiv.org/abs/2407.14981.

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Gray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, … R. Lowe, “Training Language Models to Follow Instructions with Human Feedback” in 36th Conference on Neural Information Processing Systems (NeurIPS 2022) (New Orleans, LA, USA, 2022); https://openreview.net/ forum?id=TG8KACxEON.

A. Zou, L. Phan, J. Wang, D. Duenas, M. Lin, M. Andriushchenko, R. Wang, Z. Kolter, M. Fredrikson, D. Hendrycks, Improving Alignment and Robustness with Circuit Breakers. Neural Information Processing Systems, 83345–83373 (2024); https://proceedings.neurips.cc/paper_ files/paper/2024/hash/97ca7168c2c333df5ea61ec e3b3276e1-Abstract-Conference.html.

* M. Sharma, M. Tong, J. Mu, J. Wei, J. Kruthoff, S. Goodfriend, E. Ong, A. Peng, R. Agarwal, C. Anil, A. Askell, N. Bailey, J. Benton, E. Bluemke, S. R. Bowman, E. Christiansen, H. Cunningham, … E. Perez, Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming, arXiv [cs.CL] (2025); http://arxiv.org/abs/2501.18837.

A. Peng, J. Michael, H. Sleight, E. Perez, M. Sharma, Rapid Response: Mitigating LLM Jailbreaks with a Few Examples, arXiv [cs.CL] (2024); http://arxiv.org/abs/2411.07494.

P. Maini, S. Goyal, D. Sam, A. Robey, Y. Savani, Y. Jiang, A. Zou, Z. C. Lipton, J. Z. Kolter, Safety Pretraining: Toward the next Generation of Safe AI, arXiv [cs.LG] (2025); http://arxiv.org/ abs/2504.16980.

* E. Wallace, O. Watkins, M. Wang, K. Chen, C. Koch, “Estimating Worst-Case Frontier Risks of Open-Weight LLMs” (OpenAI, 2025); https://cdn.openai.com/pdf/231bf018-659a-494d- 976c-2efdfc72b652/oai_gpt-oss_Model_Safety.pdf.

K. O’Brien, S. Casper, Q. Anthony, T. Korbak, R. Kirk, X. Davies, I. Mishra, G. Irving, Y. Gal, S. Biderman, Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs, arXiv [cs.LG] (2025); http://arxiv.org/abs/2508.06601.

K. Li, Y. Chen, F. Viégas, M. Wattenberg, When Bad Data Leads to Good Models, arXiv [cs.LG] (2025); http://arxiv.org/abs/2505.04741.

A. Paullada, I. D. Raji, E. M. Bender, E. Denton, A. Hanna, Data and Its (dis)contents: A Survey of Dataset Development and Use in Machine Learning Research. Patterns 2, 100336 (2021); https://doi.org/10.1016/j.patter.2021.100336.

L. Berti, F. Giorgi, G. Kasneci, Emergent Abilities in Large Language Models: A Survey, arXiv [cs.LG] (2025); http://arxiv.org/ abs/2503.05788.

T. Huang, S. Hu, F. Ilhan, S. F. Tekin, Z. Yahn, Y. Xu, L. Liu, Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable, arXiv [cs.CR] (2025); http://arxiv.org/ abs/2503.00555.

P. Peigné, M. Kniejski, F. Sondej, M. David, J. Hoelscher-Obermaier, C. Schroeder de Witt, E. Kran, Multi-Agent Security Tax: Trading off Security and Collaboration Capabilities in Multi-Agent Systems. Proceedings of the 39th AAAI Conference on Artificial Intelligence 39, 27573–27581 (2025); https://doi.org/10.1609/ aaai.v39i26.34970.

* D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, G. Irving, “Fine-Tuning Language Models from Human Preferences” (OpenAI, 2020); http://arxiv.org/ abs/1909.08593.

S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lindner, P. Freire, T. T. Wang, S. Marks, C.-R. Segerie, M. Carroll, A. Peng, P. Christoffersen, M. Damani, … D. Hadfield-Menell, Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. Transactions on Machine Learning Research (2023); https://openreview.net/forum?id=bx24KpJ4Eb.

M. Glickman, T. Sharot, How Human-AI Feedback Loops Alter Human Perceptual, Emotional and Social Judgements. Nature Human Behaviour 9, 345–359 (2025); https://doi.org/10.1038/s41562-024-02077-2.

K. Kobalczyk, M. Van Der Schaar, “Preference Learning for AI Alignment: A Causal Perspective” in Proceedings of the 42nd International Conference on Machine Learning, A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, J. Zhu, Eds. (PMLR, 13–19 Jul 2025) vol. 267 of Proceedings of Machine Learning Research, pp. 31063– 31083; https://proceedings.mlr.press/v267/ kobalczyk25a.html.

* X. Wen, J. Lou, X. Lu, J. Yang, Y. Liu, Y. Lu, D. Zhang, X. Yu, Scalable Oversight for Superhuman AI via Recursive Self-Critiquing, arXiv [cs.AI] (2025); http://arxiv.org/abs/2502.04675.

D. Dai, M. Liu, A. Li, J. Cao, Y. Wang, C. Wang, X. Peng, Z. Zheng, FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks, arXiv [cs.SE] (2025); http://arxiv.org/abs/2504.06939.

* Z. Wang, J. Zeng, O. Delalleau, H.-C. Shin, F. Soares, A. Bukharin, E. Evans, Y. Dong, O. Kuchaiev, HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages, arXiv [cs.CL] (2025); http://arxiv.org/abs/2505.11475.

N. Lambert, J. Morrison, V. Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V. Miranda, A. Liu, N. Dziri, X. Lyu, Y. Gu, S. Malik, V. Graf, J. D. Hwang, J. Yang, R. Le Bras, O. Tafjord, … H. Hajishirzi, “Tulu 3: Pushing Frontiers in Open Language Model Post-Training” in Second Conference on Language Modeling (2025); https://openreview.net/ forum?id=i1uGbfHHpH#discussion.

M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, D. Forsyth, D. Hendrycks, HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal, arXiv [cs.LG] (2024); http://arxiv.org/abs/2402.04249.

S. Lee, M. Kim, L. Cherif, D. Dobre, J. Lee, S. J. Hwang, K. Kawaguchi, G. Gidel, Y. Bengio, N. Malkin, M. Jain, Learning Diverse Attacks on Large Language Models for Robust Red- Teaming and Safety Tuning, arXiv [cs.CL] (2024); http://arxiv.org/abs/2405.18540.

* N. Howe, I. McKenzie, O. Hollinsworth, M. Zajac, T. Tseng, A. Tucker, P.-L. Bacon, A. Gleave, Scaling Trends in Language Model Robustness, arXiv [cs.LG] (2024); http://arxiv.org/ abs/2407.18213.

* A. Cuevas, S. Dash, B. K. Nayak, D. Vann, M. I. G. Daepp, Anecdoctoring: Automated Red- Teaming across Language and Place, arXiv [cs.CL] (2025); http://arxiv.org/abs/2509.19143.

Z.-W. Hong, I. Shenfeld, T.-H. Wang, Y.-S. Chuang, A. Pareja, J. Glass, A. Srivastava, P. Agrawal, Curiosity-Driven Red-Teaming for Large Language Models, arXiv [cs.LG] (2024); http://arxiv.org/abs/2402.19464.

T. Yun, P.-L. St-Charles, J. Park, Y. Bengio, M. Kim, Active Attacks: Red-Teaming LLMs via Adaptive Environments, arXiv [cs.LG] (2025); http://arxiv.org/abs/2509.21947.

S. Xhonneux, A. Sordoni, S. Günnemann, G. Gidel, L. Schwinn, “Efficient Adversarial Training in LLMs with Continuous Attacks” in 38th Annual Conference on Neural Information Processing Systems (NeurIPS) (2024); https://openreview.net/pdf?id=8jB6sGqvgQ.

S. Casper, L. Schulze, O. Patel, D. Hadfield- Menell, Defending Against Unforeseen Failure Modes with Latent Adversarial Training, arXiv [cs.CR] (2024); http://dx.doi.org/10.48550/ arXiv.2403.05030.

A. Sheshadri, A. Ewart, P. Guo, A. Lynch, C. Wu, V. Hebbar, H. Sleight, A. C. Stickland, E. Perez, D. Hadfield-Menell, S. Casper, Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs, arXiv [cs.LG] (2024); http://arxiv.org/abs/2407.15549.

C. Dékány, S. Balauca, R. Staab, D. I. Dimitrov, M. Vechev, MixAT: Combining Continuous and Discrete Adversarial Training for LLMs, arXiv [cs.LG] (2025); http://arxiv.org/abs/2505.16947.

M. Andriushchenko, F. Croce, N. Flammarion, Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks, arXiv [cs.CR] (2024); http://arxiv.org/abs/2404.02151.

* N. Li, Z. Han, I. Steneker, W. Primack, R. Goodside, H. Zhang, Z. Wang, C. Menghini, S. Yue, LLM Defenses Are Not Robust to Multi- Turn Human Jailbreaks yet, arXiv [cs.LG] (2024); http://arxiv.org/abs/2408.15221.

* A. Zou, M. Lin, E. Jones, M. Nowak, M. Dziemian, N. Winter, A. Grattan, V. Nathanael, A. Croft, X. Davies, J. Patel, R. Kirk, N. Burnikell, Y. Gal, D. Hendrycks, J. Z. Kolter, M. Fredrikson, Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition, arXiv [cs.AI] (2025); http://arxiv.org/abs/2507.20526.

D. Lüdke, T. Wollschläger, P. Ungermann, S. Günnemann, L. Schwinn, Diffusion LLMs Are Natural Adversaries for Any LLM, arXiv [cs.LG] (2025); http://arxiv.org/abs/2511.00203.

Anthropic, “System Card: Claude Sonnet 4.5” (Anthropic, 2025); https://assets.anthropic.com/ m/12f214efcc2f457a/original/Claude-Sonnet-4-5- System-Card.pdf.

A. Souly, J. Rando, E. Chapman, X. Davies, B. Hasircioglu, E. Shereen, C. Mougan, V. Mavroudis, E. Jones, C. Hicks, N. Carlini, Y. Gal, R. Kirk, Poisoning Attacks on LLMs Require a near-Constant Number of Poison Samples, arXiv [cs.LG] (2025); http://arxiv.org/abs/2510.07192.

* M. Wang, T. D. la Tour, O. Watkins, A. Makelov, R. A. Chi, S. Miserendino, J. Wang, A. Rajaram, J. Heidecke, T. Patwardhan, D. Mossing, Persona Features Control Emergent Misalignment, arXiv [cs.LG] (2025); http://arxiv.org/abs/2506.19823.

J. Betley, D. C. H. Tan, N. Warncke, A. Sztyber- Betley, X. Bao, M. Soto, N. Labenz, O. Evans, “Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs” in Forty-Second International Conference on Machine Learning (2025); https://openreview.net/ forum?id=aOIJ2gVRWW.

Z. Zhang, S. Cui, Y. Lu, J. Zhou, J. Yang, H. Wang, M. Huang, Agent-SafetyBench: Evaluating the Safety of LLM Agents, arXiv [cs.CL] (2024); http://arxiv.org/abs/2412.14470.

X. Li, R. Wang, M. Cheng, T. Zhou, C.-J. Hsieh, “DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLMs Jailbreakers” in Findings of the Association for Computational Linguistics: EMNLP 2024 (Association for Computational Linguistics, Stroudsburg, PA, USA, 2024), pp. 13891–13913; https://doi.org/10.18653/v1/2024. findings-emnlp.813.

M. Andriushchenko, A. Souly, M. Dziemian, D. Duenas, M. Lin, J. Wang, D. Hendrycks, A. Zou, Z. Kolter, M. Fredrikson, E. Winsor, J. Wynne, Y. Gal, X. Davies, AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents, arXiv [cs. LG] (2024); http://arxiv.org/abs/2410.09024.

T. Kuntz, A. Duzan, H. Zhao, F. Croce, Z. Kolter, N. Flammarion, M. Andriushchenko, OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents, arXiv [cs.SE] (2025); http://arxiv.org/ abs/2506.14866.

A. Naik, P. Quinn, G. Bosch, E. Gouné, F. J. C. Zabala, J. R. Brown, E. J. Young, AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents, arXiv [cs.AI] (2025); http://arxiv.org/abs/2506.04018.

C. Yu, B. Stroebl, D. Yang, O. Papakyriakopoulos, Safety Devolution in AI Agents, arXiv [cs.CY] (2025); http://arxiv.org/abs/2505.14215.

J. Y. F. Chiang, S. Lee, J.-B. Huang, F. Huang, Y. Chen, Why Are Web AI Agents More Vulnerable than Standalone LLMs? A Security Analysis, arXiv [cs.LG] (2025); http://arxiv.org/abs/2502.20383.

B. Cottier, J. You, N. Martemianova, D. Owen, “How Far Behind Are Open Models?” (Epoch AI, 2024); https://epoch.ai/blog/open-models-report.

N. Maslej, L. Fattorini, R. Perrault, Y. Gil, V. Parli, N. Kariuki, E. Capstick, A. Reuel, E. Brynjolfsson, J. Etchemendy, K. Ligett, T. Lyons, J. Manyika, J. C. Niebles, Y. Shoham, R. Wald, T. Walsh, … S. Oak, “The AI Index 2025 Annual Report” (AI Index Steering Committee, Institute for Human-Centered AI, Stanford University, 2025); https://hai.stanford.edu/assets/files/hai_ ai_index_report_2025.pdf.

* A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, … Z. Qiu, Qwen3 Technical Report, arXiv [cs.CL] (2025); http://arxiv.org/abs/2505.09388.

* Meta AI, The Llama 4 Herd: The Beginning of a New Era of Natively Multimodal AI Innovation, Meta AI (2025); https://ai.meta.com/blog/llama-4- multimodal-intelligence/.

* Kimi Team, Y. Bai, Y. Bao, G. Chen, J. Chen, N. Chen, R. Chen, Y. Chen, Y. Chen, Y. Chen, Z. Chen, J. Cui, H. Ding, M. Dong, A. Du, C. Du, D. Du, … X. Zu, Kimi K2: Open Agentic Intelligence, arXiv [cs.LG] (2025); http://arxiv.org/abs/2507.20534.

* OpenAI, S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y. Bai, B. Baker, H. Bao, B. Barak, A. Bennett, T. Bertao, N. Brett, E. Brevdo, G. Brockman, … S. Zhao, Gpt-Oss-120b & Gpt-Oss-20b Model Card, arXiv [cs.CL] (2025); https://cdn.openai.com/ pdf/419b6906-9da6-406c-a19d-1bb078ac7637/ oai_gpt-oss_model_card.pdf.

* GLM-4.5 Team, A. Zeng, X. Lv, Q. Zheng, Z. Hou, B. Chen, C. Xie, C. Wang, D. Yin, H. Zeng, J. Zhang, K. Wang, L. Zhong, M. Liu, R. Lu, S. Cao, X. Zhang, … J. Tang, GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models, arXiv [cs.CL] (2025); http://arxiv.org/abs/2508.06471.

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, X. Zhang, X. Yu, Y. Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, … Z. Zhang, DeepSeek-R1 Incentivizes Reasoning in LLMs through Reinforcement Learning. Nature 645, 633–638 (2025); https://doi.org/10.1038/ s41586-025-09422-z.

* C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S.-M. Yin, S. Bai, X. Xu, Y. Chen, Y. Chen, Z. Tang, Z. Zhang, Z. Wang, A. Yang, B. Yu, C. Cheng, … Z. Liu, Qwen-Image Technical Report, arXiv [cs.CV] (2025); http://arxiv.org/abs/2508.02324.

* S. Cao, H. Chen, P. Chen, Y. Cheng, Y. Cui, X. Deng, Y. Dong, K. Gong, T. Gu, X. Gu, T. Hang, D. Huang, J. Jiang, Z. Jiang, W. Kong, C. Li, D. Li, … Z. Zhong, HunyuanImage 3.0 Technical Report, arXiv [cs.CV] (2025); http://dx.doi.org/10.48550/ arXiv.2509.23951.

* T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, J. Wang, J. Zhang, J. Zhou, J. Wang, J. Chen, K. Zhu, … Z. Liu, Wan: Open and Advanced Large-Scale Video Generative Models, arXiv [cs.CV] (2025); http://dx.doi.org/10.48550/ arXiv.2503.20314.

* W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, K. Wu, Q. Lin, J. Yuan, Y. Long, A. Wang, A. Wang, C. Li, … C. Zhong, HunyuanVideo: A Systematic Framework for Large Video Generative Models, arXiv [cs.CV] (2024); http://dx.doi.org/10.48550/arXiv.2412.03603.

R. Bommasani, S. Kapoor, K. Klyman, S. Longpre, A. Ramaswami, D. Zhang, M. Schaake, D. E. Ho, A. Narayanan, P. Liang, Considerations for Governing Open Foundation Models. Science 386, 151–153 (2024); https://doi.org/10.1126/ science.adp1848.

US National Telecommunications and Information Administration, “Dual-Use Foundation Models with Widely Available Model Weights NTIA Report” (US Department of Commerce, 2024); https://www.ntia.gov/issues/artificial-intelligence/ open-model-weights-report.

* E. Seger, B. O’Dell, “Open Horizons: Exploring Nuanced Technical and Policy Approaches to Openness in AI” (Demos and Mozilla, 2024); https://demos.co.uk/wp-content/ uploads/2024/08/Mozilla-Report_2024.pdf.

F. Eiras, A. Petrov, B. Vidgen, C. Schroeder, F. Pizzati, K. Elkins, S. Mukhopadhyay, A. Bibi, A. Purewal, C. Botos, F. Steibel, F. Keshtkar, F. Barez, G. Smith, G. Guadagni, J. Chun, J. Cabot, … J. Foerster, Risks and Opportunities of Open-Source Generative AI, arXiv [cs.LG] (2024); http://arxiv.org/abs/2405.08597.

S. Kapoor, R. Bommasani, K. Klyman, S. Longpre, A. Ramaswami, P. Cihon, A. Hopkins, K. Bankston, S. Biderman, M. Bogen, R. Chowdhury, A. Engler, P. Henderson, Y. Jernite, S. Lazar, S. Maffulli, A. Nelson, … A. Narayanan, On the Societal Impact of Open Foundation Models, arXiv [cs.CY] (2024); http://arxiv.org/ abs/2403.07918.

C. François, L. Péran, A. Bdeir, N. Dziri, W. Hawkins, Y. Jernite, S. Kapoor, J. Shen, H. Khlaaf, K. Klyman, N. Marda, M. Pellat, D. Raji, D. Siddarth, A. Skowron, J. Spisak, M. Srikumar, … J. Weedon, A Different Approach to AI Safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety, arXiv [cs.AI] (2025); http://dx.doi.org/10.48550/ arXiv.2506.22183.

Y. Bengio, S. Mindermann, D. Privitera, T. Besiroglu, R. Bommasani, S. Casper, Y. Choi, P. Fox, B. Garfinkel, D. Goldfarb, H. Heidari, A. Ho, S. Kapoor, L. Khalatbari, S. Longpre, S. Manning, V. Mavroudis, … Y. Zeng, “International AI Safety Report” (Department for Science, Innovation and Technology, 2025); https://www.gov.uk/ government/publications/international-aisafety- report-2025.

E. Seger, N. Dreksler, R. Moulange, E. Dardaman, J. Schuett, K. Wei, C. Winter, M. Arnold, S. Ó. hÉigeartaigh, A. Korinek, M. Anderljung, B. Bucknall, A. Chan, E. Stafford, L. Koessler, A. Ovadya, B. Garfinkel, … A. Gupta, “Open-Sourcing Highly Capable Foundation Models: An Evaluation of Risks, Benefits, and Alternative Methods for Pursuing Open-Source Objectives” ( Centre for the Governance of AI, 2023); http://arxiv.org/abs/2311.09227.

A. Chan, B. Bucknall, H. Bradley, D. Krueger, Hazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models, arXiv [cs.LG] (2023); http://arxiv.org/abs/2312.14751.

T. Huang, S. Hu, F. Ilhan, S. F. Tekin, L. Liu, Harmful Fine-Tuning Attacks and Defenses for Large Language Models: A Survey, arXiv [cs.CR] (2024); http://arxiv.org/abs/2409.18169.

IWF, “What Has Changed in the AI CSAM Landscape?” (Internet Watch Foundation, 2024).

W. Hawkins, C. Russell, B. Mittelstadt, Deepfakes on Demand: The Rise of Accessible Non- Consensual Deepfake Image Generators, arXiv [cs. CY] (2025); http://arxiv.org/abs/2505.03859.

E. H. Vaughan, NCMEC Releases New Data: 2024 in Numbers, National Center for Missing & Exploited Children (2025); http://www.ncmec.org/ content/ncmec/en/blog/2025/ncmec-releasesnew- data-2024-in-numbers.html.

S. Casper, K. O’Brien, S. Longpre, E. Seger, K. Klyman, R. Bommasani, A. Nrusimha, I. Shumailov, S. Mindermann, S. Basart, F. Rudzicz, K. Pelrine, A. Ghosh, A. Strait, R. Kirk, D. Hendrycks, P. Henderson, … D. Hadfield-Menell, Open Technical Problems in Open-Weight AI Model Risk Management, Social Science Research Network (2025); https://doi.org/10.2139/ssrn.5705186.

M. Srikumar, J. Chang, K. Chmielinski, “Risk Mitigation Strategies for the Open Foundation Model Value Chain: Insights from PAI Workshop Co-Hosted with GitHub” (Partnership on AI, 2024); https://partnershiponai.notion.site/ 1e8a6131dda045f1ad00054933b0bda0? v=dcb890146f7d464a86f11fcd5de372c0.

AI Security Institute, Managing Risks from Increasingly Capable Open-Weight AI Systems, AI Security Institute (2025); https://www.aisi.gov.uk/ work/managing-risks-from-increasingly-capableopen- weight-ai-systems.

P. Henderson, E. Mitchell, C. Manning, D. Jurafsky, C. Finn, “Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models” in Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (Association for Computing Machinery, New York, NY, USA, 2023), AIES ’23, pp. 287–296; https://doi.org/10.1145/3600211.3604690.

* C. Fan, J. Jia, Y. Zhang, A. Ramakrishna, M. Hong, S. Liu, Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and beyond, arXiv [cs.LG] (2025); http://arxiv.org/abs/2502.05374.

D. Rosati, J. Wehner, K. Williams, Ł. Bartoszcze, D. Atanasov, R. Gonzales, S. Majumdar, C. Maple, H. Sajjad, F. Rudzicz, Representation Noising Effectively Prevents Harmful Fine-Tuning on LLMs, arXiv [cs.CL] (2024); http://arxiv.org/abs/2405.14577.

R. Tamirisa, B. Bharathi, L. Phan, A. Zhou, A. Gatti, T. Suresh, M. Lin, J. Wang, R. Wang, R. Arel, A. Zou, D. Song, B. Li, D. Hendrycks, M. Mazeika, Tamper-Resistant Safeguards for Open-Weight LLMs, arXiv [cs.LG] (2024); http://arxiv.org/abs/2408.00761.

A. Abdalla, I. Shaheen, D. DeGenaro, R. Mallick, B. Raita, S. A. Bargal, GIFT: Gradient- Aware Immunization of Diffusion Models against Malicious Fine-Tuning with Safe Concepts Retention, arXiv [cs.CR] (2025); http://arxiv.org/ abs/2507.13598.

* A. F. Cooper, C. A. Choquette-Choo, M. Bogen, M. Jagielski, K. Filippova, K. Z. Liu, A. Chouldechova, J. Hayes, Y. Huang, N. Mireshghallah, I. Shumailov, E. Triantafillou, P. Kairouz, N. Mitchell, P. Liang, D. E. Ho, Y. Choi, … K. Lee, Machine Unlearning Doesn’t Do What You Think: Lessons for Generative AI Policy, Research, and Practice, arXiv [cs.LG] (2024); http://arxiv.org/abs/2412.06966.

J. Łucki, B. Wei, Y. Huang, P. Henderson, F. Tramèr, J. Rando, An Adversarial Perspective on Machine Unlearning for AI Safety, arXiv [cs.LG] (2024); http://arxiv.org/abs/2409.18025.

S. Hu, Y. Fu, Z. S. Wu, V. Smith, Jogging the Memory of Unlearned LLMs through Targeted Relearning Attacks, arXiv [cs.LG] (2024); http://arxiv.org/abs/2406.13356.

* A. Deeb, F. Roger, Do Unlearning Methods Remove Information from Language Model Weights?, arXiv [cs.LG] (2024); http://arxiv.org/ abs/2410.08827.

Z. Che, S. Casper, R. Kirk, A. Satheesh, S. Slocum, L. E. McKinney, R. Gandikota, A. Ewart, D. Rosati, Z. Wu, Z. Cai, B. Chughtai, Y. Gal, F. Huang, D. Hadfield-Menell, Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities, arXiv [cs.CR] (2025); http://arxiv.org/ abs/2502.05209.

J. Luo, T. Ding, K. H. R. Chan, D. Thaker, A. Chattopadhyay, C. Callison-Burch, R. Vidal, PaCE: Parsimonious Concept Engineering for Large Language Models, arXiv [cs.CL] (2024); http://arxiv.org/abs/2406.04331.

D. Gottesman, M. Geva, “Estimating Knowledge in Large Language Models without Generating a Single Token” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (Association for Computational Linguistics, Stroudsburg, PA, USA, 2024), pp. 3994–4019; https://doi.org/10.18653/ v1/2024.emnlp-main.232.

O. Aarne, T. Fist, C. Withers, “Secure, Governable Chips: Using On-Chip Mechanisms to Manage National Security Risks from AI & Advanced Computing” (Center for a New American Security, 2024); https://s3.us-east-1. amazonaws.com/files.cnas.org/documents/CNASReport- Tech-Secure-Chips-Jan-24-finalb.pdf.

A. O’Gara, G. Kulp, W. Hodgkins, J. Petrie, V. Immler, A. Aysu, K. Basu, S. Bhasin, S. Picek, A. Srivastava, Hardware-Enabled Mechanisms for Verifying Responsible AI Development, arXiv [cs.CR] (2025); http://arxiv.org/abs/2505.03742.

C. Yueh-Han, N. Joshi, Y. Chen, M. Andriushchenko, R. Angell, H. He, Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors, arXiv [cs.CR] (2025); http://arxiv.org/abs/2506.10949.

D. Brown, M. Sabbaghi, L. Sun, A. Robey, G. J. Pappas, E. Wong, H. Hassani, Benchmarking Misuse Mitigation against Covert Adversaries, arXiv [cs.CR] (2025); http://arxiv.org/abs/2506.06414.

* I. R. McKenzie, O. J. Hollinsworth, T. Tseng, X. Davies, S. Casper, A. D. Tucker, R. Kirk, A. Gleave, STACK: Adversarial Attacks on LLM Safeguard Pipelines, arXiv [cs.CL] (2025); http://arxiv.org/abs/2506.24068.

N. Kirch, C. Weisser, S. Field, H. Yannakoudakis, S. Casper, What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms behind Attacks, arXiv [cs.CR] (2024); https://doi.org/10.48550/ARXIV.2411.03343.

N. Goldowsky-Dill, B. Chughtai, S. Heimersheim, M. Hobbhahn, Detecting Strategic Deception Using Linear Probes, arXiv [cs.LG] (2025); http://arxiv.org/abs/2502.03407.

* B. Baker, J. Huizinga, L. Gao, Z. Dou, M. Y. Guan, A. Madry, W. Zaremba, J. Pachocki, D. Farhi, “Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation” (OpenAI, 2025); https://arxiv.org/abs/2503.11926.

* T. Korbak, M. Balesni, E. Barnes, Y. Bengio, J. Benton, J. Bloom, M. Chen, A. Cooney, A. Dafoe, A. Dragan, S. Emmons, O. Evans, D. Farhi, R. Greenblatt, D. Hendrycks, M. Hobbhahn, E. Hubinger, … V. Mikulik, Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety, arXiv [cs.AI] (2025); http://arxiv.org/ abs/2507.11473.

R. Greenblatt, B. Shlegeris, K. Sachan, F. Roger, AI Control: Improving Safety Despite Intentional Subversion, arXiv [cs.LG] (2023); http://dx.doi.org/10.48550/arXiv.2312.06942.

Y. Bengio, M. K. Cohen, N. Malkin, M. MacDermott, D. Fornasiere, P. Greiner, Y. Kaddar, Can a Bayesian Oracle Prevent Harm from an Agent?, arXiv [cs.AI] (2024); http://arxiv.org/abs/2408.05284.

T. Korbak, J. Clymer, B. Hilton, B. Shlegeris, G. Irving, A Sketch of an AI Control Safety Case, arXiv [cs.AI] (2025); http://arxiv.org/ abs/2501.17315.

V. Kovarik, E. O. Chen, S. Petersen, A. Ghersengorin, V. Conitzer, AI Testing Should Account for Sophisticated Strategic Behaviour, arXiv [cs.GT] (2025); http://arxiv.org/abs/2508.14927.

* OpenAI, “Operator System Card” (OpenAI, 2025); https://cdn.openai.com/operator_ system_card.pdf.

L. Zhu, Q. Lu, D. Ming, S. U. Lee, C. Wang, Designing Meaningful Human Oversight in AI, Social Science Research Network (2025); https://doi.org/10.2139/ssrn.5501939.

* H. Mozannar, G. Bansal, C. Tan, A. Fourney, V. Dibia, J. Chen, J. Gerrits, T. Payne, M. K. Maldaner, M. Grunde-McLaughlin, E. Zhu, G. Bassman, J. Alber, P. Chang, R. Loynd, F. Niedtner, E. Kamar, … S. Amershi, Magentic-UI: Towards Human-in-the-Loop Agentic Systems, arXiv [cs.AI] (2025); http://arxiv.org/abs/2507.22358.

T. Hua, J. Baskerville, H. Lemoine, M. Hopman, A. Bhatt, T. Tracy, “Combining Cost Constrained Runtime Monitors for AI Safety” in The 39th Annual Conference on Neural Information Processing Systems (2025); https://openreview.net/forum?id=hVR3023UP2.

A. McKenzie, U. Pawar, P. Blandfort, W. Bankes, D. Krueger, E. S. Lubana, D. Krasheninnikov, “Detecting High-Stakes Interactions with Activation Probes” in The 39th Annual Conference on Neural Information Processing Systems (2025); https://openreview. net/forum?id=8YniJnJQ0P.

* OpenAI, “GPT-4o System Card” (OpenAI, 2024); https://cdn.openai.com/gpt-4osystem- card.pdf.

* OpenAI, “ChatGPT Agent System Card” (2025); https://cdn.openai.com/pdf/839e66fc- 602c-48bf-81d3-b21eacc3459d/chatgpt_agent_ system_card.pdf.

Organisation for Economic Co-Operation and Development, “Towards a Common Reporting Framework for AI Incidents” (OECD, 2025); https://doi.org/10.1787/f326d4ac-en.

N. Yu, V. Skripniuk, S. Abdelnabi, M. Fritz, “Artificial Fingerprinting for Generative Models: Rooting Deepfake Attribution in Training Data” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (IEEE, 2021); https://doi. org/10.1109/iccv48922.2021.01418.

F. Boenisch, A Systematic Review on Model Watermarking for Neural Networks. Frontiers in Big Data 4 (2021); https://www.frontiersin.org/ articles/10.3389/fdata.2021.729663/full.

P. Fernandez, G. Couairon, H. Jégou, M. Douze, T. Furon, “The Stable Signature: Rooting Watermarks in Latent Diffusion Models” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023), pp. 22409–22420; https://doi.org/10.1109/ICCV51070.2023.02053.

M. Christ, S. Gunn, T. Malkin, M. Raykova, Provably Robust Watermarks for Open- Source Language Models, arXiv [cs.CR] (2024); http://arxiv.org/abs/2410.18861.

* X. Xu, Y. Yao, Y. Liu, Learning to Watermark LLM-Generated Text via Reinforcement Learning, arXiv [cs.LG] (2024); http://arxiv.org/ abs/2403.10553.

G. Pagnotta, D. Hitaj, B. Hitaj, F. Perez-Cruz, L. V. Mancini, TATTOOED: A Robust Deep Neural Network Watermarking Scheme Based on Spread-Spectrum Channel Coding, arXiv [cs.CR] (2022); http://arxiv.org/abs/2202.06091.

P. Lv, P. Li, S. Zhang, K. Chen, R. Liang, H. Ma, Y. Zhao, Y. Li, A Robustness-Assured White-Box Watermark in Neural Networks. IEEE Transactions on Dependable and Secure Computing 20, 5214–5229 (2023); https://doi.org/10.1109/ tdsc.2023.3242737.

L. Li, B. Jiang, P. Wang, K. Ren, H. Yan, X. Qiu, “Watermarking LLMs with Weight Quantization” in Findings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, K. Bali, Eds. (Association for Computational Linguistics, Singapore, 2023), pp.

–3378; https://doi.org/10.18653/v1/2023. findings-emnlp.220.

* A. Block, A. Sekhari, A. Rakhlin, GaussMark: A Practical Approach for Structural Watermarking of Language Models, arXiv [cs.CR] (2025); http://arxiv.org/abs/2501.13941.

T. Gloaguen, N. Jovanović, R. Staab, M. Vechev, Towards Watermarking of Open- Source LLMs, arXiv [cs.CR] (2025); http://arxiv.org/ abs/2502.10525.

S. Zhu, A. Ahmed, R. Kuditipudi, P. Liang, Independence Tests for Language Models, arXiv [cs.LG] (2025); http://arxiv.org/abs/2502.12292.

E. Horwitz, A. Shul, Y. Hoshen, Unsupervised Model Tree Heritage Recovery, arXiv [cs.LG] (2024); http://arxiv.org/abs/2405.18432.

E. Horwitz, N. Kurer, J. Kahana, L. Amar, Y. Hoshen, We Should Chart an Atlas of All the World’s Models, arXiv [cs.LG] (2025); http://arxiv. org/abs/2503.10633.

Organisation for Economic Co-Operation and Development, “Sharing Trustworthy AI Models with Privacy-Enhancing Technologies” (OECD, 2025); https://doi.org/10.1787/a266160b-en.

S. Chappidi, J. Cobbe, C. Norval, A. Mazumder, J. Singh, Accountability Capture: How Record-Keeping to Support AI Transparency and Accountability (re)shapes Algorithmic Oversight. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society 8, 554–566 (2025); https://doi.org/10.1609/aies.v8i1.36570.

X. Zhao, S. Gunn, M. Christ, J. Fairoze, A. Fabrega, N. Carlini, S. Garg, S. Hong, M. Nasr, F. Tramer, S. Jha, L. Li, Y.-X. Wang, D. Song, SoK: Watermarking for AI-Generated Content, arXiv [cs.CR] (2024); http://arxiv.org/abs/2411.18479.

* L. Cao, Watermarking for AI Content Detection: A Review on Text, Visual, and Audio Modalities, arXiv [cs.CR] (2025); http://arxiv.org/ abs/2504.03765.

S. Dathathri, A. See, S. Ghaisas, P.-S. Huang, R. McAdam, J. Welbl, V. Bachani, A. Kaskasoli, R. Stanforth, T. Matejovicova, J. Hayes, N. Vyas, M. A. Merey, J. Brown-Cohen, R. Bunel, B. Balle, T. Cemgil, … P. Kohli, Scalable Watermarking for Identifying Large Language Model Outputs. Nature 634, 818–823 (2024); https://doi.org/10.1038/s41586-024-08025-4.

A. Liu, L. Pan, Y. Lu, J. Li, X. Hu, X. Zhang, L. Wen, I. King, H. Xiong, P. Yu, A Survey of Text Watermarking in the Era of Large Language Models. ACM Computing Surveys 57, 1–36 (2025); https://doi.org/10.1145/3691626.

Z. Yang, G. Zhao, H. Wu, Watermarking for Large Language Models: A Survey. Mathematics 13, 1420 (2025); https://doi.org/10.3390/math13091420.

W. Wan, J. Wang, Y. Zhang, J. Li, H. Yu, J. Sun, A Comprehensive Survey on Robust Image Watermarking. Neurocomputing 488, 226–247 (2022); https://doi.org/10.1016/j. neucom.2022.02.083.

M. S. Uddin, Ohidujjaman, M. Hasan, T. Shimamura, Audio Watermarking: A Comprehensive Review. International Journal of Advanced Computer Science and Applications 15 (2024); https://doi.org/10.14569/ IJACSA.2024.01505141.

S. Singhi, A. Yadav, A. Gupta, S. Ebrahimi, P. Hassanizadeh, Provenance Detection for AI-Generated Images: Combining Perceptual Hashing, Homomorphic Encryption, and AI Detection Models, arXiv [cs.CV] (2025); http://arxiv.org/abs/2503.11195.

R. Chen, Y. Wu, J. Guo, H. Huang, Improved Unbiased Watermark for Large Language Models, arXiv [cs.CL] (2025); http://arxiv.org/ abs/2502.11268.

C2PA Technical Working Group, “C2PA Content Credentials Explained: Addressing Common Questions and Updates” (C2PA, 2025); https://c2pa.org/wp-content/ uploads/sites/33/2025/10/content_ credentials_wp_0925.pdf.

A. Knott, D. Pedreschi, R. Chatila, T. Chakraborti, S. Leavy, R. Baeza-Yates, D. Eyers, A. Trotman, P. D. Teal, P. Biecek, S. Russell, Y. Bengio, Generative AI Models Should Include Detection Mechanisms as a Condition for Public Release. Ethics and Information Technology 25, 55 (2023); https://doi.org/10.1007/s10676-023-09728-4.

L. Lin, N. Gupta, Y. Zhang, H. Ren, C.-H. Liu, F. Ding, X. Wang, X. Li, L. Verdoliva, S. Hu, Detecting Multimedia Generated by Large AI Models: A Survey, arXiv [cs.MM] (2024); https://www.techrxiv.org/users/723084/ articles/707949-detecting-multimedia-generatedby- large-ai-models-a-survey?commit=17e92ea8d9 54e6c448a006d4f4e7fd594c9f6f0d.

A. Hans, A. Schwarzschild, V. Cherepanova, H. Kazemi, A. Saha, M. Goldblum, J. Geiping, T. Goldstein, Spotting LLMs with Binoculars: Zero- Shot Detection of Machine-Generated Text, arXiv [cs.CL] (2024); http://arxiv.org/abs/2401.12070.

* V. Pirogov, M. Artemev, Evaluating Deepfake Detectors in the Wild, arXiv [cs.CV] (2025); http://arxiv.org/abs/2507.21905.

W. Warby, Green Chameleon on a Branch (2024); https://unsplash.com/ photos/lJAYYVG2V4Y.

* S. Gowal, R. Bunel, F. Stimberg, D. Stutz, G. Ortiz-Jimenez, C. Kouridi, M. Vecerik, J. Hayes, S.-A. Rebuffi, P. Bernard, C. Gamble, M. Z. Horváth, F. Kaczmarczyck, A. Kaskasoli, A. Petrov, I. Shumailov, M. Thotakuri, … P. Kohli, SynthIDImage: Image Watermarking at Internet Scale, arXiv [cs.CR] (2025); http://arxiv.org/abs/2510.09263.

J. Cao, Q. Li, Z. Zhang, J. Ni, Secure and Robust Watermarking for AI-Generated Images: A Comprehensive Survey, arXiv [cs.CR] (2025); http://arxiv.org/abs/2510.02384.

Y. Chen, Z. Ma, H. Fang, W. Zhang, N. Yu, TAG-WM: Tamper-Aware Generative Image Watermarking via Diffusion Inversion Sensitivity, arXiv [cs.MM] (2025); http://arxiv.org/ abs/2506.23484.

T. South, Ed., “Identity Management for Agentic AI” (OpenID, 2025); https://openid. net/wp-content/uploads/2025/10/Identity- Management-for-Agentic-AI.pdf.

A. Chan, N. Kolt, P. Wills, U. Anwar, C. S. de Witt, N. Rajkumar, L. Hammond, D. Krueger, L. Heim, M. Anderljung, IDs for AI Systems, arXiv [cs.AI] (2024); http://arxiv.org/ abs/2406.12137.

T. South, S. Marro, T. Hardjono, R. Mahari, C. D. Whitney, D. Greenwood, A. Chan, A. Pentland, Authenticated Delegation and Authorized AI Agents, arXiv [cs.CY] (2025); http://arxiv.org/abs/2501.09674.

European Commission, The General-Purpose AI Code of Practice. (2025); https://digital-strategy.ec.europa.eu/en/policies/ contents-code-gpai.

S. Wiener, S. Rubio, SB-53 Artificial Intelligence Models: Large Developers (2025); https://leginfo.legislature.ca.gov/faces/ billTextClient.xhtml?bill_id=202520260SB53.

G7, OECD, G7 Reporting Framework – Hiroshima AI Process (HAIP) International Code of Conduct for Organizations Developing Advanced AI Systems. (2025); https://transparency.oecd.ai/.

OECD, Launch of the Hiroshima AI Process (HAIP) Reporting Framework, OECD (2025); https://www.oecd.org/en/events/2025/02/ launch-of-the-hiroshima-ai-processreporting- framework.html.

K. Persec, J. Healy, S. F. Esposito, “Shaping Trustworthy AI: Early Insights from the Hiroshima AI Process Reporting Framework” (OECDAI, 2025); https://oecd.ai/en/work/haipreporting- insights.

OECD, Submitted Reports – HAIP Reporting Framework (2025); https://transparency. oecd.ai/reports.

ASEAN, “Expanded ASEAN Guide on AI Governance and Ethics - Generative AI” (ASEAN, 2025); https://asean.org/book/ expanded-asean-guide-on-ai-governance-andethics- generative-ai/.

K. Choi, Analyzing South Korea’s Framework Act on the Development of AI, IAPP (2025); https://iapp.org/news/a/analyzing-south-korea-sframework- act-on-the-development-of-ai.

과학기술정보통신부, 인공지능 발전과 신뢰 기반 조성 등에 관한 기본법 (2025); https://www.law. go.kr/%EB%B2%95%EB%A0%B9/%EC%9D%B 8%EA%B3%B5%EC%A7%80%EB%8A%A5%20 %EB%B0%9C%EC%A0%84%EA%B3%BC%20 %EC%8B%A0%EB%A2%B0%20 %EA%B8%B0%EB%B0%98%20%EC%A1%B0- %EC%84%B1%20%EB%93%B1%EC%97%90%20 %EA%B4%80%ED%95%9C%20%EA%B8%B0%EB %B3%B8%EB%B2%95/%2820676,20250121%29.

METR, Frontier AI Safety Policies (2025); https://metr.org/.

Frontier Model Forum, “Risk Taxonomy and Thresholds for Frontier AI Frameworks” (2025); https://www.frontiermodelforum.org/technicalreports/ risk-taxonomy-and-thresholds/.

M. D. Buhl, B. Bucknall, T. Masterson, Emerging Practices in Frontier AI Safety Frameworks, arXiv [cs.CY] (2025); http://arxiv.org/ abs/2503.04746.

METR, Forecasting the Impacts of AI R&D Acceleration: Results of a Pilot Study (2025); https://metr.org/blog/2025-08-20-forecastingimpacts- of-ai-acceleration/.

J. Wang, K. Huang, K. Klyman, R. Bommasani, Do AI Companies Make Good on Voluntary Commitments to the White House?, arXiv [cs.CY] (2025); http://arxiv.org/abs/2508.08345.

Future of Life Institute, “AI Safety Index: Summer 2025” (Future of Life Institute, 2025); https://futureoflife.org/wp-content/ uploads/2025/07/FLI-AI-Safety-Index-Report- Summer-2025.pdf.

S. Campos, H. Papadatos, F. Roger, C. Touzet, O. Quarks, M. Murray, A Frontier AI Risk Management Framework: Bridging the Gap between Current AI Practices and Established Risk Management, arXiv [cs.AI] (2025); http://arxiv.org/abs/2502.06656.

I. Habli, R. Hawkins, C. Paterson, P. Ryan, Y. Jia, M. Sujan, J. McDermid, The BIG Argument for AI Safety Cases, arXiv [cs.CY] (2025); http://arxiv.org/abs/2503.11705.

J. Clymer, J. Weinbaum, R. Kirk, K. Mai, S. Zhang, X. Davies, An Example Safety Case for Safeguards against Misuse, arXiv [cs.LG] (2025); http://arxiv.org/abs/2505.18003.

J. Clymer, N. Gabrieli, D. Krueger, T. Larsen, Safety Cases: How to Justify the Safety of Advanced AI Systems, arXiv [cs.CY] (2024); http://arxiv.org/abs/2403.10462.

A. Goemans, M. D. Buhl, J. Schuett, T. Korbak, J. Wang, B. Hilton, G. Irving, Safety Case Template for Frontier AI: A Cyber Inability Argument, arXiv [cs.CY] (2024); http://arxiv.org/abs/2411.08088.

M. D. Buhl, G. Sett, L. Koessler, J. Schuett, M. Anderljung, Safety Cases for Frontier AI, arXiv [cs.CY] (2024); http://arxiv.org/abs/2410.21572.

M. D. Buhl, J. Pfau, B. Hilton, G. Irving, An Alignment Safety Case Sketch Based on Debate, arXiv [cs.AI] (2025); http://arxiv.org/ abs/2505.03989.

B. Hilton, M. D. Buhl, T. Korbak, G. Irving, “Safety Cases: A Scalable Approach to Frontier AI Safety” (AI Security Institute, 2025); https://doi.org/10.48550/arXiv.2503.04744.

* Anthropic, Anthropic’s Responsible Scaling Policy, Version 1.0. (2023); https://www-cdn.anthropic.com/ 1adf000c8f675958c2ee23805d91aaade1cd4613/ responsible-scaling-policy.pdf.

* Google DeepMind, Frontier Safety Framework Version 3.0. (2025); https://storage.googleapis.com/deepmindmedia/ DeepMind.com/Blog/strengtheningour- frontier-safety-framework/frontier-safetyframework_ 3.pdf.

K. Perset, S. Fialho Esposito, “How Are AI Developers Managing Risks?” (OECD, 2025); https://doi.org/10.1787/658c2ad6-en.

L. Staufer, M. Yang, A. Reuel, S. Casper, Audit Cards: Contextualizing AI Evaluations, arXiv [cs.CY] (2025); http://arxiv.org/abs/2504.13839.

International AI Safety Report 2025: Second Key Update: Technical Safeguards and Risk Management

Authors

DOI:

Abstract

Author Biographies

Yoshua Bengio, Université de Montréal / LawZero / Mila – Quebec AI Institute

Stephen Clare, Independent

Ben Bucknall, University of Oxford

Stephen Casper, Massachusetts Institute of Technology

Daniel Privitera, KIRA Center

Sören Mindermann, Independent

Teresa Ludermir, Federal University of Pernambuco

References

Downloads

Published

How to Cite

Issue

Section

Categories

License

Current Issue

Announcements

Dario Amodei, The Adolescence of Technology: Confronting and Overcoming the Risks of Powerful AI

Steve Omohundro: Regulating AGI: From Liability to Provable Contracts

Joe Rogan Experience #2345 - Roman Yampolskiy

Steve Omohundro Receives 2024 Future of Life Award

Steve Omohundro and Scientists Discuss the AI Alignment Problem with Neil deGrasse Tyson

Information