The Singapore Consensus on Global AI Safety Research Priorities
Building a Trustworthy, Reliable and Secure AI Ecosystem
DOI:
https://doi.org/10.70777/si.v2i5.15503Keywords:
ai safety, ai governance, artificial general intelligence, superintelligence, mechanistic interpretability, agi, ai risk, ai benchmarks, ai metricsAbstract
Rapidly improving AI capabilities and autonomy hold significant promise of transformation, but are also driving vigorous debate on how to ensure that AI is safe, i.e., trustworthy, reliable, and secure. Building a trusted ecosystem is therefore essential – it helps people embrace AI with confidence and gives maximal space for innovation while avoiding backlash. This requires policymakers, industry, researchers and the broader public to collectively work toward securing positive outcomes from AI’s development. AI safety research is a key dimension. Given that the state of science today for building trustworthy AI does not fully cover all risks, accelerated investment in research is required to keep pace with commercially driven growth in system capabilities. Goals: The 2025 Singapore Conference on AI (SCAI): International Scientific Exchange on AI Safety aims to support research in this important space by bringing together AI scientists across geographies to identify and synthesise research priorities in AI safety. The result, The Singapore Consensus on Global AI Safety Research Priorities, builds on the International AI Safety Report-A (IAISR) chaired by Yoshua Bengio and backed by 33 governments. By adopting a defence-in-depth model, this document organises AI safety research domains into three types: challenges with creating trustworthy AI systems (Development), challenges with evaluating their risks (Assessment), and challenges with monitoring and intervening after deployment (Control). Through the Singapore Consensus, we hope to globally facilitate meaningful conversations between AI scientists and AI policymakers for maximally beneficial outcomes. Our goal is to enable more impactful R&D efforts to rapidly develop safety and evaluation mechanisms and foster a trusted ecosystem where AI is harnessed for the public good.
References
[Alaga] Alaga, J., Schuett, J., & Anderljung, M. (2024). A Grading Rubric for AI Safety Frameworks. arXiv preprint arXiv:2409.08751. 2.
[AISES] Hendrycks, D. (2024). Systemic safety. In AI safety, ethics, and society textbook. Center for AI Safety. https://www.aisafetybook.com/textbook/systemic-safety 3. DOI: https://doi.org/10.1201/9781003530336-1
[Ashby] Ashby, W. R. (1956). An introduction to Cybernetics. Chapman & Hall. https://philpapers.org/archive/ ASHAIT.pdf 4. DOI: https://doi.org/10.5962/bhl.title.5851
[Anthropic-A] Anthropic. (2024). Anthropic Economic Index. https://www.anthropic.com/economic-index 5.
[Anthropic-B] Anthropic. (2024). Claude 3.7 Sonnet system card. https://assets.anthropic.com/m/785e231869ea8b3b/ original/claude-3-7-sonnet-system-card.pdf 6.
[Anthropic-C] Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., ... & Hubinger, E. (2024). Alignment faking in large language models. arXiv preprint arXiv:2412.14093. 7.
[Anthropic-D] Chen, Y., Benton, J., Radhakrishnan, A., Denison, J. U. C., Schulman, J., Somani, A., ... & Perez, E. Reasoning Models Don’t Always Say What They Think. 8.
[Anthropic-E] Anthropic. (2024, October 15). Responsible Scaling Policy (Version 2.0). Anthropic. https://www. anthropic.com/responsible-scaling-policy 9. DOI: https://doi.org/10.70777/si.v2i1.13657
[Anthropic-F] Anthropic. (2025). Recommended directions. Anthropic Alignment. https://alignment.anthropic. com/2025/recommended-directions/ 10.
[Anthropic-G] Anthropic. (2025, April 24). Modifying LLM beliefs with synthetic document finetuning. Anthropic Alignment Research. https://alignment.anthropic.com/2025/modifying-beliefs-via-sdf/ 11.
[Anthropic-H] Anthropic. (2024, August 8). Expanding our model safety bug bounty program. https://www. anthropic.com/news/model-safety-bug-bounty 12.
[Anwar] Anwar, U., Saparov, A., Rando, J., Paleka, D., Turpin, M., Hase, P., ... & Krueger, D. (2024). Foundational challenges in assuring alignment and safety of large language models. arXiv preprint arXiv:2404.09932. 13.
[Apollo] Hobbhahn, M. (2024, January 22). We need a science of evals. Apollo Research. https://www.apolloresearch. ai/blog/we-need-a-science-of-evals 14.
[Armstrong] Armstrong, S., & Mindermann, S. (2018). Occam’s razor is insufficient to infer the preferences of irrational agents. Advances in neural information processing systems, 31. 15.
[Barez] Li, N., Pan, A., Gopal, A., Yue, S., Berrios, D., Gatti, A., ... & Hendrycks, D. (2024). The WMDP Benchmark: Measuring and reducing malicious use with unlearning. arXiv preprint arXiv:2403.03218. 16.
[Bateman] Bateman, J., Baer, D., Bell, S. A., Brown, G. O., Cuéllar, M. F. T., Ganguli, D., ... & Zvyagina, P. (2024). Beyond open vs. closed: Emerging consensus and key questions for foundation AI model governance. 17.
[Baum] Baum, S. D. (2020). Social choice ethics in artificial intelligence. AI & Society, 35(1), 165-176. 18. DOI: https://doi.org/10.1007/s00146-017-0760-1
[Bengio-A] Bengio, Y., Hinton, G., Yao, A., Song, D., Abbeel, P., Darrell, T., ... & Mindermann, S. (2024). Managing extreme AI risks amid rapid progress. Science, 384(6698), 842–845. https://doi.org/10.1126/science.adn0117 19. DOI: https://doi.org/10.1126/science.adn0117
[Bengio-B] Bengio, Y., Cohen, M., Fornasiere, D., Ghosn, J., Greiner, P., MacDermott, M., ... & Williams-King, D. (2025). Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?. arXiv preprint arXiv:2502.15657. 20. DOI: https://doi.org/10.70777/si.v2i5.15569
[Berglund] Berglund, L., Stickland, A. C., Balesni, M., Kaufmann, M., Tong, M., Korbak, T., ... & Evans, O. (2023). Taken out of context: On measuring situational awareness in LLMs. arXiv preprint arXiv:2309.00667. 21.
[Bernardi] Bernardi, J., Mukobi, G., Greaves, H., Heim, L., & Anderljung, M. (2024). Societal adaptation to advanced AI. arXiv preprint arXiv:2405.10295. 22.
[Betley] Betley, J., Tan, D., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., ... & Evans, O. (2025). Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. arXiv preprint arXiv:2502.17424. 23.
[Birhane-A] Birhane, A., Steed, R., Ojewale, V., Vecchione, B., & Raji, I.D. (2024, April). AI auditing: The broken bus on the road to AI accountability. In 2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) (pp. 612-643). IEEE. 24. DOI: https://doi.org/10.1109/SaTML59370.2024.00037
[Birhane-B] Birhane, A., Prabhu, V., Han, S., & Boddeti, V. N. (2023). On hate scaling laws for data-swamps. arXiv preprint arXiv:2306.13141. 25.
[Bordt] Bordt, S. (2023). Explainable machine learning and its limitations (Doctoral dissertation, Universität Tübingen). 26.
[Bucknall-A] Bucknall, B., Trager, R. F., & Osborne, M. A. (2025). Position: Ensuring mutual privacy is necessary for effective external evaluation of proprietary AI systems. arXiv preprint arXiv:2503.01470. 27.
[Bucknall-B] Bucknall, B., Siddiqui, S., Thurnherr, L., McGurk, C., Harack, B., Reuel, A., ... & Trager, R. (2025). In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate?. arXiv preprint arXiv:2504.12914. 28. DOI: https://doi.org/10.70777/si.v2i5.15509
[Buhl] Buhl, M. D., Sett, G., Koessler, L., Schuett, J., & Anderljung, M. (2024). Safety cases for frontier AI. arXiv preprint arXiv:2410.21572. 29.
[Campos] Campos, S., Papadatos, H., Roger, F., Touzet, C., Quarks, O., & Murray, M. (2025). A Frontier AI Risk Management Framework: Bridging the Gap Between Current AI Practices and Established Risk Management. arXiv preprint arXiv:2502.06656. 30.
[Cao] Cao, L. (2025). Watermarking for AI Content Detection: A Review on Text, Visual, and Audio Modalities. arXiv preprint arXiv:2504.03765. 31.
[CAIS] Hinton, G., Bengio, Y., Hassabis, D., Altman, S., Amodei, D., … Statement on AI Risk. Center for AI Safety. https://safe.ai/work/statement-on-ai-risk 32.
[Casper-A] Casper, S., Davies, X., Shi, C., Gilbert, T. K., Scheurer, J., Rando, J., ... & Hadfield-Menell, D. (2023). Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv: 2307.15217. 33.
[Casper-B] Casper, S., Ezell, C., Siegmann, C., Kolt, N., Curtis, T. L., Bucknall, B., ... & Hadfield-Menell, D. (2024, June). Black-box access is insufficient for rigorous ai audits. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (pp. 2254-2272). 34. DOI: https://doi.org/10.1145/3630106.3659037
[Casper-C] Casper, S., Krueger, D., & Hadfield-Menell, D. (2025). Pitfalls of Evidence-Based AI Policy. arXiv preprint arXiv:2502.09618. 35. DOI: https://doi.org/10.70777/si.v2i2.14611
[Chan] Chan, A., Wei, K., Huang, S., Rajkumar, N., Perrier, E., Lazar, S., ... & Anderljung, M. (2025). Infrastructure for AI Agents. arXiv preprint arXiv:2501.10114. 36.
[Che] Che, Z., Casper, S., Kirk, R., Satheesh, A., Slocum, S., McKinney, L. E., ... & Hadfield-Menell, D. (2025). Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities. arXiv preprint arXiv:2502.05209. 37.
[Cheng] Cheng, P., Wu, Z., Du, W., Zhao, H., Lu, W., & Liu, G. (2023). Backdoor attacks and countermeasures in natural language processing models: A comprehensive security review. arXiv preprint arXiv:2309.06055. 38.
[Clymer] Clymer, J., Gabrieli, N., Krueger, D., & Larsen, T. (2024). Safety cases: How to justify the safety of advanced AI systems. arXiv preprint arXiv:2403.10462. 39.
[Critch] Critch, A., & Krueger, D. (2020). AI research considerations for human existential safety (ARCHES). arXiv preprint arXiv:2006.04948. 40.
[Dalrymple] Dalrymple, D., Skalse, J., Bengio, Y., Russell, S., Tegmark, M., Seshia, S., ... & Tenenbaum, J. (2024). Towards guaranteed safe AI: A framework for ensuring robust and reliable AI systems. arXiv preprint arXiv: 2405.06624. 41.
[DeepSeek] Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., ... & Piao, Y. (2024). Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437. 42.
[Dekker] Dekker, S. (2019). Foundations of safety science: A century of understanding accidents and disasters. Routledge. https://books.google.co.uk/books?id=dwWSDwAAQBAJ 43. DOI: https://doi.org/10.4324/9781351059794
[Engels] Engels, J., Baek, D. D., Kantamneni, S., & Tegmark, M. (2024). Scaling laws for scalable oversight (arXiv preprint). arXiv. https://arxiv.org/abs/2504.18530 44.
[Engstrom] Engstrom, L., Feldmann, A., & Madry, A. (2024). DsDm: Model-aware dataset selection with datamodels. arXiv preprint arXiv:2401.12926. 45.
[Eriksson] Eriksson, M., Purificato, E., Noroozian, A., Vinagre, J., Chaslot, G., Gomez, E., & Fernandez-Llorca, D. (2025). Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation. arXiv preprint arXiv:2502.06559. 46. DOI: https://doi.org/10.1609/aies.v8i1.36595
[EU] Act, E. A. I. (2024). The EU Artificial Intelligence Act. 47.
[Evans-A] Evans, O., Cotton-Barratt, O., Finnveden, L., Bales, A., Balwit, A., Wills, P., ... & Saunders, W. (2021). Truthful AI: Developing and governing AI that does not lie. arXiv preprint arXiv:2110.06674. 48.
[Evans-B] Evans, O., Saunders, W., & Stuhlmüller, A. (2019). Machine learning projects for iterated distillation and amplification. 49.
[Everitt] Everitt, T., Filan, D., Daswani, M., & Hutter, M. (2016). Self-modification of policy and utility function in rational agents. In Artificial General Intelligence: 9th International Conference, AGI 2016, New York, NY, USA, July 16-19, 2016, Proceedings 9 (pp. 1-11). Springer International Publishing. 50. DOI: https://doi.org/10.1007/978-3-319-41649-6_1
[Fortune] Fortune. (2023, July). Brainstorm Tech 2023: How Anthropic is paving the way for responsible A.I.
[Video]. Fortune. https://fortune.com/videos/watch/brainstorm-tech-2023%3A-how-anthropic-is-paving-theway- for-responsible-a.i./88517d5f-b5c3-40ac-b5bb-8368afc95acd 51.
[Field] Field, S. (2025). Why do Experts Disagree on Existential Risk and P (doom)? A Survey of AI Experts. arXiv preprint arXiv:2502.14870. 52. DOI: https://doi.org/10.1007/s43681-025-00762-0
[Google] Google DeepMind. (2024, April 9). Updating the Frontier Safety Framework. DeepMind. https://deepmind. google/discover/blog/updating-the-frontier-safety-framework/ 53.
[Gryz] Gryz, J., & Rojszczak, M. (2021). Black box algorithms and the rights of individuals: No easy solution to the “explainability” problem. Internet Policy Review, 10(2), 1-24. 54. DOI: https://doi.org/10.14763/2021.2.1564
[IAISR] Bengio, Y., Mindermann, S., Privitera, D., Besiroglu, T., Bommasani, R., Casper, S., ... & Zeng, Y. (2025). International AI Safety Report. arXiv preprint arXiv:2501.17805. 55.
[Gabriel] Gabriel, I., Manzini, A., Keeling, G., Hendricks, L. A., Rieser, V., Iqbal, H., ... & Manyika, J. (2024). The ethics of advanced AI assistants. arXiv preprint arXiv:2404.16244. 56.
[GDM] Shah, R., Irpan, A., Turner, A. M., Wang, A., Conmy, A., Lindner, D., ... & Dragan, A. (2025). An approach to technical AGI safety and security. Google DeepMind. https://storage.googleapis.com/deepmind-media/Deep- Mind.com/Blog/evaluating-potential-cybersecurity-threats-of-advanced-ai/An_Approach_to_Technical_AGI_ Safety_Apr_2025.pdf 57.
[Greenblatt-A] Greenblatt, R., Shlegeris, B., Sachan, K., & Roger, F. (2023). AI control: Improving safety despite intentional subversion. arXiv preprint arXiv:2312.06942. 58.
[Greenblatt-B], R., Roger, F., Krasheninnikov, D., & Krueger, D. (2024). Stress-testing capability elicitation with password-locked models. arXiv preprint arXiv:2405.19550. 59.
[Griffin] Griffin, C., Thomson, L., Shlegeris, B., & Abate, A. (2024). Games for AI control: Models of safety evaluations of AI deployment protocols. arXiv preprint arXiv:2409.07985. 60.
[Grosse] Grosse, R., Bae, J., Anil, C., Elhage, N., Tamkin, A., Tajdini, A., ... & Bowman, S. R. (2023). Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296. 61.
[Hadfield-Menell] Hadfield-Menell, D., Russell, S. J., Abbeel, P., & Dragan, A. (2016). Cooperative inverse reinforcement learning. Advances in neural information processing systems, 29. 62.
[Hammond] Hammond, L., Chan, A., Clifton, J., Hoelscher-Obermaier, J., Khan, A., McLean, E., ... & Rahwan, I. (2025). Multi-agent risks from advanced AI. arXiv preprint arXiv:2502.14143. 63.
[Hendrycks-A] Hendrycks, D., Carlini, N., Schulman, J., & Steinhardt, J. (2021). Unsolved problems in ML safety. arXiv preprint arXiv:2109.13916. 64.
[Hendrycks-B] Hendrycks, D. (2024). Systemic factors. In Introduction to AI safety, ethics, and society. Center for AI Safety. https://www.aisafetybook.com/textbook/systemic-factors 65. DOI: https://doi.org/10.1201/9781003530336
[Hobbhahn] Apollo Research. (2024, April 9). We need a science of evals. Apollo Research. https://www.apolloresearch. ai/blog/we-need-a-science-of-evals 66.
[Hofstätter] Hofstätter, F., van der Weij, T., Teoh, J., Bartsch, H., & Ward, F. R. (2025). The Elicitation Game: Evaluating Capability Elicitation Techniques. arXiv preprint arXiv:2502.02180. 67.
[Huang] Huang, T., Hu, S., Ilhan, F., Tekin, S. F., & Liu, L. (2024). Harmful fine-tuning attacks and defenses for large language models: A survey. arXiv preprint arXiv:2409.18169. 68.
[Hubinger-A] Hubinger, E. (2020). An overview of 11 proposals for building safe advanced AI. arXiv preprint arXiv:2012.07532. 69.
[Hubinger-B] Benton, J., Wagner, M., Christiansen, E., Anil, C., Perez, E., Srivastav, J., ... & Duvenaud, D. (2024). Sabotage evaluations for frontier models. arXiv preprint arXiv:2410.21514. 70.
[Hubinger-C] Denison, C., MacDiarmid, M., Barez, F., Duvenaud, D., Kravec, S., Marks, S., ... & Hubinger, E. (2024). Sycophancy to subterfuge: Investigating reward-tampering in large language models. arXiv preprint arXiv: 2406.10162. 71.
[Hubinger-D] Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., ... & Perez, E. (2024). Sleeper agents: Training deceptive LLMs that persist through safety training. arXiv preprint arXiv:2401.05566. 72.
[Ilyas] Engstrom, L., Feldmann, A., & Madry, A. (2024). DsDm: Model-aware dataset selection with datamodels. arXiv preprint arXiv:2401.12926. 73.
[International AI Safety Report-A] UK Government. (2025). International AI Safety Report 2025 (Accessible version). UK Department for Science, Innovation and Technology. https://assets.publishing.service.gov.uk/ media/679a0c48a77d250007d313ee/International_AI_Safety_Report_2025_accessible_f.pdf 74.
[International AI Safety Report-B] Bengio, Y., Mindermann, S., Privitera, D., Besiroglu, T., Bommasani, R., Casper, S., ... & Zeng, Y. (2025). International AI Safety Report. arXiv preprint arXiv:2501.17805. 75. DOI: https://doi.org/10.70777/si.v2i2.14755
[Irving] Irving, G., Christiano, P., & Amodei, D. (2018). AI safety via debate. arXiv preprint arXiv:1805.00899. 76.
[Ji] Ji, J., Qiu, T., Chen, B., Zhang, B., Lou, H., Wang, K., ... & Gao, W. (2023). AI alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852. 77.
[Jin] Jin, H., Hu, L., Li, X., Zhang, P., Chen, C., Zhuang, J., & Wang, H. (2024). Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models. arXiv preprint arXiv:2407.01599. 78.
[Jones] Jones, E., Dragan, A., & Steinhardt, J. (2024). Adversaries can misuse combinations of safe models. arXiv preprint arXiv:2406.14595. 79.
[Jumper] Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., ... & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. nature, 596(7873), 583-589. 80. DOI: https://doi.org/10.1038/s41586-021-03819-2
[Khan] Khan, A., Casper, S., & Hadfield-Menell, D. (2025). Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs. arXiv preprint arXiv:2503.08688. 81. DOI: https://doi.org/10.1145/3715275.3732147
[Keep the Future Human] Aguirre, A. (2025). Keep the future human. Future of Life Institute. https://keepthefuturehuman. ai/ 82.
[Kirchenbauer] Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., & Goldstein, T. (2023, July). A watermark for large language models. In International Conference on Machine Learning (pp. 17061-17084). PMLR. 83.
[ Korbak] Korbak, T., Balesni, M., Shlegeris, B., & Irving, G. (2025). How to evaluate control measures for LLM agents? A trajectory from today to superintelligence. arXiv preprint arXiv:2504.05259. 84.
[Ladish] Bondarenko, A., Volk, D., Volkov, D., & Ladish, J. (2025). Demonstrating specification gaming in reasoning models. arXiv preprint arXiv:2502.13295. 85.
[Lawrence] Lawrence, M., Shipman, M., Janzwood, S., Arnscheidt, C., Donges, J. F., Homer-Dixon, T., ... & Wunderling, N. (2024). Polycrisis Research and Action Roadmap-Gaps, opportunities, and priorities for polycrisis research and action. 86.
[Li-A] Li, B., Qi, P., Liu, B., Di, S., Liu, J., Pei, J., ... & Zhou, B. (2023). Trustworthy AI: From principles to practices. ACM Computing Surveys, 55(9), 1-46. 87. DOI: https://doi.org/10.1145/3555803
[Li-B] Li, N., Pan, A., Gopal, A., Yue, S., Berrios, D., Gatti, A., ... & Hendrycks, D. (2024). The wmdp benchmark: Measuring and reducing malicious use with unlearning. arXiv preprint arXiv:2403.03218. 88.
[Longpre] Longpre, S., Kapoor, S., Klyman, K., Ramaswami, A., Bommasani, R., Blili-Hamelin, B., ... & Henderson, P. (2024). A safe harbor for AI evaluation and red teaming. arXiv preprint arXiv:2403.04893. 89.
[MacDermott] MacDermott, M., Fox, J., Belardinelli, F., & Everitt, T. (2024). Measuring Goal-Directedness. Advances in Neural Information Processing Systems, 37, 11412-11431. 90.
[Maini] Maini, P., Goyal, S., Sam, D., Robey, A., Savani, Y., Jiang, Y., ... & Kolter, J. Z. (2025). Safety Pretraining: Toward the Next Generation of Safe AI. arXiv preprint arXiv:2504.16980. 91.
[Mallah] Mallah, R. (2017). The landscape of AI safety and beneficence research: Input for brainstorming at Beneficial AI 2017. Future of Life Institute. https://futureoflife.org/landscape/ResearchLandscapeExtended.pdf 92.
[Marks] Marks, S., Treutlein, J., Bricken, T., Lindsey, J., Marcus, J., Mishra-Sharma, S., ... & Hubinger, E. (2025). Auditing language models for hidden objectives. arXiv preprint arXiv:2503.10965. 93.
[Mazeika] Mazeika, M., Yin, X., Tamirisa, R., Lim, J., Lee, B. W., Ren, R., Phan, L., Mu, N., Khoja, A., Zhang, O., & Hendrycks, D. (2025). Utility engineering: Analyzing and controlling emergent value systems in AIs. arXiv preprint arXiv:2502.08640. https://www.emergent-values.ai/ 94.
[Michael] Michael, J., Mahdi, S., Rein, D., Petty, J., Dirani, J., Padmakumar, V., & Bowman, S. R. (2023). Debate helps supervise unreliable experts. arXiv preprint arXiv:2311.08702. 95.
[Michaud] Michaud, E. J., Liao, I., Lad, V., Liu, Z., Mudide, A., Loughridge, C., ... & Tegmark, M. (2024). Opening the AI Black Box: Distilling Machine-Learned Algorithms into Code. Entropy, 26(12), 1046, arXiv:2402.05110 (2024). 96. DOI: https://doi.org/10.3390/e26121046
[Murray] Murray, M. (2025, April 9). AI risk management can learn a lot from other industries. AI Frontiers. https://www.ai-frontiers.org/articles/ai-risk-management-can-learn-a-lot-from-other-industries 97.
[Nevo] Nevo, S., Lahav, D., Karput, A., Bar-On, Y., Bradley, H. A., & Alstott, J. (2024). Securing AI Model Weights. Technical report, RAND Corporation, 2024. https://www.rand.org/content/dam/rand/pubs/research_reports/ RRA2800/RRA2849-1/RAND_RRA2849-1.pdf 98.
[Ngo] Ngo, R., Chan, L., & Mindermann, S. (2022). The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626. 99.
[NIST] National Institute of Standards and Technology. (2025, January 15). Updated guidelines for managing misuse risk for dual-use foundation models. https://www.nist.gov/news-events/news/2025/01/updated-guidelines- managing-misuse-risk-dual-use-foundation-models 100.
[Omohundro] Omohundro, S. M. (2018). The basic AI drives. In Artificial intelligence safety and security (pp. 47-55). Chapman and Hall/CRC. 101. DOI: https://doi.org/10.1201/9781351251389-3
[OpenAI-A] OpenAI. (2023, December 18). Preparedness framework (beta). https://cdn.openai.com/openai-preparedness- framework-beta.pdf 102.
[OpenAI-B] OpenAI. (2024). How we think about safety and alignment. https://openai.com/safety/how-wethink- about-safety-alignment/ 103.
[OpenAI-C] OpenAI. (2024, December 5). OpenAI o1 system card. https://cdn.openai.com/o1-systemcard- 20241205.pdf 104.
[OpenAI-D] Baker, B., Huizinga, J., Gao, L., Dou, Z., Guan, M. Y., Madry, A., ... & Farhi, D. (2025). Monitoring reasoning models for misbehavior and the risks of promoting obfuscation. arXiv preprint arXiv:2503.11926. 105.
[OpenAI-E] OpenAI. (2025, March 10). Detecting misbehavior in frontier reasoning models. https://openai.com/ index/chain-of-thought-monitoring/ 106.
[OpenAI-F] Baker, B., Huizinga, J., Gao, L., Dou, Z., Guan, M. Y., Madry, A., ... & Farhi, D. (2025). Monitoring reasoning models for misbehavior and the risks of promoting obfuscation. arXiv preprint arXiv:2503.11926. 107.
[Paullada] Paullada, A., Raji, I. D., Bender, E. M., Denton, E., & Hanna, A. (2021). Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns, 2(11), 100336. https://doi. org/10.1016/j.patter.2021.100336 108. DOI: https://doi.org/10.1016/j.patter.2021.100336
[Phuong] Phuong, M., Aitchison, M., Catt, E., Cogan, S., Kaskasoli, A., Krakovna, V., ... & Shevlane, T. (2024). Evaluating frontier models for dangerous capabilities. arXiv preprint arXiv:2403.13793. 109.
[Qi-A] Qi, X., Zeng, Y., Xie, T., Chen, P. Y., Jia, R., Mittal, P., & Henderson, P. (2023). Fine-tuning aligned language models compromises safety, even when users do not intend to!. arXiv preprint arXiv:2310.03693. 110.
[Qi-B] Qi, X., Wei, B., Carlini, N., Huang, Y., Xie, T., He, L., ... & Henderson, P. (2024). On evaluating the durability of safeguards for open-weight LLMs. arXiv preprint arXiv:2412.07097. 111.
[RAND] Kulp, G., Gonzales, D., Smith, E., Heim, L., Puri, P., Vermeer, M. J. D., & Winkelman, Z. (2024). Hardware- enabled governance mechanisms: Developing technical solutions to exempt items otherwise classified under export control classification numbers 3A090 and 4A090 (RAND Working Paper WR-A3056-1). RAND Corporation. https://www.rand.org/content/dam/rand/pubs/working_papers/WRA3000/WRA3056-1/RAND_ WRA3056-1.pdf 112.
[Raji] Raji, I. D., Bender, E. M., Paullada, A., Denton, E., & Hanna, A. (2021). AI and the everything in the whole wide world benchmark. arXiv preprint arXiv:2111.15366. 113.
[Ren-A] Ren, R., Basart, S., Khoja, A., Gatti, A., Phan, L., Yin, X., ... & Hendrycks, D. (2024). Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?. Advances in Neural Information Processing Systems, 37, 68559-68594. 114.
[Ren-B] Ren, R., Agarwal, A., Mazeika, M., Menghini, C., Vacareanu, R., Kenstler, B., ... & Hendrycks, D. (2025). The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems. arXiv preprint arXiv:2503.03750. 115.
[Reuel] Reuel, A., Bucknall, B., Casper, S., Fist, T., Soder, L., Aarne, O., ... & Trager, R. (2024). Open problems in technical AI governance. arXiv preprint arXiv:2407.14981. 116.
[Russell] Russell, S., Dewey, D., & Tegmark, M. (2015). Research priorities for robust and beneficial artificial intelligence. AI magazine, 36(4), 105-114. 117. DOI: https://doi.org/10.1609/aimag.v36i4.2577
[Sastry] Sastry, G., Heim, L., Belfield, H., Anderljung, M., Brundage, M., Hazell, J., ... & Coyle, D. (2024). Computing power and the governance of artificial intelligence. arXiv preprint arXiv:2402.08797. 118.
[Savani] Savani, Y., Trockman, A., Feng, Z., Schwarzschild, A., Robey, A., Finzi, M., & Kolter, J. Z. (2025). Antidistillation Sampling. arXiv preprint arXiv:2504.13146. 119.
[Scheurer] Scheurer, J., Balesni, M., & Hobbhahn, M. (2023). Technical report: Large language models can strategically deceive their users when put under pressure. arXiv. 120.
[Shah-A] Russell, S. (2020). Human-compatible AI: A progress report. Paper presented at the NeurIPS 2020 Workshop on Assistance. https://people.eecs.berkeley.edu/~russell/papers/neurips20ws-assistance 121.
[Shah-B] Shah, R., Varma, V., Kumar, R., Phuong, M., Krakovna, V., Uesato, J., & Kenton, Z. (2022). Goal misgeneralization: Why correct specifications aren’t enough for correct goals. arXiv preprint arXiv:2210.01790. 122.
[Sharkey] Sharkey, L., Chughtai, B., Batson, J., Lindsey, J., Wu, J., Bushnaq, L., ... & McGrath, T. (2025). Open Problems in Mechanistic Interpretability. arXiv preprint arXiv:2501.16496. 123.
[Sharma] Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S. R., ... & Perez, E. (2023). Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548. 124.
[Shevlane] Shevlane, T., Farquhar, S., Garfinkel, B., Phuong, M., Whittlestone, J., Leung, J., ... & Dafoe, A. (2023). Model evaluation for extreme risks. arXiv preprint arXiv:2305.15324. 125.
[Slattery] Slater, P., Patel, A., & Rahimi, A. (2024). AI Risk Repository. Massachusetts Institute of Technology. https://airisk.mit.edu/ 126.
[Soares] Soares, N., Fallenstein, B., Yudkowsky, E., & Armstrong, S. (2015). Corrigibility. https://cdn.aaai.org/ocs/ ws/ws0067/10124-45900-1-PB.pdf 127.
[Solaiman] Solaiman, I., Talat, Z., Agnew, W., Ahmad, L., Baker, D., Blodgett, S. L., ... & Subramonian, A. (2023). Evaluating the social impact of generative AI systems in systems and society. arXiv preprint arXiv:2306.05949. 128.
[Sorensen] Sorensen, T., Moore, J., Fisher, J., Gordon, M., Mireshghallah, N., Rytting, C. M., ... & Choi, Y. (2024). A roadmap to pluralistic alignment. arXiv preprint arXiv:2402.05070. 129.
[South] South, T., Marro, S., Hardjono, T., Mahari, R., Whitney, C. D., Greenwood, D., ... & Pentland, A. (2025). Authenticated Delegation and Authorized AI Agents. arXiv preprint arXiv:2501.09674. 130.
[Thiel] Thiel, D. (2023). Identifying and eliminating CSAM in generative ML training data and models. Stanford Internet Observatory, Cyber Policy Center, December, 23, 3. 131.
[Turpin] Turpin, M., Michael, J., Perez, E., & Bowman, S. (2023). Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36, 74952-74965. 132.
[UK AISI] UK AI Security Institute. (2025). AISI Challenge Fund: Priority research areas 2025. https://cdn.prod. website-files.com/663bd486c5e4c81588db7a1d/67c99c8261da5261d5553893_AISI%20Challenge%20Fund_Priority% 20Research%20Areas%202025%20(1).pdf 133.
[Wallach] Wallach, H., Desai, M., Pangakis, N., Cooper, A. F., Wang, A., Barocas, S., ... & Jacobs, A. Z. (2024). Evaluating Generative AI Systems is a Social Science Measurement Challenge. arXiv preprint arXiv:2411.10939. 134.
[Wang] Wang, S., Zhu, Y., Liu, H., Zheng, Z., Chen, C., & Li, J. (2024). Knowledge editing for large language models: A survey. ACM Computing Surveys, 57(3), 1-37. 135. DOI: https://doi.org/10.1145/3698590
[Wasil-A] Wasil, A., Smith, E., Katzke, C., & Bullock, J. (2024). AI Emergency Preparedness: Examining the federal government’s ability to detect and respond to AI-related national security threats. arXiv preprint arXiv: 2407.17347. 136. DOI: https://doi.org/10.2139/ssrn.4896801
[Wasil-B] Wasil, A. (2024, June 7). What AI policy can learn from pandemic preparedness. Georgetown Security Studies Review. https://georgetownsecuritystudiesreview.org/2024/06/07/what-ai-policy-can-learn-from-pandemic- preparedness/ 137.
[Wei] Wei, A., Haghtalab, N., & Steinhardt, J. (2023). Jailbroken: How does LLM safety training fail?. Advances in Neural Information Processing Systems, 36, 80079-80110. 138.
[Weidinger-A] Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P. S., ... & Gabriel, I. (2021). Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359. 139.
[Weidinger-B] Weidinger, L., Rauh, M., Marchal, N., Manzini, A., Hendricks, L. A., Mateos-Garcia, J., ... & Isaac, W. (2023). Sociotechnical safety evaluation of generative AI systems. arXiv preprint arXiv:2310.11986. 140.
[Weld] Weld, D., & Etzioni, O. (2009). The First Law of Robotics: (A Call to Arms). In Safety and Security in Multiagent Systems: Research Results from 2004-2006 (pp. 90-100). Berlin, Heidelberg: Springer Berlin Heidelberg. 141. DOI: https://doi.org/10.1007/978-3-642-04879-1_7
[Wen-A] Wen, J., Zhong, R., Khan, A., Perez, E., Steinhardt, J., Huang, M., ... & Feng, S. (2024). Language models learn to mislead humans via RLHF. arXiv preprint arXiv:2409.12822. 142.
[Wen-B] Wen, X., Lou, J., Lu, X., Yang, J., Liu, Y., Lu, Y., ... & Yu, X. (2025). Scalable Oversight for Superhuman AI via Recursive Self-Critiquing. arXiv preprint arXiv:2502.04675. 143.
[Yuan] Yuan, Y., Jiao, W., Wang, W., Huang, J. T., He, P., Shi, S., & Tu, Z. (2023). GPT-4 is too smart to be safe: Stealthy chat with LLMs via cipher. arXiv preprint arXiv:2308.06463. 144.
[Zeng-A] Zeng, Y., Lu, E. & Sun, K. (2025). Principles on symbiosis for natural life and living artificial intelligence. AI Ethics 5, 81–86. https://doi.org/10.1007/s43681-023-00364-8 145. DOI: https://doi.org/10.1007/s43681-023-00364-8
[Zeng-B] Zhao, F., Wang, Y., Lu, E., Zhao, D., Han, B., Tong, H., ... & Zeng, Y. (2025). Redefining Superalignment: From Weak-to-Strong Alignment to Human-AI Co-Alignment to Sustainable Symbiotic Society. arXiv preprint arXiv:2504.17404. 146.
[Ziegler] Ziegler, D., Nix, S., Chan, L., Bauman, T., Schmidt-Nielsen, P., Lin, T., ... & Thomas, N. (2022). Adversarial training for high-stakes reliability. Advances in neural information processing systems, 35, 9274-9286. 147.
[Zhou] Zhou, Y., Liu, Y., Li, X., Jin, J., Qian, H., Liu, Z., ... & Yu, P. S. (2024). Trustworthiness in retrieval-augmented generation systems: A survey. arXiv preprint arXiv:2409.10102. 148.
[Zou] Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
Downloads
Published
How to Cite
Issue
Section
Categories
License
Copyright (c) 2025 Yoshua Bengio, Max Tegmark, Stuart Russell, Dawn Song, Sören Mindermann, Lan Xue, Stephen Casper, Luke Ong, Vanessa Wilfred, Tegan Maharaj, Wan Sie Lee, Ya-Qin Zhang

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.