The Singapore Consensus on Global AI Safety Research Priorities

Building a Trustworthy, Reliable and Secure AI Ecosystem

Authors

  • Yoshua Bengio Université de Montréal; Co-President and Scientific Director of LawZero; Founder and Scientific Advisor of Mila – Quebec AI Institute
  • Max Tegmark Massachusetts Institute of Technology (MIT); Tegmark AI Safety Group; Future of Life Institute
  • Stuart Russell University of California Berkeley; Berkeley Artificial Intelligence Research Lab (BAIR); Center for Human Compatible Artificial Intelligence (CHAI)
  • Dawn Song University of California Berkeley
  • Sören Mindermann MILA - Quebec
  • Lan Xue Tsinghua University
  • Stephen Casper Massachusetts Institute of Technology
  • Luke Ong Nanyang Technological University
  • Vanessa Wilfred Infocomm Media Development Authority
  • Tegan Maharaj MILA
  • Wan Sie Lee Infocomm Media Development Authority
  • Ya-Qin Zhang Tsinghua University

DOI:

https://doi.org/10.70777/si.v2i5.15503

Keywords:

ai safety, ai governance, artificial general intelligence, superintelligence, mechanistic interpretability, agi, ai risk, ai benchmarks, ai metrics

Abstract

Rapidly improving AI capabilities and autonomy hold significant promise of transformation, but are also driving vigorous debate on how to ensure that AI is safe, i.e., trustworthy, reliable, and secure. Building a trusted ecosystem is therefore essential – it helps people embrace AI with confidence and gives maximal space for innovation while avoiding backlash. This requires policymakers, industry, researchers and the broader public to collectively work toward securing positive outcomes from AI’s development. AI safety research is a key dimension. Given that the state of science today for building trustworthy AI does not fully cover all risks, accelerated investment in research is required to keep pace with commercially driven growth in system capabilities. Goals: The 2025 Singapore Conference on AI (SCAI): International Scientific Exchange on AI Safety aims to support research in this important space by bringing together AI scientists across geographies to identify and synthesise research priorities in AI safety. The result, The Singapore Consensus on Global AI Safety Research Priorities, builds on the International AI Safety Report-A (IAISR) chaired by Yoshua Bengio and backed by 33 governments. By adopting a defence-in-depth model, this document organises AI safety research domains into three types: challenges with creating trustworthy AI systems (Development), challenges with evaluating their risks (Assessment), and challenges with monitoring and intervening after deployment (Control). Through the Singapore Consensus, we hope to globally facilitate meaningful conversations between AI scientists and AI policymakers for maximally beneficial outcomes. Our goal is to enable more impactful R&D efforts to rapidly develop safety and evaluation mechanisms and foster a trusted ecosystem where AI is harnessed for the public good.

Author Biographies

Yoshua Bengio, Université de Montréal; Co-President and Scientific Director of LawZero; Founder and Scientific Advisor of Mila – Quebec AI Institute

Recognized worldwide as one of the leading experts in artificial intelligence, Yoshua Bengio is most known for his pioneering work in deep learning, earning him the 2018 A.M. Turing Award, “the Nobel Prize of Computing,” with Geoffrey Hinton and Yann LeCun, and making him the computer scientist with the largest number of citations and h-index.

He is Full Professor at Université de Montréal, Co-President and Scientific Director of LawZero and Founder and Scientific Advisor of Mila – Quebec AI Institute. He co-directs the CIFAR Learning in Machines & Brains program and acts as Special Advisor and Founding Scientific Director of IVADO.

He received numerous awards, including the prestigious Killam Prize and Herzberg Gold medal in Canada, CIFAR’s AI Chair, Spain’s Princess of Asturias Award, the VinFuture Prize and he is a Fellow of both the Royal Society of London and Canada, Knight of the Legion of Honor of France, Officer of the Order of Canada, Member of the UN’s Scientific Advisory Board for Independent Advice on Breakthroughs in Science and Technology. Yoshua Bengio was named in 2024 one of TIME’s magazine 100 most influential people in the world.

Concerned about the social impact of AI, he actively contributed to the Montreal Declaration for the Responsible Development of Artificial Intelligence and currently chairs the International AI Safety

Max Tegmark, Massachusetts Institute of Technology (MIT); Tegmark AI Safety Group; Future of Life Institute

Max Tegmark is a professor doing AI and physics research at MIT as part of the Institute for Artificial Intelligence & Fundamental Interactions and the Center for Brains, Minds and Machines. He is the author of over 300 publications as well as the New York Times bestsellers “Life 3.0: Being Human in the Age of Artificial Intelligence” and “Our Mathematical Universe: My Quest for the Ultimate Nature of Reality”. His most recent AI safety research focuses on mechanistic interpretability and guaranteed safe AI, and he also researches news bias detection with machine-learning. Max is a Fellow of the American Physical Society and holds a gold medal from the Royal Swedish Academy of Engineering Science. Time Magazine named him one of the 100 Most Influential People in AI 2023. He is a serial founder of non-profits, including the Future of Life Institute, the Foundational Questions Institute and Improve the News Foundation.   

Stuart Russell, University of California Berkeley; Berkeley Artificial Intelligence Research Lab (BAIR); Center for Human Compatible Artificial Intelligence (CHAI)

Stuart Russell received his B.A. with first-class honours in physics from Oxford University in 1982 and his Ph.D. in computer science from Stanford in 1986. He then joined the faculty of the University of California at Berkeley, where he is Professor (and formerly Chair) of Electrical Engineering and Computer Sciences and holder of the Smith-Zadeh Chair in Engineering. He is co-chair of the World Economic Forum Council on AI and the OECD Expert Group on AI Futures, and he is a US representative to the Global Partnership on AI. From 2011 to 2014 he also served as an Adjunct Professor of Neurological Surgery at UC San Francisco. Russell is a recipient of the Presidential Young Investigator Award of the National Science Foundation, the IJCAI Computers and Thought Award, the IJCAI Research Excellence Award, the ACM Allen Newell Award, the AAAI Feigenbaum Prize, the World Technology Award (Policy category), the Mitchell Prize of the American Statistical Association and the International Society for Bayesian Analysis, the ACM Karlstrom Outstanding Educator Award, and the AAAI/EAAI Outstanding Educator Award. In 1998, he gave the Forsythe Memorial Lectures at Stanford University and from 2012 to 2014 he held the Chaire Blaise Pascal in Paris. In 2021 he received an OBE from Her Majesty Queen Elizabeth and gave the BBC Reith Lectures. He is an Honorary Fellow of Wadham College, Oxford, an Andrew Carnegie Fellow, an AI2050 Senior Fellow, and a Fellow of AAAI, ACM, and AAAS. His research covers a wide range of topics in artificial intelligence including machine learning, probabilistic reasoning, knowledge representation, planning, real-time decision making, multitarget tracking, computer vision, computational physiology, global seismic monitoring, and philosophical foundations. His textbook "Artificial Intelligence: A Modern Approach" (with Peter Norvig) is used in over 1,500 universities in 135 countries. His current concerns include the threat of autonomous weapons and the long-term future of artificial intelligence and its relation to humanity. The latter topic is the subject of his book, "Human Compatible: AI and the Problem of Control".  

References

[Alaga] Alaga, J., Schuett, J., & Anderljung, M. (2024). A Grading Rubric for AI Safety Frameworks. arXiv preprint arXiv:2409.08751. 2.

[AISES] Hendrycks, D. (2024). Systemic safety. In AI safety, ethics, and society textbook. Center for AI Safety. https://www.aisafetybook.com/textbook/systemic-safety 3. DOI: https://doi.org/10.1201/9781003530336-1

[Ashby] Ashby, W. R. (1956). An introduction to Cybernetics. Chapman & Hall. https://philpapers.org/archive/ ASHAIT.pdf 4. DOI: https://doi.org/10.5962/bhl.title.5851

[Anthropic-A] Anthropic. (2024). Anthropic Economic Index. https://www.anthropic.com/economic-index 5.

[Anthropic-B] Anthropic. (2024). Claude 3.7 Sonnet system card. https://assets.anthropic.com/m/785e231869ea8b3b/ original/claude-3-7-sonnet-system-card.pdf 6.

[Anthropic-C] Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., ... & Hubinger, E. (2024). Alignment faking in large language models. arXiv preprint arXiv:2412.14093. 7.

[Anthropic-D] Chen, Y., Benton, J., Radhakrishnan, A., Denison, J. U. C., Schulman, J., Somani, A., ... & Perez, E. Reasoning Models Don’t Always Say What They Think. 8.

[Anthropic-E] Anthropic. (2024, October 15). Responsible Scaling Policy (Version 2.0). Anthropic. https://www. anthropic.com/responsible-scaling-policy 9. DOI: https://doi.org/10.70777/si.v2i1.13657

[Anthropic-F] Anthropic. (2025). Recommended directions. Anthropic Alignment. https://alignment.anthropic. com/2025/recommended-directions/ 10.

[Anthropic-G] Anthropic. (2025, April 24). Modifying LLM beliefs with synthetic document finetuning. Anthropic Alignment Research. https://alignment.anthropic.com/2025/modifying-beliefs-via-sdf/ 11.

[Anthropic-H] Anthropic. (2024, August 8). Expanding our model safety bug bounty program. https://www. anthropic.com/news/model-safety-bug-bounty 12.

[Anwar] Anwar, U., Saparov, A., Rando, J., Paleka, D., Turpin, M., Hase, P., ... & Krueger, D. (2024). Foundational challenges in assuring alignment and safety of large language models. arXiv preprint arXiv:2404.09932. 13.

[Apollo] Hobbhahn, M. (2024, January 22). We need a science of evals. Apollo Research. https://www.apolloresearch. ai/blog/we-need-a-science-of-evals 14.

[Armstrong] Armstrong, S., & Mindermann, S. (2018). Occam’s razor is insufficient to infer the preferences of irrational agents. Advances in neural information processing systems, 31. 15.

[Barez] Li, N., Pan, A., Gopal, A., Yue, S., Berrios, D., Gatti, A., ... & Hendrycks, D. (2024). The WMDP Benchmark: Measuring and reducing malicious use with unlearning. arXiv preprint arXiv:2403.03218. 16.

[Bateman] Bateman, J., Baer, D., Bell, S. A., Brown, G. O., Cuéllar, M. F. T., Ganguli, D., ... & Zvyagina, P. (2024). Beyond open vs. closed: Emerging consensus and key questions for foundation AI model governance. 17.

[Baum] Baum, S. D. (2020). Social choice ethics in artificial intelligence. AI & Society, 35(1), 165-176. 18. DOI: https://doi.org/10.1007/s00146-017-0760-1

[Bengio-A] Bengio, Y., Hinton, G., Yao, A., Song, D., Abbeel, P., Darrell, T., ... & Mindermann, S. (2024). Managing extreme AI risks amid rapid progress. Science, 384(6698), 842–845. https://doi.org/10.1126/science.adn0117 19. DOI: https://doi.org/10.1126/science.adn0117

[Bengio-B] Bengio, Y., Cohen, M., Fornasiere, D., Ghosn, J., Greiner, P., MacDermott, M., ... & Williams-King, D. (2025). Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?. arXiv preprint arXiv:2502.15657. 20. DOI: https://doi.org/10.70777/si.v2i5.15569

[Berglund] Berglund, L., Stickland, A. C., Balesni, M., Kaufmann, M., Tong, M., Korbak, T., ... & Evans, O. (2023). Taken out of context: On measuring situational awareness in LLMs. arXiv preprint arXiv:2309.00667. 21.

[Bernardi] Bernardi, J., Mukobi, G., Greaves, H., Heim, L., & Anderljung, M. (2024). Societal adaptation to advanced AI. arXiv preprint arXiv:2405.10295. 22.

[Betley] Betley, J., Tan, D., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., ... & Evans, O. (2025). Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. arXiv preprint arXiv:2502.17424. 23.

[Birhane-A] Birhane, A., Steed, R., Ojewale, V., Vecchione, B., & Raji, I.D. (2024, April). AI auditing: The broken bus on the road to AI accountability. In 2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) (pp. 612-643). IEEE. 24. DOI: https://doi.org/10.1109/SaTML59370.2024.00037

[Birhane-B] Birhane, A., Prabhu, V., Han, S., & Boddeti, V. N. (2023). On hate scaling laws for data-swamps. arXiv preprint arXiv:2306.13141. 25.

[Bordt] Bordt, S. (2023). Explainable machine learning and its limitations (Doctoral dissertation, Universität Tübingen). 26.

[Bucknall-A] Bucknall, B., Trager, R. F., & Osborne, M. A. (2025). Position: Ensuring mutual privacy is necessary for effective external evaluation of proprietary AI systems. arXiv preprint arXiv:2503.01470. 27.

[Bucknall-B] Bucknall, B., Siddiqui, S., Thurnherr, L., McGurk, C., Harack, B., Reuel, A., ... & Trager, R. (2025). In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate?. arXiv preprint arXiv:2504.12914. 28. DOI: https://doi.org/10.70777/si.v2i5.15509

[Buhl] Buhl, M. D., Sett, G., Koessler, L., Schuett, J., & Anderljung, M. (2024). Safety cases for frontier AI. arXiv preprint arXiv:2410.21572. 29.

[Campos] Campos, S., Papadatos, H., Roger, F., Touzet, C., Quarks, O., & Murray, M. (2025). A Frontier AI Risk Management Framework: Bridging the Gap Between Current AI Practices and Established Risk Management. arXiv preprint arXiv:2502.06656. 30.

[Cao] Cao, L. (2025). Watermarking for AI Content Detection: A Review on Text, Visual, and Audio Modalities. arXiv preprint arXiv:2504.03765. 31.

[CAIS] Hinton, G., Bengio, Y., Hassabis, D., Altman, S., Amodei, D., … Statement on AI Risk. Center for AI Safety. https://safe.ai/work/statement-on-ai-risk 32.

[Casper-A] Casper, S., Davies, X., Shi, C., Gilbert, T. K., Scheurer, J., Rando, J., ... & Hadfield-Menell, D. (2023). Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv: 2307.15217. 33.

[Casper-B] Casper, S., Ezell, C., Siegmann, C., Kolt, N., Curtis, T. L., Bucknall, B., ... & Hadfield-Menell, D. (2024, June). Black-box access is insufficient for rigorous ai audits. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (pp. 2254-2272). 34. DOI: https://doi.org/10.1145/3630106.3659037

[Casper-C] Casper, S., Krueger, D., & Hadfield-Menell, D. (2025). Pitfalls of Evidence-Based AI Policy. arXiv preprint arXiv:2502.09618. 35. DOI: https://doi.org/10.70777/si.v2i2.14611

[Chan] Chan, A., Wei, K., Huang, S., Rajkumar, N., Perrier, E., Lazar, S., ... & Anderljung, M. (2025). Infrastructure for AI Agents. arXiv preprint arXiv:2501.10114. 36.

[Che] Che, Z., Casper, S., Kirk, R., Satheesh, A., Slocum, S., McKinney, L. E., ... & Hadfield-Menell, D. (2025). Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities. arXiv preprint arXiv:2502.05209. 37.

[Cheng] Cheng, P., Wu, Z., Du, W., Zhao, H., Lu, W., & Liu, G. (2023). Backdoor attacks and countermeasures in natural language processing models: A comprehensive security review. arXiv preprint arXiv:2309.06055. 38.

[Clymer] Clymer, J., Gabrieli, N., Krueger, D., & Larsen, T. (2024). Safety cases: How to justify the safety of advanced AI systems. arXiv preprint arXiv:2403.10462. 39.

[Critch] Critch, A., & Krueger, D. (2020). AI research considerations for human existential safety (ARCHES). arXiv preprint arXiv:2006.04948. 40.

[Dalrymple] Dalrymple, D., Skalse, J., Bengio, Y., Russell, S., Tegmark, M., Seshia, S., ... & Tenenbaum, J. (2024). Towards guaranteed safe AI: A framework for ensuring robust and reliable AI systems. arXiv preprint arXiv: 2405.06624. 41.

[DeepSeek] Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., ... & Piao, Y. (2024). Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437. 42.

[Dekker] Dekker, S. (2019). Foundations of safety science: A century of understanding accidents and disasters. Routledge. https://books.google.co.uk/books?id=dwWSDwAAQBAJ 43. DOI: https://doi.org/10.4324/9781351059794

[Engels] Engels, J., Baek, D. D., Kantamneni, S., & Tegmark, M. (2024). Scaling laws for scalable oversight (arXiv preprint). arXiv. https://arxiv.org/abs/2504.18530 44.

[Engstrom] Engstrom, L., Feldmann, A., & Madry, A. (2024). DsDm: Model-aware dataset selection with datamodels. arXiv preprint arXiv:2401.12926. 45.

[Eriksson] Eriksson, M., Purificato, E., Noroozian, A., Vinagre, J., Chaslot, G., Gomez, E., & Fernandez-Llorca, D. (2025). Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation. arXiv preprint arXiv:2502.06559. 46. DOI: https://doi.org/10.1609/aies.v8i1.36595

[EU] Act, E. A. I. (2024). The EU Artificial Intelligence Act. 47.

[Evans-A] Evans, O., Cotton-Barratt, O., Finnveden, L., Bales, A., Balwit, A., Wills, P., ... & Saunders, W. (2021). Truthful AI: Developing and governing AI that does not lie. arXiv preprint arXiv:2110.06674. 48.

[Evans-B] Evans, O., Saunders, W., & Stuhlmüller, A. (2019). Machine learning projects for iterated distillation and amplification. 49.

[Everitt] Everitt, T., Filan, D., Daswani, M., & Hutter, M. (2016). Self-modification of policy and utility function in rational agents. In Artificial General Intelligence: 9th International Conference, AGI 2016, New York, NY, USA, July 16-19, 2016, Proceedings 9 (pp. 1-11). Springer International Publishing. 50. DOI: https://doi.org/10.1007/978-3-319-41649-6_1

[Fortune] Fortune. (2023, July). Brainstorm Tech 2023: How Anthropic is paving the way for responsible A.I.

[Video]. Fortune. https://fortune.com/videos/watch/brainstorm-tech-2023%3A-how-anthropic-is-paving-theway- for-responsible-a.i./88517d5f-b5c3-40ac-b5bb-8368afc95acd 51.

[Field] Field, S. (2025). Why do Experts Disagree on Existential Risk and P (doom)? A Survey of AI Experts. arXiv preprint arXiv:2502.14870. 52. DOI: https://doi.org/10.1007/s43681-025-00762-0

[Google] Google DeepMind. (2024, April 9). Updating the Frontier Safety Framework. DeepMind. https://deepmind. google/discover/blog/updating-the-frontier-safety-framework/ 53.

[Gryz] Gryz, J., & Rojszczak, M. (2021). Black box algorithms and the rights of individuals: No easy solution to the “explainability” problem. Internet Policy Review, 10(2), 1-24. 54. DOI: https://doi.org/10.14763/2021.2.1564

[IAISR] Bengio, Y., Mindermann, S., Privitera, D., Besiroglu, T., Bommasani, R., Casper, S., ... & Zeng, Y. (2025). International AI Safety Report. arXiv preprint arXiv:2501.17805. 55.

[Gabriel] Gabriel, I., Manzini, A., Keeling, G., Hendricks, L. A., Rieser, V., Iqbal, H., ... & Manyika, J. (2024). The ethics of advanced AI assistants. arXiv preprint arXiv:2404.16244. 56.

[GDM] Shah, R., Irpan, A., Turner, A. M., Wang, A., Conmy, A., Lindner, D., ... & Dragan, A. (2025). An approach to technical AGI safety and security. Google DeepMind. https://storage.googleapis.com/deepmind-media/Deep- Mind.com/Blog/evaluating-potential-cybersecurity-threats-of-advanced-ai/An_Approach_to_Technical_AGI_ Safety_Apr_2025.pdf 57.

[Greenblatt-A] Greenblatt, R., Shlegeris, B., Sachan, K., & Roger, F. (2023). AI control: Improving safety despite intentional subversion. arXiv preprint arXiv:2312.06942. 58.

[Greenblatt-B], R., Roger, F., Krasheninnikov, D., & Krueger, D. (2024). Stress-testing capability elicitation with password-locked models. arXiv preprint arXiv:2405.19550. 59.

[Griffin] Griffin, C., Thomson, L., Shlegeris, B., & Abate, A. (2024). Games for AI control: Models of safety evaluations of AI deployment protocols. arXiv preprint arXiv:2409.07985. 60.

[Grosse] Grosse, R., Bae, J., Anil, C., Elhage, N., Tamkin, A., Tajdini, A., ... & Bowman, S. R. (2023). Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296. 61.

[Hadfield-Menell] Hadfield-Menell, D., Russell, S. J., Abbeel, P., & Dragan, A. (2016). Cooperative inverse reinforcement learning. Advances in neural information processing systems, 29. 62.

[Hammond] Hammond, L., Chan, A., Clifton, J., Hoelscher-Obermaier, J., Khan, A., McLean, E., ... & Rahwan, I. (2025). Multi-agent risks from advanced AI. arXiv preprint arXiv:2502.14143. 63.

[Hendrycks-A] Hendrycks, D., Carlini, N., Schulman, J., & Steinhardt, J. (2021). Unsolved problems in ML safety. arXiv preprint arXiv:2109.13916. 64.

[Hendrycks-B] Hendrycks, D. (2024). Systemic factors. In Introduction to AI safety, ethics, and society. Center for AI Safety. https://www.aisafetybook.com/textbook/systemic-factors 65. DOI: https://doi.org/10.1201/9781003530336

[Hobbhahn] Apollo Research. (2024, April 9). We need a science of evals. Apollo Research. https://www.apolloresearch. ai/blog/we-need-a-science-of-evals 66.

[Hofstätter] Hofstätter, F., van der Weij, T., Teoh, J., Bartsch, H., & Ward, F. R. (2025). The Elicitation Game: Evaluating Capability Elicitation Techniques. arXiv preprint arXiv:2502.02180. 67.

[Huang] Huang, T., Hu, S., Ilhan, F., Tekin, S. F., & Liu, L. (2024). Harmful fine-tuning attacks and defenses for large language models: A survey. arXiv preprint arXiv:2409.18169. 68.

[Hubinger-A] Hubinger, E. (2020). An overview of 11 proposals for building safe advanced AI. arXiv preprint arXiv:2012.07532. 69.

[Hubinger-B] Benton, J., Wagner, M., Christiansen, E., Anil, C., Perez, E., Srivastav, J., ... & Duvenaud, D. (2024). Sabotage evaluations for frontier models. arXiv preprint arXiv:2410.21514. 70.

[Hubinger-C] Denison, C., MacDiarmid, M., Barez, F., Duvenaud, D., Kravec, S., Marks, S., ... & Hubinger, E. (2024). Sycophancy to subterfuge: Investigating reward-tampering in large language models. arXiv preprint arXiv: 2406.10162. 71.

[Hubinger-D] Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., ... & Perez, E. (2024). Sleeper agents: Training deceptive LLMs that persist through safety training. arXiv preprint arXiv:2401.05566. 72.

[Ilyas] Engstrom, L., Feldmann, A., & Madry, A. (2024). DsDm: Model-aware dataset selection with datamodels. arXiv preprint arXiv:2401.12926. 73.

[International AI Safety Report-A] UK Government. (2025). International AI Safety Report 2025 (Accessible version). UK Department for Science, Innovation and Technology. https://assets.publishing.service.gov.uk/ media/679a0c48a77d250007d313ee/International_AI_Safety_Report_2025_accessible_f.pdf 74.

[International AI Safety Report-B] Bengio, Y., Mindermann, S., Privitera, D., Besiroglu, T., Bommasani, R., Casper, S., ... & Zeng, Y. (2025). International AI Safety Report. arXiv preprint arXiv:2501.17805. 75. DOI: https://doi.org/10.70777/si.v2i2.14755

[Irving] Irving, G., Christiano, P., & Amodei, D. (2018). AI safety via debate. arXiv preprint arXiv:1805.00899. 76.

[Ji] Ji, J., Qiu, T., Chen, B., Zhang, B., Lou, H., Wang, K., ... & Gao, W. (2023). AI alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852. 77.

[Jin] Jin, H., Hu, L., Li, X., Zhang, P., Chen, C., Zhuang, J., & Wang, H. (2024). Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models. arXiv preprint arXiv:2407.01599. 78.

[Jones] Jones, E., Dragan, A., & Steinhardt, J. (2024). Adversaries can misuse combinations of safe models. arXiv preprint arXiv:2406.14595. 79.

[Jumper] Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., ... & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. nature, 596(7873), 583-589. 80. DOI: https://doi.org/10.1038/s41586-021-03819-2

[Khan] Khan, A., Casper, S., & Hadfield-Menell, D. (2025). Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs. arXiv preprint arXiv:2503.08688. 81. DOI: https://doi.org/10.1145/3715275.3732147

[Keep the Future Human] Aguirre, A. (2025). Keep the future human. Future of Life Institute. https://keepthefuturehuman. ai/ 82.

[Kirchenbauer] Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., & Goldstein, T. (2023, July). A watermark for large language models. In International Conference on Machine Learning (pp. 17061-17084). PMLR. 83.

[ Korbak] Korbak, T., Balesni, M., Shlegeris, B., & Irving, G. (2025). How to evaluate control measures for LLM agents? A trajectory from today to superintelligence. arXiv preprint arXiv:2504.05259. 84.

[Ladish] Bondarenko, A., Volk, D., Volkov, D., & Ladish, J. (2025). Demonstrating specification gaming in reasoning models. arXiv preprint arXiv:2502.13295. 85.

[Lawrence] Lawrence, M., Shipman, M., Janzwood, S., Arnscheidt, C., Donges, J. F., Homer-Dixon, T., ... & Wunderling, N. (2024). Polycrisis Research and Action Roadmap-Gaps, opportunities, and priorities for polycrisis research and action. 86.

[Li-A] Li, B., Qi, P., Liu, B., Di, S., Liu, J., Pei, J., ... & Zhou, B. (2023). Trustworthy AI: From principles to practices. ACM Computing Surveys, 55(9), 1-46. 87. DOI: https://doi.org/10.1145/3555803

[Li-B] Li, N., Pan, A., Gopal, A., Yue, S., Berrios, D., Gatti, A., ... & Hendrycks, D. (2024). The wmdp benchmark: Measuring and reducing malicious use with unlearning. arXiv preprint arXiv:2403.03218. 88.

[Longpre] Longpre, S., Kapoor, S., Klyman, K., Ramaswami, A., Bommasani, R., Blili-Hamelin, B., ... & Henderson, P. (2024). A safe harbor for AI evaluation and red teaming. arXiv preprint arXiv:2403.04893. 89.

[MacDermott] MacDermott, M., Fox, J., Belardinelli, F., & Everitt, T. (2024). Measuring Goal-Directedness. Advances in Neural Information Processing Systems, 37, 11412-11431. 90.

[Maini] Maini, P., Goyal, S., Sam, D., Robey, A., Savani, Y., Jiang, Y., ... & Kolter, J. Z. (2025). Safety Pretraining: Toward the Next Generation of Safe AI. arXiv preprint arXiv:2504.16980. 91.

[Mallah] Mallah, R. (2017). The landscape of AI safety and beneficence research: Input for brainstorming at Beneficial AI 2017. Future of Life Institute. https://futureoflife.org/landscape/ResearchLandscapeExtended.pdf 92.

[Marks] Marks, S., Treutlein, J., Bricken, T., Lindsey, J., Marcus, J., Mishra-Sharma, S., ... & Hubinger, E. (2025). Auditing language models for hidden objectives. arXiv preprint arXiv:2503.10965. 93.

[Mazeika] Mazeika, M., Yin, X., Tamirisa, R., Lim, J., Lee, B. W., Ren, R., Phan, L., Mu, N., Khoja, A., Zhang, O., & Hendrycks, D. (2025). Utility engineering: Analyzing and controlling emergent value systems in AIs. arXiv preprint arXiv:2502.08640. https://www.emergent-values.ai/ 94.

[Michael] Michael, J., Mahdi, S., Rein, D., Petty, J., Dirani, J., Padmakumar, V., & Bowman, S. R. (2023). Debate helps supervise unreliable experts. arXiv preprint arXiv:2311.08702. 95.

[Michaud] Michaud, E. J., Liao, I., Lad, V., Liu, Z., Mudide, A., Loughridge, C., ... & Tegmark, M. (2024). Opening the AI Black Box: Distilling Machine-Learned Algorithms into Code. Entropy, 26(12), 1046, arXiv:2402.05110 (2024). 96. DOI: https://doi.org/10.3390/e26121046

[Murray] Murray, M. (2025, April 9). AI risk management can learn a lot from other industries. AI Frontiers. https://www.ai-frontiers.org/articles/ai-risk-management-can-learn-a-lot-from-other-industries 97.

[Nevo] Nevo, S., Lahav, D., Karput, A., Bar-On, Y., Bradley, H. A., & Alstott, J. (2024). Securing AI Model Weights. Technical report, RAND Corporation, 2024. https://www.rand.org/content/dam/rand/pubs/research_reports/ RRA2800/RRA2849-1/RAND_RRA2849-1.pdf 98.

[Ngo] Ngo, R., Chan, L., & Mindermann, S. (2022). The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626. 99.

[NIST] National Institute of Standards and Technology. (2025, January 15). Updated guidelines for managing misuse risk for dual-use foundation models. https://www.nist.gov/news-events/news/2025/01/updated-guidelines- managing-misuse-risk-dual-use-foundation-models 100.

[Omohundro] Omohundro, S. M. (2018). The basic AI drives. In Artificial intelligence safety and security (pp. 47-55). Chapman and Hall/CRC. 101. DOI: https://doi.org/10.1201/9781351251389-3

[OpenAI-A] OpenAI. (2023, December 18). Preparedness framework (beta). https://cdn.openai.com/openai-preparedness- framework-beta.pdf 102.

[OpenAI-B] OpenAI. (2024). How we think about safety and alignment. https://openai.com/safety/how-wethink- about-safety-alignment/ 103.

[OpenAI-C] OpenAI. (2024, December 5). OpenAI o1 system card. https://cdn.openai.com/o1-systemcard- 20241205.pdf 104.

[OpenAI-D] Baker, B., Huizinga, J., Gao, L., Dou, Z., Guan, M. Y., Madry, A., ... & Farhi, D. (2025). Monitoring reasoning models for misbehavior and the risks of promoting obfuscation. arXiv preprint arXiv:2503.11926. 105.

[OpenAI-E] OpenAI. (2025, March 10). Detecting misbehavior in frontier reasoning models. https://openai.com/ index/chain-of-thought-monitoring/ 106.

[OpenAI-F] Baker, B., Huizinga, J., Gao, L., Dou, Z., Guan, M. Y., Madry, A., ... & Farhi, D. (2025). Monitoring reasoning models for misbehavior and the risks of promoting obfuscation. arXiv preprint arXiv:2503.11926. 107.

[Paullada] Paullada, A., Raji, I. D., Bender, E. M., Denton, E., & Hanna, A. (2021). Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns, 2(11), 100336. https://doi. org/10.1016/j.patter.2021.100336 108. DOI: https://doi.org/10.1016/j.patter.2021.100336

[Phuong] Phuong, M., Aitchison, M., Catt, E., Cogan, S., Kaskasoli, A., Krakovna, V., ... & Shevlane, T. (2024). Evaluating frontier models for dangerous capabilities. arXiv preprint arXiv:2403.13793. 109.

[Qi-A] Qi, X., Zeng, Y., Xie, T., Chen, P. Y., Jia, R., Mittal, P., & Henderson, P. (2023). Fine-tuning aligned language models compromises safety, even when users do not intend to!. arXiv preprint arXiv:2310.03693. 110.

[Qi-B] Qi, X., Wei, B., Carlini, N., Huang, Y., Xie, T., He, L., ... & Henderson, P. (2024). On evaluating the durability of safeguards for open-weight LLMs. arXiv preprint arXiv:2412.07097. 111.

[RAND] Kulp, G., Gonzales, D., Smith, E., Heim, L., Puri, P., Vermeer, M. J. D., & Winkelman, Z. (2024). Hardware- enabled governance mechanisms: Developing technical solutions to exempt items otherwise classified under export control classification numbers 3A090 and 4A090 (RAND Working Paper WR-A3056-1). RAND Corporation. https://www.rand.org/content/dam/rand/pubs/working_papers/WRA3000/WRA3056-1/RAND_ WRA3056-1.pdf 112.

[Raji] Raji, I. D., Bender, E. M., Paullada, A., Denton, E., & Hanna, A. (2021). AI and the everything in the whole wide world benchmark. arXiv preprint arXiv:2111.15366. 113.

[Ren-A] Ren, R., Basart, S., Khoja, A., Gatti, A., Phan, L., Yin, X., ... & Hendrycks, D. (2024). Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?. Advances in Neural Information Processing Systems, 37, 68559-68594. 114.

[Ren-B] Ren, R., Agarwal, A., Mazeika, M., Menghini, C., Vacareanu, R., Kenstler, B., ... & Hendrycks, D. (2025). The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems. arXiv preprint arXiv:2503.03750. 115.

[Reuel] Reuel, A., Bucknall, B., Casper, S., Fist, T., Soder, L., Aarne, O., ... & Trager, R. (2024). Open problems in technical AI governance. arXiv preprint arXiv:2407.14981. 116.

[Russell] Russell, S., Dewey, D., & Tegmark, M. (2015). Research priorities for robust and beneficial artificial intelligence. AI magazine, 36(4), 105-114. 117. DOI: https://doi.org/10.1609/aimag.v36i4.2577

[Sastry] Sastry, G., Heim, L., Belfield, H., Anderljung, M., Brundage, M., Hazell, J., ... & Coyle, D. (2024). Computing power and the governance of artificial intelligence. arXiv preprint arXiv:2402.08797. 118.

[Savani] Savani, Y., Trockman, A., Feng, Z., Schwarzschild, A., Robey, A., Finzi, M., & Kolter, J. Z. (2025). Antidistillation Sampling. arXiv preprint arXiv:2504.13146. 119.

[Scheurer] Scheurer, J., Balesni, M., & Hobbhahn, M. (2023). Technical report: Large language models can strategically deceive their users when put under pressure. arXiv. 120.

[Shah-A] Russell, S. (2020). Human-compatible AI: A progress report. Paper presented at the NeurIPS 2020 Workshop on Assistance. https://people.eecs.berkeley.edu/~russell/papers/neurips20ws-assistance 121.

[Shah-B] Shah, R., Varma, V., Kumar, R., Phuong, M., Krakovna, V., Uesato, J., & Kenton, Z. (2022). Goal misgeneralization: Why correct specifications aren’t enough for correct goals. arXiv preprint arXiv:2210.01790. 122.

[Sharkey] Sharkey, L., Chughtai, B., Batson, J., Lindsey, J., Wu, J., Bushnaq, L., ... & McGrath, T. (2025). Open Problems in Mechanistic Interpretability. arXiv preprint arXiv:2501.16496. 123.

[Sharma] Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S. R., ... & Perez, E. (2023). Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548. 124.

[Shevlane] Shevlane, T., Farquhar, S., Garfinkel, B., Phuong, M., Whittlestone, J., Leung, J., ... & Dafoe, A. (2023). Model evaluation for extreme risks. arXiv preprint arXiv:2305.15324. 125.

[Slattery] Slater, P., Patel, A., & Rahimi, A. (2024). AI Risk Repository. Massachusetts Institute of Technology. https://airisk.mit.edu/ 126.

[Soares] Soares, N., Fallenstein, B., Yudkowsky, E., & Armstrong, S. (2015). Corrigibility. https://cdn.aaai.org/ocs/ ws/ws0067/10124-45900-1-PB.pdf 127.

[Solaiman] Solaiman, I., Talat, Z., Agnew, W., Ahmad, L., Baker, D., Blodgett, S. L., ... & Subramonian, A. (2023). Evaluating the social impact of generative AI systems in systems and society. arXiv preprint arXiv:2306.05949. 128.

[Sorensen] Sorensen, T., Moore, J., Fisher, J., Gordon, M., Mireshghallah, N., Rytting, C. M., ... & Choi, Y. (2024). A roadmap to pluralistic alignment. arXiv preprint arXiv:2402.05070. 129.

[South] South, T., Marro, S., Hardjono, T., Mahari, R., Whitney, C. D., Greenwood, D., ... & Pentland, A. (2025). Authenticated Delegation and Authorized AI Agents. arXiv preprint arXiv:2501.09674. 130.

[Thiel] Thiel, D. (2023). Identifying and eliminating CSAM in generative ML training data and models. Stanford Internet Observatory, Cyber Policy Center, December, 23, 3. 131.

[Turpin] Turpin, M., Michael, J., Perez, E., & Bowman, S. (2023). Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36, 74952-74965. 132.

[UK AISI] UK AI Security Institute. (2025). AISI Challenge Fund: Priority research areas 2025. https://cdn.prod. website-files.com/663bd486c5e4c81588db7a1d/67c99c8261da5261d5553893_AISI%20Challenge%20Fund_Priority% 20Research%20Areas%202025%20(1).pdf 133.

[Wallach] Wallach, H., Desai, M., Pangakis, N., Cooper, A. F., Wang, A., Barocas, S., ... & Jacobs, A. Z. (2024). Evaluating Generative AI Systems is a Social Science Measurement Challenge. arXiv preprint arXiv:2411.10939. 134.

[Wang] Wang, S., Zhu, Y., Liu, H., Zheng, Z., Chen, C., & Li, J. (2024). Knowledge editing for large language models: A survey. ACM Computing Surveys, 57(3), 1-37. 135. DOI: https://doi.org/10.1145/3698590

[Wasil-A] Wasil, A., Smith, E., Katzke, C., & Bullock, J. (2024). AI Emergency Preparedness: Examining the federal government’s ability to detect and respond to AI-related national security threats. arXiv preprint arXiv: 2407.17347. 136. DOI: https://doi.org/10.2139/ssrn.4896801

[Wasil-B] Wasil, A. (2024, June 7). What AI policy can learn from pandemic preparedness. Georgetown Security Studies Review. https://georgetownsecuritystudiesreview.org/2024/06/07/what-ai-policy-can-learn-from-pandemic- preparedness/ 137.

[Wei] Wei, A., Haghtalab, N., & Steinhardt, J. (2023). Jailbroken: How does LLM safety training fail?. Advances in Neural Information Processing Systems, 36, 80079-80110. 138.

[Weidinger-A] Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P. S., ... & Gabriel, I. (2021). Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359. 139.

[Weidinger-B] Weidinger, L., Rauh, M., Marchal, N., Manzini, A., Hendricks, L. A., Mateos-Garcia, J., ... & Isaac, W. (2023). Sociotechnical safety evaluation of generative AI systems. arXiv preprint arXiv:2310.11986. 140.

[Weld] Weld, D., & Etzioni, O. (2009). The First Law of Robotics: (A Call to Arms). In Safety and Security in Multiagent Systems: Research Results from 2004-2006 (pp. 90-100). Berlin, Heidelberg: Springer Berlin Heidelberg. 141. DOI: https://doi.org/10.1007/978-3-642-04879-1_7

[Wen-A] Wen, J., Zhong, R., Khan, A., Perez, E., Steinhardt, J., Huang, M., ... & Feng, S. (2024). Language models learn to mislead humans via RLHF. arXiv preprint arXiv:2409.12822. 142.

[Wen-B] Wen, X., Lou, J., Lu, X., Yang, J., Liu, Y., Lu, Y., ... & Yu, X. (2025). Scalable Oversight for Superhuman AI via Recursive Self-Critiquing. arXiv preprint arXiv:2502.04675. 143.

[Yuan] Yuan, Y., Jiao, W., Wang, W., Huang, J. T., He, P., Shi, S., & Tu, Z. (2023). GPT-4 is too smart to be safe: Stealthy chat with LLMs via cipher. arXiv preprint arXiv:2308.06463. 144.

[Zeng-A] Zeng, Y., Lu, E. & Sun, K. (2025). Principles on symbiosis for natural life and living artificial intelligence. AI Ethics 5, 81–86. https://doi.org/10.1007/s43681-023-00364-8 145. DOI: https://doi.org/10.1007/s43681-023-00364-8

[Zeng-B] Zhao, F., Wang, Y., Lu, E., Zhao, D., Han, B., Tong, H., ... & Zeng, Y. (2025). Redefining Superalignment: From Weak-to-Strong Alignment to Human-AI Co-Alignment to Sustainable Symbiotic Society. arXiv preprint arXiv:2504.17404. 146.

[Ziegler] Ziegler, D., Nix, S., Chan, L., Bauman, T., Schmidt-Nielsen, P., Lin, T., ... & Thomas, N. (2022). Adversarial training for high-stakes reliability. Advances in neural information processing systems, 35, 9274-9286. 147.

[Zhou] Zhou, Y., Liu, Y., Li, X., Jin, J., Qian, H., Liu, Z., ... & Yu, P. S. (2024). Trustworthiness in retrieval-augmented generation systems: A survey. arXiv preprint arXiv:2409.10102. 148.

[Zou] Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.

Participants of the ‘2025 Singapore Conference on AI: International Scientific Exchange on AI Safety’, 26th April 2025.

Downloads

Published

2025-08-11

How to Cite

Bengio, Y., Tegmark, M., Russell, S., Song, D., Mindermann, S., Xue, L., … Zhang, Y.-Q. (2025). The Singapore Consensus on Global AI Safety Research Priorities: Building a Trustworthy, Reliable and Secure AI Ecosystem. SuperIntelligence - Robotics - Safety & Alignment, 2(5). https://doi.org/10.70777/si.v2i5.15503

Most read articles by the same author(s)