Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?

Yoshua Bengio; Michael Cohen; Damiano Fornasiere; Joumana Ghosn; Pietro Greiner; Matt MacDermott; S¨oren Mindermann; Adam Oberman; Jesse Richardson; Oliver Richardson; Marc-Antoine Rondeau; Pierre-Luc St-Charles; David Williams-King

doi:10.70777/si.v2i5.15569

Authors

Yoshua Bengio MILA-Quebec AI Institute; Universite de Montreal https://orcid.org/0000-0002-9322-3515
Michael Cohen University of California, Berkeley
Damiano Fornasiere Mila — Quebec AI Institute
Joumana Ghosn Mila — Quebec AI Institute
Pietro Greiner Mila — Quebec AI Institute
Matt MacDermott Imperial College London; Mila — Quebec AI Institute
Soren Mindermann Mila — Quebec AI Institute
Adam Oberman Mila — Quebec AI Institute; McGill University
Jesse Richardson Mila — Quebec AI Institute
Oliver Richardson Mila — Quebec AI Institute; Universit´e de Montr´eal
Marc-Antoine Rondeau Mila — Quebec AI Institute
Pierre-Luc St-Charles Mila — Quebec AI Institute
David Williams-King Mila — Quebec AI Institute

DOI:

https://doi.org/10.70777/si.v2i5.15569

Keywords:

narrow ai, ai safety, artificial general intelligence, bayesian ai, emergent ai risks, instrumental drives, basic ai drives, ai interpretability, explainable ai, reward gaming, specification gaming, ai loopholes, Synthetic Data Generation

Abstract

The leading AI companies are increasingly focused on building generalist AI agents—systems that can autonomously plan, act, and pursue goals across almost all tasks that humans can perform. Despite how useful these systems might be, unchecked AI agency poses significant risks to public safety and security, ranging from misuse by malicious actors to a potentially irreversible loss of human control. We discuss how these risks arise from current AI training methods. Indeed, various scenarios and experiments have demonstrated the possibility of AI agents engaging in deception or pursuing goals that were not specified by human operators and that conflict with human interests, such as self-preservation. Following the precautionary principle, we see a strong need for safer, yet still useful, alternatives to the current agency-driven trajectory.

Accordingly, we propose as a core building block for further advances the development of a non-agentic AI system that is trustworthy and safe by design, which we call Scientist AI. This system is designed to explain the world from observations, as opposed to taking actions in it to imitate or please humans. It comprises a world model that generates theories to explain data and a question-answering inference machine. Both components operate with an explicit notion of uncertainty to mitigate the risks of overconfident predictions. In light of these considerations, a Scientist AI could be used to assist human researchers in accelerating scientific progress, including in AI safety. In particular, our system can be employed as a guardrail against AI agents that might be created despite the risks involved. Ultimately, focusing on non-agentic AI may enable the benefits of AI innovation while avoiding the risks associated with the current trajectory. We hope these arguments will motivate researchers, developers, and policymakers to favor this safer path.

Author Biography

Yoshua Bengio, MILA-Quebec AI Institute; Universite de Montreal

Recognized worldwide as one of the leading experts in artificial intelligence, Yoshua Bengio is most known for his pioneering work in deep learning, earning him the 2018 A.M. Turing Award, “the Nobel Prize of Computing,” with Geoffrey Hinton and Yann LeCun, and making him the computer scientist with the largest number of citations and h-index.

He is Full Professor at Université de Montréal, Co-President and Scientific Director of LawZero and Founder and Scientific Advisor of Mila – Quebec AI Institute. He co-directs the CIFAR Learning in Machines & Brains program and acts as Special Advisor and Founding Scientific Director of IVADO.

He received numerous awards, including the prestigious Killam Prize and Herzberg Gold medal in Canada, CIFAR’s AI Chair, Spain’s Princess of Asturias Award, the VinFuture Prize and he is a Fellow of both the Royal Society of London and Canada, Knight of the Legion of Honor of France, Officer of the Order of Canada, Member of the UN’s Scientific Advisory Board for Independent Advice on Breakthroughs in Science and Technology. Yoshua Bengio was named in 2024 one of TIME’s magazine 100 most influential people in the world.

Concerned about the social impact of AI, he actively contributed to the Montreal Declaration for the Responsible Development of Artificial Intelligence and currently chairs the International AI Safety

References

Akhound-Sadegh, Tara et al. (2024). “Iterated denoising energy matching for sampling from Boltzmann densities”. In: Proc. International Conference on Machine Learning. url: https://openreview.net/forum?id=gVjMwLDFoQ.

Al Kuwaiti, Ahmed et al. (2023). “A review of the role of artificial intelligence in healthcare”. In: Journal ofpersonalized medicine 13.6, p. 951. url: https://pmc.ncbi.nlm.nih.gov/articles/PMC10301994/.

Altmeyer, Patrick et al. (2024). “Position: stop making unscientific AGI performance claims”. In: Proc.International Conference on Machine Learning. url: https://dl.acm.org/doi/10.5555/3692070.3692121.

Amodei, Dario (2024). Machines of Loving Grace. Blog post: https://darioamodei.com/machines-ofloving-grace. Accessed: 2025-02-06.

Angelopoulos, Anastasios N. and Stephen Bates (2023). “Conformal prediction: A gentle introduction”. In:Foundations and Trends in Machine Learning 16.4, pp. 494–591. url: https://dl.acm.org/doi/10.1561/2200000101.

Anthropic (May 2023). Claude’s Constitution. Webpage: https://www.anthropic.com/news/claudesconstitution.Accessed: 2025-02-06.

Anthropic (Oct. 2024a). Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku.Webpage:https://www.anthropic.com/news/3-5-models-and-computer-use. Accessed: 2025-02-06.

Anthropic (Mar. 2024b). The Claude 3 Model Family: Opus, Sonnet, Haiku. Tech. rep. Accessed: 2025-02-06.url: https://docs.anthropic.com/en/docs/resources/model-card.

Armstrong, Stuart and Xavier O’Rorke (2017). “Good and safe uses of AI Oracles”. In: ArXiv preprint1711.05541. url: https://arxiv.org/abs/1711.05541.

Armstrong, Stuart, Anders Sandberg, and Nick Bostrom (2012). “Thinking inside the box: Controlling andusing an oracle AI”. In: Minds and Machines 22, pp. 299–324. url: https://link.springer.com/article/10.1007/s11023-012-9282-2.

Aschenbrenner, Leopold (June 2024). The free world must prevail. Blog post: https : / / situational -awareness.ai/the-free-world-must-prevail/. Accessed: 2025-02-06.

Atanackovic, Lazar and Emmanuel Bengio (2024). “Investigating Generalization Behaviours of GenerativeFlow Networks”. In: Proc. Workshop on Structured Probabilistic Inference & Generative Modeling. url:https://openreview.net/forum?id=umFrtGMWaQ.

Augustin, Thomas et al. (2014). Introduction to imprecise probabilities. Vol. 591. John Wiley & Sons. url:https://books.google.ca/books?id=9qXIEAAAQBAJ.

Ayyamperumal, Suriya Ganesh and Limin Ge (2024). “Current state of LLM Risks and AI Guardrails”. In:ArXiv preprint 2406.12934. url: https://arxiv.org/abs/2406.12934.Badue, Claudine et al. (2021). “Self-driving cars: A survey”. In: Expert systems with applications 165. url:https://www.sciencedirect.com/science/article/abs/pii/S095741742030628X.

Bai, Yuntao, Andy Jones, et al. (2022). “Training a Helpful and Harmless Assistant with ReinforcementLearning from Human Feedback”. In: ArXiv preprint 2204.05862. url: https://arxiv.org/abs/2204.05862.

Bai, Yuntao, Saurav Kadavath, et al. (2022). “Constitutional AI: Harmlessness from AI Feedback”. In: ArXivpreprint 2212.08073. url: https://arxiv.org/abs/2212.08073.

Belghazi, Mohamed Ishmael et al. (2018). “Mutual information neural estimation”. In: Proc. InternationalConference on Machine Learning, pp. 531–540. url: https://proceedings.mlr.press/v80/belghazi18a.html.

Bengio, Emmanuel et al. (2021). “Flow network based generative models for non-iterative diverse candidategeneration”. In: Proc. Neural Information Processing Systems. Vol. 34, pp. 27381–27394. url: https://proceedings.neurips.cc/paper/2021/hash/e614f646836aaed9f89ce58e837e2310-Abstract.html.

Bengio, Yoshua (2023). “AI and catastrophic risk”. In: Journal of Democracy 34.4, pp. 111–121. url: https://muse.jhu.edu/pub/1/article/907692/summary.

Bengio, Yoshua, Michael K. Cohen, et al. (2024). “Can a Bayesian Oracle Prevent Harm from an Agent?” In: arXiv preprint 2408.05284. url: https://arxiv.org/abs/2408.05284.

Bengio, Yoshua, Aaron Courville, and Pascal Vincent (2013). “Representation learning: A review and new perspectives”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 35.8, pp. 1798–1828. url: https://dl.acm.org/doi/10.1109/tpami.2013.50.

Bengio, Yoshua, Tristan Deleu, et al. (2020). “A meta-transfer objective for learning to disentangle causal mechanisms”. In: Proc. International Conference on Learning Representations. url: https://openrevie w.net/forum?id=ryxWIgBFPS.

Bengio, Yoshua, Gr´egoire Mesnil, et al. (2013). “Better mixing via deep representations”. In: Proc. International Conference on Machine Learning, pp. 552–560. url: https://proceedings.mlr.press/v28 /bengio13.html.

Bengio, Yoshua, S¨oren Mindermann, et al. (2025). International AI safety report 2025. Tech. rep. Accessed: 2025-02-06. UK Government. url: https://www.gov.uk/government/publications/internationalai- safety-report-2025.

Bereska, Leonard and Efstratios Gavves (2024). “Mechanistic Interpretability for AI Safety – A Review”. In: arXiv preprint 2404.14082. url: https://arxiv.org/abs/2404.14082.

Big Sleep Team (2024). Project Zero: from naptime to big sleep: using large language models to catch vulnerabilities in real-world code. Blog post: https://googleprojectzero.blogspot.com/2024/10/fromnaptime- to-big-sleep.html. Accessed: 2025-02-06.

Bishop, Christopher M. (2013). “Model-based machine learning”. In: Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 371.1984. url: https://royalsocietypubl ishing.org/doi/full/10.1098/rsta.2012.0222.

Blanchard, Thomas, Tania Lombrozo, and Shaun Nichols (2018). “Bayesian Occam’s razor is a razor of the people”. In: Cognitive science 42.4, pp. 1345–1359. url: https://onlinelibrary.wiley.com/doi/full/ 10.1111/cogs.12573.

Blundell, Charles et al. (2015). “Weight uncertainty in neural network”. In: Proc. International Conference on Machine Learning, pp. 1613–1622. url: https://proceedings.mlr.press/v37/blundell15.

Bojarski, Mariusz (2016). “End to end learning for self-driving cars”. In: ArXiv preprint 1604.07316. url: https://arxiv.org/abs/1604.07316.

Bostrom, Nick (2012). “The superintelligent will: Motivation and instrumental rationality in advanced artificial agents”. In: Minds and Machines 22, pp. 71–85. url: https://link.springer.com/article/10.1 007/s11023-012-9281-3.

Bostrom, Nick (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press. url: https: //books.google.ca/books?id=7_H8AwAAQBAJ.

Bourguignon, Didier (Dec. 2015). The precautionary principle: definitions, applications and governance. Tech. rep. Accessed: 2025-02-06. European Parliament. url: https://www.europarl.europa.eu/thinktank/ en/document/EPRS_IDA%282015%29573876.

Breum, Simon Martin et al. (2024). “The persuasive power of large language models”. In: Proc. AAAI Conference on Web and Social Media. Vol. 18, pp. 152–163. url: https://ojs.aaai.org/index.php/ ICWSM/article/view/31304.

Bricken, Trenton et al. (Oct. 2023). “Towards monosemanticity: decomposing language models with dictionary learning”. In: Transformer Circuits Thread. Accessed: 2025-02-06. url: https://transformercircuits. pub/2023/monosemantic-features.

Bronstein, J.L. (2015). Mutualism. Oxford University Press. url: https://books.google.ca/books?id= tlIdCgAAQBAJ.

Brown, Tom et al. (2020). “Language models are few-shot learners”. In: Proc. Neural Information Processing Systems 33, pp. 1877–1901. url: https://dl.acm.org/doi/abs/10.5555/3495724.3495883.

Brown-Cohen, Jonah, Geoffrey Irving, and Georgios Piliouras (2023). “Scalable AI safety via doubly-efficient debate”. In: ArXiv preprint 2311.14125. url: https://arxiv.org/abs/2311.14125.

Bubeck, Sebastien et al. (2023). “Sparks of artificial general intelligence: early experiments with GPT-4”. In: ArXiv preprint 2303.12712. url: https://arxiv.org/abs/2303.12712.

Bullock, Justin B et al. (2024). The Oxford handbook of AI governance. Oxford University Press. url: https://academic.oup.com/edited-volume/41989.

Carlsmith, Joseph (2022). “Is power-seeking AI an existential risk?” In: ArXiv preprint 2206.13353. url: https://arxiv.org/abs/2206.13353.

Carter, Sarah R. et al. (2023). The convergence of artificial intelligence and the life sciences. Tech. rep. Accessed: 2025-02-06. NTI. url: https://www.nti.org/analysis/articles/the-convergence-ofartificial- intelligence-and-the-life-sciences/.

Ceballos, Gerardo et al. (June 2018). “Accelerated modern human–induced species losses: Entering the sixth mass extinction”. In: Science advances 4.7. url: https://www.science.org/doi/10.1126/sciadv.140 0253.

Center for AI Safety (May 2023). Statement on AI Risk. Open letter: https : / / www . safe . ai / work / statement-on-ai-risk. Accessed: 2025-02-06.

Chen, Ming-Hui, Qi-Man Shao, and Joseph G Ibrahim (2012). Monte Carlo methods in Bayesian computation. Springer Science & Business Media. url: https://link.springer.com/book/10.1007/978-1-461 2-1276-8.

Christiano, Paul, Ajeya Cotra, and Mark Xu (2021). Eliciting latent knowledge: How to tell if your eyes deceive you. Tech. rep. Accessed: 2025-02-06. Alignment Research Center. url: https://docs.google. com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8.

Christiano, Paul, Jan Leike, et al. (2017). “Deep reinforcement learning from human preferences”. In: Proc. Neural Information Processing Systems, pp. 4302–4310. url: https://dl.acm.org/doi/10.5555/32949 96.3295184.

Clark, Jack and Dario Amodei (2016). Faulty reward functions in the wild. Tech. rep. Accessed: 2025-02-06. OpenAI. url: https://openai.com/index/faulty-reward-functions/.

Clement, Sven (2024). NATO and Artificial Intelligence: Navigating the challenges and opportunities. Tech. rep. Accessed: 2025-02-06. NATO Parliamentary Assembly. url: https://www.nato-pa.int/document/2 024-nato-and-ai-report-clement-058-stc.

Cohen, Michael, Marcus Hutter, and Michael Osborne (2022). “Advanced artificial agents intervene in the provision of reward”. In: AI magazine 43.3, pp. 282–293. url: https://ojs.aaai.org/aimagazine/ index.php/aimagazine/article/view/15084.

Cohen, Michael K et al. (2024). “Regulating advanced artificial agents”. In: Science 384.6691, pp. 36–38. url: https://www.science.org/doi/10.1126/science.adl0625.

Colombatto, Clara and Stephen M Fleming (2024). “Folk psychological attributions of consciousness to large language models”. In: Neuroscience of Consciousness 2024.1. url: https://academic.oup.com/ nc/article/2024/1/niae013/7644104.

Colombo, Pierre, Pablo Piantanida, and Chlo´e Clavel (Aug. 2021). “A Novel Estimator of Mutual Information for Learning to Disentangle Textual Representations”. In: Proc. Association for Computational Linguistics. Association for Computational Linguistics, pp. 6539–6550. url: https://aclanthology.org/2021.acllong. 511/.

Cosmo, Leonardo De (June 2023). “Google engineer claims AI chatbot is sentient: why that matters”. In: Scientific American. Accessed: 2025-02-06. url: https : / / www . scientificamerican . com / article / google-engineer-claims-ai-chatbot-is-sentient-why-that-matters/.

Cottier, Ben et al. (2024). How much does it cost to train frontier AI models? Tech. rep. Accessed: 2025-02-06. Epoch AI. url: https://epoch.ai/blog/how-much-does-it-cost-to-train-frontier-ai-models.

Coupe, Christophe et al. (2019). “Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche”. In: Science advances 5.9. url: https://www.science. org/doi/10.1126/sciadv.aaw2594.

Critch, Andrew and David Krueger (2020). “AI research considerations for human existential safety (ARCHES)”. In: arXiv preprint 2006.04948. url: https://arxiv.org/abs/2006.04948.

Cuzzolin, Fabio (2021). The geometry of uncertainty. Springer Nature. url: https://www.google.co.uk/ books/edition/The_Geometry_of_Uncertainty/jNQPEAAAQBAJ.

Czyz Paweland Grabowski, Frederic et al. (2023). “Beyond Normal: On the Evaluation of Mutual Information Estimators”. In: Proc. Neural Information Processing Systems. Vol. 36, pp. 16957–16990. url: https: //proceedings.neurips.cc/paper_files/paper/2023/file/36b80eae70ff629d667f210e13497edf- Paper-Conference.pdf.

Dafoe, Allan et al. (2020). “Open problems in cooperative AI”. In: ArXiv preprint 2012.08630. url: https: //arxiv.org/abs/2012.08630.

Dalrymple, David ”davidad” et al. (2024). “Towards guaranteed safe AI: A framework for ensuring robust and reliable AI systems”. In: ArXiv preprint 2405.06624. url: https://arxiv.org/abs/2405.06624.

Defense, Department of (Feb. 2019). Summary of the 2018 Department of Defense Artificial Intelligence Strategy. Tech. rep. Accessed: 2025-02-06. US Government. url: https://media.defense.gov/2019 /feb/12/2002088963/-1/-1/1/summary-of-dod-ai-strategy.pdf.

Deleu, Tristan, Ant´onio G´ois, et al. (2022). “Bayesian structure learning with generative flow networks”. In: Proc. Uncertainty in Artificial Intelligence, pp. 518–528. url: https://proceedings.mlr.press/v180 /deleu22a.html.

Deleu, Tristan, Mizu Nishikawa-Toomey, et al. (2023). “Joint bayesian inference of graphical structure and parameters with a single generative flow network”. In: Proc. Neural Information Processing Systems. Vol. 36, pp. 31204–31231. url: https://proceedings.neurips.cc/paper_files/paper/2023/hash/63 9a9a172c044fbb64175b5fad42e9a5-Abstract-Conference.html.

Denison, Carson et al. (2024). “Sycophancy to subterfuge: investigating reward-tampering in large language models”. In: ArXiv preprint 2406.10162. url: https://arxiv.org/abs/2406.10162.

Devlin, Jacob et al. (June 2019). “BERT: pre-training of deep bidirectional transformers for language understanding”. In: Proc. Association for Computational Linguistics, pp. 4171–4186. doi: 10.18653/v1/N1 9-1423. url: https://aclanthology.org/N19-1423/.

Di Langosco, Lauro Langosco et al. (2022). “Goal misgeneralization in deep reinforcement learning”. In: Proc. International Conference on Machine Learning, pp. 12004–12019. url: https://proceedings. mlr.press/v162/langosco22a.html.

Dias, Raquel and Ali Torkamani (2019). “Artificial intelligence in clinical and genomic diagnostics”. In: Genome medicine 11.1, p. 70. url: https://genomemedicine.biomedcentral.com/articles/10.1186 /s13073-019-0689-8.

Fang, Richard, Rohan Bindu, Akul Gupta, and Daniel Kang (2024). “LLM agents can autonomously exploit one-day vulnerabilities”. In: arXiv preprint 2404.08144. url: https://arxiv.org/abs/2404.08144.

Fang, Richard, Rohan Bindu, Akul Gupta, Qiusi Zhan, et al. (2024). “LLM Agents can Autonomously Hack Websites”. In: ArXiv preprint 2402.06664. url: https://arxiv.org/abs/2402.06664.

Federal Aviation Administration (Aug. 2024). System Design and Analysis. Tech. rep. AC 25.1309-1B. Accessed: 2025-02-06. US Government. url: https://www.faa.gov/regulations_policies/advisory_ circulars/index.cfm/go/document.information/documentID/1043037.

Finzi, Marc, Max Welling, and Andrew Gordon Wilson (2021). “A practical method for constructing equivariant multilayer perceptrons for arbitrary matrix groups”. In: Proc. International Conference on Machine Learning, pp. 3318–3328. url: https://proceedings.mlr.press/v139/finzi21a.

Galatzer-Levy, Isaac R. et al. (2024). “The cognitive capabilities of generative AI: A comparative analysis with human benchmarks”. In: ArXiv preprint 2410.07391. url: https://arxiv.org/abs/2410.07391.

Garnett, Roman (2023). Bayesian Optimization. Cambridge University Press. url: https://bayesoptbook. com/. Ghahramani, Zoubin (2015). “Probabilistic machine learning and artificial intelligence”. In: Nature 521.7553, pp. 452–459. url: https://www.nature.com/articles/nature14541.

Goan, Ethan and Clinton Fookes (2020). “Bayesian neural networks: An introduction and survey”. In: Case Studies in Applied Bayesian Data Science, pp. 45–87. url: https://link.springer.com/chapter/10.1 007/978-3-030-42553-1_3.

Goel, Shubhangi (Oct. 2024). “Jensen Huang wants Nvidia to have 100 million AI assistants”. In: Business Insider. Accessed: 2025-02-06. url: https://www.businessinsider.com/jensen-huang-wants-nvidia -to-have-100-million-ai-assistants-2024-10.

Good, Irving John (1966). “Speculations concerning the first ultraintelligent machine”. In: Advances in computers 6, pp. 31–88. url: https://www.sciencedirect.com/science/article/abs/pii/S0065245 808604180.

Goodfellow, Ian J, Jonathon Shlens, and Christian Szegedy (2014). “Explaining and harnessing adversarial examples”. In: arXiv preprint 1412.6572. url: https://arxiv.org/abs/1412.6572.

Goodhart, Charles (1984). Problems of monetary management: the UK experience. Springer. url: https: //link.springer.com/chapter/10.1007/978-1-349-17295-5_4.

Google DeepMind (2024). Build AI responsibly to benefit humanity. Webpage: https://deepmind.google/ about/. Accessed: 2025-02-06.

Goswami, Lipichanda, Manoj Kumar Deka, and Mohendra Roy (2023). “Artificial intelligence in material engineering: A review on applications of artificial intelligence in material engineering”. In: Advanced Engineering Materials 25.13. url: [https://onlinelibrary.wiley.com/doi/abs/10.1002/adem.202300104](https://onlinelibrary.wiley.com/doi/abs/10.1002/adem.202300104).

Goyal, Anirudh and Yoshua Bengio (2022). “Inductive biases for deep learning of higher-level cognition”. In: Proceedings of the Royal Society A 478.2266. url: [https://royalsocietypublishing.org/doi/full/10.1098/rspa.2021.0068](https://royalsocietypublishing.org/doi/full/10.1098/rspa.2021.0068).

Grace, Katja et al. (2024). “Thousands of AI authors on the future of AI”. In: ArXiv preprint 2401.02843. url: [https://arxiv.org/abs/2401.02843](https://arxiv.org/abs/2401.02843).

Greenblatt, Ryan, Carson Denison, et al. (2024). “Alignment faking in large language models”. In: ArXiv preprint 2412.14093. url: [https://arxiv.org/abs/2412.14093](https://arxiv.org/abs/2412.14093).

Greenblatt, Ryan, Buck Shlegeris, et al. (July 2024). “AI Control: Improving Safety Despite Intentional Subversion”. In: pp. 16295–16336. url: [https://proceedings.mlr.press/v235/greenblatt24a.html](https://proceedings.mlr.press/v235/greenblatt24a.html).

Hadfield-Menell, Dylan and Gillian K Hadfield (2019). “Incomplete contracting and AI alignment”. In: Proc. AAAI/ACM Conference on AI, Ethics, and Society, pp. 417–422. url: [https://dl.acm.org/doi/abs/10.1145/3306618.3314250](https://dl.acm.org/doi/abs/10.1145/3306618.3314250).

Hadfield-Menell, Dylan, Stuart J Russell, et al. (2016). “Cooperative inverse reinforcement learning”. In: Proc. Neural Information Processing Systems. Vol. 29. url: [https://proceedings.neurips.cc/paper_files/paper/2016/hash/c3395dd46c34fa7fd8d729d8cf88b7a8-Abstract.html](https://proceedings.neurips.cc/paper_files/paper/2016/hash/c3395dd46c34fa7fd8d729d8cf88b7a8-Abstract.html).

Haghighat, Ehsan, Danial Amini, and Ruben Juanes (2022). “Physics-informed neural network simulation of multiphase poroelasticity using stress-split sequential training”. In: Computer Methods in Applied Mechanics and Engineering 397. url: [https://www.sciencedirect.com/science/article/abs/pii/S0045782522003152](https://www.sciencedirect.com/science/article/abs/pii/S0045782522003152).

Hameleers, Michael, Toni G.L.A. van der Meer, and Tom Dobber (2024). “Distorting the truth versus blatant lies: The effects of different degrees of deception in domestic and foreign political deepfakes”. In: Computers in Human Behavior 152. url: [https://www.sciencedirect.com/science/article/pii/S0747563223004478](https://www.sciencedirect.com/science/article/pii/S0747563223004478).

He, Haoran et al. (2024). “Rectifying reinforcement learning for reward matching”. In: ArXiv preprint 2406.02213. url: [https://arxiv.org/abs/2406.02213](https://arxiv.org/abs/2406.02213).

He, Kaiming et al. (2022). “Masked autoencoders are scalable vision learners”. In: Proc. Conf. on Computer Vision and Pattern Recognition, pp. 16000–16009. url: [https://openaccess.thecvf.com/content/CVPR2022/html/He_Masked_Autoencoders_Are_Scalable_Vision_Learners_CVPR_2022_paper.html](https://openaccess.thecvf.com/content/CVPR2022/html/He_Masked_Autoencoders_Are_Scalable_Vision_Learners_CVPR_2022_paper.html).

Hejna, Joey et al. (2025). “Robot Data Curation with Mutual Information Estimators”. In: arXiv preprint 2502.08623. url: [https://arxiv.org/abs/2502.08623](https://arxiv.org/abs/2502.08623).

Hendrycks, Dan (2023). “Natural selection favors AIs over humans”. In: ArXiv preprint 2303.16200. url: [https://arxiv.org/abs/2303.16200](https://arxiv.org/abs/2303.16200).

Hendrycks, Dan (2024). “Rogue AIs”. In: AI Safety, Ethics, and Society. Accessed: 2025-02-06. Taylor & Francis. url: [https://www.aisafetybook.com/textbook/rogue-ai](https://www.aisafetybook.com/textbook/rogue-ai).

Hendrycks, Dan, Nicholas Carlini, et al. (2021). Unsolved problems in ML safety. url: [https://arxiv.org/abs/2109.13916](https://arxiv.org/abs/2109.13916).

Hendrycks, Dan, Mantas Mazeika, and Thomas Woodside (2023). “An overview of catastrophic AI risks”. In: ArXiv preprint 2306.12001. url: [https://arxiv.org/abs/2306.12001](https://arxiv.org/abs/2306.12001).

Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean (2015). “Distilling the knowledge in a neural network”. In: ArXiv preprint 1503.02531. url: [https://arxiv.org/abs/1503.02531](https://arxiv.org/abs/1503.02531).

Ho, Jonathan, Ajay Jain, and Pieter Abbeel (2020). “Denoising diffusion probabilistic models”. In: Proc. Neural Information Processing Systems. Vol. 33, pp. 6840–6851. url: [https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html](https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html).

Hoffmann, Jordan et al. (2022). “Training compute-optimal large language models”. In: ArXiv preprint 2203.15556. url: [https://arxiv.org/abs/2203.15556](https://arxiv.org/abs/2203.15556).

Hu, Edward J, Moksh Jain, et al. (2024). “Amortizing intractable inference in large language models”. In: Proc. International Conference on Learning Representations. url: [https://openreview.net/forum?id=Ouj6p4ca60](https://openreview.net/forum?id=Ouj6p4ca60).

Hu, Edward J, Nikolay Malkin, et al. (2023). “GFlowNet-EM for learning compositional latent variable models”. In: Proc. International Conference on Machine Learning, pp. 13528–13549. url: [https://proceedings.mlr.press/v202/hu23c.html](https://proceedings.mlr.press/v202/hu23c.html).

Huang, Biwei et al. (2020). “Causal discovery from heterogeneous/nonstationary data”. In: Journal of Machine Learning Research 21.89. url: [https://www.jmlr.org/papers/v21/19-232.html](https://www.jmlr.org/papers/v21/19-232.html).

Hubinger, Evan, Adam Jermyn, et al. (2023). “Conditioning predictive models: risks and strategies”. In: ArXiv preprint 2302.00805. url: [https://arxiv.org/abs/2302.00805](https://arxiv.org/abs/2302.00805).

Hubinger, Evan, Chris van Merwijk, et al. (2019). “Risks from learned optimization in advanced machine learning systems”. In: arXiv preprint 1906.01820. url: [https://arxiv.org/abs/1906.01820](https://arxiv.org/abs/1906.01820).

Hurst, Alexander (2025). “I met the ‘godfathers of AI’ in Paris – here’s what they told me to really worry about”. In: The Guardian. [https://www.iaseai.org/conference/livestream](https://www.iaseai.org/conference/livestream), video Feb 7th 9:30 AM CET, at 57:51. url: [https://www.theguardian.com/commentisfree/2025/feb/14/ai-godfathersparis-industry-dangers-future](https://www.theguardian.com/commentisfree/2025/feb/14/ai-godfathersparis-industry-dangers-future).

Hussein, Ahmed et al. (2017). “Imitation learning: A survey of learning methods”. In: Computing Surveys 50.2, pp. 1–35. url: [https://dl.acm.org/doi/10.1145/3054912](https://dl.acm.org/doi/10.1145/3054912).

Ipsos (2017). Public views of machine learning. Tech. rep. Ipsos MORI Research Institute. url: [https://royalsociety.org/~/media/policy/projects/machine-learning/publications/public-viewsof-machine-learning-ipsos-mori.pdf](https://royalsociety.org/~/media/policy/projects/machine-learning/publications/public-viewsof-machine-learning-ipsos-mori.pdf).

Irving, Geoffrey, Paul Christiano, and Dario Amodei (2018). “AI safety via debate”. In: ArXiv preprint 1805.00899. url: [https://arxiv.org/abs/1805.00899](https://arxiv.org/abs/1805.00899).

Ivanova, Desi R., Marvin Schmitt, and Stefan T. Radev (2024). “Data-efficient variational mutual information estimation via Bayesian self-consistency”. In: Proc. Workshop on Bayesian Decision-Making and Uncertainty. url: [https://openreview.net/forum?id=QfiyElaO1f&noteId=aRvehpmMkK](https://openreview.net/forum?id=QfiyElaO1f&noteId=aRvehpmMkK).

Jain, Moksh, Emmanuel Bengio, et al. (2022). “Biological sequence design with GFlowNets”. In: Proc. International Conference on Machine Learning, pp. 9786–9801. url: [https://proceedings.mlr.press/v162/jain22a.html](https://proceedings.mlr.press/v162/jain22a.html).

Jain, Moksh, Tristan Deleu, et al. (2023). “GFlowNets for AI-driven scientific discovery”. In: Digital Discovery 2.3, pp. 557–577. url: [https://pubs.rsc.org/en/content/articlehtml/2023/dd/d3dd00002h](https://pubs.rsc.org/en/content/articlehtml/2023/dd/d3dd00002h).

Järviniemi, Olli and Evan Hubinger (2024). “Uncovering deceptive tendencies in language models: A simulated company AI assistant”. In: ArXiv preprint 2405.01576. url: [https://arxiv.org/abs/2405.01576](https://arxiv.org/abs/2405.01576).

Jimenez, Carlos E et al. (2024). “SWE-Bench: Can language models resolve real-world Github issues?” In: Proc. International Conference on Learning Representations. url: [https://openreview.net/forum?id=VTF8yNQM66](https://openreview.net/forum?id=VTF8yNQM66).

Jin, Zehao et al. (2025). “Causal discovery in astrophysics: Unraveling supermassive black hole and galaxy coevolution”. In: The Astrophysical Journal 979.2, p. 212. url: [https://iopscience.iop.org/article/10.3847/1538-4357/ad9ded/meta](https://iopscience.iop.org/article/10.3847/1538-4357/ad9ded/meta).

Jumper, John et al. (2021). “Highly accurate protein structure prediction with AlphaFold”. In: Nature 596.7873, pp. 583–589. url: [https://www.nature.com/articles/s41586-021-03819-2](https://www.nature.com/articles/s41586-021-03819-2).

Kahneman, Daniel (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux. url: [https://books.google.ca/books?id=ZuKTvERuPG8C](https://books.google.ca/books?id=ZuKTvERuPG8C).

Kakade, Sham Machandranath (2003). “On the sample complexity of reinforcement learning”. PhD thesis. University of London, University College London. url: [https://homes.cs.washington.edu/~sham/papers/thesis/sham_thesis.pdf](https://homes.cs.washington.edu/~sham/papers/thesis/sham_thesis.pdf).

Kaplan, Jared et al. (2020). “Scaling laws for neural language models”. In: ArXiv preprint 2001.08361. url: [https://arxiv.org/abs/2001.08361](https://arxiv.org/abs/2001.08361).

Kashinath, Karthik et al. (2021). “Physics-informed machine learning: case studies for weather and climate modelling”. In: Philosophical Transactions of the Royal Society 379.2194. url: [https://royalsocietypublishing.org/doi/full/10.1098/rsta.2020.0093](https://royalsocietypublishing.org/doi/full/10.1098/rsta.2020.0093).

Kaufmann, Timo et al. (2023). “A survey of reinforcement learning from human feedback”. In: ArXiv preprint 2312.14925. url: [https://arxiv.org/abs/2312.14925](https://arxiv.org/abs/2312.14925).

Kim, Minsu, Sanghyeok Choi, Hyeonah Kim, et al. (2024). “Ant colony sampling with GFlowNets for combinatorial optimization”. In: ArXiv preprint 2403.07041. url: [https://arxiv.org/abs/2403.07041](https://arxiv.org/abs/2403.07041).

Kim, Minsu, Sanghyeok Choi, Taeyoung Yun, et al. (2025). “Adaptive teachers for amortized samplers”. In: Proc. International Conference on Learning Representations. url: [https://openreview.net/forum?id=BdmVgLMvaf](https://openreview.net/forum?id=BdmVgLMvaf).

Kochkov, Dmitrii et al. (2024). “Neural general circulation models for weather and climate”. In: Nature 632.8027, pp. 1060–1066. url: [https://www.nature.com/articles/s41586-024-07744-y](https://www.nature.com/articles/s41586-024-07744-y).

Krakovna, Victoria et al. (2024). Specification gaming: the flip side of AI ingenuity. Tech. rep. Accessed: 2025-02-06. Google DeepMind. url: [https://deepmind.google/discover/blog/specification-gaming-the-flip-side-of-ai-ingenuity/](https://deepmind.google/discover/blog/specification-gaming-the-flip-side-of-ai-ingenuity/).

Krueger, David (2024). “Safe and trustworthy agents: an oxymoron?” In: Proc. Workshop Towards Safe & Trustworthy Agents. url: [https://neurips.cc/virtual/2024/workshop/84748](https://neurips.cc/virtual/2024/workshop/84748).

Kruger, Justin and David Dunning (1999). “Unskilled and unaware of it: How difficulties in recognizing one’s own incompetence lead to inflated self-assessments”. In: Journal of personality and social psychology 77.6, p. 1121. url: [https://pubmed.ncbi.nlm.nih.gov/10626367/](https://pubmed.ncbi.nlm.nih.gov/10626367/).

Kuhlau, Frida et al. (2011). “A precautionary principle for dual use research in the life sciences”. In: Bioethics 25.1. url: [https://pubmed.ncbi.nlm.nih.gov/19594724/](https://pubmed.ncbi.nlm.nih.gov/19594724/).

Kunda, Ziva (1990). “The case for motivated reasoning.” In: Psychological bulletin 108.3, p. 480. url: [https://psycnet.apa.org/record/1991-06436-001](https://psycnet.apa.org/record/1991-06436-001).

Laubscher, Ryno (2021). “Simulation of multi-species flow and heat transfer using physics-informed neural networks”. In: Physics of Fluids 33.8. url: [https://pubs.aip.org/aip/pof/article-abstract/33/8/087101/1080391/Simulation-of-multi-species-flow-and-heat-transfer](https://pubs.aip.org/aip/pof/article-abstract/33/8/087101/1080391/Simulation-of-multi-species-flow-and-heat-transfer).

LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton (2015). “Deep Learning”. In: Nature 521.7553, pp. 436–444. url: [https://www.nature.com/articles/nature14539](https://www.nature.com/articles/nature14539).

Lee, Chang-Shing et al. (2016). “Human vs. computer Go: Review and prospect”. In: IEEE Computational Intelligence Magazine 11.3, pp. 67–72. url: [https://ieeexplore.ieee.org/abstract/document/7515285](https://ieeexplore.ieee.org/abstract/document/7515285).

Lee, Seanie et al. (2025). “Learning diverse attacks on large language models for robust red-teaming and safety tuning”. In: Proc. International Conference on Learning Representations. url: [https://openreview.net/forum?id=1mXufFuv95](https://openreview.net/forum?id=1mXufFuv95).

Leung, YinYee (2015). “Regret minimization and related decision rules”. PhD thesis. Cornell University. url: [https://ecommons.cornell.edu/server/api/core/bitstreams/ef0ef95b-6156-487e-900e-6c33714ed0c3/content](https://ecommons.cornell.edu/server/api/core/bitstreams/ef0ef95b-6156-487e-900e-6c33714ed0c3/content).

Liu, Gary et al. (2023). “Deep learning-guided discovery of an antibiotic targeting Acinetobacter baumannii”. In: Nature Chemical Biology 19.11, pp. 1342–1350. url: [https://www.nature.com/articles/s41589-023-01349-8](https://www.nature.com/articles/s41589-023-01349-8).

Luo, Fan-Ming et al. (2024). “A survey on model-based reinforcement learning”. In: Science China Information Sciences 67.2. url: [https://link.springer.com/article/10.1007/s11432-022-3696-5](https://link.springer.com/article/10.1007/s11432-022-3696-5).

MacPhee, Ross D.E. (1999). Extinctions in near time. Vol. 2. Springer Science & Business Media. url: [https://link.springer.com/book/10.1007/978-1-4757-5202-1](https://link.springer.com/book/10.1007/978-1-4757-5202-1).

Malkin, Nikolay et al. (2023). “GFlowNets and variational inference”. In: Proc. International Conference on Learning Representations. url: [https://openreview.net/forum?id=uKiE0VIluA-](https://openreview.net/forum?id=uKiE0VIluA-).

Manyika, James and Sissie Hsiao (2024). An overview of the Gemini app. Tech. rep. Accessed: 2025-02-06. Google. url: [https://gemini.google/overview-gemini-app.pdf](https://gemini.google/overview-gemini-app.pdf).

Marzo, Giordano De, Claudio Castellano, and David Garcia (2024). “Large language model agents can coordinate beyond human scale”. In: ArXiv preprint 2409.02822. url: [https://arxiv.org/abs/2409.02822](https://arxiv.org/abs/2409.02822).

Maslej, Nestor et al. (May 2024). Artificial intelligence index report 2024. Tech. rep. Accessed: 2025-02-06. Stanford University. url: [https://aiindex.stanford.edu/report/](https://aiindex.stanford.edu/report/).

McKinney, Scott Mayer et al. (2020). “International evaluation of an AI system for breast cancer screening”. In: Nature 577.7788, pp. 89–94. url: [https://www.nature.com/articles/s41586-019-1799-6](https://www.nature.com/articles/s41586-019-1799-6).

Meinke, Alexander et al. (2024). “Frontier models are capable of in-context scheming”. In: ArXiv preprint 2412.04984. url: [https://arxiv.org/abs/2412.04984](https://arxiv.org/abs/2412.04984).

Metz, Cade (Mar. 2016). “In two moves, AlphaGo and Lee Sedol redefined the future”. In: Wired. Accessed: 2025-02-06. url: [https://www.wired.com/2016/03/two-moves-alphago-lee-sedol-redefined-future/](https://www.wired.com/2016/03/two-moves-alphago-lee-sedol-redefined-future/).

Microsoft Corporation (Oct. 2023). Fiscal Year 2024 First Quarter Earnings Conference Call. Webpage: [https://www.microsoft.com/en-us/investor/events/fy-2024/earnings-fy-2024-q1](https://www.microsoft.com/en-us/investor/events/fy-2024/earnings-fy-2024-q1). Accessed: 2025-02-06.

Milli, Smitha et al. (2017). “Should robots be obedient?” In: arXiv preprint 1705.09990. url: [https://arxiv.org/abs/1705.09990](https://arxiv.org/abs/1705.09990).

Mooney, H. A. and E. E. Cleland (2001). “The evolutionary impact of invasive species”. In: Proc. of the National Academy of Sciences 98.10, pp. 5446–5451. url: [https://www.pnas.org/doi/abs/10.1073/pnas.091093398](https://www.pnas.org/doi/abs/10.1073/pnas.091093398).

Murphy, Kevin P (2022). Probabilistic machine learning: an introduction. MIT press. url: [https://probml.github.io/pml-book/book1.html](https://probml.github.io/pml-book/book1.html).

Neumann, John von, Oskar Morgenstern, and Ariel Rubinstein (1944). Theory of Games and Economic Behavior (60th Anniversary Commemorative Edition). Princeton University Press. url: [https://books.google.ca/books?id=jCN5aNJ-n-0C](https://books.google.ca/books?id=jCN5aNJ-n-0C).

Ngo, Richard, Lawrence Chan, and Sören Mindermann (2024). “The alignment problem from a deep learning perspective”. In: Proc. International Conference on Learning Representations. url: [https://openreview.net/forum?id=fh8EYKFKns](https://openreview.net/forum?id=fh8EYKFKns).

Omohundro, Stephen M (2018). “The basic AI drives”. In: Artificial intelligence safety and security. Chapman and Hall/CRC, pp. 47–55. url: [https://dl.acm.org/doi/10.5555/1566174.1566226](https://dl.acm.org/doi/10.5555/1566174.1566226).

OpenAI (2023). Planning for AGI and beyond. Webpage: [https://openai.com/index/planning-for-agi-and-beyond/](https://openai.com/index/planning-for-agi-and-beyond/). Accessed: 2025-02-06.

OpenAI (2024a). Learning to reason with LLMs. Webpage: [https://openai.com/index/learning-to-reason-with-llms/](https://openai.com/index/learning-to-reason-with-llms/). Accessed: 2025-02-06.

OpenAI (2024b). MLE-bench: evaluating machine learning agents on machine learning engineering. Webpage: [https://openai.com/index/mle-bench/](https://openai.com/index/mle-bench/). Accessed: 2025-02-06.

OpenAI (Dec. 2024c). OpenAI o1 System Card. Tech. rep. Accessed: 2025-02-06. url: [https://openai.com/index/openai-o1-system-card/](https://openai.com/index/openai-o1-system-card/).

OpenAI (2025a). Announcing the Stargate Project. Webpage: [https://openai.com/index/announcing-the-stargate-project/](https://openai.com/index/announcing-the-stargate-project/). Accessed: 2025-02-06.

OpenAI (2025b). Introducing ChatGPT search. Webpage: [https://openai.com/index/introducing-chatgpt-search/](https://openai.com/index/introducing-chatgpt-search/). Accessed: 2025-02-06.

OpenAI (2025c). Introducing deep research. Webpage: [https://openai.com/index/introducing-deep-research/](https://openai.com/index/introducing-deep-research/). Accessed: 2025-02-06.

OpenAI (2025d). Introducing Operator. Webpage: [https://openai.com/index/introducing-operator/](https://openai.com/index/introducing-operator/). Accessed: 2025-02-06.

OpenAI (Jan. 2025e). OpenAI o3-mini System Card. Tech. rep. Accessed: 2025-02-06. url: [https://openai.com/index/o3-mini-system-card/](https://openai.com/index/o3-mini-system-card/).

Ouyang, Long et al. (2022). “Training language models to follow instructions with human feedback”. In: Proc. Neural Information Processing Systems. Vol. 35, pp. 27730–27744. url: [https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf](https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf).

Park, Peter S et al. (2024). “AI deception: A survey of examples, risks, and potential solutions”. In: Patterns 5.5. url: [https://www.cell.com/patterns/fulltext/S2666-3899%2824%2900103-X?s=08](https://www.cell.com/patterns/fulltext/S2666-3899%2824%2900103-X?s=08).

Paul Christiano (2016). What does the universal prior actually look like? Blog post: [https://ordinaryideas.wordpress.com/2016/11/30/what-does-the-universal-prior-actually-look-like/](https://ordinaryideas.wordpress.com/2016/11/30/what-does-the-universal-prior-actually-look-like/). Accessed: 2025-02-06.

Perdomo, Juan et al. (2020). “Performative prediction”. In: Proc. International Conference on Machine Learning, pp. 7599–7609. url: [https://proceedings.mlr.press/v119/perdomo20a.html](https://proceedings.mlr.press/v119/perdomo20a.html).

Perrigo, Billy (Feb. 2024). “Meta’s AI chief Yann LeCun on AGI, open-source, and AI risk”. In: Time. Accessed: 2025-02-06. url: [https://time.com/6694432/yann-lecun-meta-ai-interview/](https://time.com/6694432/yann-lecun-meta-ai-interview/).

Peters, Jonas, Dominik Janzing, and Bernhard Schölkopf (2017). Elements of causal inference: foundations and learning algorithms. The MIT Press. url: [https://library.oapen.org/handle/20.500.12657/26040](https://library.oapen.org/handle/20.500.12657/26040).

Peyrard, Maxime and Kyunghyun Cho (2025). Meta-Statistical Learning: Supervised Learning of Statistical Inference. url: [https://arxiv.org/abs/2502.12088](https://arxiv.org/abs/2502.12088).

Poole, Ben et al. (June 2019). “On Variational Bounds of Mutual Information”. In: Proc. International Conference on Machine Learning. Vol. 97, pp. 5171–5180. url: [https://proceedings.mlr.press/v97/poole19a.html](https://proceedings.mlr.press/v97/poole19a.html).

Popova, Mariya, Olexandr Isayev, and Alexander Tropsha (2018). “Deep reinforcement learning for de novo drug design”. In: Science advances 4.7. url: [https://www.science.org/doi/10.1126/sciadv.aap7885](https://www.science.org/doi/10.1126/sciadv.aap7885).

Rai, Daking et al. (2024). “A practical review of mechanistic interpretability for transformer-based language models”. In: ArXiv preprint 2407.02646. url: [https://arxiv.org/abs/2407.02646](https://arxiv.org/abs/2407.02646).

Rainforth, Tom et al. (2024). “Modern Bayesian experimental design”. In: Statistical Science 39.1, pp. 100–114. url: [https://projecteuclid.org/journals/statistical-science/volume-39/issue-1/Modern-Bayesian-Experimental-Design/10.1214/23-STS915.short](https://projecteuclid.org/journals/statistical-science/volume-39/issue-1/Modern-Bayesian-Experimental-Design/10.1214/23-STS915.short).

Ramsey, Frank P. (1926). “Truth and Probability”. In: The Foundations of Mathematics and other Logical Essays. Ed. by R.B. Braithwaite. 1999 electronic edition. London: Kegan, Paul, Trench, Trubner & Co. Chap. VII, pp. 156–198. url: [https://books.google.ca/books?id=1st-3kYOEPQC](https://books.google.ca/books?id=1st-3kYOEPQC).

Reed, Scott et al. (2022). “A generalist agent”. In: Transactions on Machine Learning Research. url: [https://openreview.net/forum?id=1ikK0kHjvj](https://openreview.net/forum?id=1ikK0kHjvj).

Rempe, Davis et al. (2021). “Generating useful accident-prone driving scenarios via a learned traffic prior”. In: Proc. Conf. on Computer Vision and Pattern Recognition, pp. 17284–17294. url: [https://www.computer.org/csdl/proceedings-article/cvpr/2022/694600r7284/1H1k7GlOq9G](https://www.computer.org/csdl/proceedings-article/cvpr/2022/694600r7284/1H1k7GlOq9G).

Richardson, Oliver E. (2022). “Loss as the inconsistency of a probabilistic dependency graph: Choose your model, not your loss function”. In: Proc. International Conference on Artificial Intelligence and Statistics, pp. 2706–2735. url: [https://proceedings.mlr.press/v151/richardson22b.html](https://proceedings.mlr.press/v151/richardson22b.html).

Roose, Kevin (Dec. 2023). “This A.I. subculture’s motto: go, go, go”. In: The New York Times. Accessed: 2025-02-06. url: [https://www.nytimes.com/2023/12/10/technology/ai-acceleration.html](https://www.nytimes.com/2023/12/10/technology/ai-acceleration.html).

Rowe, Luke et al. (2024). “CtRL-Sim: reactive and controllable driving agents with offline reinforcement learning”. In: Conference on Robot Learning. url: [https://openreview.net/forum?id=MfIUKzihC8](https://openreview.net/forum?id=MfIUKzihC8).

Ruan, Yangjun et al. (2023). “Identifying the risks of LM agents with an LM-emulated sandbox”. In: Proc. International Conference on Learning Representations. url: [https://openreview.net/forum?id=GEcwtMk1uA](https://openreview.net/forum?id=GEcwtMk1uA).

Rudner, Tim G.J. and Helen Toner (Dec. 2021). Key concepts in AI safety: specification in machine learning. Tech. rep. Accessed: 2025-02-06. Center for Security and Emerging Technology. url: [https://cset.georgetown.edu/wp-content/uploads/Key-Concepts-in-AI-Safety-Specification-in-Machine-Learning.pdf](https://cset.georgetown.edu/wp-content/uploads/Key-Concepts-in-AI-Safety-Specification-in-Machine-Learning.pdf).

Russell, Stuart (2019). Human compatible: AI and the problem of control. Penguin UK. url: [https://books.google.ca/books/about/Human_Compatible.html?id=VMq_wwEACAAJ](https://books.google.ca/books/about/Human_Compatible.html?id=VMq_wwEACAAJ).

Russell, Stuart (2022). “If we succeed”. In: Daedalus 151.2, pp. 43–57. url: [https://www.amacad.org/publication/daedalus/if-we-succeed](https://www.amacad.org/publication/daedalus/if-we-succeed).

Russell, Stuart J and Peter Norvig (2016). Artificial intelligence: a modern approach. url: [https://aima.cs.berkeley.edu/](https://aima.cs.berkeley.edu/).

Savage, L. J. (1954). Foundations of Statistics. New York: Wiley. url: [https://books.google.ca/books?id=zSv6dBWneMEC](https://books.google.ca/books?id=zSv6dBWneMEC).

Schölkopf, Bernhard et al. (2021). “Toward causal representation learning”. In: Proc. IEEE. Vol. 109. IEEE, pp. 612–634. url: [https://ieeexplore.ieee.org/abstract/document/9363924](https://ieeexplore.ieee.org/abstract/document/9363924).

Schrittwieser, Julian et al. (2020). “Mastering Atari, Go, chess and shogi by planning with a learned model”. In: Nature 588.7839, pp. 604–609. url: [https://www.nature.com/articles/s41586-020-03051-4](https://www.nature.com/articles/s41586-020-03051-4).

Sendera, Marcin et al. (2024). “Improved off-policy training of diffusion samplers”. In: Proc. Neural Information Processing Systems. url: [https://openreview.net/forum?id=vieIamY2Gi](https://openreview.net/forum?id=vieIamY2Gi).

Shah, Rohin et al. (2022). “Goal misgeneralization: why correct specifications aren’t enough for correct goals”. In: ArXiv preprint 2210.01790. url: [https://arxiv.org/abs/2210.01790](https://arxiv.org/abs/2210.01790).

Shepherd, John G (2009). Geoengineering the climate: science, governance and uncertainty. Royal Society. url: [https://royalsociety.org/news-resources/publications/2009/geoengineering-climate/](https://royalsociety.org/news-resources/publications/2009/geoengineering-climate/).

Silver, David, Aja Huang, et al. (2016). “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529.7587, pp. 484–489. url: [https://www.nature.com/articles/nature16961](https://www.nature.com/articles/nature16961).

Silver, David, Thomas Hubert, et al. (2018). “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play”. In: Science 362.6419, pp. 1140–1144. url: [https://www.science.org/doi/10.1126/science.aar6404](https://www.science.org/doi/10.1126/science.aar6404).

Silver, David, Julian Schrittwieser, et al. (2017). “Mastering the game of Go without human knowledge”. In: nature 550.7676, pp. 354–359. url: [https://www.nature.com/articles/nature24270](https://www.nature.com/articles/nature24270).

Skalse, Joar et al. (2022). “Defining and characterizing reward gaming”. In: Proc. Neural Information Processing Systems. Vol. 35, pp. 9460–9471. url: [https://proceedings.neurips.cc/paper_files/paper/2022/hash/3d719fee332caa23d5038b8a90e81796-Abstract-Conference.html](https://proceedings.neurips.cc/paper_files/paper/2022/hash/3d719fee332caa23d5038b8a90e81796-Abstract-Conference.html).

Sohl-Dickstein, Jascha et al. (2015). “Deep unsupervised learning using nonequilibrium thermodynamics”. In: Proc. International Conference on Machine Learning, pp. 2256–2265. url: [http://proceedings.mlr.press/v37/sohl-dickstein15.html](http://proceedings.mlr.press/v37/sohl-dickstein15.html).

Solomonoff, Ray J (1964). “A formal theory of inductive inference. Part I”. In: Information and control 7.1, pp. 1–22. url: [https://www.sciencedirect.com/science/article/pii/S0019995864902232](https://www.sciencedirect.com/science/article/pii/S0019995864902232).

Sutton, Richard S. and Andrew G. Barto (2018). Reinforcement Learning: An Introduction. 2nd. MIT Press. url: [https://mitpress.mit.edu/9780262039246/reinforcement-learning/](https://mitpress.mit.edu/9780262039246/reinforcement-learning/).

Tegmark, Max (2018). Life 3.0: Being human in the age of artificial intelligence. Vintage. url: [https://books.google.ca/books/about/Life_3_0.html?id=3_otDwAAQBAJ](https://books.google.ca/books/about/Life_3_0.html?id=3_otDwAAQBAJ).

Thornley, Elliott (2024). “The shutdown problem: an AI engineering puzzle for decision theorists”. In: Philosophical Studies, pp. 1–28. url: [https://link.springer.com/article/10.1007/s11098-024-02153-3](https://link.springer.com/article/10.1007/s11098-024-02153-3).

Turtayev, Rustem et al. (2024). “Hacking CTFs with plain agents”. In: ArXiv preprint 2412.02776. url: [https://arxiv.org/abs/2412.02776](https://arxiv.org/abs/2412.02776).

United Nations (Sept. 2015). Transforming our world: the 2030 Agenda for Sustainable Development. Webpage: [https://sdgs.un.org/2030agenda](https://sdgs.un.org/2030agenda). Accessed: 2025-02-06.

US Government (Oct. 2024). Framework to Advance AI Governance and Risk Management in National Security. White House publication: [https://ai.gov/wp-content/uploads/2024/10/NSM-Framework-to-Advance-AI-Governance-and-Risk-Management-in-National-Security.pdf](https://ai.gov/wp-content/uploads/2024/10/NSM-Framework-to-Advance-AI-Governance-and-Risk-Management-in-National-Security.pdf). Accessed: 2025-02-06.

Vaccari, Cristian and Andrew Chadwick (2020). “Deepfakes and disinformation: Exploring the impact of synthetic political video on deception, uncertainty, and trust in news”. In: Social media + society 6.1. url: [https://journals.sagepub.com/doi/10.1177/2056305120903408](https://journals.sagepub.com/doi/10.1177/2056305120903408).

Vargas, Francisco, Will Grathwohl, and Arnaud Doucet (2023). “Denoising diffusion samplers”. In: Proc. International Conference on Learning Representations. url: [https://openreview.net/forum?id=8pvnfTAbu1f](https://openreview.net/forum?id=8pvnfTAbu1f).

Venkatraman, Siddarth et al. (2024). “Amortizing Intractable Inference in Diffusion Models for Bayesian Inverse Problems”. In: Proc. Workshop on Machine Learning and the Physical Sciences. Accessed: 2025-02-06. url: [https://ml4physicalsciences.github.io/2024/files/NeurIPS_ML4PS_2024_188.pdf](https://ml4physicalsciences.github.io/2024/files/NeurIPS_ML4PS_2024_188.pdf).

Villalobos, Pablo et al. (2024). “Position: will we run out of data? Limits of LLM scaling based on human-generated data”. In: Proc. International Conference on Machine Learning. url: [https://openreview.net/forum?id=ViZcgDQjyG](https://openreview.net/forum?id=ViZcgDQjyG).

Vincent, Pascal (2011). “A connection between score matching and denoising autoencoders”. In: Neural computation 23, pp. 1661–1674. url: [https://ieeexplore.ieee.org/abstract/document/6795935](https://ieeexplore.ieee.org/abstract/document/6795935).

Wang, Hanchen et al. (2023). “Scientific discovery in the age of artificial intelligence”. In: Nature 620.7972, pp. 47–60. url: [https://www.nature.com/articles/s41586-023-06221-2](https://www.nature.com/articles/s41586-023-06221-2).

Wang, Jingkang et al. (2021). “AdvSim: Generating safety-critical scenarios for self-driving vehicles”. In: Proc. Conf. on Computer Vision and Pattern Recognition, pp. 9909–9918. url: [https://ieeexplore.ieee.org/document/9578745](https://ieeexplore.ieee.org/document/9578745).

Wei, Jason et al. (2022). “Emergent abilities of large language models”. In: Transactions on Machine Learning Research. url: [https://openreview.net/forum?id=yzkSU5zdwD](https://openreview.net/forum?id=yzkSU5zdwD).

Wijk, Hjalmar et al. (2024). “RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts”. In: arXiv preprint 2411.15114. url: [https://arxiv.org/abs/2411.15114](https://arxiv.org/abs/2411.15114).

Wynroe, Keith, David Atkinson, and Jaime Sevilla (2023). Literature review of transformative artificial intelligence timelines. Tech. rep. Accessed: 2025-02-06. Epoch AI. url: [https://epoch.ai/blog/literature-review-of-transformative-artificial-intelligence-timelines](https://epoch.ai/blog/literature-review-of-transformative-artificial-intelligence-timelines).

Yudkowsky, Eliezer S. (2002). The AI-Box experiment. Blog post: [https://www.yudkowsky.net/singularity/aibox](https://www.yudkowsky.net/singularity/aibox). Accessed: 2025-02-06.

Zhang, Dinghuai, Ricky TQ Chen, et al. (2022). “Unifying generative models with GFlowNets and beyond”. In: arXiv preprint 2209.02606. url: [https://arxiv.org/abs/2209.02606](https://arxiv.org/abs/2209.02606).

Zhang, Dinghuai, Nikolay Malkin, et al. (2022). “Generative flow networks for discrete probabilistic modeling”. In: Proc. International Conference on Machine Learning, pp. 26412–26428. url: [https://proceedings.mlr.press/v162/zhang22v.html](https://proceedings.mlr.press/v162/zhang22v.html).

Zhang, Qinsheng and Yongxin Chen (2022). “Path integral sampler: a stochastic control approach for sampling”. In: Proc. International Conference on Learning Representations. url: [https://openreview.net/forum?id=_uCb2ynRu7Y](https://openreview.net/forum?id=_uCb2ynRu7Y).

Zou, Andy et al. (2023). “Universal and transferable adversarial attacks on aligned language models”. In: ArXiv preprint 2307.15043. url: [https://arxiv.org/abs/2307.15043](https://arxiv.org/abs/2307.15043).

Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?

Authors

DOI:

Keywords:

Abstract

Author Biography

Yoshua Bengio, MILA-Quebec AI Institute; Universite de Montreal

References

Downloads

Published

How to Cite

Issue

Section

Categories

License

Most read articles by the same author(s)

Current Issue

Announcements

Dario Amodei, The Adolescence of Technology: Confronting and Overcoming the Risks of Powerful AI

Steve Omohundro: Regulating AGI: From Liability to Provable Contracts

Joe Rogan Experience #2345 - Roman Yampolskiy

Steve Omohundro Receives 2024 Future of Life Award

Steve Omohundro and Scientists Discuss the AI Alignment Problem with Neil deGrasse Tyson

Information