Can a Bayesian Oracle Prevent Harm from an Agent?

Authors

  • Yoshua Bengio Mila, Universite de Montreal
  • Matt McDermott Imperial College, London
  • Michael K. Cohen University of California, Berkeley
  • Nikolay Malkin Chancellor's Fellow @ University of Edinburgh, School of Informatics
  • Damiano Fornasiere Mila, Universite de Montreal
  • Pietro Greiner Mila, Universite de Montreal
  • Younesse Kaddar University of Oxford https://orcid.org/0000-0001-7366-9889

DOI:

https://doi.org/10.70777/si.v2i1.13799

Keywords:

bayesian ai, ai oracle, agi safety, agi risks, artificial general intelligence, artificial superintelligence, ai agent

Abstract

Is there a way to design powerful AI systems based on machine learning methods that would satisfy probabilistic safety guarantees? With the long-term goal of obtaining a probabilistic guarantee that would apply in every context, we consider estimating a context-dependent bound on the probability of violating a given safety specification. Such a risk evaluation would need to be performed at run-time to provide a guardrail against dangerous actions of an AI. Noting that different plausible hypotheses about the world could produce very different outcomes, and because we do not know which one is right, we derive bounds on the safety violation probability predicted under the true but unknown hypothesis. Such bounds could be used to reject potentially dangerous actions. Our main results involve searching for cautious but plausible hypotheses, obtained by a maximization that involves Bayesian posteriors over hypotheses. We consider two forms of this result, in the i.i.d. case and in the non-i.i.d. case, and conclude with open problems towards turning such theoretical results into practical AI guardrails.

Author Biographies

Yoshua Bengio, Mila, Universite de Montreal

Recognized worldwide as one of the leading experts in artificial intelligence, Yoshua Bengio is most known for his pioneering work in deep learning, earning him the 2018 A.M. Turing Award, “the Nobel Prize of Computing,” with Geoffrey Hinton and Yann LeCun, and making him the computer scientist with the largest number of citations and h-index.

He is Full Professor at Université de Montréal, and the Founder and Scientific Director of Mila – Quebec AI Institute. He co-directs the CIFAR Learning in Machines & Brains program and acts as Scientific Director of IVADO. 

He received numerous awards, including the prestigious Killam Prize and Herzberg Gold medal in Canada, CIFAR’s AI Chair, Spain’s Princess of Asturias Award, the VinFuture Prize and he is a Fellow of both the Royal Society of London and Canada, Knight of the Legion of Honor of France, Officer of the Order of Canada, Member of the UN’s Scientific Advisory Board for Independent Advice on Breakthroughs in Science and Technology. Yoshua Bengio was named in 2024 one of TIME’s magazine 100 most influtential people in the world.

Concerned about the social impact of AI, he actively contributed to the Montreal Declaration for the Responsible Development of Artificial Intelligence and currently chairs the International Scientific Report on the Safety of Advanced AI.

Matt McDermott, Imperial College, London

PhD Student at Imperial College London, CDT in Safe and Trusted AI. Working on the Safe AI for Humanity Project. Also involved in the Causal Incentives Working Group. mjm121 at ic.ac.uk

 

Michael K. Cohen, University of California, Berkeley

I'm a postdoc with Stuart Russell at UC Berkeley. My research considers the expected behavior of generally intelligent artificial agents. I am interested in designing agents that we can expect to behave safely, no matter how instrumentally rational they are.

My perspective on the extinction risk posed by AI can be found in my published work on the topic—subject to several assumptions, advanced algorithms that explicitly plan over the long term using a learned model of the world would likely intervene in the provision of certain observations, and outcompete us for resources in an attempt to do so securely. My research mostly aims to find violations of those assumptions, with some success.

Nikolay Malkin, Chancellor's Fellow @ University of Edinburgh, School of Informatics

I work on algorithms for deep-learning-based reasoning and their applications. I am specifically interested in the following subjects:

  •  Machine learning for generative models, in particular, induction of compositional structure in generative models and modeling of posteriors over high-dimensional explanatory variables (including with continuous-time (diffusion) generative models). Much of my recent work is on generative flow networks, which are a path towards inference machines that build structured, uncertainty-aware explanations for observed data.
  •  Applications to natural language processing and reasoning in language: what large language models can do, what they cannot do, and how to overcome their limitations with improved inference procedures. I view human-like symbolic, formal, and mathematical reasoning via Bayesian neurosymbolic methods as a long-term aspiration for artificial intelligence.
  •  Applications to computer vision: notably, below you can find my work on AI for remote sensing (land cover mapping and change detection), which can be used for tracking land use patterns over time and monitoring the effects of climate change

Damiano Fornasiere, Mila, Universite de Montreal

Senior AI Safety Research Scientist, Safe AI for Humanity

References

Yuntao Bai, Saurav Kadavath, SandipanKundu, Amanda Askell, JacksonKernion, Andy Jones,

Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI:

Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022.

Andrew Barron, Mark J. Schervish, and Larry Wasserman. The consistency of posterior

distributions in nonparametric problems. The Annals of Statistics, 27(2):536–561, 1999.

Yoshua Bengio. Towards a cautious scientist AI with convergent safety

bounds, February 2024. URL https://yoshuabengio.org/2024/02/26/

towards-a-cautious-scientist-ai-with-convergent-safety-bounds/.

Yoshua Bengio, Daniel Privitera, Tamay Besiroglu, Rishi Bommasani, Stephen Casper, Yejin

Choi, Danielle Goldfarb, Hoda Heidari, Leila Khalatbari, Shayne Longpre, et al. International

Scientific Report on the Safety of Advanced AI. PhD thesis, Department for Science, Innovation

and Technology, 2024.

Michael K Cohen and Marcus Hutter. Pessimism about unknown unknowns inspires conservatism.

Conference on Learning Theory (COLT), 2020.

Michael K Cohen, Marcus Hutter, and Neel Nanda. Fully general online imitation learning.

Journal of Machine Learning Research, 23(1):15066–15095, 2022. DOI: https://doi.org/10.46610/RRMLCC.2022.v01i01.004

Michael K. Cohen, Marcus Hutter, and Michael A. Osborne. Advanced artificial agents intervene

in the provision of reward. AI magazine, 43(3):282–293, 2022. DOI: https://doi.org/10.1002/aaai.12064

David Dalrymple, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia,

Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, et al. Towards guaranteed

safe AI: A framework for ensuring robust and reliable AI systems. arXiv preprint

arXiv:2405.06624, 2024.

Tristan Deleu, António Góis, Chris Emezue, Mansi Rankawat, Simon Lacoste-Julien, Stefan

Bauer, and Yoshua Bengio. Bayesian structure learning with generative flow networks.

Uncertainty in Artificial Intelligence (UAI), 2022.

Tristan Deleu, Mizu Nishikawa-Toomey, Jithendaraa Subramanian, Nikolay Malkin, Laurent

Charlin, and Yoshua Bengio. Joint Bayesian inference of graphical structure and parameters

with a single generative flownetwork. Neural Information Processing Systems (NeurIPS), 2023.

Persi Diaconis and David A. Freedman. On the consistency of Bayes estimates (with discussion).

The Annals of Statistics, 14:1–26, 1986.

J.L. Doob. Application of the theory of martingales. Colloque International Centre Nat. Rech.

Sci., pages 22–28, 1949.

David A. Freedman. On the asymptotic behavior of Bayes’ estimates in the discrete case. The

Annals of Mathematical Statistics, 34(4):1386–1403, 1963. DOI: https://doi.org/10.1214/aoms/1177703871

David A. Freedman. On the asymptotic behavior of Bayes estimates in the discrete case II. The

Annals of Mathematical Statistics, 36(2):454–456, 1965. DOI: https://doi.org/10.1214/aoms/1177700155

Superintelligence – Robotics – Safety & Alignment 2025 2(1) Large Language Models I

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial

examples. International Conference on Learning Representations (ICLR), 2015.

Han Guo, Bowen Tan, Zhengzhong Liu, Eric P. Xing, and Zhiting Hu. Efficient (soft) Q-learning

for text generation with limited good data. arXiv preprint arXiv:2106.07704, 2021.

Edward J. Hu, Nikolay Malkin, Moksh Jain, Katie Everett, Alexandros Graikos, and Yoshua

Bengio. GFlowNet-EM for learning compositional latent variable models. International Conference

on Machine Learning (ICML), 2023.

Edward J. Hu, Moksh Jain, Eric Elmoznino, Younesse Kaddar, Guillaume Lajoie, Yoshua

Bengio, and Nikolay Malkin. Amortizing intractable inference in large language models.

International Conference on Learning Representations (ICLR), 2024.

Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer, Charlie Griffin, and Joar

Skalse. Goodhart’s law in reinforcement learning. arXiv preprint arXiv:2310.09144, 2023.

Minsu Kim, Sanghyeok Choi, Jiwoo Son, Hyeonah Kim, Jinkyoo Park, and Yoshua Bengio.

Ant colony sampling with GFlowNets for combinatorial optimization. arXiv preprint

arXiv:2403.07041, 2024.

Minsu Kim, Taeyoung Yun, Emmanuel Bengio, Dinghuai Zhang, Yoshua Bengio, Sungsoo

Ahn, and Jinkyoo Park. Local search GFlowNets. International Conference on Learning

Representations (ICLR), 2024.

Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom

Everitt, Ramana Kumar, Zac Kenton, Jan Leike, and Shane Legg. Specification

gaming: the flip side of AI ingenuity, 2020. URL deepmind.com/blog/

specification-gaming-the-flip-side-of-ai-ingenuity.

Alexander K Lew, Tan Zhi-Xuan, Gabriel Grand, and Vikash K Mansinghka. Sequential

Monte Carlo steering of large language models using probabilistic programs. arXiv preprint

arXiv:2306.03081, 2023.

Alexander J McNeil, Rüdiger Frey, and Paul Embrechts. Quantitative risk management:

concepts, techniques and tools-revised edition. Princeton university press, 2015.

Jeffrey W. Miller. A detailed treatment of Doob’s theorem. arXiv preprint arXiv:1801.03122,

JeffreyW. Miller. Asymptotic normality, concentration, and coverage of generalized posteriors.

Journal of Machine Learning Research, 22(168):1–53, 2021.

Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The Effects of Reward Misspecification:

Mapping and Mitigating Misaligned Models. In International Conference on Learning Representations,

Richard Yuanzhe Pang, Vishakh Padmakumar, Thibault Sellam, Ankur P. Parikh, and He He.

Reward Gaming in Conditional Text Generation. arXiv preprint arXiv:2211.08714, 2023.

Du Phan, Matthew D. Hoffman, David Dohan, Sholto Douglas, Tuan Anh Le, Aaron Parisi,

Pavel Sountsov, Charles Sutton, Sharad Vikram, and Rif A. Saurous. Training chain-of-thought

via latent-variable inference. Neural Information Processing Systems (NeurIPS), 2023.

Lorraine Schwartz. On Bayes procedures. Probability Theory and Related Fields, 4(1):10–26,

Marcin Sendera, Minsu Kim, Sarthak Mittal, Pablo Lemos, Luca Scimeca, Jarrid Rector-

Brooks, Alexandre Adam, Yoshua Bengio, and Nikolay Malkin. Improved off-policy training

of diffusion samplers. arXiv preprint arXiv:2402.05098, 2024.

Joar Skalse, Lucy Farnik, Sumeet Ramesh Motwani, Erik Jenner, Adam Gleave, and Alessandro

Abate. STARC: A general framework for quantifying differences between reward functions.

arXiv preprint arXiv:2309.15257, 2024.

Superintelligence – Robotics – Safety & Alignment 2025 2(1) Large Language Models I

Joar Max Viktor Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, and David Krueger.

Defining and characterizing reward gaming. Neural Information Processing Systems (NeurIPS),

Joar Max Viktor Skalse, Matthew Farrugia-Roberts, Stuart Russell, Alessandro Abate, and

Adam Gleave. Invariance in policy optimisation and partial identifiability in reward learning.

International Conference on Machine Learning (ICML), 2023.

Zitao Song, Chao Yang, Chaojie Wang, Bo An, and Shuang Li. Latent logic tree extraction

for event sequence explanation from LLMs. International Conference on Machine Learning

(ICML), 2024.

Siddarth Venkatraman, Moksh Jain, Luca Scimeca, Minsu Kim, Marcin Sendera, Mohsin

Hasan, Luke Rowe, Sarthak Mittal, Pablo Lemos, Emmanuel Bengio, Alexandre Adam, Jarrid

Rector-Brooks, Yoshua Bengio, Glen Berseth, and Nikolay Malkin. Amortizing intractable inference

in diffusion models for vision, language, and control. arXiv preprint arXiv:2405.20971,

Jean Ville. Étude critique de la notion de collectif. 1939. URL http://eudml.org/doc/

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety

training fail? Neural Information Processing Systems (NeurIPS), 2023.

Fangxu Yu, Lai Jiang, Haoqiang Kang, Shibo Hao, and Lianhui Qin. Flow of reasoning:

Efficient training of LLM policy with divergent thinking. arXiv preprint arXiv:2406.05673,

David Zhang, Corrado Rainone, Markus Peschl, and Roberto Bondesan. Robust scheduling

with GFlowNets. International Conference on Learning Representations (ICLR), 2023.

Dinghuai Zhang, Hanjun Dai, Nikolay Malkin, Aaron Courville, Yoshua Bengio, and Ling Pan.

Let the flows tell: Solving graph combinatorial problems with GFlowNets. Neural Information

Processing Systems (NeurIPS), 2023.

Stephen Zhao, Rob Brekelmans, Alireza Makhzani, and Roger Baker Grosse. Probabilistic

inference in language models via twisted sequential Monte Carlo. International Conference on

Machine Learning (ICML), 2024.

Ming Yang Zhou, Zichao Yan, Elliot Layne, Nikolay Malkin, Dinghuai Zhang, Moksh Jain,

Mathieu Blanchette, and Yoshua Bengio. PhyloGFN: Phylogenetic inference with generative

flow networks. International Conference on Learning Representations (ICLR), 2024.

Simon Zhuang and Dylan Hadfield-Menell. Consequences of misaligned AI. Neural Information

Processing Systems (NeurIPS), 2020.

Fig. 2a: Overestimate of action harm vs. theoretical lower bound.

Downloads

Published

2025-03-05

How to Cite

Bengio, Y., McDermott, M., Cohen, M. K., Malkin, N., Fornasiere, D., Greiner, P., & Kaddar, Y. (2025). Can a Bayesian Oracle Prevent Harm from an Agent?. SuperIntelligence - Robotics - Safety & Alignment, 2(1). https://doi.org/10.70777/si.v2i1.13799

Most read articles by the same author(s)