Can a Bayesian Oracle Prevent Harm from an Agent?
DOI:
https://doi.org/10.70777/si.v2i1.13799Keywords:
bayesian ai, ai oracle, agi safety, agi risks, artificial general intelligence, artificial superintelligence, ai agentAbstract
Is there a way to design powerful AI systems based on machine learning methods that would satisfy probabilistic safety guarantees? With the long-term goal of obtaining a probabilistic guarantee that would apply in every context, we consider estimating a context-dependent bound on the probability of violating a given safety specification. Such a risk evaluation would need to be performed at run-time to provide a guardrail against dangerous actions of an AI. Noting that different plausible hypotheses about the world could produce very different outcomes, and because we do not know which one is right, we derive bounds on the safety violation probability predicted under the true but unknown hypothesis. Such bounds could be used to reject potentially dangerous actions. Our main results involve searching for cautious but plausible hypotheses, obtained by a maximization that involves Bayesian posteriors over hypotheses. We consider two forms of this result, in the i.i.d. case and in the non-i.i.d. case, and conclude with open problems towards turning such theoretical results into practical AI guardrails.
References
Yuntao Bai, Saurav Kadavath, SandipanKundu, Amanda Askell, JacksonKernion, Andy Jones,
Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI:
Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022.
Andrew Barron, Mark J. Schervish, and Larry Wasserman. The consistency of posterior
distributions in nonparametric problems. The Annals of Statistics, 27(2):536–561, 1999.
Yoshua Bengio. Towards a cautious scientist AI with convergent safety
bounds, February 2024. URL https://yoshuabengio.org/2024/02/26/
towards-a-cautious-scientist-ai-with-convergent-safety-bounds/.
Yoshua Bengio, Daniel Privitera, Tamay Besiroglu, Rishi Bommasani, Stephen Casper, Yejin
Choi, Danielle Goldfarb, Hoda Heidari, Leila Khalatbari, Shayne Longpre, et al. International
Scientific Report on the Safety of Advanced AI. PhD thesis, Department for Science, Innovation
and Technology, 2024.
Michael K Cohen and Marcus Hutter. Pessimism about unknown unknowns inspires conservatism.
Conference on Learning Theory (COLT), 2020.
Michael K Cohen, Marcus Hutter, and Neel Nanda. Fully general online imitation learning.
Journal of Machine Learning Research, 23(1):15066–15095, 2022. DOI: https://doi.org/10.46610/RRMLCC.2022.v01i01.004
Michael K. Cohen, Marcus Hutter, and Michael A. Osborne. Advanced artificial agents intervene
in the provision of reward. AI magazine, 43(3):282–293, 2022. DOI: https://doi.org/10.1002/aaai.12064
David Dalrymple, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia,
Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, et al. Towards guaranteed
safe AI: A framework for ensuring robust and reliable AI systems. arXiv preprint
arXiv:2405.06624, 2024.
Tristan Deleu, António Góis, Chris Emezue, Mansi Rankawat, Simon Lacoste-Julien, Stefan
Bauer, and Yoshua Bengio. Bayesian structure learning with generative flow networks.
Uncertainty in Artificial Intelligence (UAI), 2022.
Tristan Deleu, Mizu Nishikawa-Toomey, Jithendaraa Subramanian, Nikolay Malkin, Laurent
Charlin, and Yoshua Bengio. Joint Bayesian inference of graphical structure and parameters
with a single generative flownetwork. Neural Information Processing Systems (NeurIPS), 2023.
Persi Diaconis and David A. Freedman. On the consistency of Bayes estimates (with discussion).
The Annals of Statistics, 14:1–26, 1986.
J.L. Doob. Application of the theory of martingales. Colloque International Centre Nat. Rech.
Sci., pages 22–28, 1949.
David A. Freedman. On the asymptotic behavior of Bayes’ estimates in the discrete case. The
Annals of Mathematical Statistics, 34(4):1386–1403, 1963. DOI: https://doi.org/10.1214/aoms/1177703871
David A. Freedman. On the asymptotic behavior of Bayes estimates in the discrete case II. The
Annals of Mathematical Statistics, 36(2):454–456, 1965. DOI: https://doi.org/10.1214/aoms/1177700155
Superintelligence – Robotics – Safety & Alignment 2025 2(1) Large Language Models I
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial
examples. International Conference on Learning Representations (ICLR), 2015.
Han Guo, Bowen Tan, Zhengzhong Liu, Eric P. Xing, and Zhiting Hu. Efficient (soft) Q-learning
for text generation with limited good data. arXiv preprint arXiv:2106.07704, 2021.
Edward J. Hu, Nikolay Malkin, Moksh Jain, Katie Everett, Alexandros Graikos, and Yoshua
Bengio. GFlowNet-EM for learning compositional latent variable models. International Conference
on Machine Learning (ICML), 2023.
Edward J. Hu, Moksh Jain, Eric Elmoznino, Younesse Kaddar, Guillaume Lajoie, Yoshua
Bengio, and Nikolay Malkin. Amortizing intractable inference in large language models.
International Conference on Learning Representations (ICLR), 2024.
Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer, Charlie Griffin, and Joar
Skalse. Goodhart’s law in reinforcement learning. arXiv preprint arXiv:2310.09144, 2023.
Minsu Kim, Sanghyeok Choi, Jiwoo Son, Hyeonah Kim, Jinkyoo Park, and Yoshua Bengio.
Ant colony sampling with GFlowNets for combinatorial optimization. arXiv preprint
arXiv:2403.07041, 2024.
Minsu Kim, Taeyoung Yun, Emmanuel Bengio, Dinghuai Zhang, Yoshua Bengio, Sungsoo
Ahn, and Jinkyoo Park. Local search GFlowNets. International Conference on Learning
Representations (ICLR), 2024.
Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom
Everitt, Ramana Kumar, Zac Kenton, Jan Leike, and Shane Legg. Specification
gaming: the flip side of AI ingenuity, 2020. URL deepmind.com/blog/
specification-gaming-the-flip-side-of-ai-ingenuity.
Alexander K Lew, Tan Zhi-Xuan, Gabriel Grand, and Vikash K Mansinghka. Sequential
Monte Carlo steering of large language models using probabilistic programs. arXiv preprint
arXiv:2306.03081, 2023.
Alexander J McNeil, Rüdiger Frey, and Paul Embrechts. Quantitative risk management:
concepts, techniques and tools-revised edition. Princeton university press, 2015.
Jeffrey W. Miller. A detailed treatment of Doob’s theorem. arXiv preprint arXiv:1801.03122,
JeffreyW. Miller. Asymptotic normality, concentration, and coverage of generalized posteriors.
Journal of Machine Learning Research, 22(168):1–53, 2021.
Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The Effects of Reward Misspecification:
Mapping and Mitigating Misaligned Models. In International Conference on Learning Representations,
Richard Yuanzhe Pang, Vishakh Padmakumar, Thibault Sellam, Ankur P. Parikh, and He He.
Reward Gaming in Conditional Text Generation. arXiv preprint arXiv:2211.08714, 2023.
Du Phan, Matthew D. Hoffman, David Dohan, Sholto Douglas, Tuan Anh Le, Aaron Parisi,
Pavel Sountsov, Charles Sutton, Sharad Vikram, and Rif A. Saurous. Training chain-of-thought
via latent-variable inference. Neural Information Processing Systems (NeurIPS), 2023.
Lorraine Schwartz. On Bayes procedures. Probability Theory and Related Fields, 4(1):10–26,
Marcin Sendera, Minsu Kim, Sarthak Mittal, Pablo Lemos, Luca Scimeca, Jarrid Rector-
Brooks, Alexandre Adam, Yoshua Bengio, and Nikolay Malkin. Improved off-policy training
of diffusion samplers. arXiv preprint arXiv:2402.05098, 2024.
Joar Skalse, Lucy Farnik, Sumeet Ramesh Motwani, Erik Jenner, Adam Gleave, and Alessandro
Abate. STARC: A general framework for quantifying differences between reward functions.
arXiv preprint arXiv:2309.15257, 2024.
Superintelligence – Robotics – Safety & Alignment 2025 2(1) Large Language Models I
Joar Max Viktor Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, and David Krueger.
Defining and characterizing reward gaming. Neural Information Processing Systems (NeurIPS),
Joar Max Viktor Skalse, Matthew Farrugia-Roberts, Stuart Russell, Alessandro Abate, and
Adam Gleave. Invariance in policy optimisation and partial identifiability in reward learning.
International Conference on Machine Learning (ICML), 2023.
Zitao Song, Chao Yang, Chaojie Wang, Bo An, and Shuang Li. Latent logic tree extraction
for event sequence explanation from LLMs. International Conference on Machine Learning
(ICML), 2024.
Siddarth Venkatraman, Moksh Jain, Luca Scimeca, Minsu Kim, Marcin Sendera, Mohsin
Hasan, Luke Rowe, Sarthak Mittal, Pablo Lemos, Emmanuel Bengio, Alexandre Adam, Jarrid
Rector-Brooks, Yoshua Bengio, Glen Berseth, and Nikolay Malkin. Amortizing intractable inference
in diffusion models for vision, language, and control. arXiv preprint arXiv:2405.20971,
Jean Ville. Étude critique de la notion de collectif. 1939. URL http://eudml.org/doc/
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety
training fail? Neural Information Processing Systems (NeurIPS), 2023.
Fangxu Yu, Lai Jiang, Haoqiang Kang, Shibo Hao, and Lianhui Qin. Flow of reasoning:
Efficient training of LLM policy with divergent thinking. arXiv preprint arXiv:2406.05673,
David Zhang, Corrado Rainone, Markus Peschl, and Roberto Bondesan. Robust scheduling
with GFlowNets. International Conference on Learning Representations (ICLR), 2023.
Dinghuai Zhang, Hanjun Dai, Nikolay Malkin, Aaron Courville, Yoshua Bengio, and Ling Pan.
Let the flows tell: Solving graph combinatorial problems with GFlowNets. Neural Information
Processing Systems (NeurIPS), 2023.
Stephen Zhao, Rob Brekelmans, Alireza Makhzani, and Roger Baker Grosse. Probabilistic
inference in language models via twisted sequential Monte Carlo. International Conference on
Machine Learning (ICML), 2024.
Ming Yang Zhou, Zichao Yan, Elliot Layne, Nikolay Malkin, Dinghuai Zhang, Moksh Jain,
Mathieu Blanchette, and Yoshua Bengio. PhyloGFN: Phylogenetic inference with generative
flow networks. International Conference on Learning Representations (ICLR), 2024.
Simon Zhuang and Dylan Hadfield-Menell. Consequences of misaligned AI. Neural Information
Processing Systems (NeurIPS), 2020.
Downloads
Published
How to Cite
Issue
Section
Categories
License
Copyright (c) 2025 Yoshua Bengio, Matt McDermott, Michael K. Cohen, Nikolay Malkin, Damiano Fornasiere, Pietro Greiner, Younesse Kaddar

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.