The Road to Artificial SuperIntelligence: A Comprehensive Survey of Superalignment

Authors

  • HyunJin Kim Microsoft Research Asia
  • Xiaoyuan Yi Microsoft Research Asia
  • JinYeong Bak Sungkyunkwan University
  • Jing Yao Microsoft Research Asia
  • Jianxun Lian Microsoft Research Asia
  • Muhua Huang The University of Chicago
  • Shitong Duan Fudan University, Shanghai
  • Xing Xie Microsoft Research Asia

DOI:

https://doi.org/10.70777/si.v2i1.13963

Keywords:

agi, agi alignment, agi safety, superintelligence safety, superintelligence alignment, rlhf, reinforcement learning human feedback, weak-to-strong generalization, reinforcement learning from ai feedback, zero-sum debate, ai sandwiching

Abstract

The emergence of large language models (LLMs) has sparkedthe discussion on Artificial Superintelligence (ASI), a hypothetical AI system surpassing human intelligence. Though ASI is still hypothetical and far from current AI capabilities, existing alignment methods struggle to guide such advanced AI ensure its safety in the future. It is essential to discuss the alignment of such AI now. Superalignment, the alignment of AI at superhuman levels of capability systems with human values and safety requirements, aims to address two primary goals: scalability in supervision to provide high-quality guidance signals and robust governance to ensure alignment with human values. In this survey, we review the original scalable oversight problem and corresponding methods and potential solutions for superalignment. Specifically, we introduce the challenges and limitations of current alignment paradigms in addressing the superalignment problem. Then we review scalable oversight methods for superalignment. Finally, we discuss the key challenges and propose pathways for the safe and continual improvement of future AI systems. By comprehensively reviewing the current literature, our goal is provide a systematical introduction of existing methods, analyze their strengths and limitations, and discuss potential future directions.

References

Tom Brown, Benjamin Mann, Nick Ryder, Melanie

Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind

Neelakantan, Pranav Shyam, Girish Sastry, Amanda

Askell, Sandhini Agarwal, Ariel Herbert-Voss,

Gretchen Krueger, Tom Henighan, Rewon Child,

Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, et al.

Language models are few-shot learners. In

Advances in Neural Information Processing Systems,

volume 33, pages 1877–1901. Curran Associates,

Inc.

Jonah Brown-Cohen, Geoffrey Irving, and Georgios Piliouras.

Scalable ai safety via doubly-efficient

debate. Preprint, arXiv:2311.14125.

Sebastien Bubeck, Varun Chandrasekaran, Ronen Eldan,

Johannes Gehrke, Eric Horvitz, Ece Kamar,

Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg,

et al. 2023. Sparks of artificial general intelligence:

Early experiments with gpt-4. arXiv preprint

arXiv:2303.12712.

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner,

Bowen Baker, Leo Gao, Leopold Aschenbrenner,

Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan

Leike, Ilya Sutskever, and Jeff Wu. 2023. Weak-tostrong

generalization: Eliciting strong capabilities

with weak supervision. Preprint, arXiv:2312.09390.

Charbel-Raphael. 2023. Ais 101: Task decomposition

for scalable oversight. https://www.lesswron

g.com/posts/FFz6H35Gy6BArHxkc/ais-101-t

ask-decomposition-for-scalable-oversight.

Accessed: 2024-11-26.

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan,

Henrique Ponde de Oliveira Pinto, Jared Kaplan,

Harri Edwards, Yuri Burda, Nicholas Joseph, Greg

Brockman, Alex Ray, Raul Puri, Gretchen Krueger,

Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela

Mishkin, Brooke Chan, Scott Gray, Nick Ryder, et al.

Evaluating large language models trained on

code. Preprint, arXiv:2107.03374.

Steffi Chern, Ethan Chern, Graham Neubig, and Pengfei

Liu. 2024. Can large language models be trusted for

evaluation? scalable meta-evaluation of llms as evaluators

via agent debate. Preprint, arXiv:2401.16788.

Paul Christiano, Jan Leike, Tom Brown, Miljan Martic,

Shane Legg, and Dario Amodei. 2017. Deep

reinforcement learning from human preferences. Advances

in neural information processing systems, 30.

Paul Christiano, Buck Shlegeris, and Dario Amodei.

Supervising strong learners by amplifying

weak experts. Preprint, arXiv:1810.08575.

Ajeya Cotra. 2021. The case for aligning narrowly

superhuman models. https://www.alignmentfor

um.org/posts/PZtsoaoSLpKjjbMqM/. Accessed:

-11-26.

Yilun Du, Shuang Li, Antonio Torralba, Joshua B.

Tenenbaum, and Igor Mordatch. 2023. Improving

factuality and reasoning in language models through

multiagent debate. Preprint, arXiv:2305.14325.

Sujan Dutta, Sayantan Mahinder, Raviteja Anantha,

and Bortik Bandyopadhyay. 2024. Applying RLAIF

for code generation with API-usage in lightweight

LLMs. In Proceedings of the 2nd Workshop on Natural

Language Reasoning and Structured Explanations

(@ACL 2024), pages 39–45, Bangkok, Thailand.

Association for Computational Linguistics.

Tom Everitt, Francis Rhys Ward, Sebastian Benthall,

James Fox, Matt MacDermott, and Ryan Carey. 2023.

Reward hacking from a causal perspective. https:

//www.alignmentforum.org/posts/aw5nqamqt

nDnW8w9u/reward-hacking-from-a-causal-per

spective. Accessed: 2024-11-26.

Angela Fan, Beliz Gokkaya, Mark Harman, Mitya

Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M.

Zhang. 2023. Large language models for software

engineering: Survey and open problems. Preprint,

arXiv:2310.03533.

Nanyi Fei, Zhiwu Lu, Yizhao Gao, Guoxing Yang, Yuqi

Huo, Jingyuan Wen, Haoyu Lu, Ruihua Song, Xin

Gao, Tao Xiang, et al. 2022. Towards artificial general

intelligence via a multimodal foundation model.

Nature Communications, 13(1):3094.

Ben Goertzel. 2014. Artificial general intelligence: concept,

state of the art, and future prospects. Journal of

Artificial General Intelligence, 5(1):1.

Zhibin Gou, Zhihong Shao, Yeyun Gong, yelong shen,

Yujiu Yang, Nan Duan, and Weizhu Chen. 2024.

CRITIC: Large language models can self-correct

with tool-interactive critiquing. In The Twelfth International

Conference on Learning Representations.

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan,

Ksenia Konyushkova, Lotte Weerts, Abhishek

Sharma, Aditya Siddhant, Alex Ahern, Miaosen

Wang, Chenjie Gu, et al. 2023. Reinforced selftraining

(rest) for language modeling. arXiv preprint

arXiv:2308.08998.

Jianyuan Guo, Hanting Chen, Chengcheng Wang,

Kai Han, Chang Xu, and Yunhe Wang. 2024.

Vision superalignment: Weak-to-strong generalization

for vision foundation models. Preprint,

arXiv:2402.03749.

Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel,

and Anca Dragan. 2016. Cooperative inverse reinforcement

learning. In Advances in Neural Information

Processing Systems, volume 29. Curran Associates,

Inc.

Shengding Hu, Xin Liu, Xu Han, Xinrong Zhang, Chaoqun

He, Weilin Zhao, Yankai Lin, Ning Ding, Zebin

Ou, Guoyang Zeng, Zhiyuan Liu, and Maosong

Sun. 2024. Predicting emergent abilities with infinite

resolution evaluation. In The Twelfth International

Conference on Learning Representations.

Yining Huang, Keke Tang, Meilian Chen, and Boyuan

Wang. 2024. A comprehensive survey on evaluating

large language model applications in the medical

industry. Preprint, arXiv:2404.15777.

Edward Hughes, Michael D Dennis, Jack Parker-Holder,

Feryal Behbahani, Aditi Mavalankar, Yuge Shi, Tom

Schaul, and Tim Rocktaschel. 2024. Position: Openendedness

is essential for artificial superhuman intelligence.

In Proceedings of the 41st International

Conference on Machine Learning, volume 235 of

Proceedings of Machine Learning Research, pages

–20616. PMLR.

Geoffrey Irving, Paul Christiano, and Dario Amodei.

Ai safety via debate. Preprint,

arXiv:1805.00899.

Ilya Sutskever Jan Leike. 2023. Introducing superalignment.

https://openai.com/index/introducing

-superalignment/. Accessed: 2024-12-20.

Albert Q. Jiang, Alexandre Sablayrolles, Antoine

Roux, Arthur Mensch, Blanche Savary, Chris

Bamford, Devendra Singh Chaplot, Diego de las

Casas, Emma Bou Hanna, Florian Bressand, Gianna

Lengyel, Guillaume Bour, Guillaume Lample,

Lelio Renard Lavaud, Lucile Saulnier, Marie-

Anne Lachaux, Pierre Stock, Sandeep Subramanian,

Sophia Yang, Szymon Antoniak, Teven Le Scao,

Theophile Gervet, Thibaut Lavril, Thomas Wang,

Timothee Lacroix, andWilliam El Sayed. 2024. Mixtral

of experts. Preprint, arXiv:2401.04088.

Cameron R. Jones and Benjamin K. Bergen. 2024.

Does gpt-4 pass the turing test? Preprint,

arXiv:2310.20216.

Alexandra Jonker and Amanda McGrath. 2024. What

is superalignment? https://www.ibm.com/think/

topics/superalignment. Accessed: 2024-12-19.

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B.

Brown, Benjamin Chess, Rewon Child, Scott Gray,

Alec Radford, Jeffrey Wu, and Dario Amodei. 2020.

Scaling laws for neural language models. Preprint,

arXiv:2001.08361.

Zachary Kenton, Noah Y. Siegel, Janos Kramar,

Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian,

Rishabh Agarwal, David Lindner, Yunhao Tang,

Noah D. Goodman, and Rohin Shah. 2024. On scalable

oversight with weak llms judging strong llms.

Preprint, arXiv:2407.04622.

Akbir Khan, John Hughes, Dan Valentine, Laura

Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward

Grefenstette, Samuel R. Bowman, Tim Rocktaschel,

and Ethan Perez. 2024. Debating with more persuasive

llms leads to more truthful answers. Preprint,

arXiv:2402.06782.

Jan Hendrik Kirchner, Yining Chen, Harri Edwards,

Jan Leike, Nat McAleese, and Yuri Burda. 2024.

Prover-verifier games improve legibility of llm outputs.

Preprint, arXiv:2407.13692.

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas

Mesnard, Johan Ferret, Kellie Lu, Colton Bishop,

Ethan Hall, Victor Carbune, Abhinav Rastogi, and

Sushant Prakash. 2024. Rlaif vs. rlhf: Scaling reinforcement

learning from human feedback with ai

feedback. Preprint, arXiv:2309.00267.

Jan Leike, David Krueger, Tom Everitt, Miljan Martic,

Vishal Maini, and Shane Legg. 2018. Scalable agent

alignment via reward modeling: a research direction.

Preprint, arXiv:1811.07871.

Jiawei Liu, Songrun Xie, Junhao Wang, Yuxiang Wei,

Yifeng Ding, and Lingming Zhang. 2024. Evaluating

language models for efficient code generation. In

First Conference on Language Modeling.

Yuejiang Liu and Alexandre Alahi. 2024. Cosupervised

learning: Improving weak-to-strong generalization

with hierarchical mixture of experts.

Preprint, arXiv:2402.15505.

Meta. 2024. Llama 3.2: Revolutionizing edge ai and

vision with open, customizable models. https://ai

.meta.com/blog/llama-3-2-connect-2024-vis

ion-edge-mobile-devices/. Accessed: 2024-10-

Mistral. 2024. Cheaper, better, faster, stronger. https:

//mistral.ai/news/mixtral-8x22b/. Accessed:

-10-28.

Tim Mucci and Cole Stryker. 2023. What is artificial

superintelligence? https://www.ibm.com/topi

cs/artificial-superintelligence. Accessed:

-11-25.

Yoonsoo Nam, Nayara Fonseca, Seok Hyeong Lee,

Chris Mingard, and Ard A. Louis. 2024. An exactly

solvable model for emergence and scaling laws in the

multitask sparse parity problem. In The Thirty-eighth

Annual Conference on Neural Information Processing

Systems.

Chris Olah. 2016. Bringing precision to the ai safety

discussion. https://research.google/blog/bri

nging-precision-to-the-ai-safety-discuss

ion/. Accessed: 2024-12-20.

OpenAI. 2024. Introducing openai o1. https://open

ai.com/o1/. Accessed: 2024-10-28.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,

Carroll Wainwright, Pamela Mishkin, Chong Zhang,

Sandhini Agarwal, Katarina Slama, Alex Ray, John

Schulman, Jacob Hilton, Fraser Kelton, Luke Miller,

Maddie Simens, Amanda Askell, Peter Welinder,

Paul F Christiano, Jan Leike, and Ryan Lowe. 2022.

Training language models to follow instructions with

human feedback. In Advances in Neural Information

Processing Systems, volume 35, pages 27730–27744.

Curran Associates, Inc.

Jens Pohl. 2015. Artificial superintelligence: Extinction

or nirvana? In Proceedings of InterSymp-2015, IIAS,

th International Conference on Systems Research,

Informatics, and Cybernetics.

Salvador Pueyo. 2018. Growth, degrowth, and the

challenge of artificial superintelligence. Journal of

Cleaner Production, 197:1731–1736.

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano

Ermon, Christopher D Manning, and Chelsea Finn.

Direct preference optimization: Your language

model is secretly a reward model. arXiv preprint

arXiv:2305.18290.

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson

Petty, Richard Yuanzhe Pang, Julien Dirani, Julian

Michael, and Samuel R. Bowman. 2024. GPQA:

A graduate-level google-proof q&a benchmark. In

First Conference on Language Modeling.

Jitao Sang, Yuhang Wang, Jing Zhang, Yanxu Zhu,

Chao Kong, Junhong Ye, Shuyu Wei, and Jinlin Xiao.

Improving weak-to-strong generalization with

scalable oversight and ensemble learning. Preprint,

arXiv:2402.00667.

Archit Sharma, Sedrick Keh, Eric Mitchell, Chelsea

Finn, Kushal Arora, and Thomas Kollar. 2024. A

critical evaluation of AI feedback for aligning large

language models. In The Thirty-eighth Annual Conference

on Neural Information Processing Systems.

Freda Shi, Mirac Suzgun, Markus Freitag, XuezhiWang,

Suraj Srivats, Soroush Vosoughi, Hyung Won Chung,

Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan

Das, and Jason Wei. 2022. Language models are

multilingual chain-of-thought reasoners. Preprint,

arXiv:2210.03057.

David Silver, Aja Huang, Chris J Maddison, Arthur

Guez, Laurent Sifre, George Van Den Driessche, Julian

Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam,

Marc Lanctot, et al. 2016. Mastering

the game of go with deep neural networks and tree

search. nature, 529(7587):484–489.

Jessica Taylor, Eliezer Yudkowsky, Patrick LaVictoire,

and Andrew Critch. 2016. Alignment for advanced

machine learning systems. Ethics of artificial intelligence,

pages 342–382.

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-

Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan

Schalkwyk, Andrew M. Dai, Anja Hauth, Katie

Millican, David Silver, Melvin Johnson, Ioannis

Antonoglou, Julian Schrittwieser, Amelia Glaese,

Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki

Lazaridou, Orhan Firat, James Molloy, et al. 2024.

Gemini: A family of highly capable multimodal models.

Preprint, arXiv:2312.11805.

Qwen Team. 2024. Qwen2.5: A party of foundation

models.

Xinpeng Wang, Shitong Duan, Xiaoyuan Yi, Jing Yao,

Shanlin Zhou, Zhihua Wei, Peng Zhang, Dongkuan

Xu, Maosong Sun, and Xing Xie. 2024. On the

essence and prospect: An investigation of alignment

approaches for big models. In Proceedings of the

Thirty-Third International Joint Conference on Artificial

Intelligence, IJCAI-24, pages 8308–8316. International

Joint Conferences on Artificial Intelligence

Organization. Survey Track.

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel,

Barret Zoph, Sebastian Borgeaud, Dani Yogatama,

Maarten Bosma, Denny Zhou, Donald Metzler, Ed H.

Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy

Liang, Jeff Dean, and William Fedus. 2022a. Emergent

abilities of large language models. Transactions

on Machine Learning Research. Survey Certification.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten

Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le,

and Denny Zhou. 2022b. Chain of thought prompting

elicits reasoning in large language models. In

Advances in Neural Information Processing Systems.

Chunpu Xu, Steffi Chern, Ethan Chern, Ge Zhang,

Zekun Wang, Ruibo Liu, Jing Li, Jie Fu, and

Pengfei Liu. 2023. Align on the fly: Adapting chatbot

behavior to established norms. arXiv preprint

arXiv:2312.15907.

Wenkai Yang, Shiqi Shen, Guangyao Shen, Wei Yao,

Yong Liu, Zhi Gong, Yankai Lin, and Ji-Rong Wen.

a. Super(ficial)-alignment: Strong models may

deceive weak models in weak-to-strong generalization.

Preprint, arXiv:2406.11431.

Yuqing Yang, Yan Ma, and Pengfei Liu. 2024b. Weakto-

strong reasoning. Preprint, arXiv:2407.13647.

Tianyu Yu, Haoye Zhang, Yuan Yao, Yunkai Dang,

Da Chen, Xiaoman Lu, Ganqu Cui, Taiwen He,

Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun.

Rlaif-v: Aligning mllms through opensource

ai feedback for super gpt-4v trustworthiness.

Preprint, arXiv:2405.17220.

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho,

Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason

Weston. 2024. Self-rewarding language models.

Preprint, arXiv:2401.10020.

Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao

Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu,

Lili Yu, et al. 2023. Lima: Less is more for alignment.

arXiv preprint arXiv:2305.11206.

Reward Model vs Reinforcement Learning Policy Training-Fig 5

Downloads

Published

2025-03-16

How to Cite

Kim, H., Yi, X., Bak, J., Yao, J., Lian, J., Huang, M., … Xie, X. (2025). The Road to Artificial SuperIntelligence: A Comprehensive Survey of Superalignment. SuperIntelligence - Robotics - Safety & Alignment, 2(1). https://doi.org/10.70777/si.v2i1.13963