The Road to Artificial SuperIntelligence: A Comprehensive Survey of Superalignment
DOI:
https://doi.org/10.70777/si.v2i1.13963Keywords:
agi, agi alignment, agi safety, superintelligence safety, superintelligence alignment, rlhf, reinforcement learning human feedback, weak-to-strong generalization, reinforcement learning from ai feedback, zero-sum debate, ai sandwichingAbstract
The emergence of large language models (LLMs) has sparkedthe discussion on Artificial Superintelligence (ASI), a hypothetical AI system surpassing human intelligence. Though ASI is still hypothetical and far from current AI capabilities, existing alignment methods struggle to guide such advanced AI ensure its safety in the future. It is essential to discuss the alignment of such AI now. Superalignment, the alignment of AI at superhuman levels of capability systems with human values and safety requirements, aims to address two primary goals: scalability in supervision to provide high-quality guidance signals and robust governance to ensure alignment with human values. In this survey, we review the original scalable oversight problem and corresponding methods and potential solutions for superalignment. Specifically, we introduce the challenges and limitations of current alignment paradigms in addressing the superalignment problem. Then we review scalable oversight methods for superalignment. Finally, we discuss the key challenges and propose pathways for the safe and continual improvement of future AI systems. By comprehensively reviewing the current literature, our goal is provide a systematical introduction of existing methods, analyze their strengths and limitations, and discuss potential future directions.
References
Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, et al.
Language models are few-shot learners. In
Advances in Neural Information Processing Systems,
volume 33, pages 1877–1901. Curran Associates,
Inc.
Jonah Brown-Cohen, Geoffrey Irving, and Georgios Piliouras.
Scalable ai safety via doubly-efficient
debate. Preprint, arXiv:2311.14125.
Sebastien Bubeck, Varun Chandrasekaran, Ronen Eldan,
Johannes Gehrke, Eric Horvitz, Ece Kamar,
Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg,
et al. 2023. Sparks of artificial general intelligence:
Early experiments with gpt-4. arXiv preprint
arXiv:2303.12712.
Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner,
Bowen Baker, Leo Gao, Leopold Aschenbrenner,
Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan
Leike, Ilya Sutskever, and Jeff Wu. 2023. Weak-tostrong
generalization: Eliciting strong capabilities
with weak supervision. Preprint, arXiv:2312.09390.
Charbel-Raphael. 2023. Ais 101: Task decomposition
for scalable oversight. https://www.lesswron
g.com/posts/FFz6H35Gy6BArHxkc/ais-101-t
ask-decomposition-for-scalable-oversight.
Accessed: 2024-11-26.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan,
Henrique Ponde de Oliveira Pinto, Jared Kaplan,
Harri Edwards, Yuri Burda, Nicholas Joseph, Greg
Brockman, Alex Ray, Raul Puri, Gretchen Krueger,
Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela
Mishkin, Brooke Chan, Scott Gray, Nick Ryder, et al.
Evaluating large language models trained on
code. Preprint, arXiv:2107.03374.
Steffi Chern, Ethan Chern, Graham Neubig, and Pengfei
Liu. 2024. Can large language models be trusted for
evaluation? scalable meta-evaluation of llms as evaluators
via agent debate. Preprint, arXiv:2401.16788.
Paul Christiano, Jan Leike, Tom Brown, Miljan Martic,
Shane Legg, and Dario Amodei. 2017. Deep
reinforcement learning from human preferences. Advances
in neural information processing systems, 30.
Paul Christiano, Buck Shlegeris, and Dario Amodei.
Supervising strong learners by amplifying
weak experts. Preprint, arXiv:1810.08575.
Ajeya Cotra. 2021. The case for aligning narrowly
superhuman models. https://www.alignmentfor
um.org/posts/PZtsoaoSLpKjjbMqM/. Accessed:
-11-26.
Yilun Du, Shuang Li, Antonio Torralba, Joshua B.
Tenenbaum, and Igor Mordatch. 2023. Improving
factuality and reasoning in language models through
multiagent debate. Preprint, arXiv:2305.14325.
Sujan Dutta, Sayantan Mahinder, Raviteja Anantha,
and Bortik Bandyopadhyay. 2024. Applying RLAIF
for code generation with API-usage in lightweight
LLMs. In Proceedings of the 2nd Workshop on Natural
Language Reasoning and Structured Explanations
(@ACL 2024), pages 39–45, Bangkok, Thailand.
Association for Computational Linguistics.
Tom Everitt, Francis Rhys Ward, Sebastian Benthall,
James Fox, Matt MacDermott, and Ryan Carey. 2023.
Reward hacking from a causal perspective. https:
//www.alignmentforum.org/posts/aw5nqamqt
nDnW8w9u/reward-hacking-from-a-causal-per
spective. Accessed: 2024-11-26.
Angela Fan, Beliz Gokkaya, Mark Harman, Mitya
Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M.
Zhang. 2023. Large language models for software
engineering: Survey and open problems. Preprint,
arXiv:2310.03533.
Nanyi Fei, Zhiwu Lu, Yizhao Gao, Guoxing Yang, Yuqi
Huo, Jingyuan Wen, Haoyu Lu, Ruihua Song, Xin
Gao, Tao Xiang, et al. 2022. Towards artificial general
intelligence via a multimodal foundation model.
Nature Communications, 13(1):3094.
Ben Goertzel. 2014. Artificial general intelligence: concept,
state of the art, and future prospects. Journal of
Artificial General Intelligence, 5(1):1.
Zhibin Gou, Zhihong Shao, Yeyun Gong, yelong shen,
Yujiu Yang, Nan Duan, and Weizhu Chen. 2024.
CRITIC: Large language models can self-correct
with tool-interactive critiquing. In The Twelfth International
Conference on Learning Representations.
Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan,
Ksenia Konyushkova, Lotte Weerts, Abhishek
Sharma, Aditya Siddhant, Alex Ahern, Miaosen
Wang, Chenjie Gu, et al. 2023. Reinforced selftraining
(rest) for language modeling. arXiv preprint
arXiv:2308.08998.
Jianyuan Guo, Hanting Chen, Chengcheng Wang,
Kai Han, Chang Xu, and Yunhe Wang. 2024.
Vision superalignment: Weak-to-strong generalization
for vision foundation models. Preprint,
arXiv:2402.03749.
Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel,
and Anca Dragan. 2016. Cooperative inverse reinforcement
learning. In Advances in Neural Information
Processing Systems, volume 29. Curran Associates,
Inc.
Shengding Hu, Xin Liu, Xu Han, Xinrong Zhang, Chaoqun
He, Weilin Zhao, Yankai Lin, Ning Ding, Zebin
Ou, Guoyang Zeng, Zhiyuan Liu, and Maosong
Sun. 2024. Predicting emergent abilities with infinite
resolution evaluation. In The Twelfth International
Conference on Learning Representations.
Yining Huang, Keke Tang, Meilian Chen, and Boyuan
Wang. 2024. A comprehensive survey on evaluating
large language model applications in the medical
industry. Preprint, arXiv:2404.15777.
Edward Hughes, Michael D Dennis, Jack Parker-Holder,
Feryal Behbahani, Aditi Mavalankar, Yuge Shi, Tom
Schaul, and Tim Rocktaschel. 2024. Position: Openendedness
is essential for artificial superhuman intelligence.
In Proceedings of the 41st International
Conference on Machine Learning, volume 235 of
Proceedings of Machine Learning Research, pages
–20616. PMLR.
Geoffrey Irving, Paul Christiano, and Dario Amodei.
Ai safety via debate. Preprint,
arXiv:1805.00899.
Ilya Sutskever Jan Leike. 2023. Introducing superalignment.
https://openai.com/index/introducing
-superalignment/. Accessed: 2024-12-20.
Albert Q. Jiang, Alexandre Sablayrolles, Antoine
Roux, Arthur Mensch, Blanche Savary, Chris
Bamford, Devendra Singh Chaplot, Diego de las
Casas, Emma Bou Hanna, Florian Bressand, Gianna
Lengyel, Guillaume Bour, Guillaume Lample,
Lelio Renard Lavaud, Lucile Saulnier, Marie-
Anne Lachaux, Pierre Stock, Sandeep Subramanian,
Sophia Yang, Szymon Antoniak, Teven Le Scao,
Theophile Gervet, Thibaut Lavril, Thomas Wang,
Timothee Lacroix, andWilliam El Sayed. 2024. Mixtral
of experts. Preprint, arXiv:2401.04088.
Cameron R. Jones and Benjamin K. Bergen. 2024.
Does gpt-4 pass the turing test? Preprint,
arXiv:2310.20216.
Alexandra Jonker and Amanda McGrath. 2024. What
is superalignment? https://www.ibm.com/think/
topics/superalignment. Accessed: 2024-12-19.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B.
Brown, Benjamin Chess, Rewon Child, Scott Gray,
Alec Radford, Jeffrey Wu, and Dario Amodei. 2020.
Scaling laws for neural language models. Preprint,
arXiv:2001.08361.
Zachary Kenton, Noah Y. Siegel, Janos Kramar,
Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian,
Rishabh Agarwal, David Lindner, Yunhao Tang,
Noah D. Goodman, and Rohin Shah. 2024. On scalable
oversight with weak llms judging strong llms.
Preprint, arXiv:2407.04622.
Akbir Khan, John Hughes, Dan Valentine, Laura
Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward
Grefenstette, Samuel R. Bowman, Tim Rocktaschel,
and Ethan Perez. 2024. Debating with more persuasive
llms leads to more truthful answers. Preprint,
arXiv:2402.06782.
Jan Hendrik Kirchner, Yining Chen, Harri Edwards,
Jan Leike, Nat McAleese, and Yuri Burda. 2024.
Prover-verifier games improve legibility of llm outputs.
Preprint, arXiv:2407.13692.
Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas
Mesnard, Johan Ferret, Kellie Lu, Colton Bishop,
Ethan Hall, Victor Carbune, Abhinav Rastogi, and
Sushant Prakash. 2024. Rlaif vs. rlhf: Scaling reinforcement
learning from human feedback with ai
feedback. Preprint, arXiv:2309.00267.
Jan Leike, David Krueger, Tom Everitt, Miljan Martic,
Vishal Maini, and Shane Legg. 2018. Scalable agent
alignment via reward modeling: a research direction.
Preprint, arXiv:1811.07871.
Jiawei Liu, Songrun Xie, Junhao Wang, Yuxiang Wei,
Yifeng Ding, and Lingming Zhang. 2024. Evaluating
language models for efficient code generation. In
First Conference on Language Modeling.
Yuejiang Liu and Alexandre Alahi. 2024. Cosupervised
learning: Improving weak-to-strong generalization
with hierarchical mixture of experts.
Preprint, arXiv:2402.15505.
Meta. 2024. Llama 3.2: Revolutionizing edge ai and
vision with open, customizable models. https://ai
.meta.com/blog/llama-3-2-connect-2024-vis
ion-edge-mobile-devices/. Accessed: 2024-10-
Mistral. 2024. Cheaper, better, faster, stronger. https:
//mistral.ai/news/mixtral-8x22b/. Accessed:
-10-28.
Tim Mucci and Cole Stryker. 2023. What is artificial
superintelligence? https://www.ibm.com/topi
cs/artificial-superintelligence. Accessed:
-11-25.
Yoonsoo Nam, Nayara Fonseca, Seok Hyeong Lee,
Chris Mingard, and Ard A. Louis. 2024. An exactly
solvable model for emergence and scaling laws in the
multitask sparse parity problem. In The Thirty-eighth
Annual Conference on Neural Information Processing
Systems.
Chris Olah. 2016. Bringing precision to the ai safety
discussion. https://research.google/blog/bri
nging-precision-to-the-ai-safety-discuss
ion/. Accessed: 2024-12-20.
OpenAI. 2024. Introducing openai o1. https://open
ai.com/o1/. Accessed: 2024-10-28.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, John
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller,
Maddie Simens, Amanda Askell, Peter Welinder,
Paul F Christiano, Jan Leike, and Ryan Lowe. 2022.
Training language models to follow instructions with
human feedback. In Advances in Neural Information
Processing Systems, volume 35, pages 27730–27744.
Curran Associates, Inc.
Jens Pohl. 2015. Artificial superintelligence: Extinction
or nirvana? In Proceedings of InterSymp-2015, IIAS,
th International Conference on Systems Research,
Informatics, and Cybernetics.
Salvador Pueyo. 2018. Growth, degrowth, and the
challenge of artificial superintelligence. Journal of
Cleaner Production, 197:1731–1736.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano
Ermon, Christopher D Manning, and Chelsea Finn.
Direct preference optimization: Your language
model is secretly a reward model. arXiv preprint
arXiv:2305.18290.
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson
Petty, Richard Yuanzhe Pang, Julien Dirani, Julian
Michael, and Samuel R. Bowman. 2024. GPQA:
A graduate-level google-proof q&a benchmark. In
First Conference on Language Modeling.
Jitao Sang, Yuhang Wang, Jing Zhang, Yanxu Zhu,
Chao Kong, Junhong Ye, Shuyu Wei, and Jinlin Xiao.
Improving weak-to-strong generalization with
scalable oversight and ensemble learning. Preprint,
arXiv:2402.00667.
Archit Sharma, Sedrick Keh, Eric Mitchell, Chelsea
Finn, Kushal Arora, and Thomas Kollar. 2024. A
critical evaluation of AI feedback for aligning large
language models. In The Thirty-eighth Annual Conference
on Neural Information Processing Systems.
Freda Shi, Mirac Suzgun, Markus Freitag, XuezhiWang,
Suraj Srivats, Soroush Vosoughi, Hyung Won Chung,
Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan
Das, and Jason Wei. 2022. Language models are
multilingual chain-of-thought reasoners. Preprint,
arXiv:2210.03057.
David Silver, Aja Huang, Chris J Maddison, Arthur
Guez, Laurent Sifre, George Van Den Driessche, Julian
Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam,
Marc Lanctot, et al. 2016. Mastering
the game of go with deep neural networks and tree
search. nature, 529(7587):484–489.
Jessica Taylor, Eliezer Yudkowsky, Patrick LaVictoire,
and Andrew Critch. 2016. Alignment for advanced
machine learning systems. Ethics of artificial intelligence,
pages 342–382.
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-
Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan
Schalkwyk, Andrew M. Dai, Anja Hauth, Katie
Millican, David Silver, Melvin Johnson, Ioannis
Antonoglou, Julian Schrittwieser, Amelia Glaese,
Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki
Lazaridou, Orhan Firat, James Molloy, et al. 2024.
Gemini: A family of highly capable multimodal models.
Preprint, arXiv:2312.11805.
Qwen Team. 2024. Qwen2.5: A party of foundation
models.
Xinpeng Wang, Shitong Duan, Xiaoyuan Yi, Jing Yao,
Shanlin Zhou, Zhihua Wei, Peng Zhang, Dongkuan
Xu, Maosong Sun, and Xing Xie. 2024. On the
essence and prospect: An investigation of alignment
approaches for big models. In Proceedings of the
Thirty-Third International Joint Conference on Artificial
Intelligence, IJCAI-24, pages 8308–8316. International
Joint Conferences on Artificial Intelligence
Organization. Survey Track.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel,
Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Maarten Bosma, Denny Zhou, Donald Metzler, Ed H.
Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy
Liang, Jeff Dean, and William Fedus. 2022a. Emergent
abilities of large language models. Transactions
on Machine Learning Research. Survey Certification.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le,
and Denny Zhou. 2022b. Chain of thought prompting
elicits reasoning in large language models. In
Advances in Neural Information Processing Systems.
Chunpu Xu, Steffi Chern, Ethan Chern, Ge Zhang,
Zekun Wang, Ruibo Liu, Jing Li, Jie Fu, and
Pengfei Liu. 2023. Align on the fly: Adapting chatbot
behavior to established norms. arXiv preprint
arXiv:2312.15907.
Wenkai Yang, Shiqi Shen, Guangyao Shen, Wei Yao,
Yong Liu, Zhi Gong, Yankai Lin, and Ji-Rong Wen.
a. Super(ficial)-alignment: Strong models may
deceive weak models in weak-to-strong generalization.
Preprint, arXiv:2406.11431.
Yuqing Yang, Yan Ma, and Pengfei Liu. 2024b. Weakto-
strong reasoning. Preprint, arXiv:2407.13647.
Tianyu Yu, Haoye Zhang, Yuan Yao, Yunkai Dang,
Da Chen, Xiaoman Lu, Ganqu Cui, Taiwen He,
Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun.
Rlaif-v: Aligning mllms through opensource
ai feedback for super gpt-4v trustworthiness.
Preprint, arXiv:2405.17220.
Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho,
Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason
Weston. 2024. Self-rewarding language models.
Preprint, arXiv:2401.10020.
Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao
Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu,
Lili Yu, et al. 2023. Lima: Less is more for alignment.
arXiv preprint arXiv:2305.11206.
Downloads
Published
How to Cite
Issue
Section
Categories
License
Copyright (c) 2025 HyunJin Kim, Xiaoyuan Yi, JinYeong Bak, Jing Yao, Jianxun Lian, Muhua Huang, Shitong Duan, Xing Xie

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.