Deliberative Alignment: Reasoning Enables Safer Language Models
DOI:
https://doi.org/10.70777/si.v2i3.15159Keywords:
Chain-of-Thought (CoT), deliberative alignment, Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), Jailbreak Resistance , Overrefusal Rates, Hard Refusal, AI safety categories, Out-of-Distribution (OOD) Generalization, Encoding-based Jailbreaks, Response Style Guidelines, Alignment Strategies, Deception Monitoring, Critique-and-RefineAbstract
As large-scale language models increasingly impact safety-critical domains, ensuring their reliable adherence to well-defined principles remains a fundamental challenge. We introduce Deliberative Alignment, a new paradigm that directly teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering. We used this approach to align OpenAI’s o-series models [1], and achieved highly precise adherence to OpenAI’s safety policies, without requiring human-written chain-of-thoughts or answers. Deliberative Alignment pushes the Pareto frontier by simultaneously increasing robustness to jailbreaks while decreasing overrefusal rates, and also improves out-of-distribution generalization. We demonstrate that reasoning over explicitly specified policies enables more scalable, trustworthy, and interpretable alignment.
References
OpenAI, Learning to reason with LLMs, 2024.
[Online]. Available: https://openai.com/index/learningto- reason-with-llms/.
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., “Training language models to follow instructions with human feedback,” in NeurIPS, 2022.
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al., “The LLaMA 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024.
M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieser, et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” arXiv preprint arXiv:2403.05530, 2024.
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., “GPT-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,” arXiv preprint arXiv:2307.15043, 2023.
A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does llm safety training fail?” NeurIPS, 2024.
M. Andriushchenko, F. Croce, and N. Flammarion, “Jailbreaking leading safety-aligned llms with simple adaptive attacks,” arXiv preprint arXiv:2404.02151, 2024.
J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Higgins, “Solving math word problems with process-and outcome-based feedback,” arXiv preprint arXiv:2211.14275, 2022.
C. Snell, D. Klein, and R. Zhong, “Learning by distilling context,” arXiv preprint arXiv:2209.15189, 2022.
A. Askell, Y. Bai, A. Chen, et al., “A general language assistant as a laboratory for alignment,” arXiv preprint arXiv:2112.00861, 2021.
A. Souly, Q. Lu, D. Bowen, et al., “A strongreject for empty jailbreaks,” arXiv preprint arXiv:2402.10260, 2024.
P. R¨ottger, H. R. Kirk, B. Vidgen, G. Attanasio, F. Bianchi, and D. Hovy, “Xstest: A test suite for identifying exaggerated safety behaviours in large language models,” arXiv preprint arXiv:2308.01263, 2024.
OpenAI, Introducing the model spec, 2024.
[Online]. Available: https://cdn.openai.com/spec/modelspec- 2024-05-08.html.
W. Zhao, X. Ren, J. Hessel, C. Cardie, Y. Choi, and Y. Deng, “Wildchat: 1m chatgpt interaction logs in the wild,” arXiv preprint arXiv:2405.01470, 2024.
X. Shen, Z. Chen, M. Backes, Y. Shen, and Y. Zhang, “”do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” arXiv preprint arXiv:2308.03825, 2024.
P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong, “Jailbreaking black box large language models in twenty queries,” arXiv preprint arXiv:2310.08419, 2024.
P. Chao, E. Debenedetti, A. Robey, et al., “Jailbreakbench: An open robustness benchmark for jailbreaking large language models,” arXiv preprint arXiv:2404.01318, 2024.
P. Kumar, E. Lau, S. Vijayakumar, et al., “Refusal-trained llms are easily jailbroken as browser agents,” arXiv preprint arXiv:2410.13886, 2024.
OpenAI, O1 system card, 2024.
[Online]. Available: https://cdn.openai.com/o1-system-card.pdf.
OpenAI, Gpt-4o system card, 2024.
[Online]. Available: https://cdn.openai.com/gpt-4o-system-card.pdf.
Anthropic, Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 sonnet, 2024.
[Online]. Available: https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Claude-3-Model-Card-October- Addendum.pdf.
G. Gemini Team, “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” arXiv preprint arXiv:2403.05530, 2024.
J. Wei, N. Karina, H. W. Chung, Y. J. Jiao, S. Papay, A. Glaese, J. Schulman, and W. Fedus, “Measuring short-form factuality in large language models,” arXiv preprint arXiv:2411.04368, 2024.
A. Parrish, A. Chen, N. Nangia, V. Padmakumar, J. Phang, J. Thompson, P. M. Htut, and S. R. Bowman, “BBQ: A hand-built bias benchmark for question answering,” arXiv preprint arXiv:2110.08193, 2021.
Y. Bai, S. Kadavath, S. Kundu, et al., “Constitutional AI: Harmlessness from AI feedback,” arXiv preprint arXiv:2212.08073, 2022.
A. Madaan, N. Tandon, P. Gupta, et al., “Self-refine: Iterative refinement with self-feedback,” arXiv preprint arXiv:2303.17651, 2023.
P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” arXiv preprint arXiv:2305.18290, 2024.
L. Pan, M. Saxon, W. Xu, D. Nathani, X. Wang, and W. Y. Wang, “Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies,” arXiv preprint arXiv:2308.03188, 2023.
S. Welleck, X. Lu, P. West, F. Brahman, T. Shen, D. Khashabi, and Y. Choi, “Generating sequences by learning to self-correct,” in The Eleventh International Conference on Learning Representations, vol. 2, 2023.
I. Schlag, S. Sukhbaatar, A. Celikyilmaz, W.-t. Yih, J. Weston, J. Schmidhuber, and X. Li, “Large language model programs,” arXiv preprint arXiv:2305.05364, 2023.
Y. Zhang, J. Chi, H. Nguyen, K. Upasani, D. M. Bikel, J. Weston, and E. M. Smith, “Backtracking improves generation safety,” arXiv preprint arXiv:2409.14586, 2024.
S. Russell, Human compatible: Artificial intelligence and the problem of control, 1st. USA: Penguin Books, 2019, isbn: 9780525558637.
N. Bostrom, Superintelligence: Paths, Dangers, Strategies, 1st. USA: Oxford University Press, Inc., 2014, isbn: 0199678111.
S. M. Omohundro, “The basic ai drives,” in Proceedings of the 2008 Conference on Artificial General Intelligence 2008: Proceedings of the First AGI Conference, NLD: IOS Press, 2008, pp. 483–492, isbn: 9781586038335.
O. J¨arviniemi and E. Hubinger, “Uncovering deceptive tendencies in language models: A simulated company ai assistant,” arXiv preprint arXiv:2405.01576, 2024.
T. Hagendorff, “Deception abilities emerged in large language models,” Proceedings of the National Academy of Sciences, vol. 121, no. 24, Jun. 2024, issn: 1091-6490.
Downloads
Published
How to Cite
Issue
Section
Categories
License
Copyright (c) 2025 Melody Y. Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex, Amelia Glaese

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.