Deliberative Alignment: Reasoning Enables Safer Language Models

Melody Y. Guan; Manas Joglekar; Eric Wallace; Saachi Jain; Boaz Barak; Alec Helyar; Rachel Dias; Andrea Vallone; Hongyu Ren; Jason Wei; Hyung Won Chung; Sam Toyer; Johannes Heidecke; Alex Beutel; Amelia Glaese

doi:10.70777/si.v2i3.15159

Authors

Melody Y. Guan OpenAI; Stanford University
Manas Joglekar OpenAI
Eric Wallace OpenAI
Saachi Jain OpenAI
Boaz Barak Harvard University; OpenAI
Alec Helyar OpenAI
Rachel OpenAI
Andrea Vallone OpenAI
Hongyu Ren OpenAI
Jason Wei OpenAI
Hyung Won Chung OpenAI
Sam Toyer OpenAI
Johannes Heidecke OpenAI
Alex OpenAI
Amelia Glaese OpenAI

DOI:

https://doi.org/10.70777/si.v2i3.15159

Keywords:

Chain-of-Thought (CoT), deliberative alignment, Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), Jailbreak Resistance , Overrefusal Rates, Hard Refusal, AI safety categories, Out-of-Distribution (OOD) Generalization, Encoding-based Jailbreaks, Response Style Guidelines, Alignment Strategies, Deception Monitoring, Critique-and-Refine

Abstract

As large-scale language models increasingly impact safety-critical domains, ensuring their reliable adherence to well-defined principles remains a fundamental challenge. We introduce Deliberative Alignment, a new paradigm that directly teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering. We used this approach to align OpenAI’s o-series models [1], and achieved highly precise adherence to OpenAI’s safety policies, without requiring human-written chain-of-thoughts or answers. Deliberative Alignment pushes the Pareto frontier by simultaneously increasing robustness to jailbreaks while decreasing overrefusal rates, and also improves out-of-distribution generalization. We demonstrate that reasoning over explicitly specified policies enables more scalable, trustworthy, and interpretable alignment.

References

OpenAI, Learning to reason with LLMs, 2024.

[Online]. Available: https://openai.com/index/learningto- reason-with-llms/.

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., “Training language models to follow instructions with human feedback,” in NeurIPS, 2022.

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al., “The LLaMA 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024.

M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieser, et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” arXiv preprint arXiv:2403.05530, 2024.

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., “GPT-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.

A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,” arXiv preprint arXiv:2307.15043, 2023.

A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does llm safety training fail?” NeurIPS, 2024.

M. Andriushchenko, F. Croce, and N. Flammarion, “Jailbreaking leading safety-aligned llms with simple adaptive attacks,” arXiv preprint arXiv:2404.02151, 2024.

J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Higgins, “Solving math word problems with process-and outcome-based feedback,” arXiv preprint arXiv:2211.14275, 2022.

C. Snell, D. Klein, and R. Zhong, “Learning by distilling context,” arXiv preprint arXiv:2209.15189, 2022.

A. Askell, Y. Bai, A. Chen, et al., “A general language assistant as a laboratory for alignment,” arXiv preprint arXiv:2112.00861, 2021.

A. Souly, Q. Lu, D. Bowen, et al., “A strongreject for empty jailbreaks,” arXiv preprint arXiv:2402.10260, 2024.

P. R¨ottger, H. R. Kirk, B. Vidgen, G. Attanasio, F. Bianchi, and D. Hovy, “Xstest: A test suite for identifying exaggerated safety behaviours in large language models,” arXiv preprint arXiv:2308.01263, 2024.

OpenAI, Introducing the model spec, 2024.

[Online]. Available: https://cdn.openai.com/spec/modelspec- 2024-05-08.html.

W. Zhao, X. Ren, J. Hessel, C. Cardie, Y. Choi, and Y. Deng, “Wildchat: 1m chatgpt interaction logs in the wild,” arXiv preprint arXiv:2405.01470, 2024.

X. Shen, Z. Chen, M. Backes, Y. Shen, and Y. Zhang, “”do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” arXiv preprint arXiv:2308.03825, 2024.

P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong, “Jailbreaking black box large language models in twenty queries,” arXiv preprint arXiv:2310.08419, 2024.

P. Chao, E. Debenedetti, A. Robey, et al., “Jailbreakbench: An open robustness benchmark for jailbreaking large language models,” arXiv preprint arXiv:2404.01318, 2024.

P. Kumar, E. Lau, S. Vijayakumar, et al., “Refusal-trained llms are easily jailbroken as browser agents,” arXiv preprint arXiv:2410.13886, 2024.

OpenAI, O1 system card, 2024.

[Online]. Available: https://cdn.openai.com/o1-system-card.pdf.

OpenAI, Gpt-4o system card, 2024.

[Online]. Available: https://cdn.openai.com/gpt-4o-system-card.pdf.

Anthropic, Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 sonnet, 2024.

[Online]. Available: https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Claude-3-Model-Card-October- Addendum.pdf.

G. Gemini Team, “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” arXiv preprint arXiv:2403.05530, 2024.

J. Wei, N. Karina, H. W. Chung, Y. J. Jiao, S. Papay, A. Glaese, J. Schulman, and W. Fedus, “Measuring short-form factuality in large language models,” arXiv preprint arXiv:2411.04368, 2024.

A. Parrish, A. Chen, N. Nangia, V. Padmakumar, J. Phang, J. Thompson, P. M. Htut, and S. R. Bowman, “BBQ: A hand-built bias benchmark for question answering,” arXiv preprint arXiv:2110.08193, 2021.

Y. Bai, S. Kadavath, S. Kundu, et al., “Constitutional AI: Harmlessness from AI feedback,” arXiv preprint arXiv:2212.08073, 2022.

A. Madaan, N. Tandon, P. Gupta, et al., “Self-refine: Iterative refinement with self-feedback,” arXiv preprint arXiv:2303.17651, 2023.

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” in Advances in Neural Information Processing Systems, vol. 30, 2017.

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” arXiv preprint arXiv:2305.18290, 2024.

L. Pan, M. Saxon, W. Xu, D. Nathani, X. Wang, and W. Y. Wang, “Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies,” arXiv preprint arXiv:2308.03188, 2023.

S. Welleck, X. Lu, P. West, F. Brahman, T. Shen, D. Khashabi, and Y. Choi, “Generating sequences by learning to self-correct,” in The Eleventh International Conference on Learning Representations, vol. 2, 2023.

I. Schlag, S. Sukhbaatar, A. Celikyilmaz, W.-t. Yih, J. Weston, J. Schmidhuber, and X. Li, “Large language model programs,” arXiv preprint arXiv:2305.05364, 2023.

Y. Zhang, J. Chi, H. Nguyen, K. Upasani, D. M. Bikel, J. Weston, and E. M. Smith, “Backtracking improves generation safety,” arXiv preprint arXiv:2409.14586, 2024.

S. Russell, Human compatible: Artificial intelligence and the problem of control, 1st. USA: Penguin Books, 2019, isbn: 9780525558637.

N. Bostrom, Superintelligence: Paths, Dangers, Strategies, 1st. USA: Oxford University Press, Inc., 2014, isbn: 0199678111.

S. M. Omohundro, “The basic ai drives,” in Proceedings of the 2008 Conference on Artificial General Intelligence 2008: Proceedings of the First AGI Conference, NLD: IOS Press, 2008, pp. 483–492, isbn: 9781586038335.

O. J¨arviniemi and E. Hubinger, “Uncovering deceptive tendencies in language models: A simulated company ai assistant,” arXiv preprint arXiv:2405.01576, 2024.

T. Hagendorff, “Deception abilities emerged in large language models,” Proceedings of the National Academy of Sciences, vol. 121, no. 24, Jun. 2024, issn: 1091-6490.

Deliberative Alignment: Reasoning Enables Safer Language Models

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

Categories

License

Current Issue

Announcements

Dario Amodei, The Adolescence of Technology: Confronting and Overcoming the Risks of Powerful AI

Steve Omohundro: Regulating AGI: From Liability to Provable Contracts

Joe Rogan Experience #2345 - Roman Yampolskiy

Steve Omohundro Receives 2024 Future of Life Award

Steve Omohundro and Scientists Discuss the AI Alignment Problem with Neil deGrasse Tyson

Information