DarwinLM: Evolutionary Structured Pruning of Large Language Models

Shengkun Tang; Oliver Sieberling; Eldar Kurtic; Dan Alistarh

doi:10.70777/si.v2i3.15171

Authors

Shengkun Tang Department of Machine Learning, MBZUAI, Abu Dhabi, UAE
Oliver Sieberling ETH Zurich https://orcid.org/0009-0008-4682-903X
Eldar Kurtic ISTA Vienna; Red Hat AI Boston
Dan Alistarh ISTA, Vienna; Red Hat AI, Boston

DOI:

https://doi.org/10.70777/si.v2i3.15171

Abstract

Large Language Models (LLMs) have achieved significant success across various NLP tasks. However, their massive computational costs limit their widespread use, particularly in real-time applications. Structured pruning offers an effective solution by compressing models and directly providing end-to-end speed improvements, regardless of the hardware environment. Meanwhile, different components of the model exhibit varying sensitivities towards pruning, calling for nonuniform model compression. However, a pruning method should not only identify a capable substructure, but also account for post-compression training. To this end, we propose DarwinLM, a method for training-aware structured pruning. DarwinLM builds upon an evolutionary search process, generating multiple offspring models in each generation through mutation, and selecting the fittest for survival. To assess the effect of post-training, we incorporate a lightweight, multistep training process within the offspring population, progressively increasing the number of tokens and eliminating poorly performing models in each selection stage. We validate our method through extensive experiments on Llama- 2-7B, Llama-3.1-8B and Qwen-2.5-14B-Instruct, achieving state-of-the-art performance for structured pruning. For instance, DarwinLM surpasses ShearedLlama while requiring 5× less training data during post-compression training. Code and all weights are released at: https://github.com/ISTDASLab/ DarwinLM.

Author Biographies

Shengkun Tang, Department of Machine Learning, MBZUAI, Abu Dhabi, UAE

Welcome to my website~ My name is Shengkun Tang. You can call me Bryson for short. Currently, I am a research intern in Alibaba Qwen Team. Besides, I am a PhD student of Machine Learning in MBZUAI, under the supervision of Prof. Zhiqiang Shen. During my gap year, I had a wonderful time as an research assistant in DASLab in ISTA , working with Prof. Dan Alistarh. Besides, I had close collaboration with Prof. Dongkuan Xu (NCSU) and Dr. Yaqing Wang (Google DeepMind), working on efficent multi-modal models. I finished B.E. in Remote Sensing at Wuhan University , under the supervision of Prof. Jian Yao and Prof. Xin Su.

Oliver Sieberling, ETH Zurich

Quantization / Model Compression Deep Learning Evolutionary Algorithms

Eldar Kurtic, ISTA Vienna; Red Hat AI Boston

Expertise: Pruning Sparsity Quantization

Dan Alistarh, ISTA, Vienna; Red Hat AI, Boston

I am a Professor at the Institute of Science and Technology Austria (ISTA), and ML Research Lead at Neural Magic, Inc.

My research focuses on efficient algorithms and systems for machine learning, and spans from algorithms and lower bounds, to practical implementations. Before ISTA, I was a researcher at ETH Zurich and Microsoft Research, Cambridge, UK. Prior to that, I was a Postdoctoral Associate at MIT CSAIL, working with Prof. Nir Shavit. I received my PhD from the EPFL, under the guidance of Prof. Rachid Guerraoui.

During Fall 2023, I was a Visiting Professor at MIT.

My research is supported by a the Austrian FWF Center of Excellence BILAI, ERC Proof-of-Concept and Starting Grants, and generous grants from NVIDIA, Google, and Amazon.

Our lab’s code can be found at https://github.com/IST-DASLab

References

Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y.,

Lebr´on, F., and Sanghai, S. Gqa: Training generalized

multi-query transformer models from multi-head checkpoints.

arXiv preprint arXiv:2305.13245, 2023.

An, Y., Zhao, X., Yu, T., Tang, M., andWang, J. Fluctuationbased

adaptive structured pruning for large language models.

In Proceedings of the AAAI Conference on Artificial

Intelligence, 2024.

Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al. Piqa: Reasoning

about physical commonsense in natural language. In Proceedings

of the AAAI conference on artificial intelligence,

volume 34, pp. 7432–7439, 2020.

Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins,

M., and Toutanova, K. Boolq: Exploring the surprising

difficulty of natural yes/no questions. In NAACL, 2019.

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A.,

Schoenick, C., and Tafjord, O. Think you have solved

question answering? try arc, the ai2 reasoning challenge.

arXiv preprint arXiv:1803.05457, 2018.

Dettmers, T., Svirschevski, R., Egiazarian, V., Kuznedelev,

D., Frantar, E., Ashkboos, S., Borzunov, A., Hoefler, T.,

and Alistarh, D. Spqr: A sparse-quantized representation

for near-lossless llm weight compression. arXiv preprint

arXiv:2306.03078, 2023.

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle,

A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan,

A., et al. The llama 3 herd of models. arXiv preprint

arXiv:2407.21783, 2024.

Frantar, E. and Alistarh, D. Spdy: Accurate pruning with

speedup guarantees. In International Conference on Machine

Learning, pp. 6726–6743. PMLR, 2022.

Frantar, E. and Alistarh, D. Sparsegpt: Massive language

models can be accurately pruned in one-shot. In International

Conference on Machine Learning, pp. 10323–

PMLR, 2023.

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq:

Accurate post-training quantization for generative pretrained

transformers. arXiv preprint arXiv:2210.17323,

Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi,

A., Foster, C., Golding, L., Hsu, J., Le Noac’h, A., et al.

A framework for few-shot language model evaluation, 12

URL https://zenodo. org/records/10256836, 7.

Groeneveld, D., Beltagy, I., Walsh, P., Bhagia, A., Kinney,

R., Tafjord, O., Jha, A. H., Ivison, H., Magnusson, I.,

Wang, Y., et al. Olmo: Accelerating the science of language

models. arXiv preprint arXiv:2402.00838, 2024.

Gromov, A., Tirumala, K., Shapourian, H., Glorioso, P., and

Roberts, D. A. The unreasonable ineffectiveness of the

deeper layers. arXiv preprint arXiv:2403.17887, 2024.

Gu, Y., Dong, L.,Wei, F., and Huang, M. MiniLLM: Knowledge

distillation of large language models. In The Twelfth

International Conference on Learning Representations,

Hassibi, B. and Stork, D. Second order derivatives for

network pruning: Optimal brain surgeon. Advances in

neural information processing systems, 5, 1992.

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika,

M., Song, D., and Steinhardt, J. Measuring massive

multitask language understanding. arXiv preprint

arXiv:2009.03300, 2020.

Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge

in a neural network, 2015.

Hsieh, C.-Y., Li, C.-L., Yeh, C.-K., Nakhost, H., Fujii, Y.,

Ratner, A., Krishna, R., Lee, C.-Y., and Pfister, T. Distilling

step-by-step! outperforming larger language models

with less training data and smaller model sizes. arXiv

preprint arXiv:2305.02301, 2023.

Huang, W., Liu, Y., Qin, H., Li, Y., Zhang, S., Liu, X.,

Magno, M., and QI, X. Billm: Pushing the limit of posttraining

quantization for llms. In Forty-first International

Conference on Machine Learning, 2024.

Kim, B.-K., Kim, G., Kim, T.-H., Castells, T., Choi, S.,

Shin, J., and Song, H.-K. Shortened llama: A simple

depth pruning for large language models. arXiv preprint

arXiv:2402.02834, 2024.

Klein, A., Golebiowski, J., Ma, X., Perrone, V., and Archambeau,

C. Structural pruning of large language models via

neural architecture search. In AutoML Conference 2023,

Kurtic, E., Campos, D., Nguyen, T., Frantar, E., Kurtz, M.,

Fineran, B., Goin, M., and Alistarh, D. The optimal bert

surgeon: Scalable and accurate second-order pruning for

large language models. arXiv preprint arXiv:2203.07259,

Kurti´c, E., Frantar, E., and Alistarh, D. Ziplm: Inferenceaware

structured pruning of language models. Advances

in Neural Information Processing Systems, 36, 2024.

Li, S., Ning, X., Wang, L., Liu, T., Shi, X., Yan, S., Dai,

G., Yang, H., and Wang, Y. Evaluating quantized large

language models. In Forty-first International Conference

on Machine Learning, 2024.

Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang,

W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. Awq:

Activation-aware weight quantization for on-device llm

compression and acceleration. Proceedings of Machine

Learning and Systems, 6:87–100, 2024.

Liu, C., Zhao, F., Kuang, K., Kang, Y., Jiang, Z., Sun, C.,

andWu, F. Evolving knowledge distillation with large language

models and active learning. In Calzolari, N., Kan,

M.-Y., Hoste, V., Lenci, A., Sakti, S., and Xue, N. (eds.),

Proceedings of the 2024 Joint International Conference

on Computational Linguistics, Language Resources and

Evaluation (LREC-COLING 2024), 2024.

Liu, J., Cui, L., Liu, H., Huang, D., Wang, Y., and Zhang,

Y. Logiqa: A challenge dataset for machine reading

comprehension with logical reasoning. arXiv preprint

arXiv:2007.08124, 2020.

Lozhkov, A., Ben Allal, L., von Werra, L., and

Wolf, T. Fineweb-edu, May 2024. URL

https://huggingface.co/datasets/

HuggingFaceFW/fineweb-edu.

Ma, S., Wang, H., Ma, L., Wang, L., Wang, W., Huang, S.,

Dong, L., Wang, R., Xue, J., and Wei, F. The era of 1-bit

llms: All large language models are in 1.58 bits. arXiv

preprint arXiv:2402.17764, 2024.

Ma, X., Fang, G., and Wang, X. LLM-pruner: On the

structural pruning of large language models. In Thirtyseventh

Conference on Neural Information Processing

Systems, 2023.

Men, X., Xu, M., Zhang, Q., Wang, B., Lin, H., Lu, Y., Han,

X., and Chen,W. Shortgpt: Layers in large language models

are more redundant than you expect. arXiv preprint

arXiv:2403.03853, 2024.

Muralidharan, S., Sreenivas, S. T., Joshi, R., Chochowski,

M., Patwary, M., Shoeybi, M., Catanzaro, B., Kautz,

J., and Molchanov, P. Compact language models via

pruning and knowledge distillation. arXiv preprint

arXiv:2407.14679, 2024.

Qwen, T. Qwen2.5: A party of foundation models, September

URL https://qwenlm.github.io/

blog/qwen2.5/.

Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y.

Winogrande: An adversarial winograd schema challenge

at scale. Communications of the ACM, 64(9):99–106,

Sanh, V. Distilbert, a distilled version of bert: Smaller, faster,

cheaper and lighter. arXiv preprint arXiv:1910.01108,

Sieberling, O., Kuznedelev, D., Kurtic, E., and Alistarh, D.

Evopress: Towards optimal dynamic model compression

via evolutionary search. arXiv preprint arXiv:2410.14649,

Tang, S., Ma, L., Li, H., Sun, M., and Shen, Z. Bi-mamba:

Towards accurate 1-bit state space models. arXiv preprint

arXiv:2411.11843, 2024.

Tao, C., Hou, L., Bai, H., Wei, J., Jiang, X., Liu, Q., Luo,

P., and Wong, N. Structured pruning for efficient generative

pre-trained language models. In Findings of the

Association for Computational Linguistics: ACL 2023,

pp. 10880–10895, 2023.

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,

A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,

Bhosale, S., et al. Llama 2: Open foundation and finetuned

chat models. arXiv preprint arXiv:2307.09288,

Wang, H., Ma, S., Dong, L., Huang, S., Wang, H., Ma, L.,

Yang, F., Wang, R., Wu, Y., and Wei, F. Bitnet: Scaling 1-

bit transformers for large language models. arXiv preprint

arXiv:2310.11453, 2023.

Wang, Z., Wohlwend, J., and Lei, T. Structured pruning

of large language models. In Proceedings of the 2020

Conference on Empirical Methods in Natural Language

Processing (EMNLP), pp. 6151–6162, 2020.

Welbl, J., Liu, N. F., and Gardner, M. Crowdsourcing

multiple choice science questions. arXiv preprint

arXiv:1707.06209, 2017.

Xia, M., Gao, T., Zeng, Z., and Chen, D. Sheared llama:

Accelerating language model pre-training via structured

pruning. In ICLR, 2024.

Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han,

S. Smoothquant: Accurate and efficient post-training

quantization for large language models. In International

Conference on Machine Learning, pp. 38087–38099.

PMLR, 2023.

Xu, X., Li, M., Tao, C., Shen, T., Cheng, R., Li, J., Xu,

C., Tao, D., and Zhou, T. A survey on knowledge

distillation of large language models. arXiv preprint

arXiv:2402.13116, 2024a.

Xu, Y., Han, X., Yang, Z., Wang, S., Zhu, Q., Liu, Z., Liu,

W., and Che, W. Onebit: Towards extremely low-bit

large language models. arXiv preprint arXiv:2402.11295,

b.

Yin, L., Wu, Y., Zhang, Z., Hsieh, C.-Y., Wang, Y., Jia, Y.,

Pechenizkiy, M., Liang, Y., Wang, Z., and Liu, S. Outlier

weighed layerwise sparsity (owl): A missing secret

sauce for pruning llms to high sparsity. arXiv preprint

arXiv:2310.05175, 2023.

Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi,

Y. Hellaswag: Can a machine really finish your sentence?

arXiv preprint arXiv:1905.07830, 2019.

DarwinLM: Evolutionary Structured Pruning of Large Language Models

Authors

DOI:

Abstract

Author Biographies

Shengkun Tang, Department of Machine Learning, MBZUAI, Abu Dhabi, UAE

Oliver Sieberling, ETH Zurich

Eldar Kurtic, ISTA Vienna; Red Hat AI Boston

Dan Alistarh, ISTA, Vienna; Red Hat AI, Boston

References

Downloads

Published

How to Cite

Issue

Section

License

Current Issue

Announcements

Dario Amodei, The Adolescence of Technology: Confronting and Overcoming the Risks of Powerful AI

Steve Omohundro: Regulating AGI: From Liability to Provable Contracts

Joe Rogan Experience #2345 - Roman Yampolskiy

Steve Omohundro Receives 2024 Future of Life Award

Steve Omohundro and Scientists Discuss the AI Alignment Problem with Neil deGrasse Tyson

Information