EFFIBENCH: Benchmarking the Efficiency of Automatically Generated Code

Dong Huang; Yuhao Qing; Weiyi Shang; Heming Cui; Jie Zhang

EFFIBENCH: Benchmarking the Efficiency of Automatically Generated Code

Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, Jie Zhang^*

^*Corresponding author for this work

Informatics

Research output: Chapter in Book/Report/Conference proceeding › Conference paper › peer-review

Abstract

Code generation models have increasingly become integral to aiding software
development. Although current research has thoroughly examined the correctness
of the code produced by code generation models, a vital aspect that plays a pivotal
role in green computing and sustainability efforts — the efficiency of the generated
code — has often been neglected. This paper presents EFFIBENCH, a benchmark
with 1,000 efficiency-critical coding problems to assess the efficiency of code
generated by code generation models. EFFIBENCH contains a diverse set of
LeetCode coding problems. Each problem is paired with an executable humanwritten canonical solution, which obtains the SOTA efficiency on the LeetCode
solution leaderboard. With EFFIBENCH, we empirically examine the ability of 42
large language models (35 open-source and 7 closed-source) in generating efficient
code. Our evaluation results demonstrate that the efficiency of the code generated by
LLMs is generally worse than the efficiency of human-written canonical solutions.
For example, GPT-4 generated code has an average 3.12 times execution time that
of the human-written canonical solutions. In the most extreme cases, the execution
time and total memory usage of GPT-4 generated code are 13.89 and 43.92 times
that of the canonical solutions. The source code of EffiBench is released on https:
//github.com/huangd1999/EffiBench. We also provide the LeaderBoard in
https://huggingface.co/spaces/EffiBench/effibench-leaderboard.

Original language	English
Title of host publication	NeurIPS 2024
Publication status	Accepted/In press - 26 Sept 2024

Access to Document

EffiBench (6)Accepted author manuscript, 505 KBLicence: CC BY-ND

https://arxiv.org/pdf/2402.02037Licence: CC BY-ND

Cite this

@inbook{5000f061af39486994b25ea6b4c0d4c4,

title = "EFFIBENCH: Benchmarking the Efficiency of Automatically Generated Code",

abstract = "Code generation models have increasingly become integral to aiding softwaredevelopment. Although current research has thoroughly examined the correctnessof the code produced by code generation models, a vital aspect that plays a pivotalrole in green computing and sustainability efforts — the efficiency of the generatedcode — has often been neglected. This paper presents EFFIBENCH, a benchmarkwith 1,000 efficiency-critical coding problems to assess the efficiency of codegenerated by code generation models. EFFIBENCH contains a diverse set ofLeetCode coding problems. Each problem is paired with an executable humanwritten canonical solution, which obtains the SOTA efficiency on the LeetCodesolution leaderboard. With EFFIBENCH, we empirically examine the ability of 42large language models (35 open-source and 7 closed-source) in generating efficientcode. Our evaluation results demonstrate that the efficiency of the code generated byLLMs is generally worse than the efficiency of human-written canonical solutions.For example, GPT-4 generated code has an average 3.12 times execution time thatof the human-written canonical solutions. In the most extreme cases, the executiontime and total memory usage of GPT-4 generated code are 13.89 and 43.92 timesthat of the canonical solutions. The source code of EffiBench is released on https://github.com/huangd1999/EffiBench. We also provide the LeaderBoard inhttps://huggingface.co/spaces/EffiBench/effibench-leaderboard.",

author = "Dong Huang and Yuhao Qing and Weiyi Shang and Heming Cui and Jie Zhang",

year = "2024",

month = sep,

day = "26",

language = "English",

booktitle = "NeurIPS 2024",

}

TY - CHAP

T1 - EFFIBENCH: Benchmarking the Efficiency of Automatically Generated Code

AU - Huang, Dong

AU - Qing, Yuhao

AU - Shang, Weiyi

AU - Cui, Heming

AU - Zhang, Jie

PY - 2024/9/26

Y1 - 2024/9/26

N2 - Code generation models have increasingly become integral to aiding softwaredevelopment. Although current research has thoroughly examined the correctnessof the code produced by code generation models, a vital aspect that plays a pivotalrole in green computing and sustainability efforts — the efficiency of the generatedcode — has often been neglected. This paper presents EFFIBENCH, a benchmarkwith 1,000 efficiency-critical coding problems to assess the efficiency of codegenerated by code generation models. EFFIBENCH contains a diverse set ofLeetCode coding problems. Each problem is paired with an executable humanwritten canonical solution, which obtains the SOTA efficiency on the LeetCodesolution leaderboard. With EFFIBENCH, we empirically examine the ability of 42large language models (35 open-source and 7 closed-source) in generating efficientcode. Our evaluation results demonstrate that the efficiency of the code generated byLLMs is generally worse than the efficiency of human-written canonical solutions.For example, GPT-4 generated code has an average 3.12 times execution time thatof the human-written canonical solutions. In the most extreme cases, the executiontime and total memory usage of GPT-4 generated code are 13.89 and 43.92 timesthat of the canonical solutions. The source code of EffiBench is released on https://github.com/huangd1999/EffiBench. We also provide the LeaderBoard inhttps://huggingface.co/spaces/EffiBench/effibench-leaderboard.

AB - Code generation models have increasingly become integral to aiding softwaredevelopment. Although current research has thoroughly examined the correctnessof the code produced by code generation models, a vital aspect that plays a pivotalrole in green computing and sustainability efforts — the efficiency of the generatedcode — has often been neglected. This paper presents EFFIBENCH, a benchmarkwith 1,000 efficiency-critical coding problems to assess the efficiency of codegenerated by code generation models. EFFIBENCH contains a diverse set ofLeetCode coding problems. Each problem is paired with an executable humanwritten canonical solution, which obtains the SOTA efficiency on the LeetCodesolution leaderboard. With EFFIBENCH, we empirically examine the ability of 42large language models (35 open-source and 7 closed-source) in generating efficientcode. Our evaluation results demonstrate that the efficiency of the code generated byLLMs is generally worse than the efficiency of human-written canonical solutions.For example, GPT-4 generated code has an average 3.12 times execution time thatof the human-written canonical solutions. In the most extreme cases, the executiontime and total memory usage of GPT-4 generated code are 13.89 and 43.92 timesthat of the canonical solutions. The source code of EffiBench is released on https://github.com/huangd1999/EffiBench. We also provide the LeaderBoard inhttps://huggingface.co/spaces/EffiBench/effibench-leaderboard.

M3 - Conference paper

BT - NeurIPS 2024

ER -

EFFIBENCH: Benchmarking the Efficiency of Automatically Generated Code

Abstract

Access to Document

Fingerprint

Cite this