EFFIBENCH: Benchmarking the Efficiency of Automatically Generated Code

Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, Jie Zhang*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference paperpeer-review

Abstract

Code generation models have increasingly become integral to aiding software
development. Although current research has thoroughly examined the correctness
of the code produced by code generation models, a vital aspect that plays a pivotal
role in green computing and sustainability efforts — the efficiency of the generated
code — has often been neglected. This paper presents EFFIBENCH, a benchmark
with 1,000 efficiency-critical coding problems to assess the efficiency of code
generated by code generation models. EFFIBENCH contains a diverse set of
LeetCode coding problems. Each problem is paired with an executable humanwritten canonical solution, which obtains the SOTA efficiency on the LeetCode
solution leaderboard. With EFFIBENCH, we empirically examine the ability of 42
large language models (35 open-source and 7 closed-source) in generating efficient
code. Our evaluation results demonstrate that the efficiency of the code generated by
LLMs is generally worse than the efficiency of human-written canonical solutions.
For example, GPT-4 generated code has an average 3.12 times execution time that
of the human-written canonical solutions. In the most extreme cases, the execution
time and total memory usage of GPT-4 generated code are 13.89 and 43.92 times
that of the canonical solutions. The source code of EffiBench is released on https:
//github.com/huangd1999/EffiBench. We also provide the LeaderBoard in
https://huggingface.co/spaces/EffiBench/effibench-leaderboard.
Original languageEnglish
Title of host publicationNeurIPS 2024
Publication statusAccepted/In press - 26 Sept 2024

Fingerprint

Dive into the research topics of 'EFFIBENCH: Benchmarking the Efficiency of Automatically Generated Code'. Together they form a unique fingerprint.

Cite this