AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

jasonchou9877@gmail.com; {nickaliu,wigginzhou,faxonlian}@tencent.com
Hunyuan Team, Tencent
*Equal Contributions Corresponding Authors

The dataset and code will be uploaded as soon as possible this week. Please stay tuned!

Introduction

AutoCodeGen. We propose an automated workflow based on LLM-Sandbox interaction, where LLMs generate test inputs and obtain test outputs through the sandbox to create high-quality code generation datasets.

AutoCodeBench. We introduce AutoCodeBench, a large-scale code generation benchmark with 3,920 problems, evenly distributed across 20 programming languages. It features high difficulty, practicality, and diversity, and is designed to measure the absolute multilingual performance of models.

AutoCodeBench-Lite. Based on the evaluation results of over 30 open-source and closed-source models on AutoCodeBench, we select 1,586 problems that were successfully solved by at least two models. This subset, AutoCodeBench-Lite, is used to measure performance differences between models.

AutoCodeBench-Complete. We select 1,000 problems from AutoCodeBench-Lite and use 3-shot prompting to construct AutoCodeBench-Complete, a completion-style code generation benchmark designed to assess the performance of base models.

AutoCodeGen

HumanEval Overfitting

The core innovation of AutoCodeGen lies in having LLMs generate test inputs, execute them in a sandbox to obtain test outputs, and generate programming problems in reverse. This approach is more efficient, scalable, and ensures better test case coverage compared to existing methods like KodCode and CodeI/O.

AutoCodeBench

HumanEval Overfitting
HumanEval Overfitting
HumanEval Overfitting

Experimental Results

HumanEval Overfitting
HumanEval Overfitting
HumanEval Overfitting

Further Analysis

HumanEval Overfitting

The performance difference between various models is small for popular languages, but large for low-resource languages.

HumanEval Overfitting

The performance of LLMs declines when faced with multi-logic programming problems.

HumanEval Overfitting

LLMs exhibit parameter and test-time sampling scaling law on AutoCodeBench.

HumanEval Overfitting

The feedback provided by our multilingual sandbox can guide the model to refine its code.