AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

jasonchou9877@gmail.com; {nickaliu,wigginzhou,faxonlian}@tencent.com

Hunyuan Team, Tencent

^*Equal Contributions ^†Corresponding Authors

📰 News

2025.8.15 - We released code, dataset, and sandbox.

2025.8.13 - We published arXiv paper.

Introduction

AutoCodeGen. We propose an automated workflow based on LLM-Sandbox interaction, where LLMs generate test inputs and obtain test outputs through the sandbox to create high-quality code generation datasets.

AutoCodeBench. We introduce AutoCodeBench, a large-scale code generation benchmark with 3,920 problems, evenly distributed across 20 programming languages. It features high difficulty, practicality, and diversity, and is designed to measure the absolute multilingual performance of models.

AutoCodeBench-Lite. Based on the evaluation results of over 30 open-source and closed-source models on AutoCodeBench, we select 1,586 problems that were successfully solved by at least two models. This subset, AutoCodeBench-Lite, is used to measure performance differences between models.

AutoCodeBench-Complete. We select 1,000 problems from AutoCodeBench-Lite and use 3-shot prompting to construct AutoCodeBench-Complete, a completion-style code generation benchmark designed to assess the performance of base models.

MultiLanguageSandbox. A robust, secure, and high-performance multi-language code execution sandbox service that provides comprehensive support for compilation and execution across more than 30 programming languages.

AutoCodeGen

The core innovation of AutoCodeGen lies in having LLMs generate test inputs, execute them in a sandbox to obtain test outputs, and generate programming problems in reverse. This approach is more efficient, scalable, and ensures better test case coverage compared to existing methods like KodCode and CodeI/O.

Further Analysis