The dataset and code will be uploaded as soon as possible this week. Please stay tuned!
AutoCodeGen. We propose an automated workflow based on LLM-Sandbox interaction, where LLMs generate test inputs and obtain test outputs through the sandbox to create high-quality code generation datasets.
AutoCodeBench. We introduce AutoCodeBench, a large-scale code generation benchmark with 3,920 problems, evenly distributed across 20 programming languages. It features high difficulty, practicality, and diversity, and is designed to measure the absolute multilingual performance of models.
AutoCodeBench-Lite. Based on the evaluation results of over 30 open-source and closed-source models on AutoCodeBench, we select 1,586 problems that were successfully solved by at least two models. This subset, AutoCodeBench-Lite, is used to measure performance differences between models.
AutoCodeBench-Complete. We select 1,000 problems from AutoCodeBench-Lite and use 3-shot prompting to construct AutoCodeBench-Complete, a completion-style code generation benchmark designed to assess the performance of base models.
The core innovation of AutoCodeGen lies in having LLMs generate test inputs, execute them in a sandbox to obtain test outputs, and generate programming problems in reverse. This approach is more efficient, scalable, and ensures better test case coverage compared to existing methods like KodCode and CodeI/O.
The performance difference between various models is small for popular languages, but large for low-resource languages.
The performance of LLMs declines when faced with multi-logic programming problems.
LLMs exhibit parameter and test-time sampling scaling law on AutoCodeBench.
The feedback provided by our multilingual sandbox can guide the model to refine its code.