HomeAI NewsEvaluating the Accuracy of AI-Generated Code with EvalPlus

Evaluating the Accuracy of AI-Generated Code with EvalPlus

May 4, 2023

A Rigorous Evaluation Framework for Code Synthesis with Large Language Models

Existing code evaluation datasets may not fully assess the functional correctness of code generated by Large Language Models (LLMs).
EvalPlus is a benchmarking framework that uses both LLM-based and mutation-based input generators to rigorously evaluate the functional correctness of LLM-synthesized code.
Extensive evaluation with HUMANEVAL+ shows significant amounts of previously undetected incorrect code, reducing pass@k by 15.1% on average.

The use of Large Language Models (LLMs) for code generation has gained significant attention, but questions remain about the accuracy and correctness of the code these models produce. Current code evaluation datasets may be limited in both quantity and quality, making it difficult to fully assess the functional correctness of LLM-generated code. To address this, researchers have proposed EvalPlus, a code synthesis benchmarking framework designed to rigorously evaluate LLM-synthesized code.

Evaluating the Accuracy of AI-Generated Code with EvalPlus

EvalPlus works by taking a base evaluation dataset and using automatic input generation to create and diversify a large number of new test inputs. This is achieved by combining LLM-based and mutation-based input generators. The framework is used to create HUMANEVAL+, an extended version of the popular HUMANEVAL benchmark, featuring 81 times more generated tests.

When evaluating 14 popular LLMs with HUMANEVAL+, the framework was able to identify significant amounts of previously undetected incorrect code, reducing the pass@k rate by an average of 15.1%. EvalPlus even discovered several incorrect ground-truth implementations in the original HUMANEVAL dataset. This demonstrates that current code synthesis evaluation results may not accurately reflect the true performance of LLMs and highlights the potential for automated test input generation to improve programming benchmarks.

Is-Your-Code-Generated-by-ChatGPT-Really-Correct Download

Future work for EvalPlus includes extending the framework to other code benchmarks, exploring better test generation techniques, and leveraging test suite reduction techniques to maintain test effectiveness while reducing redundancy. The researchers also suggest integrating EvalPlus with formal verification tools to provide stronger evaluation guarantees when applicable. Additionally, the core test generation technique could be used to alert developers to potential flaws in AI-generated code snippets when engaged in AI pair programming, such as with Copilot.

Paper

Github

Tags
llm

Karel https://neuronad.com

Karel is the founder of Neuronad and a technology enthusiast with deep roots in web development and digital innovation. He launched Neuronad to create a dedicated space for AI news that cuts through the hype and focuses on what truly matters — the tools, research, and trends shaping our future. Karel oversees the editorial direction and technical infrastructure behind the site.

Evaluating the Accuracy of AI-Generated Code with EvalPlus

A Rigorous Evaluation Framework for Code Synthesis with Large Language Models

Must Read

OpenAI Unveils GPT-5.4 Mini and Nano to Power the Next Generation of AI Agents

OmniControl: A Leap in Image-Conditioned Diffusion Transformers

DesignEdit: Layered Precision Refining Image Editing with Advanced Latent Techniques

Autonomo Technologies Raises £2M, Expands Checkout-Free Stores Across Germany

AI’s Dark Side Unleashed: The Dawn of Autonomous Cyber Espionage

[email protected]

Copyright © 2024 Neuronad.com. All rights reserved.

Random articles

Trump, Musk, and Microsoft CEO Satya Nadella Discuss AI and Cybersecurity at Mar-a-Lago

Infinite Worlds, Instant Feedback: The Leap Forward in Real-Time AI World Modeling

Bridging the Gap: Advancements in Open-Source Multimodal AI Models

Random articles - last 7 days

Kimi K2.6 Agent Swarm is Redefining AI Productivity

Qwen3.6-Max-Preview: Smarter, Sharper, and Rewriting the Rules of Agentic AI

Atlassian’s New AI Policy Changes the Rules for 300,000 Companies

Evaluating the Accuracy of AI-Generated Code with EvalPlus

A Rigorous Evaluation Framework for Code Synthesis with Large Language Models

RELATED ARTICLES

Must Read

Copyright © 2024 Neuronad.com. All rights reserved.

Random articles

Random articles - last 7 days