HomeAI NewsScienceEvaluating the Accuracy of AI-Generated Code with EvalPlus

Evaluating the Accuracy of AI-Generated Code with EvalPlus

May 4, 2023

A Rigorous Evaluation Framework for Code Synthesis with Large Language Models

Existing code evaluation datasets may not fully assess the functional correctness of code generated by Large Language Models (LLMs).
EvalPlus is a benchmarking framework that uses both LLM-based and mutation-based input generators to rigorously evaluate the functional correctness of LLM-synthesized code.
Extensive evaluation with HUMANEVAL+ shows significant amounts of previously undetected incorrect code, reducing pass@k by 15.1% on average.

The use of Large Language Models (LLMs) for code generation has gained significant attention, but questions remain about the accuracy and correctness of the code these models produce. Current code evaluation datasets may be limited in both quantity and quality, making it difficult to fully assess the functional correctness of LLM-generated code. To address this, researchers have proposed EvalPlus, a code synthesis benchmarking framework designed to rigorously evaluate LLM-synthesized code.

EvalPlus works by taking a base evaluation dataset and using automatic input generation to create and diversify a large number of new test inputs. This is achieved by combining LLM-based and mutation-based input generators. The framework is used to create HUMANEVAL+, an extended version of the popular HUMANEVAL benchmark, featuring 81 times more generated tests.

When evaluating 14 popular LLMs with HUMANEVAL+, the framework was able to identify significant amounts of previously undetected incorrect code, reducing the pass@k rate by an average of 15.1%. EvalPlus even discovered several incorrect ground-truth implementations in the original HUMANEVAL dataset. This demonstrates that current code synthesis evaluation results may not accurately reflect the true performance of LLMs and highlights the potential for automated test input generation to improve programming benchmarks.

Is-Your-Code-Generated-by-ChatGPT-Really-Correct Download

Future work for EvalPlus includes extending the framework to other code benchmarks, exploring better test generation techniques, and leveraging test suite reduction techniques to maintain test effectiveness while reducing redundancy. The researchers also suggest integrating EvalPlus with formal verification tools to provide stronger evaluation guarantees when applicable. Additionally, the core test generation technique could be used to alert developers to potential flaws in AI-generated code snippets when engaged in AI pair programming, such as with Copilot.

Paper

Github

Tags
llm

Karel https://neuronad.com

Sam Altman’s AI Juggernaut: A Power-Hungry Beast Set to Rival Entire Cities

AI’s Broken Promise: From Productivity Boom to ‘Workslop’ Bust

Revolutionizing Mental Health: AI Crafts a Trip-Free Psychedelic

Nadella’s Nightmare: Could AI Spell the End for Microsoft?

SpatialGen Ushers in a New Era of AI-Powered Indoor Scene Creation

Mistral’s New OCR API: A Game Changer for AI-Ready Documents

China’s Autonomous Agent, Manus, Changes Everything: The Dawn of Self-Directed AI

LLM Inference Hardware Calculator

Claude 3.7 Sonnet: The World’s First Hybrid AI Brain Coding and Reasoning

SambaNova Launches the Fastest DeepSeek-R1 671B with Unmatched Efficiency

Celebrities explaining science? Yes, please!

Breaking News: The world is ending, and influencers are live-reacting to the chaos!

THIS WILL BE A DAY LONG REMEMBERED: DARTH VADER’S AI VOICE LANDS IN FORTNITE

Where AI Baby Wisdom Meets Canine Comedy

The Impact of OpenAI’s 4o Image Generation: A Visual Revolution

From Garage Invite to X-Rated Text: When AI Mishears, Chaos Follows

Sam Altman’s AI Juggernaut: A Power-Hungry Beast Set to Rival Entire Cities

AI’s Broken Promise: From Productivity Boom to ‘Workslop’ Bust

Revolutionizing Mental Health: AI Crafts a Trip-Free Psychedelic

Nadella’s Nightmare: Could AI Spell the End for Microsoft?

SpatialGen Ushers in a New Era of AI-Powered Indoor Scene Creation

Mistral’s New OCR API: A Game Changer for AI-Ready Documents

China’s Autonomous Agent, Manus, Changes Everything: The Dawn of Self-Directed AI

LLM Inference Hardware Calculator

Claude 3.7 Sonnet: The World’s First Hybrid AI Brain Coding and Reasoning

SambaNova Launches the Fastest DeepSeek-R1 671B with Unmatched Efficiency

Celebrities explaining science? Yes, please!

Breaking News: The world is ending, and influencers are live-reacting to the chaos!

THIS WILL BE A DAY LONG REMEMBERED: DARTH VADER’S AI VOICE LANDS IN FORTNITE

Where AI Baby Wisdom Meets Canine Comedy

The Impact of OpenAI’s 4o Image Generation: A Visual Revolution

From Garage Invite to X-Rated Text: When AI Mishears, Chaos Follows

Evaluating the Accuracy of AI-Generated Code with EvalPlus

A Rigorous Evaluation Framework for Code Synthesis with Large Language Models

Must Read

Grok’s Controversial Stance: AI Skepticism or Denial?

Bridging the Gap in AI: OMG-LLaVA’s Comprehensive Image and Text Reasoning Capabilities

OpenAI Codex and the Future of AI-Driven Coding

Microsoft Paint Reinvents Itself: AI-Powered Tools to Transform Image Editing

Physics3D: 3D Object Simulation with Video Diffusion Models

Evaluating the Accuracy of AI-Generated Code with EvalPlus

A Rigorous Evaluation Framework for Code Synthesis with Large Language Models

RELATED ARTICLES

Must Read