This is an evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained on Code". It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.
460 PAPERS • 1 BENCHMARK
The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry-level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases.
245 PAPERS • 1 BENCHMARK
WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further split into training (61,297 examples), development (9,145 examples) and test sets (17,284 examples). It can be used for natural language inference tasks related to relational databases.
227 PAPERS • 3 BENCHMARKS
CodeXGLUE is a benchmark dataset and open challenge for code intelligence. It includes a collection of code intelligence tasks and a platform for model evaluation and comparison. CodeXGLUE stands for General Language Understanding Evaluation benchmark for CODE. It includes 14 datasets for 10 diversified code intelligence tasks covering the following scenarios:
161 PAPERS • 15 BENCHMARKS
The APPS dataset consists of problems collected from different open-access coding websites such as Codeforces, Kattis, and more. The APPS benchmark attempts to mirror how humans programmers are evaluated by posing coding problems in unrestricted natural language and evaluating the correctness of solutions. The problems range in difficulty from introductory to collegiate competition level and measure coding ability as well as problem-solving.
100 PAPERS • 1 BENCHMARK
The CMU CoNaLa, the Code/Natural Language Challenge dataset is a joint project from the Carnegie Mellon University NeuLab and Strudel labs. Its purpose is for testing the generation of code snippets from natural language. The data comes from StackOverflow questions. There are 2379 training and 500 test examples that were manually annotated. Every example has a natural language intent and its corresponding python snippet. In addition to the manually annotated dataset, there are also 598,237 mined intent-snippet pairs. These examples are similar to the hand-annotated ones except that they contain a probability if the pair is valid.
64 PAPERS • 1 BENCHMARK
A new large dataset with over 100,000 examples consisting of Java classes from online code repositories, and develop a new encoder-decoder architecture that models the interaction between the method documentation and the class environment.
35 PAPERS • 1 BENCHMARK
CodeContests is a competitive programming dataset for machine-learning. This dataset was used when training AlphaCode.
33 PAPERS • 1 BENCHMARK
DS-1000 is a code generation benchmark with a thousand data science questions spanning seven Python libraries that (1) reflects diverse, realistic, and practical use cases, (2) has a reliable metric, (3) defends against memorization by perturbing questions.
30 PAPERS • NO BENCHMARKS YET
The Django dataset is a dataset for code generation comprising of 16000 training, 1000 development and 1805 test annotations. Each data point consists of a line of Python code together with a manually created natural language description.
22 PAPERS • 1 BENCHMARK
This dataset contains card descriptions of the card game Hearthstone and the code that implements them. These are obtained from the open-source implementation Hearthbreaker (https://github.com/danielyule/hearthbreaker).
21 PAPERS • NO BENCHMARKS YET
HumanEval-X is a benchmark for evaluating the multilingual ability of code generative models. It consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks, such as code generation and translation.
XLCoST is a benchmark dataset for cross-lingual code intelligence. The dataset contains fine-grained parallel data from 8 languages (7 commonly used programming languages and English), and supports 10 cross-language code tasks.
14 PAPERS • NO BENCHMARKS YET
JuICe is a corpus of 1.5 million examples with a curated test set of 3.7K instances based on online programming assignments. Compared with existing contextual code generation datasets, JuICe provides refined human-curated data, open-domain code, and an order of magnitude more training data.
13 PAPERS • NO BENCHMARKS YET
In this work, we make the first attempt to evaluate LLMs in a more challenging code generation scenario, i.e. class-level code generation. We first manually construct the first class-level code generation benchmark ClassEval of 100 class-level Python code generation tasks with approximately 500 person-hours. Based on it, we then perform the first study of 11 state-of-the-art LLMs on class-level code generation. Based on our results, we have the following main findings. First, we find that all existing LLMs show much worse performance on class-level code generation compared to on standalone method-level code generation benchmarks like HumanEval; and the method-level coding ability cannot equivalently reflect the class-level coding ability among LLMs. Second, we find that GPT-4 and GPT-3.5 still exhibit dominate superior than other LLMs on class-level code generation, and the second-tier models includes Instruct-Starcoder, Instruct-Codegen, and Wizardcoder with very similar performance.
9 PAPERS • NO BENCHMARKS YET
MCoNaLa is a multilingual dataset to benchmark code generation from natural language commands extending beyond English. Modeled off of the methodology from the English Code/Natural Language Challenge (CoNALa) dataset, the authors annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian.
Evaluate a natural language code generation model on real data science pedagogical notebooks! Data Science Problems (DSP) includes well-posed data science problems in Markdown along with unit tests to verify correctness and a Docker environment for reproducible execution. About 1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only can test a model's ability to chain together complex tasks, but also evaluate the solutions on real data! See our paper Training and Evaluating a Jupyter Notebook Data Science Assistant for more details about state of the art results and other properties of the dataset.
6 PAPERS • NO BENCHMARKS YET
Automated source code generation is currently a popular machine learning-based task. It can be helpful for software developers to write functionally correct code from a given context. However, just like human developers, a code generation model can produce vulnerable code, which the developers can mistakenly use. For this reason, evaluating the security of a code generation model is a must. In this paper, we describe SecurityEval, an evaluation dataset to fulfill this purpose. It contains 130 samples for 75 vulnerability types, which are mapped to the Common Weakness Enumeration (CWE). We also demonstrate using our dataset to evaluate one open-source (i.e., InCoder) and one closed-source code generation model (i.e., GitHub Copilot).
Lyra is a dataset for code generation that consists on Python code with embedded SQL. This dataset contains 2,000 carefully annotated database manipulation programs from real usage projects. Each program is paired with both a Chinese comment and an English comment.
5 PAPERS • NO BENCHMARKS YET
Shellcode_IA32 is a dataset containing 20 years of shellcodes from a variety of sources is the largest collection of shellcodes in assembly available to date.
5 PAPERS • 1 BENCHMARK
The CoNaLa Extended With Question Text is an extension to the original CoNaLa Dataset (Papers With Code Link) proposed in the NLP4Prog workshop paper "Reading StackOverflow Encourages Cheating: Adding Question Text Improves Extractive Code Generation". The key additions are that every example now has the full question body from its respective StackOverflow Question.
4 PAPERS • 1 BENCHMARK
PyTorrent contains 218,814 Python package libraries from PyPI and Anaconda environment. This is because earlier studies have shown that much of the code is redundant and Python packages from these environments are better in quality and are well-documented. PyTorrent enables users (such as data scientists, students, etc.) to build off the shelf machine learning models directly without spending months of effort on large infrastructure.
4 PAPERS • NO BENCHMARKS YET
BioCoder is a benchmark developed to evaluate existing pre-trained models in generating bioinformatics code. In relation to function-code generation, BioCoder covers potential package dependencies, class declarations, and global variables. It incorporates 1026 functions and 1243 methods in Python and Java from GitHub and 253 examples from the Rosalind Project.
3 PAPERS • NO BENCHMARKS YET
Description
To automatically generate Python and assembly programs used for security exploits, we curated a large dataset for feeding NMT techniques. A sample in the dataset consists of a snippet of code from these exploits and their corresponding description in the English language. We collected exploits from publicly available databases (exploitdb, shellstorm), public repositories (e.g., GitHub), and programming guidelines. In particular, we focused on exploits targeting Linux, the most common OS for security-critical network services, running on IA-32 (i.e., the 32-bit version of the x86 Intel Architecture). The dataset is stored in the folder EVIL/datasets and consists of two parts: i) Encoders: a Python dataset, which contains Python code used by exploits to encode the shellcode; ii) Decoders: an assembly dataset, which includes shellcode and decoders to revert the encoding.
2 PAPERS • NO BENCHMARKS YET
The dataset is specifically constructed for the library-oriented code generation task, which are constructed in the paper “CodeGen4Libs: A Two-Stage Approach for Library-Oriented Code Generation”.
1 PAPER • NO BENCHMARKS YET
CriticBench is a comprehensive benchmark designed to assess the abilities of Large Language Models (LLMs) to critique and rectify their reasoning across various tasks. It encompasses five reasoning domains:
DISL The full dataset report is available at: https://arxiv.org/abs/2403.16861
In this paper, we introduce a novel benchmarking framework designed specifically for evaluations of data science agents. Our contributions are three-fold. First, we propose DSEval, an evaluation paradigm that enlarges the evaluation scope to the full lifecycle of LLM-based data science agents. We also cover aspects including but not limited to the quality of the derived analytical solutions or machine learning models, as well as potential side effects such as unintentional changes to the original data. Second, we incorporate a novel bootstrapped annotation process letting LLM themselves generate and annotate the benchmarks with ``human in the loop''. A novel language (i.e., DSEAL) has been proposed and the derived four benchmarks have significantly improved the benchmark scalability and coverage, with largely reduced human labor. Third, based on DSEval and the four benchmarks, we conduct a comprehensive evaluation of various data science agents from different aspects. Our findings re
This is an assembly dataset built on top of Shellcode_IA32, a dataset for automatically generating assembly from natural language descriptions that consists of 3,200 assembly instructions, commented in the English language, which were collected from shellcodes for IA-32 and written for the Netwide Assembler (NASM) for Linux. In order to make the data more representative of the code that we aim to generate (i.e., complete exploits, inclusive of decoders to be delivered in the shellcode), we enriched the dataset with further samples of assembly code, drawn from the exploits that we collected from public databases. Different from the previous dataset, the new one includes assembly code from real decoders used in actual exploits. The final dataset contains 3,715 unique pairs of assembly code snippets/English intents. To better support developers in the automatic generation of the assembly programs, we looked beyond a one-to-one mapping between natural language intents and their correspond
This dataset contains samples to generate Python code for security exploits. In order to make the dataset representative of real exploits, it includes code snippets drawn from exploits from public databases. Differing from general-purpose Python code found in previous datasets, the Python code of real exploits entails low-level operations on byte data for obfuscation purposes (i.e., to encode shellcodes). Therefore, real exploits make extensive use of Python instructions for converting data between different encoders, for performing low-level arithmetic and logical operations, and for bit-level slicing, which cannot be found in the previous general-purpose Python datasets. In total, we built a dataset that consists of 1,114 original samples of exploit-tailored Python snippets and their corresponding intent in the English language. These samples include complex and nested instructions, as typical of Python programming. In order to perform more realistic training and for a fair evaluat
This dataset contains:
We introduce HumanEval-XL, a massively multilingual code generation benchmark specifically crafted to address this deficiency. HumanEval-XL establishes connections between 23 NLs and 12 programming languages (PLs), and comprises of a collection of 22,080 prompts with an average of 8.33 test cases. By ensuring parallel data across multiple NLs and PLs, HumanEval-XL offers a comprehensive evaluation platform for multilingual LLMs, allowing the assessment of the understanding of different NLs. Our work serves as a pioneering step towards filling the void in evaluating NL generalization in the area of multilingual code generation. We make our evaluation code and data publicly available at https://github.com/FloatAI/HumanEval-XL.
MMCode is a multi-modal code generation dataset designed to evaluate the problem-solving skills of code language models in visually rich contexts (i.e. images). It contains 3,548 questions paired with 6,620 images, derived from real-world programming challenges across 10 code competition websites, with Python solutions and tests provided. The dataset emphasizes the extreme demand for reasoning abilities, the interwoven nature of textual and visual contents, and the occurrence of questions containing multiple images.
Syntax-Aware Fill-in-the-Middle (SAFIM) is a benchmark for evaluating Large Language Models (LLMs) on the code Fill-in-the-Middle (FIM) task. SAFIM has three subtasks: Algorithmic Block Completion, Control-Flow Expression Completion, and API Function Call Completion. SAFIM is sourced from code submitted from April 2022 to January 2023 to minimize the impact of data contamination on evaluation results.
1 PAPER • 1 BENCHMARK
The dataset consists of source code and LLVM IR pairs generated from accepted and de-duped programming contest solutions. The dataset is divided into language configs and mode splits. The language can be one of C, C++, D, Fortran, Go, Haskell, Nim, Objective-C, Python, Rust and Swift, indicating the source files' languages. The mode split indicates the compilation mode, which can be wither Size_Optimized or Perf_Optimized.
SOEVAL is created by us by mining questions from StackOverflow. Our goal was to create a prompt dataset that reflects the real-life needs of software developers. To build this dataset, we first collected 500 popular and recent questions with Python and Java tags for each. From these 1,000 questions, we applied a set of inclusion and exclusion criteria. The inclusion criteria were: the question has to (1) explicitly ask “how to do X” in Python or Java; (2) include code in its body; (3) have an accepted answer that includes code. We excluded questions that were (1) open-ended and asking for best practices/guidelines for a specific problem in Python/Java; (2) related to finding a specific API/module for a given task; (3) related to errors due to environment configuration (e.g., missing dependency library); (4) related to configuring libraries/API; (5) syntax specific types of questions. By applying the criteria above to these 1K questions, we obtained 28 and 42 prompts for Java and Python
SGSM contains 20,490 question/answer pairs generated by MATHWELL, a context-free grade school math word problem generator that outputs a word problem and Program of Thought (PoT) solution based solely on an optional student interest. SGSM has two subsets: SGSM Train, comprised of 2,093 question/answer pairs verified by human experts, and SGSM Unannotated, comprised of 18,397 question/answer pairs that have executable code but are not verified by human experts. SGSM is the largest English grade school math QA dataset with PoT rationales.
TACO (Topics in Algorithmic COde generation dataset) is a dataset focused on algorithmic code generation, designed to provide a more challenging training dataset and evaluation benchmark for the code generation model field. The dataset consists of programming competition problems that are more difficult and closer to real programming scenarios. It emphasizes improving or evaluating the model's understanding and reasoning abilities in practical application scenarios, rather than just implementing predefined function functionalities.
The dataset contains more than 100k code patch pairs extracted from open source projects on GitHub. Each pair comes with the erroneous and the fixed version of the corresponding code snippet. Instead of the whole file, the code snippets are extracted to focus on the problematic region (error line + other lines around it). For each sample, the repository name, the commit id, and the file names are provided so that one can access the complete files in case of interest.
$\textbf{Turbulence}$ is a new benchmark for systematically evaluating the correctness and robustness of instruction-tuned large language models (LLMs) for code generation. Turbulence consists of a large set of natural language question templates, each of which is a programming problem, parameterised so that it can be asked in many different forms. Each question template has an associated test oracle that judges whether a code solution returned by an LLM is correct. Thus, from a single question template, it is possible to ask an LLM a $\textit{neighbourhood}$ of very similar programming questions, and assess the correctness of the result returned for each question. This new benchmark systematically and automatically identifies cases where LLMs are able to solve some problems in a neighbourhood but do not manage to generalise to solve the whole neighbourhood. Therefore, this method is effective at highlighting robustness issues.
Verified Smart Contracts Code Comments is a dataset of real Ethereum smart contract functions, containing "code, comment" pairs of both Solidity and Vyper source code. The dataset is based on every deployed Ethereum smart contract as of 1st of April 2022, whose been verified on Etherscan and has a least one transaction. A total of 1,541,370 smart contract functions are provided, parsed from 186,397 unique smart contracts, filtered down from 2,217,692 smart contracts.
Verified Smart Contracts is a dataset of real Ethereum smart contracts, containing both Solidity and Vyper source code. It consists of every deployed Ethereum smart contract as of 1st of April 2022, whose been verified on Etherscan and has a least one transaction. A total of 186,397 unique smart contracts are provided, filtered down from 2,217,692 smart contracts. The dataset contains 53,843,305 lines of code.