10 dataset results for Code Search

The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found

255 PAPERS • 12 BENCHMARKS

CodeXGLUE

CodeXGLUE is a benchmark dataset and open challenge for code intelligence. It includes a collection of code intelligence tasks and a platform for model evaluation and comparison. CodeXGLUE stands for General Language Understanding Evaluation benchmark for CODE. It includes 14 datasets for 10 diversified code intelligence tasks covering the following scenarios:

161 PAPERS • 15 BENCHMARKS

CoNaLa

CoNaLa (CMU CoNaLa, the Code/Natural Language Challenge)

The CMU CoNaLa, the Code/Natural Language Challenge dataset is a joint project from the Carnegie Mellon University NeuLab and Strudel labs. Its purpose is for testing the generation of code snippets from natural language. The data comes from StackOverflow questions. There are 2379 training and 500 test examples that were manually annotated. Every example has a natural language intent and its corresponding python snippet. In addition to the manually annotated dataset, there are also 598,237 mined intent-snippet pairs. These examples are similar to the hand-annotated ones except that they contain a probability if the pair is valid.

64 PAPERS • 1 BENCHMARK

StaQC

StaQC (Stack Overflow Question-Code pairs) is a large dataset of around 148K Python and 120K SQL domain question-code pairs, which are automatically mined from StackOverflow.

16 PAPERS • NO BENCHMARKS YET

XLCoST (Cross-Lingual Code Snippet)

XLCoST is a benchmark dataset for cross-lingual code intelligence. The dataset contains fine-grained parallel data from 8 languages (7 commonly used programming languages and English), and supports 10 cross-language code tasks.

14 PAPERS • NO BENCHMARKS YET

CoDesc

CoDesc is a large dataset of 4.2m Java source code and parallel data of their description from code search, and code summarization studies.

4 PAPERS • 2 BENCHMARKS

PyTorrent

PyTorrent contains 218,814 Python package libraries from PyPI and Anaconda environment. This is because earlier studies have shown that much of the code is redundant and Python packages from these environments are better in quality and are well-documented. PyTorrent enables users (such as data scientists, students, etc.) to build off the shelf machine learning models directly without spending months of effort on large infrastructure.

4 PAPERS • NO BENCHMARKS YET

ISAdetect dataset (ISAdetect binary file and object code dataset)

This repository holds two datasets: one with both the original binaries and the code sections extracted from them (“full dataset”), and one with only the code sections (“only code sections”). The code sections were extracted by carving out sections of the binary that were marked as executable. The binaries were scraped from Debian repositories.

1 PAPER • NO BENCHMARKS YET

Search4Code

Search4Code is a large-scale web query based dataset of code search queries for C# and Java. The Search4Code data is mined from Microsoft Bing's anonymized search query logs using weak supervision technique.

1 PAPER • NO BENCHMARKS YET

washed_contract

Dataset contains about 48K contracts which are open source on Etherscan.

1 PAPER • NO BENCHMARKS YET

Datasets

10 dataset results for Code Search