The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found
255 PAPERS • 12 BENCHMARKS
StaQC (Stack Overflow Question-Code pairs) is a large dataset of around 148K Python and 120K SQL domain question-code pairs, which are automatically mined from StackOverflow.
16 PAPERS • NO BENCHMARKS YET
CoDesc is a large dataset of 4.2m Java source code and parallel data of their description from code search, and code summarization studies.
4 PAPERS • 2 BENCHMARKS
Presents a new dataset of code snippets with short descriptions, created using data gathered from Stackoverflow, a popular programming help website. Since access is open and unrestricted, the content is inherently noisy (ungrammatical, non-parsable, lacking content).
3 PAPERS • NO BENCHMARKS YET
The Java dataset introduced in DeepCom (Deep Code Comment Generation), commonly used to evaluate automated code summarization.
2 PAPERS • 1 BENCHMARK
The Java dataset introduced in Hybrid-DeepCom (Deep code comment generation with hybrid lexical and syntactical information), commonly used to evaluate automated code summarization. It is basically a further version of DeepCom-Java.
1 PAPER • 1 BENCHMARK
The Python dataset introduced in the Parallel Corpus paper (A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation), commonly used for evaluating automated code summarization.