7 dataset results for Source Code Summarization

The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found

255 PAPERS • 12 BENCHMARKS

StaQC

StaQC (Stack Overflow Question-Code pairs) is a large dataset of around 148K Python and 120K SQL domain question-code pairs, which are automatically mined from StackOverflow.

16 PAPERS • NO BENCHMARKS YET

CoDesc

CoDesc is a large dataset of 4.2m Java source code and parallel data of their description from code search, and code summarization studies.

4 PAPERS • 2 BENCHMARKS

Summarizing Source Code using a Neural Attention Model

Presents a new dataset of code snippets with short descriptions, created using data gathered from Stackoverflow, a popular programming help website. Since access is open and unrestricted, the content is inherently noisy (ungrammatical, non-parsable, lacking content).

3 PAPERS • NO BENCHMARKS YET

DeepCom-Java

The Java dataset introduced in DeepCom (Deep Code Comment Generation), commonly used to evaluate automated code summarization.

2 PAPERS • 1 BENCHMARK

Java scripts

The Java dataset introduced in Hybrid-DeepCom (Deep code comment generation with hybrid lexical and syntactical information), commonly used to evaluate automated code summarization. It is basically a further version of DeepCom-Java.

1 PAPER • 1 BENCHMARK

ParallelCorpus-Python

The Python dataset introduced in the Parallel Corpus paper (A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation), commonly used for evaluating automated code summarization.

1 PAPER • 1 BENCHMARK

Datasets

7 dataset results for Source Code Summarization