MATH is a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations.
378 PAPERS • 2 BENCHMARKS
A challenge set for elementary-level Math Word Problems (MWP). An MWP consists of a short Natural Language narrative that describes a state of the world and poses a question about some unknown quantities.
212 PAPERS • 1 BENCHMARK
We present ASDiv (Academia Sinica Diverse MWP Dataset), a diverse (in terms of both language patterns and problem types) English math word problem (MWP) corpus for evaluating the capability of various MWP solvers. Existing MWP corpora for studying AI progress remain limited either in language usage patterns or in problem types. We thus present a new English MWP corpus with 2,305 MWPs that cover more text patterns and most problem types taught in elementary school. Each MWP is annotated with its problem type and grade level (for indicating the level of difficulty). Furthermore, we propose a metric to measure the lexicon usage diversity of a given MWP corpus, and demonstrate that ASDiv is more diverse than existing corpora. Experiments show that our proposed corpus reflects the true capability of MWP solvers more faithfully.
98 PAPERS • 1 BENCHMARK
MathQA significantly enhances the AQuA dataset with fully-specified operational programs.
94 PAPERS • 1 BENCHMARK
Math23K is a dataset created for math word problem solving, contains 23, 162 Chinese problems crawled from the Internet. Refer to our paper for more details: The dataset is originally introduced in the paper Deep Neural Solver for Math Word Problems. The original files are originally split into train/test split, while other research efforts (https://github.com/2003pro/Graph2Tree) perform the train/dev/test split.
89 PAPERS • 1 BENCHMARK
MAWPS is an online repository of Math Word Problems, to provide a unified testbed to evaluate different algorithms. MAWPS allows for the automatic construction of datasets with particular characteristics, providing tools for tuning the lexical and template overlap of a dataset as well as for filtering ungrammatical problems from web-sourced corpora. The online nature of this repository facilitates easy community contribution. Amassed 3,320 problems, including the full datasets used in several previous works.
84 PAPERS • 2 BENCHMARKS
Multilingual Grade School Math Benchmark (MGSM) is a benchmark of grade-school math problems. The same 250 problems from GSM8K are each translated via human annotators in 10 languages. GSM8K (Grade School Math 8K) is a dataset of 8.5K high-quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.
38 PAPERS • 1 BENCHMARK
MathVista is a consolidated Mathematical reasoning benchmark within Visual contexts. It consists of three newly created datasets, IQTest, FunctionQA, and PaperQA, which address the missing visual domains and are tailored to evaluate logical reasoning on puzzle test figures, algebraic reasoning over functional plots, and scientific reasoning with academic paper figures, respectively. It also incorporates 9 MathQA datasets and 19 VQA datasets from the literature, which significantly enrich the diversity and complexity of visual perception and mathematical reasoning challenges within our benchmark. In total, MathVista includes 6,141 examples collected from 31 different datasets.
26 PAPERS • NO BENCHMARKS YET
Current visual question answering (VQA) tasks mainly consider answering human-annotated questions for natural images in the daily-life context. Icon question answering (IconQA) is a benchmark which aims to highlight the importance of abstract diagram understanding and comprehensive cognitive reasoning in real-world diagram word problems. For this benchmark, a large-scale IconQA dataset is built that consists of three sub-tasks: multi-image-choice, multi-text-choice, and filling-in-the-blank. Compared to existing VQA benchmarks, IconQA requires not only perception skills like object recognition and text understanding, but also diverse cognitive reasoning skills, such as geometric reasoning, commonsense reasoning, and arithmetic reasoning.
23 PAPERS • 1 BENCHMARK
514 algebra word problems and associated equation systems gathered from Algebra.com.
20 PAPERS • 1 BENCHMARK
A new large-scale geometry problem-solving dataset - 3,002 multi-choice geometry problems - dense annotations in formal language for the diagrams and text - 27,213 annotated diagram logic forms (literals) - 6,293 annotated text logic forms (literals)
17 PAPERS • 1 BENCHMARK
GeoS is a dataset for automatic math problem solving. It is a dataset of SAT plane geometry questions where every question has a textual description in English accompanied by a diagram and multiple choices. Questions and answers are compiled from previous official SAT exams and practice exams offered by the College Board. We annotate ground-truth logical forms for all questions in the dataset.
14 PAPERS • 1 BENCHMARK
A new large scale plane geometry problem solving dataset called PGPS9K, labeled both fine-grained diagram annotation and interpretable solution program.
7 PAPERS • 1 BENCHMARK
DRAW-1K is a dataset consisting of 1000 algebra word problems, semiautomatically annotated for the evaluation of automatic solvers. DRAW includes gold coefficient alignments that are necessary uniquely identify the derivation of an equation system.
4 PAPERS • 1 BENCHMARK
Provided explanations on the existing three benchmark datasets on solving algebraic word problems: ALG514, DRAW-1K, MAWPS
3 PAPERS • 1 BENCHMARK
Mix of Minimal Optimal Sets (MMOS) of dataset has two advantages for two aspects, higher performance and lower construction costs on math reasoning.
1 PAPER • NO BENCHMARKS YET
SGSM contains 20,490 question/answer pairs generated by MATHWELL, a context-free grade school math word problem generator that outputs a word problem and Program of Thought (PoT) solution based solely on an optional student interest. SGSM has two subsets: SGSM Train, comprised of 2,093 question/answer pairs verified by human experts, and SGSM Unannotated, comprised of 18,397 question/answer pairs that have executable code but are not verified by human experts. SGSM is the largest English grade school math QA dataset with PoT rationales.