The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities. Big-bench include more than 200 tasks.
222 PAPERS • 134 BENCHMARKS
CheGeKa is a Jeopardy!-like Russian QA dataset collected from the official Russian quiz database ChGK.
2 PAPERS • 1 BENCHMARK
Ethics (per ethics) dataset is created to test the knowledge of the basic concepts of morality. The task is to predict human ethical judgments about diverse text situations in a multi-label classification setting. The main objective of the task is to evaluate the positive or negative implementation of five concepts in normative with ‘yes’ and ‘no’ ratings. The included concepts are as follows: virtue, law, moral, justice, and utilitarianism.
MultiQ is a multi-hop QA dataset for Russian, suitable for general open-domain question answering, information retrieval, and reading comprehension tasks.
RuOpenBookQA is a QA dataset with multiple-choice elementary-level science questions which probe the understanding of core science facts.
RuWorldTree is a QA dataset with multiple-choice elementary-level science questions, which evaluate the understanding of core science facts.
Recent times have witnessed an increasing number of applications of deep neural networks towards solving tasks that require superior cognitive abilities, e.g., playing Go, generating art, ChatGPT, etc. Such a dramatic progress raises the question: how generalizable are neural networks in solving problems that demand broad skills? To answer this question, we propose SMART: a Simple Multimodal Algorithmic Reasoning Task (and the associated SMART-101 dataset) for evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed specifically for children of younger age (6--8). Our dataset consists of 101 unique puzzles; each puzzle comprises a picture and a question, and their solution needs a mix of several elementary skills, including pattern recognition, algebra, and spatial reasoning, among others. To train deep neural networks, we programmatically augment each puzzle to 2,000 new instances; each instance varied in appea
2 PAPERS • NO BENCHMARKS YET
A deductive reasoning benchmark based on formal logic theory. A model is required to generate a proof that (dis-) proves a given hypothesis based on a given set of facts.
1 PAPER • NO BENCHMARKS YET
The ontology files, readme and statistical information can be found and browsed in the ontology library. Because many of the ontologies make use of imports, we have "localised" ontologies by parsing them, resolving and parsing all imports, merging the main and imported ontologies together, and re-serialising the ontology, all using the OWL API. The original main ontology and the imported ontologies are saved in the "sources/".
Recent advances in large language models have led to the development of multimodal LLMs (MLLMs), which take both image data and text as an input. Virtually all of these models have been announced within the past year, leading to a significant need for benchmarks evaluating the abilities of these models to reason truthfully and accurately on a diverse set of tasks. When Google announced Gemini (Gemini Team et al., 2023), they showcased its ability to solve rebuses—wordplay puzzles which involve creatively adding and subtracting letters from words derived from text and images. The diversity of rebuses allows for a broad evaluation of multimodal reasoning capabilities, including image recognition, multi- step reasoning, and understanding the human creator’s intent. We present REBUS: a collection of 333 hand-crafted rebuses spanning 13 diverse cate- gories, including hand-drawn and digital images created by nine contributors. Samples are presented in Table 1. Notably, GPT-4V, the most powe
1 PAPER • 1 BENCHMARK
The Winograd schema challenge composes tasks with syntactic ambiguity, which can be resolved with logic and reasoning.