Outside Knowledge Visual Question Answering (OK-VQA) includes more than 14,000 questions that require external knowledge to answer.
258 PAPERS • 2 BENCHMARKS
This dataset contains 21,889 outfits from polyvore.com, in which 17,316 are for training, 1,497 for validation and 3,076 for testing.
55 PAPERS • 3 BENCHMARKS
Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order information. ARO consists of Visual Genome Attribution, to test the understanding of objects' properties; Visual Genome Relation, to test for relational understanding; and COCO-Order & Flickr30k-Order, to test for order sensitivity in VLMs. ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases.
21 PAPERS • NO BENCHMARKS YET
PopQA is an open-domain QA dataset with 14k QA pairs with fine-grained Wikidata entity ID, Wikipedia page views, and relationship type information.
ALCE is a benchmark for Automatic LLMs' Citation Evaluation. ALCE collects a diverse set of questions and retrieval corpora and requires building end-to-end systems to retrieve supporting evidence and generate answers with citations.
18 PAPERS • NO BENCHMARKS YET
In this project, we introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions that cannot be answered with only common sense knowledge. Using InfoSeek, we analyze various pre-trained visual question answering models and gain insights into their characteristics. Our findings reveal that state-of-the-art pre-trained multi-modal models (e.g., PaLI-X, BLIP2, etc.) face challenges in answering visual information-seeking questions, but fine-tuning on the InfoSeek dataset elicits models to use fine-grained knowledge that was learned during their pre-training.
17 PAPERS • 2 BENCHMARKS
QAMPARI is an ODQA benchmark, where question answers are lists of entities, spread across many paragraphs. It was created by (a) generating questions with multiple answers from Wikipedia's knowledge graph and tables, (b) automatically pairing answers with supporting evidence in Wikipedia paragraphs, and (c) manually paraphrasing questions and validating each answer.
11 PAPERS • NO BENCHMARKS YET
DIOR-RSVG is a large-scale benchmark dataset of remote sensing data (RSVG). It aims to localize the referred objects in remote sensing (RS) images with the guidance of natural language. This new dataset includes image/expression/box triplets for training and evaluating visual grounding models.
7 PAPERS • NO BENCHMARKS YET
The dataset contains single-shot videos taken from moving cameras in underwater environments. The first shard of a new Marine Video Kit dataset is presented to serve for video retrieval and other computer vision challenges. In addition to basic meta-data statistics, we present several insights based on low-level features as well as semantic annotations of selected keyframes. 1379 videos with a length from 2 s to 4.95 min, with the mean and median duration of each video is 29.9 s, and 25.4 s, respectively. We capture data from 11 different regions and countries during the time from 2011 to 2022.
7 PAPERS • 1 BENCHMARK
The Belgian Statutory Article Retrieval Dataset (BSARD) is a French native corpus for studying statutory article retrieval. BSARD consists of more than 22,600 statutory articles from Belgian law and about 1,100 legal questions posed by Belgian citizens and labeled by experienced jurists with relevant articles from the corpus.
6 PAPERS • 1 BENCHMARK
xCodeEval is one of the largest executable multilingual multitask benchmarks consisting of 17 programming languages with execution-level parallelism. It features a total of seven tasks involving code understanding, generation, translation, and retrieval, and it employs an execution-based evaluation instead of traditional lexical approaches. It also provides a test-case-based multilingual code execution engine, ExecEval that supports all the programming languages in xCodeEval.
6 PAPERS • NO BENCHMARKS YET
ComFact is a benchmark for commonsense fact linking, where models are given contexts and trained to identify situationally-relevant commonsense knowledge from KGs. The novel benchmark, C-om-Fact, contains ∼293k in-context relevance annotations for common-sense triplets across four stylistically diverse dialogue and storytelling datasets.
5 PAPERS • NO BENCHMARKS YET
ProofNet is a benchmark for autoformalization and formal proving of undergraduate-level mathematics. The ProofNet benchmarks consists of 371 examples, each consisting of a formal theorem statement in Lean 3, a natural language theorem statement, and natural language proof. The problems are primarily drawn from popular undergraduate pure mathematics textbooks and cover topics such as real and complex analysis, linear algebra, abstract algebra, and topology.
DiSCQ is a newly curated question dataset composed of 2,000+ questions paired with the snippets of text (triggers) that prompted each question. The questions are generated by medical experts from 100+ MIMIC-III discharge summaries. This dataset is released to facilitate further research into realistic clinical Question Answering (QA) and Question Generation (QG).
4 PAPERS • NO BENCHMARKS YET
COSIAN is an annotation collection of Japanese popular (J-POP) songs, focusing on singing style and expression of famous solo-singers.
2 PAPERS • NO BENCHMARKS YET
RoMQA is a benchmark for robust, multi-evidence, and multi-answer question answering (QA). RoMQA contains clusters of questions that are derived from related constraints mined from the Wikidata knowledge graph. The dataset evaluates robustness of QA models to varying constraints by measuring worst-case performance within each question cluster.
ShapeTalk contains over half a million discriminative utterances produced by contrasting the shapes of common 3D objects for a variety of object classes and degrees of similarity. The dataset provides discriminative utterances for a total of 36,391 shapes, across 30 object classes. Overall, ShapeTalk contains 73,799 distinct contexts, and a total of 536,596 utterances
CoreSearch is a dataset for Cross-Document Event Coreference Search. It consists of two separate passage collections: (1) a collection of passages containing manually annotated coreferring event mention, and (2) an annotated collection of destructor passages.
1 PAPER • NO BENCHMARKS YET
FanOutQA is a high quality, multi-hop, multi-document benchmark for large language models using English Wikipedia as its knowledge base. Compared to other question-answering benchmarks, FanOutQA requires reasoning over a greater number of documents, with the benchmark's main focus being on the titular fan-out style of question. We present these questions in three tasks -- closed-book, open-book, and evidence-provided -- which measure different abilities of LLM systems.
FewDR is a dataset for Few-shot dense retrieval (DR). FewDR aims to effectively generalize to novel search scenarios by learning a few samples. Specifically, FewDR employs class-wise sampling to establish a standardized "few-shot" setting with finely-defined classes, reducing variability in multiple sampling rounds.
LLeQA is a French native dataset for studying information retrieval and long-form question answering in the legal domain. It consists of a knowledge corpus of 27,941 statutory articles collected from the Belgian legislation, and 1,868 legal questions posed by Belgian citizens and labeled by experienced jurists with a comprehensive answer rooted in relevant articles from the corpus.
PTVD is a plot-oriented multimodal dataset in the TV domain. It is also the first non-English dataset of its kind. Additionally, PTVD contains more than 26 million bullet screen comments (BSCs), powering large-scale pre-training.
PoseScript is a dataset that pairs a few thousand 3D human poses from AMASS with rich human-annotated descriptions of the body parts and their spatial relationships. This dataset is designed for the retrieval of relevant poses from large-scale datasets and synthetic pose generation, both based on a textual pose description.
Spiced is a paraphrase dataset of scientific findings annotated for degree of information change. Spiced contains 6,000 scientific finding pairs extracted from news stories, social media discussions, and full texts of original papers.
The StatCan Dialogue Dataset: Retrieving Data Tables through Conversations with Genuine Intents
1 PAPER • 1 BENCHMARK