The MSLR-WEB10K dataset consists of 10,000 search queries over the documents from search results. The data also contains the values of 136 features and a corresponding user-labeled relevance factor on a scale of one to five with respect to each query-document pair. It is a subset of the MSLR-WEB30K dataset.
35 PAPERS • NO BENCHMARKS YET
The MQ2007 dataset consists of queries, corresponding retrieved documents and labels provided by human experts. The possible relevance labels for each document are “relevant”, “partially relevant”, and “not relevant”.
30 PAPERS • NO BENCHMARKS YET
The MQ2008 dataset is a dataset for Learning to Rank. It contains 800 queries with labelled documents.
27 PAPERS • NO BENCHMARKS YET
This dataset contains benchmark scores for EQ-Bench, a novel benchmark designed to evaluate aspects of emotional intelligence in Large Language Models (LLMs). We assess the ability of LLMs to understand complex emotions and social interactions by asking them to predict the intensity of emotional states of characters in a dialogue. The benchmark is able to discriminate effectively between a wide range of models. We find that EQ-Bench correlates strongly with comprehensive multi-domain benchmarks like MMLU (Hendrycks et al., 2020) (r=0.97), indicating that we may be capturing similar aspects of broad intelligence. Our benchmark produces highly repeatable results using a set of 60 English-language questions. We also provide open-source code for an automated benchmarking pipeline at https://github.com/EQ-bench/EQ-Bench and a leaderboard at https://www.eqbench.com.
3 PAPERS • 1 BENCHMARK
WebLINX is a large-scale benchmark of 100K interactions across 2300 expert demonstrations of conversational web navigation. It covers a broad range of patterns on over 150 real-world websites and can be used to train and evaluate agents in diverse scenarios.
2 PAPERS • 1 BENCHMARK
The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. The AQL is the first publicly available query log that combines size, scope, and diversity, enabling research on new retrieval models and search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.
1 PAPER • NO BENCHMARKS YET
~1M Flickr images from the XX century-aged from the 1910s to 1990s. Dataset was introduced by Müller et al. and can be found https://www.radar-service.eu/radar/en/dataset/tJzxrsYUkvPklBOw
Genre annotations for movies The file genre2movies.csv contains genre-movie tuples based on Wikidata annotations (https://www.wikidata.org/).
IMDB-WIKI-SbS is a new large-scale dataset for evaluation pairwise comparisons, building on the success of a well-known benchmark for computer vision systems IMDB-WIKI. This dataset uses the age information offered by IMDB-WIKI as ground truth while providing a balanced distribution of ages and genders of people in photos.
X-Wines is a consistent wine dataset containing 100,646 instances and 21 million real evaluations carried out by users. Data were collected on the open Web in 2022 and pre-processed for wider free use. They refer to the scale 1–5 ratings carried out over a period of 10 years (2012–2021) for wines produced in 62 different countries.
0 PAPER • NO BENCHMARKS YET