Multimodal Emotion Recognition IEMOCAP The IEMOCAP dataset consists of 151 videos of recorded dialogues, with 2 speakers per session for a total of 302 videos across the dataset. Each segment is annotated for the presence of 9 emotions (angry, excited, fear, sad, surprised, frustrated, happy, disappointed and neutral) as well as valence, arousal and dominance. The dataset is recorded across 5 sessions with 5 pairs of speakers.
630 PAPERS • 3 BENCHMARKS
DailyDialog is a high-quality multi-turn open-domain English dialog dataset. It contains 13,118 dialogues split into a training set with 11,118 dialogues and validation and test sets with 1000 dialogues each. On average there are around 8 speaker turns per dialogue with around 15 tokens per turn.
367 PAPERS • 2 BENCHMARKS
Multimodal EmotionLines Dataset (MELD) has been created by enhancing and extending EmotionLines dataset. MELD contains the same dialogue instances available in EmotionLines, but it also encompasses audio and visual modality along with text. MELD has more than 1400 dialogues and 13000 utterances from Friends TV series. Multiple speakers participated in the dialogues. Each utterance in a dialogue has been labeled by any of these seven emotions -- Anger, Disgust, Sadness, Joy, Neutral, Surprise and Fear. MELD also has sentiment (positive, negative and neutral) annotation for each utterance.
208 PAPERS • 2 BENCHMARKS
EmoryNLP comprises 97 episodes, 897 scenes, and 12,606 utterances, where each utterance is annotated with one of the seven emotions borrowed from the six primary emotions in the Willcox (1982)’s feeling wheel, sad, mad, scared, powerful, peaceful, joyful, and a default emotion of neutral.
52 PAPERS • 1 BENCHMARK
The SEMAINE videos dataset contains spontaneous data capturing the audiovisual interaction between a human and an operator undertaking the role of an avatar with four personalities: Poppy (happy), Obadiah (gloomy), Spike (angry) and Prudence (pragmatic). The audiovisual sequences have been recorded at a video rate of 25 fps (352 x 288 pixels). The dataset consists of audiovisual interaction between a human and an operator undertaking the role of an agent (Sensitive Artificial Agent). SEMAINE video clips have been annotated with couples of epistemic states such as agreement, interested, certain, concentration, and thoughtful with continuous rating (within the range [1,-1]) where -1 indicates most negative rating (i.e: No concentration at all) and +1 defines the highest (Most concentration). Twenty-four recording sessions are used in the Solid SAL scenario. Recordings are made of both the user and the operator, and there are usually four character interactions in each recording session,
EmoContext consists of three-turn English Tweets. The emotion labels include happiness, sadness, anger and other.
42 PAPERS • 1 BENCHMARK
EmotionLines contains a total of 29245 labeled utterances from 2000 dialogues. Each utterance in dialogues is labeled with one of seven emotions, six Ekman’s basic emotions plus the neutral emotion. Each labeling was accomplished by 5 workers, and for each utterance in a label, the emotion category with the highest votes was set as the label of the utterance. Those utterances voted as more than two different emotions were put into the non-neutral category. Therefore the dataset has a total of 8 types of emotion labels, anger, disgust, fear, happiness, sadness, surprise, neutral, and non-neutral.
We construct a dataset named CPED from 40 Chinese TV shows. CPED consists of multisource knowledge related to empathy and personal characteristic. This knowledge covers 13 emotions, gender, Big Five personality traits, 19 dialogue acts and other knowledge.
15 PAPERS • 3 BENCHMARKS
The Multimodal Corpus of Sentiment Intensity (CMU-MOSI) dataset is a collection of 2199 opinion video clips. Each opinion video is annotated with sentiment in the range [-3,3]. The dataset is rigorously annotated with labels for subjectivity, sentiment intensity, per-frame and per-opinion annotated visual features, and per-milliseconds annotated audio features.
9 PAPERS • 2 BENCHMARKS
EmoWOZ is the first large-scale open-source dataset for emotion recognition in task-oriented dialogues. It contains emotion annotations for user utterances in the entire MultiWOZ (10k+ human-human dialogues) and DialMAGE (1k human-machine dialogues collected from our human trial). Overall, there are 83k user utterances annotated. In addition, the emotion annotation scheme is tailored to task-oriented dialogues and considers the valence, the elicitor, and the conduct of the user emotion.
8 PAPERS • 1 BENCHMARK
Emotional Dialogue Acts data contains dialogue act labels for existing emotion multi-modal conversational datasets. We chose two popular multimodal emotion datasets: Multimodal EmotionLines Dataset (MELD) and Interactive Emotional dyadic MOtion CAPture database (IEMOCAP). EDAs reveal associations between dialogue acts and emotional states in a natural-conversational language such as Accept/Agree dialogue acts often occur with the Joy emotion, Apology with Sadness, and Thanking with Joy.
3 PAPERS • NO BENCHMARKS YET
The CANDOR corpus is a large, novel, multimodal corpus of 1,656 recorded conversations in spoken English. This 7+ million word, 850 hour corpus totals over 1TB of audio, video, and transcripts, with moment-to-moment measures of vocal, facial, and semantic expression, along with an extensive survey of speaker post conversation reflections.
1 PAPER • NO BENCHMARKS YET
KD-EmoR is socio-behavioral emotion dataset for emotion recognition in realistic conversation scenarios. It consists of total 12289 sentences from 1513 scenes of a Korean TV show named 'Three Brothers'. The dataset is split into Training and testing sets. Each sample consists of sentence_id, person(speaker), sentence, scene_ID, context(Scene description) labeled with one of the following complex emotion labels: euphoria, dysphoria and neutral. This dataset can be used to study Emotion recognition in Korean conversations.
1 PAPER • 1 BENCHMARK
The E-MASAC Dataset is a collection of code-mixed conversations sourced from an Indian TV series, focusing on Hindi-English interactions. It was derived from the MASAC dataset and specifically annotated for Emotion Recognition in Conversations (ERC) tasks. The dataset comprises 8,607 dialogues with 11,440 utterances, containing instances of sarcasm and humor. Emotions such as anger, fear, joy, sadness, surprise, contempt, and neutral are annotated for each utterance by three fluent English and Hindi-speaking linguists, ensuring a high inter-annotator agreement of 0.85.