Volume 11 - Issue 1
A Bimodal Approach for Speech Emotion Recognition using Audio and Text
- Oxana Verkholyak
St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia, ITMO University, Kronverkskiy Prospekt, 49, St Petersburg, Russia, Ulm University, Helmholtzstraße 16, 89081 Ulm, Germany
overkholyak@gmail.com
- Anastasia Dvoynikova
St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia, ITMO University, Kronverkskiy Prospekt, 49, St Petersburg, Russia
dvoynikova.a@iias.spb.su
- Alexey Karpov
St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia, ITMO University, Kronverkskiy Prospekt, 49, St Petersburg, Russia
karpov@iias.spb.su
Keywords: Computational paralinguistics, Speech emotion recognition, Sentiment analysis, Bimodal fusion, Annotation agreement
Abstract
This paper presents a novel bimodal speech emotion recognition system based on analysis of acoustic
and linguistic information. We propose a novel decision-level fusion strategy that leverages both
emotions and sentiments extracted from audio and text transcriptions of extemporaneous speech utterances.
We perform experimental study to prove the effectiveness of the proposed methods using
emotional speech database RAMAS, revealing classification results of 7 emotional states (happy, surprised,
angry, sad, scared, disgusted, neutral) and 3 sentiment categories (positive, negative, neutral).
We compare relative performance of unimodal vs. bimodal systems, analyze their effectiveness on
different levels of annotation agreement, and discuss the effect of reduction of training data size on
the overall performance of the systems. We also provide important insights about contribution of each
modality for the best optimal performance for emotions classification, which reaches UAR=72.01%
on the highest 5-th level of annotation agreement.