TY - GEN
T1 - Modeling naturalistic affective states via facial and vocal expressions recognition
AU - Caridakis, George
AU - Malatesta, Lori
AU - Kessous, Loic
AU - Amir, Noam
AU - Raouzaiou, Amaryllis
AU - Karpouzis, Kostas
PY - 2006
Y1 - 2006
N2 - Affective and human-centered computing are two areas related to HCI which have attracted attention during the past years. One of the reasons that this may be attributed to, is the plethora of devices able to record and process multimodal input from the part of the users and adapt their functionality to their preferences or individual habits, thus enhancing usability and becoming attractive to users less accustomed with conventional interfaces. In the quest to receive feedback from the users in an unobtrusive manner, the visual and auditory modalities allow us to infer the users' emotional state, combining information both from facial expression recognition and speech prosody feature extraction. In this paper, we describe a multi-cue, dynamic approach in naturalistic video sequences. Contrary to strictly controlled recording conditions of audiovisual material, the current research focuses on sequences taken from nearly real world situations. Recognition is performed via a 'Simple Recurrent Network' which lends itself well to modeling dynamic events in both user's facial expressions and speech. Moreover this approach differs from existing work in that it models user expressivity using a dimensional representation of activation and valence, instead of detecting the usual 'universal emotions' which are scarce in everyday human-machine interaction. The algorithm is deployed on an audiovisual database which was recorded simulating human-human discourse and, therefore, contains less extreme expressivity and subtle variations of a number of emotion labels.
AB - Affective and human-centered computing are two areas related to HCI which have attracted attention during the past years. One of the reasons that this may be attributed to, is the plethora of devices able to record and process multimodal input from the part of the users and adapt their functionality to their preferences or individual habits, thus enhancing usability and becoming attractive to users less accustomed with conventional interfaces. In the quest to receive feedback from the users in an unobtrusive manner, the visual and auditory modalities allow us to infer the users' emotional state, combining information both from facial expression recognition and speech prosody feature extraction. In this paper, we describe a multi-cue, dynamic approach in naturalistic video sequences. Contrary to strictly controlled recording conditions of audiovisual material, the current research focuses on sequences taken from nearly real world situations. Recognition is performed via a 'Simple Recurrent Network' which lends itself well to modeling dynamic events in both user's facial expressions and speech. Moreover this approach differs from existing work in that it models user expressivity using a dimensional representation of activation and valence, instead of detecting the usual 'universal emotions' which are scarce in everyday human-machine interaction. The algorithm is deployed on an audiovisual database which was recorded simulating human-human discourse and, therefore, contains less extreme expressivity and subtle variations of a number of emotion labels.
KW - Affective interaction
KW - Facial expression recognition
KW - Image processing
KW - Multimodal analysis
KW - Naturalistic data
KW - Prosodic feature extraction
KW - User modeling
UR - http://www.scopus.com/inward/record.url?scp=34547153664&partnerID=8YFLogxK
U2 - 10.1145/1180995.1181029
DO - 10.1145/1180995.1181029
M3 - פרסום בספר כנס
AN - SCOPUS:34547153664
SN - 159593541X
SN - 9781595935410
T3 - ICMI'06: 8th International Conference on Multimodal Interfaces, Conference Proceeding
SP - 146
EP - 154
BT - ICMI'06
Y2 - 2 November 2006 through 4 November 2006
ER -