THE IMPACT OF F0 EXTRACTION ERRORS ON THE CLASSIFICATION OF PROMINENCE AND EMOTION

Anton Batliner1, Stefan Steidl1, Björn Schuller2, Dino Seppi3, Thurid Vogt4, Laurence Devillers5, Laurence Vidrascu5, Noam Amir6, Loic Kessous6 & Vered Aharonson7
1Lehrstuhl für Mustererkennung, Friedrich-Alexander-Universität Erlangen Nürnberg; 2Institute for Human-Machine-Communication, Technische Universität München; 3ITC-IRST; 4Multimedia Concepts and their Applications, University of Augsburg; 5 LIMSI-CNRS; 6Dep. of Communication Disorders, Sackler Faculty of Medicine, Tel Aviv University; 7 Tel Aviv Academic College of Engineering, Tel Aviv

ID 1168
[full paper]

Traditionally, it has been assumed that pitch is the most important prosodic feature for the marking of prominence, and of other phenomena such as the marking of boundaries or emotions. This role has been put into question by recent studies. As nowadays larger databases are always being processed automatically, it is not clear up to what extent the possibly lower relevance of pitch can be attributed to extraction errors or to other factors. We present some ideas as for a phenomenological difference between pitch and duration, and compare the performance of automatically extracted F0 values and of manually corrected F0 values for the automatic recognition of rominence and emotion in spontaneous speech (children giving commands to a pet robot). The difference in classification performance between corrected and automatically extracted pitch features turns out to be consistent but not very pronounced.