Mark Hasegawa-Johnson1, Karen Livescu2, Partha Lal3 & Kate Saenko2
1University of Illinois at Urbana-Champaign; 2Massachusetts Institute of Technology; 3University of Edinburgh

ID 1719
[full paper]

Speech recognition, by both humans and machines, benefits from visual observation of the face. It has often been noticed, however, that the audible and visible correlates of a phoneme may be asynchronous; perhaps for this reason, automatic speech recognition structures that allow asynchrony between the audible phoneme and the visible viseme outperform recognizers that allow no such asynchrony. This paper proposes, and tests using experimental speech recognition systems, a new explanation for audiovisual asynchrony. We propose that audiovisual asynchrony may be the result of asynchrony between the gestures implemented by different articulators. The proposed model of audiovisual asynchrony is tested by implementing an "articulatory-feature model" audiovisual speech recognizer with multiple hidden state variables, each representing the gestures of one articulator. The proposed system performs as well as a standard audiovisual recognizer on a digit recognition task; the best results are achieved by combining the outputs of the two systems.