Audio-Visual Processing

Fig. 1: Block diagram of Audio-Visual system for Automatic Speech Recognition

In our research we develop technology to explore and capitalize on the correlation between speech and video data. Multimodal signal processing is more than simply "putting together" text, audio, images, and video; it is the integration and interaction among these different media that creates new systems and new research challenges and opportunities. Unimodal analysis of signals can deliver acceptable performance levels only in benign situations; the performance decreases rapidly when the conditions are not ideal.

In multimodal communications where human speech is involved, audio-visual interaction is particularly significant. Human perception of speech is bimodal in that acoustic speech can be affected by visual cues from lip movement. Due to the bimodality in speech perception, audio-visual interaction is an important design factor for multimodal communication systems. A prime example of this interaction is lip or speech reading. It is used by the hearing-impaired for enhancing their speech understanding capability but also by every normal hearing person to some extent, especially in noisy environments.

We are also very excited at our new collaboration with Professor Karen Livescu of Toyota Technological Institute at Chicago, a specialist in dynamic Bayesian networks, linguistics, and speech recognition. We have explored and continue to explore many issues in this diverse area including (with selected publications):

  • Audio-Visual Biometrics
  • Audio-Visual Fusion Techniques for Various Applications
  • Dynamic Bayesian Networks for Audio-Visual Automatic Speech Recognition
  • Dynamic Stream Weighting Based on Audio and Visual Reliability Metrics
  • Facial Expression Recognition
  • Robust Audio-Visual Continuous Speech
  • Robust Facial Tracking
  • Speech-to-Video Synthesis
  • Video Only Speech Recognition
  • Visual Feature Extraction

We also provide for download demos and presentations related to our work: