Tutorial 6: Reverberant Speech Processing for Human Communication and Automatic Speech Recognition

Presented by

Tomohiro Nakatani, Armin Sehr, and Walter Kellermann

Abstract

This tutorial will focus on recent advances in techniques for handling reverberant speech directed at both humans and computers, including speech dereverberation and automatic speech recognition (ASR) in reverberant environments. When a speech signal is captured by a distant microphone in a room, reverberation components are detrimental to the quality of the observed speech and often cause serious degradation in various speech applications, including voice recording, teleconferencing, IPTV, hearing aids, human-computer dialog systems, and meeting recognition. To extend the applicability of such applications in real acoustical environments, theory and methodologies for processing reverberant speech have been extensively studied in recent years, and new promising approaches have emerged for both speech enhancement and ASR.

In the tutorial, we will both provide an introductory and fundamental review of the technological area, and we will also pay attention to introducing promising generic emerging technologies. After explaining the impact of reverberation on distant-microphone speech communication and highlighting the fundamental differences between reverberation and additive interferences, we will describe the state-of-the-art in signal dereverberation and reverberation-robust ASR. Then, two emerging approaches will be discussed: One is multi-channel linear prediction-based blind speech dereverberation, and the other is robust ASR, based on feature-domain reverberation models. The former does not only allow to blindly estimate the inverse filter that can cancel the reverberation effect from the reverberant speech signal, but also to integrate it with other microphone array signal processing techniques, such as noise reduction and blind source separation. The latter is a class of very flexible and powerful parametric schemes for adjusting the acoustic model of an ASR system to reverberant environments, leading to significant improvements in recognition rate. Both technologies are generic and ready to be evaluated for real speech applications, so that this tutorial aims at providing the participants with new clear visions on how the new technologies can be used for many distant-talking speech applications. In addition, live signal processing demonstrations, including joint realization of speech dereverberation and source separation, will be presented in the tutorial.

The tutorial will be composed of the following items:

  1. Introduction
    1. Selected applications of distant-microphone speech interfaces
    2. Problem description
    3. Approaches
  2. Microphone array signal processing in reverberant environments
    1. Inverse filtering, and its problems
    2. Statistical model-based speech dereverberation
    3. Integration with other signal processing technologies
    4. Introduction of real applications
  3. Robust automatic speech recognition in reverberant environments
    1. Feature-based approaches
    2. Model-based approaches
    3. Decoder-based approaches
    4. REMOS
  4. Summary, conclusions and outlook

Each part of the tutorial will be presented by one of the presenters as follows:

A: Walter Kellermann
B: Tomohiro Nakatani
C: Armin Sehr
D: Walter Kellermann

Speaker Biography

Tomohiro Nakatani (Mf03, SMf06) is a senior research scientist (supervisor) of NTT Communication Science Labs, NTT Corporation, Japan. He received his BE, ME, and PhD degrees from Kyoto University, Japan in 1989, 1991, and 2002, respectively. Since he joined NTT Corporation as a researcher in 1991, he has been investigating speech enhancement technologies for developing intelligent human-machine interfaces. From 2005, he visited Georgia Institute of Technology as a Visiting Scholar for a year where he worked with Prof. Biing-Hwang Juang. Since 2008, he has been a Visiting Assistant Professor in the Department of Media Science, Nagoya University. He was honored to receive the 1997 JSAI Conference Best Paper Award, the 2002 ASJ Poster Award, the 2005 IEICE Best Paper Award, and the 2009 ASJ Technical Development Award. He has been a member of IEEE SP Audio and Acoustics Technical Committee since 2009, a member of IEEE CAS Blind Signal Processing Technical Committee since 2007, and a chair of IEEE Kansai Section Technical Program Committee since 2011. He served as an Associate Editor of IEEE Transactions on Audio, Speech, and Language Processing for two years since 2008, and as a Technical Program Chair of IEEE WASPAA-2007. He is a senior member of IEEE, and a member of IEICE, and ASJ.

Armin Sehr (Mf11) is a research associate at the Chair of Multimedia Communications and Signal Processing at the University of Erlangen-Nuremberg. He received the Dipl.-Ing.(FH) degree from the University of Applied Sciences Regensburg, Germany, in 1998 and the Dr.-Ing. as well as the M.Sc. degrees from the University of Erlangen-Nuremberg, Germany, in 2009 and 2010. From 1998 to 2003, he was working as a senior algorithm designer with Ericsson Eurolab in Nuremberg, Germany, on various projects of speech coding, speech enhancement, and mobile communications. Since joining the Chair of Multimedia Communications and Signal Processing in 2003, he has been working on robust distant-talking speech recognition in reverberant environments. From October 2010 to February 2011, he was a visiting researcher at the Signal Processing Group of NTT Communications Science Labs in Kyoto, Japan, working with Dr. Tomohiro Nakatani on new methods for combining speech enhancement and speech recognition systems.

Walter Kellermann (S'86, M'90, SM'06, F'08) is a professor for communications at the Chair of Multimedia Communications and Signal Processing of the University of Erlangen-Nuremberg, Germany. He received the Dipl.-Ing. (univ.) degree in Electrical Engineering from the University of Erlangen-Nuremberg in 1983, and the Dr.-Ing. degree ('with distinction') from the Technical University Darmstadt, Germany, in 1988. From 1989 to 1990, he was a Postdoctoral Member of Technical Staff at AT&T Bell Laboratories, Murray Hill, NJ. In 1990, he joined Philips Kommunikations Industrie, Nuremberg, Germany. From 1993 to 1999 he was a professor at the Fachhochschule Regensburg before he joined the University Erlangen-Nuremberg as a professor and head of the audio research laboratory in 1999. In 1999 he co-founded the consulting firm DSP Solutions. Dr. Kellermann authored or co-authored 16 book chapters and more than 180 refereed papers in journals and conference proceedings. He served as an associate editor and as guest editor to various journals, e.g., to the IEEE Transactions on Speech and Audio Processing from 2000 to 2004, and to the EURASIP Journal on Signal Processing. Presently he serves as associate editor to the EURASIP Journal on Advances in Signal Processing and as a Member of the Overview Editorial Board for the IEEE Signal Processing Society. He was the general chair of the 5th International Workshop on Microphone Arrays in 2003 and the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics in 2005. He was the general co-chair of the 2nd and the 3rd International Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA) in 2008 and 2011, respectively. He was a Distinguished Lecturer of the IEEE Signal Processing Society for 2007 and 2008. He served as a Chair of the Technical Committee for Audio and Acoustic Signal Processing of the IEEE Signal Processing Society from 2008 to 2010. He received the Julius-von-Haast Fellowship Award of the Royal Society of New Zealand 2012. He is an IEEE Fellow.