Tutorial 2: Speech Modeling and Enhancement Using Diffusion Maps

Presented by

Israel Cohen, Sharon Gannot, and Ronen Talmon

Abstract

Enhancement of speech signals is of great interest in many applications ranging from speech recognition, to hearing aids and hands-free mobile communication. Although this problem has attracted significant research efforts for several decades, many aspects remain open and require further research.

In recent years, there has been a growing effort to develop data analysis methods based on the geometry of the acquired raw data. These modern methods, which are termed kernel methods or manifold learning, aim at discovering the underlying structure in data sets as a precursor to other types of processing. These data analysis approaches have recently become very popular in machine learning and data mining applications. However, their usefulness in signal processing has not yet been fully discovered.

In this tutorial, we introduce a novel approach in signal processing and analysis. We present innovative ways to exploit the geometry of signals using applied harmonic analysis and manifold learning methods. In particular, we explore compact representations of signals, conveying their actual degrees of freedom and intrinsic geometric structure. Finding such representations enables the development of efficient and improved processing algorithms. Our goal is to provide the audience with a comprehensive review of an existing manifold learning approach, and in parallel to introduce its usefulness in audio and speech processing.

We begin by revisiting classical approaches in speech enhancement. We formulate the problem of spectral enhancement; address the time-frequency correlation of spectral coefficients for speech and noise signals, and present statistical models.

Following the introduction, we review the geometric approach for data analysis and processing. In particular, we describe diffusion maps in detail, which is a data analysis method for structural multiscale geometric organization of raw data and dimensionality reduction. We introduce the fundamental concept and ideas, and describe the formulation and analysis.

In this tutorial, a special focus is given to the problem of transient interference suppression. The wide-spread assumption of stationary noise poses a major limitation on traditional speech enhancement algorithms. In particular, it makes them inadequate in transient interference environments, as transients are usually characterized by a sudden burst of sound. We circumvent this assumption by learning the geometric structure of the transient interference using manifold learning and nonlocal diffusion filtering. We describe statistical approaches; address the transient structure and propose a corresponding model. A special attention is given to a novel nonlocal filtering approach, which enables to efficiently incorporate the acquired geometric structure into a processing framework. We review the main concept and provide probabilistic analysis and diffusion interpretation for the nonlocal filtering.

Another problem addressed in this tutorial is the problem of modeling natural and artificial systems. This problem has a key role in audio and speech processing applications and has long been a task that attracted considerable research effort. A predefined model is traditionally developed for every type of task or system, and then, the model parameters are estimated from observations. We present a completely different approach. Without assuming any specific model, we identify the degrees of freedom of the system and its modes of variability. This approach provides a generic data-driven method for a wide variety of system types. We describe a generic algorithm for parameterization of linear systems using diffusion geometry. The algorithm is based on recent developments of spectral and nonlinear independent component analysis techniques, anisotropic kernels, and classical results from statistical signal processing and Fourier analysis. A given system can be viewed as a black box controlled by several independent parameters. By recovering these parameters, we reveal the actual degrees of freedom of the system and obtain its intrinsic modeling. These attractive features are extremely useful for system design, control and calibration.

We aim to communicate that the key idea is to combine the classical statistical estimation along with the modern data-driven manifold learning. We show that capturing the geometric structure of the signals enriches the a-priori assumed statistical model and enables good performance. The obtained experimental results indicate that such a combination may be more powerful than strictly model-driven approaches.

Speaker Biography

Israel Cohen is an Associate Professor in the Department of Electrical Engineering at the Technion - Israel Institute of Technology. He received the B.Sc. (Summa Cum Laude), M.Sc. and Ph.D. degrees in electrical engineering from the Technion in 1990, 1993 and 1998, respectively.

From 1990 to 1998, he was a Research Scientist with RAFAEL Research Laboratories, Haifa, Israel Ministry of Defense. From 1998 to 2001, he was a Postdoctoral Research Associate with the Computer Science Department, Yale University, New Haven, CT. In 2001 he joined the Electrical Engineering Department of the Technion.

His research interests are statistical signal processing, analysis and modeling of acoustic signals, speech enhancement, noise estimation, microphone arrays, source localization, blind source separation, system identification and adaptive filtering.

He is a coeditor of the Multichannel Speech Processing section of the Springer Handbook of Speech Processing (Springer, 2008), a coauthor of Noise Reduction in Speech Processing (Springer, 2009), a coeditor of Speech Processing in Modern Communication: Challenges and Perspectives (Springer, 2010), and a general co-chair of the 2010 International Workshop on Acoustic Echo and Noise Control (IWAENC).

Dr. Cohen is a recipient of the Alexander Goldberg Prize for Excellence in Research, and the Muriel and David Jacknow award for Excellence in Teaching. He served as Associate Editor of the IEEE Transactions on Audio, Speech, and Language Processing and IEEE Signal Processing Letters, and as Guest Editor of a special issue of the EURASIP Journal on Advances in Signal Processing on Advances in Multi-microphone Speech Processing and a special issue of the Elsevier Speech Communication Journal on Speech Enhancement.

Sharon Gannot is an Associate Professor in the School of Engineering at Bar-Ilan University, Israel. He received his B.Sc. degree (summa cum laude) from the Technion - Israel Institute of Technology, in 1986 and the M.Sc. (cum laude) and Ph.D. degrees from Tel-Aviv University, Israel in 1995 and 2000, respectively, all in electrical engineering.

In the year 2001 he held a post-doctoral position at the department of Electrical Engineering (SISTA) at K.U.Leuven, Belgium. From 2002 to 2003 he held a research and teaching position at the Faculty of Electrical Engineering, Technion-Israel Institute of Technology, Haifa, Israel.

Dr. Gannot is the recipient of Bar-Ilan University outstanding lecturer award for the year 2010. He is a coeditor of the Speech Enhancement section of the Springer Handbook of Speech Processing (Springer, 2008), and a coeditor of Speech Processing in Modern Communication: Challenges and Perspectives (Springer, 2010). Dr. Gannot serves as Associate Editor of the IEEE Transactions on Audio, Speech and Language Processing, and a member of the IEEE Audio and Acoustic Signal Processing (ASSP) Technical Committee. Between 2003-2011 he served as an Associate Editor of EURASIP Journal on Advances in signal Processing (JASP). He was also an editor of two special issues on Multi-microphone Speech Processing for JASP, and a guest editor of ELSEVIER Speech Communication Journal. He is a member of the Technical and Steering committee of the International Workshop on Acoustic Echo and Noise Control (IWAENC) since 2005 and the general co-chair of IWAENC 2010 held in Tel-Aviv, Israel. Dr. Gannot will serve as the general co-chair of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) in 2013. His research interests include parameter estimation, statistical signal processing, array processing, and speech processing with applications in enhancement, separation, beamforming, dereverberation and localization.

Ronen Talmon is a Gibbs assistant professor in the Mathematics Department at Yale University, CT. He received the B.A degree (cum laude) in mathematics and computer science from the Open University, Ra'anana, Israel, in 2005, and the Ph.D. degree in electrical engineering from the Technion - Israel Institute of Technology, Haifa, Israel, in 2011.

From 2000 to 2005, he was a software developer and researcher at a technological unit of the Israeli Defense Forces. Since 2005, he has been a Teaching Assistant and a Project Supervisor with the Signal and Image Processing Lab (SIPL), Electrical Engineering Department, Technion.

Dr. Talmon is the recipient of the Irwin and Joan Jacobs Fellowship for the year 2011, the Distinguished Project Supervisor Award for the year 2010, and the Excellence in Teaching Award for outstanding teaching assistants for the year 2008. His research interests are statistical signal processing, speech enhancement, system identification, harmonic analysis, and geometric methods for data analysis.