Tutorial 1: Beyond Words in Speech Processing: Speaker States, Traits and Vocal Behavior

Presented by

Björn Schuller and Florian Metze


Tomorrow's speech-based Human-Machine Interaction systems will not only interpret a word string, and Multimedia Retrieval systems will allow more than keyword-based retrieval: Speech transports a lot more information in how we say something, who says something, whom we address, etc., and this context information needs to be picked up by future intelligent machines. Classification and integration of speaker state and trait information and vocal behavior will be essential, in order to gain more information about a person, (implicitly) adapt to the user, and eventually to really understand what they say - and mean.

These cues comprise long-term, immutable traits such as biological trait primitives (e.g., height, weight, age, gender), group/ ethnicity membership (race/culture/social class with fuzzy borders towards other linguistic concepts, such as singing register, dialect or nativeness), and personality traits (likeability and personality in general). Medium-term traits include more or less temporary states (e.g., sleepiness, medical and alcohol intoxication, health state, mood such as depression), and structural signals (role in groups, friendship and identity, positive/ negative attitude), which influence behavior, interaction, etc. Finally, there are short-term states comprising mode (speaking style and voice quality), emotions, and related states such as stress, intimacy, interest, confidence, uncertainty, deception, politeness, frustration, sarcasm, and pain - just to mention a few. Further vocal behavior comprises non-linguistic information such as laughter, coughing, or sighing.

Following more than a decade of research, strategies have been developed that successfully deal with many of these states and traits, and current systems reach a level where they are able to face realistic conditions - a situation that this tutorial aims to exploit:

Starting from selected important modeling forms of speaker states and traits, it will guide the participants along the chain of processing, from data collection and annotation issues to acoustic and linguistic modeling for speaker classification. It will cover robust automatic recognition of speech, independent of state and trait variation, and incorporating vocal behavior, detection of para-linguistic events, and synergistic and asynchronous multi-stream fusion of the acquired information. Then, generation, application, and system integration issues will be discussed, including a discourse on ethical issues. After a broad overview on the state-of-the-art, more recent developments such as self-learnt spaces, subject modeling, spectral decomposition-based features, un-supervised learning, multilingual issues, and full open-microphone processing will be presented, and opportunities and challenges for future research will be presented. The tutorial will consist of a theoretic and a practical part to experience latest tools and technology in the field.

The objective of this tutorial is thus to give a comprehensive introduction and general overview on recent algorithms and methodology, and prepare participants to participate in potential future trends in analysis of "real-life" speech with all facets of speaker states and traits such as affect, emotion, personality, behavioral and social signals, including non-linguistic vocalization. This goal will be achieved by providing highly interactive classroom-style presentation, followed by "hands-on" experience, including practical aspects as current datasets and research tools. While it will not be possible to discuss all aspects of social speech analysis and generation, a participant in this tutorial will gain all the skills needed to identify current algorithms and tools needed to solve a particular problem from his/her field.

The main target audience is thus a broad group of scholars, practitioners, and experts in dialog system engineering, speech recognition and processing, natural language understanding or general Human-Computer Interaction and Media Retrieval: "real-life" speech touches any of these fields and is influenced by the "voice behind" the words. The tutorial will assume basic knowledge of signal processing principles (so it will be suitable for the non-specialist), but it will cover many state-of-the-art subjects, so that also the specialist will find it interesting. The first three quarters of the tutorial shall be of theoretic nature, however with a strongly interactive character. The second, practical part will be consisting of demonstrations and practical experiments with provided software and data. To this end, participants are encouraged to bring along their laptops.

Speaker Biography

Björn W. Schuller received his diploma in 1999 and his doctoral degree for his study on Automatic Speech and Emotion Recognition in 2006, both in electrical engineering and information technology from TUM (Munich University of Technology/Germany). He is tenured as Senior Researcher and Lecturer in Pattern Recognition and Speech Processing heading the Intelligent Audio Analysis Group at TUM's Institute for Human-Machine Communication since 2006. From 2009 to 2010 he lived in Paris/France and was with the CNRS-LIMSI Spoken Language Processing Group in Orsay/France dealing with affective and social signals in speech. In 2010 he was also a visiting scientist in the Imperial College London's Department of Computing in London/UK working on audiovisual behaviour recognition. In 2011 he was guest lecturer at the Università Politecnica delle Marche (UNIVPM) in Ancona/Italy. Best known are his works advancing Semantic Audio and Audiovisual Processing and Affective Computing. Dr. Schuller is a member of the IEEE, ACM, HUMAINE Association, and ISCA and (co-)authored 2 books and 240 publications in peer reviewed books, journals, and conference proceedings in the field of signal processing, and machine learning leading to more than 2,000 citations - his current H-index equals 24. He serves as member and secretary of the steering committee, associate editor, and repeated guest editor of the IEEE Transactions on Affective Computing, associate and repeated guest editor of the Computer Speech and Language, associate editor of the IEEE Transactions on Neural Networks and Learning Systems and guest editor for the Speech Communication, Image and Vision Computing, Cognitive Computation, and the EURASIP Journal on Advances in Signal Processing, reviewer for 40 leading journals and multiple conferences in the field, and as invited speaker, session and challenge organizer including the first of their kind INTERSPEECH 2009 Emotion, 2010 Paralinguistic, 2011 Speaker State, and 2012 Speaker Trait Challenges and the 2011 Audio/Visual Emotion Challenge and Workshop and chairman and programme committee member of numerous further international workshops and conferences. Steering and involvement in current and past research projects includes the European Community funded ASC-Inclusion STREP project as coordinator and the awarded SEMAINE project, and projects funded by the German Research Foundation (DFG) and companies such as BMW, Continental, Daimler, HUAWEI, Siemens, Toyota, and VDO. Advisory board activities comprise his membership as invited expert in the W3C Emotion Incubator and Emotion Markup Language Incubator Groups, and his repeated election to the Executive Committee of the HUMAINE Association where he chairs the Special Interest Group Speech.

Florian Metze received his PhD from Universität Karlsruhe (TH) for a thesis on "Articulatory Features for Conversational Speech Recognition" in 2005. He worked as a Senior Research Scientist at Deutsche Telekom Laboratories (T-Labs) from 2006 to 2009, where he led research and development projects involving speech and language technologies with a focus on usability. In 2009, he joined Carnegie Mellon University as research faculty in the Language Technologies Institute, where he is also affiliated with the interACT center. At CMU, he regularly teaches several classes, including "Speech Recognition and Understanding".

Dr. Metze has worked on a wide range of topics in the field of speech processing and user interfaces. For his thesis work, he introduced a multi-stream approach to combining different detectors for evidence of phonological features in the audio signal into a single decision. Since then, the approach could also be shown to be effective for speaker adaptation, multi-lingual ASR, and recognition of "silent" speech using muscle signals and was investigated in a 2011 Johns Hopkins University Summer Workshop for synthesis of emotional speech, which Dr. Metze lead together with Dr. Alan Black.

At both T-Labs and CMU, Dr. Metze has also worked on automatic detection of emotions and personality in speech, and the use of such information in human-human and man-machine interfaces. Current work shows how personality traits are represented in speech, and how humans and machines can detect them. He has published more than 80 papers in his research areas, and is a member of IEEE, ISCA, the Institute of Semantic Computing, ACM, GI, and other professional associations. He is a member of the IEEE Speech and Language Technical Committee, the INTERSPEECH 2012 Technical Committee, chaired conferences and workshops such as IEEE International Conference on Semantic Computing 2010, Multimodal Audio-based Multimedia Content Analysis 2011, and regularly serves as reviewer for conferences and journals.