Robust Automatic Speech Recognition: A Bridge to Practical Applications
By Jinyu Li, Li Deng, Reinhold Haeb-Umbach and Yifan Gong
()
About this ebook
Robust Automatic Speech Recognition: A Bridge to Practical Applications establishes a solid foundation for automatic speech recognition that is robust against acoustic environmental distortion. It provides a thorough overview of classical and modern noise-and reverberation robust techniques that have been developed over the past thirty years, with an emphasis on practical methods that have been proven to be successful and which are likely to be further developed for future applications.The strengths and weaknesses of robustness-enhancing speech recognition techniques are carefully analyzed. The book covers noise-robust techniques designed for acoustic models which are based on both Gaussian mixture models and deep neural networks. In addition, a guide to selecting the best methods for practical applications is provided.The reader will:
- Gain a unified, deep and systematic understanding of the state-of-the-art technologies for robust speech recognition
- Learn the links and relationship between alternative technologies for robust speech recognition
- Be able to use the technology analysis and categorization detailed in the book to guide future technology development
- Be able to develop new noise-robust methods in the current era of deep learning for acoustic modeling in speech recognition
- The first book that provides a comprehensive review on noise and reverberation robust speech recognition methods in the era of deep neural networks
- Connects robust speech recognition techniques to machine learning paradigms with rigorous mathematical treatment
- Provides elegant and structural ways to categorize and analyze noise-robust speech recognition techniques
- Written by leading researchers who have been actively working on the subject matter in both industrial and academic organizations for many years
Jinyu Li
Jinyu Li received a Ph.D. degree from Georgia Institute of Technology, U.S. From 2000 to 2003, he was a Researcher at Intel China Research Center and a Research Manager at iFlytek, China. Currently, he is a Principal Applied Scientist at Microsoft, working as a technical lead to design and improve speech modeling algorithms and technologies that ensure industry state-of-the-art speech recognition accuracy for Microsoft products. His major research interests cover several topics in speech recognition and machine learning, including noise robustness, deep learning, discriminative training, and feature extraction. He has authored over 60 papers and awarded over 10 patents.
Related to Robust Automatic Speech Recognition
Related ebooks
Patterns for Fault Tolerant Software Rating: 4 out of 5 stars4/5Deep Learning on Edge Computing Devices: Design Challenges of Algorithm and Architecture Rating: 0 out of 5 stars0 ratingsMultidimensional Signal, Image, and Video Processing and Coding Rating: 0 out of 5 stars0 ratingsAdvanced Video Coding: Principles and Techniques: The Content-based Approach Rating: 0 out of 5 stars0 ratingsCellular Internet of Things: From Massive Deployments to Critical 5G Applications Rating: 0 out of 5 stars0 ratingsMachine Reading Comprehension: Algorithms and Practice Rating: 0 out of 5 stars0 ratingsAdvances in Independent Component Analysis and Learning Machines Rating: 0 out of 5 stars0 ratingsFeature Extraction and Image Processing for Computer Vision Rating: 4 out of 5 stars4/5Microwave Wireless Communications: From Transistor to System Level Rating: 4 out of 5 stars4/5Artificial Intelligence Foundations: Learning from experience Rating: 0 out of 5 stars0 ratingsF# for Machine Learning Essentials Rating: 0 out of 5 stars0 ratingsTensorFlow A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsIntel Xeon Phi Processor High Performance Programming: Knights Landing Edition Rating: 0 out of 5 stars0 ratingsSignal Processing for Neuroscientists, A Companion Volume: Advanced Topics, Nonlinear Techniques and Multi-Channel Analysis Rating: 0 out of 5 stars0 ratingsAn Introduction to Wavelets Rating: 0 out of 5 stars0 ratingsA Wavelet Tour of Signal Processing: The Sparse Way Rating: 0 out of 5 stars0 ratingsAscend AI Processor Architecture and Programming: Principles and Applications of CANN Rating: 0 out of 5 stars0 ratingsSwarm Intelligence and Bio-Inspired Computation: Theory and Applications Rating: 0 out of 5 stars0 ratingsSemiconductor Lasers I: Fundamentals Rating: 0 out of 5 stars0 ratingsPrinciples and Labs for Deep Learning Rating: 0 out of 5 stars0 ratingsWavelets: Theory, Algorithms, and Applications Rating: 0 out of 5 stars0 ratingsArtificial Intelligence-Based Brain-Computer Interface Rating: 0 out of 5 stars0 ratingsIntroduction to Deep Learning and Neural Networks with Python™: A Practical Guide Rating: 0 out of 5 stars0 ratingsHeterogeneous Computing with OpenCL 2.0 Rating: 0 out of 5 stars0 ratingsPattern Recognition and Image Processing Rating: 5 out of 5 stars5/5Meta-Learning: Theory, Algorithms and Applications Rating: 0 out of 5 stars0 ratingsDeep Learning for Robot Perception and Cognition Rating: 4 out of 5 stars4/5Deep Learning with Structured Data Rating: 0 out of 5 stars0 ratingsAcademic Press Library in Signal Processing, Volume 7: Array, Radar and Communications Engineering Rating: 0 out of 5 stars0 ratingsNatural Language Processing A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratings
Electrical Engineering & Electronics For You
Electricity for Beginners Rating: 5 out of 5 stars5/5How to Diagnose and Fix Everything Electronic, Second Edition Rating: 4 out of 5 stars4/5No Nonsense Technician Class License Study Guide: for Tests Given Between July 2018 and June 2022 Rating: 5 out of 5 stars5/5The Fast Track to Your Technician Class Ham Radio License: For Exams July 1, 2022 - June 30, 2026 Rating: 5 out of 5 stars5/5Beginner's Guide to Reading Schematics, Third Edition Rating: 0 out of 5 stars0 ratingsBeginner's Guide to Reading Schematics, Fourth Edition Rating: 4 out of 5 stars4/5Electrical Engineering 101: Everything You Should Have Learned in School...but Probably Didn't Rating: 5 out of 5 stars5/5The Homeowner's DIY Guide to Electrical Wiring Rating: 5 out of 5 stars5/5Very Truly Yours, Nikola Tesla Rating: 5 out of 5 stars5/5Electrician's Pocket Manual Rating: 0 out of 5 stars0 ratingsBasic Electricity Rating: 4 out of 5 stars4/5Upcycled Technology: Clever Projects You Can Do With Your Discarded Tech (Tech gift) Rating: 5 out of 5 stars5/5Practical Electrical Wiring: Residential, Farm, Commercial, and Industrial Rating: 4 out of 5 stars4/5Electrician's Calculations Manual, Second Edition Rating: 0 out of 5 stars0 ratingsSolar & 12 Volt Power For Beginners Rating: 4 out of 5 stars4/5Off-Grid Projects: Step-by-Step Guide to Building Your Own Off-Grid System Rating: 0 out of 5 stars0 ratingsTwo-Stroke Engine Repair and Maintenance Rating: 0 out of 5 stars0 ratingsDIY Lithium Battery Rating: 3 out of 5 stars3/515 Dangerously Mad Projects for the Evil Genius Rating: 4 out of 5 stars4/5Mims Circuit Scrapbook V.I. Rating: 5 out of 5 stars5/5Raspberry Pi Projects for the Evil Genius Rating: 0 out of 5 stars0 ratingsProgramming Arduino: Getting Started with Sketches Rating: 4 out of 5 stars4/5DIY Drones for the Evil Genius: Design, Build, and Customize Your Own Drones Rating: 4 out of 5 stars4/5Electroculture - The Application of Electricity to Seeds in Vegetable Growing Rating: 0 out of 5 stars0 ratingsHow Do Electric Motors Work? Physics Books for Kids | Children's Physics Books Rating: 0 out of 5 stars0 ratingsElectric Circuits Essentials Rating: 5 out of 5 stars5/5Schaum's Outline of Basic Electricity, Second Edition Rating: 5 out of 5 stars5/5Basic Electronics: Book 2 Rating: 5 out of 5 stars5/5THE Amateur Radio Dictionary: The Most Complete Glossary of Ham Radio Terms Ever Compiled Rating: 4 out of 5 stars4/5
Reviews for Robust Automatic Speech Recognition
0 ratings0 reviews
Book preview
Robust Automatic Speech Recognition - Jinyu Li
Chapter 1
Introduction
Abstract
Automatic speech recognition (ASR) by machine has been a field of research for more than 60 years. The industry has developed a broad range of commercial products where ASR as user interface has become ever more useful and pervasive. Consumer-centric applications increasingly require ASR to be robust to the full range of real-world noise and other acoustic distorting conditions. However, reliably recognizing spoken words in realistic acoustic environments is still a challenge.
We introduce distortion factors that operate in various stages of speech production, from thought to speech signals, leading to the issues of ASR robustness as the focus of this book. We provide an introductory summary of this book in this chapter, covering the ASR robustness problem for acoustic models based on both Gaussian mixture models and deep neural networks. The book goes significantly beyond much of the existing survey literature, and illustrates the research and product development on ASR robustness to noisy acoustic environments that has been progressing for over 30 years.
Finally, we define the mission, goal, and structure of the book in this chapter. We aim to establish a solid, consistent, and common mathematical foundation for robust ASR, emphasizing the methods proven to be successful and expected to sustain or expand their future applicability.
Keywords
Automatic speech recognition
Noise robustness
ASR applications
Survey
Gaussian mixture models
Deep neural networks
Chapter Outline
1.1 Automatic Speech Recognition 1
1.2 Robustness to Noisy Environments 2
1.3 Existing Surveys in the Area 2
1.4 Book Structure Overview 5
References 6
1.1 Automatic Speech Recognition
Automatic speech recognition (ASR) is the process and the related technology for converting the speech signal into its corresponding sequence of words or other linguistic entities by means of algorithms implemented in a device, a computer, or computer clusters (Deng and O’Shaughnessy, 2003; Huang et al., 2001b). ASR by machine has been a field of research for more than 60 years (Baker et al., 2009a,b; Davis et al., 1952). The industry has developed a broad range of commercial products where speech recognition as user interface has become ever useful and pervasive.
Historically, ASR applications have included voice dialing, call routing, interactive voice response, data entry and dictation, voice command and control, gaming, structured document creation (e.g., medical and legal transcriptions), appliance control by voice, computer-aided language learning, content-based spoken audio search, and robotics. More recently, with the exponential growth of big data and computing power, ASR technology has advanced to the stage where more challenging applications are becoming a reality. Examples are voice search, digital assistance and interactions with mobile devices (e.g., Siri on iPhone, Bing voice search and Cortana on winPhone and Windows 10 OS, and Google Now on Android), voice control in home entertainment systems (e.g., Kinect on xBox), machine translation, home automation, in-vehicle navigation and entertainment, and various speech-centric information processing applications capitalizing on downstream processing of ASR outputs (He and Deng, 2013).
1.2 Robustness to Noisy Environments
New waves of consumer-centric applications increasingly require ASR to be robust to the full range of real-world noise and other acoustic distorting conditions. However, reliably recognizing spoken words in realistic acoustic environments is still a challenge. For such large-scale, real-world applications, noise robustness is becoming an increasingly important core technology since ASR needs to work in much more difficult acoustic environments than in the past (Deng et al., 2002).
Noise refers to any unwanted disturbances superposed upon the intended speech signal. Robustness is the ability of a system to maintain its good performance under varying operating conditions, including those unforeseeable or unavailable at the time of system development.
Speech as observed and digitized is generated by a complex process, from the thoughts to actual speech signals. This process can be described in five stages as shown in Figure 1.1, where a number of variables affect the outcome of each stage. Some major stages in this long chain have been analyzed and modeled mathematically in Deng (1999, 2006).
Figure 1.1 From thoughts to speech.
All of the above could lead to ASR robustness issues. This book addresses challenges mostly in the acoustic channel area where interfering signals lead to ASR performance degradation.
In this area, robustness of ASR to noisy background can be approached from two directions:
• reducing the noise level by exploring hardware utilizing spatial or directional information from microphone technology and transducer principles, such as noise canceling microphones and microphone arrays;
• software algorithmic processing taking advantage of the spectral and temporal separation between speech and interfering signals, which is the major focus of this book.
1.3 Existing Surveys in the Area
Researchers and practitioners have been trying to improve ASR robustness to operating conditions for many years (Huang et al., 2001a; Huang and Deng, 2010). A survey of the 1970s speech recognition systems has identified (Lea, 1980) that a primary difficulty with speech recognition is this ability of the input to pick up other sounds in the environment that act as interfering noise.
The term robust speech recognition
emerged in the late 1980s. Survey papers in the 1990s include (Gong, 1995; Juang, 1991; Junqua and Haton, 1995). By 2000, robust speech recognition has gained significant importance in the speech and language processing fields. Actually, it was the most popular area in the International Conference on Acoustics, Speech and Signal Processing, at least during 2001-2003 (Gong, 2004). Since 2010, robust ASR remains one of the most popular areas in the speech processing community, and tremendous and steady progress in noisy speech recognition have been made.
A large number of noise-robust ASR methods, in the order of hundreds, have been proposed and published over the past 30 years or so, and many of them have created significant impact on either research or commercial use. Such accumulated knowledge deserves thorough examination not only to define the state of the art in this field from a fresh and unifying perspective, but also to point to potentially fruitful future directions. Nevertheless, a well-organized framework for relating and analyzing these methods is conspicuously missing. The existing survey papers (Acero, 1993; Deng, 2011; Droppo and Acero, 2008; Gales, 2011; Gong, 1995; Haeb-Umbach, 2011; Huo and Lee, 2001; Juang, 1991; Kumatani et al., 2012; Lee, 1998) in noise-robust ASR either do not cover all recent advances in the field or focus only on a specific sub-area. Although there are also few recent books (Kolossa and Haeb-Umbach, 2011; Virtanen et al., 2012), they are collections of topics with each chapter written by different authors and it is hard to provide a unified view across all topics. Given the importance of noise-robust ASR, the time is ripe to analyze and unify the solutions. The most recent overview paper (Li et al., 2014) elaborates on the basic concepts in noise-robust ASR and develops categorization criteria and unifying themes. Specifically, it hierarchically classifies the major and significant noise-robust ASR methods using a consistent and unifying mathematical language. It establishes their interrelations and differentiates among important techniques, and discusses current technical challenges and future research directions. It also identifies relatively promising, short-term new research areas based on a careful analysis of successful methods, which can serve as a reference for future algorithm development in the field. Furthermore, in the literature spanning over 30 years on noise-robust ASR, there is inconsistent use of basic concepts and terminology as adopted by different researchers in the field. This kind of inconsistency is confusing at times, especially for new researchers and students. It is, therefore, important to examine discrepancies in the current literature and re-define a consistent terminology. However, due to the restriction of page length, the overview paper (Li et al., 2014) did not discuss the technologies in depth. More importantly, all the aforementioned books and articles largely assumed that the acoustic models for ASR are based on Gaussian mixture model hidden Markov models (GMM-HMMs).
More recently, a new acoustic modeling technique, referred to as the context-dependent deep neural network hidden Markov model (CD-DNN-HMM) which employs deep learning, has been developed (Deng and Yu, 2014; Yu and Deng, 2011, 2014). This new DNN-based acoustic model has been shown, by many groups, to significantly outperform the conventional state-of-the-art GMM-HMMs in many ASR tasks (Dahl et al., 2012; Hinton et al., 2012). As of the writing of this book, DNN-based ASR has been widely adopted by almost all major speech recognition products and public tools worldwide.
DNNs combine acoustic feature extraction and speech phonetic symbol classification into a single framework. By design, they ensure that both feature extraction and classification are jointly optimized under a discriminative criterion. With their complex non-linear mapping built on top of successive applications of simple non-linear mapping, DNNs force input features distorted by a variety of noise and channels as well as other factors to be mapped to a same output vector of phonetic symbol classes. Such an ability provides the potential for substantial performance improvement in noisy speech recognition.
However, while DNNs dramatically reduce overall word error rate for speech recognition, many new questions are raised: How much are DNNs more robust than GMMs? How should we introduce a physical model of speech, noise, and channel into a DNN model so that a better DNN can be trained given the same data? Will feature cleaning for a DNN add value to the DNN modeling? Can we model speech with a DNN such that complete, expensive retraining can be avoided upon a change in noise? To what extend the noise robustness methods developed for GMMs can enhance the robustness of DNNs? etc. More generally, what the future of noise-robust ASR technologies would hold in the new era of DNNs for ASR is a question not addressed in the existing survey literature on noise-robust ASR. One of the main goals of this book is to survey the recent noise-robust methods developed for DNNs as the acoustic models of speech, and to discuss the future research directions.
1.4 Book Structure Overview
This book is devoted to providing a summary of the current, fast expanding knowledge and approaches to solving a variety of problems in noise-robust ASR. A more specific purpose is to assist readers in acquiring a structured understanding of the state of the art and to continue to enrich the knowledge.
In this book, we aim to establish a solid, consistent, and common mathematical foundation for noise-robust ASR. We emphasize the methods that are proven to be effective and successful and that are likely to sustain or expand their future applicability. For the methods described in this book, we attempt to present the basic ideas, the assumptions, and the relationships with other methods. We categorize a wide range of noise-robust techniques using different criteria to equip the reader with the insight to choose among techniques and with the awareness of the performance-complexity tradeoffs. The pros and cons of using different noise-robust ASR techniques in practical application scenarios are provided as a guide to interested practitioners. The current challenges and future research directions especially in the era of DNNs and deep learning are carefully analyzed.
This book is organized as follows. We provide the basic concepts and formulations of ASR in Chapter 2. In Chapter 3, we discuss the fundamentals of noise-robust ASR. The impact of noise and channel distortions on clean speech is examined. Then, we build a general framework for noise-robust ASR and define five ways of categorizing and analyzing noise-robust ASR techniques. Chapter 4 is devoted to the first category—feature-domain vs. model-domain techniques. Various feature-domain processing methods are covered in detail, including noise-resistant features, feature moment normalization, and feature compensation, as well as a few most prominent model-domain methods. The second category, detailed in Chapter 5, comprises methods that exploit prior knowledge about the signal distortion. Examples of such models are mapping functions between the clean and noisy speech features, and environment-specific models combined during online operation of the noise-robust algorithms. Methods that incorporate an explicit distortion model to predict the distorted speech from a clean one define the third category, covered in Chapter 6. The use of uncertainty constitutes the fourth way to categorize a wide range of noise-robust ASR algorithms, and is covered in Chapter 7. Uncertainty in either the model space or feature space may be incorporated within the Bayesian framework to promote noise-robust ASR. The final, fifth way to categorize and analyze noise-robust ASR techniques exploits joint model training, described in Chapter 8. With joint model training, environmental variability in the training data is removed in order to generate canonical models. After the noise-robust techniques for single-microphone non-reverberant ASR are comprehensively discussed above, the book includes two chapters, covering reverberant ASR and multi-channel processing for noise-robust ASR, respectively. We conclude this book in Chapter 11, with discussions on future directions for noise-robust