Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Robust Automatic Speech Recognition: A Bridge to Practical Applications
Robust Automatic Speech Recognition: A Bridge to Practical Applications
Robust Automatic Speech Recognition: A Bridge to Practical Applications
Ebook612 pages6 hours

Robust Automatic Speech Recognition: A Bridge to Practical Applications

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Robust Automatic Speech Recognition: A Bridge to Practical Applications establishes a solid foundation for automatic speech recognition that is robust against acoustic environmental distortion. It provides a thorough overview of classical and modern noise-and reverberation robust techniques that have been developed over the past thirty years, with an emphasis on practical methods that have been proven to be successful and which are likely to be further developed for future applications.The strengths and weaknesses of robustness-enhancing speech recognition techniques are carefully analyzed. The book covers noise-robust techniques designed for acoustic models which are based on both Gaussian mixture models and deep neural networks. In addition, a guide to selecting the best methods for practical applications is provided.The reader will:

  • Gain a unified, deep and systematic understanding of the state-of-the-art technologies for robust speech recognition
  • Learn the links and relationship between alternative technologies for robust speech recognition
  • Be able to use the technology analysis and categorization detailed in the book to guide future technology development
  • Be able to develop new noise-robust methods in the current era of deep learning for acoustic modeling in speech recognition
  • The first book that provides a comprehensive review on noise and reverberation robust speech recognition methods in the era of deep neural networks
  • Connects robust speech recognition techniques to machine learning paradigms with rigorous mathematical treatment
  • Provides elegant and structural ways to categorize and analyze noise-robust speech recognition techniques
  • Written by leading researchers who have been actively working on the subject matter in both industrial and academic organizations for many years
LanguageEnglish
Release dateOct 30, 2015
ISBN9780128026168
Robust Automatic Speech Recognition: A Bridge to Practical Applications
Author

Jinyu Li

Jinyu Li received a Ph.D. degree from Georgia Institute of Technology, U.S. From 2000 to 2003, he was a Researcher at Intel China Research Center and a Research Manager at iFlytek, China. Currently, he is a Principal Applied Scientist at Microsoft, working as a technical lead to design and improve speech modeling algorithms and technologies that ensure industry state-of-the-art speech recognition accuracy for Microsoft products. His major research interests cover several topics in speech recognition and machine learning, including noise robustness, deep learning, discriminative training, and feature extraction. He has authored over 60 papers and awarded over 10 patents.

Related to Robust Automatic Speech Recognition

Related ebooks

Electrical Engineering & Electronics For You

View More

Related articles

Reviews for Robust Automatic Speech Recognition

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Robust Automatic Speech Recognition - Jinyu Li

    Chapter 1

    Introduction

    Abstract

    Automatic speech recognition (ASR) by machine has been a field of research for more than 60 years. The industry has developed a broad range of commercial products where ASR as user interface has become ever more useful and pervasive. Consumer-centric applications increasingly require ASR to be robust to the full range of real-world noise and other acoustic distorting conditions. However, reliably recognizing spoken words in realistic acoustic environments is still a challenge.

    We introduce distortion factors that operate in various stages of speech production, from thought to speech signals, leading to the issues of ASR robustness as the focus of this book. We provide an introductory summary of this book in this chapter, covering the ASR robustness problem for acoustic models based on both Gaussian mixture models and deep neural networks. The book goes significantly beyond much of the existing survey literature, and illustrates the research and product development on ASR robustness to noisy acoustic environments that has been progressing for over 30 years.

    Finally, we define the mission, goal, and structure of the book in this chapter. We aim to establish a solid, consistent, and common mathematical foundation for robust ASR, emphasizing the methods proven to be successful and expected to sustain or expand their future applicability.

    Keywords

    Automatic speech recognition

    Noise robustness

    ASR applications

    Survey

    Gaussian mixture models

    Deep neural networks

    Chapter Outline

    1.1 Automatic Speech Recognition   1

    1.2 Robustness to Noisy Environments   2

    1.3 Existing Surveys in the Area   2

    1.4 Book Structure Overview   5

    References   6

    1.1 Automatic Speech Recognition

    Automatic speech recognition (ASR) is the process and the related technology for converting the speech signal into its corresponding sequence of words or other linguistic entities by means of algorithms implemented in a device, a computer, or computer clusters (Deng and O’Shaughnessy, 2003; Huang et al., 2001b). ASR by machine has been a field of research for more than 60 years (Baker et al., 2009a,b; Davis et al., 1952). The industry has developed a broad range of commercial products where speech recognition as user interface has become ever useful and pervasive.

    Historically, ASR applications have included voice dialing, call routing, interactive voice response, data entry and dictation, voice command and control, gaming, structured document creation (e.g., medical and legal transcriptions), appliance control by voice, computer-aided language learning, content-based spoken audio search, and robotics. More recently, with the exponential growth of big data and computing power, ASR technology has advanced to the stage where more challenging applications are becoming a reality. Examples are voice search, digital assistance and interactions with mobile devices (e.g., Siri on iPhone, Bing voice search and Cortana on winPhone and Windows 10 OS, and Google Now on Android), voice control in home entertainment systems (e.g., Kinect on xBox), machine translation, home automation, in-vehicle navigation and entertainment, and various speech-centric information processing applications capitalizing on downstream processing of ASR outputs (He and Deng, 2013).

    1.2 Robustness to Noisy Environments

    New waves of consumer-centric applications increasingly require ASR to be robust to the full range of real-world noise and other acoustic distorting conditions. However, reliably recognizing spoken words in realistic acoustic environments is still a challenge. For such large-scale, real-world applications, noise robustness is becoming an increasingly important core technology since ASR needs to work in much more difficult acoustic environments than in the past (Deng et al., 2002).

    Noise refers to any unwanted disturbances superposed upon the intended speech signal. Robustness is the ability of a system to maintain its good performance under varying operating conditions, including those unforeseeable or unavailable at the time of system development.

    Speech as observed and digitized is generated by a complex process, from the thoughts to actual speech signals. This process can be described in five stages as shown in Figure 1.1, where a number of variables affect the outcome of each stage. Some major stages in this long chain have been analyzed and modeled mathematically in Deng (1999, 2006).

    Figure 1.1 From thoughts to speech.

    All of the above could lead to ASR robustness issues. This book addresses challenges mostly in the acoustic channel area where interfering signals lead to ASR performance degradation.

    In this area, robustness of ASR to noisy background can be approached from two directions:

    • reducing the noise level by exploring hardware utilizing spatial or directional information from microphone technology and transducer principles, such as noise canceling microphones and microphone arrays;

    • software algorithmic processing taking advantage of the spectral and temporal separation between speech and interfering signals, which is the major focus of this book.

    1.3 Existing Surveys in the Area

    Researchers and practitioners have been trying to improve ASR robustness to operating conditions for many years (Huang et al., 2001a; Huang and Deng, 2010). A survey of the 1970s speech recognition systems has identified (Lea, 1980) that a primary difficulty with speech recognition is this ability of the input to pick up other sounds in the environment that act as interfering noise. The term robust speech recognition emerged in the late 1980s. Survey papers in the 1990s include (Gong, 1995; Juang, 1991; Junqua and Haton, 1995). By 2000, robust speech recognition has gained significant importance in the speech and language processing fields. Actually, it was the most popular area in the International Conference on Acoustics, Speech and Signal Processing, at least during 2001-2003 (Gong, 2004). Since 2010, robust ASR remains one of the most popular areas in the speech processing community, and tremendous and steady progress in noisy speech recognition have been made.

    A large number of noise-robust ASR methods, in the order of hundreds, have been proposed and published over the past 30 years or so, and many of them have created significant impact on either research or commercial use. Such accumulated knowledge deserves thorough examination not only to define the state of the art in this field from a fresh and unifying perspective, but also to point to potentially fruitful future directions. Nevertheless, a well-organized framework for relating and analyzing these methods is conspicuously missing. The existing survey papers (Acero, 1993; Deng, 2011; Droppo and Acero, 2008; Gales, 2011; Gong, 1995; Haeb-Umbach, 2011; Huo and Lee, 2001; Juang, 1991; Kumatani et al., 2012; Lee, 1998) in noise-robust ASR either do not cover all recent advances in the field or focus only on a specific sub-area. Although there are also few recent books (Kolossa and Haeb-Umbach, 2011; Virtanen et al., 2012), they are collections of topics with each chapter written by different authors and it is hard to provide a unified view across all topics. Given the importance of noise-robust ASR, the time is ripe to analyze and unify the solutions. The most recent overview paper (Li et al., 2014) elaborates on the basic concepts in noise-robust ASR and develops categorization criteria and unifying themes. Specifically, it hierarchically classifies the major and significant noise-robust ASR methods using a consistent and unifying mathematical language. It establishes their interrelations and differentiates among important techniques, and discusses current technical challenges and future research directions. It also identifies relatively promising, short-term new research areas based on a careful analysis of successful methods, which can serve as a reference for future algorithm development in the field. Furthermore, in the literature spanning over 30 years on noise-robust ASR, there is inconsistent use of basic concepts and terminology as adopted by different researchers in the field. This kind of inconsistency is confusing at times, especially for new researchers and students. It is, therefore, important to examine discrepancies in the current literature and re-define a consistent terminology. However, due to the restriction of page length, the overview paper (Li et al., 2014) did not discuss the technologies in depth. More importantly, all the aforementioned books and articles largely assumed that the acoustic models for ASR are based on Gaussian mixture model hidden Markov models (GMM-HMMs).

    More recently, a new acoustic modeling technique, referred to as the context-dependent deep neural network hidden Markov model (CD-DNN-HMM) which employs deep learning, has been developed (Deng and Yu, 2014; Yu and Deng, 2011, 2014). This new DNN-based acoustic model has been shown, by many groups, to significantly outperform the conventional state-of-the-art GMM-HMMs in many ASR tasks (Dahl et al., 2012; Hinton et al., 2012). As of the writing of this book, DNN-based ASR has been widely adopted by almost all major speech recognition products and public tools worldwide.

    DNNs combine acoustic feature extraction and speech phonetic symbol classification into a single framework. By design, they ensure that both feature extraction and classification are jointly optimized under a discriminative criterion. With their complex non-linear mapping built on top of successive applications of simple non-linear mapping, DNNs force input features distorted by a variety of noise and channels as well as other factors to be mapped to a same output vector of phonetic symbol classes. Such an ability provides the potential for substantial performance improvement in noisy speech recognition.

    However, while DNNs dramatically reduce overall word error rate for speech recognition, many new questions are raised: How much are DNNs more robust than GMMs? How should we introduce a physical model of speech, noise, and channel into a DNN model so that a better DNN can be trained given the same data? Will feature cleaning for a DNN add value to the DNN modeling? Can we model speech with a DNN such that complete, expensive retraining can be avoided upon a change in noise? To what extend the noise robustness methods developed for GMMs can enhance the robustness of DNNs? etc. More generally, what the future of noise-robust ASR technologies would hold in the new era of DNNs for ASR is a question not addressed in the existing survey literature on noise-robust ASR. One of the main goals of this book is to survey the recent noise-robust methods developed for DNNs as the acoustic models of speech, and to discuss the future research directions.

    1.4 Book Structure Overview

    This book is devoted to providing a summary of the current, fast expanding knowledge and approaches to solving a variety of problems in noise-robust ASR. A more specific purpose is to assist readers in acquiring a structured understanding of the state of the art and to continue to enrich the knowledge.

    In this book, we aim to establish a solid, consistent, and common mathematical foundation for noise-robust ASR. We emphasize the methods that are proven to be effective and successful and that are likely to sustain or expand their future applicability. For the methods described in this book, we attempt to present the basic ideas, the assumptions, and the relationships with other methods. We categorize a wide range of noise-robust techniques using different criteria to equip the reader with the insight to choose among techniques and with the awareness of the performance-complexity tradeoffs. The pros and cons of using different noise-robust ASR techniques in practical application scenarios are provided as a guide to interested practitioners. The current challenges and future research directions especially in the era of DNNs and deep learning are carefully analyzed.

    This book is organized as follows. We provide the basic concepts and formulations of ASR in Chapter 2. In Chapter 3, we discuss the fundamentals of noise-robust ASR. The impact of noise and channel distortions on clean speech is examined. Then, we build a general framework for noise-robust ASR and define five ways of categorizing and analyzing noise-robust ASR techniques. Chapter 4 is devoted to the first category—feature-domain vs. model-domain techniques. Various feature-domain processing methods are covered in detail, including noise-resistant features, feature moment normalization, and feature compensation, as well as a few most prominent model-domain methods. The second category, detailed in Chapter 5, comprises methods that exploit prior knowledge about the signal distortion. Examples of such models are mapping functions between the clean and noisy speech features, and environment-specific models combined during online operation of the noise-robust algorithms. Methods that incorporate an explicit distortion model to predict the distorted speech from a clean one define the third category, covered in Chapter 6. The use of uncertainty constitutes the fourth way to categorize a wide range of noise-robust ASR algorithms, and is covered in Chapter 7. Uncertainty in either the model space or feature space may be incorporated within the Bayesian framework to promote noise-robust ASR. The final, fifth way to categorize and analyze noise-robust ASR techniques exploits joint model training, described in Chapter 8. With joint model training, environmental variability in the training data is removed in order to generate canonical models. After the noise-robust techniques for single-microphone non-reverberant ASR are comprehensively discussed above, the book includes two chapters, covering reverberant ASR and multi-channel processing for noise-robust ASR, respectively. We conclude this book in Chapter 11, with discussions on future directions for noise-robust

    Enjoying the preview?
    Page 1 of 1