Vous êtes sur la page 1sur 1

 Section: Multimedia

Audio and Speech Processing for Data Mining


Zheng-Hua Tan
Aalborg University, Denmark

INTRODUCTION and speech processing is a key, enabling technology


(Ohtsuki, Bessho, Matsuo, Matsunaga, & Kayashi,
The explosive increase in computing power, network 2006). Progress in this area can impact numerous busi-
bandwidth and storage capacity has largely facilitated ness and government applications (Gilbert, Moore, &
the production, transmission and storage of multimedia Zweig, 2005). Examples are discovering patterns and
data. Compared to alpha-numeric database, non-text generating alarms for intelligence organizations as well
media such as audio, image and video are different as for call centers, analyzing customer preferences, and
in that they are unstructured by nature, and although searching through vast audio warehouses.
containing rich information, they are not quite as expres-
sive from the viewpoint of a contemporary computer.
As a consequence, an overwhelming amount of data BACKGROUND
is created and then left unstructured and inaccessible,
boosting the desire for efficient content management With the enormous, ever-increasing amount of audio
of these data. This has become a driving force of mul- data (including speech), the challenge now and in the
timedia research and development, and has lead to a future becomes the exploration of new methods for
new field termed multimedia data mining. While text accessing and mining these data. Due to the non-struc-
mining is relatively mature, mining information from tured nature of audio, audio files must be annotated
non-text media is still in its infancy, but holds much with structured metadata to facilitate the practice of
promise for the future. data mining. Although manually labeled metadata to
In general, data mining the process of applying some extent assist in such activities as categorizing
analytical approaches to large data sets to discover audio files, they are insufficient on their own when
implicit, previously unknown, and potentially useful it comes to more sophisticated applications like data
information. This process often involves three steps: mining. Manual transcription is also expensive and
data preprocessing, data mining and postprocessing in many cases outright impossible. Consequently,
(Tan, Steinbach, & Kumar, 2005). The first step is to automatic metadata generation relying on advanced
transform the raw data into a more suitable format for processing technologies is required so that more thor-
subsequent data mining. The second step conducts ough annotation and transcription can be provided.
the actual mining while the last one is implemented Technologies for this purpose include audio diarization
to validate and interpret the mining results. and automatic speech recognition. Audio diarization
Data preprocessing is a broad area and is the part aims at annotating audio data through segmentation,
in data mining where essential techniques are highly classification and clustering while speech recognition
dependent on data types. Different from textual data, is deployed to transcribe speech. In addition to these is
which is typically based on a written language, image, event detection, such as, for example, applause detec-
video and some audio are inherently non-linguistic. tion in sports recordings. After audio is transformed
Speech as a spoken language lies in between and of- into various symbolic streams, data mining techniques
ten provides valuable information about the subjects, can be applied to the streams to find patterns and as-
topics and concepts of multimedia content (Lee & sociations, and information retrieval techniques can
Chen, 2005). The language nature of speech makes be applied for the purposes of indexing, search and
information extraction from speech less complicated retrieval. The procedure is analogous to video data
yet more precise and accurate than from image and mining and retrieval (Zhu, Wu, Elmagarmid, Feng, &
video. This fact motivates content based speech analysis Wu, 2005; Oh, Lee, & Hwang, 2005).
for multimedia data mining and retrieval where audio

Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.

Vous aimerez peut-être aussi