Vous êtes sur la page 1sur 4

THE RELATIONSHIP BETWEEN HIDDEN MARKOV MODELS AND PREDICTION BY PARTIAL MATCHING MODELS

Stuart Andrew Yeates Computer Science, University of Waikato s.yeates@cs.wakato.ac.nz

Abstract: Hidden Markov Models (HMMs) are the pre-eminent statistical modelling technique in modern voice recognition systems. Prediction by Partial Matching (PPM) is a state-of-the-art compression algorithm that can be used for statistical modelling of textual information. In the past we have studied the use of PPM models to solve the generalised tag-insertion problem. We show that, in general, PPM-based models are a special case of HMMs, with a particular structure, enabling the use of well researched, mature, HMM tools and techniques to solve the tag-insertion problem. Using a state-of-the-art HMM package, we describe how PPMbased modelling of discrete, ordinal, sequential signals can be recast as modelling of continuous, ordinal, temporal signals amenable to processing by HMM. We examine the statistical algorithms available for use on models in HMM form that are unavailable to models in PPM form. Keywords: Hidden Markov Models, HMM, Prediction by Partial Matching, PPM, Text Mining, generalised tag-insertion problem

1 INTRODUCTION
Lewis Carrolls poem Jabberwocky opens with these lines: Twas brillig, and the slithy toves Did gyre and gimble in the wabe: All mimsy were the borogoves, And the mome raths outgrabe. Despite the fact that of the words having four or more letters only two appear in the dictionary, fluent speakers of English have no difficulty accepting the text as English or, given a little preparation, reading it aloud. What is it about this text that makes it seem so natural while containing so few English words? The outer structure helps: the line breaks, the capital letters at the start of each line, the rhyme at the end. However, far more can be inferred: The toves were obviously gyring as well as gimbling; the raths, while engaged in outgrabing (or perhaps outgribing), were in some way related to mome, or had the quality of momeness, and so on. In fact, the markers enable you to place every nonsense word but two in its proper syntactic category. [2]

These inferences exploit syntactic information that native speakers pick up subconsciously from the original poem, based on lexical features such as the framework of function words and word endings. This information could be encoded by marking up the parts of speech in an XMLlike notation: 'Twas brillig, and the <adjective>slithy</> <noun> toves</> Did <verb>gyre</> and <verb>gimble</> in the <noun> wabe</> All<adjective>mimsy</> <noun>borogoves </> were the

And the mome <noun>raths</> <verb>outgrabe</> This paper is about inferring markup information, a generalisation of part-of-speechtagging called generalised tag insertion. We have previously used PPM models [3,6,7,10] (PPMM) to attempt to solve this problem, using a separate PPMM to model the context of each tag, one tag for nouns, one tag for verbs, etc. PPM works by building up a set of probability distributions of length less than or equal to the order of the model. For example an order 2 PPMM, as shown in Figure 1 has a probability distribution associated with each node in the graph. The probability distri1

bution is a combination of the observed frequency of characters occurring the training text. For each character, the appropriate distribution is found by tracing down through the graph using the previous two characters. Tags are mapped to special characters ( in this case). Viterbi search [8], an efficient algorithm for finding the highest probability path through a model, was then used to find the most likely place to insert tags.
1 a 2 a 7 b 6 5 b 4 10 a 11 b 12 13

State 8 the final state. The dashed lines represent transitions that, due to the nature of the graph can only be traversed once at the start of the sequence of characters. Similarly, the dotted lines represent transitions which may only be taken at the end of the sequence. Probabilities are given to each of the state transitions using a statistical algorithm called the Baum-Welsh algorithm [1,5], which adjusts the probabilities when trained on a training corpus. As with PPMMs, HMMs use Viterbi search to combine the individual probabilities for each transition to find the most likely route though the model for a given observed sequence and the sequence of characters emitted by this route is the output. Due to the depth of research into HMMs, a number of algorithms are available to models in HMM for that are unavailable to models in PPM form. These fall into two main categories, probability manipulations such as BaumWelsh algorithm and model-rewriting algorithms which split commonly used nodes or merge rarely used similar nodes[11].

3 a b 9 8

Figure 1: An order 2 PPMM It was noticed that using PPMMs in this manner was, in fact, equivalent using HMMs to solve the same problem. HMMs represent a well-known and well-studied field of computer science with over 35 years of research and development [1,5]. This results in a solid theoretical groundwork and robust, stable production implementations.

3 MAPPING PPMMS TO HMMS


Figure 3 shows the same PPMM as shown in Figure 1, but adapted into a HMM. Again, dashed lines are only used at the start of a sequence during initialisation. Previous the PPMM was stateless (the distribution was found from the root of the tree each time) whereas the new HMM is stateful (the model remembers the previous last characters and finds the offset based on a single character).
1 a b 3 b 9

2 HMMS
Hidden Morkov Models (HMMs) are widely used in such fields as voice recognition, network modelling and mathematical simulation. They are a variant of Markov Models that are used widely in computerised statistical modelling. HMMs aim to model the steady state of a system and unlike the PPMM show in Figure 1, state is kept between each character.

3 2 1 7 6
2

4 8 5
b 6 a 5 8

10 11 a 12 b 4

Figure 2: A small HMM. Figure 2 shows a small HMM model. Most transitions are bi-directional and can emit any character in the alphabet (each with a different probability). State 1 is the initial state and 2

13

Figure 3: An order 2 PPMM adapted into an HMM.

HMMs traditional have a single final state (State 8 in Figure 2) whereas PPMMs can have several final states for each special character, with each special character corresponding to a higher-level transition. The PPMM in Figure 3 could easily be rewritten to have a single final state for the (by including non-emitting transitions linking each of the old final states to a new final state), but when there are multiple special characters, such merging of states transforms the message Change to model X to Change modelthere is a quantifiable loss of information. This rewriting is shown in Figure 4.
1 a b 3 b 9 8 5 b 6 a 7 13 12 10 11 a 2 b 4 13

It should be stressed that these differences appear to be implementation differences rather than deep differences; there appears to be no reason why standard HMM initialisation, training and evaluation algorithms cannot be run used with multiple final states and escape methods. HTK appears to be able to handle medium character sets and model sizes, as generated by order 2 or order 3 PPM on Western languages (Eastern languages with very large character sets present a different problem). This may be because while there are many more states, in PPM generated models there the number a transitions out of a state in independent of the size of the model..

4 USING HTK
The HMM package we are working with is HTK (HMM ToolKit) [9]. Built for speech recognition, it expects continuous, ordinal, temporal signals from a microphone or similar. The generalised tag-insertion problem is posed in terms of discrete, cardinal, sequential signals (i.e. characters), so the input must be transformed.

Figure 4 A PPMM converted to an HMM and rewritten to have a single final state.
Time

Time 1 2 3 4 5 6 7 8 9 10

When the Baum-Welsh algorithm is building up the dynamic probabilities in a HMM, it never reduces a probability to zero, but theoretically comes arbitrarily close to it. In practise HTK [11], the HMM package we use, all non-zero probabilities below a threshold are raised to the threshold. PPMMs dont reduce probabilities to zero either, but use a different method to handle low probabilities. All probabilities are calculated from the counts of previously seen characters, with a small probability of seeing a novel character. The size of this small probability and the way it is spread among the possible novel characters is determined by the escape method [4,9]. The escape method specifies a mapping from a collection of counts to a probability distribution. It is not clear whether the Baum-Welsh algorithm has the same or similar effects as commonly used escape methods and determining this is beyond the scope of this paper. 3

Time 1 2 3 4 5 6 7 8 9 10

Time 1 2 3 4 5 6 7 8 9 10

Figure 5: The transformations of a signal Figure 5 shows the transformation of the signal. Initially the characters are translated to pluses as shown in (A). A series of assumptions are made about the input to transform it into the desired signal (D). First the characters are assumed to be equally spaced with respect to time (B), an assumption of temporality. Secondly the characters are assumed to repre-

sent all intermediate vales (C), an assumption of ordinality. Lastly the characters are assumed to be continuous (D), an assumption of continuity. None of these assumptions are actually true but none of the functionality of HTK that we depend upon relies on these holding. So long as both the training text (the example text we use to build the model) and the testing text (the actual text we want to insert tags into) undergo the same transformations everything runs smoothly. Indeed, because the PPMM and the HMM are equivalent and the PPMM does not rely on these assumptions, then we can be certain that if the HMM relies on the assumptions then it is a implementation detail rather than an algorithmic assumption. There is a slight chance that some the improvements and optimisations that come with a stable mature implementation may be undermined by the assumptions. We will know whether this is the case only after experimentation.

Robert Claiborne. The Life and Times of the English Languagethe history of our marvellous native tongue. Bloomsbury, 1983. John G. Cleary and Ian H. Witten. Data Compression using adaptive coding and partial string matching. IEEE Transactions on Communications, 4:32, pages 396-402, 1984 Paul Glor Howard. The Design and Analysis of Efficient Lossless Data Compression Systems. Ph.D. Thesis, Department of Computer Science, Brown University, May 1993. Lawrence R. Rabiner. A tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 2:77, 1989 William J. Teahan and John G. Cleary. Tag based models of English text. Proceedings of the Data Compression Conference, Snowbird, Utah, 1998. William J. Teahan, Yingying Wen, Roger McNab and Ian H. Witten. A compressionbased algorithm for Chinese word segmentation. Computational Linguistics, 3:26, pages 375--393, 2000. Andrew J. Viterbi. Error Bounds for Convolutional Codes and an Asymptotically Optimal Decoding Algorithm. IEEE Transactions on Information Theory, 13, pages 260--269, 1967. Ian H. Witten and Timothy C. Bell. The Zero-Frequency Problem: Estimating the Probabilities of Novel Events on Adaptive Text Compression. IEEE Transaction of Information Theory, 4:37, pages 1085-1094 1991.

5 CONCLUSION
We have examined the generalised tag insertion problem and noted that previous work focused on using PPM to solve it. We observed that PPMMs are a subset of HMMs and the PPMMs built for solving the generalised tag insertion problem must be transformed into stateful finite automata. We also examined the necessary transformation of the input to transform it from a discrete, cardinal, sequential signal into a continuous, ordinal, temporal signal and noticed that the assumption made in this transformation werent true, but that it didnt matter. Having transformed the PPMM into an HMM we now have access to Baum-Welsh algorithm and model-rewriting algorithms, which we hope we will be able to use to increase the accuracy of our solutions to the generalised tag insert problem.

10 Stuart A. Yeates, Ian H. Witten and David Bainbridge. Tag Insertion Complexity. Proceedings of the Data Compression Conference, Snowbird Utah, 2001. 11 Steve Young, Dan Kershaw, Julian Odell, Dave Ollason, Valtcho Valtchev and Phil Woodland. The HTK Book Version 3.0. July, 2000.

6 REFERENCES
1 L. E. Baum and T. Petrie. Statistical inference for probabilistic functions of finite state Markov Chains. Annals of Mathematical Statistics, 37, pages 1554-1563, 1966.

Vous aimerez peut-être aussi