Vous êtes sur la page 1sur 34

Hidden Markov Models Applied to Information Extraction

Part I: Concept

HMM Tutorial

Part II: Sample Application

AutoBib: web information extraction


Larry Reeve INFO629: Artificial Intelligence Dr. Weber, Fall 2004

Part I: Concept HMM Motivation

Real-world has structures and processes which have (or produce) observable outputs
Usually sequential (process unfolds over time) Cannot see the event producing the output

Example: speech signals

Problem: how to construct a model of the structure or process given only observations

HMM Background

Basic theory developed and published in 1960s and 70s


No widespread understanding and application until late 80s Why?

Theory published in mathematic journals which were not widely read by practicing engineers Insufficient tutorial material for readers to understand and apply concepts

HMM Uses

Uses

Speech recognition

Recognizing spoken words and phrases

Text processing

Parsing raw records into structured records

Bioinformatics

Protein sequence prediction

Financial

Stock market forecasts (price pattern prediction) Comparison shopping services

HMM Overview

Machine learning method Makes use of state machines Based on probabilistic models Useful in problems having sequential steps Can only observe output from states, not the states themselves

State machine:

Example: speech recognition


Observe: acoustic signals Hidden States: phonemes


(distinctive sounds of a language)

Observable Markov Model Example


State transition matrix

Weather

Rainy Rainy Cloudy 0.4 0.2

Cloudy 0.3 0.6

Sunny 0.3 0.2

Once each day weather is observed


State 1: rain State 2: cloudy State 3: sunny

Sunny

0.1

0.1

0.8

What is the probability the weather for the next 7 days will be:

sun, sun, rain, rain, sun, cloudy, sun

Each state corresponds to a physical observable event

Observable Markov Model

Hidden Markov Model Example

Coin toss:
Heads, tails sequence with 2 coins You are in a room, with a wall Person behind wall flips coin, tells result

Coin selection and toss is hidden Cannot observe events, only output (heads, tails) from events

Problem is then to build a model to explain observed sequence of heads and tails

HMM Components

A set of states (xs)


A set of possible output symbols (ys) A state transition matrix (as)

probability of making transition from one state to the next probability of a emitting/observing a symbol at a particular state

Output emission matrix (bs)

Initial probability vector


probability of starting at a particular state Not shown, sometimes assumed to be 1

HMM Components

Common HMM Types

Ergodic (fully connected):

Every state of model can be reached in a single step from every other state of the model

Bakis (left-right):

As time increases, states proceed from left to right

HMM Core Problems

Three problems must be solved for HMMs to be useful in real-world applications 1) Evaluation 2) Decoding 3) Learning

HMM Evaluation Problem

Purpose: score how well a given model matches a given observation sequence

Example (Speech recognition):

Assume HMMs (models) have been built for words home and work.

Given a speech signal, evaluation can determine the probability each model represents the utterance

HMM Decoding Problem

Given a model and a set of observations, what are the hidden states most likely to have generated the observations?

Useful to learn about internal model structure, determine state statistics, and so forth

HMM Learning Problem

Goal is to learn HMM parameters (training)

State transition probabilities Observation probabilities at each state

Training is crucial:

it allows optimal adaptation of model parameters to observed training data using real-world phenomena

No known method for obtaining optimal parameters from data only approximations Can be a bottleneck in HMM usage

HMM Concept Summary

Build models representing the hidden states of a process or structure using only observations Use the models to evaluate probability that a model represents a particular observation sequence Use the evaluation information in an application to: recognize speech, parse addresses, and many other applications

Part II: Application AutoBib System

Provide a uniform view of several computer science bibliographic web data sources
An automated web information extraction system that requires little human input

Web pages designed differently from site-to-site IE requires training samples

HMMs used to parse unstructured bibliographic records into a structured format: NLP

Web Information Extraction Converting Raw Records

Approach
1) Provide seed database of structured records 2) Extract raw records from relevant Web pages 3) Match structured records to raw records

To build training samples

4) Train HMM-based parser 5) Parse unmatched raw recs into structured recs 6) Merge new structured records into database

AutoBib Architecture

Step 1 - Seeding

Provide seed database of structured records

Take small collection of BibTeX format records and insert into database Cleaning step normalizes record fields

Examples:

Proc. Proceedings Jan January

Manual step, executed once only

Step 2 Extract Raw Records

Extract raw records from relevant Web pages

User specifies
Web pages to extract from How to follow next page links for multiple pages

Raw records are extracted

Uses record-boundary discovery techniques


Subtree of Interest = largest subtree of HTML tags Record separators = frequent HTML tags

Tokenized Records

(Replace all HTML tags with ^)

Step 3 - Matching

Match raw records R to structured records S Apply 4 tests (heuristic-based)


1) 2) 3) 4)

Match at least author in R to an author in S S.year must appear in R If S.pages exists, R must contain it S.title is approximately contained in R
Levenshtein edit distance approximate string match

Step 4 Parser Training

Train HMM-based parser

For each pair of R and S that match, annotate tokens in raw record with field names Annotated raw records are fed into HMM parser in order to learn:
State transition probabilities Symbol probabilities at each state

Parser Training, continued

Key consideration is HMM structure for navigating record fields (fields, delimiters)

Special states

start, end

Normal states

author, title, year, etc.

Best structure found:


Have multiple delimiter and tag states, one for each normal state

Example: author-delimiter, author-tag

Sample HMM
(Method 3)

Source: http://www.cs.duke.edu/~geng/autobib/web/hmm.jpg

Step 5 - Conversion

Parse unmatched raw recs into structured recs using HMM parser Matched raw records can be directly converted without parsing because they were annotated in matching step

Step 6 - Merging

Merge new structured records into database Initial seed database has now grown New records will be used for improved matching on the next run

Evaluation

Success rate:
# of tokens labeled by HMM ------------------------------------# of tokens labeled by person

DBLP: 98.9%

Computer Science Bibliography

CSWD: 93.4%

CompuScience WWW-Database

HMM Advantages / Disadvantages

Advantages

Effective Can handle variations in record structure


Optional fields Varying field ordering

Disadvantages

Requires training using annotated data


Not completely automatic May require manual markup Size of training data may be an issue

Other methods

Wrappers

Specification of areas of interest on Web page Hand-crafted

Wrapper induction

Requires manual training Not always accommodating to changing structure Syntax-based; no semantic labeling

Application to Other Domains

E-Commerce

Comparison shopping sites


Extract product/pricing information from many sites Convert information into structured format and store Provide interface to look up product information and then display pricing information gathered from many sites

Saves users time

Rather than navigating to and searching many sites, users can consult a single site

References

Concept:

Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), 257-285.

Application:

Geng, J. and Yang, J. (2004). Automatic Extraction of Bibliographic Information on the Web. Proceedings of the 8th International Database Engineering and Applications Symposium (IDEAS04), 193-204.

Vous aimerez peut-être aussi