Hidden Markov Models Applied To Information Extraction: Part I: Concept

Hidden Markov Models Applied to Information Extraction
Part I: Concept
HMM Tutorial
Part II: Sample Application
AutoBib: web information extraction

Larry Reeve INFO629: Artificial Intelligence Dr. Weber, Fall 2004
Part I: Concept HMM Motivation
Real-world has structures and processes which have (or produce) observable outputs
Usually sequential (process unfolds over time) Cannot see the event producing the output
Example: speech signals
Problem: how to construct a model of the structure or process given only observations
HMM Background
Basic theory developed and published in 1960s and 70s

No widespread understanding and application until late 80s Why?
Theory published in mathematic journals which were not widely read by practicing engineers Insufficient tutorial material for readers to understand and apply concepts
HMM Uses
Uses
Speech recognition
Recognizing spoken words and phrases
Text processing
Parsing raw records into structured records
Bioinformatics
Protein sequence prediction
Financial

Stock market forecasts (price pattern prediction) Comparison shopping services
HMM Overview
Machine learning method Makes use of state machines Based on probabilistic models Useful in problems having sequential steps Can only observe output from states, not the states themselves
State machine:
Example: speech recognition

Observe: acoustic signals Hidden States: phonemes

(distinctive sounds of a language)
Observable Markov Model Example

State transition matrix
Weather
Rainy Rainy Cloudy 0.4 0.2
Cloudy 0.3 0.6
Sunny 0.3 0.2
Once each day weather is observed

State 1: rain State 2: cloudy State 3: sunny
Sunny
0.1
0.1
0.8
What is the probability the weather for the next 7 days will be:
sun, sun, rain, rain, sun, cloudy, sun
Each state corresponds to a physical observable event
Observable Markov Model
Hidden Markov Model Example
Coin toss:
Heads, tails sequence with 2 coins You are in a room, with a wall Person behind wall flips coin, tells result
Coin selection and toss is hidden Cannot observe events, only output (heads, tails) from events
Problem is then to build a model to explain observed sequence of heads and tails
HMM Components

A set of states (xs)

A set of possible output symbols (ys) A state transition matrix (as)
probability of making transition from one state to the next probability of a emitting/observing a symbol at a particular state
Output emission matrix (bs)
Initial probability vector

probability of starting at a particular state Not shown, sometimes assumed to be 1
HMM Components
Common HMM Types
Ergodic (fully connected):
Every state of model can be reached in a single step from every other state of the model
Bakis (left-right):
As time increases, states proceed from left to right
HMM Core Problems
Three problems must be solved for HMMs to be useful in real-world applications 1) Evaluation 2) Decoding 3) Learning
HMM Evaluation Problem
Purpose: score how well a given model matches a given observation sequence
Example (Speech recognition):
Assume HMMs (models) have been built for words home and work.
Given a speech signal, evaluation can determine the probability each model represents the utterance
HMM Decoding Problem
Given a model and a set of observations, what are the hidden states most likely to have generated the observations?
Useful to learn about internal model structure, determine state statistics, and so forth
HMM Learning Problem
Goal is to learn HMM parameters (training)
State transition probabilities Observation probabilities at each state
Training is crucial:
it allows optimal adaptation of model parameters to observed training data using real-world phenomena
No known method for obtaining optimal parameters from data only approximations Can be a bottleneck in HMM usage
HMM Concept Summary
Build models representing the hidden states of a process or structure using only observations Use the models to evaluate probability that a model represents a particular observation sequence Use the evaluation information in an application to: recognize speech, parse addresses, and many other applications
Part II: Application AutoBib System
Provide a uniform view of several computer science bibliographic web data sources
An automated web information extraction system that requires little human input

Web pages designed differently from site-to-site IE requires training samples
HMMs used to parse unstructured bibliographic records into a structured format: NLP
Web Information Extraction Converting Raw Records
Approach
1) Provide seed database of structured records 2) Extract raw records from relevant Web pages 3) Match structured records to raw records
To build training samples
4) Train HMM-based parser 5) Parse unmatched raw recs into structured recs 6) Merge new structured records into database
AutoBib Architecture
Step 1 - Seeding
Provide seed database of structured records
Take small collection of BibTeX format records and insert into database Cleaning step normalizes record fields
Examples:

Proc. Proceedings Jan January
Manual step, executed once only
Step 2 Extract Raw Records
Extract raw records from relevant Web pages
User specifies
Web pages to extract from How to follow next page links for multiple pages
Raw records are extracted
Uses record-boundary discovery techniques

Subtree of Interest = largest subtree of HTML tags Record separators = frequent HTML tags
Tokenized Records
(Replace all HTML tags with ^)
Step 3 - Matching
Match raw records R to structured records S Apply 4 tests (heuristic-based)

1) 2) 3) 4)
Match at least author in R to an author in S S.year must appear in R If S.pages exists, R must contain it S.title is approximately contained in R
Levenshtein edit distance approximate string match
Step 4 Parser Training
Train HMM-based parser
For each pair of R and S that match, annotate tokens in raw record with field names Annotated raw records are fed into HMM parser in order to learn:
State transition probabilities Symbol probabilities at each state
Parser Training, continued
Key consideration is HMM structure for navigating record fields (fields, delimiters)

Special states
start, end
Normal states
author, title, year, etc.
Best structure found:

Have multiple delimiter and tag states, one for each normal state
Example: author-delimiter, author-tag
Sample HMM
(Method 3)
Source: http://www.cs.duke.edu/~geng/autobib/web/hmm.jpg
Step 5 - Conversion
Parse unmatched raw recs into structured recs using HMM parser Matched raw records can be directly converted without parsing because they were annotated in matching step
Step 6 - Merging
Merge new structured records into database Initial seed database has now grown New records will be used for improved matching on the next run
Evaluation
Success rate:
# of tokens labeled by HMM ------------------------------------# of tokens labeled by person
DBLP: 98.9%
Computer Science Bibliography
CSWD: 93.4%
CompuScience WWW-Database
HMM Advantages / Disadvantages
Advantages
Effective Can handle variations in record structure

Optional fields Varying field ordering
Disadvantages
Requires training using annotated data

Not completely automatic May require manual markup Size of training data may be an issue
Other methods
Wrappers

Specification of areas of interest on Web page Hand-crafted
Wrapper induction

Requires manual training Not always accommodating to changing structure Syntax-based; no semantic labeling
Application to Other Domains
E-Commerce
Comparison shopping sites

Extract product/pricing information from many sites Convert information into structured format and store Provide interface to look up product information and then display pricing information gathered from many sites
Saves users time
Rather than navigating to and searching many sites, users can consult a single site
References
Concept:
Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), 257-285.
Application:
Geng, J. and Yang, J. (2004). Automatic Extraction of Bibliographic Information on the Web. Proceedings of the 8th International Database Engineering and Applications Symposium (IDEAS04), 193-204.

Hidden Markov Models Applied To Information Extraction: Part I: Concept

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Hidden Markov Models Applied To Information Extraction: Part I: Concept

Transféré par

Droits d'auteur :

Formats disponibles

Hidden Markov Models Applied to Information Extraction

Part II: Sample Application

AutoBib: web information extraction

Part I: Concept HMM Motivation

Example: speech signals

Basic theory developed and published in 1960s and 70s

Recognizing spoken words and phrases

Parsing raw records into structured records

Protein sequence prediction

Stock market forecasts (price pattern prediction) Comparison shopping services

Example: speech recognition

Observe: acoustic signals Hidden States: phonemes

Observable Markov Model Example

Rainy Rainy Cloudy 0.4 0.2

Cloudy 0.3 0.6

Sunny 0.3 0.2

Once each day weather is observed

State 1: rain State 2: cloudy State 3: sunny

sun, sun, rain, rain, sun, cloudy, sun

Each state corresponds to a physical observable event

Observable Markov Model

Hidden Markov Model Example

A set of states (xs)

Output emission matrix (bs)

Initial probability vector

probability of starting at a particular state Not shown, sometimes assumed to be 1

Common HMM Types

Ergodic (fully connected):

As time increases, states proceed from left to right

HMM Core Problems

HMM Evaluation Problem

Example (Speech recognition):

HMM Decoding Problem

HMM Learning Problem

Goal is to learn HMM parameters (training)

State transition probabilities Observation probabilities at each state

HMM Concept Summary

Part II: Application AutoBib System

Web pages designed differently from site-to-site IE requires training samples

Web Information Extraction Converting Raw Records

To build training samples

Provide seed database of structured records

Proc. Proceedings Jan January

Manual step, executed once only

Step 2 Extract Raw Records

Extract raw records from relevant Web pages

Raw records are extracted

Uses record-boundary discovery techniques

(Replace all HTML tags with ^)

Match raw records R to structured records S Apply 4 tests (heuristic-based)

Step 4 Parser Training

Train HMM-based parser

Parser Training, continued

author, title, year, etc.

Best structure found:

Example: author-delimiter, author-tag

Computer Science Bibliography

HMM Advantages / Disadvantages

Effective Can handle variations in record structure

Optional fields Varying field ordering

Requires training using annotated data

Specification of areas of interest on Web page Hand-crafted