Vous êtes sur la page 1sur 16

PLSI

Aim :
Given a set of documents d1, d2, ., dN and a query string q= t1, t2, ....., t|q|,
we want to find those set of documents which are most likely to contain some
or all of the terms in the query string. We call such documents as relevant set
of documents with respect to the given q
Traditional approach using VSM
Uses term-document matrix to represent each document as a vector of
terms.
Query string is also represented as a vector in the term space
Similarities between Query vector and each document vector is computed
using a similarity measure, and relevant documents are obtained
Problem with VSM model
As number of documents increases, number of terms also rapidly increases.
This results in increase in the number of rows in the term-document matrix. In
other words, the dimension of each document in the corpus increases. This is
known as Curse of Dimensionality problem.
Reducing Dimensions using LSI
LSI finds a low-rank approximation to the term-document matrix.
This is done using Singular Value Decomposition of the term-document
matrix.
The consequence of low-rank approximation is that some dimensions are
combined and depend on more than one term. This mitigates the
problem of identifying synonymy, as the rank lowering is expected to
merge the dimensions associated with terms that have similar meanings.
Limitations of LSI
LSI cannot capture polysemy (i.e., multiple meanings of a word) because each occurrence of a
word is treated as having the same meaning due to the word being represented as a single
point in space.
Limitations of bag of words model (BOW), where a text is represented as an unordered
collection of words. To address some of the limitation of bag of words model (BOW),
multi-gram dictionary can be used to find direct and indirect association as well as higher-order
co-occurrences among terms.
LSI assumes that words and documents form a joint Gaussian model, while a Poisson
distribution has been observed. Thus, a newer alternative is probabilistic latent semantic
indexing, based on a multinomial model, which is reported to give better results than standard
LSI
Probabilistic Latent Semantic Indexing
PLSI is based on a generative probabilistic model.
Documents generate a particular distribution of aspects (topics).
Aspects generate a particular distribution of word usage.

D Z W

P(Z=z | D=d) P(W=w | Z=z)


PLSI Model
PLSI considers that our data can be expressed in terms of 3 sets of variables :

Documents: d D = { d1, d2, , dN } - Observed variables


Words: w W = { w1, w2, ., wM } - Observed variables
Topics: z Z = { z1, z2, ., zK } - Latent Variables
PLSI Model
The general structure of PLSI

model. This shows the intermediate layer of

latent topics that links the documents and the

words: each document can be represented as

a mixture of concepts weighted by the proba-

bility P (z|d) and each word expresses a topic

with probability P (w|z)


Defining the model
The model can be completely defined by specifying the joint distribution :

(1)

(2)

(3)
Parameters of the model
Equation 2 is the mathematical representation of the mixture model
shown in the last figure.
The parameters of the model are P (w|z) and P (z|d); their number is
(M 1)K and N (K 1).
The parameters can be estimated via maximum likelihood estimation,
by finding those values of parameters that maximize the predictive
probability for the observed word occurrences i.e, P (w|d)
Estimating the parameters
The predictive probability of pLSA mixture model is denoted by P (w|d),
so the objective function is given by the following expression :

(4)

n(d, w) : Observed frequencies, the number of times word w appears in


document d.
P(w|d) : Probability of word w being generated by document d.
Estimating the parameters
Plugging Eq (2) into Eq (4) and taking log on both sides , we get :

This is a non-convex optimization problem and it can be solved by using


Expectation-Maximization (EM) algorithm for the log-likelihood
Expectation Maximization (EM)
Start with some random values for the parameters P(w|z) and P(z|d)
Now we have to repeat the following two steps until the values of our
parameters converge :
E-step : Calculate posterior probabilities for latent variables given the observations and
current estimates.
M-step : Update parameters using the posterior probabilities in E-step to increase log L
Expectation Maximization (EM)
E-step

M-step : In order to maximize L, assign :


THANKS

Vous aimerez peut-être aussi