Vous êtes sur la page 1sur 6

GRD Journals | Global Research and Development Journal for Engineering | International Conference on Innovations in Engineering and Technology

(ICIET) - 2016 | July 2016

e-ISSN: 2455-5703

Ranking of Document Recommendations from


Conversations using Probabilistic Latent Semantic
Analysis
1P.

Velvizhi 2S. Aishwarya 3R. Bhuvaneswari


1,2,3
Department of Computer Science & Engineering
1,2,3
K.L.N. College of Engineering, Pottapalayam, Sivagangai 630612, India
Abstract
Any Information retrieval from documents is done through text search. Now a day, efficient search is done through Mining
techniques. Speech is recognized for searching a document. A group of Conversations are recorded using Automatic Speech
Recognition (ASR) technique. The system changes speech to text using FISHER tool. Those conversations are stored in a database.
Formulation of Implicit Queries is preceded in two stages as Extraction and Clustering. The domain of the conversations is
structured through Topic Modeling. Extraction of Keywords from a topic is done with high probability. In this system, Ranking of
documents is done using Probabilistic Latent Semantic Analysis (PLSA) technique. Clustering of keywords from a set covers all
the topics recommended. The precise document recommendation for a topic is specified intensively. The Probabilistic Latent
Semantic Analysis (PLSA) technique is to provide ranking over the searched documents with weighted keywords. This reduces
noise while searching a topic. Enforcing both relevance and diversity ensures effective document retrieval. The text documents are
converted to speech conversation using e-Speak tool. The final retrieved conversations are as required.
Keyword- Keyword Extraction, Topic Modeling, Word Frequency, PLSA, Document retrieval
__________________________________________________________________________________________________

I. INTRODUCTION
Many unpredictable information are available as documents, databases or multimedia resources. Users current activities do not
initiate a search to access that information. So we adopt just-in-time retrieval system. This system spontaneously recommends
documents by analyzing users current activities. Here, the activities are recorded as conversations. For instance, conversations
recorded in a meeting. A real-time Automatic Speech Recognition (ASR) technique constructs implicit queries for document
retrieval. This recommendation is done from the web or a local repository. Just-in-time retrieval system must construct implicit
queries from conversations and this contains a much larger number of words than a query. For instance, four people must make a
list of all the items required to survive in the mountains. A short fragment of 120 seconds conversations which contains 600 words
is recorded. This is then split based on the variety of domains such as Job, Plan or Business. The Multiplicity of topics or
speech disfluencies produces ASR noise. The main objective is to maintain multiple records about users information needs. In
this paper, extracting a relevant and diverse set of keywords is the first step. Clustering into specific-topic queries are ranked by
importance. This topic-based clustering technique decreases ASR errors and increases diversity of document recommendations.
For instance, Word Frequency retrieves the Wikipedia pages like Job, Hiring and Interview. Whereas, users would
prefer fragments such as Job, Plan and Business.
Relevance and Diversity can be explained in three stages as follows. Extraction of keywords, which retrieves the mostly used
words. Build one or several implicit queries namely Single and Multiple. While using Single query, irrelevant documents are
retrieved. Whereas in Multiple query, Diversity constraint is maintained. Ranking the results is considered as the final step. It reorders the document list recommended to users. Previous methods in the formulation of implicit queries rely on Word Frequency
weights. But other methods perform keyword extraction using Topical Similarity. They do not set Topic diversity constraint.
From the ASR output, users information needs are satisfied and the number of irrelevant words is reduced. When the
keywords are extracted, it is followed by clustering which builds topically separated queries. They are run independently than
topically mixed query. These results are then ranked and organized.
The paper is organized as follows. In Section II-A the just-in-time retrieval system and the policies used for query
formulation is reviewed. In Section II-B keyword extraction methods are discussed. In Section III the proposed technique for
formulation of implicit queries is given. In Section IV introduces data sets and the comparison of keyword sets is made to retrieve
documents using crowd sourcing. In Section V the experimental results on keyword extraction and clustering is presented.

All rights reserved by www.grdjournals.com

133

Ranking of Document Recommendations from Conversations using Probabilistic Latent Semantic Analysis
(GRDJE / CONFERENCE / ICIET - 2016 / 022)

II. JUST-IN-TIME RETRIEVAL


Just-in-time retrieval system produces a spontaneous change in query based information retrieval. Users activities are monitored
to detect information needs and retrieve relevant document. For this, implicit queries are extracted that are present in the text
document (Not explicitly shown). Many just-in-time retrieval systems and methodologies are used for query formulation. An
Automatic Content Linking device (ACLD), a document recommendation system was intended.
A. Query Formulation
The first introductory system for a document recommendation was the Fixit system .This monitors the background search on a
database and makes relationships between symptoms and faults. It also provides additional information for the current state. The
next system, Remembrance agent was integrated into the text editor and search is made. Watson system verifies browsing history
for document recommendation. Its structure formulates Word Frequency and written style. Before the solutions proposed in this
paper, the ACLD modeled users information needs at regular time intervals. But this method includes the entire set of words rather
than implicit query which reduces noises. The use of semantic similarity shows that, although this improves relevance its high
computational cost makes it unfit for just-in-time retrieval from a large repository.
This motivated to design a keyword extraction methodology to model users information needs. Even short conversation
fragments include words related to several topics. Here, ASR transcript adds additional errors and keyword selection leads to
improper queries. This leads to failure of relevance and user satisfaction. The extraction method proposed for keywords introduces
diversity and a clustering technique for topically separated queries.
B. Keyword Extraction
Several methods automatically extract keywords from a text. The oldest technique used word frequencies for ranking. An
alternative method by counting pair wise co-occurrence frequencies and ranking does not contain word meaning. So they ignore
low frequency words. The highly important domain is notified.
For instance, the words Car, Wheel, Seat and Passenger together indicate automobiles as a unique domain.
To improve frequency based methods lexical semantic information has been proposed. Semantic relation is captured by
a manual thesaurus such as Word Net or Wikipedia. Topic modeling technique such as LSA, PLSA, LDA are automatically built.
Probabilistic Latent Semantic Analysis (PLSA) is built to rank the words of a conversation transcript using weighted point wise
mutual information scoring function. Ranking in the transcript is based on the topical similarity.
The dependencies among selected words are combined with Page Rank, Word Net or Topical Information. In proposed
paper, word2vec vector space representation of a word is modeled using neural network language model. The previous work
considered topical similarity and dependencies among words. But reward diversity is explicitly not mentioned. So secondary topics
in a conversation is lost.

III. IMPLICIT QUERY FORMULATION


A two-stage process proposes formulation of implicit queries. The foremost step is the extraction of keywords from the text
document. The documents must be recommended for those transcripts by an ASR system. All the topics covered in the
conversations must be detected as keywords. ASR mistakes are removed by avoiding irrelevant words. The next stage is a clustering
technique which forms topically disjoint queries.
A. Diverse Keyword Extraction
Topic Modeling is proposed as the first step to build a topical representation. The content words are selected using topical
similarity. This also includes a diverse range of topics recommended by summarization technique. The main advantage covers all
the main topics of the conversation. The proposed algorithm selects a smaller number of keywords from each topic in order to
include more topics. This is useful for the following reasons. Increase in the variety of documents is ensured. The second reason
is that the algorithm chooses smaller number of noisy words rather than algorithm which ignores diversity.
The present diverse keyword extraction includes three steps, as viewed in Fig.1. First, Topic Modeling represents the
distribution of the topic z for each word w noted p (z | w). These topics are not represented explicitly. But are represented using
topic modeling technique. The domain of the conversation is selected for group of conversations. Second, the modeled topics are
weighted for each conversation fragment as . From the weights determined, the keyword list w = [ , , ] covers all the
important topics.
Modeling topics is done using Probabilistic Latent Semantic Analysis (PLSA) to maintain the distribution for topic z of
each word w noted p (z | w). For each conversation fragment the topics are weighted. This is obtained by analyzing the probability
average.
If a conversation fragment t mentions set of topics Z, each word w from t is a subset in a set, then the main objective is
finding a subset with k unique words that covers all the topics. A Greedy algorithm finds an optimal solution within a given time
period. Finally we select one of the set which maximizes priority values over all the other topics. If keywords in ASR errors
mentions lower , the selection probability is reduced. Hence the keywords are extracted in desired constrain between relevance and

All rights reserved by www.grdjournals.com

134

Ranking of Document Recommendations from Conversations using Probabilistic Latent Semantic Analysis
(GRDJE / CONFERENCE / ICIET - 2016 / 022)

diversity in the keyword set. For this, a parameter is set to cover those constraints.
B. Keyword Clustering
The keywords extracted using diverse keyword extraction methodologies responses user needs in the form of topics. Diversity of
topics and reduction of noisy effect is maintained by splitting the keyword set into several topically disjoint subsets. Every subset
forms an implicit query to retrieve the document. These subsets are formed by clustering of topically similar keywords. It is
explained as follows.
Ranking keywords for each main topic of the conversation is done for keyword clustering. The decreasing values of
.
p (z|w) is considered for ordering of keywords. The keywords with higher value are considered for each topic z. They are ranked
themselves with
values.
C. Document Recommendation
For document retrieval, implicit queries can be prepared for each conversation fragment. Improvement of the retrieval results is
ensured through Formulation of multiple implicit queries. With implicit query formulation, after ranking the first d documents are
retrieved. By this way, recommendation lists were prepared from the document sets. Ranking based on topical similarity
corresponding to queries is formulated. This is the baseline algorithm.

IV. EXPERIMENTAL RESULTS


Here, diverse keyword extraction technique is compared with query formulation and extracting keywords. This extracts more
relevant keywords and reduces ASR noise. The retrieval results are compared which are formed by keyword lists and split into
topically separated queries. From this method, the list generated outperforms existing methods.
A. Keyword Extraction Methods
The diverse keyword extraction method proposes several versions such as D (), WF and Topic Similarity (TS). It is calculated
based on values where {.5, .75, 1}. These three methods do not ensured diversity. When the values of are noted, D (.5)
appears to be low comparing to other values of . These values can be clustered into three groups: .5 <.7 can be represented as
= .5, .7 .8 can be represented as =.75, .8< 1 can be represented as =1. But we consider only D (.5), D (.75), D (1) and
Topic Similarity.
B. Topical Diversity
We have compared four keyword extraction methods such as WF, TS, D (.75), D (.5). The average values of D (.75) and D (.5)
are similar with high ranks rather than WF and TS. The values for TS are low and gradually increase for large number of keywords.
This shows that TS can ensure topic diversity. Whereas the values for WF are uniform throughout the keyword search. So WF
does not consider topics.
C. Relevant Keyword
Binary comparisons are performed using crowd sourcing between each extraction methods such as WF and TS, D (.5) and D (.75)
and so on. The goal is to rank the methods among the four extraction techniques. This excludes redundant comparisons. Human
judgments prefer the following ranking:
D. Noise Reduction
The differences between comparison values compared with manual ones are similar for ASR transcripts. There is degradation in
the values of WF due to ASR noise. When D (.75) outperforms TS, the ranking still remains the same as D (.75) > TS > WF > D
(.5).
From each method the keywords are listed out for comparisons to detect ASR errors. The noise level may vary from 5%
to 50% for each method. Hence, D (.75) selects smaller number of keywords than TS and WF. Here, WF method selects only
words with higher frequency rather than topics. If noisy words are present in irrelevant topics, both TS and D (.75) probability will
be reduced. The advantage of D (.75) over TS is that noisy keyword selection is reduced.
E. Retrieval Scores
The retrieval results of queries from keyword list are again binary compared using crowd sourcing. We have considered two types
of implicit queries: Single queries and multiple queries. The entire keyword set is considered for formulating single query, while
multiple queries are formed by dividing the keyword set into topically independent subsets. These results are combined into a
unique document set.
When compared with the previous result, WF outperforms comparative relevance with 87% vs. 13%. First, single queries
are built by considering D (.75), TS and WF keyword extraction methods. It is then compared to view the relevant results. Next,
using the same methods multiple queries are built and comparisons are done with the resulting document sets. The best results of
multiple queries are compared with best results of single queries.

All rights reserved by www.grdjournals.com

135

Ranking of Document Recommendations from Conversations using Probabilistic Latent Semantic Analysis
(GRDJE / CONFERENCE / ICIET - 2016 / 022)

F. Single Queries
Single query comparisons are made based on D (.75), TS and WF methods over the fragments. The suggested documents are listed
out through transcripts. Again, D (.75) > WF > TS is the resulting rank for diversity keyword extraction technique. When single
queries are used relevance of resulting document sets are superior compared to the extraction technique.
Relevance (%) Relevance (%)
m1
m2
TS vs. WF
40
60
WF vs. D(.75)
42
58
TS vs. D(.75)
20
80
Fig. 1: Comparative Relevance of Single Queries Inferred As: D (.75) > WF > TS.
Compared Methods (m1 vs. m2)

G. Multiple Queries
Binary comparisons were performed using multiple queries which results in topically disjoint sets. Here, TS and D (.75) keyword
extraction methods are used. The above two methods are notified as CTS and CD (.75) (C indicates cluster of keywords). They are
derived from TS and D (.75) keyword lists after clustering. Clustering on WF is not applicable because it does not formulate on
topic modeling. Now, CD (.75) outperforms CTS with 62% vs. 38%.
H. Single versus Multiple
From the same keyword lists, single and multiple queries are compared as D (.75) and TS. It reveals that using multiple queries
leas more relevant documents when compared with single queries as CD (.75) > D (.75) and CTS > TS. The relevance score
indicates that 65% to 35% for D (.75) and TS.

Relevance (%) Relevance (%)


m1
m2
CD(.75) vs. D(.75)
70
30
CTS vs. TS
70
30
CD(.75) vs.CTS
60
40
Fig. 2: Comparative Relevance Scores of Document Results using Multiple Queries
Compared Methods (m1 vs. m2)

1) Example
The priority for CD (.75) shows that it retrieves the most relevant results. In Appendix A, a list of 12 items is discussed and listed
out to survive in a cold mountain. The list of keywords extracted by D (.75), TS and WF does not consider topical information
(Job, Work, and Company). Here no relevance information is retrieved. The CTS method covers main topics of the fragment
with the keywords as Plan, Interview, Company and Business. Then time and questions is selected to cover the remaining
topics. Next, Resume, studies, mail and school covers all the other topics.
The highly ranked retrieval results are obtained by WF, TS, D (.75), CD (.75) and CTS. Then, multiple queries (CTS and
CD (.75)) retrieve the most relevant documents. First, WF does not recommend relevant documents. Then D (.75) retrieves the
largest number of topics. But multiple queries CTS and CD (.75) in comparison with single queries retrieves the most relevant
documents. Here TS and D (.75) do not separate mixture of topics. So we move onto CTS and CD (.75). On the contrary, CD (.75)
covers more topics while compared to CTS.
WF

TS

D (.75)

S = {Job, Work, Interview, Plan, Company,


School, Studies}

S = {Job, Hiring, Hire, Resume, Company,


Preparation }

S = {Job, Resume, School, Time,


Advice, Friends }

Fig. 3: Keyword Sets Obtained by Three Keyword extraction Techniques.


CTS
CD (.75)
Q1 = {New job, New year, Studies, Graduation, Time} Q1 = {Job, New year, Friends, Studies}
Q2 = {Hiring, Agency}
Q2 = {Finance, Campus, Firms, Resume}
Q3 = {Campus, Finance}
Q3 = {Academic, Advice, Strength, Weakness}
Q4 = {Performance}
Fig. 4: Abstract Topics of the Conversation Fragment.
WF
TS
D (.75)
Job

Interview

Resume

Job offer

Job

School

All rights reserved by www.grdjournals.com

136

Ranking of Document Recommendations from Conversations using Probabilistic Latent Semantic Analysis
(GRDJE / CONFERENCE / ICIET - 2016 / 022)

Hire

Hiring

Interview

Hiring Company

Resume

Campus

CTS
CD (.75)
Job
Finance
Interview
Job
Hiring
Advice
Company Academic
Agency
Business
Fig. 5: Comparison of Single and Multiple Queries

V. CONCLUSION
A specific just-in-time retrieval system includes conversational environment which recommends documents to users along with
the information needs. First, topic modeling is performed by retrieving implicit queries from fragments. A novel keyword
extraction technique covers maximum number of important topics in a fragment. They are clustered to divide the keyword set into
smaller topically-disjoint subsets formulated using implicit queries.
Based on WF and TS, relevance of the retrieved documents is represented. Therefore, both relevance and diversity
empowers improvement of keyword extraction technique. It is considered by comparing n words in addition to individual words.
The current goal is to process explicit queries and rank the results to cover all the information needs of the user. These
techniques are not integrated in the working environment with human users in real-life meetings.

VI. APPENDIX
The following transcript of a four-party conversation (speakers AD) was submitted to the document recommendation system. The
keyword lists, queries and documents retrieved are respectively shown in Tables VII, VIII, and IX.
Nancy: Hi. It is good to see you, John.
John: Same here, Nancy. It has been a long time since I last saw you.
Nancy: Yes, the last time we saw each other was New Years Eve. How are you doing?
John: I am doing OK. It would be better if I have a new job right now.
Nancy: You are looking for a new job? Why?
John: I already finished my studies and graduated last week. Now, I want to get a job in the
Finance field. Payroll is not exactly Finance.
Nancy: How long have you been looking for a new job?
John: I just started this week.
Nancy: Didnt you have any interviews with those firms that came to our campus last month? I believe quite a few companies
came to recruit students for their Finance departments.
John: I could only get one interview with Fidelity Company because of my heavy work- schedule. A month has already gone
by, and I have not heard from them. I guess I did not make it.
Nancy: Dont worry, John. You always did well in school. I know your good grades will help you get a job soon. Besides, the
job market is pretty good right now, and all companies need financial analysts.
John: I hope so.
Nancy: You have prepared a resume, right?
John: Yes.
Nancy: Did you mail your resume to a lot of companies? How about recruiting agencies?
John: I have sent it to a dozen companies already. No, I have not thought about recruiting agencies. But, I do look closely at
the employment ads listed in the newspaper every day.
Nancy: Are there a lot of openings?
John: Quite a few. Some of them require a certain amount of experience and others are willing to train.
Nancy: My friends told me that it helps to do some homework before you go to an interview. You need to know the company
wellwhat kind of business is it in? What types of products does it sell? How is it doing lately?
John: Yes, I know. I am doing some research on companies that I want to work for. I want to be ready whenever they call me
in for an interview.
Nancy: Have you thought about questions they might ask you during the interview?
John: What types of questions do you think they will ask?
Nancy: Well, they might ask you some questions about Finance theories to test your academic understanding.
John: I can handle that.

All rights reserved by www.grdjournals.com

137

Ranking of Document Recommendations from Conversations using Probabilistic Latent Semantic Analysis
(GRDJE / CONFERENCE / ICIET - 2016 / 022)

Nancy: They might tell you about a problem and want you to come up with a solution.
John: I dont know about that. I hope I will be able to give them a decent response if the need arises.
Nancy: They will want to know you a little bit before they make a hiring decision. So, they may ask you to describe yourself.
For example, what are your strengths and your weaknesses? How do you get along with people?
John: I need to work on that question. How would I describe myself? Huh!
Nancy: Also, make sure you are on time. Nothing is worse than to be late for an interview. You do not want to give them a
bad impression, right from the start.
John: I know. I always plan to arrive about 10 or 15 minutes before the interview starts.
Nancy: Good decision! It seems that you are well prepared for your job search. I am sure you will find a good job in no time.
John: I hope so.
Nancy: I need to run; otherwise, I will be late for school. Good luck in your job search, John.
John: Thank you for your advice. Bye!

RREFERENCES
[1] M. Habibi and A. Popescu-Belis, Enforcing topic diversity in a document recommender for conversations, in Proc. 25th Int.
Conf. Comput. Linguist. (Coling), 2014, pp. 588599.
[2] D. Harwath and T. J. Hazen, Topic identification based extrinsic eval-uation of summarization techniques applied to
conversational speech, in Proc. Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2012, pp. 50735076.
[3] A. Popescu-Belis, M. Yazdani, A. Nanchen, and P. N. Garner, A speech-based just-in-time retrieval system using semantic
search, in Proc. Annu. Conf. North Amer. Chap. ACL (HLT-NAACL), 2011, pp. 8085.
[4] D. Traum,P. Aggarwal,R. Artstein,S. Foutz,J. Gerten,A. Katsamanis, A. Leuski, D. Noren, and W. Swartout, Ada and Grace:
Direct inter-action with museum visitors, in Proc. 12th Int. Conf. Intell. Virtual Agents, 2012, pp. 245251.
[5] A. S. M. Arif, J. T. Du, and I. Lee, Examining collaborative query reformulation: A case of travel information searching, in
Proc. 37th Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2014, pp. 875878.
[6] M. Habibi and A. Popescu-Belis, Using crowdsourcing to compare document recommendation strategies for conversations,
Workshop Recommendat. Utility Eval.: Beyond RMSE (RUE11), pp. 1520, 2012.
[7] M. Yazdani, Similarity learning over large collaborative networks, Ph.D. dissertation, EPFL Doctoral School in Information
and Commu-nication (EDIC), Lausanne, Switzerland, 2013.
[8] A. Nenkova and K. McKeown, A survey of text summarization tech-niques, in Mining Text Data, C. C. Aggarwal and C.
Zhai, Eds. New York, NY, USA: Springer, 2012, ch. 3, pp. 4376.
[9] D. F. Harwath, T. J. Hazen, and J. R. Glass, Zero resource spoken audio corpus analysis, in Proc. Int. Conf. Acoust., Speech,
Signal Process. (ICASSP), 2013, pp. 85558559.

All rights reserved by www.grdjournals.com

138

Vous aimerez peut-être aussi