Human Aided Text Summarizer "SAAR" Using Reinforcement Learning

Paper ID: ISCMI2014-1-031E
Human Aided Text Summarizer

SAAR using Reinforcement Learning
By : Chandra Prakash
ABV-IIITM Gwalior
&
Dr. Anupam Shukla

Professor, ABV-IIITM, Gwalior
2014 Intl. Conference on Soft Computing & Machine Intelligence (ISCMI 2014)
Approach
Problem Definition
Motivation
Literature survey
Scope of Project
Methodology/Approach
Tools used
Result
Introduction
Real time Problem

Imagine
Download 1000 + papers and now want to get the
summary..
We have list of emails about sports event, get the summary
of those emails in one para
We have to study lots of books for the exam and the
summarizer gives the key concepts of the books as few
pages notes
Value for researchers
Get me everything/Papers say about Automatic Text
Summarization
Definition
Automatic Summaries
An active research area where computer automatically

summarize text from both single and multi-documents.
A short summary, which conveys the essence of the document

Should be less than half of original text
Can be extractive or abstractive based
May be produced from single or multiple documents
Dipanjan Das, Andre F.T. Martins (2007). A Survey on Automatic Text Summarization. Literature
Survey for the Language and Statistics II course at CMU, Pittsburg
Problem definition
With the advent of the information revolution (WWW),
Electronic documents are becoming a principle

information
Thousands of electronic documents are produced and made available on the internet
each day.
media
of
business
and
academic
not easy to read each and every document.
Information Access Agent:
Search engines : Google, Yahoo etc.

Information retrieval is far greater than that a user can handle and manage.
User has to analyze searched result one by one until felt satisfactory, this is time
consuming and inefficient.
What could be the possible solution than???
Problem definition (cont..)

Text summarization is not as per user specification.
Generic summary generation not possible as summary changes as user

changes.
Even two human cant generate a similar summary from a given
document.
Internal factors (background, education etc.) play vital role in generating a
summary
What could be the possible solution now ???
Solution: Human Aided Text Summarization

Benefits of summarization include:
Save reading time
Value for researchers
Abstracts for Scientific and other articles

Facilitate fast literature searches
Facilities classification of articles and other written data :
Improve Search engines indexing efficiency of web pages

Assists in storing the text in much lesser space.
Heading of the given article/document
News summarization
Opinion Mining and Sentiment Analysis
Enables Cell phones to access the Web information

With human feedback user oriented summary
Previous Approach :
1950 : Automatic creation of literature abstracts was proposed by IBM
Luhn.
Text Mining:
Includes discovery of patterns and trends in data associations among entities
in a document.
Consist of three steps:
text preparation,
text processing and
text analysis.
Text Summarization :
Text Summarization Methods.
Extraction: Construct the summery by taking the most important
sentences
Abstraction: Construct the summary by paraphrasing section of the
original document.
9
Type of Techniques:
Statistical techniques :
Based on Term Frequency.
Stop-word filtering : remove the unwanted noise.
Stemming or Lemmatization: different forms of the same word.
Determine term importance
Term Frequency/Inverse-documents-frequency (TF-IDF)
Weighting scheme, etc.
Linguistic techniques :
Looks for text semantics.
Linguistic techniques extract sentence by
Parsing and part of Natural language processing (NLP).
Speech tagging is among the starting steps.
Scope of Project
Problem Definition:
Extractive Text Summarization

Single Document
Fully Automated Summarization (FAS)
Human Aided Machine Summarization (HAMS)
Machine Learning
Reinforcement Learning
Tools used:
Matlab
Java
Earlier Methodology proposed (FAS)
Chandra Prakash, Anupam Shukla Automated summary generation from singe document using information gain
Springer, Contemporary Computing ,Communications in Computer and Information Science Volume 94, pp 152-159,
2010 .
Methodology proposed (HAMS)
Keyword Significant Factor
Solution
Approach for the Problem
Input: Document with text is fed into the system.
Preprocessing:
Tokenization: Divides the character sequence into words
sentence splitting further divides sequences of words into
sentences, and so on.
Stemming or Lemmatization
Stop word filtering Feature Extraction :
Sentence Ranking: Machine Learning
Human Feedback
Output\ Result: Generated Summary
an abstract.
15
Methodology Steps..
Methodology for text summarization involves
Term Selection using Pre-Processing
Tokenization or Segmentation
Stop word Filtering
Stemming or Lemmatization
Term weighting
Term Frequency (TF):

Wi(Tj)=fij
where fij is the frequency of jth term in sentence i.
Inverse Sentence Frequency (ISF) :

Wi(Tjj
N
fij log
nj
where N=no of sentences in the collection

nj =no of sentence where the term j appears.
Methodology Steps (cont)

Weight of a Term is calculated as :
(TW)i, j= (ISF) I,j
Where (TW)I,j is Term weight if ith sentence and jth Term.
Sentence Signature
Sentences that indicate key concepts in a document.
Term Sentence Matrix
W 11
W 21
(TSM )
....
W m1
Information Gain
Term Frequency Weight Score

Inverse Sentence Frequency score
Normalized Sentence length score
(NSL)i=
No of Words occuring in the sentences

No of words occuring in the longest sentence in the document
Sentence position score

(SPS)i =
W 12 .... W 1n
W 22 .... W 2n
.... .... ....
Wm2 .... Wmn
n - i 1
n
Numerical Data Score

(PNS)i =
No of numerical data in the sentences

Length of the sentence

Information Gain is calculated as
Information Gain (IG) = (TFW)i+ ISFS(Tj)i + (NSL)i +(SPS)i+ (PNS)i

where i is the sentence and j is the term
Term-Sentence matrix after IG :
IG (W 11) IG (W 12 )
IG (W 21) IG (W 22 )
(TSM )
....
....
IG (W m1) IG (Wm2)
IG (W 1n)
.... IG (W 2n)
....
....
.... IG (Wmn)
....
Element of reinforcement learning

Agent
State
Policy
Reward
Action
Environment
Agent: Intelligent programs

Environment: External condition
Policy:
Defines the agents behavior at a given time
A mapping from states to actions
Lookup tables or simple function
An agent learns behavior through trial-and-error interactions with a dynamic
environment.

Processing Step:
Action Sentence scoring using Reinforcement Learning

Selection Policies
-greedy
a , with probility 1-
at = t
Random action wit h probabilit y
In our approach we have consider

State : Sentences ;
Action: Updating Term weight is considered
Policy: Update the term to maximum the sentence rank
Reward : scalar value of Term. (IG)
Q-Learning
Processing Step:
IG (W 11) IG (W 12 )
IG (W 21) IG (W 22 )
(TSM )
....
....
IG (W m1) IG (Wm2)
IG (W 1n)
.... IG (W 2n)
....
....
.... IG (Wmn)
....
Matrix Q : learning matrix.
IG (W 11) updted
IG (W 21)
updted
updted (TSM )
....
IG (W m1) updted
IG (W 12 ) updted
IG (W 22 ) updted
....
IG (Wm2) updted
IG (W 1n) updted
.... IG (W 2n) updted
....
....
.... IG (Wmn) updted

....
Summary Generation :
Sentence selection :
Euclidean n-space
P = 1 , 2
Q = 1 , 2
Dataset
Article from The Hindu (june 2013)

DUC06 sets of documents :
12 document sets
No of document in each Set 25
Average no of sentence 32
300 document summary
Evaluation
Evaluation Techniques
Precision (P) =
F score =
100 r
Km
Recall (R) =
100 r
Kh
100 P R 100 2r
=
P+ R
Kh + Km
where, r is no of common sentence, Km is length of machine generated summary and kh is length of

human generated summary
Available automated text summarizers

Open Text summarizer (OTS),
Pertinence Summarizer (PS), and
Extractor Test Summarizer Software (ETSS).
The compression ratio is 30%
Compared with some available automated text summarizers

Open Text summarizer (OTS), Pertinence Summarizer (PS),
and Extractor Test Summarizer Software (ETSS)
Result
Comparison of generated text

summary for HAMS
Methods
SAAR (user
feedback)
Precision
value (P)
90
Recall
Value(R)
85
Comparison of Recall, Precision

Value and F-score for HAMS
F-score
87.42
Chart Title
ETSS
PS
IG
summary
75
65
70.57
OTS
OTS
75
60
66.66
IG Summary
PS
75
60
66.66
ETSS
75
60
66.66
SAAR Based
0
F-Score
20
40
Recall Value
60
80
100
Precision Value (P)
Conclusion and future scope

A novel approach for human aided text summarization by user
feedback from single document

This summarization by extract will be good enough for a reader to
understand the main idea of a document, though the understand
ability might not be as good as a summary by abstract.
As a future work this approach can be exacted for multi-document
summary document extraction using machine learning.
We can introduce the concept of multi agent into the system.
This will increase its speed as well make the summary or abstract more generic.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Verma R, Chen P, Integrating Ontology Knowledge into a Query-based Information

Summarization System, DUC 2007, 2007. Rochester, NY.
Lunh H. P. 'The automatic creation of literature abstracts, IBM Journal of Research and
Development, vol 2, pp 159165, 1958.
Edmundson H. P., New Methods in Automatic Extracting, Journal of the ACM (JACM),
vol. 16 no.2, pp. 264-285, 1969.
Salton G., Buckley, C., Term-Weighting Approaches in Automatic Text Retrieval
Information Processing & Management, Vol 24. pp.513 523, 1988.
Luhn H.P, A Statical Approach to Mechanical Encoding and Searching of Literary
Information. IBM Journal of Research and Development, pp. 309-317, 1975.
Salton G., Buckley, C. Term-Weighting Approaches in Automatic Text Retrieval.
Information Processing & Management, Vol 24. pp.513523, 1988.
Kupiec J et al., A trainable document summarizer, In Proceedings of SIGIR, 1995.
Conroy J. M., O'leary D. P, Text summarization via hidden markov model, In Proceedings
of SIGIR '01, pp 406-407, 2001, New York, NY, USA.
Agarwal N., Ford K. H., Shneider M., Sentence Boundary Detection using a MaxEnt
Classifer.
Garca-Hernndez R. A., Ledeneva Y., Word Sequence Models for Single Text
Summarization, 2009 Second International Conferences on Advances in Computer-Human
Interactions, pp. 44-48, 2009.
The Hindu [http://www.hinduonnet.com/] Accessed on 23rd June 2009.
Van Rsbergen C J. Information Retrieval, 2nd edition. Dept. of Computer Science, University
of Glasgow. 1979.
References
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
V. A. Yatsko and T. N. Vishnyakov (2006). A Method for Evaluating Modern Systems of

Automatic Text Summarization.
S. Hariharan, and R. Srinivasan,(2008).Investigations in single document summarization by
extraction method.
Ren Arnulfo Garca-Hernndez and Yulia Ledeneva (2009) Word Sequence Models for
Single Text Summarization.
Kyoomarsi, F.; Khosravi, H.; Eslami, E.; Dehkordy, P.K.; Tajoddin, A.; Optimizing Text
Summarization Based on Fuzzy Logic. In Proceedings of Computer and Information Science,
2008. ICIS 08.
Sparck-Jones, K. Automatic summarizing: factors and directions. In Mani, I.; Maybury, M.
Advances in Automatic Text Summarization. The MIT Press (1999) 1-12
Hovy, E. and C.-Y. Lin (1997). Automated Text Summarization in SUMMARIST. In
Proceedings of the ACL97/EACL97 Workshop on Intelligent Scalable Text Summarization,
Madrid, Spain.
Mani, I. and M. T. Maybury (editors) (1999). Advances in Automatic Text Summarization.
MIT Press, Cambridge, MA.
Sparck-Jones, K. (1999). Automatic Summarizing: Factors and Directions. In Mani, I. and M.
T. Maybury (editors), Advances in Automatic Text Summarization, pp. 113. The MIT Press.
Lin, C.-Y. and E. Hovy (2000). The automated acquisition of topic signatures for text
summarization. In Proceedings of the 18th COLING Conference, Saarbrucken, Germany.
Baldwin, B., R. Donaway, E. Hovy, E. Liddy, I. Mani, D. Marcu, K. McKeown, V. Mittal, M.
Moens, D. Radev, K. Sparck-Jones, B. Sundheim, S. Teufel, R. Weischedel, and M. White
(2000). An Evaluation Road Map for Summarization Research. http://wwwnlpir.nist.gov/projects/duc/papers/summarization.roadmap.doc.
Thank You
31

Human Aided Text Summarizer "SAAR" Using Reinforcement Learning

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Human Aided Text Summarizer "SAAR" Using Reinforcement Learning

Transféré par

Droits d'auteur :

Formats disponibles

Paper ID: ISCMI2014-1-031E

Human Aided Text Summarizer

Dr. Anupam Shukla

Real time Problem

An active research area where computer automatically

A short summary, which conveys the essence of the document

Electronic documents are becoming a principle

not easy to read each and every document.

Information Access Agent:

Search engines : Google, Yahoo etc.

What could be the possible solution than???

Problem definition (cont..)

Generic summary generation not possible as summary changes as user

What could be the possible solution now ???

Solution: Human Aided Text Summarization

Abstracts for Scientific and other articles

Facilities classification of articles and other written data :

Improve Search engines indexing efficiency of web pages

Enables Cell phones to access the Web information

Extractive Text Summarization

Earlier Methodology proposed (FAS)

Methodology proposed (HAMS)

Keyword Significant Factor

Term Selection using Pre-Processing

Term Frequency (TF):

Inverse Sentence Frequency (ISF) :

where N=no of sentences in the collection

Methodology Steps (cont)

Sentences that indicate key concepts in a document.

Term Sentence Matrix

Term Frequency Weight Score

No of Words occuring in the sentences

Sentence position score

Wm2 .... Wmn

Numerical Data Score

No of numerical data in the sentences

Methodology Steps (cont)

Information Gain (IG) = (TFW)i+ ISFS(Tj)i + (NSL)i +(SPS)i+ (PNS)i

Element of reinforcement learning

Agent: Intelligent programs

An agent learns behavior through trial-and-error interactions with a dynamic

Methodology Steps (cont)

Action Sentence scoring using Reinforcement Learning

In our approach we have consider

Matrix Q : learning matrix.

.... IG (Wmn) updted

Article from The Hindu (june 2013)

300 document summary

where, r is no of common sentence, Km is length of machine generated summary and kh is length of

Available automated text summarizers

Compared with some available automated text summarizers

Comparison of generated text

Comparison of Recall, Precision

Precision Value (P)

Conclusion and future scope

feedback from single document

Verma R, Chen P, Integrating Ontology Knowledge into a Query-based Information

V. A. Yatsko and T. N. Vishnyakov (2006). A Method for Evaluating Modern Systems of

Vous aimerez peut-être aussi