Vous êtes sur la page 1sur 31

Paper ID: ISCMI2014-1-031E

Human Aided Text Summarizer


SAAR using Reinforcement Learning

By : Chandra Prakash
ABV-IIITM Gwalior
&

Dr. Anupam Shukla


Professor, ABV-IIITM, Gwalior

2014 Intl. Conference on Soft Computing & Machine Intelligence (ISCMI 2014)

Approach
Problem Definition
Motivation

Literature survey
Scope of Project
Methodology/Approach

Tools used
Result

Introduction

Real time Problem


Imagine
Download 1000 + papers and now want to get the

summary..
We have list of emails about sports event, get the summary
of those emails in one para
We have to study lots of books for the exam and the
summarizer gives the key concepts of the books as few
pages notes
Value for researchers
Get me everything/Papers say about Automatic Text

Summarization

Definition
Automatic Summaries

An active research area where computer automatically


summarize text from both single and multi-documents.

A short summary, which conveys the essence of the document


Should be less than half of original text
Can be extractive or abstractive based
May be produced from single or multiple documents

Dipanjan Das, Andre F.T. Martins (2007). A Survey on Automatic Text Summarization. Literature
Survey for the Language and Statistics II course at CMU, Pittsburg

Problem definition
With the advent of the information revolution (WWW),

Electronic documents are becoming a principle


information

Thousands of electronic documents are produced and made available on the internet
each day.

media

of

business

and

academic

not easy to read each and every document.

Information Access Agent:

Search engines : Google, Yahoo etc.


Information retrieval is far greater than that a user can handle and manage.
User has to analyze searched result one by one until felt satisfactory, this is time
consuming and inefficient.

What could be the possible solution than???

Problem definition (cont..)


Text summarization is not as per user specification.

Generic summary generation not possible as summary changes as user


changes.
Even two human cant generate a similar summary from a given
document.
Internal factors (background, education etc.) play vital role in generating a
summary

What could be the possible solution now ???

Solution: Human Aided Text Summarization


Benefits of summarization include:
Save reading time
Value for researchers

Abstracts for Scientific and other articles


Facilitate fast literature searches

Facilities classification of articles and other written data :

Improve Search engines indexing efficiency of web pages


Assists in storing the text in much lesser space.
Heading of the given article/document

News summarization
Opinion Mining and Sentiment Analysis

Enables Cell phones to access the Web information


With human feedback user oriented summary

Previous Approach :
1950 : Automatic creation of literature abstracts was proposed by IBM
Luhn.
Text Mining:
Includes discovery of patterns and trends in data associations among entities
in a document.
Consist of three steps:
text preparation,
text processing and
text analysis.
Text Summarization :
Text Summarization Methods.
Extraction: Construct the summery by taking the most important
sentences
Abstraction: Construct the summary by paraphrasing section of the
original document.
9

Type of Techniques:
Statistical techniques :
Based on Term Frequency.
Stop-word filtering : remove the unwanted noise.
Stemming or Lemmatization: different forms of the same word.
Determine term importance
Term Frequency/Inverse-documents-frequency (TF-IDF)
Weighting scheme, etc.
Linguistic techniques :
Looks for text semantics.
Linguistic techniques extract sentence by
Parsing and part of Natural language processing (NLP).
Speech tagging is among the starting steps.

Scope of Project
Problem Definition:

Extractive Text Summarization


Single Document
Fully Automated Summarization (FAS)
Human Aided Machine Summarization (HAMS)

Machine Learning

Reinforcement Learning

Tools used:
Matlab
Java

Earlier Methodology proposed (FAS)

Chandra Prakash, Anupam Shukla Automated summary generation from singe document using information gain
Springer, Contemporary Computing ,Communications in Computer and Information Science Volume 94, pp 152-159,
2010 .

Methodology proposed (HAMS)

Keyword Significant Factor

Solution
Approach for the Problem
Input: Document with text is fed into the system.
Preprocessing:
Tokenization: Divides the character sequence into words
sentence splitting further divides sequences of words into
sentences, and so on.
Stemming or Lemmatization
Stop word filtering Feature Extraction :
Sentence Ranking: Machine Learning
Human Feedback
Output\ Result: Generated Summary
an abstract.

15

Methodology Steps..
Methodology for text summarization involves

Term Selection using Pre-Processing

Tokenization or Segmentation
Stop word Filtering
Stemming or Lemmatization

Term weighting

Term Frequency (TF):


Wi(Tj)=fij
where fij is the frequency of jth term in sentence i.

Inverse Sentence Frequency (ISF) :


Wi(Tjj

N
fij log
nj

where N=no of sentences in the collection


nj =no of sentence where the term j appears.

Methodology Steps (cont)


Weight of a Term is calculated as :
(TW)i, j= (ISF) I,j
Where (TW)I,j is Term weight if ith sentence and jth Term.

Sentence Signature

Sentences that indicate key concepts in a document.

Term Sentence Matrix

W 11
W 21
(TSM )
....

W m1

Information Gain

Term Frequency Weight Score


Inverse Sentence Frequency score
Normalized Sentence length score
(NSL)i=

No of Words occuring in the sentences


No of words occuring in the longest sentence in the document

Sentence position score


(SPS)i =

W 12 .... W 1n
W 22 .... W 2n
.... .... ....

Wm2 .... Wmn

n - i 1
n

Numerical Data Score


(PNS)i =

No of numerical data in the sentences


Length of the sentence

Methodology Steps (cont)


Information Gain is calculated as

Information Gain (IG) = (TFW)i+ ISFS(Tj)i + (NSL)i +(SPS)i+ (PNS)i


where i is the sentence and j is the term
Term-Sentence matrix after IG :

IG (W 11) IG (W 12 )
IG (W 21) IG (W 22 )
(TSM )
....
....

IG (W m1) IG (Wm2)

IG (W 1n)
.... IG (W 2n)
....
....

.... IG (Wmn)
....

Element of reinforcement learning


Agent
State

Policy

Reward

Action

Environment

Agent: Intelligent programs


Environment: External condition
Policy:
Defines the agents behavior at a given time
A mapping from states to actions
Lookup tables or simple function

An agent learns behavior through trial-and-error interactions with a dynamic

environment.

Methodology Steps (cont)


Processing Step:

Action Sentence scoring using Reinforcement Learning


Selection Policies

-greedy

a , with probility 1-
at = t
Random action wit h probabilit y

In our approach we have consider


State : Sentences ;
Action: Updating Term weight is considered
Policy: Update the term to maximum the sentence rank
Reward : scalar value of Term. (IG)

Q-Learning

Processing Step:
IG (W 11) IG (W 12 )
IG (W 21) IG (W 22 )
(TSM )
....
....

IG (W m1) IG (Wm2)

IG (W 1n)
.... IG (W 2n)
....
....

.... IG (Wmn)
....

Matrix Q : learning matrix.

IG (W 11) updted
IG (W 21)
updted
updted (TSM )

....

IG (W m1) updted

IG (W 12 ) updted
IG (W 22 ) updted
....
IG (Wm2) updted

IG (W 1n) updted
.... IG (W 2n) updted

....
....

.... IG (Wmn) updted


....

Summary Generation :

Sentence selection :

Euclidean n-space
P = 1 , 2
Q = 1 , 2

Dataset

Article from The Hindu (june 2013)


DUC06 sets of documents :

12 document sets
No of document in each Set 25
Average no of sentence 32

300 document summary

Evaluation
Evaluation Techniques
Precision (P) =

F score =

100 r
Km

Recall (R) =

100 r
Kh

100 P R 100 2r
=
P+ R
Kh + Km

where, r is no of common sentence, Km is length of machine generated summary and kh is length of


human generated summary

Available automated text summarizers


Open Text summarizer (OTS),
Pertinence Summarizer (PS), and
Extractor Test Summarizer Software (ETSS).
The compression ratio is 30%

Compared with some available automated text summarizers


Open Text summarizer (OTS), Pertinence Summarizer (PS),
and Extractor Test Summarizer Software (ETSS)

Result

Comparison of generated text


summary for HAMS
Methods

SAAR (user
feedback)

Precision
value (P)

90

Recall
Value(R)

85

Comparison of Recall, Precision


Value and F-score for HAMS

F-score

87.42

Chart Title
ETSS
PS

IG
summary

75

65

70.57

OTS

OTS

75

60

66.66

IG Summary

PS

75

60

66.66

ETSS

75

60

66.66

SAAR Based
0
F-Score

20

40

Recall Value

60

80

100

Precision Value (P)

Conclusion and future scope


A novel approach for human aided text summarization by user

feedback from single document


This summarization by extract will be good enough for a reader to
understand the main idea of a document, though the understand
ability might not be as good as a summary by abstract.
As a future work this approach can be exacted for multi-document
summary document extraction using machine learning.
We can introduce the concept of multi agent into the system.

This will increase its speed as well make the summary or abstract more generic.

References
1.
2.
3.

4.
5.
6.
7.
8.
9.
10.

11.
12.

Verma R, Chen P, Integrating Ontology Knowledge into a Query-based Information


Summarization System, DUC 2007, 2007. Rochester, NY.
Lunh H. P. 'The automatic creation of literature abstracts, IBM Journal of Research and
Development, vol 2, pp 159165, 1958.
Edmundson H. P., New Methods in Automatic Extracting, Journal of the ACM (JACM),
vol. 16 no.2, pp. 264-285, 1969.
Salton G., Buckley, C., Term-Weighting Approaches in Automatic Text Retrieval
Information Processing & Management, Vol 24. pp.513 523, 1988.
Luhn H.P, A Statical Approach to Mechanical Encoding and Searching of Literary
Information. IBM Journal of Research and Development, pp. 309-317, 1975.
Salton G., Buckley, C. Term-Weighting Approaches in Automatic Text Retrieval.
Information Processing & Management, Vol 24. pp.513523, 1988.
Kupiec J et al., A trainable document summarizer, In Proceedings of SIGIR, 1995.
Conroy J. M., O'leary D. P, Text summarization via hidden markov model, In Proceedings
of SIGIR '01, pp 406-407, 2001, New York, NY, USA.
Agarwal N., Ford K. H., Shneider M., Sentence Boundary Detection using a MaxEnt
Classifer.
Garca-Hernndez R. A., Ledeneva Y., Word Sequence Models for Single Text
Summarization, 2009 Second International Conferences on Advances in Computer-Human
Interactions, pp. 44-48, 2009.
The Hindu [http://www.hinduonnet.com/] Accessed on 23rd June 2009.
Van Rsbergen C J. Information Retrieval, 2nd edition. Dept. of Computer Science, University
of Glasgow. 1979.

References
13.
14.
15.

16.

17.
18.

19.
20.
21.
22.

V. A. Yatsko and T. N. Vishnyakov (2006). A Method for Evaluating Modern Systems of


Automatic Text Summarization.
S. Hariharan, and R. Srinivasan,(2008).Investigations in single document summarization by
extraction method.
Ren Arnulfo Garca-Hernndez and Yulia Ledeneva (2009) Word Sequence Models for
Single Text Summarization.
Kyoomarsi, F.; Khosravi, H.; Eslami, E.; Dehkordy, P.K.; Tajoddin, A.; Optimizing Text
Summarization Based on Fuzzy Logic. In Proceedings of Computer and Information Science,
2008. ICIS 08.
Sparck-Jones, K. Automatic summarizing: factors and directions. In Mani, I.; Maybury, M.
Advances in Automatic Text Summarization. The MIT Press (1999) 1-12
Hovy, E. and C.-Y. Lin (1997). Automated Text Summarization in SUMMARIST. In
Proceedings of the ACL97/EACL97 Workshop on Intelligent Scalable Text Summarization,
Madrid, Spain.
Mani, I. and M. T. Maybury (editors) (1999). Advances in Automatic Text Summarization.
MIT Press, Cambridge, MA.
Sparck-Jones, K. (1999). Automatic Summarizing: Factors and Directions. In Mani, I. and M.
T. Maybury (editors), Advances in Automatic Text Summarization, pp. 113. The MIT Press.
Lin, C.-Y. and E. Hovy (2000). The automated acquisition of topic signatures for text
summarization. In Proceedings of the 18th COLING Conference, Saarbrucken, Germany.
Baldwin, B., R. Donaway, E. Hovy, E. Liddy, I. Mani, D. Marcu, K. McKeown, V. Mittal, M.
Moens, D. Radev, K. Sparck-Jones, B. Sundheim, S. Teufel, R. Weischedel, and M. White
(2000). An Evaluation Road Map for Summarization Research. http://wwwnlpir.nist.gov/projects/duc/papers/summarization.roadmap.doc.

Thank You

31

Vous aimerez peut-être aussi