Académique Documents
Professionnel Documents
Culture Documents
By : Chandra Prakash
ABV-IIITM Gwalior
&
2014 Intl. Conference on Soft Computing & Machine Intelligence (ISCMI 2014)
Approach
Problem Definition
Motivation
Literature survey
Scope of Project
Methodology/Approach
Tools used
Result
Introduction
summary..
We have list of emails about sports event, get the summary
of those emails in one para
We have to study lots of books for the exam and the
summarizer gives the key concepts of the books as few
pages notes
Value for researchers
Get me everything/Papers say about Automatic Text
Summarization
Definition
Automatic Summaries
Dipanjan Das, Andre F.T. Martins (2007). A Survey on Automatic Text Summarization. Literature
Survey for the Language and Statistics II course at CMU, Pittsburg
Problem definition
With the advent of the information revolution (WWW),
Thousands of electronic documents are produced and made available on the internet
each day.
media
of
business
and
academic
News summarization
Opinion Mining and Sentiment Analysis
Previous Approach :
1950 : Automatic creation of literature abstracts was proposed by IBM
Luhn.
Text Mining:
Includes discovery of patterns and trends in data associations among entities
in a document.
Consist of three steps:
text preparation,
text processing and
text analysis.
Text Summarization :
Text Summarization Methods.
Extraction: Construct the summery by taking the most important
sentences
Abstraction: Construct the summary by paraphrasing section of the
original document.
9
Type of Techniques:
Statistical techniques :
Based on Term Frequency.
Stop-word filtering : remove the unwanted noise.
Stemming or Lemmatization: different forms of the same word.
Determine term importance
Term Frequency/Inverse-documents-frequency (TF-IDF)
Weighting scheme, etc.
Linguistic techniques :
Looks for text semantics.
Linguistic techniques extract sentence by
Parsing and part of Natural language processing (NLP).
Speech tagging is among the starting steps.
Scope of Project
Problem Definition:
Machine Learning
Reinforcement Learning
Tools used:
Matlab
Java
Chandra Prakash, Anupam Shukla Automated summary generation from singe document using information gain
Springer, Contemporary Computing ,Communications in Computer and Information Science Volume 94, pp 152-159,
2010 .
Solution
Approach for the Problem
Input: Document with text is fed into the system.
Preprocessing:
Tokenization: Divides the character sequence into words
sentence splitting further divides sequences of words into
sentences, and so on.
Stemming or Lemmatization
Stop word filtering Feature Extraction :
Sentence Ranking: Machine Learning
Human Feedback
Output\ Result: Generated Summary
an abstract.
15
Methodology Steps..
Methodology for text summarization involves
Tokenization or Segmentation
Stop word Filtering
Stemming or Lemmatization
Term weighting
N
fij log
nj
Sentence Signature
W 11
W 21
(TSM )
....
W m1
Information Gain
W 12 .... W 1n
W 22 .... W 2n
.... .... ....
n - i 1
n
IG (W 11) IG (W 12 )
IG (W 21) IG (W 22 )
(TSM )
....
....
IG (W m1) IG (Wm2)
IG (W 1n)
.... IG (W 2n)
....
....
.... IG (Wmn)
....
Policy
Reward
Action
Environment
environment.
-greedy
a , with probility 1-
at = t
Random action wit h probabilit y
Q-Learning
Processing Step:
IG (W 11) IG (W 12 )
IG (W 21) IG (W 22 )
(TSM )
....
....
IG (W m1) IG (Wm2)
IG (W 1n)
.... IG (W 2n)
....
....
.... IG (Wmn)
....
IG (W 11) updted
IG (W 21)
updted
updted (TSM )
....
IG (W m1) updted
IG (W 12 ) updted
IG (W 22 ) updted
....
IG (Wm2) updted
IG (W 1n) updted
.... IG (W 2n) updted
....
....
Summary Generation :
Sentence selection :
Euclidean n-space
P = 1 , 2
Q = 1 , 2
Dataset
12 document sets
No of document in each Set 25
Average no of sentence 32
Evaluation
Evaluation Techniques
Precision (P) =
F score =
100 r
Km
Recall (R) =
100 r
Kh
100 P R 100 2r
=
P+ R
Kh + Km
Result
SAAR (user
feedback)
Precision
value (P)
90
Recall
Value(R)
85
F-score
87.42
Chart Title
ETSS
PS
IG
summary
75
65
70.57
OTS
OTS
75
60
66.66
IG Summary
PS
75
60
66.66
ETSS
75
60
66.66
SAAR Based
0
F-Score
20
40
Recall Value
60
80
100
This will increase its speed as well make the summary or abstract more generic.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
References
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
Thank You
31