Vous êtes sur la page 1sur 36

TEXT SUMMARIZER

Project report submitted in partial fulfilment of


the requirement for the degree of

Bachelor of Engineering
In
Computer Science and Engineering
By
Nilanjan Chakraborty [BE/6175/13]
Abhishek Kumar [BE/6177/13]
Arka Prabho Pramanik [BE/6179/13]
Akash Ratan [BE/6183/13]

DEPARTMENT OF COMPUTER SCIENCE ENGINEERING


BIRLA INSTITUTE OF TECHNOLOGY MESRA

May 2017
CERTIFICATE

It is certified that the work contained in the project report titled Text Summarizer
by Nilanjan Chakraborty, Abhishek Kumar, Arka Prabho Pramanik, Akash
Ratan has been carried out under my supervision and that this work has not been
submitted else where for a degree of Bachelor of Engineering in Computer Science
and Engineering.

Signature of Supervisor

Rayees Ahamed Khan


BIT Mesra
Off Campus Deoghar
May 2017
DECLARATION

I declare that this written submission represents my ideas in my own words and
where others' ideas or words have been included, I have adequately cited and
referenced the original sources. I also declare that I have adhered to all principles of
academic honesty and integrity and have not misrepresented or fabricated or falsified
any idea/data/fact/source in my submission. I understand that any violation of the
above will be cause for disciplinary action by the Institute and can also evoke penal
action from the sources which have thus not been properly cited or from whom
proper permission has not been taken when needed.

(Signature)
Nilanjan Chakraborty [BE/6175/13]

(Signature)
Abhishek Kumar [BE/6177/13]

(Signature)
Arka Prabho Pramanik [BE/6179/13]

(Signature)
Akash Ratan [BE/6183/13]

Date: __________
APPROVAL SHEET

This project report entitled Text Summarizer by Nilanjan Chakraborty,


Abhishek Kumar, Arka Prabho Pramanik, Akash Ratan is approved for the
degree of Bachelor of Engineering in Computer Science and Engineering.

Examiner
________________________
________________________

Supervisor
________________________
________________________

I/C of the department


________________________

Date :____________
Place:____________
ACKNOWLEDGEMENT

We would like to express our heartfelt gratitude to our project mentor, Mr. Rayees
Ahamed Khan for his patient guidance, invaluable advice and supervision. We
would like to thank him for being so accommodating to adjust his busy schedule to
resolve our queries. He has selflessly imparted his expertise and years of experience
to us throughout the whole duration of the final year project. Without his help, we
would not been able to complete our project.
We would also like to extend my gratitude to all the faculty members of Department
of Computer Science and Engineering of Birla Institute of Technology Mesra,
Deoghar Campus who have taught us throughout our university days for the
knowledge and guidance imparted to us. Without them, the educational experience in
college would not have been so enjoyable and enriching.
We would also like to express warm appreciation to our colleagues and friends
who have been so accommodating and helpful in our pursuit for academic
excellence.
Last but not least, we would like to extend our heartfelt thanks to our family for their
unconditional love, accommodation of constant moral support throughout my
academic years.

(Signature)
Nilanjan Chakraborty [BE/6175/13]

(Signature)
Abhishek Kumar [BE/6177/13]

(Signature)
Arka Prabho Pramanik [BE/6179/13]

(Signature)
Akash Ratan [BE/6183/13]
CONTENTS

Abstract..(i)

List of Figures.(ii)

1. Introduction..1

1.1 Objective2

1.2 Application.3

1.3 Overview of the project..3

1.4 Organisation of the Thesis..4

2. Literature Review.5

2.1 Frequency based approach..5

2.1.1 Term frequency5

2.1.2 Keyword Frequency5

2.1.3 Stop words filtering.5

2.2 Term frequencyinverse document frequency 6


3. Proposed Work and Architecture..7

3.1 Frequency Based7

3.2.1 Frequency detection Method 7

3.2.2 Keyword Frequency method7

3.2 Feature Based Approach 8-11

3.3 Architecture12-17

4. Module description.18

5. Proposed Algorithms and Implementation19

5.1 Sentence similarity..19

5.2 Word similarity19

5.3 TF-IDF.20

6. Results and analysis21

7. Conclusion and future work26

8. Bibliography and References..27


ABSTRACT

With the increasing amount of information, it has become difficult to take out concise
information. Thus it is necessary to build a system that could present human quality
summaries.Text summarizer is a tool that provides summaries of a given document.
In this project three different approaches have been implemented for text
summarization. In all these three summarizers, sentences are represented as a feature
vector. In the first approach, features like position of sentences, vocabulary
intersections, resemblance to title, sentence inclusion of numerical data are used and
model is trained using Genetic Algorithm. In the second approach, apart from the
features used in the first approach, structure of the document, popularity of content
are also used as features. In the third approach, word to vector algorithm is used to
generate summary.

(i)
LIST OF FIGURES

1. Classsification of Summarization tasks..1

2. Summary extraction10

3. Steps of text summarization11

4. Flowchart Diagram..12

5. Data Flow Diagram14-16

6. Feature based summarization tool..17

7. Word Similarity Illustration18

8. TF-IDF Illustration..19

9. Text Summarizer GUI..22

10. Text Summarizer with document.23

11. Text Summarizer with line ranking..24

12. Text Summarizer with summary25

(ii)
1. INTRODUCTION

Text summarization is the process of reducing a text Document with a computer


program in order to create a summary that retains the most important points of the
original document. As The problem of information overload has grown, and as the
quantity of data has increased, so has interest in text summarization. It is very
difficult for human beings to manually summarize large documents of text. Text
Summarization methods can be classified into extractive and abstractive
summarization. An extractive summarization method consists of selecting important
sentences, paragraphs etc. from the original document and concatenating them into
shorter form. The importance of sentences is decided based on statistical and
linguistic features of sentences. Extractive methods work by selecting a subset of
existing words, phrases, or sentences in the original text to form the summary. The
extractive summarization systems are typically based on techniques for sentence
extraction and aim to cover the set of sentences that are most important for the overall
understanding of a given document.
In frequency based technique obtained summary makes more meaning. But in k-
means clustering due to out of order extraction, summary might not make sense.

Fig 1. Classification of summarization tasks

1
1.1 OBJECTIVE
With the growing amount of data in the world, interest in the field of text
summarization generation has been widely increasing so as to reducing the manual
effort of a person working on it. This thesis focuses on the comparison of various
existent algorithms for the summarization of text passages.

Summarization is a hard problem of Natural Language Processing because, to do it


properly, one has to really understand the point of a text. This requires semantic
analysis, discourse processing, and inferential interpretation (grouping of the content
using world knowledge). The last step, especially, is complex, because systems
without a great deal of world knowledge simply cannot do it. Therefore, attempts so
far of performing true abstraction--creating abstracts as summaries--have not been
very successful.

Fortunately, however, an approximation called extraction is more feasible today. To


create an extract, a system need simply to identify the most important/topical/central
topic(s) of the text, and return them to the reader. Although the summary is not
necessarily coherent, the reader can form an opinion of the content of the original.
Most automated summarization systems today produce extracts only.

Text Summarizer is an attempt to develop robust extraction technology as far as it can


go and then continue research and development of techniques to perform abstraction.
This work faces the depth vs. robustness tradeoff: either systems analyze/interpret the
input deeply enough to producegood summaries (but are limited to small application
domains), or they work robustly over more or less unrestricted text (but cannot
analyze deeply enough to fuse the input into a true summary, and hence perform only
topic extraction). In particular, symbolic techniques, using parsers, grammars, and
semantic representations, do not scale up to real-world size, while Information
Retrieval and other statistical techniques, being based on word counting and word
clustering, cannot create true summaries because they operate at the word (surface)
level instead of at the concept level.

Multi-document summarization aims to solve information overload problem by


providing condensed summary from a given set of documents on a topic. In this
work, we propose a novel multi-objective optimization approach for generic multi-
document summarization. The proposed method is based on two conflicting objective
functions, namely, coverage and diversity. The generated summary aims to maximize
the content coverage of the original documents, while maximizing the diversity
between sentences within the summary.

2
1.2 APPLICATION
Text summarization involves reduces a text file into a passage or paragraph that
conveys the main meaning of the text. The searching of important information from a
large text file is very difficult job for the users thus to automatic extract the important
information or summary of the text file.

This summary helps the users to reduce time instead of reading the whole text file
and it provide quick Information from the large document. In todays world to Extract
information from the World Wide Web is very easy. This extracted information is a
huge text repository.

With the rapid growth of the World Wide Web (internet), information overload is
becoming a problem for an increasing large number of people. Text summarization
can be an indispensable solution to reduce the information overload problem on the
web.

1.3 OVERVIEW OF THE PROJECT


Generally, there are two approaches to text summarization: extraction and
abstraction. Extractive methods work by selecting a subset of existing words,
phrases, or sentences in the original text to form the summary.

First clean the text file by removing full stop, common words (conjunction, verb,
adverb, preposition etc.). Then calculate the frequency of each words and select
top words which have maximum frequency. This technique retrieves important
sentence emphasize on high information richness in the sentence as well as high
Information retrieval. These related maximum sentence generated scores are
clustered to generate the summary of the document. Thus we use k-mean
clustering to these maximum sentences of the document and find the relation to
extract clusters with most relevant sets in the document, these helps to find the
summary of the document.

3
1.4 ORGANISATION OF THE THESIS

In Chapter 1, we have given the Introduction to our project

In Chapter 2, we have given the Literature Survey which includes the review of
extraction based approaches and what are the existing algorithms in these
approaches.

In Chapter 3, we have discussed the work proposed work and algorithms used
in the implementation.

In Chapter 4, results of the various implemented algorithms and as well as the


results are analysed.

In Chapter 5, we have the results conclusions drawn from the results as well
as the scope for future work is discussed.

4
2. LITERATURE REVIEW

Yogan Jaya Kumar, Ong Sing Goh, Halizah Basiron, Ngo Hea Choon and Puspalata
C Suppiah : A Review on Automatic Text Summarization Approaches,Journal of
Computer Science 2016, pp-179-186

2.1 FREQUENCY BASED APPROACH

2.1.1 TERM FREQUENCY

The term frequency is very important feature. TF (term frequency) represents


how many time the term appears in the document (usually a compression
function such as square root or logarithm is applied) to calculate the term
frequency. The term identifying sentence boundaries in a document is based on
punctuation such as (. (, , [, {, etc.) and split into sentences. These sentences
are nothing but tokens.

2.1.2 KEYWORD FREQUENCY


The keywords are the top high frequency words in term sentence frequency.
After cleaning the document calculate the frequency of each world. And which
words have the highest frequency these words are called keywords. The words
score are chosen as keywords, based on this feature, any sentence in the
document is scored by number of keywords it contains, where the sentence
receives 0.1 score for each key word.

2.1.3 STOP WORD FILTERING

In any document there will be many words that appear regularly but provide
little or no extra meaning to the document. Words such as 'the', 'and', 'is' and
'on' are very frequent in the English language and most documents will contain
many instances of them. These words are generally not very useful when
searching; they are not normally what users are searching for when entering
queries

5
2.2 TERM FREQUENCYINVERSE DOCUMENT FREQUENCY
Term frequency-inverse document frequency (tf-idf) has been traditionally used in
information retrieval to deal with frequent occurring terms or words in a corpus
consisting related documents (Jurafsky and Martin, 2009). Its purpose was to address
the following question: Are all content words that frequently appear in documents are
equally important? For instance, a collection of news articles reporting on earthquake
disaster will obviously contain the word earthquake in all documents.

Thus the idea of tf-idf is to reduce the weightage of frequent occurring words by
comparing its proportional frequency in the document collection. This property has
made the tf-idf to be one of the universally used terminologies in extractive
summarization. Here, the term frequency (tf) is defined as:

Each word is then divided or normalized by the total number of the words in
document j. This term weight computation is similar to the word probability
computation given in above equation. Next, the inverse document frequency (idf) of a
word i is computed:

where, the total number of documents in the corpus is divided by the number of
documents that contain the word i. Based on above equations, the tf-idf of word i in
document j is computed:

6
3. PROPOSED WORK AND ARCHITECTURE

3.1 FREQUENCY BASED APPROACH

First clean the document means remove the stop words from the document.
Then count the frequency of each word in remaining text file by comparing
select world with the each word. Then select the keyword which have highest
frequency. After that select the sentences which have these keywords.

3.1.1 FREQUENCY DETECTION METHOD

In this technique, we first eliminate commonly occurring words and then find
keywords according to the frequency of the occurrence of the word. This
assumes that if a passage is given, more attention will be paid to the topic on
which it is written, hence increasing the frequency of the occurrence of the
word and words similar to it. Now we need to extract those lines in which these
words occur since the other sentences wouldnt be as related to the topic as the
ones containing the keywords would be. Thus, a summary is generated
containing only useful sentences.

3.1.2 KEYWORD FREQUENCY METHOD

This algorithm takes the previous algorithm to a further level. This takes into
account facts such as the first few words of an article has more weights as
compared to the rest, since they represent the first paragraph generally contains
a gist of what is being said in the rest of the article. Secondly, it also takes into
account the frequency of occurrence of keywords obtained in the previous
algorithm in a particular sentence. Higher the keyword count within a sentence,
more is its relevance to the topic at hand.

7
3.2 FEATURE BASED APPROACH
One of the natural way to determine the importance of a sentence is to identify the
features that reflects the relevance of that sentence. Edmundson (1969) defined three
features deemed indicative to sentence relevance i.e., sentence position, presence of
title word and cue words. For example, the beginning sentences in a document
usually describes the main information concerning the document. Therefore, selecting
sentences based on its position could be a reasonable strategy. The following features
are commonly used to determine sentence relevance:-

Sentence Position
The beginning sentences in a document usually describes the main information
concerning the document. Position of the sentence in the text, decides its importance.
Sentences in the beginning defines the theme of the document whereas end
sentences conclude or summarize the document.The positional value of a sentence
is computed by assigning the highest score value to the first sentence and the last
sentence of the document. Second higher score value to the second sentence from
starting and second last sentence of the document. Remaining sentences are assigned
score value zero.

Sample document

Summary of document including the first line which contains main information

8
Term Frequency
Frequency is the number of times a word occurs in a document. If a words frequency
in a document is high, then it can be said that this word has a significant effect on the
content of the document. The total frequency value of a sentence is calculated by sum
up the frequency of every word in the document.

Unique words in a document with their frequency

Sentence Ranking
After scoring of each sentence, sentences are arranged in descending order of their
score value i.e. the sentence whose score value is highest is in top position and the
sentence whose score value is lowest is in bottom position.

Sentence ranking of a sample document

9
Fig 2 Summary Extraction

10
Fig 3 Steps of proposed text summarization technique

11
3.3 ARCHITECTURE

The project is built on Java. The process of summarization begins with processing
of input document which is broken down into sentences and subsequently into
words. Summarizer maintains a list of sentences of the document and each sentence
is responsible to store the words contained in it. A data flow diagram of the same is
shown in Figure 2

Fig 4 Flowchart Diagram

12
3.3.1 DATA FLOW DIAGRAM

Data flow diagram shows the system processes the data. It is used to represent the
relationship between components of the program. It is a technique for modeling a
scheme high-level detail by presenting how input information is changed to
productive results through a series of useful transformations.

DFDs are easier and simpler to understand by scientific and non-technical audiences.
It can present a high level system general idea, absolute with limitations and relations
to other systems. It also grants a thorough image of system mechanism.

PROCESS

The process is the handling that transforms information, performing computations,


building decisions (logic flow), or directing data flows based on production rules. In
other words, a process receives participation and generates some production.

Auto-Text Summarizer also performs processes that are described in the form of
DFD.

Following are some processes that show how data is transformed and computed.
Browsing input file

Extracting Text from input file

Extracting Summary

Checking Dictionary

Replacing words with synonyms

Data flow diagram of auto-text summarizer gives a simple but authoritative


graphic method which is straightforwardly understandable. It gives view point of
data movements which consists of the inputs and outputs that represents the flow
of information. The ability to represent the system at different levels of details
gives added advantage.

Working starts from taking input a document or user text and end in generating
summary of the document. After summary is generated, it is checked to look for the
difficult words which are also present in the dictionary to replace those words with
their meanings.

13
LEVEL 0 DFD FOR TEXT SUMMARIZER

Context level data flow diagram represents the complete system. Further working of
the system is represented in level 1 and level 2 diagrams. The Context level DFD is
referred to as Level 0 DFD, it is a data flow diagram (DFD) of the range of an
organizational system that shows the system limitations, outside entities that act
together with the system and the main information flows among the entities and the
system. Level 0 diagrams is the main (high) level view of a system. It is same as
Block diagram. System context diagrams help in better understanding of the context
of the system.

Fig 5 LEVEL 0 DFD for Text Summarization

14
LEVEL 1 DFD FOR AUTO-TEXT SUMMARIZER

Fig 6 LEVEL 1 DFD for Summary Generation and Synonym Replacement

The system has two main modules. The first major module is to extract summary of a
given input document and the second module is the replacement of difficult words in
a summary with their synonyms which are also present in a dictionary.

15
LEVEL 2 DFD FOR AUTO-TEXT SUMMARIZER

Level 1 for system presents the breakdown of context level dfd that represents
inner working of the system. A data flow diagram is that which can be used to
point to the clear progress of a business scheme.

Fig 7 LEVEL 2 DFD of Internal Working

Initially, a user browse a file. The file can be any Word file, URL of a web page or
user specified text. Then the text from the input file is extracted for further
processing. A regular expression separates sentences of extracted text. Separated
sentences are stored in a list. The system calculates the priority of sentences. Priority
determines the importance of sentence. The sentences are then selected to be added in
summary.

16
Fig 8 depicts the general model of a feature based summarizer. The scores for each
feature are computed and combined for sentence scoring. Prior to sentence scoring,
these features are given weights to determine its level of importance. In this case,
feature weighting will be applied to determine the weights associated to each
feature and the sentence score is then computed using the linear combination of
each feature score multiplied by its corresponding weight:

Fig 8 A feature based summarization model

17
4.MODULE DESCRIPTION
We are building text summarizer a system that combines symbolic concept-level
world knowledge. Text summarization is based on the following 'equation':

Summarization = Topic Identification + Interpretation + Generation

For each step, the system hybridizes techniques as follows:

Module 1. Topic Identification:

Generalizing word-level IR techniques, and adding additional techniques of topic


spotting, we use and dictionaries to perform 'concept counting' and generalization, in
order to identify important topics in the text. English, Japanese, Spanish, Indonesian,
and Arabic preprocessing modules and lexicons are providing multilingual
capabilities. This is the most developed stage of Text summarizer at this time.

Module 2. Interpretation:

After Topic Identification interpretation of most important lines are done. This is
done by clustering the lines into sub levels each with own topic.

Training on Wall Street Journal and other texts, we employ statistical techniques from
IR (word clustering, tf.idf, chi-squared) and cognitive psychology (latent semantic
analysis, WordNet, etc.), as well as lexicons and dictionaries, to perform 'concept-
based' topic fusion (interpretation) to find true summarizing concepts. The achieve
the robust performance required for general utility.

Module 3. Generation:

After Interpretation that is subdividing the context into topics, relevant topics are
chosen that display the summary of the given document. So we are basically
generating the summary of the document.

We will develop two alternatives: a keyword lister; a phrase template generator;All


three will provide hyperlinks from the summary back into the source document.

Prototypes of each portion of the system have already been built and are separately
evaluated.

18
5. PROPOSED ALGORITHMS AND IMPLEMENTATION

The summarizer fetches pages from local hard drive and filters out tags, references
and other meta content from it. The processed input page is then broken down
section. Each section maintains a separate list of sentences contained in it. After
processing and storing of page content, summarizer calculates feature values for each
sentences. List of features used are as follow:

5.1 SENTENCE POSITION: It is a traditional method for providing score for the
sentences. Each sentence in a section is given a score based on its relative position in
its section. Sentences appearing in the beginning of the section are given higher
weightage as shown in equation.

5.2 WORD SIMILARITY: This feature provides customizability to the summarizer,


in which a user can select keywords that he wants/doesnt want in the summary. The
summarizer then compares similarity of that word with every word in the sentence
using Wordnet dataset. If a sentence contains similar words then depending upon the
preference of user it scores the sentence. For positive keywords, it increments the
score for sentences containing similary words on the other hand for negative
keywords, it decrements the score for sentences containing similary keywords. Some
examples of words similarity are shown in Fig 3.

Fig 9 Word Similarity Illustration

19
5.3. TF-IDF: TF-IDF for a word in a sentence is inversely proportional to the number
of documents which also contain that word. Words with high TF-IDF numbers imply
a strong relationship with the sentence they appear in, suggesting that if that word
were to appear in a sentence, it could be interest to the user . Data frequency values
are constructed using 1000 randomly selected Wikipedia pages. Higher the frequency
of the word in corpus, lower will be its value. The score of sentence is calculated
using equation

where W is the list of words contained in the sentence. An illustration is shown in the
figure 3

Fig 10 TF-IDF Illustration

20
6. RESULT AND ANALYSIS

In extraction type of summarization often addressed in the literature are key world
extraction, where the goal is to select individual words to "tag" a document, and
document summarization, where the goal is to select whole sentences to create a short
paragraph summary.

Using tf-idf technique we conducted multiple experiments with the sentence


clustering based document summarization. For each experiment, input data sets are
preprocessed by removing stop words, but no stemming is applied. Each experiment
deals with a summarization method in which sentence clustering algorithm remains
the same, but cluster ordering and representative selection techniques are replaced
with the possible alternatives to judge whether summarization performance depends
on the cluster ordering and representative selection techniques.

In frequency based technique obtained summary makes more meaning. But in k-


means clustering due to out of order extraction, summary might not make sense.

In frequency based technique extraction of important sentences involving keyword is


tough. But in k-means clustering extraction of keyword related topics might be one of
its most important strength.
In frequency based technique complexity of implementation is low but in k-means
clustering complexity of implementation is very high.

In comparison, the keyword frequency based summary generation algorithm has been
found to be very simple where complexity is concerned. And generally it has found
that this method provides a much better summary than the other two methods, though
this will depend on the text given in hand. Meaning of the summary generated by this
method is usually higher than k-means clustering algorithm for generating summary
since the sentences are extracted in the same order as in the given article.

21
SCREENSHOTS

Fig 11 Text Summarizer GUI

22
Fig 12 Text summarizer with a document

23
Fig 13 Text summariser with line ranking

24
Fig 14 Text Summarizer with summary

25
7. CONCLUSION AND FUTURE WORK

Text summarization is an old challenge but the current research direction diverts
towards emerging trends in biomedicine, product review, education domains, emails
and blogs. This is due to the fact that there is information overload in these areas,
especially on the World Wide Web. Text summarization is an important area in NLP
(Natural Language Processing) research. It consists of automatically creating a
summary of one or more texts. The purpose of extractive document summarization is
to automatically select a number of indicative sentences, passages, or paragraphs
from the original document.Text summarization approaches based on Neural
Network, Graph Theoretic, Fuzzy and Cluster have, to an extent, succeeded in
making an effective summary of a document.Both extractive and abstractive methods
have been researched. Most summarization techniques are based on extractive
methods. Abstractive method is similar to summaries made by humans. Abstractive
summarization as of now requires heavy machinery for language generation and
is difficult to replicate into the domain specific areas.

In future work abstractive methods can be implemented. In abstractive method build


an internal semantic representation and then use natural language generation
techniques to create a summary.

26
8. BIBLIOGRAPHY AND REFERENCES
1. Fang Chen, Kesong Han and Guilin Chen, An Approach to sentence selection
based text summarization", Proceedings of IEEE TENCON 02, 489-493, 2002.

2. A. Kiani B and M. R. Akbarzadeh T, Automatic Text Summarization Using:


Hybrid Fuzzy GA-GP, IEEE International Conference on Fuzzy Systems, 16-21
July, Vancouver, BC, Canada, 977 -983, 2006.

3. C. Jaruskulchai and C. Kruengkrai, Text Summarization Using Local and Global


Properties, Proceedings of the IEEE/WIC International Conference on web
Intelligence, 13-17 October. Halifax, Canada: IEEE Computer Society, 201-206,
2003.

4. D. R Radev. H. Jing, and M. Budzikowska. Centroid-based summarization of


multiple documents: Sentence extraction, utility based evaluation, and user
studies. In ANLP/NAACL Workshop on Summarization, Seattle, April, (2000).

5. M. Osborne. Using Maximum Entropy for Sentence Extraction. In ACL


Workshop on Text Summarization, (2002).

27

Vous aimerez peut-être aussi