Académique Documents
Professionnel Documents
Culture Documents
ABSTRACT
This project on Auto Summarization is used for developing an innovative program for your
computer that will create a short summary from any document or a browsed web page.
Many people suffer with the daunting task of reading large amounts of electronic textual
material which takes a lot of human effort. So, the objective of this project is to design an
automatic text extraction system i.e., an auto summarization tool to alleviate, if not totally
solve, the above problem. We do this by using the statistical approach of auto summarization.
Now scoring each sentence — For this we will use the formula
Sentence Score = Sum of Word Scores/Number of Normal Words,
Where the frequency count of a normal word = number of occurrences of that word in the whole
document/number of sentences in which that word occurs
The proposed system is a computerized one. The system requires very less data entry. Automatic
text summarization is a process of condensing a source document into a shorter version of
text, while keeping the most significant information using a computer program.
As per the above suggested steps , the developed tool is a powerful tool and has fast
summarization capabilities. It will be a cross-platform software which will produce
summary based on the statistical methods and algorithms used.
The project has been made using C#.net technology.
Page 1
Auto Summarization Tool 201
0
TABLE OF CONTENTS
ABSTRACT ………………………………………………………………… i
Contents
Contents............................................................................................................................................2
INTRODUCTION............................................................................................................................6
1.1 BACKGROUND.................................................................................................................6
1.2 PURPOSE...............................................................................................................................7
1.3 SCOPE....................................................................................................................................8
1.4 OVERVIEW...........................................................................................................................8
1.5 ORGANIZATION OF REPORT............................................................................................9
2.1 SYSTEM INTERFACES......................................................................................................10
2.2 USER INTERFACES...........................................................................................................10
2.3 SOFTWARE INTERFACES................................................................................................11
2.4 HARDWARE INTERFACE................................................................................................11
2.5 REQUIREMENT ANALYSIS.............................................................................................12
2.6 FEASIBILITY STUDY........................................................................................................12
2.6.1 ECONOMICAL FEASIBILITY........................................................................................13
2.6.2 TECHNICAL FEASIBILITY............................................................................................13
2.6.3 OPERATIONAL FEASIBILITY .....................................................................................13
2.7 TECHNOLOGIES TO BE USED:.......................................................................................13
2.7.1 INTRODUCTION:............................................................................................................14
2.7.2 IMPORTANT FACTS ABOUT C#:.................................................................................14
2.7.3 VISUAL STUDIO 2008:...................................................................................................15
2.7.4 THE COMPILER:.............................................................................................................15
2.7.5 C#:......................................................................................................................................16
2.7.6 SQL SERVER MANAGEMENT STUDIO:.....................................................................16
3.1 MODULE DESCRIPTION ...............................................................................................17
3.1.1 TEXT PRE-PROCESSOR:................................................................................................17
Page 2
Auto Summarization Tool 201
0
Page 3
Auto Summarization Tool 201
0
LIST OF TABLES
Page 4
Auto Summarization Tool 201
0
LIST OF FIGURES
INTRODUCTION
1.1 BACKGROUND
In the computer age, people are inundated with papers, memos, e-mail messages, reports, web
pages, schedules, reference materials, test results, and so on. All this requires a lot of time and
human effort. Unfortunately, many documents do not begin with summaries. Creation of
summaries is tedious, requiring the author to re-read the document, identify major themes,
and distill the main points of the document into a concise summary.
Summarizing a document is even more difficult and time-consuming for a reader. The
reader must first read the entire document (or at least skim it) to understand the contents.
The reader must then attempt to extract the document's key points from unimportant
details.
Page 6
Auto Summarization Tool 201
0
The problem is less critical, but still troubling, for individual users who are browsing
through the Internet or other networks to find documents on a related topic. With the
explosion in the quantity of online text and multimedia information in recent years, there
has been a renewed interest in automatic summarization.
Readability is constrained. All the records may not be handled or written by the same person. So
the format and style of records differ and hence it is difficult to understand.
Sometimes all the reading, work and time goes wasted as after reading the whole
document one realizes it to be worthless.
1.2 PURPOSE
A summary or recap is a shortened version of the original or it is the restating of the
main ideas of the text in as few words as possible. The main purpose of such a
simplification is to highlight the major points from the original (much longer) subject. The
target is to help the user get the gist in a short period of time.
Auto Summarizer analyzes a document and then assigns a score to each sentence. We
Decide the amount of detail you want, and AutoSummarize uses the scoring system to
Extract the key points and assemble them for us.
So, this is a project for developing an innovative program for your computer that will
create a short summary from any document or a browsed web page. This has some
applications like summarizing the search-engine results, providing briefs of big documents
that do not have an abstract etc..
Automatic text summarization is a process of condensing a source document into a shorter
version of text, while keeping the most significant information using a computer program.
Page 7
Auto Summarization Tool 201
0
1.3 SCOPE
An abstract or summary at the beginning of a document can help a reader quickly understand the
scope of a body of information. The AutoSummary Tools in Microsoft Office Word highlights
and assembles key points of a document.
AutoSummarize analyzes a document and then assigns a score to each sentence. We decide the
amount of detail you want, and AutoSummarize uses the scoring system to extract the key points
and assemble them for us.
1.4 OVERVIEW
Statistical ones operate by finding the important sentences using statistical methods (like
frequency of a particular word etc). Statistical summarizers normally do not use any linguistic
information. In this project, an auto-summarization tool is developed using statistical techniques.
The techniques involve finding the frequency of words, scoring the sentences, ranking the
sentences etc. The summary is obtained by selecting a particular number of sentences (specified
by the user) from the top of the list. It operates on a single document (but can be made to work on
multiple documents by choosing proper algorithms for integration) and provides a summary of
the document. The size of the summary can be specified by the user when invoking the tool. Pre-
processing interfaces are there to handle the following document types: Plain Text, HTML, and
Word Document.
In this project, the technique used for summarizing is STATISTICAL one. This technique
is being implemented and used because in comparison to linguistic method for generating
Page 8
Auto Summarization Tool 201
0
the text, it is very easy to develop and also time saving, where the result of both the
summarizers (i.e., the summary generated) is not very different.
We have scored sentences in the given text to generate a summary comprising of the
most important ones obtained so. The program takes input from a text file, and outputs
the summary into a similar text file. The summary is obtained by selecting a particular
number of sentences (specified by the user) from the top of the list i.e., the size of the
summary can be specified by the user when invoking the tool. It operates on a single
document (but can be made to work on multiple documents by choosing proper algorithms
for integration) and provides a summary of the document. The advanced scoring techniques
like finding the frequency of words, scoring the sentences, ranking the sentences etc., based
upon the placement of sentences is being implemented.
Our project report is sectioned into various sections that describes the pattern in which
our project has come to its completion. Abstract includes a brief and simple description about
overall project. Section one explains introductory part which contains overview about how
many summarizers are there and which out of them we are using for our project and
why.
Section second implies product perspective in which we highlight the initial requirements
that are required in the beginning for developing this project, like software interface,
hardware interface requirement analysis, feasibility study, and all the technical terms with
their explanation are mentioned in this section.
Next section explains about the various modules along with their working. Fourth section
includes description of various database tables to be used for storing information like
basic word table and stop word table. It is then followed by description of information in
diagrammatic manner like using Context diagrams, DFD’s, and flowcharts of all the
project modules.
The last section of this report includes a working example from the developed tool along
with screenshots and coding used for the project.
Page 9
Auto Summarization Tool 201
0
PRODUCT PERSPECTIVE
Page 10
Auto Summarization Tool 201
0
The user can enter the size of summary i.e., the no. of sentences he wants to be
displayed in summary. He can also browse the file from the system or any web page
whose summary is to be generated by converting these pages in plain text. This provides
flexibility to the user to manage the output and get the summary displayed on screen
itself.
Platform : C#.net
Front End
Language Used : C#
RAM 2 GB
Page 11
Auto Summarization Tool 201
0
Table 2.2
Interview
Questionnaires
Record inspection
On-site observation
A feasibility study is conducted to select the best system that meets performance
requirement. This entails an identification description, an evaluation of candidate system and
the selection of best system for job. The system required a statement of constraints; the
identification of specific system objective and a description of output define performance
etc. The key considerations in feasibility analysis are:
Economic Feasibility
Technical Feasibility
Operational Feasibility
Page 12
Auto Summarization Tool 201
0
The backend required for storing other details(words) is also the database MY SQL. The
computer present are highly sophisticated and don’t need extra components to load the
software. Hence the tool can easily be implemented in the new system without any
additional expenditure. Hence, it is economically feasible.
Page 13
Auto Summarization Tool 201
0
Framework – .NET
2.7.1 INTRODUCTION:
C#.NET is Microsoft’s next-generation technology for creating robust application softwares. It’s
built on the Microsoft .NET Framework, which is a cluster of closely related new technologies
that revolutionizes everything from database access to distributed applications. C#.NET is one of
the most important components of the .NET Framework—it’s the part that enables you to develop
high-performance desktop applications and services.
C#, which was designed as a quick and easy language for creating quick applications, by contrast,
C#.NET is a full-blown platform for developing comprehensive, blisteringly fast applications.
Page 14
Auto Summarization Tool 201
0
Visual Studio 2008 and Visual Studio Team System 2008 codenamed Orcas were released to
MSDN subscribers on 19 November 2007 alongside .NET Framework 3.5.
Visual Studio 2008 is focused on development of Windows Vista, 2007 Office system, and Web
applications. For visual design, a new Windows Presentation Foundation visual designer and a
new HTML/CSS editor influenced by Microsoft Expression Web are included. J# is not included.
Visual Studio 2008 requires .NET 3.5 Framework and by default configures compiled assemblies
to run on .NET Framework 3.5, but it also supports multi-targeting which lets the developers
choose which version of the .NET Framework the assembly runs on. Visual Studio 2008 also
includes new code analysis tools, including the new Code Metrics tool (only in Team Edition and
Team Suite Edition). For Visual C++, Visual Studio adds a new version of Microsoft Foundation
Classes (MFC 9.0) that adds support for the visual styles and UI controls introduced with
Windows Vista. For native and managed code interoperability, Visual C++ introduces the
STL/CLR, which is a port of the C++ Standard Template Library (STL) containers and
algorithms to managed code. STL/CLR defines STL-like containers, iterators and algorithms that
work on C++/CLI managed objects.
Visual Studio 2008 features include an XAML-based designer (codenamed Cider), workflow
designer, LINQ to SQL designer (for defining the type mappings and object encapsulation for
SQL Server data), XSLT debugger, JavaScript Intelligence support, JavaScript Debugging
support, support for UAC manifests, a concurrent build system, among others. It ships with an
enhanced set of UI widgets, both for Windows Forms and WPF. It also includes a multithreaded
build engine (MSBuild) to compile multiple source files (and build the executable file) in a
project across multiple threads simultaneously. It also includes support for
compiling PNG compressed icon resources introduced in Windows Vista. An updated XML
Schema designer will ship separately some time after the release of Visual Studio 2008.
The Visual Studio debugger includes features targeting easier debugging of multi-threaded
applications. In debugging mode, in the Threads window, which lists all the threads, hovering
over a thread will display the stack trace of that thread in tooltips. The threads can directly be
named and flagged for easier identification from that window itself. In addition, in the code
window, along with indicating the location of the currently executing instruction in the current
thread, the currently executing instructions in other threads are also pointed out. The Visual
Studio debugger supports integrated debugging of the .NET 3.5 Framework Base Class Library
(BCL) which can dynamically download the BCL source code and debug symbols and allow
stepping into the BCL source during debugging. As of 2010 a limited subset of the BCL source is
available, with more library support planned for later.
Page 15
Auto Summarization Tool 201
0
.NET separates these two pieces. That way, every language can use the same design tools.
The .NET language compilers include the following:
• The Visual Basic compiler (vbc.exe)
• The C# compiler (csc.exe)
2.7.5 C#:
C# (pronounced "see sharp") is a multi-paradigm programming language encompassing
imperative, functional, generic, object-oriented (class-based) and component- programming
disciplines. It was developed by Microsoft within the .NET.
SQL Server Management Studio is a tool included with Microsoft SQL Server 2005 and later
versions for configuring, managing, and administering all components within Microsoft SQL
Server. The tool includes both script editors and graphical tools which work with objects and
features of the server.
A central feature of SQL Server Management Studio is the Object Explorer, which allows the
user to browse, select, and act upon any of the objects within the server.
MODULES
This system will contain various modules. User can interact with the System using the GUI
interface provided. The list of modules is as follows:-
Text Pre-processor
Sentence Separator
Words Separator
Page 16
Auto Summarization Tool 201
0
Figure 3.1
Page 17
Auto Summarization Tool 201
0
3.1.6SCORING ALGORITHM:
This algorithm determines the score of each sentence. Several possibilities exist. The score can be
made to be proportional to the sum of frequencies of the different words comprising the sentence
(i.e., if a sentence has 3 words A, B and C, then the score is proportional the sum of how many
times A, B and C have occurred in the document). The score can also be made to be inversely
proportional to the number of sentences in which the words in the sentence appear in the
document. Likewise, many such heuristic rules can be applied to score the sentences.
3.1.7 RANKING:
The sentences will be ranked according to the scores. Any other criteria like the position of a
sentence in the document can be used to control the ranking. For example, even though the scores
are high, we would not put consecutive sentences together.
3.1.8 SUMMARIZING:
Based on the user input on the size of the summary, the sentences will be picked from the ranked
list and concatenated. The resulting summary file could be stored with a name like
<originalfilename>_summary.txt.
Page 18
Auto Summarization Tool 201
0
Figure 3.2
On the other hand, the score of a Normal Word is calculated on the following criteria:
“Their score is directly proportional the total number of occurrences in the whole document and
inversely proportional to the number of sentences in which the word occurs.”
NOTE:
Normal words are the words excluding the stop words.
A list of stop words is given at the end of this document.
Page 19
Auto Summarization Tool 201
0
DATABASE DESCRIPTIONS
This table consists of stop words i.e., the regular English words like a, an, the etc., which
needs to be eliminated from the document before the scoring of each sentence i.e., these
words are not counted while calculating the score of a particular sentence.
FIELD TYPE
Words VARCHAR2
Table 4.1
INFORMATION DESCRIPTION
Page 20
Auto Summarization Tool 201
0
Figure 5.1
Page 21
Auto Summarization Tool 201
0
Figure 5.2
Page 22
Auto Summarization Tool 201
0
Figure 5.3
Page 23
Auto Summarization Tool 201
0
Figure 5.4
Page 24
Auto Summarization Tool 201
0
5.4.3 SCORING
Figure 5.5
Page 25
Auto Summarization Tool 201
0
5.4.4 RANKING
Figure 5.6
Page 26
Auto Summarization Tool 201
0
Figure 5.7
Page 27
Auto Summarization Tool 201
0
OVERALL DESCRIPTION
Some amount of information is lost while the generation of the summary. The
amount of information lost depends on the specified number of sentences by the
user.
Page 28
Auto Summarization Tool 201
0
It is assumed that only few important sentences consisting of the key words from
the given text will give accurate summary of the text.
Every user should be comfortable of working with computer.
He must have basic knowledge of English too.
Another advantage of the summarizer stems from the combined statistical and basic word
processing. This dual analysis is beneficial because the statistical component ensures that a
summary will always be produced, and the basic word component improves the quality of
the resulting summary.
Page 29
Auto Summarization Tool 201
0
SPECIFIC REQUIREMENTS
7.1.1 RELIABILITY
This System is very reliable in the sense that it saves all the summaries generated earlier
in the database.
7.1.2 AVAILABILITY
This feature will be available to all the users. No complex software or hardware requirements are
there. Only the text to be input is needed.
7.1.3 MAINTAINABILITY
As far as the maintainability is concerned, our System is very simple to maintain.
Information can easily be entered in database and database is even simple to use and
maintain.
7.1.4 PORTABILITY
For a software application to be good and effective, it should run on different platforms.
This project will be cross-platform software. Our System is basically meant for the stand-
alone system but it can transfer whole database along with the source codes to another
system. Hence portability can be achieved easily in our system. That is why, our System is
portable.
Page 30
Auto Summarization Tool 201
0
The checklist that follows provides a set of characteristics that lead to a testable software.
7.3.1 OPERABILITY ‘The better it works the more efficiently it can be tested’
No bug in the system block the execution of tests.
The product evolves in functional stages ( allows simultaneous development and
testing ).
7.3.3 CONTROLLABILITY ‘The better we can control the software, the more the
testing can be automated and optimized’
All possible outputs can be generated through some combination of input.
All code is executable through some combination of input
Software and hardware states and variables can be controlled directly by the test
engineer.
Input and output formats are consistent and structured
Tests can be conveniently specified, automated and reproduced
Page 31
Auto Summarization Tool 201
0
7.3.5 SIMPLICITY ‘The less there is to test, the more quickly we can test it’
Functional simplicity ( e.g. the feature set is the minimum necessary to meet
requirements ).
Structural simplicity ( e.g. architecture is modularized to limit the propagation of
faults ).
Code simplicity ( e.g. a coding standard is adopted for ease of inspection and
maintenance ).
7.3.6 STABILITY ‘The fewer the changes, the fewer the disruptions to testing’
Changes to the software are infrequent
Changes to the software are controlled
Changes to the software do not invalidate existing tests
The software recovers well from failures
7.3.7 UNDERSTANDABILITY ‘the more information we have, the smarter we will test’
The design is well understood
Changes to the design are communicated
Technical documentation is instantly accessible
Technical documentation is well organized
Technical documentation is specific and detailed
Technical documentation is accurate
Page 32
Auto Summarization Tool 201
0
WORKING (EXAMPLE)
“The event horizon is where the force of gravity becomes so strong that even light is pulled
into the black hole. Although the event horizon is part of a black hole, it is not a tangible
object. If you were to fall into a black hole, it would be impossible for you to know when
you hit the event horizon. For a mathematical derivation of the radius of an event horizon
see below.
The singularity is not really a tangible object either. According to the General Theory of
Relativity the Singularity is a point of infinite space time curvature. This means that the
force of gravity has become infinitely strong at the center of a black hole. Everything that
falls into a black hole by passing the event horizon, including light, will eventually reach the
singularity of a black hole. Before something reaches the singularity it is torn apart by
intense gravitational forces. Even the atoms themselves are torn apart by the gravitational
forces.”
“According to the General Theory of Relativity the Singularity is a point of infinite space
time curvature. Everything that falls into a black hole by passing the event horizon,
including light, will eventually reach the singularity of a black hole. Before something
reaches the singularity it is torn apart by intense gravitational forces.”
Page 33
Auto Summarization Tool 201
0
8.1 SCREEN-SHOTS:
Figure 8.1
Splash Screen appears before the GUI appears. This Screen is visible for just 5 seconds.
Page 34
Auto Summarization Tool 201
0
Figure 8.2
This is the main GUI of the software where all the working takes place.
Page 35
Auto Summarization Tool 201
0
8.3 REFERENCES:
• http://www.indiastudychannel.com/resources/12455-Development-an-auto-
summarization-tool.aspx
• http://www.tgmc-projects.com
• http://www.sourcecodeonline.com
Page 36