Vous êtes sur la page 1sur 64

Project Electronic Cub Reporter

Web Client-Server Architecture


To Support Advanced Text Search

By Haotian Sun

MSc in Advanced Software Engineering


Supervisor: Rob Gaizauskas
August 2005

This report is submitted in partial fulfillment of the requirement for the


degree of MSc in Advanced Software Engineering

i
Declaration

All sentences or passages quoted in this dissertation from other people’s work have
been specifically acknowledged by clear cross-referencing to author, work and page(s).
Any illustrations which are not the work of the author of this dissertation have been
used with the explicit permission of the originator and are specifically acknowledged. I
understand that failure to do this amounts to plagiarism and will be considered grounds
for failure in this dissertation and the degree examination as a whole.

Name: Haotian Sun

Signature:

Date:

ii
Abstract
This project is about designing and building a web-based client-server system to
support advanced searching over the UK Press Association archive, which is ~20GB of
newswire stories constituting the entire PA E-Archive from 1994 till the present. The
new system will support journalists in the task of gathering and writing background for
breaking news stories. This report includes a literature review which was made to gain
the appropriate amount of information on web-based database-driven systems and Web
Server-side technologies which will allow the implementation of the web application.
Also a requirement analysis was made to determine the requirements of the system to
be produced.

The system integrates the conventional search engine technology with novel
summarization and natural language analysis tools. The technologies used in the
implementation of this system include JSP, Java Beans and Servlets, a MySQL
relational database, Apache Lucene Searching Engine, a database connection pooling
technique and the Tomcat Web Server. Also a new special method was introduced to
enhance the performance of this large database-driven web application.

iii
Acknowledgements
Thanks to Rob Gaizauskas, for his valuable guidance, help and feedback. Thanks to
Horacio Saggion for his cooperation on data and technical support. Thanks to Emma
Barker for her requirements analyses. Thanks to Martin Cooke for his useful
e-commerce reading material. Finally, thanks to my parents, my girlfriend and my
friends for their love and support.

iv
Contents

Chapter 1: Introduction ....................................................................................... - 1 -


1.1 What is the Cubreporter Project? .................................................................. - 1 -
1.2 Aims and Objectives..................................................................................... - 1 -
1.3 Looking Ahead ............................................................................................. - 2 -
Chapter 2: Literature Review .............................................................................. - 4 -
2.1 Overview of Web-based Database Systems................................................... - 4 -
2.2 System Architecture Survey.......................................................................... - 4 -
2.2.1 Tiered Architecture ................................................................................ - 4 -
2.2.2 Available Technologies.......................................................................... - 5 -
2.3 Web Application Models ............................................................................ - 14 -
2.3.1 The J2EE Model .................................................................................. - 14 -
2.3.2 The MVC Design Model...................................................................... - 16 -
2.4 Possible Architecture Scenarios .................................................................. - 16 -
2.4.1 Pure JSP implementing MVC............................................................... - 17 -
2.4.2 JSP & Servlets implementing MVC ..................................................... - 17 -
2.4.3 JSP, Servlets & EJB implementing MVC ............................................. - 18 -
2.5 Review of the PA Digital Text Library ....................................................... - 19 -
Chapter 3: Requirements and Analysis............................................................. - 22 -
3.1 Functional Requirements ............................................................................ - 22 -
3.1.1 Search Function ................................................................................... - 22 -
3.1.2 Data Rendering .................................................................................... - 22 -
3.1.3 User Personalization ............................................................................ - 23 -
3.1.4 System Administration......................................................................... - 23 -
3.2 Non-Functional Requirements .................................................................... - 23 -
3.3 Constraints.................................................................................................. - 24 -
Chapter 4: Design............................................................................................... - 25 -
4.1 Overview of Technology choices ................................................................ - 25 -
4.2 Presentation Tier Design............................................................................. - 26 -
4.3 Business Tier Design .................................................................................. - 28 -
4.4 Database Tier Design.................................................................................. - 30 -
Chapter 5: Implementation and Testing ........................................................... - 32 -
5.1 Overview .................................................................................................... - 32 -
5.2 Implementation........................................................................................... - 33 -
5.3 Testing........................................................................................................ - 40 -
5.3.1 Functional Testing ............................................................................... - 40 -
5.3.2 Performance Testing ............................................................................ - 43 -
Chapter 6: Results and Discussion .................................................................... - 44 -
6.1 Findings...................................................................................................... - 44 -
6.2 Goals Achieved .......................................................................................... - 44 -
6.3 Further Work Needed ................................................................................. - 45 -
Chapter 7: Conclusion........................................................................................ - 46 -
References........................................................................................................... - 48 -
Web References.............................................................................................. - 49 -
Appendix 1 Selected Code:................................................................................. - 50 -
Appendix 2 Screenshots: .................................................................................... - 55 -

v
Chapter 1 Introduction

Chapter 1: Introduction

1.1 What is the Cubreporter Project?


Suppose a journalist from UK Press Association (PA) went to gather news for the
Election 2005. He would write background for breaking news and send it back to the
headquarters via wireless equipment. As a sophisticated journalist, he should know
how many times Tony Blair won the election but he might forget how many seats
Labour got in the last election. So he opened his browser and linked to the
CubReporter website, where he could log into the advanced searching system to ask
for help from PA database which contains approximate 20GB of newswire stories from
1993 till now. What he could enter for search is either a keyword like “election Labour
seats 2001” or a complete question like “How many seats did Labour Party get in
2001?” Then the system will show him all the stories related to his query. In short, the
journalist would pick up what needed in an efficient way and customize to save them
in his record portfolio for further tracking.

The above is a scene showing the purpose of the Cubreporter project. Obviously, this
project is a web-based client-server and database-driven system. The key issue in this
project is how to realize a high performance searching function across the PA’s
existing large database via web pages.

The CubReporter Project is a large and complex project funded by the UK EPSRC for
3 years, and it is the collaboration between the researchers in the Departments of
Computer Science and Journalism together with the Press Association Group. The
project has been studied by the other researchers for more than two years, and the
content introduced here is actually the late implementation of the project. Since the
project was a team work, my task was to help the CubReporter researchers in the NLP
Group to realize functions of the system. Table 1.1 shows the role of each member in
this project.

Role Member
Principal Investigator Pro. Robert Gaizauskas
Requirements Analyzer Dr. Emma Barker
Technique Supporter Dr. Horacio Saggion
Database Designer Dr. Angelo Dalli
System Developer Mr. Haotian Sun
Table 1.1
In short, the CubReporter Project is a research project which investigates how
language technologies might help journalists to access information in a news archive in
the context of a background writing task, as introduced by Gaizauskas et al.(2005).

1.2 Aims and Objectives

Currently, the UK Press Association Group has a system called PA Digital Text
Library (WWW1) to support journalists in the task of gathering and writing
background for breaking news stories. The problem of this system is that it is suffering

-1-
Chapter 1 Introduction

from low performance of web pages loading and provides limited functions for
document searching.

The new system under development should have many advantages against the existing
one. In short, it will integrate as many functions as possible in order to convenient the
journalists’ daily job from the PA Group. For example, we may use the existing system
to make a search for some kind of stories or even use some searching engine (e.g.
Google) to get some information, but obviously, these available approaches are either
limited by functions or inexact results. What we are doing is to help the PA Group to
reorganize their data archive and develop a new system that is not only practical but
also has powerful multiple functions.

This project aims to design and build a web-based client-server application to support
searching over the PA archive, as indexed and analyzed by the language analysis tools.
This includes the following aspects:

• Designing and writing web server side "business logic" code to access
databases and assemble content for dynamic page creation
• Developing a flexible presentation layer framework for rendering content in the
client browser
• Developing a appropriate session management framework to support collection
of search results during iterative search and navigation of search history

1.3 Looking Ahead

Firstly, a literature review was undertaken in order to gain the appropriate information
and knowledge on the available technologies and processes that could be used to
develop the application. Since the application is a web-based system, a system
architecture survey for web applications is carried out, where tiered architecture is
introduced and after that there is a more detailed description for each tier with the
available technologies that can be used during the development of each tier. In addition
to this, there is a review of available web application models and an introduction to
Model-View-Controller (MVC) design model. After that, possible architecture
scenarios are raised, with the focus on the combination with JSP technology. Finally
there is a review of the PA Electronic Text Library Website, where some similar
functions of the new system are introduced.

In Chapter 3 “Requirements & Analysis”, a full list of requirements is presented.


These requirements include both functional requirements and non functional
requirements and some constraints to the system. Furthermore, a detailed analysis of
these requirements was made in order to understand them and to make the correct
decisions during design and implementation. After that, in Chapter 4 “Design”, an
overview of chosen technologies is firstly introduced and then the designs for each tier
are illustrated with detailed discussion and appropriate figures. In Chapter 5
“Implementation & Testing”, there is a detailed description of the implementation
process, including some more in-depth review of the significant functions of the
application, like index-based searching. A functional testing was followed to evaluate

-2-
Chapter 1 Introduction

the quality of the system developed. Moreover, a testing script was designed to
simulate several real cases for evaluating the novel method that enhances the
performance of the search function.

In Chapter 6, all the finding and results achieved during the process of development in
this project were discussed from both the team view and individual point. Lastly, a
summary of all the work completed in this project was made in the Chapter 7, and also
a review of the project.

-3-
Chapter 2 Literature Review

Chapter 2: Literature Review

2.1 Overview of Web-based Database Systems

Over the past few years, most enterprises have become used to managing their
information via electronic database. Now, due to the popularity and accessibility of the
World Wide Web, people are tending to retrieve information from web-based
application, either by companies or individuals. This trend results in the use of
databases for storage of all types of information. Since databases provide the ability to
store information in an organized fashion with easy access and retrieval, more and
more web-based systems are relying on the database server to support the storage of
certain information. Finally, with the application of distributed system theory, the
different parts of such a system can run on multiple computers, which provide us an
easy and convenient way to enhance development and maintainability.

2.2 System Architecture Survey

The Cub Reporter project is a collaboration of researchers introduced in Chapter 1,


who built the language analysis tools and studied journalists' information seeking
behavior. My job was to work together with them to carry out the system. The critical
thing implemented firstly was to decide the system architecture so that we could
identify each one’s task and work together and efficiently.

2.2.1 Tiered Architecture

Having a review of current web site systems, most of them are based on a tiered
architecture, especially the ones used in e-commerce and large web-based applications.
The tiered architecture is also called distributed architecture which means that systems
are composed of programs running on multiple hosts, such as a web server and a
database server. A tier above is one of those hosts but it can be virtual distributed
applications running on a single host. So a tier may be a logical, as well as a physical
layer of a system. In this project, the word ‘tier’ refers to logical layers. Generally,
there are several tiers that represent different logics such as the presentation logic
which shows how information is presented to the clients and the business logic which
means the collection of objects and methods that are different from business to
business. The advantage of such a design is to allow separation of concerns, which
means coding paradigms and required skill sets are both different. Typically, there are
following kinds of tired system:

-4-
Chapter 2 Literature Review

One-tier systems can be either applications running on a physical computer or isolated


systems which disconnect to a network but can function correctly from logical view.
Applications like word processors and image viewers belong to this category. The
major advantages of one-tier systems are their simplicity and high performance. Since
there is no need to communicate with remote services through a network, the
application can have the maximum performance allowed by its hardware.

Two-tier systems run on two computers, composed of a client application and a server
application. For example, a typical web application developed by CGI runs on a
client’s web browser and a web server. But as mentioned before, a tier can be a logical
layer of a system, so it is possible to have a two-tier application running on one
machine with a division of two layers. Thus, there are usually three parts in a two-tier
system: a client, a server and a protocol. The role of the protocol, which in most cases
is the HTTP protocol, is used to send and request data between the client and the
server. The advantage of this architecture is to separate presentation logic from
business logic, which means the presentation logic is on the client side, while the
business logic is on the server side. So the majority of the data processing can be done
on the server and the client just needs to receive and present the data derived from the
server side. However, this kind of architecture has little potential for resource sharing,
which is a big problem for large database-driven system.

Three-tier systems have one more layer and usually represent presentation logic,
business logic and data logic which means how to ensure the persistence, security and
of data respectively. This architecture allows concurrent data access via transactions. It
is common to have more than one client to retrieve data simultaneously, so it can avoid
the data corruption. For instance, a server requests or stores data in the database server
after it receives a request from a client and then it gives information returned from the
database back to the client.

N-tier systems are those systems having more than three tiers. This type provides
great performance capabilities, enhancing the amount of services it can provide and
allowing greater maintainability. The disadvantage of an n-tier system is that it is
inefficient in most cases, expensive and more complex.

In short, the most widely used architecture is a three-tier type, which offers a
separation of presentation logic, business logic and data logic. This architecture is
argued and proved by Chaffee (2000) as the best solution for a web-based
database-driven system. So it is necessary to discuss the available technologies to
employ next.

2.2.2 Available Technologies

Since three-tier architecture enhances the separation of presentation logic, business


logic and data logic, we can discuss the technologies that could be employed from the

-5-
Chapter 2 Literature Review

view of each tier.

l Presentation Tier
Presentation tier means how the web pages interact with the clients. Since most of the
content rendered to the end users is generated from the database, this project is a
dynamic web system. The current available and popular technologies to produce
dynamic content in a web page are Active Server Pages, Java Sever Pages and
Velocity Template Language.

Active Server Pages (ASPs) are the contribution to the development of dynamic pages
from Microsoft. This technology has experienced two generations, classic ASPs and
recent ASP.NET. Classic ASPs use scripting languages on the server side and have
support for VBScript and Jscript which are built in the web pages. New ASP.NET
supports any.NET compliant language, such as C#, Java. Since code in ASP.NET is
compiled rather than interpreted by the ISA server, it runs faster than classic ASPs.
Though both of these technologies are widely used and easy to learn, they are heavily
dependent on the Microsoft’s server and framework. Because this project is being
developed under an open source environment, they are not the proper solutions.

Java Server Pages (JSPs) are another technology provided by Sun for developing web
pages that include dynamic content. They enable web developers and designers to
rapidly develop and easily maintain, information-rich, dynamic Web pages. Unlike a
plain HTML page, a JSP page can change its content based on any number of variable
items, including the identity of the user, the user’s browser type, information provided
by the user, and selections made by the user. [4] A JSP file has a “.jsp” extension and
contains both standard markup language elements (e.g. HTML tags) and special JSP
elements that allow the server to insert dynamic content in the page. For instance, the
following shows a piece of JSP code.
<HTML>
<HEAD><TITLE>Hello World! </TITLE></HEAD>
<BODY>
<%
int i=0;
while(i++<10){
%>
<p>Do Loop #
<%out.println(i); %>
</p>
</BODY>
</HTML>
The codes enclosed in <% %> tags are embedded java codes, which exactly obey the
programming rules in a typical java class. When a JSP is requested for the first time, it
needs to be compiled by the virtual machine on the server side firstly, which allows the
server to handle JSPs much faster in the future.

-6-
Chapter 2 Literature Review

Since JSP is based on Java technology and open to use, it is supported by many open
source web servers, such as Apache Tomcat, and can implement complex web
applications via the collaboration with Java Beans and Java Servlets (discussed later).

Velocity is one of the Apache Jakarta Projects, which is a Java-based template engine.
It permits anyone to use a simple yet powerful template language to reference objects
defined in Java code. [WWW 1] Velocity pages are actually written in the form of
HTML with embedded Velocity code and end with a “.vm” extension. The following
shows an example of Velocity code style.
<HTML>
<BODY>
#set ($foo = “Veloocity”)
Hello $foo World!
</BODY>
</HTML>
The bold code above is the embedded Velocity code and the result is a web page that
prints “Hello Velocity World!”. Generally speaking, references begin with $ and are
used to get something, and directives begin with # and are used to do something. Code
in this style is called Velocity Template Language (VTL). The most attraction Velocity
has is that it fully enhances the Model-View-Controller (MVC) pattern (talked later) to
organize web sites, which is regarded as a clear way for web designers and Java
programmers to work in parallel. This means that web page designers can focus solely
on creating a nice site, and programmers can focus solely on writing top-notch code.
Since Java code is separated by Velocity from the web pages, it is easier to maintain
the web site over its lifespan. So Velocity is a valid alternative to JSPs. However, JSPs
plus JavaBeans can also apply MVC design pattern to web sites. According to Martin
Cook (2005), the only problem JSPs can have is that there may be a mix between
presentation tier and business tier if a system is large and complex. But obviously this
does not mean JSPs are not suitable to use in presentation. Considering the valuable
experiences we have in JSPs and Java technologies and the limited time on the
development, JSPs are enough to carry out the presentation tier with the proper
cooperation with Java Beans and Servlets. So JSP is preferred technology to carry out
the presentation tier in this project.

l Business Tier
The Business Tier shows us the collection of objects and methods that are different
from business to business, which means how to represent your business logic to be
performed in a certain way in your project. In this tier, the available technologies are
JavaBeans and Java Servlets.

(1) Java Beans

JavaBeans (Enterprise Java Beans) are a platform-independent architecture for Java,


which is one of the most interesting and powerful ways to cooperate with JSPs.

-7-
Chapter 2 Literature Review

JavaBeans are reusable Java classes whose methods and variables follow specific
naming conventions to give them added abilities. They can be embedded directly in a
JSP page using the <jsp:useBean> tag. A JavaBean component can perform a
well-defined task and make its resulting information available to the JSP page through
simple accessor methods. The difference between a JavaBeans component and a
normal third party class used by the generated servlet is that the web server can give
JavaBeans embedded in page special treatment. For instance, a server can
automatically set a bean’s properties using the parameter values in the clients’ request.
In other words, if the request includes a name parameter and the server detects through
introspection (a technique where the methods and variables of a Java class can be
programmatically determined at runtime) that the bean has a name property and a
setName(String name) method, the server can automatically can setName() with the
value of the name parameter. So there’s no need for calling getParameter().

JavaBeans are embedded in a JSP page using the <jsp: useBean>action, which has the
following syntax where case is sensitive and the quotes are mandatory.
<jsp:useBean id=”name” scope=”page |request |session |application”
class=”classname” type=”typename”>
</jsp:useBean>

The setting of the attributes in the tag is explained in the table 2.1 as following:
Attributes Specifications Examples
Id Define the name of the bean. id=”potentialUser”
This is the key under which the
bean is saved if its scope extends id=”newStory”
beyond the page. If a bean
instance saved under this name
already exists in the given scope,
this instance is used within this
page. Otherwise a new bean is
created.
Scope Specify the scope of the bean’s scope=”session”
visibility. The default value is
page which means the variable is scope=”request”
created essentially as an instance
variable. If the value is request,
the variable is stored as a request
attribute; if it is session, the bean
is stored in the user’s session; if
application, the bean is stored in
the servlet context.
Class Specify the class name of the class=
bean, which is used when “com.cubreporter.user.userBean”
initially constructing the bean.

-8-
Chapter 2 Literature Review

The class name must be fully


qualified.
Type Specify the type of the bean as it type=
should be held by the system, “com.cubreporter.user.userInterface”
used for casting when the object
is retrieved form the request,
session, or context. The type
value should be the fully
qualified name of a superclass or
an interface of the actual class.
Table 2.1
Furthermore, the <jsp:setProperty>action provides the ability for request parameters to
automatically set properties on the beans embedded within a page. This allows beans
to access the parameters of the request without having to call getParameter().

Though there are four ways to use this feature, the common usage is as following:

<jsp:setProperty name=”beanName” property=”*”/>

where any parameter with the same name and type as a property on the given bean
should be used to set that property on the bean. For example, if a bean, which
represents the user, has a setUserName(String userName) method and the request has a
parameter userName with a value of Rob, the server will automatically
calluserBean.setUserName(“Rob”) at the beginning of the request handling. The
parameter is ignored if the parameter value can not be converted to the proper type.

or <jsp:setProperty name=”beanName” property=”paramName”/>

where the given property should be set if there’s a request parameter with the same
name and type. For instance,

<jsp:setProperty name=”beanName” property=”userName”/>.

Finally, the <jsp.getProperty>action provides a mechanism for retrieving property


values without using Java code in the page. Its usage is as following:

<jsp:getProperty name=”beanName” property=”paramName”/>

This means to include at this location the value of the given property on the given
bean.

-9-
Chapter 2 Literature Review

(2) Java Servlets

The Java Servlet API allows a software developer to add dynamic content to a web
server using the Java platform. It has the ability to maintain state after many server
transactions, which is done using HTTP Cookies, session variables or URL rewriting.
Java Servlets is a generic server extension, which is actually a Java class that can be
loaded dynamically to expand the functionality of a server. Since servlets are written in
the highly portable Java language and follow a standard framework, they provide a
means to create sophisticated server extensions in a server and operating system
independent way.

Java Servlet API makes use of the Java standard extension classes in two packages
javax.servlet and javax.servlet.http. The javax.servlet package contains classes and
interfaces to support generic, protocol-independent servlets. These classes are
extended by the classes in the javax.servlet.http package to add HTTP-specific
functionality.

Every servlet must implement the javax.servlet.Servlet interface. Most servlets


implement this interface by extending one of two special classes:
javax.servlet.GenericServlet or javax.servlet.http.HttpServlet. A protocol-independent
servlet should be a subclass of GenericServlet, while an HTTP servlet should be a
subclass of HttpServlet, which is itself a subclass of GenericServlet with added
specific HTTP functionality. Unlike a regular Java class, a servlet does not have a
main() method. Instead, certain methods of a servlet are invoked by the server in the
process of handling requests. Each time the server dispatches a request to a servlet, it
invokes the servlet’s service method. A generic servlet should override its service()
method to handle requests as appropriate for the servlet. The service() method accepts
two parameters: a request object and a response object. The request object tells the
servlet about the request, while the response object is used to return a response. In
contrast, an HTTP servlet usually does not override the service() method. Instead, it
overrides doGet() to handle GET requests and doPost() to handle POST request. An
HTTP servlet can override either or both of these methods. The service() method of
HttpServlet handles the setup and dispatching to all the doXXX() methods, which is
why it usually should not be overridden.( Hunter & Crawford, 2001)

The remainder in the above two packages are largely support classes. For example,
HttpServletResquest and HttpServletResponse classes in javax.servlet.http package
provide access to HTTP requests and responses. The javax.servlet.http package also
contains an HttpSession class that provides built-in session tracking functionality and a
Cookie class that allows me to quickly set up and process HTTP cookies. So the
session management for the user in the project could also be implemented by a servlet.

Apart from the technologies mentioned above, the search engine is the most important
part in the business tier. Apache Lucene is the preferred search engine because our

- 10 -
Chapter 2 Literature Review

application requires a full-text search and Lucene technology is an open source,


high-performance, full-featured text search engine (WWW2). Lucene APIs has many
powerful features, which include a number of accurate and efficient search algorithms.
For instance, the ranked searching (best results returned first), the field searching (e.g.,
title, author), many powerful query types(e.g. phrase queries, wildcard queries,
proximity queries, range queries and more), the date range search, sorting by any field,
multiple-index searching with merged results etc.

The fundamental concepts in Lucene are index, document, field and term. An index
contains a sequence of documents; a document is a sequence of fields; a field is a
named sequence of terms; a term is a string. The index files are based on the
enterprise’s own data and generated by the enterprisers themselves. We can make use
of these index files for efficient full text search.

l Data Tier
Today, the most popular and efficient way to store data in a web application is using
database or XML. The original PA data archive contains more than 8.5 million stories
totaling 20GB of data. The raw corpus has been processed and encoded in XML
following a strict Document Type Definition (DTD) specification which includes
elements such as story date, category, topic, and structural information such as
headlines, bylines, and paragraphs. One example is shown in Figure 2.1. Obviously, in
order to access the XML-formatted data, it is necessary to have a view of available
XML parsers. Since the later work will make a transfer from the XML data to a
relational database, available technologies for operations on databases are also
introduced.
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE HSA SYSTEM "../../../../dtds/HSA.DTD">
<HSA DATE="01011994" DAY="01" YEAR="1994" MONTH="01" ID="HSA1868" PRIORITY="4"
CATEGORY="HHH" COUNT="222" MSGINFO="PA" TOPIC="1 SHOWBIZ Ward" TIMEDATE="010442 JAN
94">
<HEADLINE>I'M NOTHING LIKE MY SEXY ROLE, SAYS SOPHIE</HEADLINE>
<BODY>
<BYLINE>By Rob Scully, Press Association Showbusiness Correspondent</BYLINE>
<PARAGRAPH NRO="1">Actress Sophie Ward, who plays a sex siren for her latest TV role, insists she is nothing
like that in real life. </PARAGRAPH>
<PARAGRAPH NRO="2">Sophie, 28, plays Eden in a new Ruth Rendell chiller, A Dark Adapted Eye, starting
tomorrow night on BBC1.</PARAGRAPH>
<PARAGRAPH NRO="3">``It was fun to play a glamorous part because I am not like that and it is quite hard work ...
it sometimes took three hours just to get the hair right,'' she said.</PARAGRAPH>
<PARAGRAPH NRO="4">end mf </PARAGRAPH>
</BODY>
</HSA>

Figure 2.1 Corpus Encoding

- 11 -
Chapter 2 Literature Review

XML is the abbreviation for Extensible Markup Language, which is simply text and
portable data. The data in a XML file is stored in custom tags within the markup
language, something like HTML. It has advantages for storing complicated
information unambiguously and is easily readable by both humans and machines.
XML is fully supported by Java, so there are many APIs that can be employed to parse
an XML file, such as SAX and JDOM.

As seen from the Figure 2.1, the first line indicates that this document is marked up in
XML. The second line specifies validation and root. The document starts from the
third line, where the only root element HSA is followed by several attributes, such as
the date of this story and the category code “HHH” which stands for UK Home News
only. The next is the story’s headline that is in the same level with its body. The body
has two kinds of children, which are byline and paragraph respectively. Each
paragraph has its own number attribute that could be used for search.

(1) XML Parser

In order to retrieve certain information from a XML file, an XML parser is needed and
handles the important task of taking a raw XML document as input and making sense
of the document. Since there are no hard and fast rules to select a XML parser, it is not
an easy task to choose the right parser for your application. According to Brett
McLaughlin (2001), there are two main criteria which are typically used. The first
thing is the speed of the parser. As XML documents are frequently used and their
complexity grows, the speed of an XML parser becomes extremely important to the
overall performance of an application. The second is conformity to the XML
specification. This means you have to be careful when selecting a parser since some
XML parsers may not conform to finer points of the XML specification in order to
squeeze out additional speed. Balancing these factors, JDOM is an ideal choice to
employ according to Cook (2005), because it provides high performance of SAX (e.g.
rapid parsing and output) and document model of DOM without memory problems
(e.g. random-access).

Though XML has many advantages to store data, there are some very important
disadvantages. For example, XML files require a large amount of storage compared to
databases and it has bad performance if the source data is quite large when searching
objects. Because the original PA’s archive saved in the form of XML is ~20GB, which
is too large and affects the performance of searching, we decided to transfer original
PA’s archive to store in the database. This is the reason why we decided to reorganize
the current PA archive and store it in a relational database.

(2) Relational Database

- 12 -
Chapter 2 Literature Review

A relational database is a database system in which the database is organized and


accessed according to the relationships between data items without the need for any
consideration of physical orientation and relationship (WWW4). It stores all its data
inside tables, and nothing more. All operations on data are done on the tables
themselves or produce another table as the result. Relational database systems are
widely used among the industry because of their built in function of organizing the
data into tables. There are many well-known relational database management systems
today, such as MySQL and Microsoft SQL Server. All these products have a number
of advanced features to provide easy access to data and high performance to retrieve
data in cases where search functions are applied to an application, especially multiple
factor searches.

In order to communicate with the relational database systems, Structured Query


Language (SQL) is used as the standard language to perform certain tasks, such as a
search or an upgrade task. Generally, the standard SQL commands such as "Select",
"Insert", "Update", "Delete", "Create", and "Drop" can be used to accomplish almost
everything that one needs to do with a relational database, though there may be some
additional ones that are proprietary extensions to different database management
systems. When a database is created, all the tables in the database must be created
using the SQL syntax, where various rules of SQL must be followed. Meanwhile, the
relationships between the tables must be set up and the column types must be declared.
The filling of data into the tables can either be done manually through a command
prompt or can be done with embedded SQL code in Java or other types of programs.
Lastly, data retrieval from a database can also be done manually through a command
prompt or through embedded SQL code in some type of program, always using the
SQL command ‘SELECT’ and some proper syntax. In order to communicate with the
database using java, all needed is the JDBC API (Java Database Connectivity).

The JDBC API is the industry standard for database-independent connectivity between
the Java programming language and a wide range of databases (WWW5). It is just like
a bridge between application and database server and provides access to an SQL-based
database.

The JDBC API makes it possible to do three things:


• Establish a connection with a database or access any tabular data source
• Send SQL statements
• Process the results

With JDBC technology, we can make the application development easy and
economical. For instance, programmers can save a lot of their time and effort because
they do not need to care about complex data access tasks which are hidden in JDBC
API. Furthermore, there is no need for configuration by the client, since we can define
a JDBC driver.

- 13 -
Chapter 2 Literature Review

Another thing I had to care about in this project is opening and closing database
connections. One way to handle this is to have each usage of the database open and
close a connection explicitly. One of the problems of doing it in this way is requiring
every class that talks to the database to have enough information available to establish
a connection. Moreover, opening and closing database connections are just about the
most expensive thing an application can do. For short queries, it can take much longer
to open the connection than to perform the actual database retrieval. So it is very
necessary to have another way to handle the database connection.

Connection pooling is a technique that was pioneered by database vendors to allow


multiple clients to share a cached set of connection objects that provide access to a
database resource. A connection pool class should be able to perform the following
actions:
l The connections are allocated in advance
l Available connections should be managed
l New connections are ready to be allocated
l Wait for a connection to be available
l The connections should be closed when not required

Normally, I should write my own connection pooling servlet. But luckily, Apache
Turbine was found as a third-part support, whose database connection pooling facility
can be made use of in this project.

Apache Turbine is a Servlet based framework that allows experienced Java developers
to quickly build secure web applications (WWW3). Turbine allows us to personalize
the web sites and to use user logins to restrict access to parts of the application.
Turbine provides good support for working with JDBC through the Torque Object
Relational Tool and its connection pooling facilities. Before the Turbine facilities can
be used, I need to set up my application for Turbine and create a property file with
some specific initialization values for the database pool in it. More details will be
introduced later.

2.3 Web Application Models

Since this project is actually a web-based application, it is important to review some


popular and useful web application models so that an appropriate model can be chosen.

2.3.1 The J2EE Model

The Java 2 Enterprise Edition (J2EE) Model is actually a three tiered web application
model, which includes Client Tier, Middle Tier and Enterprise Information System
(EIS) Tier. J2EE is a compilation of various Java APIs, including JSP, Java Servlets,

- 14 -
Chapter 2 Literature Review

Enterprise JavaBeans (EJB) and JDBC. These APIs stand for different component
technologies, and are managed by what the J2EE documents name containers. A web
container provides the runtime environment for servlets and JSP components,
translating requests and responses into standard Java objects. Similarly, an EJB
container manages EJB components. One important thing we should keep in mind is
that an EJB is quite different from a Java Bean. On the one hand, a JavaBeans
component is a regular Java class, which follows a few simple naming conventions and
can be used by any other Java class. On the other hand, an Enterprise JavaBean must
be developed in compliance with a whole set of strict rules and works only in the
environment provided by an EJB container. Furthermore, components in the two types
of containers can use other APIs (e.g. JDBC) to access databases.

Figure 2.2 shows an overview of all the components and their relationship, which was
introduced by Pawlan (2001). The client tier includes browsers as well as some GUI
applications. A browser usually uses HTTP to communicate with the web container.
Client services are supported by the middle tier through the web container and the EJB
container. Many applications can be implemented solely as web container components.
However in some complex applications, the web components just act as an interface to
the application logic implemented by EJB components. Both of these components can
access databases via other J2EE APIs like JDBC. The Enterprise Information System
(EIS) tier holds the application’s business data, which is usually comprised of one or
several relational database management servers.

Client Tier Middle Tier EIS Tier

Figure 2.2 J2EE Model

- 15 -
Chapter 2 Literature Review

2.3.2 The MVC Design Model

The Model-View-Controller (MVC) design model was first introduced by Xerox in a


number of papers published in the late 1980s in conjunction with the Smalltalk
language. And since then it has been widely used in all popular programming
languages because of its advantages on reducing effort needed in software
development. As its name states, this design pattern is based on the three entities, the
Model, the View, and the Controller. “The Model represents pure business data and the
rules for how the data is displayed or the user interface for modifying the data. The
View, on the other hand, knows all about the user interface details. It also knows about
the public Model interface for reading its data, so that it can render it correctly, and it
knows about the Controller interface, so it can ask the Controller to modify the Model”
(Bergsten 2002). Before MVC, user interface designs tended to mix these entities
together. MVC decouples them to increase flexibility and reuse and enforces the
separation of responsibilities into different tiers. The figure 2.3 shows a brief view on
how the different pieces communicate in a MVC model.

Request
Controller
Forward
State
change
XML
Event
Response
View Model database
Data

Figure 2.3 MVC Model

In practice, there are several ways to assign MVC model to the different types of J2EE
components, which depends on the scenario, the types of clients supported, and
whether or not EJB is used. The next subsection describes possible architecture
scenarios where JSP plays an important role.

2.4 Possible Architecture Scenarios

In a J2EE application, there are three approaches to assign MVC roles to the different
components. Because the architecture is critical to the system and different way has its
own advantages, it is very important to choose the right scenario before
implementation. The following shows the different architectures in a J2EE application
implemented by different components.

- 16 -
Chapter 2 Literature Review

2.4.1 Pure JSP implementing MVC

In this approach, separate JSP pages are used for presentation (the View) and request
processing (the Controller) and place all business logic in beans (the Model). As
shown in Figure 2.4, Controller pages initialize the beans, and then View pages
generate the response by reading their properties. A pure JSP approach is a suitable
model for testing new ideas and prototyping because it is the fastest way to reach a
visible result. However, when you evaluate the prototype by this approach, you will
find that the pure JSP application is hard to maintain since it may be a nightmare to
redesign and reuse.

View
Search.html
ResultList.jsp

Model
JavaBeans database

Controller
Find.jsp
Redirect
Delete.jsp

Figure 2.4

2.4.2 JSP & Servlets implementing MVC

Today, the combination of JSP and servlets is a powerful tool for developing
well-structured applications which are easy to maintain and reuse. When both of them
are combined, the MVC roles are assigned as shown in Figure 2.5. In this case, all
requests are sent to the servlet as a Controller with a parameter or a part of the URI
path which indicates what to be done. As in the pure JSP scenario, Java Beans are used
to represent the Model. The Controller servlet takes an appropriate JSP page to
generate a response to the user (the View) depending on the outcome of the processing.
For example, if a request to add a new document is executed successfully, the servlet
will pick a JSP page that shows the list of the updated documents. If the request fails,
the servlet will take another JSP page which shows why it failed.

- 17 -
Chapter 2 Literature Review

View
Search.html
ResultList.jsp

Model
JavaBeans database

Controller
Java Servlets

Figure 2.5

2.4.3 JSP, Servlets & EJB implementing MVC

Nowadays, an application based on Enterprise JavaBeans (EJB) is regarded as an


overhead development. This is because it is the most complex structure and brings us
transaction management and a client type independent component model. This
structure also enforces the MVC design pattern, and makes an application easy to
extend and maintain. Compared with the JSP & Servlets scenario, there are another
two components: entity EJBs and session EJBs. An entity EJB represents a specific
piece of business data, such as a customer. Each entity EJB has a unique identity, and
all clients that need access to the entity represented by the EJB use the same bean
instance. Session EJBs are intended to handle business logic and are used only by the
client who created them.

As seen in Figure 2.6, a user sends a request to a servlet as in the JSP&servlet scenario.
However, on one hand, it asks an EJB session bean to process the request instead of
the servlet itself. Therefore the Controller role spans the servlet and the EJB session
bean. On the other hand, the Model role is shared by the EJB entity beans and the
web-tier JavaBeans components. As Bergsten (2002) stated in his book
“Typically,JavaBeans component in the web tier mirror the data maintained by EJB
entity beans to avoid expensive communication between the web tier and the EJB tier.
The session EJBs may update a number of the EJB entity beans as a result of
processing the request. The JavaBeans components in the web tier get notified so they
can refresh their state, and are then used in a JSP page to generate a response.”

- 18 -
Chapter 2 Literature Review

View
Search.html Model
ResultList.jsp JavaBeans
database
Entity EJBs

Servlets
Session EJBs
Controller

Figure 2.6

2.5 Review of the PA Digital Text Library

The PA Group is currently running a digital text library for their journalists to collect
stories. In order to be familiar with the new system requirements to a high level, it is
necessary to play with their existing system, and know about their habits in using the
system. Here are three main interfaces from the PA Digital Text Library.

As seen in Figure 2.7, basically, there should be some fields for users to enter their
queries according to different conditions. Furthermore, users are offered to customize
their search data range and sort style.

Figure 2.7

- 19 -
Chapter 2 Literature Review

Figure 2.8 shows the matched result page, which include a certain number of items
from the total results and a page separation function. Each result item shows the basic
information of a document, which includes document id, topic, a short paragraph of the
content, the group, the word count and the date of the document. It also provides a
function for users to save their preferred documents in their online portfolio.

Figure 2.8

In Figure 2.9, the complete information of a document is shown with the highlight of
the query.

Figure 2.9

- 20 -
Chapter 2 Literature Review

Apart from the review of the interface and its functions, it is known that the current PA
online Digital Text Library is powered by PHP technology with XML database.

According to our experience on using the existing system, it was found that the system
has an unreliable performance, such as a very slow response to the client’s request if
the network traffic is busy.

- 21 -
Chapter 3 Requirements and Analysis

Chapter 3: Requirements and Analysis


This project aims to design and build a web-based client-server application to
support searching over the PA archive, as indexed and analyzed by the language
analysis tools. The original requirements of this project involve the following
aspects.

(1) Designing and writing web server side "business logic" code to access
databases and assemble content for dynamic page creation
(2) Developing a flexible presentation layer framework for rendering content in the
client browser
(3) Developing a appropriate session management framework to support collection
of search results during iterative search and navigation of search history

In order to understand these requirements in a more detailed way, it is necessary to


divide the requirements into functional requirements and non-requirements with
some additional constraints.

3.1 Functional Requirements

3.1.1 Search Function

l The advanced search engine facility is the heart of this project, which offers
the main functions of the system.
l The search functions should be divided into two steps, the first of which is
called generic search while the second is called refined search.
l In generic search, the input query could be any word(s), sentences or questions.
Users can specify a search calendar range (e.g. after or before a certain date, or
between two specified days), and sort the results by date or weighting.
l In refined search, users can select their search type in a more detailed way
according to their queries in the random search. For instance, a list of the
related entities and events to their keywords, or even some similar events
should be offered for users to consider about and make a further decision.

3.1.2 Data Rendering

l The search result returned from database should be presented in a certain form
according to the document structure. The information included in one item
should indicate the document ID, title, headline, byline (if any), word count,
the date and the first paragraph of content if there is more than one.
l Each results list page should present no more than ten result items. If the
search results are more than ten, the system should also offer a page separation

- 22 -
Chapter 3 Requirements and Analysis

function that allows user to go to the next page, the previous page or a
specified page. Furthermore, each result list page should also indicate the
following information: the number of the results, the current page, and the
number of items shown in the current page.
l When a user finds the document(s) he wants, he can open the full document in
a new window by clicking on its title. The full document window will show
the complete document content with all the other information included in a
document structure.
l Since the system offers a tracking function that users can review their
preferred documents in their portfolios, there should be a save button attached
to each item in both the result list page and the full content page.

3.1.3 User Personalization

l Valid users should use a login interface to enter the system with a username
and password.
l The system should offer users a function to modify their password and update
their personal information.
l The system allows users to review their search record and save preferred
results in their portfolio for future use.

3.1.4 System Administration

l According to the client’s requirement, the system will only be available to the
authorized subscribes and users can not register themselves. So there should an
administrator function for the system. This function will allow the system
administrator to allocate an account to a valid user.
l The functions that an administrator can take include create, delete or
en/disable a user.

3.2 Non-Functional Requirements

System Loading Performance


Since the database in this system is quite large, the search result may take much
memory in the server side. It is very necessary to estimate the system loading
performance according to the number of simultaneous queries from different users.
Here the web server used for testing has a 6GB physical memory and it is supposed
that there would be approximately 4GB memory always free and ready to use for this
system. The number of the potential users should also be estimated because the system
will be only used by the journalists (around 100 people out of 1400) from the PA. It is
supposed that we have 5 people send queries to server at the same according to the real
situation in a news collection.

- 23 -
Chapter 3 Requirements and Analysis

Availability
The system will be running on a web server host 24 hours a day and 7 days a week.
This does not include the regular maintenance to the web host and some abnormal
situation (e.g. any disaster to the hard disk in the host). However these exceptions are
nothing to the system itself.

Reliability
The system must be reliable enough on all aspects of its use. The data it uses should be
reliable because they are stored in the MySQL database and there is a certain backup
mechanism to prevent any loss to the original data.

Security
The system must be secure enough because of the confidential data from the PA
archive. Any illegal entrance or use to the system should be regarded as a threat to the
database safety. Therefore, it is essential to have an authentication mechanism to the
system. This authentication will only allow authorized users to enter the system and
carry out the search function.

Recoverability
The system should be guaranteed to have a quick recoverability once the web host
experiences a disaster (e.g. hardware crash). The database should have a secure backup
at the database server while the system should have a simple method to redeploy to the
web server.

3.3 Constraints

l Hardware Constraints
The only hardware constraint to the system is that the system should be able to be
installed and run on the machine which will be regarded as a web host. The host
machine used for testing is the DON server in the NLP Group, which has a memory of
6GB.

l Software Constraints
The software constraints should be discussed from two aspects, the server side and the
user side. From the server side, Java technologies should be employed to support the
Tomcat web server and any other third-part software run in on the server side. The data
should be stored in a relational database while some configuration files should be
saved in the form of XML file. From the user side, the browsers in their operation
systems should support the basic HTTP protocol.

- 24 -
Chapter 4 Design

Chapter 4: Design

4.1 Overview of Technology choices

After a review of all the possible technologies for this project in Chapter 2 and having
in mind the requirements from Chapter 3, a selection of technologies appropriate for
the development of the system was made carefully.

After several discussions among the team members, it was decided that the system
should follow a 3-tiered web application structure. The MVC model was chosen as the
chief design pattern because of its clear separation of the presentation, business and
data logics. This decision made the role of each member clear and the teamwork more
efficient. The Figure 4.1 below shows the blueprint of the system, which was produced
by our teamwork. In simple, all the actions in this system are controlled by the server
side controller. The controller receives a request from a user and dispatches it to the
related event handler to process it.

Business Tier
Presentation
Tier
Controller
Browser
/ Userid/
Client
Search Search
History
Web Servlets

Word-based Result Query/Snap Documents Event-based User


Text Search Rendering Analysis Manager Search Authentication/
(Lucene) Session Mgmt

Word-based
Text Text NL
PA Event
Index Analysis
Indexing Archive Structure/
(Lucene) Summaries

Database
Tier

Figure 4.1

- 25 -
Chapter 4 Design

For the presentation tier, JSP technology was chosen as the system needs to provide
the dynamic data according to the users’ request. The reason for choosing JSP over
other technologies is that it has many options to implement the MVC model and the
Tomcat web server available in the department has a good compatibility to the JSP and
other Java technologies. Furthermore, the open source platform and technologies are
more preferable to our client.

JavaBeans and Java Servlets are the solutions for the business logic tier. This is
because both of them are programmed in Java language, which is the preferred
programming language in our team because most of us have a sophisticated experience
in Java. Furthermore Sun and its Java technologies is one of the pioneers in the large
database-driven web systems, and many enterprise solutions introduced in the past
have proved its value and high performance. Moreover, the core search engine (Lucene
APIs) in this project is also programmed in the Java language. So balancing all the
factors, it is the Java technologies which should be used to take on the implementation
of the system.

The main function of this system is to provide a search service, so it is very critical to
have reasonable data storage so that a search process can be carried out easily and
efficiently. Before the project, the PA archive was stored in the form of XML files.
And now for the data tier, a relational database was chosen as a new storage to replace
the old XML files. The reason why a new storage is employed is that the structure of
old data was found to be unreasonable by our Requirement Analyzer and we decided
to reorganize the structure of the data in a more logical way. Since a relational
database can provide a high-performance search function and a robust management, it
was decided to design a new relational database for the PA archive. So in the database
tier, the technologies employed are those used to implement the operation between an
application and a relational database, which includes the Structured Query Language
(SQL), Java Database Connectivity (JDBC API), and Database Pooling Technology.

4.2 Presentation Tier Design


Presentation tier shows how information is presented to the clients. Actually, it is the
GUI design. The requirements analysis has produced a set of specifications for the
front end of the system. JSP is the main technology used to render content in the client
browser, but also a small amount of HTML pages are used for initiating some
functions. Because the presentation tier deals only with the display of data, the most
frequently used JSP technologies are to retrieve the data from the servlet and render
them on the pages. This does enforce the MVC design pattern chosen. In order to
control the process of the development, it was necessary for me to keep in mind what
kinds of web pages they are and how many there are. A convenient way is to produce a
website flow that shows the functions of each page, which include: User
Authentication, Make New Search, Searched History, User Profile, Refined Search,
Search Result Listing, and the Full Document View.

- 26 -
Chapter 4 Design

In detail, firstly, common users can enter the system via the Welcome Page, where
they need to type their account and password for authentication. Then they can choose
the actions to be taken in the Home Page, such as make a new search or review their
search history. If they take a new search, what they can enter is either some keywords
or a question. Then the system will refine their query in a new page where more
accurate search functions are provided, such as a list of the related entities and events
to the keywords, or even some similar events could be found. Users choose their
preference by clicking on some tick boxes beside the items on this page and submit it.
On next step, a list of matched results will appear on the screen, which shows titles of
the relevant stories and their summaries. Here users are allowed to customize their
view style and save items needed. Lastly, users can view the full context of stories by
clicking on their titles, where they can also save the document to their portfolio. Valid
users can review and edit their search history whenever they like after login to the
system, which is very similar to the function of a shopping cart.

According to the requirement in the Chapter 3, users are not allowed to register
themselves. So here is an administration page for an administrator to allocate accounts
for valid users. The administrator can take actions like create, delete or en/disable a
user.

The Figure 4.2 on the next page shows the website flow for the entire system, which
illustrates what functions are exactly provided in each page of the web application.

- 27 -
Chapter 4 Design

Figure 4.2

4.3 Business Tier Design

Based on the three tired system architecture, the MVC model was applied to the
business tier. As seen from the Figure 4.3, the clients send request via HTTP using
HTML forms, and the Java Servlets decide which actions to be taken. Once the actions
are confirmed, the Servlets will call the relevant JavaBeans to process and retrieve the
information. The main objects in this project (e.g. document) are represented by
JavaBeans, which hold the attributes of the objects and have certain methods to
implement the search tasks. After the retrieval of data, both beans and servlets

- 28 -
Chapter 4 Design

co-operate to pass the information to the JSP pages. In this way, the dynamic content is
rendered to the client’s browser.

Business Tier
Word-based
Query/Snap
View Index Search
Analysis Text
JSP pages Index
Request/
Response

Forward Context Model MySQL


to JSP codes JavaBeans Event-based Event
search Structures
Request via
------------------
HTML form Controller Update/Retrieve Summaries
Java Servlets Information Documents

Request/Retrieve Data

Figure 4.3

The process of the text search is the core in this project, which offers the main search
function over the PA archive. As seen from the Figure 4.4 on the next page, the
process of the text search is composed of four stages, which are query/snap analysis,
events/entities search, word-based index search and database-based search respectively.
When the method in the document bean is called to process the text search, the query
sent from the user side will be firstly passed to the Query Analyzer class, which filters
the predefined stop words that are around 250 in total and regarded as unimportant
words in the query. Then the refined query, which is actually some key words, will be
used to carry out the index search. As introduced in Chapter 2, the index files are the
fundamental data storage for Lucene index search, so the word-based index search will
be implemented over these index files via Lucene APIs. The detailed Lucene’s
functions that are employed in this project include: the ranked searching, the field
searching, the date range search, sorting by any field, and multiple-index searching
with merged results. Combining these functions together, the matched documents’ IDs
can be retrieved from the index files and used for database search. Lastly, a SQL query
for searching document table is processed by JDBC with the IDs as the clause. Until
now, all the information for matched documents are caught and can be saved in a data
structure(e.g. LinkedList) for further use.

For the events-based search, a Question Answering (QA) subsystem is employed to


provide the journalist with short, text units that answer their specific, well-formed

- 29 -
Chapter 4 Design

natural language questions. When the query analyzer recognize the query from users as
a question, it produces an analysis which includes the expected answer type (EAT) and
depending on the question, a special attribute created to refer to the attribute-value to
be extracted from the answer entity. Each candidate answer gets a preliminary score
based on its semantic proximity to the EAT and the number of relations the candidate
answer has with elements of the question. An overall score is computed for each entity
as a function of its preliminary score plus a similarity value between the question and
the sentence where the entity comes from (Gaizauskas et al., 2005). Then the score
will be used to retrieve data from the database as a reference. The later process is the
same with the text search.
Events/Entities
Query logic form Searcher
(scores answer
candidates)

Users’
Query Analyzer keywords
Index Searcher
(Filter the pre-defined stop (word-based index search Matched
Query
words from the query) via Lucene API) events/entities

Matched documents IDs

Data Structure
All the information
(ArrayList that holds all the
needed for matched
Database Searcher
matched objects of documents and (JDBC API implements
documents
is saved in the memory for use) the SQL script)

Figure 4.4

4.4 Database Tier Design

As the example of the PA document shown in Chapter 2, the data in PA’s archive were
stored in the form of XML file, which have special parameters for search purpose. The
database tier design is critical to the system, since the performance of the system is
linked to how fast search results can be produced. Furthermore it was very necessary
to reorganize the data storage in an efficient way because it was found that there were
some unreasonable rules in the old storage, such as the same id can be used for
different documents, which would cause ambiguous search results. Moreover, it is
convenient and reliable to manage the data in a relational database because of their
powerful accompanied functions and facilities. For example, MySQL database server
is a very popular relational database, which offers many advanced functions for data
management.

- 30 -
Chapter 4 Design

During the process of database design, it is important to note that the type of each
attribute (column) in every table should be considered carefully. This is because any
inappropriate use of the type would cause either the failure to install the data or the
waste of the memory. So the new database design should be based on the attributes of
the document in the old PA archive and a further and careful study was needed for the
database designer.

Again, with the teamwork, the new database design was carried out by Dr. Angelo
Dalli who produced the Figure 4.5 as following.

Figure 4.5

Since the data is too large to do the conversion manually, it is necessary to have an
auto-import facility to convert those XML files to MySQL database where SQL could
be used to support advanced text search. So an XML parser is a basic tool to be
employed to implement the task. Since the conversion of data is not my task, it is
ignored how it works here. My job focused on how to retrieve the data from the
database and render it to the client.

- 31 -
Chapter 5 Implementation and Testing

Chapter 5: Implementation and Testing

5.1 Overview

The CubReporter Project is a large and complex project funded by the EPSRC for
three years, and the other researchers has been working on this project for more than
two years. Though I tried my best to carry out all the functions during the project
period, my time and effort was obviously limited. Firstly, this is a cooperating team
project, so my progress could not move on until other workmates pass me those
necessary data and methods. Unfortunately, there is a data delay caused by the
unforeseen factors in feeding the full database, which resulted in that I could only get
the data for testing from the document table (the green part in figure 4.3). However,
these data should be enough to make use as a sample to carry out the system prototype.
Finally I have finished most of the functions according to the system design and raised
a method to improve the system performance. My contributions to this project include
the aspects in the following part of the system design (Figure 5.1).

Browser Controller
Client /
Search
Web Servlets

Word-based Result Query/Snap User


Text Search Rendering Analysis Authentication/
(Lucene) Session Management

Word-based
Text Text PA
Index Indexing Archive
(Lucene)

Figure 5.1
As seen from the diagram above, the main Word-based Text Search and the user
management have been done, and the result can be rendered to the client in a nice
interface. An ecommerce standard user authentication mechanism was designed to
implement the login and session management. The detailed methods will be explained
later.

In principle, the Text Search includes three steps. When a search request is received
from the user by HTTP form, the first step is to do the query analyses, which means to

- 32 -
Chapter 5 Implementation and Testing

break the query into several words and ignore the words that are called stopwords (e.g.
the, a, is, are, what, where and etc.). So the keywords in a query should be selected
before carrying out any search. In the second step, the Lucene index search engine is
employed to get the document ID from the Text Index archive. Lastly, the targeted
document IDs from the index search will be used as an SQL query to load other
document information from the document table in our database.

5.2 Implementation

l Database Pooling
One of the important tasks in this project is to find out a special way to improve the
loading performance from the database and make use of the available resource in an
efficient way. After the review of technologies to employ in Chapter 2, the database
pooling technology was determined to be the most efficient way to load the data from
a database server, because it solves the most time-consuming problem for a
database-driven application--- opening and closing database connections. Using
Database Pooling, we can ignore the time for opening and closing database
connections according to a case study made by Marty and Brown (2003).

To make use of Turbine as a database pooling facility, you must import the Turbine jar
to your system directory (e.g. WEB-INF\lib) firstly. Then a servlet called TurbineInit
(Appendix 1) is created and added to the web.xml file, whose function is to initialize
the DB connection pool. When the web server starts, it will read the setting
information in an external file called TurbineResources.properties, which includes all
the information needed for defining classes for Turbine service and database setting
(see Appendix 1). In any code that performs a database interaction, we need to import
the following two classes in the Turbine package.

import org.apache.turbine.services.db.TurbineDB;
import org.apache.turbine.util.db.pool.DBConnection;

After initializing a new DBConnection class as in the following code, a new


connection with Turbine database pool will be set up. So now, we have the
connections with the database.

DBConnection dbConn = TurbineDB.getConnection();

The next step is to read the pre-statement from the external file, which stores some
SQL queries that need to be executed. But before any statement can be read, it is
necessary to use the ResourceBundle class from the Java.util package to load the
external file. Then we can apply a SQL query in the SQLQueries.properties file to the
PreparedStatement class in Java.sql package. For example, there is a SQL query like
this: “findDocuments=SELECT * FROM documents WHERE docid = ? ”, the
- 33 -
Chapter 5 Implementation and Testing

program will pass the conditional parameter to this SQL before it is executed. The
following section of codes shows how a SQL query is executed.

private static ResourceBundle sql_bundle = ResourceBundle.getBundle(“SQLQueries.properties");


PreparedStatement pstmt = dbConn.prepareStatement(sql_bundle.getString("findDocuments"));
pstmt.setString(1, “HSA1923”);//Pass the clause to the SQL statement
ResultSet rs = pstmt.executeQuery(); //execute the statement and save the result in a set

l Text Search
Firstly, a Document bean is created to represent the document object, which holds all
the attributes of a document object, e.g. document ID, title, headline, date, content and
etc. There are also relevant Getters() and Setters() for each attribute so that the value of
each attribute can be retrieved from the JSP pages externally. Furthermore, a
LinkedList is employed to store the matched search result because the data structure of
LinkedList is the best choice to store large data in memory. There are mainly three
steps for a text search.

(1) Query analyzes.


A query sent by a user should be analyzed firstly before passing into the search engine.
In the CubReporter project, we have a collection of stop words, which is stored in an
external file separately. These stop words are used to filter the same words in the
users’ query, which means the words existing in the stop word list will be ignored from
the search query. For example, words like “the, an, a, what, where, they, me and etc”.
In detail, the Lucene class StopFilter is used to remove the stop words from a query.
But before filtering the stop words, there is a separate servlet used to load the
stopwords file from the hard disk and pass the file to the StemAndStopAnalyzer class
(Appendix 1).
StemAndStopAnalyzer analyzer=new StemAndStopAnalyzer(stopWords);

(2) Index-based search


The text index archive should be ready for use before carrying out any index-based
search. The index files are organized in a certain way according to the Lucene index
rules. All the index files used in this project are provided by Dr.Saggion. The Lucene
class IndexSearcher is used to carry out the index search. But before search, the
QueryParser class in Lucene package is employed to parse the query with the help
from the stopwords analyzer. The following code shows how it works.
Searcher searcher = new IndexSearcher(indexFile);//Implements search over a single IndexReader
Query query_head = QueryParser.parse(stringQuery,PAConstants.HEADLINE,analyser);
Query query_topic = QueryParser.parse(qstr,PAConstants.TOPIC,analyser);
Query query_content = QueryParser.parse(qstr,PAConstants.CONTENT,analyser);
Query query_date = QueryParser.parse(dates,PAConstants.DATE,analyser);
BooleanClause dClause=new BooleanClause(query_date,true,false);
BooleanClause hClause=new BooleanClause(query_head,false,false);
BooleanClause tClause=new BooleanClause(query_topic,false,false);

- 34 -
Chapter 5 Implementation and Testing

BooleanClause cClause=new BooleanClause(query_content,false,false);


tQuery.add(hClause); tQuery.add(tClause); tQuery.add(cClause);
bQuery.add(tQuery,true,false); bQuery.add(dClause);
Hits hits = searcher.search(bQuery);//A ranked list of documents, used to hold search results
(3) Database-based search
As mentioned before, there is a LinkedList for holding the matched result. So a loop
should be made to the Hits collection to retrieve the result items. When doing a loop,
each item from the Hits should be applied to a Lucene Document Object, which has
get() method to load the further information.
DocumentsBean doc=null;
LinkedList records=new LinkedList();
String[] lines=new String[hits.length()];
for(int dn=0;dn<hits.length();dn++) {
Document d = hits.doc(dn);
lines[dn] = d.get(PAConstants.DOC_ID).toString();
PreparedStatement pstmt =
dbConn.prepareStatement(sql_bundle.getString("findDocuments"));
pstmt.setString(1, lines[dn]);
ResultSet rs = pstmt.executeQuery();
while(rs.next()){
doc=new DocumentsBean();
doc.setDID(rs.getString("docid"));
doc.setDCategory(rs.getString("category"));
doc.setDHeadline(rs.getString("headline"));
doc.setDByline(rs.getString("byline"));
doc.setDPcount(rs.getInt("pcount"));
doc.setDKeywords(rs.getString("keywords"));
doc.setDTopic(rs.getString("topic"));
doc.setDContent(rs.getString("content"));
}
records.add(dn,doc);
}
return records;
Until now, we have implemented the complete text search, what is left is to render the
result in the user’s browsers via dynamic JSP pages.

l System Performance Improvement

Considering this system will be used among the journalists in the PA Group to collect
information for breaking news, there would be a chance for a number of journalist to
make some similar or the same search in a certain period. For example, when an
Olympic Games happens, they may be more interested in the sports stories. So during
this period, journalists will do searches more frequently on the sports. This would
bring many hot searches on a topic in a short period to the system. Therefore, if we

- 35 -
Chapter 5 Implementation and Testing

have a mechanism that controls the hot searching items in a certain period, and make
use of the available resource in a maximum way, we will improve the system
performance in a high level. After a careful research, the following method was found
to improve the system performance.
As discussed in the above subsection, JavaBeans are used to represent each object for
the business logic in this project. Any search is made by calling a related Java object
which holds the basic properties of that object (e.g. document, story and etc.) and has
its method findObject(String query) that returns a LinkedList which holds all the
objects with the data found from the DB. This LinkedList will be stored in the physical
memory in the server side once the first search is activated and will not be destroyed
until the server is down. Since loading data from memory is much faster than from
either the database server or the XML files stored on the hard disk, it will speed up the
response time a lot if the potential matched search result is always kept in the memory
rather than anywhere else. Based on the previous discussion to text search, the
following novel method was introduced to enhance the system performance in a high
level. If we make clear the two questions underlined below, we will improve the
performance in a reasonable way. The method is discussed from the running process of
the system as below:
(1) Tomcat starts and initializes the DB Pooling(setting up a virtual connection with
MySQL Server)
(2) When doing the first search, a starting time is set up to control a loop of memory
check, which means how often the system should check the memory to release the
expired data in the memory. For example, the first search is made at 9:00am on
July 1st 2005 and the duration of the loop checking is set to 2 days (depending on
the memory space). Then the system will start checking the memory for the
expired data, which is stored in the memory but has not been used for a certain
period. (e.g. 30 days), at 9:00am on July 3 rd 2005 and again it will do the same
loop afterwards.
(3) When a user makes a search, the system will check the memory first, to see
whether this query was requested before or not. If the object of that query already
exists in the memory, it will renew the timestamp of that object in the memory.
Doing this way, the hot queries will be always kept in the memory. If not in the
memory, it will get the DB connection and create new objects with the key of that
query. Then new data is saved in the memory so that it is not necessary to get the
data from DB again later if the same query is done. In other words, it speeds up
the search performance since getting response from the web host’s memory is
much faster than from the MySQL.
(4) The objects with that query are created and saved in a LinkedList. Then a new
object called ObjectTimeStamp is created with two parameters, the former
LinkedList and a long data type which represents the current time. Lastly, the
query and ObjectTimeStamp are saved in a TreeMap, whose method get(Object
key) can be used for checking a duplicate search.
(5) Another question is how long an object can exist in the memory. So a method
called timeExpired() is created to check any expired data. The main idea is to use

- 36 -
Chapter 5 Implementation and Testing

the current time stamp to subtract the recorded time stamp in the
ObjectTimeStamp. The data in the TreeMap will be removed from the memory if
the difference is greater than the certain duration. So that the data which is not
frequently requested but stays in the memory can be cleared in time. In this way,
the physical memory of the web host is saved efficiently, and obviously the
performance of this large database-driven system will be improved.

In short summary, this method makes use of system resource in an optimal way from
the server side, and it also brings a new approach to enhance the performance of the
database-driven web application. The complete code is included in the Appendix 1 to
show how this method works.

l Data Rendering
When the user sends a query to the servlet via an HTML form, the servlet will handle
it and pass it to the beans, and then the beans process the query and perform the search
function. After that, all the data retrieved from the database is saved in the memory.
Now what needs to be done is to output them in the web pages dynamically.

As stated in the requirements in Chapter 3, each page will display no more than ten
result items and there should be a page change function to view more items. Moreover,
each page should have a guide showing the number of matched items and pages and
the current page number. All these functions are realized in the JSP pages. Because all
the parameters for calculating which items should be displayed in the current page can
be retrieved from the servlets, it is not difficult to carry out the output of the data.
Figure 5.2 shows an example of the results list interface.

Figure 5.2
When the user finds a document he prefers, he can view the complete content of that
document by clicking on its topic, which will lead him to another page showing the

- 37 -
Chapter 5 Implementation and Testing

whole document. For more details, please see the Figures A2.3 to A2.7 in Appendix 2.
The figures show the entire search interface in the text search system except the
interface for the refined page which is shown in Figure 4.2 but not done yet because of
the reasons mentioned in the beginning of this chapter. Those user interfaces include
the main search page, the search results list page and the full document page
separately.

l User Authentication

Here, a userBean is created for representing the user object in the project, which
includes all the attributes and functions that are related to the behavior of a user. Then
the database connection pooling is carried out via Turbine DB facility. For example,
information like the JDBC driver and the username and password for the MySQL
database is stored in the TurbineResources.properties, which will be loaded by
TurbineConfig class method when the Tomcat server starts. All the methods like
createUser() and findUser() in the userBean can get a connection with the database by
importing org.apache.turbine.util.db.pool.DBConnection. Using the DB connection
pool also makes it possible that all the SQL queries are stored in an independent
properties file and separated from the Java codes. This facilitates later deployment and
modification to the tables in the database.

As required, the administrator can register a valid user via filling in username and
password. The registration module satisfies the following features:
(1) Username should match one and only in the database
(2) Password should be more than 6 characters and less than 20 characters and
encrypted on both client and server sides.
(3) The registration interface should prompt the administrator for any mistake they
could make during filling in the forms, e.g. passwords entered twice should be
the same and new username should be different from existing ones.

The Figure 5.3 on the next page shows the interface for the registration model in this
project. For system responses to different potential inputs, please see Figures A2.8 to
A2.10 in the Appendix 2 for the further details.

- 38 -
Chapter 5 Implementation and Testing

Figure 5.3

Since the login module is like a gate in this project, it should not only be robust but
also smart and friendly. The finished login module supports the following features:
(1) Any input mistakes are identified and shown to the user in a friendly fashion, e.g.
if a user enters an invalid username, the login interface will prompt that the
username does not exist. In the case that a user enters a valid username but a
wrong password, the interface will then tell him/her that the invalid password has
been entered.
(2) Once the identity of a user is accepted, the system will create a cookie object for
this user and record it in his/her browser for a period unless the user logs out
manually. The system will check the user’s browser automatically every time the
user logs in so that the protected pages will not be open to anonymous users and
the user’s behavior can be recorded and tracked.

Figure 5.4 on the next page shows the page for the login model in this project. For the
various system responses to different potential inputs, please see Figures A2.1 to A2.2
in the Appendix 2 for the further details.

In order to track the valid users’ information after they log in, cookies are used in the
user management. What needs to be done is to add some JSP to the login page to send
a cookie back to the browser. One thing worth mentioning here is that it is not enough
to just store the username in the cookie, because a potential hacker could gain
unauthorized access to a user’s account by faking the cookie. A trick to prevent this is

- 39 -
Chapter 5 Implementation and Testing

to store both the username and the password. In this way, the hacker can not create the
cookie by just knowing the username.

Figure 5.4

5.3 Testing

The testing process was broken down into two parts, functional testing and
performance testing. The functional testing covered the area of functionality in the
system. The aim of the functional testing was to check if the system actually works as
intended and find any possible errors and bugs in the system. The other area of testing
is the performance of the system, which is used to prove the performance improved by
the novel method introduced in this chapter.

5.3.1 Functional Testing

As introduced in the Functional Requirements in Chapter 3, the system generally


includes the following functions: Administrator Management (User Registration, View,
En/disable an existing user), User Authentication, Search Function (word-based text
search and events-based search) and User Personalization (update password and
personal information, track users’ search history and saved documents). Due to the
time constraints and the scale of the project, the finished functions are User

- 40 -
Chapter 5 Implementation and Testing

Registration, User Authentication, and part of Search Function (word-based text


search). So the functional testing was carried out based on these finished functions.

Table 5.1 below shows the test cases on the User Registration Model. Here it is
supposed that both of the usernames “James” and “Horacio” do not exist in the
database and the username “Robot” already exists in the database.
Ref Input Action Expected Actual Outcome Success
Taken Outcome (Y/N)
1 Username: James Submit The message A new page shows Y
shows the the message that the
Password: james007 success of new user “James” has
registration in been registered
Repassword:james007
another page. successfully.

2 Username: Robot Submit The message A message is Y


shows that the presented at the top
Password: robot001 username entered of the username text
already exists in field to show the
Repassword:robot001
the database in name entered is in
the same page. use.(See Figure A2.8
in Appendix 2)

3 Username: Horacio Submit The message A message is shown Y


shows that the at the top of the
Password: saggion007 passwords the password text field to
user entered prompt the admin has
Repassword:saggion0
twice are entered two different
01
different passwords.(See
Figure A2.9 in
Appendix 2)

4 Username: Submit The message A message is shown Y


shows that the at the top of the
Password: username is username text field to
required. show that the
Repassword:
username is required.
(See Figure A2.10 in
Appendix 2)

5 Username: Horacio Submit The message As expected (See Y


shows that the Figure A2.9 in
Password: password is Appendix 2)
required.
Repassword:

Table 5.1

- 41 -
Chapter 5 Implementation and Testing

For the User Authentication Model in the project, Table 5.2 shows the test cases for it.
Here it is supposed that the username “James” is the valid name with the valid
password “james007”, and the username “Ben” is invalid.
Ref Input Action Expected Actual Outcome Success
Taken Outcome (Y/N)
1 Username: James Login The page As expected Y
redirects to the
Password: james007 main search page.

2 Username: Ben Login The message A message is Y


shows that the presented at the top
Password: benben01 username entered of the page to show
does not exist in that no such a user is
the system. found. (See Figure
A2.1 in Appendix 2)

3 Username: James Login The message A message is Y


shows that the presented at the top
Password: james001 password entered of the page to prompt
is wrong. the user has entered a
wrong password.
(See Figure A2.2 in
Appendix 2)

4 Username: Login The message As expected. Y


shows that the
Password: username and
password can not
be blank.

Table 5.2
Next is the test for the text-based search function. In the search text field, the users can
input anything that related to their tasks in collecting background information for
breaking news, such as several separate words or a natural language sentence.
Furthermore, the users also have the options to specify a date range which is from
1994 to 1997 when the test was done. The expected outcome of the action taken for the
search is a page showing the entire matched results list with a page separation function
at the bottom of each page (see Figure A2.6 in Appendix 2). When the users want to
view the complete content of a document, they can navigate to the full document page
by clicking on the topic of that document. The expected outcome should be a page
showing all the information that a document has in the database. An example is given
as seen in Figure A2.7 in Appendix 2. Finally, the users can make a new search by
enter a query in the top of each page at any time, and the results list should be the same.
This is an additional function to make the search behavior more convenient in any
page.

- 42 -
Chapter 5 Implementation and Testing

5.3.2 Performance Testing

In order to test the performance and reliability of the search function, a real case was
carried out by a group of five users from different IP addresses. Also a response
timestamp is used to record the time for the server’s response. It shows the response
time of each search at the top right corner of each page. The system was deployed to
the Don server in the NLP Group and the test was done from the Lewin Lab in the
Department (Because of the firewall the Department runs, the access to the Don server
is denied from the outside visitors. So the simulation from the outside clients to the
system could not be done).

The simulation test was done in the following cases:

(1) Five users made a different search at the same time.


(2) Three of them entered three different queries, and the other two entered two
queries that were the same to two of the prior three queries. All of their actions
were taken at the same time.
(3) Three of them sent three different queries, and the other two requested two queries
that were also the same to two of the prior three queries. But this time, the actions
taken by the two people are after the actions taken by the three people.

The test result shows that, in case 1, five users got their search results with different
response time due to the different number of the matched items. In case 2, two of those
users who entered the same queries to the others received their search results with a
less response time, but it still cost them some time (e.g. 5milliseconds). In case 3, the
two users who entered the same queries later got the results list with a zero response
time. This means the same results were loaded from the memory directly rather than
from the database, so the performance of the search function is enhanced to those
searches that are similar and carried out in a short duration.

The above test was aim to prove the novel method mentioned before. Obviously, it
needs further research and test to be proved reliable. So please note that because this
method was designed just for this project, it is not guaranteed to work in any other
similar project.

- 43 -
Chapter 6 Results and Discussion

Chapter 6: Results and Discussion

6.1 Findings

After the testing is done, the test result shows that the original aims listed in Chapter
1.2 were finished successfully. A working prototype has been obtained for the
CubReporter Project. During the process of development, a number of academic
knowledge learned from the Java Ecommerce Model was found very useful in the real
life practice. In detail, both the three tiered web architecture and the MVC design
model were successfully applied to the system. The technologies suitable for each tier
were also practically used to implement the required functions in the system. For
instance, the database pooling technology was found to be the best solution on the
database connection and JSP plus JavaBeans played a good role in the task of data
rendering. With the careful study on the performance of web application, a novel
method was introduced to enhance the system performance. All these findings are
valuable experience for my future study.

6.2 Goals Achieved

A number of goals and objectives have been achieved after the project development
lifecycle ended. These goals and objectives were both project-based and personal.

After evaluating the prototype of the project, it is clear that a fully working system has
been produced to meet the original objectives (see Chapter 1.2) laid out for the project.
The achieved system partly proves the feasibility of research hypotheses introduced by
Gaizauskas et al.(2005). When looking back on the requirements (see Chapter 3), it is
obvious that most of the functions have been fulfilled to a certain level. The major goal
obtained was the successful implementation of the text-based search function
requirement. Put simply, this function is based on multiple data sources (both text
index data and data in the MySQL database) and implemented by two search engines
(both Lucene APIs and JDBC APIs). Because the system relies on database retrieval
heavily, database pooling technology is used to solve the problem that opening a
connection to a database is time-consuming process. Furthermore, a novel method was
introduced to enhance the performance of the search function following a certain rule
(see Chapter 5). The method was originally invented in the development of this project
and the initial tests show that it speeds up the search response time for the similar
search significantly. Moreover, a robust User Registration Model and User
Authentication Model were carried out and fully tested. Both of these models will be
suitable for their roles in the integration of the system in the future. Lastly, the major
user interfaces in the project have been completed and easily extensible for the future
work. For example, the page combination technology is used for each page, so the

- 44 -
Chapter 6 Results and Discussion

header and the footer of each page can be used directly for the new pages (e.g. the
refined page) in the future. This does avoid repeated coding and save the development
time.

On a personal basis, the major achievement was the knowledge and comprehension of
the reviewed technologies, especially the ones chosen for implementation. The
valuable practical experiences on developing such a large web-based application are
very important for me to examine and enhance my academic knowledge in the
web-related subjects. Though the technologies employed in this project were used in a
correct manner, I believe further use of more advanced functions supported by these
technologies would result in a better performance for the system. Another personal
goal that was achieved during the project development process was the deep
understanding of the research area in Natural Language Processing, which brings me
an interest on the future research in this area. The working system achieved brings our
research group a bright power for the future work. Lastly, all the experience gained in
teamwork throughout the development of the project, including time management,
communication and cooperation with other team members, was another very important
goal that was achieved for me. All this knowledge and experiences gained brings me a
very pleasing memory for my studying period.

6.3 Further Work Needed

As mentioned before, the CubReporter Project is a large and complex project. It has
cost the other researchers more than two years to work on it, and my project period is
actually the later implementation process. The project is a team work, so I could not
carry on until the other members get the necessary data ready to use for testing.
Unfortunately, there is a database filling problem for Events and Objects Tables in the
database design. This should be solved soon by our database designer. That is why the
event-based search function could not be carried out by me. Furthermore, other works
left to do include the part of Administrator Management (View, En/disable an existing
user), and User Personalization (update password and personal information, track
users’ search history and saved documents). These issues have to be carried out after
the full search function is completed but should not be difficult to implement because
the design is already available.

Apart from the tasks left, any bugs found during the test process must be repaired in
order to achieve an enterprise deployment quality. Furthermore, some new cache
methods are worth being applied to the project in order to get higher performance. For
example, the Jakarta Cache Taglib could be used to cache the fragments of our JSP
pages. In this way, the default cache lifetime and size can be configurable, which will
improve the web loading performance on the client side.

- 45 -
Chapter 7 Conclusion

Chapter 7: Conclusion
The aim of the project was to help the other researchers in the NLP Group to carry out
a prototype for their research hypotheses introduced in the prior stage of the
CubReporter Project. The major task was to design and write the business logic code
on web server side to access a large database and retrieve information for the dynamic
page creation.
`
A set of the system requirements was agreed with the requirements analyzer, who has
been working on this project for more than two years. The requirements were broken
down into two categories, functional and non-functional. The core functional
requirements of the system were search function, data rendering, user authentication
and an administrator management function. Further analysis and investigation was
made regarding the requirements and a set of specifications was created for the system
to be designed. But before the design process a literature review was undertaken in
order to gain the appropriate knowledge about the possible technologies that were
available for use.

The literature review focused on those technologies and approaches for web-based
database-driven system. A careful review of tiered web architecture was done,
followed by available technologies for each tier. Apart from that, there was also a
review of MVC design pattern for the system to follow during the design process,
Moreover, possible architectures scenarios were introduced with the detailed
comparisons. Finally, a review of the current PA Digital Library website was made to
gain valuable information. This is because the existing system has some similar
functions that are needed to use in the new system.

After finishing the literature review, the design of the system was carried out based on
the obtained information and knowledge about the technologies available for the
implementation of the system. The design process was broken down into three parts,
the presentation tier, the business tier and the data storage tier. Each part of the design
process describes how the technologies will be used in order to develop the system.
The design pattern chosen was the Model-View-Controller (MVC), this is because the
MVC model provides a good separation of the presentation, the business logic and the
data. For the presentation, JSP was chosen as the major technology to carry out the
dynamic pages. For the business logic, Java Beans and Java Servlets were chosen as
the most suitable technology, especially the cooperation with JSP. Finally, for the data
tier, both text index database and a relational database were used to support the
complex text-based search function.

The next stage is the implementation process, where the technologies chosen during
the design stage were applied to the practice. The fact that a working prototype for the
system was achieved proves that the technological choices made were correct. During

- 46 -
Chapter 7 Conclusion

the process of implementation, a focus was made to improve the performance of the
system, especially in the aspect of enhancing the response speed for the search
function. Thus a novel method was introduced based on the background of this project.

Finally, a functional test was carried out to test different models developed in this
project and the test result shows that they all work in a right order. In order to prove
the improvement on the system performance, a real life test was carried out by a small
group of users. The test results show that the method does speed up the data retrieval
function in a certain level.

In conclusion, the test results uncovered no major errors and the achieved system was
found to be operational. The project requirements were fulfilled and the overall project
was therefore successfully implemented using the technologies correctly and following
the design as intended.

- 47 -
References

References
Gaizauskas,R. Sagggion,H. Barker,E. & Foster,J.(2005): Overview of Recent
Advanced Natural Language Processing ( RAMP 2005) . To appear.

Cooke, M.P. (2003), E-Commerce lecture notes, Department of Computer, Sheffield


University

Chaffee, A.(2000). “One, two, three, or n tiers?”. JavaWorld [Online], January 2000.
Available: http://www.javaworld.com/javaworld/jw-01-2000/jw-01-ssj-tiers_p.html

Pawlan, M.(2001). “Java 2 Enterprise Edition Technology Center”, Sun Developer


Network (Online) March 2001
Avaliable: http://java.sun.com/developer/technicalArticles/J2EE/Intro/

Bergsten, H. (2002) Java Server Pages (2 nd Edition). O’REILLY

Hunter, J. & Crawford, W. (2001) Java Servlet Programming (2 nd Edition). O’REILLY

McLaughlin, B. (2001) Java & XML (2 nd Edition). O’REILLY

Hall, M. & Brown, L. (2000) Core Servlets and Java Sever Pages. Prentice Hall PTR

- 48 -
References

Web References

(WWW1) http://socks5host.pa.press.net:8082/cgi-bin/PAText4

(WWW2) http://lucene.apache.org/java/docs/

(WWW3) http://jakarta.apache.org/turbine/

(WWW4) http://www.orafaq.com/glossary/faqglosr.htm

(WWW5) http://docs.sun.com/app/docs/doc/805-4368/6j450e60j?a=view

- 49 -
Appendix 1 Selected Code

Appendix 1 Selected Code:


/**
* Turbine Database Pooling Initialization
* @author Haotian Sun
*/
public class TurbineInit extends HttpServlet {
private static boolean inited = false;

public TurbineInit() {
super();
}

public void init() {


if (inited) return;
System.out.println("Turbine starts!");
inited = true;
String prefix = getServletContext().getRealPath("/");
String dir = getInitParameter("turbine-resource-directory");
TurbineConfig tc = new TurbineConfig(dir, "TurbineResources.properties");
tc.init();
}

protected void doGet(HttpServletRequest arg0, HttpServletResponse arg1)


throws ServletException, IOException {
super.doGet(arg0, arg1);
}
}

TurbineResources.properties
# -------------------------------------------------------------------
# TURBINE S E R V I C E S
# -------------------------------------------------------------------
services.PoolBrokerService.classname=org.apache.turbine.services.db.TurbinePoolBrokerService
services.MapBrokerService.classname=org.apache.turbine.services.db.TurbineMapBrokerService
# -------------------------------------------------------------------
#DATABASE SETTINGS
# -------------------------------------------------------------------
database.default.driver=com.mysql.jdbc.Driver
database.default.url=jdbc:mysql://localhost:3306/cubreporter
database.default.username=root
database.default.password=root

- 50 -
Appendix 1 Selected Code

# The number of database connections to cache per ConnectionPool


# instance (specified per database).
database.default.maxConnections=50

# The amount of time (in milliseconds) that database connections will be


# cached (specified per database).
# Default: one hour = 60 * 60 * 1000
database.default.expiryTime=3600000

# The amount of time (in milliseconds) a connection request will have to wait
# before a time out occurs and an error is thrown.
# Default: ten seconds = 10 * 1000
database.connectionWaitTimeout=10000

# The interval (in milliseconds) between which the PoolBrokerService logs


# the status of it's ConnectionPools.
# Default: No logging = 0 = 0 * 1000
database.logInterval=0

# These are the supported JDBC drivers and their associated Turbine
# adaptor. These properties are used by the DBFactory. You can add all
# the drivers you want here.
database.adaptor=DBMM
database.adaptor.DBMM=com.mysql.jdbc.Driver

# Determines if the quantity column of the IDBroker's id_table should


# be increased automatically if requests for ids reaches a high
# volume.
database.idbroker.cleverquantity=true

/**
* Query Analyzer, which remove the defined stop words from a query.
*/
public class StemAndStopAnalyzer extends Analyzer
{
String[] stopWords = null;
public StemAndStopAnalyzer(String[] stopWords)
{
this.stopWords = stopWords;}
public TokenStream tokenStream(String fieldName, Reader reader)
{
TokenStream result = new StandardTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);

- 51 -
Appendix 1 Selected Code

if (stopWords != null)
result = new StopFilter(result,stopWords);//remove the stop words from a query
result = new PorterStemFilter(result);// Transforms the token stream as per the Porter
//stemming algorithm
return result;
}
}

public static LinkedList SearchDocumentsViaDB(String indexFile,String qstr,String stopWordsFile, String dates)


{
Date date = new Date();
long current = date.getTime();
if (documents.get(qstr) != null) {
System.out.println("Not load from DB");
DocsTimeStamp dt = (DocsTimeStamp) documents.get(qstr);
dt.time = date.getTime();
return dt.docs;
}
Hits hits = null;
List stopList=IndexPA.stopWords(stopWordsFile);
String[] stopWords=(String[]) stopList.toArray(new String[0]);
Searcher searcher = null;
try {
searcher = new IndexSearcher(indexFile);
} catch(IOException ioe) {
ioe.printStackTrace();
}
Query query_head = null;
Query query_topic = null;
Query query_content = null;
Query query_date= null;
StemAndStopAnalyzer analyser=new StemAndStopAnalyzer(stopWords);
BooleanQuery bQuery=new BooleanQuery();
BooleanQuery dQuery=new BooleanQuery();
BooleanQuery tQuery=new BooleanQuery();
try {
query_head = QueryParser.parse(qstr,PAConstants.HEADLINE,analyser);
query_topic = QueryParser.parse(qstr,PAConstants.TOPIC,analyser);
query_content = QueryParser.parse(qstr,PAConstants.CONTENT,analyser);
query_date = QueryParser.parse(dates,PAConstants.DATE,analyser);
BooleanClause dClause=new BooleanClause(query_date,true,false);
BooleanClause hClause=new BooleanClause(query_head,false,false);
BooleanClause tClause=new BooleanClause(query_topic,false,false);
BooleanClause cClause=new BooleanClause(query_content,false,false);

- 52 -
Appendix 1 Selected Code

tQuery.add(hClause);
tQuery.add(tClause);
tQuery.add(cClause);
bQuery.add(tQuery,true,false);
bQuery.add(dClause);
System.out.println("testing " + bQuery.toString());
} catch(ParseException pe) {
pe.printStackTrace();
}
try {
hits = searcher.search(bQuery);
} catch(IOException ioe) {
ioe.printStackTrace();
}
DBConnection dbConn = null;
DocumentsBean doc=null;
LinkedList records=new LinkedList();
String[] lines=new String[hits.length()];
String snippet;
try {
dbConn = TurbineDB.getConnection();
if (dbConn == null) {
throw new DocumentsException();
}
for(int dn=0;dn<hits.length();dn++) {
Document d = hits.doc(dn);
lines[dn] = d.get(PAConstants.DOC_ID).toString();
PreparedStatement pstmt =
dbConn.prepareStatement(sql_bundle.getString("findDocuments"));
pstmt.setString(1, lines[dn]);
ResultSet rs = pstmt.executeQuery();
while(rs.next()){
doc=new DocumentsBean();
doc.setDID(rs.getString("docid"));
doc.setDCategory(rs.getString("category"));
doc.setDHeadline(rs.getString("headline"));
doc.setDByline(rs.getString("byline"));
doc.setDPcount(rs.getInt("pcount"));
doc.setDKeywords(rs.getString("keywords"));
doc.setDTopic(rs.getString("topic"));
doc.setDPage(rs.getInt("page"));
doc.setDContent(rs.getString("content"));
records.add(dn,doc);}
}

- 53 -
Appendix 1 Selected Code

DocsTimeStamp dt = new DocsTimeStamp(records,date.getTime());


documents.put(qstr,dt);
} catch(Exception ioe) {
ioe.printStackTrace();
}
finally
{
try
{
TurbineDB.releaseConnection(dbConn);
}
catch (Exception e)
{
ioe.printStackTrace();
}
}
if (current - lastCheckTime>86400000){//time to start a new check(1 day)
timeExpired();
lastCheckTime = current;
System.out.println("Renew time");
}
return records;
}
public static void timeExpired(){
LinkedList keys = new LinkedList(documents.keySet());
Iterator it = keys.iterator();
while(it.hasNext()){
String key = (String) it.next();
DocsTimeStamp dt = (DocsTimeStamp) documents.get(key);
long current = new Date().getTime();
long record = dt.time;
if (current-record>259200000){//time to renew a timestamp(3days)
documents.remove(key);
System.out.println(key + " has been expired");
}
}
}
class DocsTimeStamp{
public LinkedList docs=new LinkedList();
public long time;
public DocsTimeStamp(LinkedList docs, long time){
this.docs=docs;
this.time=time;}
}

- 54 -
Appendix 2 Screenshots

Appendix 2 Screenshots:

Figure A2.1 Login Page


Figure A2.1 shows the case what the system responses when a user enters an invalid
username to login.

Figure A2.2 Login Page


Figure A2.2 is the case what the system responses when a valid user enters a wrong
password.

- 55 -
Appendix 2 Screenshots

Figure A2.3 Main Search Page

Figure A2.4 Results List Page 1


Figure A2.4 shows the matched results list returned from the system according to the
query “Sheffield” on 1 st January 1994. The circle on the top right corner shows the first
response time from the server to the client.

- 56 -
Appendix 2 Screenshots

Figure A2.5 Results List Page 2


Figure A2.5 is another page of the returned results list mentioned before, please note to
the circle on the top right corner shows the second response time is zero because of the
direct data loading from the physical memory of the remote server.

Figure A2.6 Results List Page 3


Figure A2.6 shows the page separation function in each results list page.

- 57 -
Appendix 2 Screenshots

Figure A2.7 Full Document Page

Figure A2.8 Registration Page 1


Figure A2.8 shows the case what the system responses when an existing username is
registered.

- 58 -
Appendix 2 Screenshots

Figure A2.9 Registration Page 2


Figure A2.9 shows the case what the system responses when the administrator enters
two different passwords or nothing for password fields for a new user registration.

Figure 2.10 Registration Page 3


Figure 2.10 shows the case what the system responses when nothing is entered for
registration.

- 59 -

Vous aimerez peut-être aussi