Académique Documents
Professionnel Documents
Culture Documents
By Haotian Sun
i
Declaration
All sentences or passages quoted in this dissertation from other people’s work have
been specifically acknowledged by clear cross-referencing to author, work and page(s).
Any illustrations which are not the work of the author of this dissertation have been
used with the explicit permission of the originator and are specifically acknowledged. I
understand that failure to do this amounts to plagiarism and will be considered grounds
for failure in this dissertation and the degree examination as a whole.
Signature:
Date:
ii
Abstract
This project is about designing and building a web-based client-server system to
support advanced searching over the UK Press Association archive, which is ~20GB of
newswire stories constituting the entire PA E-Archive from 1994 till the present. The
new system will support journalists in the task of gathering and writing background for
breaking news stories. This report includes a literature review which was made to gain
the appropriate amount of information on web-based database-driven systems and Web
Server-side technologies which will allow the implementation of the web application.
Also a requirement analysis was made to determine the requirements of the system to
be produced.
The system integrates the conventional search engine technology with novel
summarization and natural language analysis tools. The technologies used in the
implementation of this system include JSP, Java Beans and Servlets, a MySQL
relational database, Apache Lucene Searching Engine, a database connection pooling
technique and the Tomcat Web Server. Also a new special method was introduced to
enhance the performance of this large database-driven web application.
iii
Acknowledgements
Thanks to Rob Gaizauskas, for his valuable guidance, help and feedback. Thanks to
Horacio Saggion for his cooperation on data and technical support. Thanks to Emma
Barker for her requirements analyses. Thanks to Martin Cooke for his useful
e-commerce reading material. Finally, thanks to my parents, my girlfriend and my
friends for their love and support.
iv
Contents
v
Chapter 1 Introduction
Chapter 1: Introduction
The above is a scene showing the purpose of the Cubreporter project. Obviously, this
project is a web-based client-server and database-driven system. The key issue in this
project is how to realize a high performance searching function across the PA’s
existing large database via web pages.
The CubReporter Project is a large and complex project funded by the UK EPSRC for
3 years, and it is the collaboration between the researchers in the Departments of
Computer Science and Journalism together with the Press Association Group. The
project has been studied by the other researchers for more than two years, and the
content introduced here is actually the late implementation of the project. Since the
project was a team work, my task was to help the CubReporter researchers in the NLP
Group to realize functions of the system. Table 1.1 shows the role of each member in
this project.
Role Member
Principal Investigator Pro. Robert Gaizauskas
Requirements Analyzer Dr. Emma Barker
Technique Supporter Dr. Horacio Saggion
Database Designer Dr. Angelo Dalli
System Developer Mr. Haotian Sun
Table 1.1
In short, the CubReporter Project is a research project which investigates how
language technologies might help journalists to access information in a news archive in
the context of a background writing task, as introduced by Gaizauskas et al.(2005).
Currently, the UK Press Association Group has a system called PA Digital Text
Library (WWW1) to support journalists in the task of gathering and writing
background for breaking news stories. The problem of this system is that it is suffering
-1-
Chapter 1 Introduction
from low performance of web pages loading and provides limited functions for
document searching.
The new system under development should have many advantages against the existing
one. In short, it will integrate as many functions as possible in order to convenient the
journalists’ daily job from the PA Group. For example, we may use the existing system
to make a search for some kind of stories or even use some searching engine (e.g.
Google) to get some information, but obviously, these available approaches are either
limited by functions or inexact results. What we are doing is to help the PA Group to
reorganize their data archive and develop a new system that is not only practical but
also has powerful multiple functions.
This project aims to design and build a web-based client-server application to support
searching over the PA archive, as indexed and analyzed by the language analysis tools.
This includes the following aspects:
• Designing and writing web server side "business logic" code to access
databases and assemble content for dynamic page creation
• Developing a flexible presentation layer framework for rendering content in the
client browser
• Developing a appropriate session management framework to support collection
of search results during iterative search and navigation of search history
Firstly, a literature review was undertaken in order to gain the appropriate information
and knowledge on the available technologies and processes that could be used to
develop the application. Since the application is a web-based system, a system
architecture survey for web applications is carried out, where tiered architecture is
introduced and after that there is a more detailed description for each tier with the
available technologies that can be used during the development of each tier. In addition
to this, there is a review of available web application models and an introduction to
Model-View-Controller (MVC) design model. After that, possible architecture
scenarios are raised, with the focus on the combination with JSP technology. Finally
there is a review of the PA Electronic Text Library Website, where some similar
functions of the new system are introduced.
-2-
Chapter 1 Introduction
the quality of the system developed. Moreover, a testing script was designed to
simulate several real cases for evaluating the novel method that enhances the
performance of the search function.
In Chapter 6, all the finding and results achieved during the process of development in
this project were discussed from both the team view and individual point. Lastly, a
summary of all the work completed in this project was made in the Chapter 7, and also
a review of the project.
-3-
Chapter 2 Literature Review
Over the past few years, most enterprises have become used to managing their
information via electronic database. Now, due to the popularity and accessibility of the
World Wide Web, people are tending to retrieve information from web-based
application, either by companies or individuals. This trend results in the use of
databases for storage of all types of information. Since databases provide the ability to
store information in an organized fashion with easy access and retrieval, more and
more web-based systems are relying on the database server to support the storage of
certain information. Finally, with the application of distributed system theory, the
different parts of such a system can run on multiple computers, which provide us an
easy and convenient way to enhance development and maintainability.
Having a review of current web site systems, most of them are based on a tiered
architecture, especially the ones used in e-commerce and large web-based applications.
The tiered architecture is also called distributed architecture which means that systems
are composed of programs running on multiple hosts, such as a web server and a
database server. A tier above is one of those hosts but it can be virtual distributed
applications running on a single host. So a tier may be a logical, as well as a physical
layer of a system. In this project, the word ‘tier’ refers to logical layers. Generally,
there are several tiers that represent different logics such as the presentation logic
which shows how information is presented to the clients and the business logic which
means the collection of objects and methods that are different from business to
business. The advantage of such a design is to allow separation of concerns, which
means coding paradigms and required skill sets are both different. Typically, there are
following kinds of tired system:
-4-
Chapter 2 Literature Review
Two-tier systems run on two computers, composed of a client application and a server
application. For example, a typical web application developed by CGI runs on a
client’s web browser and a web server. But as mentioned before, a tier can be a logical
layer of a system, so it is possible to have a two-tier application running on one
machine with a division of two layers. Thus, there are usually three parts in a two-tier
system: a client, a server and a protocol. The role of the protocol, which in most cases
is the HTTP protocol, is used to send and request data between the client and the
server. The advantage of this architecture is to separate presentation logic from
business logic, which means the presentation logic is on the client side, while the
business logic is on the server side. So the majority of the data processing can be done
on the server and the client just needs to receive and present the data derived from the
server side. However, this kind of architecture has little potential for resource sharing,
which is a big problem for large database-driven system.
Three-tier systems have one more layer and usually represent presentation logic,
business logic and data logic which means how to ensure the persistence, security and
of data respectively. This architecture allows concurrent data access via transactions. It
is common to have more than one client to retrieve data simultaneously, so it can avoid
the data corruption. For instance, a server requests or stores data in the database server
after it receives a request from a client and then it gives information returned from the
database back to the client.
N-tier systems are those systems having more than three tiers. This type provides
great performance capabilities, enhancing the amount of services it can provide and
allowing greater maintainability. The disadvantage of an n-tier system is that it is
inefficient in most cases, expensive and more complex.
In short, the most widely used architecture is a three-tier type, which offers a
separation of presentation logic, business logic and data logic. This architecture is
argued and proved by Chaffee (2000) as the best solution for a web-based
database-driven system. So it is necessary to discuss the available technologies to
employ next.
-5-
Chapter 2 Literature Review
l Presentation Tier
Presentation tier means how the web pages interact with the clients. Since most of the
content rendered to the end users is generated from the database, this project is a
dynamic web system. The current available and popular technologies to produce
dynamic content in a web page are Active Server Pages, Java Sever Pages and
Velocity Template Language.
Active Server Pages (ASPs) are the contribution to the development of dynamic pages
from Microsoft. This technology has experienced two generations, classic ASPs and
recent ASP.NET. Classic ASPs use scripting languages on the server side and have
support for VBScript and Jscript which are built in the web pages. New ASP.NET
supports any.NET compliant language, such as C#, Java. Since code in ASP.NET is
compiled rather than interpreted by the ISA server, it runs faster than classic ASPs.
Though both of these technologies are widely used and easy to learn, they are heavily
dependent on the Microsoft’s server and framework. Because this project is being
developed under an open source environment, they are not the proper solutions.
Java Server Pages (JSPs) are another technology provided by Sun for developing web
pages that include dynamic content. They enable web developers and designers to
rapidly develop and easily maintain, information-rich, dynamic Web pages. Unlike a
plain HTML page, a JSP page can change its content based on any number of variable
items, including the identity of the user, the user’s browser type, information provided
by the user, and selections made by the user. [4] A JSP file has a “.jsp” extension and
contains both standard markup language elements (e.g. HTML tags) and special JSP
elements that allow the server to insert dynamic content in the page. For instance, the
following shows a piece of JSP code.
<HTML>
<HEAD><TITLE>Hello World! </TITLE></HEAD>
<BODY>
<%
int i=0;
while(i++<10){
%>
<p>Do Loop #
<%out.println(i); %>
</p>
</BODY>
</HTML>
The codes enclosed in <% %> tags are embedded java codes, which exactly obey the
programming rules in a typical java class. When a JSP is requested for the first time, it
needs to be compiled by the virtual machine on the server side firstly, which allows the
server to handle JSPs much faster in the future.
-6-
Chapter 2 Literature Review
Since JSP is based on Java technology and open to use, it is supported by many open
source web servers, such as Apache Tomcat, and can implement complex web
applications via the collaboration with Java Beans and Java Servlets (discussed later).
Velocity is one of the Apache Jakarta Projects, which is a Java-based template engine.
It permits anyone to use a simple yet powerful template language to reference objects
defined in Java code. [WWW 1] Velocity pages are actually written in the form of
HTML with embedded Velocity code and end with a “.vm” extension. The following
shows an example of Velocity code style.
<HTML>
<BODY>
#set ($foo = “Veloocity”)
Hello $foo World!
</BODY>
</HTML>
The bold code above is the embedded Velocity code and the result is a web page that
prints “Hello Velocity World!”. Generally speaking, references begin with $ and are
used to get something, and directives begin with # and are used to do something. Code
in this style is called Velocity Template Language (VTL). The most attraction Velocity
has is that it fully enhances the Model-View-Controller (MVC) pattern (talked later) to
organize web sites, which is regarded as a clear way for web designers and Java
programmers to work in parallel. This means that web page designers can focus solely
on creating a nice site, and programmers can focus solely on writing top-notch code.
Since Java code is separated by Velocity from the web pages, it is easier to maintain
the web site over its lifespan. So Velocity is a valid alternative to JSPs. However, JSPs
plus JavaBeans can also apply MVC design pattern to web sites. According to Martin
Cook (2005), the only problem JSPs can have is that there may be a mix between
presentation tier and business tier if a system is large and complex. But obviously this
does not mean JSPs are not suitable to use in presentation. Considering the valuable
experiences we have in JSPs and Java technologies and the limited time on the
development, JSPs are enough to carry out the presentation tier with the proper
cooperation with Java Beans and Servlets. So JSP is preferred technology to carry out
the presentation tier in this project.
l Business Tier
The Business Tier shows us the collection of objects and methods that are different
from business to business, which means how to represent your business logic to be
performed in a certain way in your project. In this tier, the available technologies are
JavaBeans and Java Servlets.
-7-
Chapter 2 Literature Review
JavaBeans are reusable Java classes whose methods and variables follow specific
naming conventions to give them added abilities. They can be embedded directly in a
JSP page using the <jsp:useBean> tag. A JavaBean component can perform a
well-defined task and make its resulting information available to the JSP page through
simple accessor methods. The difference between a JavaBeans component and a
normal third party class used by the generated servlet is that the web server can give
JavaBeans embedded in page special treatment. For instance, a server can
automatically set a bean’s properties using the parameter values in the clients’ request.
In other words, if the request includes a name parameter and the server detects through
introspection (a technique where the methods and variables of a Java class can be
programmatically determined at runtime) that the bean has a name property and a
setName(String name) method, the server can automatically can setName() with the
value of the name parameter. So there’s no need for calling getParameter().
JavaBeans are embedded in a JSP page using the <jsp: useBean>action, which has the
following syntax where case is sensitive and the quotes are mandatory.
<jsp:useBean id=”name” scope=”page |request |session |application”
class=”classname” type=”typename”>
</jsp:useBean>
The setting of the attributes in the tag is explained in the table 2.1 as following:
Attributes Specifications Examples
Id Define the name of the bean. id=”potentialUser”
This is the key under which the
bean is saved if its scope extends id=”newStory”
beyond the page. If a bean
instance saved under this name
already exists in the given scope,
this instance is used within this
page. Otherwise a new bean is
created.
Scope Specify the scope of the bean’s scope=”session”
visibility. The default value is
page which means the variable is scope=”request”
created essentially as an instance
variable. If the value is request,
the variable is stored as a request
attribute; if it is session, the bean
is stored in the user’s session; if
application, the bean is stored in
the servlet context.
Class Specify the class name of the class=
bean, which is used when “com.cubreporter.user.userBean”
initially constructing the bean.
-8-
Chapter 2 Literature Review
Though there are four ways to use this feature, the common usage is as following:
where any parameter with the same name and type as a property on the given bean
should be used to set that property on the bean. For example, if a bean, which
represents the user, has a setUserName(String userName) method and the request has a
parameter userName with a value of Rob, the server will automatically
calluserBean.setUserName(“Rob”) at the beginning of the request handling. The
parameter is ignored if the parameter value can not be converted to the proper type.
where the given property should be set if there’s a request parameter with the same
name and type. For instance,
This means to include at this location the value of the given property on the given
bean.
-9-
Chapter 2 Literature Review
The Java Servlet API allows a software developer to add dynamic content to a web
server using the Java platform. It has the ability to maintain state after many server
transactions, which is done using HTTP Cookies, session variables or URL rewriting.
Java Servlets is a generic server extension, which is actually a Java class that can be
loaded dynamically to expand the functionality of a server. Since servlets are written in
the highly portable Java language and follow a standard framework, they provide a
means to create sophisticated server extensions in a server and operating system
independent way.
Java Servlet API makes use of the Java standard extension classes in two packages
javax.servlet and javax.servlet.http. The javax.servlet package contains classes and
interfaces to support generic, protocol-independent servlets. These classes are
extended by the classes in the javax.servlet.http package to add HTTP-specific
functionality.
The remainder in the above two packages are largely support classes. For example,
HttpServletResquest and HttpServletResponse classes in javax.servlet.http package
provide access to HTTP requests and responses. The javax.servlet.http package also
contains an HttpSession class that provides built-in session tracking functionality and a
Cookie class that allows me to quickly set up and process HTTP cookies. So the
session management for the user in the project could also be implemented by a servlet.
Apart from the technologies mentioned above, the search engine is the most important
part in the business tier. Apache Lucene is the preferred search engine because our
- 10 -
Chapter 2 Literature Review
The fundamental concepts in Lucene are index, document, field and term. An index
contains a sequence of documents; a document is a sequence of fields; a field is a
named sequence of terms; a term is a string. The index files are based on the
enterprise’s own data and generated by the enterprisers themselves. We can make use
of these index files for efficient full text search.
l Data Tier
Today, the most popular and efficient way to store data in a web application is using
database or XML. The original PA data archive contains more than 8.5 million stories
totaling 20GB of data. The raw corpus has been processed and encoded in XML
following a strict Document Type Definition (DTD) specification which includes
elements such as story date, category, topic, and structural information such as
headlines, bylines, and paragraphs. One example is shown in Figure 2.1. Obviously, in
order to access the XML-formatted data, it is necessary to have a view of available
XML parsers. Since the later work will make a transfer from the XML data to a
relational database, available technologies for operations on databases are also
introduced.
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE HSA SYSTEM "../../../../dtds/HSA.DTD">
<HSA DATE="01011994" DAY="01" YEAR="1994" MONTH="01" ID="HSA1868" PRIORITY="4"
CATEGORY="HHH" COUNT="222" MSGINFO="PA" TOPIC="1 SHOWBIZ Ward" TIMEDATE="010442 JAN
94">
<HEADLINE>I'M NOTHING LIKE MY SEXY ROLE, SAYS SOPHIE</HEADLINE>
<BODY>
<BYLINE>By Rob Scully, Press Association Showbusiness Correspondent</BYLINE>
<PARAGRAPH NRO="1">Actress Sophie Ward, who plays a sex siren for her latest TV role, insists she is nothing
like that in real life. </PARAGRAPH>
<PARAGRAPH NRO="2">Sophie, 28, plays Eden in a new Ruth Rendell chiller, A Dark Adapted Eye, starting
tomorrow night on BBC1.</PARAGRAPH>
<PARAGRAPH NRO="3">``It was fun to play a glamorous part because I am not like that and it is quite hard work ...
it sometimes took three hours just to get the hair right,'' she said.</PARAGRAPH>
<PARAGRAPH NRO="4">end mf </PARAGRAPH>
</BODY>
</HSA>
- 11 -
Chapter 2 Literature Review
XML is the abbreviation for Extensible Markup Language, which is simply text and
portable data. The data in a XML file is stored in custom tags within the markup
language, something like HTML. It has advantages for storing complicated
information unambiguously and is easily readable by both humans and machines.
XML is fully supported by Java, so there are many APIs that can be employed to parse
an XML file, such as SAX and JDOM.
As seen from the Figure 2.1, the first line indicates that this document is marked up in
XML. The second line specifies validation and root. The document starts from the
third line, where the only root element HSA is followed by several attributes, such as
the date of this story and the category code “HHH” which stands for UK Home News
only. The next is the story’s headline that is in the same level with its body. The body
has two kinds of children, which are byline and paragraph respectively. Each
paragraph has its own number attribute that could be used for search.
In order to retrieve certain information from a XML file, an XML parser is needed and
handles the important task of taking a raw XML document as input and making sense
of the document. Since there are no hard and fast rules to select a XML parser, it is not
an easy task to choose the right parser for your application. According to Brett
McLaughlin (2001), there are two main criteria which are typically used. The first
thing is the speed of the parser. As XML documents are frequently used and their
complexity grows, the speed of an XML parser becomes extremely important to the
overall performance of an application. The second is conformity to the XML
specification. This means you have to be careful when selecting a parser since some
XML parsers may not conform to finer points of the XML specification in order to
squeeze out additional speed. Balancing these factors, JDOM is an ideal choice to
employ according to Cook (2005), because it provides high performance of SAX (e.g.
rapid parsing and output) and document model of DOM without memory problems
(e.g. random-access).
Though XML has many advantages to store data, there are some very important
disadvantages. For example, XML files require a large amount of storage compared to
databases and it has bad performance if the source data is quite large when searching
objects. Because the original PA’s archive saved in the form of XML is ~20GB, which
is too large and affects the performance of searching, we decided to transfer original
PA’s archive to store in the database. This is the reason why we decided to reorganize
the current PA archive and store it in a relational database.
- 12 -
Chapter 2 Literature Review
The JDBC API is the industry standard for database-independent connectivity between
the Java programming language and a wide range of databases (WWW5). It is just like
a bridge between application and database server and provides access to an SQL-based
database.
With JDBC technology, we can make the application development easy and
economical. For instance, programmers can save a lot of their time and effort because
they do not need to care about complex data access tasks which are hidden in JDBC
API. Furthermore, there is no need for configuration by the client, since we can define
a JDBC driver.
- 13 -
Chapter 2 Literature Review
Another thing I had to care about in this project is opening and closing database
connections. One way to handle this is to have each usage of the database open and
close a connection explicitly. One of the problems of doing it in this way is requiring
every class that talks to the database to have enough information available to establish
a connection. Moreover, opening and closing database connections are just about the
most expensive thing an application can do. For short queries, it can take much longer
to open the connection than to perform the actual database retrieval. So it is very
necessary to have another way to handle the database connection.
Normally, I should write my own connection pooling servlet. But luckily, Apache
Turbine was found as a third-part support, whose database connection pooling facility
can be made use of in this project.
Apache Turbine is a Servlet based framework that allows experienced Java developers
to quickly build secure web applications (WWW3). Turbine allows us to personalize
the web sites and to use user logins to restrict access to parts of the application.
Turbine provides good support for working with JDBC through the Torque Object
Relational Tool and its connection pooling facilities. Before the Turbine facilities can
be used, I need to set up my application for Turbine and create a property file with
some specific initialization values for the database pool in it. More details will be
introduced later.
The Java 2 Enterprise Edition (J2EE) Model is actually a three tiered web application
model, which includes Client Tier, Middle Tier and Enterprise Information System
(EIS) Tier. J2EE is a compilation of various Java APIs, including JSP, Java Servlets,
- 14 -
Chapter 2 Literature Review
Enterprise JavaBeans (EJB) and JDBC. These APIs stand for different component
technologies, and are managed by what the J2EE documents name containers. A web
container provides the runtime environment for servlets and JSP components,
translating requests and responses into standard Java objects. Similarly, an EJB
container manages EJB components. One important thing we should keep in mind is
that an EJB is quite different from a Java Bean. On the one hand, a JavaBeans
component is a regular Java class, which follows a few simple naming conventions and
can be used by any other Java class. On the other hand, an Enterprise JavaBean must
be developed in compliance with a whole set of strict rules and works only in the
environment provided by an EJB container. Furthermore, components in the two types
of containers can use other APIs (e.g. JDBC) to access databases.
Figure 2.2 shows an overview of all the components and their relationship, which was
introduced by Pawlan (2001). The client tier includes browsers as well as some GUI
applications. A browser usually uses HTTP to communicate with the web container.
Client services are supported by the middle tier through the web container and the EJB
container. Many applications can be implemented solely as web container components.
However in some complex applications, the web components just act as an interface to
the application logic implemented by EJB components. Both of these components can
access databases via other J2EE APIs like JDBC. The Enterprise Information System
(EIS) tier holds the application’s business data, which is usually comprised of one or
several relational database management servers.
- 15 -
Chapter 2 Literature Review
Request
Controller
Forward
State
change
XML
Event
Response
View Model database
Data
In practice, there are several ways to assign MVC model to the different types of J2EE
components, which depends on the scenario, the types of clients supported, and
whether or not EJB is used. The next subsection describes possible architecture
scenarios where JSP plays an important role.
In a J2EE application, there are three approaches to assign MVC roles to the different
components. Because the architecture is critical to the system and different way has its
own advantages, it is very important to choose the right scenario before
implementation. The following shows the different architectures in a J2EE application
implemented by different components.
- 16 -
Chapter 2 Literature Review
In this approach, separate JSP pages are used for presentation (the View) and request
processing (the Controller) and place all business logic in beans (the Model). As
shown in Figure 2.4, Controller pages initialize the beans, and then View pages
generate the response by reading their properties. A pure JSP approach is a suitable
model for testing new ideas and prototyping because it is the fastest way to reach a
visible result. However, when you evaluate the prototype by this approach, you will
find that the pure JSP application is hard to maintain since it may be a nightmare to
redesign and reuse.
View
Search.html
ResultList.jsp
Model
JavaBeans database
Controller
Find.jsp
Redirect
Delete.jsp
Figure 2.4
Today, the combination of JSP and servlets is a powerful tool for developing
well-structured applications which are easy to maintain and reuse. When both of them
are combined, the MVC roles are assigned as shown in Figure 2.5. In this case, all
requests are sent to the servlet as a Controller with a parameter or a part of the URI
path which indicates what to be done. As in the pure JSP scenario, Java Beans are used
to represent the Model. The Controller servlet takes an appropriate JSP page to
generate a response to the user (the View) depending on the outcome of the processing.
For example, if a request to add a new document is executed successfully, the servlet
will pick a JSP page that shows the list of the updated documents. If the request fails,
the servlet will take another JSP page which shows why it failed.
- 17 -
Chapter 2 Literature Review
View
Search.html
ResultList.jsp
Model
JavaBeans database
Controller
Java Servlets
Figure 2.5
As seen in Figure 2.6, a user sends a request to a servlet as in the JSP&servlet scenario.
However, on one hand, it asks an EJB session bean to process the request instead of
the servlet itself. Therefore the Controller role spans the servlet and the EJB session
bean. On the other hand, the Model role is shared by the EJB entity beans and the
web-tier JavaBeans components. As Bergsten (2002) stated in his book
“Typically,JavaBeans component in the web tier mirror the data maintained by EJB
entity beans to avoid expensive communication between the web tier and the EJB tier.
The session EJBs may update a number of the EJB entity beans as a result of
processing the request. The JavaBeans components in the web tier get notified so they
can refresh their state, and are then used in a JSP page to generate a response.”
- 18 -
Chapter 2 Literature Review
View
Search.html Model
ResultList.jsp JavaBeans
database
Entity EJBs
Servlets
Session EJBs
Controller
Figure 2.6
The PA Group is currently running a digital text library for their journalists to collect
stories. In order to be familiar with the new system requirements to a high level, it is
necessary to play with their existing system, and know about their habits in using the
system. Here are three main interfaces from the PA Digital Text Library.
As seen in Figure 2.7, basically, there should be some fields for users to enter their
queries according to different conditions. Furthermore, users are offered to customize
their search data range and sort style.
Figure 2.7
- 19 -
Chapter 2 Literature Review
Figure 2.8 shows the matched result page, which include a certain number of items
from the total results and a page separation function. Each result item shows the basic
information of a document, which includes document id, topic, a short paragraph of the
content, the group, the word count and the date of the document. It also provides a
function for users to save their preferred documents in their online portfolio.
Figure 2.8
In Figure 2.9, the complete information of a document is shown with the highlight of
the query.
Figure 2.9
- 20 -
Chapter 2 Literature Review
Apart from the review of the interface and its functions, it is known that the current PA
online Digital Text Library is powered by PHP technology with XML database.
According to our experience on using the existing system, it was found that the system
has an unreliable performance, such as a very slow response to the client’s request if
the network traffic is busy.
- 21 -
Chapter 3 Requirements and Analysis
(1) Designing and writing web server side "business logic" code to access
databases and assemble content for dynamic page creation
(2) Developing a flexible presentation layer framework for rendering content in the
client browser
(3) Developing a appropriate session management framework to support collection
of search results during iterative search and navigation of search history
l The advanced search engine facility is the heart of this project, which offers
the main functions of the system.
l The search functions should be divided into two steps, the first of which is
called generic search while the second is called refined search.
l In generic search, the input query could be any word(s), sentences or questions.
Users can specify a search calendar range (e.g. after or before a certain date, or
between two specified days), and sort the results by date or weighting.
l In refined search, users can select their search type in a more detailed way
according to their queries in the random search. For instance, a list of the
related entities and events to their keywords, or even some similar events
should be offered for users to consider about and make a further decision.
l The search result returned from database should be presented in a certain form
according to the document structure. The information included in one item
should indicate the document ID, title, headline, byline (if any), word count,
the date and the first paragraph of content if there is more than one.
l Each results list page should present no more than ten result items. If the
search results are more than ten, the system should also offer a page separation
- 22 -
Chapter 3 Requirements and Analysis
function that allows user to go to the next page, the previous page or a
specified page. Furthermore, each result list page should also indicate the
following information: the number of the results, the current page, and the
number of items shown in the current page.
l When a user finds the document(s) he wants, he can open the full document in
a new window by clicking on its title. The full document window will show
the complete document content with all the other information included in a
document structure.
l Since the system offers a tracking function that users can review their
preferred documents in their portfolios, there should be a save button attached
to each item in both the result list page and the full content page.
l Valid users should use a login interface to enter the system with a username
and password.
l The system should offer users a function to modify their password and update
their personal information.
l The system allows users to review their search record and save preferred
results in their portfolio for future use.
l According to the client’s requirement, the system will only be available to the
authorized subscribes and users can not register themselves. So there should an
administrator function for the system. This function will allow the system
administrator to allocate an account to a valid user.
l The functions that an administrator can take include create, delete or
en/disable a user.
- 23 -
Chapter 3 Requirements and Analysis
Availability
The system will be running on a web server host 24 hours a day and 7 days a week.
This does not include the regular maintenance to the web host and some abnormal
situation (e.g. any disaster to the hard disk in the host). However these exceptions are
nothing to the system itself.
Reliability
The system must be reliable enough on all aspects of its use. The data it uses should be
reliable because they are stored in the MySQL database and there is a certain backup
mechanism to prevent any loss to the original data.
Security
The system must be secure enough because of the confidential data from the PA
archive. Any illegal entrance or use to the system should be regarded as a threat to the
database safety. Therefore, it is essential to have an authentication mechanism to the
system. This authentication will only allow authorized users to enter the system and
carry out the search function.
Recoverability
The system should be guaranteed to have a quick recoverability once the web host
experiences a disaster (e.g. hardware crash). The database should have a secure backup
at the database server while the system should have a simple method to redeploy to the
web server.
3.3 Constraints
l Hardware Constraints
The only hardware constraint to the system is that the system should be able to be
installed and run on the machine which will be regarded as a web host. The host
machine used for testing is the DON server in the NLP Group, which has a memory of
6GB.
l Software Constraints
The software constraints should be discussed from two aspects, the server side and the
user side. From the server side, Java technologies should be employed to support the
Tomcat web server and any other third-part software run in on the server side. The data
should be stored in a relational database while some configuration files should be
saved in the form of XML file. From the user side, the browsers in their operation
systems should support the basic HTTP protocol.
- 24 -
Chapter 4 Design
Chapter 4: Design
After a review of all the possible technologies for this project in Chapter 2 and having
in mind the requirements from Chapter 3, a selection of technologies appropriate for
the development of the system was made carefully.
After several discussions among the team members, it was decided that the system
should follow a 3-tiered web application structure. The MVC model was chosen as the
chief design pattern because of its clear separation of the presentation, business and
data logics. This decision made the role of each member clear and the teamwork more
efficient. The Figure 4.1 below shows the blueprint of the system, which was produced
by our teamwork. In simple, all the actions in this system are controlled by the server
side controller. The controller receives a request from a user and dispatches it to the
related event handler to process it.
Business Tier
Presentation
Tier
Controller
Browser
/ Userid/
Client
Search Search
History
Web Servlets
Word-based
Text Text NL
PA Event
Index Analysis
Indexing Archive Structure/
(Lucene) Summaries
Database
Tier
Figure 4.1
- 25 -
Chapter 4 Design
For the presentation tier, JSP technology was chosen as the system needs to provide
the dynamic data according to the users’ request. The reason for choosing JSP over
other technologies is that it has many options to implement the MVC model and the
Tomcat web server available in the department has a good compatibility to the JSP and
other Java technologies. Furthermore, the open source platform and technologies are
more preferable to our client.
JavaBeans and Java Servlets are the solutions for the business logic tier. This is
because both of them are programmed in Java language, which is the preferred
programming language in our team because most of us have a sophisticated experience
in Java. Furthermore Sun and its Java technologies is one of the pioneers in the large
database-driven web systems, and many enterprise solutions introduced in the past
have proved its value and high performance. Moreover, the core search engine (Lucene
APIs) in this project is also programmed in the Java language. So balancing all the
factors, it is the Java technologies which should be used to take on the implementation
of the system.
The main function of this system is to provide a search service, so it is very critical to
have reasonable data storage so that a search process can be carried out easily and
efficiently. Before the project, the PA archive was stored in the form of XML files.
And now for the data tier, a relational database was chosen as a new storage to replace
the old XML files. The reason why a new storage is employed is that the structure of
old data was found to be unreasonable by our Requirement Analyzer and we decided
to reorganize the structure of the data in a more logical way. Since a relational
database can provide a high-performance search function and a robust management, it
was decided to design a new relational database for the PA archive. So in the database
tier, the technologies employed are those used to implement the operation between an
application and a relational database, which includes the Structured Query Language
(SQL), Java Database Connectivity (JDBC API), and Database Pooling Technology.
- 26 -
Chapter 4 Design
In detail, firstly, common users can enter the system via the Welcome Page, where
they need to type their account and password for authentication. Then they can choose
the actions to be taken in the Home Page, such as make a new search or review their
search history. If they take a new search, what they can enter is either some keywords
or a question. Then the system will refine their query in a new page where more
accurate search functions are provided, such as a list of the related entities and events
to the keywords, or even some similar events could be found. Users choose their
preference by clicking on some tick boxes beside the items on this page and submit it.
On next step, a list of matched results will appear on the screen, which shows titles of
the relevant stories and their summaries. Here users are allowed to customize their
view style and save items needed. Lastly, users can view the full context of stories by
clicking on their titles, where they can also save the document to their portfolio. Valid
users can review and edit their search history whenever they like after login to the
system, which is very similar to the function of a shopping cart.
According to the requirement in the Chapter 3, users are not allowed to register
themselves. So here is an administration page for an administrator to allocate accounts
for valid users. The administrator can take actions like create, delete or en/disable a
user.
The Figure 4.2 on the next page shows the website flow for the entire system, which
illustrates what functions are exactly provided in each page of the web application.
- 27 -
Chapter 4 Design
Figure 4.2
Based on the three tired system architecture, the MVC model was applied to the
business tier. As seen from the Figure 4.3, the clients send request via HTTP using
HTML forms, and the Java Servlets decide which actions to be taken. Once the actions
are confirmed, the Servlets will call the relevant JavaBeans to process and retrieve the
information. The main objects in this project (e.g. document) are represented by
JavaBeans, which hold the attributes of the objects and have certain methods to
implement the search tasks. After the retrieval of data, both beans and servlets
- 28 -
Chapter 4 Design
co-operate to pass the information to the JSP pages. In this way, the dynamic content is
rendered to the client’s browser.
Business Tier
Word-based
Query/Snap
View Index Search
Analysis Text
JSP pages Index
Request/
Response
Request/Retrieve Data
Figure 4.3
The process of the text search is the core in this project, which offers the main search
function over the PA archive. As seen from the Figure 4.4 on the next page, the
process of the text search is composed of four stages, which are query/snap analysis,
events/entities search, word-based index search and database-based search respectively.
When the method in the document bean is called to process the text search, the query
sent from the user side will be firstly passed to the Query Analyzer class, which filters
the predefined stop words that are around 250 in total and regarded as unimportant
words in the query. Then the refined query, which is actually some key words, will be
used to carry out the index search. As introduced in Chapter 2, the index files are the
fundamental data storage for Lucene index search, so the word-based index search will
be implemented over these index files via Lucene APIs. The detailed Lucene’s
functions that are employed in this project include: the ranked searching, the field
searching, the date range search, sorting by any field, and multiple-index searching
with merged results. Combining these functions together, the matched documents’ IDs
can be retrieved from the index files and used for database search. Lastly, a SQL query
for searching document table is processed by JDBC with the IDs as the clause. Until
now, all the information for matched documents are caught and can be saved in a data
structure(e.g. LinkedList) for further use.
- 29 -
Chapter 4 Design
natural language questions. When the query analyzer recognize the query from users as
a question, it produces an analysis which includes the expected answer type (EAT) and
depending on the question, a special attribute created to refer to the attribute-value to
be extracted from the answer entity. Each candidate answer gets a preliminary score
based on its semantic proximity to the EAT and the number of relations the candidate
answer has with elements of the question. An overall score is computed for each entity
as a function of its preliminary score plus a similarity value between the question and
the sentence where the entity comes from (Gaizauskas et al., 2005). Then the score
will be used to retrieve data from the database as a reference. The later process is the
same with the text search.
Events/Entities
Query logic form Searcher
(scores answer
candidates)
Users’
Query Analyzer keywords
Index Searcher
(Filter the pre-defined stop (word-based index search Matched
Query
words from the query) via Lucene API) events/entities
Data Structure
All the information
(ArrayList that holds all the
needed for matched
Database Searcher
matched objects of documents and (JDBC API implements
documents
is saved in the memory for use) the SQL script)
Figure 4.4
As the example of the PA document shown in Chapter 2, the data in PA’s archive were
stored in the form of XML file, which have special parameters for search purpose. The
database tier design is critical to the system, since the performance of the system is
linked to how fast search results can be produced. Furthermore it was very necessary
to reorganize the data storage in an efficient way because it was found that there were
some unreasonable rules in the old storage, such as the same id can be used for
different documents, which would cause ambiguous search results. Moreover, it is
convenient and reliable to manage the data in a relational database because of their
powerful accompanied functions and facilities. For example, MySQL database server
is a very popular relational database, which offers many advanced functions for data
management.
- 30 -
Chapter 4 Design
During the process of database design, it is important to note that the type of each
attribute (column) in every table should be considered carefully. This is because any
inappropriate use of the type would cause either the failure to install the data or the
waste of the memory. So the new database design should be based on the attributes of
the document in the old PA archive and a further and careful study was needed for the
database designer.
Again, with the teamwork, the new database design was carried out by Dr. Angelo
Dalli who produced the Figure 4.5 as following.
Figure 4.5
Since the data is too large to do the conversion manually, it is necessary to have an
auto-import facility to convert those XML files to MySQL database where SQL could
be used to support advanced text search. So an XML parser is a basic tool to be
employed to implement the task. Since the conversion of data is not my task, it is
ignored how it works here. My job focused on how to retrieve the data from the
database and render it to the client.
- 31 -
Chapter 5 Implementation and Testing
5.1 Overview
The CubReporter Project is a large and complex project funded by the EPSRC for
three years, and the other researchers has been working on this project for more than
two years. Though I tried my best to carry out all the functions during the project
period, my time and effort was obviously limited. Firstly, this is a cooperating team
project, so my progress could not move on until other workmates pass me those
necessary data and methods. Unfortunately, there is a data delay caused by the
unforeseen factors in feeding the full database, which resulted in that I could only get
the data for testing from the document table (the green part in figure 4.3). However,
these data should be enough to make use as a sample to carry out the system prototype.
Finally I have finished most of the functions according to the system design and raised
a method to improve the system performance. My contributions to this project include
the aspects in the following part of the system design (Figure 5.1).
Browser Controller
Client /
Search
Web Servlets
Word-based
Text Text PA
Index Indexing Archive
(Lucene)
Figure 5.1
As seen from the diagram above, the main Word-based Text Search and the user
management have been done, and the result can be rendered to the client in a nice
interface. An ecommerce standard user authentication mechanism was designed to
implement the login and session management. The detailed methods will be explained
later.
In principle, the Text Search includes three steps. When a search request is received
from the user by HTTP form, the first step is to do the query analyses, which means to
- 32 -
Chapter 5 Implementation and Testing
break the query into several words and ignore the words that are called stopwords (e.g.
the, a, is, are, what, where and etc.). So the keywords in a query should be selected
before carrying out any search. In the second step, the Lucene index search engine is
employed to get the document ID from the Text Index archive. Lastly, the targeted
document IDs from the index search will be used as an SQL query to load other
document information from the document table in our database.
5.2 Implementation
l Database Pooling
One of the important tasks in this project is to find out a special way to improve the
loading performance from the database and make use of the available resource in an
efficient way. After the review of technologies to employ in Chapter 2, the database
pooling technology was determined to be the most efficient way to load the data from
a database server, because it solves the most time-consuming problem for a
database-driven application--- opening and closing database connections. Using
Database Pooling, we can ignore the time for opening and closing database
connections according to a case study made by Marty and Brown (2003).
To make use of Turbine as a database pooling facility, you must import the Turbine jar
to your system directory (e.g. WEB-INF\lib) firstly. Then a servlet called TurbineInit
(Appendix 1) is created and added to the web.xml file, whose function is to initialize
the DB connection pool. When the web server starts, it will read the setting
information in an external file called TurbineResources.properties, which includes all
the information needed for defining classes for Turbine service and database setting
(see Appendix 1). In any code that performs a database interaction, we need to import
the following two classes in the Turbine package.
import org.apache.turbine.services.db.TurbineDB;
import org.apache.turbine.util.db.pool.DBConnection;
The next step is to read the pre-statement from the external file, which stores some
SQL queries that need to be executed. But before any statement can be read, it is
necessary to use the ResourceBundle class from the Java.util package to load the
external file. Then we can apply a SQL query in the SQLQueries.properties file to the
PreparedStatement class in Java.sql package. For example, there is a SQL query like
this: “findDocuments=SELECT * FROM documents WHERE docid = ? ”, the
- 33 -
Chapter 5 Implementation and Testing
program will pass the conditional parameter to this SQL before it is executed. The
following section of codes shows how a SQL query is executed.
l Text Search
Firstly, a Document bean is created to represent the document object, which holds all
the attributes of a document object, e.g. document ID, title, headline, date, content and
etc. There are also relevant Getters() and Setters() for each attribute so that the value of
each attribute can be retrieved from the JSP pages externally. Furthermore, a
LinkedList is employed to store the matched search result because the data structure of
LinkedList is the best choice to store large data in memory. There are mainly three
steps for a text search.
- 34 -
Chapter 5 Implementation and Testing
Considering this system will be used among the journalists in the PA Group to collect
information for breaking news, there would be a chance for a number of journalist to
make some similar or the same search in a certain period. For example, when an
Olympic Games happens, they may be more interested in the sports stories. So during
this period, journalists will do searches more frequently on the sports. This would
bring many hot searches on a topic in a short period to the system. Therefore, if we
- 35 -
Chapter 5 Implementation and Testing
have a mechanism that controls the hot searching items in a certain period, and make
use of the available resource in a maximum way, we will improve the system
performance in a high level. After a careful research, the following method was found
to improve the system performance.
As discussed in the above subsection, JavaBeans are used to represent each object for
the business logic in this project. Any search is made by calling a related Java object
which holds the basic properties of that object (e.g. document, story and etc.) and has
its method findObject(String query) that returns a LinkedList which holds all the
objects with the data found from the DB. This LinkedList will be stored in the physical
memory in the server side once the first search is activated and will not be destroyed
until the server is down. Since loading data from memory is much faster than from
either the database server or the XML files stored on the hard disk, it will speed up the
response time a lot if the potential matched search result is always kept in the memory
rather than anywhere else. Based on the previous discussion to text search, the
following novel method was introduced to enhance the system performance in a high
level. If we make clear the two questions underlined below, we will improve the
performance in a reasonable way. The method is discussed from the running process of
the system as below:
(1) Tomcat starts and initializes the DB Pooling(setting up a virtual connection with
MySQL Server)
(2) When doing the first search, a starting time is set up to control a loop of memory
check, which means how often the system should check the memory to release the
expired data in the memory. For example, the first search is made at 9:00am on
July 1st 2005 and the duration of the loop checking is set to 2 days (depending on
the memory space). Then the system will start checking the memory for the
expired data, which is stored in the memory but has not been used for a certain
period. (e.g. 30 days), at 9:00am on July 3 rd 2005 and again it will do the same
loop afterwards.
(3) When a user makes a search, the system will check the memory first, to see
whether this query was requested before or not. If the object of that query already
exists in the memory, it will renew the timestamp of that object in the memory.
Doing this way, the hot queries will be always kept in the memory. If not in the
memory, it will get the DB connection and create new objects with the key of that
query. Then new data is saved in the memory so that it is not necessary to get the
data from DB again later if the same query is done. In other words, it speeds up
the search performance since getting response from the web host’s memory is
much faster than from the MySQL.
(4) The objects with that query are created and saved in a LinkedList. Then a new
object called ObjectTimeStamp is created with two parameters, the former
LinkedList and a long data type which represents the current time. Lastly, the
query and ObjectTimeStamp are saved in a TreeMap, whose method get(Object
key) can be used for checking a duplicate search.
(5) Another question is how long an object can exist in the memory. So a method
called timeExpired() is created to check any expired data. The main idea is to use
- 36 -
Chapter 5 Implementation and Testing
the current time stamp to subtract the recorded time stamp in the
ObjectTimeStamp. The data in the TreeMap will be removed from the memory if
the difference is greater than the certain duration. So that the data which is not
frequently requested but stays in the memory can be cleared in time. In this way,
the physical memory of the web host is saved efficiently, and obviously the
performance of this large database-driven system will be improved.
In short summary, this method makes use of system resource in an optimal way from
the server side, and it also brings a new approach to enhance the performance of the
database-driven web application. The complete code is included in the Appendix 1 to
show how this method works.
l Data Rendering
When the user sends a query to the servlet via an HTML form, the servlet will handle
it and pass it to the beans, and then the beans process the query and perform the search
function. After that, all the data retrieved from the database is saved in the memory.
Now what needs to be done is to output them in the web pages dynamically.
As stated in the requirements in Chapter 3, each page will display no more than ten
result items and there should be a page change function to view more items. Moreover,
each page should have a guide showing the number of matched items and pages and
the current page number. All these functions are realized in the JSP pages. Because all
the parameters for calculating which items should be displayed in the current page can
be retrieved from the servlets, it is not difficult to carry out the output of the data.
Figure 5.2 shows an example of the results list interface.
Figure 5.2
When the user finds a document he prefers, he can view the complete content of that
document by clicking on its topic, which will lead him to another page showing the
- 37 -
Chapter 5 Implementation and Testing
whole document. For more details, please see the Figures A2.3 to A2.7 in Appendix 2.
The figures show the entire search interface in the text search system except the
interface for the refined page which is shown in Figure 4.2 but not done yet because of
the reasons mentioned in the beginning of this chapter. Those user interfaces include
the main search page, the search results list page and the full document page
separately.
l User Authentication
Here, a userBean is created for representing the user object in the project, which
includes all the attributes and functions that are related to the behavior of a user. Then
the database connection pooling is carried out via Turbine DB facility. For example,
information like the JDBC driver and the username and password for the MySQL
database is stored in the TurbineResources.properties, which will be loaded by
TurbineConfig class method when the Tomcat server starts. All the methods like
createUser() and findUser() in the userBean can get a connection with the database by
importing org.apache.turbine.util.db.pool.DBConnection. Using the DB connection
pool also makes it possible that all the SQL queries are stored in an independent
properties file and separated from the Java codes. This facilitates later deployment and
modification to the tables in the database.
As required, the administrator can register a valid user via filling in username and
password. The registration module satisfies the following features:
(1) Username should match one and only in the database
(2) Password should be more than 6 characters and less than 20 characters and
encrypted on both client and server sides.
(3) The registration interface should prompt the administrator for any mistake they
could make during filling in the forms, e.g. passwords entered twice should be
the same and new username should be different from existing ones.
The Figure 5.3 on the next page shows the interface for the registration model in this
project. For system responses to different potential inputs, please see Figures A2.8 to
A2.10 in the Appendix 2 for the further details.
- 38 -
Chapter 5 Implementation and Testing
Figure 5.3
Since the login module is like a gate in this project, it should not only be robust but
also smart and friendly. The finished login module supports the following features:
(1) Any input mistakes are identified and shown to the user in a friendly fashion, e.g.
if a user enters an invalid username, the login interface will prompt that the
username does not exist. In the case that a user enters a valid username but a
wrong password, the interface will then tell him/her that the invalid password has
been entered.
(2) Once the identity of a user is accepted, the system will create a cookie object for
this user and record it in his/her browser for a period unless the user logs out
manually. The system will check the user’s browser automatically every time the
user logs in so that the protected pages will not be open to anonymous users and
the user’s behavior can be recorded and tracked.
Figure 5.4 on the next page shows the page for the login model in this project. For the
various system responses to different potential inputs, please see Figures A2.1 to A2.2
in the Appendix 2 for the further details.
In order to track the valid users’ information after they log in, cookies are used in the
user management. What needs to be done is to add some JSP to the login page to send
a cookie back to the browser. One thing worth mentioning here is that it is not enough
to just store the username in the cookie, because a potential hacker could gain
unauthorized access to a user’s account by faking the cookie. A trick to prevent this is
- 39 -
Chapter 5 Implementation and Testing
to store both the username and the password. In this way, the hacker can not create the
cookie by just knowing the username.
Figure 5.4
5.3 Testing
The testing process was broken down into two parts, functional testing and
performance testing. The functional testing covered the area of functionality in the
system. The aim of the functional testing was to check if the system actually works as
intended and find any possible errors and bugs in the system. The other area of testing
is the performance of the system, which is used to prove the performance improved by
the novel method introduced in this chapter.
- 40 -
Chapter 5 Implementation and Testing
Table 5.1 below shows the test cases on the User Registration Model. Here it is
supposed that both of the usernames “James” and “Horacio” do not exist in the
database and the username “Robot” already exists in the database.
Ref Input Action Expected Actual Outcome Success
Taken Outcome (Y/N)
1 Username: James Submit The message A new page shows Y
shows the the message that the
Password: james007 success of new user “James” has
registration in been registered
Repassword:james007
another page. successfully.
Table 5.1
- 41 -
Chapter 5 Implementation and Testing
For the User Authentication Model in the project, Table 5.2 shows the test cases for it.
Here it is supposed that the username “James” is the valid name with the valid
password “james007”, and the username “Ben” is invalid.
Ref Input Action Expected Actual Outcome Success
Taken Outcome (Y/N)
1 Username: James Login The page As expected Y
redirects to the
Password: james007 main search page.
Table 5.2
Next is the test for the text-based search function. In the search text field, the users can
input anything that related to their tasks in collecting background information for
breaking news, such as several separate words or a natural language sentence.
Furthermore, the users also have the options to specify a date range which is from
1994 to 1997 when the test was done. The expected outcome of the action taken for the
search is a page showing the entire matched results list with a page separation function
at the bottom of each page (see Figure A2.6 in Appendix 2). When the users want to
view the complete content of a document, they can navigate to the full document page
by clicking on the topic of that document. The expected outcome should be a page
showing all the information that a document has in the database. An example is given
as seen in Figure A2.7 in Appendix 2. Finally, the users can make a new search by
enter a query in the top of each page at any time, and the results list should be the same.
This is an additional function to make the search behavior more convenient in any
page.
- 42 -
Chapter 5 Implementation and Testing
In order to test the performance and reliability of the search function, a real case was
carried out by a group of five users from different IP addresses. Also a response
timestamp is used to record the time for the server’s response. It shows the response
time of each search at the top right corner of each page. The system was deployed to
the Don server in the NLP Group and the test was done from the Lewin Lab in the
Department (Because of the firewall the Department runs, the access to the Don server
is denied from the outside visitors. So the simulation from the outside clients to the
system could not be done).
The test result shows that, in case 1, five users got their search results with different
response time due to the different number of the matched items. In case 2, two of those
users who entered the same queries to the others received their search results with a
less response time, but it still cost them some time (e.g. 5milliseconds). In case 3, the
two users who entered the same queries later got the results list with a zero response
time. This means the same results were loaded from the memory directly rather than
from the database, so the performance of the search function is enhanced to those
searches that are similar and carried out in a short duration.
The above test was aim to prove the novel method mentioned before. Obviously, it
needs further research and test to be proved reliable. So please note that because this
method was designed just for this project, it is not guaranteed to work in any other
similar project.
- 43 -
Chapter 6 Results and Discussion
6.1 Findings
After the testing is done, the test result shows that the original aims listed in Chapter
1.2 were finished successfully. A working prototype has been obtained for the
CubReporter Project. During the process of development, a number of academic
knowledge learned from the Java Ecommerce Model was found very useful in the real
life practice. In detail, both the three tiered web architecture and the MVC design
model were successfully applied to the system. The technologies suitable for each tier
were also practically used to implement the required functions in the system. For
instance, the database pooling technology was found to be the best solution on the
database connection and JSP plus JavaBeans played a good role in the task of data
rendering. With the careful study on the performance of web application, a novel
method was introduced to enhance the system performance. All these findings are
valuable experience for my future study.
A number of goals and objectives have been achieved after the project development
lifecycle ended. These goals and objectives were both project-based and personal.
After evaluating the prototype of the project, it is clear that a fully working system has
been produced to meet the original objectives (see Chapter 1.2) laid out for the project.
The achieved system partly proves the feasibility of research hypotheses introduced by
Gaizauskas et al.(2005). When looking back on the requirements (see Chapter 3), it is
obvious that most of the functions have been fulfilled to a certain level. The major goal
obtained was the successful implementation of the text-based search function
requirement. Put simply, this function is based on multiple data sources (both text
index data and data in the MySQL database) and implemented by two search engines
(both Lucene APIs and JDBC APIs). Because the system relies on database retrieval
heavily, database pooling technology is used to solve the problem that opening a
connection to a database is time-consuming process. Furthermore, a novel method was
introduced to enhance the performance of the search function following a certain rule
(see Chapter 5). The method was originally invented in the development of this project
and the initial tests show that it speeds up the search response time for the similar
search significantly. Moreover, a robust User Registration Model and User
Authentication Model were carried out and fully tested. Both of these models will be
suitable for their roles in the integration of the system in the future. Lastly, the major
user interfaces in the project have been completed and easily extensible for the future
work. For example, the page combination technology is used for each page, so the
- 44 -
Chapter 6 Results and Discussion
header and the footer of each page can be used directly for the new pages (e.g. the
refined page) in the future. This does avoid repeated coding and save the development
time.
On a personal basis, the major achievement was the knowledge and comprehension of
the reviewed technologies, especially the ones chosen for implementation. The
valuable practical experiences on developing such a large web-based application are
very important for me to examine and enhance my academic knowledge in the
web-related subjects. Though the technologies employed in this project were used in a
correct manner, I believe further use of more advanced functions supported by these
technologies would result in a better performance for the system. Another personal
goal that was achieved during the project development process was the deep
understanding of the research area in Natural Language Processing, which brings me
an interest on the future research in this area. The working system achieved brings our
research group a bright power for the future work. Lastly, all the experience gained in
teamwork throughout the development of the project, including time management,
communication and cooperation with other team members, was another very important
goal that was achieved for me. All this knowledge and experiences gained brings me a
very pleasing memory for my studying period.
As mentioned before, the CubReporter Project is a large and complex project. It has
cost the other researchers more than two years to work on it, and my project period is
actually the later implementation process. The project is a team work, so I could not
carry on until the other members get the necessary data ready to use for testing.
Unfortunately, there is a database filling problem for Events and Objects Tables in the
database design. This should be solved soon by our database designer. That is why the
event-based search function could not be carried out by me. Furthermore, other works
left to do include the part of Administrator Management (View, En/disable an existing
user), and User Personalization (update password and personal information, track
users’ search history and saved documents). These issues have to be carried out after
the full search function is completed but should not be difficult to implement because
the design is already available.
Apart from the tasks left, any bugs found during the test process must be repaired in
order to achieve an enterprise deployment quality. Furthermore, some new cache
methods are worth being applied to the project in order to get higher performance. For
example, the Jakarta Cache Taglib could be used to cache the fragments of our JSP
pages. In this way, the default cache lifetime and size can be configurable, which will
improve the web loading performance on the client side.
- 45 -
Chapter 7 Conclusion
Chapter 7: Conclusion
The aim of the project was to help the other researchers in the NLP Group to carry out
a prototype for their research hypotheses introduced in the prior stage of the
CubReporter Project. The major task was to design and write the business logic code
on web server side to access a large database and retrieve information for the dynamic
page creation.
`
A set of the system requirements was agreed with the requirements analyzer, who has
been working on this project for more than two years. The requirements were broken
down into two categories, functional and non-functional. The core functional
requirements of the system were search function, data rendering, user authentication
and an administrator management function. Further analysis and investigation was
made regarding the requirements and a set of specifications was created for the system
to be designed. But before the design process a literature review was undertaken in
order to gain the appropriate knowledge about the possible technologies that were
available for use.
The literature review focused on those technologies and approaches for web-based
database-driven system. A careful review of tiered web architecture was done,
followed by available technologies for each tier. Apart from that, there was also a
review of MVC design pattern for the system to follow during the design process,
Moreover, possible architectures scenarios were introduced with the detailed
comparisons. Finally, a review of the current PA Digital Library website was made to
gain valuable information. This is because the existing system has some similar
functions that are needed to use in the new system.
After finishing the literature review, the design of the system was carried out based on
the obtained information and knowledge about the technologies available for the
implementation of the system. The design process was broken down into three parts,
the presentation tier, the business tier and the data storage tier. Each part of the design
process describes how the technologies will be used in order to develop the system.
The design pattern chosen was the Model-View-Controller (MVC), this is because the
MVC model provides a good separation of the presentation, the business logic and the
data. For the presentation, JSP was chosen as the major technology to carry out the
dynamic pages. For the business logic, Java Beans and Java Servlets were chosen as
the most suitable technology, especially the cooperation with JSP. Finally, for the data
tier, both text index database and a relational database were used to support the
complex text-based search function.
The next stage is the implementation process, where the technologies chosen during
the design stage were applied to the practice. The fact that a working prototype for the
system was achieved proves that the technological choices made were correct. During
- 46 -
Chapter 7 Conclusion
the process of implementation, a focus was made to improve the performance of the
system, especially in the aspect of enhancing the response speed for the search
function. Thus a novel method was introduced based on the background of this project.
Finally, a functional test was carried out to test different models developed in this
project and the test result shows that they all work in a right order. In order to prove
the improvement on the system performance, a real life test was carried out by a small
group of users. The test results show that the method does speed up the data retrieval
function in a certain level.
In conclusion, the test results uncovered no major errors and the achieved system was
found to be operational. The project requirements were fulfilled and the overall project
was therefore successfully implemented using the technologies correctly and following
the design as intended.
- 47 -
References
References
Gaizauskas,R. Sagggion,H. Barker,E. & Foster,J.(2005): Overview of Recent
Advanced Natural Language Processing ( RAMP 2005) . To appear.
Chaffee, A.(2000). “One, two, three, or n tiers?”. JavaWorld [Online], January 2000.
Available: http://www.javaworld.com/javaworld/jw-01-2000/jw-01-ssj-tiers_p.html
Hall, M. & Brown, L. (2000) Core Servlets and Java Sever Pages. Prentice Hall PTR
- 48 -
References
Web References
(WWW1) http://socks5host.pa.press.net:8082/cgi-bin/PAText4
(WWW2) http://lucene.apache.org/java/docs/
(WWW3) http://jakarta.apache.org/turbine/
(WWW4) http://www.orafaq.com/glossary/faqglosr.htm
(WWW5) http://docs.sun.com/app/docs/doc/805-4368/6j450e60j?a=view
- 49 -
Appendix 1 Selected Code
public TurbineInit() {
super();
}
TurbineResources.properties
# -------------------------------------------------------------------
# TURBINE S E R V I C E S
# -------------------------------------------------------------------
services.PoolBrokerService.classname=org.apache.turbine.services.db.TurbinePoolBrokerService
services.MapBrokerService.classname=org.apache.turbine.services.db.TurbineMapBrokerService
# -------------------------------------------------------------------
#DATABASE SETTINGS
# -------------------------------------------------------------------
database.default.driver=com.mysql.jdbc.Driver
database.default.url=jdbc:mysql://localhost:3306/cubreporter
database.default.username=root
database.default.password=root
- 50 -
Appendix 1 Selected Code
# The amount of time (in milliseconds) a connection request will have to wait
# before a time out occurs and an error is thrown.
# Default: ten seconds = 10 * 1000
database.connectionWaitTimeout=10000
# These are the supported JDBC drivers and their associated Turbine
# adaptor. These properties are used by the DBFactory. You can add all
# the drivers you want here.
database.adaptor=DBMM
database.adaptor.DBMM=com.mysql.jdbc.Driver
/**
* Query Analyzer, which remove the defined stop words from a query.
*/
public class StemAndStopAnalyzer extends Analyzer
{
String[] stopWords = null;
public StemAndStopAnalyzer(String[] stopWords)
{
this.stopWords = stopWords;}
public TokenStream tokenStream(String fieldName, Reader reader)
{
TokenStream result = new StandardTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
- 51 -
Appendix 1 Selected Code
if (stopWords != null)
result = new StopFilter(result,stopWords);//remove the stop words from a query
result = new PorterStemFilter(result);// Transforms the token stream as per the Porter
//stemming algorithm
return result;
}
}
- 52 -
Appendix 1 Selected Code
tQuery.add(hClause);
tQuery.add(tClause);
tQuery.add(cClause);
bQuery.add(tQuery,true,false);
bQuery.add(dClause);
System.out.println("testing " + bQuery.toString());
} catch(ParseException pe) {
pe.printStackTrace();
}
try {
hits = searcher.search(bQuery);
} catch(IOException ioe) {
ioe.printStackTrace();
}
DBConnection dbConn = null;
DocumentsBean doc=null;
LinkedList records=new LinkedList();
String[] lines=new String[hits.length()];
String snippet;
try {
dbConn = TurbineDB.getConnection();
if (dbConn == null) {
throw new DocumentsException();
}
for(int dn=0;dn<hits.length();dn++) {
Document d = hits.doc(dn);
lines[dn] = d.get(PAConstants.DOC_ID).toString();
PreparedStatement pstmt =
dbConn.prepareStatement(sql_bundle.getString("findDocuments"));
pstmt.setString(1, lines[dn]);
ResultSet rs = pstmt.executeQuery();
while(rs.next()){
doc=new DocumentsBean();
doc.setDID(rs.getString("docid"));
doc.setDCategory(rs.getString("category"));
doc.setDHeadline(rs.getString("headline"));
doc.setDByline(rs.getString("byline"));
doc.setDPcount(rs.getInt("pcount"));
doc.setDKeywords(rs.getString("keywords"));
doc.setDTopic(rs.getString("topic"));
doc.setDPage(rs.getInt("page"));
doc.setDContent(rs.getString("content"));
records.add(dn,doc);}
}
- 53 -
Appendix 1 Selected Code
- 54 -
Appendix 2 Screenshots
Appendix 2 Screenshots:
- 55 -
Appendix 2 Screenshots
- 56 -
Appendix 2 Screenshots
- 57 -
Appendix 2 Screenshots
- 58 -
Appendix 2 Screenshots
- 59 -