Harvard Tesis 1 - PP

EXAMENSARBETE INOM TEKNIK,
GRUNDNIV, 15 HP
STOCKHOLM, SVERIGE 2017
Alternative Information
Gathering on Mobile Devices
EDIN JAKUPOVIC
KTH
SKOLAN FR INFORMATIONS- OCH KOMMUNIKATIONSTEKNIK
Abstract
Searching and gathering information about specific topics is a time wasting, but
vital practise. With the continuous growth and surpassing of desktop devices, the
mobile market is becoming a more important area to consider. Due to the porta-
bility of mobile devices, certain tasks are more difficult to perform, compared to
on a desktop device. Searching for information online is generally slower on mobile
devices than on desktop devices, even though the majority of searches are performed
on mobile devices.
The largest challenges with searching for information online using mobile devices,
are the smaller screen sizes, and the time spent jumping between sources and search
results in a browser. These challenges could be solved by using an application that
focuses on the relevancy of search results, summarizes the content of them, and
presents them on a single screen.
The aim of this study was to find an alternative data gathering method with a
faster and simpler searching experience. This data gathering method was able to
quickly find and gather data requested through a search term by a user. The data
was then analyzed and presented to the user in a summarized form, to eliminate the
need to visit the source of the content.
A survey was performed by having a smaller target group of users answer a question-
naire. The results showed that the method was quick, results were often relevant,
and the summaries reduced the need to visit the source page. But while the method
had potential for future development, it is hindered by ethical issues related to the
use of web scrapers.
Keywords Data collection, Mobile devices, Web scraping, Summarization meth-

ods, User-centered design
3
Abstrakt
Sokning och insamling av information om specifika amnen ar en tidskravande, men
nodvandig praxis. Med den kontinuerliga tillvaxten som gatt forbi stationara en-
heters andel, blir mobilmarknaden ett viktigt omrade att overvaga. Med tanke pa
rorligheten av barbara enheter, sa blir vissa uppgifter svarare att utfora, jamfort
med pa stationara enheter. Att soka efter information pa Internet ar generellt
langsammare pa mobila enheter an pa stationara.
De storsta utmaningarna med att soka efter information pa Internet med mobila
enheter, ar de mindre skarmstorlekarna, och tiden spenderad pa att ta sig mel-
lan kallor och sokresultat i en webblasare. Dessa utmaningar kan losas genom att
anvanda en applikation som fokuserar pa relevanta sokresultat och sammanfattar
innehallet av dem, samt presenterar dem pa en enda vy.
Syftet med denna studie ar att hitta en alternativ datainsamlingsmetod for att
skapa en snabbare och enklare sokupplevelse. Denna datainsamlingsmetod kom-
mer snabbt att kunna hitta och samla in data som begarts via en sokterm av en
anvandare. Darefter analyseras och presenteras data for anvandaren i en samman-
fattad form for att eliminera behovet av att besoka innehallets kalla.
En undersokning utfordes genom att en mindre malgrupp av anvandare svarade

pa ett formular av fragor. Resultaten visade att metoden var snabb, resultaten var
ofta relevanta och sammanfattningarna minskade behovet av att besoka kallsidan.
Men medan metoden hade potential for framtida utveckling, hindras det av de etiska
problemen som associeras med anvandningen av web scrapers.
Keywords Datainsamling, Mobila enheter, Web scraping, Textsammanfattningsme-

toder, Anvandarcentrerad design
4
Acknowledgements
We would like to thank our advisers Fadil Galjic and Leif Lindback at the Royal
Institute of Technology. The feedback and help we received during this project
proved invaluable for this thesis.
5
Contents
1 Introduction 11
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Theoretical Background 15
2.1 Web Search engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Web Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Web Scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Asynchronous Programming . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Concurrent Programming . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Multithreaded Android Programming . . . . . . . . . . . . . . 16
2.2.3 AsyncTask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Managing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 PHP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 User Interface Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 Colour Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.2 User Interactivity . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Text Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.1 Natural Language Processing . . . . . . . . . . . . . . . . . . 19
2.5.2 Automatic Summarization . . . . . . . . . . . . . . . . . . . . 20
2.5.3 Generic Summarization . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Web Browsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6.1 HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6.2 CSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7.1 Web Scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7.2 Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7.3 Similar Applications . . . . . . . . . . . . . . . . . . . . . . . 22
3 Methods 25
3.1 Research Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.1 Research Methods . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.2 Research Process . . . . . . . . . . . . . . . . . . . . . . . . . 26
7
3.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Literature Study . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Interview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Design and Implementation of Prototype . . . . . . . . . . . . . . . . 28
3.3.1 Design of Prototype . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.2 Implementation of Prototype . . . . . . . . . . . . . . . . . . 29
3.3.3 Development Environment . . . . . . . . . . . . . . . . . . . . 30
3.4 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.1 Formative Evaluation . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.2 Heuristic Evaluation . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.3 Summative Evaluation . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Evaluating Performance . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . 31
3.5.2 Methods of Evaluation . . . . . . . . . . . . . . . . . . . . . . 32
4 Collecting and Presenting Information: Challenges and Possibili-

ties 33
4.1 Issues with Using Search Engines . . . . . . . . . . . . . . . . . . . . 33
4.1.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.2 Data Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Getting Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1 Types of Information Requirements . . . . . . . . . . . . . . . 34
4.2.2 Presentation of Data . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Improving Information Gathering . . . . . . . . . . . . . . . . . . . . 35
4.3.1 Reducing Data Usage . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.2 Time to Resolution . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.3 Showing Relevant Data . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Challenges and Possibilities: Summary . . . . . . . . . . . . . . . . . 36
4.4.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4.2 Possibilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Information Gathering Application: Design and Implementation 39

5.1 Design of the Application . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.1 Application Functionality . . . . . . . . . . . . . . . . . . . . 39
5.1.2 Webscraping for Information . . . . . . . . . . . . . . . . . . . 40
5.1.3 Application Structure . . . . . . . . . . . . . . . . . . . . . . . 40
5.1.4 Application Flowchart . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.1 Web scraping for Data . . . . . . . . . . . . . . . . . . . . . . 45
5.2.2 Storing and Updating of Data . . . . . . . . . . . . . . . . . . 46
5.2.3 Creating a Summary . . . . . . . . . . . . . . . . . . . . . . . 47
6 Information Gathering Application: Evaluation 49

6.1 Presentation of The Results . . . . . . . . . . . . . . . . . . . . . . . 49
6.1.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . 49
6.2 App Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2.1 App Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2.2 Relevance Results . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.3 Survey Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8
6.3.1 How Relevant were the Summaries? . . . . . . . . . . . . . . . 54
6.3.2 Did the Swipe Functionality Positively Impact the Experience. 54
7 Discussion 55
7.1 Methodology and Consequences of the Study . . . . . . . . . . . . . . 55
7.1.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.1.2 Consequences of the Study . . . . . . . . . . . . . . . . . . . . 56
7.2 Problem Statement Revisited . . . . . . . . . . . . . . . . . . . . . . 57
7.2.1 Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.3 Ethical Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.3.1 Lost Clicks and Ad Revenue . . . . . . . . . . . . . . . . . . . 59
7.3.2 Information and Copyright issues . . . . . . . . . . . . . . . . 59
7.3.3 Anti web scraping Industry . . . . . . . . . . . . . . . . . . . 59
7.4 Sustainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.4.1 Effect on Environment . . . . . . . . . . . . . . . . . . . . . . 60
7.4.2 Economical Sustainability . . . . . . . . . . . . . . . . . . . . 60
8 Conclusions 61
8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8.2 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
9
Chapter 1
Introduction
With the exponential growth of online data[1] accessed through mobile devices, it is
becoming more difficult to search and find desired information about a topic. Time
is often wasted sifting through data that is either irrelevant or duplicate information
of already collected data. Search engines today rely on each individual user to sift
through the found links in order to get to the desired information. The time and
number of page traversals it takes to find the desired data could be reduced, by hav-
ing the search application do the work of finding and presenting the information.
The objective of such a system would be to reduce the bandwidth, and the time
spent searching for relevant information.
Improving on the current methods for collecting data requires the information
searched for to be presented faster, while maintaining relevance to the desired topic.
Improving how data is collected in a way that benefits the user over traditional
methods introduces the concern of presentation. How should the data be presented
to the user in a way that both saves time and helps them find the desired informa-
tion? This thesis presents the task of developing a information gathering method
for Android devices, which finds and presents relevant data to the user, and explores
how to apply certain methods in Android application development. The rest of this
chapter introduces the specific problems that defines and motivates the focus and
purpose of this thesis.
1.1 Background
Finding and gathering data online is mostly done through search engines, such as
Google and Bing. The companies that offer these search services use programmable
bots, known as web crawlers that traverse the World Wide Web, and create indexes
for each site they gain access to.
The information gathered by the web crawler is then used to present the searcher
with links to websites that are most relevant. Presenting the most relevant sites
first is done by analysing the page, using many types of questions to determine its
relevancy. Web crawlers can also be used to fetch specific data from a web page,
and are then referred to as a web scrapers.
11
1.2 Problem
There is a need to reduce the work required by a user to gather data on a specific
subject on mobile devices. Different options must be explored concerning the gath-
ering of data without the use of traditional search engines. One option that could
be viable is using web scrapers to scrape the web for data, and analyze what content
best characterizes the desired data. The data then has to be presented to the user
in a coherent manner.
All this is needed to produce an alternative to the desktop friendly search engines
that are more difficult to use on mobile devices. This thesis explores ways that web
crawlers and web scrapers can be used and discusses how to implement them in a
smart environment in the form of an Android application. Since the data collected
by the web scraper has to be processed by the application, the thesis will also discuss
methods of storing the data and processing it.
Another problem that arises is the issue of presenting the collected data to a user
in a clear way. As we try to improve upon search engines, there must be a thought
out design plan when developing for the Android platform. The data presented to
the user must not only be simple to read and understand, but also summarize the
content without leaving important details out. This means there is a problem with
both the technical and aesthetic part of presenting data.
The task of improving upon existing methods for gathering data is a difficult one
for many reasons. A successful implementation of a smart information gathering
tool would need to reduce search time and bandwidth. While looking up a short de-
scription or a wanted link is easy to do in your smart devices browser using existing
search engines, gathering data from several sources becomes more difficult the more
data one needs on the subject.
1.3 Problem Statement

This thesis aims to investigate the following questions:
In which way can a web scraper be used to collect relevant data on a subject?
How can the collected data be stored and analyzed?
In which way can an Android application use a web scraper for data gathering?
How can the collected data be presented to promote easy access to the desired
information?
1.4 Purpose
This thesis aims to find a search solution for mobile devices that reduces the band-
width and time used for finding relevant information for the user. The experiences
from this study could also aspire to lay a foundation for other people who wish
to develop Android applications that make use of multithreading, summarization
methods and databases.
12
Android developers that want to gather data from the web using their software,
can use this thesis to determine if web scrapers are a viable option to accomplish
that. There are also different problems that arise when developing for the Android
OS, concerning data gathering and presentation. These problems include issues such
as how to find, store and analyze data using web scrapers. Further issues that arise
are the user friendliness of an application.
1.5 Delimitations
Creating an application of this type can range from hundreds of lines of code to
millions with varying complexity. Since the goal is to return data which can vary
from hundreds of different file types and extensions, the decision was made to limit
it to only gather raw text. Local and server side caching was also excluded from this
project due to uncertainties regarding legal aspects. Furthermore, the UI design of
the application was kept simple, with a main focus on functionality.
1.6 Thesis Outline

The thesis is structured as follows.
In chapter 2 the thesis presents the necessary background information along

with its related sources. The chapter provides the technical background needed
to fully understand the document.
In chapter 3 our research strategies and methods are presented and briefly
discussed. The chapter gives an overview of which different research strategies
were chosen and why.
Chapter 4 covers the challenges and possibilities that arise when performing a
web search using a mobile device.
Chapter 5 provides an overview of the applications implementation, design

and all of its functionality.
Chapter 6 presents the results gathered from user feedback received through
questionnaires, and the statistical results generated from the data contained
in the database.
Chapter 7 discusses the design decisions made when implementing the applica-
tion and what motivated these decisions. The problem statement is revisited
and reflected upon. Furthermore, ethical aspects of the thesis is discussed.
Chapter 8 ends the thesis with conclusions, future uses and possible future
research within the thesis topic.
13
Chapter 2
Theoretical Background
Understanding the content of this thesis requires a basic understanding of a wide

number of technologies and design principles. The following chapter covers the
fundamental concepts and topics required to fully understand the work and research
in this document.
2.1 Web Search engines

The search engines used today rely on different technologies to find and present
what links are most relevant to a search term.
2.1.1 Web Indexing

Performing a web search using a search engine is achieved by entering a relevant
search term to the desired topic. The result of the search is returned as websites
that are deemed most relevant by the search engine. Search engines, such as Google
and Bing, rely on various methods for discovering the websites that make up the In-
ternet in order to present the user with relevant search results. The data that search
engines use for displaying relevant results are gotten through a method called web
indexing.[2]
Web indexing refers to various methods of indexing either a set of web pages or
the whole Internet. Indexing is achieved using web crawlers that recursively visit
each link on a web page. When a search engine finds a website, it takes a snapshot
of the content of the website and saves it in a database. When the search engine has
a websites contents, it can quickly match the website with a users search query.
2.1.2 Web Scraping

Web scraping is a data gathering technique where data is harvested from the web.
After loading a website, the web scraper software can extract the available website
data and repackage it into a desired format[3]. The data can then be stored locally
and be used without having to access the web. As many websites do not offer the
option to save specific data from their website, web scraping can be used to auto-
mate the manual technique of copying desired data by hand.
15
Because the content and structure of websites varies, each website requires a dif-
ferent solution for fetching content. The desired data is found by identifying the
elements or attributes where the data resides. Web scrapers can be used with the
addition of web crawlers, to gather information throughout many links.
2.2 Asynchronous Programming

Unlike synchronous programs were code is executed sequentially from top to bottom,
asynchronous programs uses a non-blocking model where processes in blocked states
dont hinder the rest of the program[4].
2.2.1 Concurrent Programming

Concurrent programming deals with programs where instructions can either be ran
in parallel or without blocking. Concurrency depends on having several threads of
execution.
CPU Cores
CPUs found in mobile and desktop devices usually have several cores of execution.
A core can only perform a single instruction at a time, while maintaining a pointer
to the next instruction and a small memory known as a registry. By splitting up the
workload between the CPUs cores parallelism can be achieved, which can provide a
speedup to a degree[5].
Threads
A thread of execution is a sequence of instructions which a CPU core can perform[6].
Threads are spawned by processes, which are programs running on a device. Threads
spawned by processes contain instructions that can be executed independently of
other code, and does not need to be ran sequentially.
A single core can have several threads running at the same time, but are not ran in
parallel. By switching between running different threads on a single CPU core, the
CPU can provide the illusion of concurrency. This prevents a process from blocking
other operations and thus making the program unresponsive.
2.2.2 Multithreaded Android Programming

By making use of several threads, code that has to wait for something no longer
blocks the rest of the program. The task scheduler can simply assign the CPU cores
new instructions. By using built in Java APIs such as Executor, ThreadPoolEx-
ecutor and FutureTask, multithreaded programs are made easier to write and keep
track of[7].
2.2.3 AsyncTask
AsyncTask is a class in the Android OS package, which provides a simple way of
performing background operations and handling the result on the UI thread[8]. The
16
UI thread is the main thread of an Android Application, which updates the graphical
interface. Blocking the UI thread prevents the application from rerendering the
screen, and thus gives the impression that the application is frozen. By handling
tasks on a separate thread using the AsyncTask class, the UI thread is not blocked.
Unlike thread handlers such as Executors, AsyncTasks are made for shorter, less
CPU intensive operations.
2.3 Managing Data

Modern software applications depend on both local and server side data to provide
users with content. Managing data requires knowledge of both database design and
how to access and update data.
2.3.1 SQL
SQL stands for Structured Query Language and is a programming language created
by IBM in the 1970s to help developers manage databases easier[9]. SQL being
a query language, means that users can create queries that holds the information
needed for the DBMS to accomplish a specific task on the database. While there
were many query languages created, SQL became the most popular and is the most
used query language today. When a user wants to manage their database, i.e. adding
an entry to a table, a query has to be created and handled by the DBMS.
2.3.2 PHP
PHP is a scripting language that is primarily used on web servers. It is often used
in web development to provide interaction between a client and data stored on a
server. PHP code can also be embedded directly into HTML documents to perform
various functions, such as generating dynamic content. PHP code that is embedded
in HTML is executed on the server and the generated HTML is sent to the client[10].
17
2.4 User Interface Design
Designing a mobile application provides certain challenges not found on a desktop
device. The smaller screen and touch controls require extra care to ensure the
application is easy to use.
2.4.1 Colour Theory

Designing an application that is visually appealing requires a fundamental under-
standing of colour theory. Colour theory looks at different colour combinations and
how they are perceived. A way of thinking about complementary colours is by look-
ing at a colour wheel[11]. The colour wheel puts complementary colours on opposite
sides and analogous colours close by, as seen in figure 2.1.
Figure 2.1: Colour Wheel

image from Wikimedia Commons[12].
Analogous colours: Colours that lie next to each other in the colour wheel, are
often found in nature and are considered harmonious and pleasing to the eye[13].
Because analogous colours dont create a high contrast, they are commonly used for
deciding the overall colour theme of a design.
Complementary colours: Colours that lie on the opposite side of the colour
wheel create a high contrast, and are commonly used when something needs to
stand out. Unlike analogous colours, complementary colours can be quite jarring
and should thus not be used for the overall design colour palette.
18
2.4.2 User Interactivity
Due to the size and mobility, mobile devices have vastly different interactions than
desktop devices and thus need to be designed accordingly. Contrary to desktop
devices, there are many ways of holding and interacting with a touchscreen. Ac-
counting for all kinds of device requires the design to be responsive and simple.
Accomplishing this requires the following certain principles.[14]
Consistency: Each page of the application should keep consistency in regards

to design elements, such as font and colour. Furthermore, the design of the
pages should be shaped by usability testing.
Readability: Text should always have a high contrast to make it easy to

read. The font size should be large enough and scale with the size of the
screen. Important labels such as button texts should be extra large and visible
to convey functionality.
Simplicity: The simplicity of a design requires balance between functionality

and ease of use. Using the features of an application should be easy enough
to do without reading instructions, while still accomplishing the task.
Visibility: Everything the user needs to navigate and use the application
should be available without distractions. Navigation should always be pre-
sented in a way thats clear and natural. The user should never have to guess
or spend large amounts of time navigating between pages.
Feedback: The design should never have the user guessing what is happening.
The state and condition of an application should always be visible, so the user
does not think the application has frozen when its loading.
When designing a UI, placement of objects needs extra consideration. Because the
interface is interacted by trough touch controls, parts of the screen will be covered
by fingers.
2.5 Text Processing

Text processing involves different ways of manipulating text to either extract or
change parts.
2.5.1 Natural Language Processing

Natural Language Processing (NLP) is a field covered by computer science, com-
puter linguistics and artificial intelligence that studies the relationship between hu-
man language and computers. NLP looks at how computers can analyze human
language and derive conclusions on the content of text. There are many uses for
NLP algorithms which depends on the desired output.
19
2.5.2 Automatic Summarization
The process of summarizing texts using software is an applied method of NLP known
as automatic summarization. The goal of summarization algorithms is often to
generate a summary from a non predefined text[15]. There are many varied methods
of achieving this, but they all rely on identifying a list of keywords that define the
topic of the text.
2.5.3 Generic Summarization

Generic summarization is one of the types of automatic summarization that is fo-
cused on producing a summary from a collection of data[16]. The goal of generic
summarization is to condense a number of sentences down to a smaller amount,
while keeping the most relevant data. One way this can be achieved is by giv-
ing each sentence a relevancy score[17]. Determining the relevance of a sentence
requires checking the sentence for different factors including the occurrence of key-
words. Keywords are words commonly found in the original text that are not a
part of stopwords. Stopwords are common words in English such as the or and
that do not describe the content. Assigning a sentence a relevancy score is achieved
through the following steps also depicted in figure 2.2.
How many words in the sentence was also found in the search term.
The search term is split up and a set of search specific keywords are identified.
If a sentence contains one or more keywords from the search term, its more
likely to be relevant.
How long the sentence was compared to an ideal sentence. The

ideal length of a sentence tends to be around 15-20 words according to most
writing guides. A sentence is given a weighted length score based on how close
it comes to the ideal sentence length.
Where the sentence was found in the text. A lot of articles and reports
follow a text structure, where general topics are introduced in the beginning
and concluded at the end. More specific subjects are discussed in the middle.
A summary consists of more general information and does not go into the
details. Sentences that are found in the beginning and end of the original text
is thus given a higher weighted position score. This style of writing is known
as the Hourglass Model[18].
The sentences keyword density. A sentence is given a weighted keyword

density score based on how many keywords it contains. If a sentence contains
many keywords, its more likely to contain descriptive information.
A score based on how common the keywords found in the sentence

where. For each keyword that occurs in a sentence, a weighted frequency
score is given based on how frequent the keywords are in the original text.
Keywords that are found more often in the original text give a higher score.
Finally, the weighted scores are combined and the 5 most relevant sentences are
combined to compose a summary. The sentences are combined in the same order as
they occurred in the text.
20
Figure 2.2: Summarization Flowchart.
2.6 Web Browsers

2.6.1 HTML
HTML, which stands for Hypertext Markup Language, is the standard language used
for creating documents that represent web pages[19]. A document created through
HTML consists of nested elements, with different tags that describe the content
they contain. The elements that make up the HTML documents are interpreted by
the browser to decide how the content should be displayed, but are not displayed
themselves. An HTML document start with a Document Type Declaration tag
that declares which version of HTML the document is rendered by. The content
between the <html> and </html> tags describe the web page in whole, but the
visual content is placed only between the <body> and </body> tags. Programs
can be ran in the browser to offer dynamic content by writing JavaScript between
<script> tags, or by including separate files through the src attribute.
2.6.2 CSS
Cascading Style Sheets(CSS) is a style language used to describe the presentation
of structured documents, such as HTML or XML. While HTML documents can
be styled using style attributes for each element, a style sheet makes it possible to
separate the content of a document from the presentation. The style of a HTML
element is declared by having a keyword called a selector, which is a part of the
stylesheet that specifies the tag name of the element.
The properties of the selector such as colour, font and many more are then ap-
plied to each element matching the tag. By using attribute selectors, it is possible
to target specific elements that have matching id or class attributes in the target
document. In addition to specifying the colour and font of elements, CSS is also
used to design the layout of a web page or document[20].
21
2.7 Related Work
This section covers some of the related works found throughout the thesis work. It
presents the issues and possibilities found in similar applications.
2.7.1 Web Scraping

Web scraping deals with the extraction of information that a user is likely to find
useful or interesting. A web scraper is simply a tool that automates information
gathering online and returns the result to the user. Web scraping can be applied for
many different purposes, but is most often used by companies to monitor competitors
prices or collect information from peoples profiles. Any text or media that is found
online can generally be web scraped, so the use of web scraping is varied. Web
scraping is generally found in most larger e-commerce business, where they use web
scrapers to track prices of products sold by their competitors. Research companies
pull a lot of data from different websites and make use of web scrapers because its
automated. Web scraping is generally done as an either one time scraping where
data is fetched in a large batch and used, or its used continuously over a long time
to keep track of changes.
2.7.2 Summarization
Automatic summarization techniques are used to automatically create summaries,
with little or none human intervention. Summarizers are useful for getting an
overview of a complete text in a shorter time. Automatic summarization is not
that commonly used in business, because the technology is not mature enough and
the results can vary in quality. Summarization tools can be used to summarize any
media including text or videos, but is mostly used for text. Software that performs
summarizations is often written in machine learning courses as an exercise, but the
business applications are still rather unused.
2.7.3 Similar Applications

Sensebot: Semantic Engines LLC is a company that has developed services
related to finding information online through web scrapers. The product called
SenseBot[21] is described as a search engine that produces summaries from
search terms. According to the website it is using text mining and multi-
document summarization to produce a coherent summary. While the web
scraping part is probably true which is what they refer to as text mining,
the summarization is very poorly implemented and clearly does not use several
documents for each summary, as its contradictorily stated. The website claims
to present users with summaries, but when tested would only provide a single
sentence which was often found to be irrelevant. From the website, it was
made clear that a summary has to contain more than one sentence to properly
describe the content.
22
http://smmry.com/ SMMRY is a website[22] that offers a tool for summa-
rizing text. The text to summarize can be provided either through a file, a url
or by typing in the text manually. The length of the summary can be spec-
ified in sentences. The website does not offer any information regarding the
owner, but is selling a service provided through its API. The website offers the
options to set the amount of sentences. Five sentences seemed to be the best
compromise between length and content quality. Both the SMMRY website
and the developed application summarizes content, but the application allows
users to create summaries without providing a source.
23
Chapter 3
Methods
This chapter covers the research strategies and methodologies that were used in this
study. Furthermore, the research process is outlined and the data collection, and
result gatherings methods used to answer the problem statement are presented.
3.1 Research Strategy

A literature study and an interview were conducted to gather the required knowledge
needed to answer the problem statement. This section outlines how the research
methods were approached and why they were chosen.
3.1.1 Research Methods

The research methodologies were chosen and conducted in order to gain the knowl-
edge required to assess and find a solution to the problem statement.
Literature Study
A literature study is a process of gathering information about a subject from
various sources, such as articles, books and research papers. The gathered
information can then be processed and summarized to help gain an under-
standing of a researched topic. A literature study is performed with either
a quantitative or qualitative method[23]. The qualitative method looks at
thoughts and opinions. This uncovers new problems and possibilities and al-
lows one to delve deeper into the problem. The quantitative method looks at
measured or deducted data. The initial literature study is often conducted
with a qualitative research approach, in order to find new thoughts and trends
about a researched topic. The qualitative study is usually then followed by a
quantitative study, where measurable data is evaluated and interpreted to for-
mulate facts. In order to assure the validity of the material, literature studies
require one to critically evaluate the information and sources, to determine the
legitimacy of the content. Without a critical analysis, information gathered
cant be used in summarises or integrated into ones work. The choice of using
a literature study was performed to get a better understanding of the fields in
which the thesis was conducted in.
25
Interview
Due to the involvement of many different technologies and challenges, an inter-
view was deemed useful in order to learn from the experiences of someone who
has worked with similar systems. The desired information was collected by
performing a general interview with some predetermined questions, and was
conducted with a software engineer that had work experience in many of the
technologies used. The purpose of the interview was to learn the best prac-
tices of storing and analyzing data, collected from an Android applications.
Additionally we gained insight into common problems that can occur during
the development process, and how to avoid them.
3.1.2 Research Process

The work of this thesis was divided into several phases, where different types of
studies and development practices were conducted, as depicted in figure 3.1. During
the first phase of the thesis, research was performed to create a strong foundation
of background knowledge in the relevant fields. The information gathered was ana-
lyzed, a hypothesis was made and conclusions were drawn. When the background
knowledge was deemed strong enough, the development phase was started, where
the data from the research was used to plan and implement a prototype. The de-
velopment phase is conducted in an iterative manner, in which the prototype is
heuristically, formatively evaluated and redesigned to improve upon the current ver-
sion. The last phase of the thesis is the evaluation phase, where different evaluation
methods were used to test and analyze the result of the prototype.
Figure 3.1: The Process Outline.
26
3.2 Data Collection
This section contains the methods used to gather and summarize data, and how
they were applied.
3.2.1 Literature Study

Before a literature study could be conducted, an assessment was made to create an
overview of the prerequisites needed to fully understand and approach the problem.
By breaking down the problem statement into smaller subproblems, concepts and
technologies required to conduct the study were identified. With an overview of
the project in mind, academical search engines such as Google Scholar and the
KTH library search tools were used to search for relevant information. Further
results were acquired by studying the official documentations for the programming
languages Java, PHP and the IDE Android Studio. The process of the literature
study is depicted in figure 3.2.
Figure 3.2: Literature Study Process.
Analyzing the Knowledge Requirement

Because the prerequisites of this thesis required background knowledge in
many different fields, it was important to identify the key concepts that would
provide an understanding and knowledge of concepts related to the problem
statement. Certain topics were deemed more important for the project, such
as Android development and web scraping, and thus were prioritized. With
the delimitations in mind, certain topics required less in depth studying in
order to attain adequate knowledge.
Acquiring Relevant Source Material

Finding source material, whether its a published article, a book or a blog,
requires searching through different sources of varying depth and complexity.
Without varying the search sources, it would have been difficult to get enough
knowledge about the topics without having to spend a large amount of time
on large case studies.
As more knowledge was obtained and applied, new topics arose which needed
to be studied. Because the technologies used were rather new, such as web
scraping, most sources had to be fetched from online articles, documentations
and studies.
27
Evaluating the Content and Sources
A crucial aspect of the literature studies was determining the validity and
relevance of a source. Because the concepts and technologies used in this thesis
are new and ever evolving, it was important to assure the sources where up to
date in addition to being trustworthy. When possible, official documentations
were used for the literature study.
Reading, Studying and Understanding the Chosen Sources

By gathering and studying material of different levels of complexity and depth
from multiple sources, a better understanding was attained and the problem
statement was better understood. This would lead to the discovery of new
subproblems and solutions that needed further studying.
Applying Knowledge to Update Design and Delimitations With the

discovery of new subproblems and challenges, the design of the prototype had
to be updated to match the new delimitations and possible solutions. By
finding new solutions and problems, further studies were required.
Evaluating Result Gathering Methods Finally, the literature study was

used to decide the methods that would be used to obtain the data required to
answer and evaluate the problem statements.
3.2.2 Interview
After performing and summarizing the results from the literature study, it was clear
that not all questions were answered, or could be reliably found online or in books.
The purpose and aim of the interview was to get a better understanding of the
development process, and how to develop applications specifically targeting mobile
devices. The interview was held as part informal and part general interview, which
means that while there were some general questions, the interview was kept rather
open for discussions. This was done in order to get answers to some questions, and
coming up with new questions that were previously not thought of. The interview
was held with Eric Von Knorring who is a software engineer. The choice to inter-
view him was made due to his wast experience in many of the technologies used
surrounding the application.
3.3 Design and Implementation of Prototype

The prototype was intended to provide a tool that could be used to investigate the
problem statement and measure a result. When the research phase of the thesis was
completed, the resulting data was gathered to be evaluated and used in the design
of the prototype. This section contains the method used to design and implement
the prototype.
28
3.3.1 Design of Prototype
Due to the complexity of the application, the design process can be split into several
different parts that need to be analyzed and designed.
Web Scraping Design
The gathering of data through web scraping was independently designed to fit
the specific requirements of the application.
Database Design
Designing the database refers to the implementation of a separate system used
to store and update values, through function calls from the mobile application.
The design and choice of technology was decided based on previous knowledge
in database paradigms and due to fulfilling requirements of handling concur-
rent writes to the same database rows.
UI Design
From the literature study, baselines for creating a simple and aesthetic UI were
established and documented. Implementing a proper design required several
iterations, in order to reduce any distractions from the core purpose of the
application.
Information Design
The aspect of the application that required the most research was the presenta-
tion of the results. Developing the solutions required to promote better access
to the desired information, resulted in branching out to researching Natural
Language Processing.
3.3.2 Implementation of Prototype

Designing and implementing the application was a process that was not done se-
quentially, but rather iteratively and in parts. This was done to keep the project
scope flexible in terms of features and quality. Because there was no end user to
evaluate each iteration of the project, self evaluations were had after each work
phase was complete, where the design was analyzed and adjusted. Due to the iter-
ative workflow of implementing smaller parts of the whole application, the risk of
not completing the program on time was reduced. Furthermore, the MVC design
pattern was used to write different parts of the application independently.
The implementation of the prototype was done in two different parts. First was
the front-end design of the prototype, which was the different views of the Android
application. Secondly was the back-end design, which includes the database, model
classes and the PHP code used to query the results. During the implementation
phase, both the front-end and back-end were developed in parallel. This was needed
to be able to test and evaluate parts of the prototype before continuing with the de-
velopment. The front-end of the prototype was from the beginning very dependant
on the back-end, such as the database to test that certain features work.
The parallel implementation of the front-end and back-end was used to create the
resulting prototype. The prototype was then used to generate the results needed for
the evaluation of the thesis.
29
3.3.3 Development Environment
This project used the IDE Android Studio to develop the application, due to being
the official environment to develop and supported by Google. The benefits of using
Android Studio was the inclusion of tools required to develop Android applications
such as emulators, built in tools for version control and dependency management.
Development of the database and backend was done by using the program XAMPP
to setup a local server and design the database schemas. The backend PHP code
was written using the text editor Atom. The mobile application was developed to
run on Android version 4.1(JELLY BEAN) and newer.
3.4 Evaluation Methods

Determining if the application was able to achieve its purpose of investigating and
possibly solving the problem statement, requires evaluations to be made. Different
methods of evaluation were needed throughout the development to ensure the work
was moving towards its target.
3.4.1 Formative Evaluation

The iterative process of designing, redesigning and implementing the project proto-
type, made use of the practices of formative evaluation, where modifying a prototype
occurs all through the implementation stage. While there were no stakeholders to
discuss current iterations, self reflections and iterative cycles made it possible to
evaluate the project at different versions. The application was evaluated after each
milestone was achieved. The progress and issues that occurred during the implemen-
tation were discussed, and the resulting discussion was used to make adjustments
where it was applicable.
3.4.2 Heuristic Evaluation

Heuristic evaluation refers to the method of identifying and solving problems related
to user interfaces. By applying different heuristics when creating an interactive and
responsive user interface, the design of an application can achieve its purpose of
directing the user without being distracting. The evaluated heuristics used were
from the Nielsen set[24] of heuristics:
Visibility of System Status: Give the user feedback on what going on.
Consistency and Standards: Ensure that system behaviour was consistent

throughout the application.
Aesthetic and Minimalist Design: Only present relevant information and

options.
The heuristics applies to both the aesthetics and design of the UI, but also to the
functionality of the underlying system.
30
3.4.3 Summative Evaluation
While formative evaluation methods were applied during the development process,
a summative evaluation was conducted after the application was implemented. The
evaluation was performed to measure various metrics of the application. The sum-
mative evaluation strived to answer how viable the application is with performance
and functionality in mind[25].
Was the performance targets met in terms of speed, RAM and data
usage?
Could the application serve as an alternative method for information

gathering?
Were the summaries relevant?
3.5 Evaluating Performance

In order to evaluate the applications performance, relevant metrics needed to be
collected, analyzed and presented. The literature study provided the basis for which
metrics need to be tested, in order to assess the possibility of solving the problem
statements.
3.5.1 Performance Metrics

To assess the viability of the application prototype, certain key properties were
identified, that were deemed necessary to fulfill. The metrics were chosen as follow.
Speed: How long it takes from entering a search term until information is
visible.
Data usage: How much network data is used from that a search is started
until the information is fully loaded.
RAM usage: How much RAM is used at the peak of RAM usage.
Relevancy: How many of the search results return relevant results and what
share of results are relevant to the search term.
The results need to be relevant to the search term, or else the application does not
fulfill its purpose. The application must display the result at least as quick as a
regular search engine. The application also cant use more network data then a
regular search. Lastly, the application cant leak memory or use significantly more
memory than a regular search engine.
31
3.5.2 Methods of Evaluation
In order to obtain results of the performance metrics, methods that accurately and
reliable measure data had to be used. Furthermore, a range of devices with different
performance needed to be tested.
Speed
Measuring the speed of the application requires checking how long time it takes
from hitting the search button, until having the result fully load. Furthermore, each
major function will be measured to identify possible bottlenecks.
Obtaining the time measurements was achieved by calling a built in function for
saving the system time, before and after the desired measurement. The elapsed
time was then obtained by taking the difference between the time stamps.
long startTime = System.nanoTime();

// code being measured
long elapsedTime = System.nanoTime()-startTime;
Data Usage
Measuring the network data was done by using the built in tool Android Device
Manager. By selecting a process, the total amount of network data could be mea-
sured over a time period.
RAM Usage
The amount of RAM used could be measured by taking snapshots during runtime.
This was done by using the built in monitor, and measuring during peak RAM
usage.
Relevancy
The relevancy was measured by user feedback from a target group, that tested dif-
ferent search inputs and gave feedback on the results. The feedback was collected
through a questionnaire where the users were prompted to answer questions regard-
ing the application. The feedback was received in terms of numerical scores and
text for each question. How much of the contents that is relevant, was measured by
statistically evaluating the database data.
32
Chapter 4
Collecting and Presenting

Information: Challenges and
Possibilities
This chapter presents and analyzes the issues with searching on mobile devices. The
result of the literature study was used to design a plan for implementing a prototype
and test its viability.
4.1 Issues with Using Search Engines

This section covers the technical issues that were found and how they affect the
gathering of information when using mobile devices.
4.1.1 Performance
Since the first web page was created in 1990, websites have become a lot larger with
the addition of various features such as images, videos, fonts, CSS and JavaScript
to name a few. What started as a simple way of sharing information in the form
of text, has evolved into often building fully fledged web applications with complex
features, and as a consequence often large JavaScript file sizes.
This increase in size and performance required has been a noticeable issue for desk-
top users, as websites have been trending towards implementing feature sets of web
apps[26]. The issues are further magnified when considering that factors, such as
bandwidth and power draw, are less of an issue on connected devices. Not only are
mobile devices limited by their batteries and data plans, the network connection,
processor and RAM are often significantly slower, which affects the performance.
Performing a search on a mobile device has not gotten much slower due to the
search engines used, but rather due to the loading of found data and navigating of
the result. The heavy use of JavaScript in modern websites, introduces features that
are often unwanted when trying to find information quickly, such as animations and
ads that load in dynamically. Trying to find information while on a mobile device
often takes a significant amount of time, especially if the desired information is hard
to find and requires the user to check several links.
33
4.1.2 Data Usage
The average web page has more than doubled in size from 2012 to 2016, where it
was over 2.3Mb large[26]. The trend is moving towards larger websites, and its
mostly due to the increase of images, video and other media to become more visu-
ally appealing. The use of JavaScript has also increased, with most websites using
one or more larger frameworks in addition to any other code. This increase in web-
site size has had a greater impact on mobile device users, as faster mobile network
connections, such as the 4G network, are not always available[27].
Trying to find information online while using a mobile device on a data plan, can be
costly in terms of data used. Search engines provide only a sentence or two below
each result link, which makes it difficult to assess the content quality of a web-
site before loading it. This issue is further magnified in nations where the network
infrastructure is weaker and mobile devices slower.
4.2 Getting Information

This sections covers the types of information needs that exist when using mobile
devices, and the issues of how they are presented.
4.2.1 Types of Information Requirements

The types of searches and decisions made when using mobile devices are often dif-
ferent to desktop searches[28]. The types of searches made on mobile devices are
often done to help make a quick decision while on the move. Topics that require
deeper research are often delegated to desktop devices, where searching is faster and
often simpler.
4.2.2 Presentation of Data

Search engines present the results from a web search, by providing a link to where
the desired information can be found. User that perform searches on mobile devices
are usually presented with a list of links, that each have a descriptive sentence or
two taken from the web page. Compared to on a desktop environment, this list is
too big for the smaller screens of the mobile devices.
Due to the smaller screens of mobile device, fewer links can be seen at once, which
further inhibits the user experience. The links are presented in order, where the most
relevant links are placed at the top, according to whatever algorithm the search en-
gine uses for ranking. This still requires the user to either trust the search engine
with the top link, or to manually visit sites until the information they are searching
for is found.
34
The desired information is always at least two clicks away and even if its found,
it often comes with additional undesired information. Mobile searches are mostly
made to get quick and convenient answers. Finding the desired information from a
long article or web page is time consuming, which is detrimental to the goal of most
mobile searches.
4.3 Improving Information Gathering

This section covers the solutions that were found from the literature study and how
they could be applied to solve issues.
4.3.1 Reducing Data Usage

As of April 2017[29], images and scripts account for over 88% of the average website
size and are growing larger every year. HTML documents which contain the con-
tent, are on average less than 2% of the size of a website. Unless additional HTML
document content is loaded after the static page is fetched, the desired information
of a website accounts for less than 50kb on average.
By only fetching the content that is required to extract the desired information,
less time can be spent waiting for images, fonts and CSS to load and data usage
can be reduced. This extraction of data can be done manually, or be automated
using web scrapers. Libraries such as BeautifulSoup[30], phantomjs[31] and
jsoup[32], which are made for extracting information from websites are available
in many languages, which reduces the need to implement a custom solution.
4.3.2 Time to Resolution

The number of clicks and time it takes to find the desired information varies from
searchterm to searchterm, but has a large impact whatever a user waits for a page
to load. According to data collected in 2016[33], over half of mobile users will leave
a site if the page is not loaded in 3 seconds, which impacts the time it takes to
answer the users information requirement. Furthermore, 77% of web pages take
more than 10 seconds to load on the 3G network[33]. The time to resolution is one
of the big issues with mobile searching and can often be contributed to large ads,
slow sequential requests and over stylised websites.
4.3.3 Showing Relevant Data

Finding relevant data is the main goal for search engines. It can be difficult for
search engines to know what exactly the user wants to find when they are doing
a specific search. Thats why its useful to get user feedback on what results are
considered relevant. Applying user feedback further expands upon the search engine
algorithms[34], that are used to rank websites on their relevancy to a search term.
When a website has been given an individual relevance rank, they are presented on
the resultpage with the most relevant search on the top.
35
The search links only have a sentence or two of content, which does not indicate
enough about the information stored in the link. When presenting the search results
on a result page, introducing an abstract helps the user make a decision on the
websites relevance and could be enough to solve the information need.
4.4 Challenges and Possibilities: Summary

This section covers a summary of the challenges and possibilities found during the
literature study.
4.4.1 Challenges
With the many different challenges discovered, special focus was given to the two
most difficult. The performance aspect had to be prioritized in order for the ap-
plication to be considered an alternative to regular search method. In particular,
the time to resolution was a key performance metric to have in mind. Displaying
relevant data was the second difficult challenge, due to the subjectivity of different
summary methods.
In the case of the performance, it was discovered that mobile network speed had the
largest impact on web scraping performance. And while it doesnt take very long to
scrape just a single web page, the time it would take to scrape several pages after
each other would add up to an unacceptable amount of time. As the application
needs to present several results from many websites, this challenge had to be solved.
In the case of showing relevant data, finding and extracting the correct content
from web pages proved to be an issue. Because there were no preset rules for a spe-
cific page, the web scraper has to be configured to work on all kinds of pages. Due to
the smaller screen sizes available on mobile devices, choosing which sentences to be
present is important to make good use of screen space. Choosing the best sentences
is not easy, as the most relevant sentences can be located anywhere on the page,
with many different page layouts.
Other smaller challenges such as how data should be stored and designing a good
UI were also discovered, but didnt require as much in depth research.
36
4.4.2 Possibilities
Along with discovering challenges, a number of possibilities were also found.
As the speed of web scraping can be slow when scraping several pages after each
other, solutions for this problem were researched. One of the possible solutions that
were found was the use of threads. By taking advantage of a mobile devices differ-
ent CPU cores (if they have more than one), threads could be used to scrape several
web pages simultaneously. This could reduce the time to resolution if implemented
effectively.
Another challenge was to decide which sentences in a web pages content to present
to the user. A solution that was found for this challenge was the use of automati-
cally created summaries. To be able to create these summaries, research had to be
conducted in the field of summarization methods. By using generic summarization
to give sentences a weighted score based on a scoring algorithm, more relevant sen-
tences could be separated from the more irrelevant ones.
Possible solutions to the smaller problems include creating a database to store all
necessary data, and creating a UI based on the research made on colour theory and
user interactivity.
37
Chapter 5
Information Gathering
Application: Design and
Implementation
This chapter provides an overview of the created application and all of its function-
ality. Furthermore, flowcharts and diagrams are presented. The implementation of
each component of the application is described.
5.1 Design of the Application

This section covers the structure and functionality of the Android application, which
is used to gather the results needed to answer the problem statement.
5.1.1 Application Functionality

The Android application is an information gathering tool developed for this thesis,
that finds and returns information instead of links as a result. The application was
specifically developed with mobile devices in mind, where different issues appear
compared to connected devices. The app focuses on reducing data usage and pre-
senting summaries of links that give an insight into the full content.
The main functionality of the Android application is to provide a user with a search
bar, where they can enter a searchterm. After a user has entered a search, they are
presented with a couple of search results, which are ranked based on how relevant
they are. The search results consist of a text summary and a link to the web page
from where the specific summary was generated. The user can then either expand
the summary to read more, visit the link or affect the relevance ranking by swiping
left or right on the result. From this page the user can either perform a new search
or get a larger summary, which consists of all the results that were swiped right.
39
5.1.2 Webscraping for Information
The Implementation of this search application was achieved by making use of the
already indexed web, provided by search engines such as Google. In order to gather
the relevant information from the found links, the links where web scraped to fetch
the HTML document and retrieve text from the desired elements.
5.1.3 Application Structure

An overview of the package structure is depicted in figure 5.1. The ViewActivity
package contains the classes, which represent the different pages in the application.
The classes implement both the functionality of a view and a controller. This is
due to the code that describes the user interface and interaction lie in the same
class. The model package contains a package called DatabaseHandler, which is
responsible for connecting to a database, fetching and updating data. The model
also contains the WebScraping package which contains the classes that scrape
links for data and generates summaries that are presented in the view. When a
search is performed, the ThreadSearch class performs multithreaded web scraping
and summaries are generated in the ThreadScrapeResult class. Lastly, the DTO
package contains classes which are used to gather and transfer data between classes.
Figure 5.1: Package Overview of the Application.
40
5.1.4 Application Flowchart
Start Screen
When the application is started it fills a list with stop words that will be used by
the summarize algorithm. Entering a search term switches from the MainActivity
intent to the ResultPage intent, as depicted in figure 5.2.
Figure 5.2: Starting Screen Overview.
Before the page loads, it fetches the search term and creates an AsyncTask to perform
tasks on a background thread. Relevant links which contain information are web
scraped from a search engine and used to update the database. A thread is spawned
for each link, and data is web scraped and summarized before being collected. The
gathered result is sorted and displayed for the user. A new search can be made by
pressing the New Search button and a summary of relevant results are presented
with the Continue button. Flowchart is depicted in figure 5.3
Figure 5.3: Result Screen Overview.
41
The relevant summaries are fetched and the database is called to update which
links are relevant on a background thread. A new search is made by pressing the
New Search button. Final result is depicted in figure 5.4 and overall flowchart is
depicted in figure 5.5
Figure 5.4: Summary Result Page.
Figure 5.5: Overall Application flowchart.
42
In App Views
Screenshots from the application are depicted in figures 5.6, 5.7 and 5.8. Further-
more the colour pallet is depicted in figure 5.9
First of all, depicted in figure 5.6, is the start screen of the application. This is
the initial view presented to the user after starting the application. This view in-
cludes a search bar where the user can enter its desired search term and a button
to initiate a search.
Depicted in figure 5.7 is the result view that the user reaches after initiating a
search. A list of search results, which consist of summaries, are presented. Beneath
each summary, there is a link to the source page, which leads to the page from where
the summary was generated. The user can swipe left or right on search results to
decide its relevance. A left swipe declares that the result is considered irrelevant and
a right swipe is for relevant results. The complementary colours red and green are
used to represent the two choices. The view includes a button that returns the user
to the start screen in purpose of making a new search. After one or more relevant
choices has been swiped, there is a button for continuing on to view the collected
summaries that were deemed relevant.
In figure 5.8, the summary view of the application is depicted. This is the view
that is reached when continuing from the result screen. This view presents the cho-
sen summaries from the search results on a scrollable page. There is also a new
search button to initiate a new search.
Lastly, the chosen colors are presented in figure 5.9. The application has a theme
that consists of shades of green that are analogous on the color wheel. The text is
black on white, which has high contrast and is easy to read. To clarify which swipe
direction indicates a relevant or irrelevant result without providing instructions, the
complementary colours red and green are used.
43
Figure 5.6: Application Start Screen. Figure 5.7: Application Result Screen.
Figure 5.8: Application Final Page. Figure 5.9: Colour Palette.
44
5.2 Implementation
This section covers the implementation of the mobile application.
5.2.1 Web scraping for Data

To gather data that was requested by a user, a data gathering method had to be
chosen. The use of a web scraper as the data gathering method was an obvious
choice, but there was some consideration required when it came to implementing
it. Writing a web scraper from scratch was considered, but was deemed outside the
scope of this thesis. Instead the web scraping library jsoup was chosen due to
being written in Java and thus simple to incorporate.
In order to display relevant data, Google was used as a source for finding rele-
vant links. This was achieved by fetching the links from a Google search query as
depicted in listing 5.1. The links were found to be children of <h3> elements with
the class name r by inspecting the HTML source code.
Document doc = Jsoup.connect("https://www.google.se/search?q="+searchTerm)
.get();
Elements searchLinks = doc.select("h3.r > a");
Listing 5.1: Code for fetching relevant links
Links that contain relevant data could now be identified and scraped for their con-
tent. Scraping the content of a website required the application to wait for a TCP
connection to be established before a GET request could be made. The time it
takes to establish a connection and start downloading content from the web server
is substantial, and thus fetching data from several sources sequentially was not an
option. To circumvent this issue, a thread was created for each instance where
HTTP requests were made. The handling of threads and data was achieved by us-
ing a ExecutorService from the java.util.concurrent package.
An ExecutorService provides methods for setting the amount of threads to be

ran and how to invoke functions on threads. A example code showing how to initi-
ate the ExecutorService is depicted in listing 5.2 where numOfThreads is equal
to the amount of links to web scrape.
ExecutorService executor = Executors.newFixedThreadPool(numOfThreads);
Listing 5.2 Setting the number of threads to use.
By providing a list of Callable tasks to the ExecutorService, the executor can

invoke all methods and gather the result once the threads are done. The result of the
callable tasks are saved in a list and can then be further processed. This simplifies
the issue with syncing threads and allows all threads to finish before proceeding.
The results from the executor are saved in a list as depicted in listing 5.3.
List<Future<ThreadScrapeResult>> futures = executor.invokeAll(callableTasks);
Listing 5.3 Gathering the results from multiple threads.
45
5.2.2 Storing and Updating of Data
In order to keep track of what results were deemed relevant, a database was used
to store the relevancies related to each search result. By keeping track of which
summaries were swiped left or right, a summaries relevancy score was updated and
could be presented in different order with the highest relevance first. Performing a
complete search requires two connections to the database.
Using the database required a web server hosting PHP files. The PHP files would
handle requests from the application and was used to perform queries. Setting up
the connection was done by including a PHP file with the configuration for con-
necting to the web server as depicted in listing 5.4. The queries were performed
through PHP instead of directly calling the database due to security reasons. Using
software, mobile applications can be decompiled and the database configuration file
can be found.
$host = "localhost";
$dbname = "kandidat";
$username = "root";
$password = ******;
$connection = new PDO("mysql:host=$host; dbname=$dbname",$username,$password);
Listing 5.4 Database configuration for PHP
When Entering a Search

When a new search was made, there were a couple of terms that needed to be saved.
The search term were needed to identify separate searches. The domains and full
URLs from the initial web scrape were required to identify search terms, that give
several results from the same domain. The links from the resulting web scrape were
used to prepare a query as depicted in listing 5.5. The queries were issued by sending
a POST request to the web server with the information to upload.
for(int i=0;i<links.size();i++){
builder.appendQueryParameter("searchUrl"+i,links.get(i));
builder.appendQueryParameter("domainUrl"+i,domains.get(i));
}
Listing 5.5 Query used to update the database with URLs and domains.
The web server would then perform a SQL query and update the database with the
new information. In order to prevent security issues such as SQL injections, user
input was validated using built in functions such as binding parameters as depicted
in listing 5.6.
$stmt->bindParam(:search,$ POST[searchUrl.$x]);
$stmt->execute();
Listing 5.6 Binding input parameters to prevent SQL injections.
After the information was uploaded and updated, the relevance of each link was
fetched and returned from the web server.
46
After Choosing which Links are Relevant
Following a successful search, users were able to swipe either left or right, to rank
the relevancy of a summary. To update this information, the database was queried
with the necessary information required to identify which summaries was ranked and
what score they got. This was used to update how relevant a summary was for each
search term, which would change the order of future results, based on relevancy.
Database Design
The database was designed to contain all the information that was necessary to
identify and score each summary. The url, domain and searchterm were used to
identify a unique summary, without having to store the actual text. Furthermore,
some additional columns such as numOfHits and noOfSearches were used to
collect user data as depicted in figure 5.10.
Figure 5.10: Database Design.
5.2.3 Creating a Summary

Displaying relevant information required the result to contain text related to the
search term. Furthermore, the results had to be short enough to be useful while on
the move, while still containing the necessary information. To achieve this, an algo-
rithm was created that extracts the most relevant sentences from text. Identifying
which sentences were the most relevant was achieved by assigning scores and iden-
tifying key words. These keywords were identified by taking the most used words
from the text, which were not a part of the common words. After the text had been
fully analysed and scored, the highest rated sentences were returned, in the order
found in the text. Each sentence was given a weighted score based on:
How many words in the sentence was also found in the search term.
How long the sentence was compared to an ideal sentence.
Where the sentence was found in the text.
The sentences keyword density.
A score based on how common the keywords found in the sentence where.
47
Chapter 6
Information Gathering
Application: Evaluation
This chapter covers the applications performance as well as user feedback.
6.1 Presentation of The Results

The following section presents the performance metrics gathered from using the
application. The tests were performed using an emulator on three different mobile
devices, with different levels of performance. The search terms were chosen to test
results that consist of large and word dense pages, as well as smaller and more
compact texts.
What is Brexit?: This search term was the third most searched on Google
during 2016 and the pages were on average 1.78mb large across the tested
devices.
Theory of Relativity: This search term was chosen to test the perfor-
mance on more mobile friendly web pages with an average page size of 0.63mb
across the tested devices.
6.1.1 Performance Metrics

Measuring the performance of the application was done by testing for three different
metrics.
Performance : Speed
The speed of the application was measured in nanoseconds for each function, from
entering a search term until the result is displayed. The time that passes from
pressing the search button until retrieving the result, consists of four major functions
that gather and operate on data. The performance metrics displays these individual
functions and how much of the total time they account for. The result is displayed
in milliseconds.
49
GoogleScrape: Is the function that performs a web scrape on Google to fetch
relevant links for a search term.
QueryTime: The time it takes to updates the database with the new results
and receive an answer.
ThreadSearch: Function that splits the workload up into threads and gathers
the summarized data from each web scraped link.
Avg Summarize: Average time it took for a device to create a summariza-
tion.
Max Summarize: Longest time it took for a device to create a summariza-
tion.
In figure 6.1 the results from the search term What is Brexit? is displayed. The
difference between the results of the smartphones was small. On average 89.356%
of the time was spent web scraping, while creating the summary accounted for less
than 5%.
Figure 6.1: Time metrics for search example 1.

In figure 6.2 the results from the search term Theory of Relativity is displayed.
Once again there were no significant differences in performance. On average 92.63%
of the time was spent web scraping with less than 5% used for summarizing the
results.
Figure 6.2: Time metrics for search example 2.
50
Performance: Single vs Multiple Threads
Two different implementations of the application was tested, to see how long it
takes from performing a search, to receiving the result. One version made use
of multiple threads to webscrape and create the summaries, while the other used
a single thread. The average time was computed by taking the average from 30
searches with caching disabled. The single threaded version took on average 8485 ms
while the multithreaded took on average 2381 ms. The single threaded application
was on average 3.36 times slower as depicted in figure 6.3. The depicted time is the
sum of the functions shown in figure 6.1
Figure 6.3: Performance difference between single and multithreaded application.
51
Performance: Ram
The amount of RAM used by the application was measured by performing a search
and measuring the peak RAM usage and is depicted in figure 6.4 and 6.5. The RAM
usage was low throughout the testing and no memory leaks were discovered.
Figure 6.4: Allocated used Ram.
Figure 6.5: Unallocated available Memory.
52
Performance: Data
Measuring the amount of network data used, was achieved by using the IDEs built
in network monitor tool. The average data used for a search was measured both
through the application and through a normal web browser. Browser data was
measured by searching for a searchterm, clicking a link and letting the page fully
load. The data was collected by measuring the average data usage from the same
links scraped by the application. The application data was measured by performing
a search term and getting the summaries. On average the application would reduce
the data amount by 60-80%, depending on the search term and mobile device as
depicted in figure 6.6 and 6.7.
Figure 6.6: Network Data saved for Search Example 1.
Figure 6.7: Network Data saved for Search Example 2.
6.2 App Usage

This section covers the data gathered from tests performed by people taking a sur-
vey. The survey had 18 participants who filled out a questionnaire and an additional
number of people who tried the application. The testing group consisted of mostly
students between the age of 20-30 with decent to good knowledge of mobile appli-
cations.
53
6.2.1 App Usage
From the data collected the following results were had.
750 unique summaries were created.
183 summaries were ranked.
91 summaries were deemed irrelevant.
92 summaries were deemed relevant.
80 unique search terms were searched for.
440 unique domains were found.
6.2.2 Relevance Results

From the testing pool of people partaking in the survey, relevance ratings were
associated to urls for specific search terms. There were 80 unique search terms
entered and the average search term was entered 2.8250 times. Out of the 750
unique summaries 24.4% of summaries were swiped. This resulted in the average
summary having a relevance score of 0.499213.
6.3 Survey Results

A survey was conducted in order to get user feedback and search data. While the
results were too biased to be of any scientific use, it showed which parts of the
application could be improved and which parts were good. This was due to the
survey group consisting of friends and acquaintances, who would most likely give
higher scores. The questions had both a scoring system from 0-5 and a text field
for answers.
6.3.1 How Relevant were the Summaries?

The feedback from the survey indicated that the summaries were good for the most
part with 55.6% of the votes giving the summaries a 5/5. 33.3% of the votes
gave the summaries a 4/5. The mean score was 4.28, which indicated that the
summaries were mostly good. The standard deviation of the answers was 0.75593.
6.3.2 Did the Swipe Functionality Positively Impact the Ex-

perience.
The swipe functionality had a larger spread of scores with 33.3% giving the score 2
and 44.4% giving the score of 3. The standard deviation was 1 and the mean score
was 3. The result indicates that the swipe functionality could improve the results
over time but has flaws.
54
Chapter 7
Discussion
This chapter covers the discussions on the different methods, solutions and prob-
lems found during this project. The problem statement is revisited and discussed.
Furthermore the repercussions of the application are discussed from a sustainable
and ethical viewpoint.
7.1 Methodology and Consequences of the Study

This section covers the applied methodologies used to evaluate and answer the prob-
lem statement. It also covers the consequences of the study.
7.1.1 Methods
These following methods were the ones used during the project.
Literature study
Investigating possible solutions for the problem statement required having back-
ground knowledge in a wide set of topics. Because there was a lack of experience
with some of the technologies required to develop the Android application, a litera-
ture study was performed to gain the required knowledge. Furthermore, presenting
the data in a way that would promote faster access to the desired information, was
difficult due to there not existing obvious solutions. Performing the literature study
was more a necessity than a choice.
Interview
The reason for doing the interview was partly to gain insight into what best practices
exist, but also to settle some uncertainties that arose during the early stages of
designing the application. The interview did not cover many questions regarding
Android applications, but it proved invaluable for managing the project in terms of
time and scope.
55
Design and implementation
Some of the minor issues were UI related. While most modern web pages are de-
signed to work well on mobile devices, they are often harder to navigate than the
desktop counterparts. More major issues that appear were due to performance and
network data limits. Because most modern websites make use of images and large
JavaScript frameworks, a lot of data and performance is used to load web pages. We
chose to apply the MVC architectural pattern and evaluate our work formatively.
The reason for choosing the design pattern was mostly due to having previous ex-
perience developing using MVC. Because Android development was a new concept
that had to be learnt for this thesis, it was beneficial to apply already known con-
cepts. By using the MVC pattern, and evaluating the work formatively, it was made
easier to modify and rewrite a layer of the application without other layers being
affected. This was essential for the development cycle where the design often had to
be changed. Some of the issues with using MVC was that it was difficult to define
what the control layer was. The view of an Android application acts more as a
view-controller hybrid which reduces cohesion and reduces the amount of code that
could be reused. Other design patterns were considered such as MVVM and more
Android specific patterns.
Evaluation
Formative Evaluation: Formative evaluation was used during the design
and implementation process. This proved very useful for developing an appli-
cation that was constantly changing, as decisions to change the applications
functionality were made based on incoming new information.
Heuristic Evaluation: Presenting the data to the user required us to con-

sider many aspects of what makes a design presentable without being dis-
tracting. We considered many different implementations, and applied differ-
ent heuristics to promote faster and better understanding of the applications
results. We chose to apply heuristic evaluation because we found that poorly
designed visuals could affect how well the information was received.
Summative Evaluation: In order to decide if the application could be used

to solve the problem statement, the results had to be evaluated both from
a user by user experience, and from collecting data. We chose to apply a
summative evaluation in order to decide if the application was able to achieve
the performance metrics we desired, and if the results were relevant. The
evaluation was performed by measuring data and asking users for feedback.
The evaluation gave us some clear answers on which aspects worked and what
could be improved.
7.1.2 Consequences of the Study

From the results collected and the feedback received from the target group, we found
that presenting information in this way, could certainly be an attractive alternative
for some information needs.
56
7.2 Problem Statement Revisited
The following questions were the problem statements of this thesis
In which way can a web scraper be used to collect relevant data on a subject?
How can the collected data be stored and analyzed?
In which way can an Android application use a web scraper for data gather-
ing? How can the collected data be presented to promote easy access to the
desired information?
From the problem statements, the following questions could be extracted.
In which way can a web scraper be used to collect relevant data on

a subject? The solution we came up with was to make use of the already
indexed web provided by the search engine Google. By making use of an
already established search engines for indexing and finding relevant links, we
could use those links as a source for extracting information. This was done in
order to reduce the scope of the project, and because a better way of finding
relevant links was deemed unachievable.
How can the collected data be stored and analyzed? A decision was
made to store and update data on a web server. This was done in order to get
easier access to metrics, such as which search terms were popular and what was
deemed relevant. By having users update the same information, a collective
effort was made to push more relevant summaries to the top. From the results
gathered through the questionnaire, most user found that this could improve
the search results over time. Preventing vote manipulation was something that
was considered, but would have expanded the scope of the project too much.
By collecting various data regarding searches, it was made simple to gather
and analyze how well different aspects application worked.
In which way can an Android application use a web scraper for data
gathering? Gathering data in an effective manner was achieved by utilizing
threads, which reduced the time it took waiting for HTTP requests. Since
Android applications are built using Java, several web scraping libraries were
available. Performing the web scrape was done by including a library for web
scraping, due to the improved performance rather than implementing our own.
The initial implementation which only used a single thread proved to be much
slower than a search made on a regular search engine. Since most Android
devices can utilize several cores, multithreading became a good solution for
improving the search performance as shown by the measured results.
57
How can the collected data be presented to promote easy access
to the desired information? By sorting all the gathered links by their
relevancy score, the most relevant result was presented at the top of the list
of results. By having the most relevant results at the top, the time to resolve
could be shortened by hiding irrelevant results further down. An issue with
finding information online, is that the text often has filler sentences or is to
long. In order to more easily and faster understand the information, we decided
to present the user with summaries. These summaries were long enough to
solve the information need, but short enough to be understood quickly. In
the case that the summary wasnt detailed enough, each summary had a link
to the source page. According to the user feedback, the summaries generated
by the application were overall rather good. Most of the negative feedback
was due to the swipe functionality being considered too sensitive. Overall,
the feedback collected indicated that the application could definitely se some
usage but would require more work in order to smooth out some minor issues.
7.2.1 Design Decisions

Designing and implementing the application came with many tough decisions that
had to be made. While developing the application, there were compromises made
between performance and size, but some issues were had due to other reasons.
Caching Summaries
A decisions was made not to cache the summaries. Caching the summaries would
have reduced the time it takes search for the same term from a couple of seconds to
less than a second. Still, a decision was made to not include caching due to possible
copyrights issues of storing website data on our server. Even though the content
that would have been saved consists of summaries, we decided not to take any risks.
Choice of Platform
The reason for developing this application on the mobile platform were many. The
problems in the problem statement are amplified on mobile devices compared to
desktop devices, due to the shortcoming of mobile technology, such as limited battery
and network data. Mobile devices account for more than 50% of searches and is only
increasing. Thus it made sense to us to target a platform where the benefits would
be seen most clearly.
58
7.3 Ethical Aspects
During the literature study, certain issues regarding unethical behaviour of web
scraping software were discovered. This section covers the discussion about these
ethical topics.
7.3.1 Lost Clicks and Ad Revenue

As hosting websites can be very expensive, the use of ads on websites is a widespread
phenomenon. Many websites also rely on their ad revenue as their main source of
income. When a website wants to gather statistics on how many users browse their
pages, they usually count the amount of clicks made by a visitor. By using a web
scraper to gather the data directly from a website using web scrapers, the website
does not get any user information or clicks. The web scraper also dodges any ads
presented on the website as there is no browser to run scripts or show the ads. Web-
sites gain lower, if not zero, revenue from being scraped by a web scraper, compared
to a user visiting the website.
And while the website gets no revenue from the web scraper it still has to pro-
vide the scraper with files, which puts a load on the websites server. This creates
the ethical issue of not giving anything back to the creator of the content. If all
search engines tried to present the data of a search result better than the actual
website the data is hosted on, there would be a less incentive for creators to publish
their data on ad driven free websites. A consequence of this could be a decrease in
overall free content on the internet and a growth in content behind paywalls. To
try and help content creators, we make sure to always link to the source page when
presenting a search result.
7.3.2 Information and Copyright issues

Throughout history there have been many legal cases that involve companies suing
or being sued over the use of web scrapers. Web scrapers are legal in most countries
but come with rules that have to be followed in others. Web scraping is still in
the legal grey area due to the technology being hard to interpret when it comes
down to jurisdiction. The information that is gathered by web scrapers can be
subjected to copyright laws depending on how its used. When the data is presented
in transformative manner such as in our application, there is no grounds for copyright
infringement, but is still considered unethical.
7.3.3 Anti web scraping Industry

Due to the legal grey area of which web scraping lies in, there is a whole industry
created just to combat web scraping. There are many different ways a website
can try to protect itself from web scrapers, but requires the ability to differentiate
between a program and a human. This has proven to be a difficult challenge and is
one of the reasons large companies still try to issue legal cases.
59
7.4 Sustainability
This section covers the possible effects the thesis could have on sustainability if the
application is used or the ideas applied.
7.4.1 Effect on Environment

The positive effect our application would have on the environment, if people started
using it, might seem negligible at first, but could potentially reduce power and
data consumption by a large amount. The samples used in our results reduced
the data usage by more than 50% compared to visiting a website. This would
reduce the amount of battery a mobile device uses and the workload of servers in
an ideal scenario. To have any real effect it would require a large group of the
population, switching from getting their information needs through interactive and
heavy websites, to the rather lightweight and minimal application. Overall the
application does not have any impact on the environment in the current form, but
ideas could be applied by larger companies to make an impact in energy usage.
7.4.2 Economical Sustainability

The economical sustainability of websites could be harmed by applications such as
ours that use resources without giving anything back. The application itself was not
designed to be monetized, thus the application could not scale to provide service to
many users, without costing a lot to run.
60
Chapter 8
Conclusions
This chapter concludes what was achieved throughout the thesis and what future
implications the results could hold.
8.1 Summary
This study set out to investigate how problems related to information gathering on
mobile devices, could be solved using technologies such as web scraping. Based on
the results from our literature study, issues that occur when searching for infor-
mation on mobile devices were identified. These issues include problems with long
loading times, small and difficult to use interfaces, large website size and presenting
data in a mobile friendly way to users.
With the issues in mind, an application was designed that would try to improve the
information gathering process on mobile devices. By making use of web scrapers
for gathering the information, rethinking the way information could be presented,
an application was implemented that makes use of these ideas. The results that
were gathered from the application were presented and analyzed. The user feedback
indicated that the application did achieve its goal of quickly presenting relevant in-
formation, but was inconsistent and would require some work before it could replace
regular search engines.
The different results of the thesis were discussed, and the authors came to the con-
clusion that the technical aspects of the application could be improved to a point
where it could potentially be used by a specific target group, that has a need for
raw text gathering. But while the technical side had potential, the application was
hindered by the unethical aspects of web scraping.
8.2 Future research

While this thesis gives an introduction to the implications of using web scraping
on mobile devices and web scraping in general, there is much more to investigate.
Firstly, the ethical aspects of web scraping is only merely touched upon in this the-
sis. This is a huge discussion topic in the IT industry. Further research has to be
made if there is ever going to be an agreement on what is allowed and what is not.
61
If legal grounds cant be decided because of too many different opinions, an official
list of guidelines on how to use web scrapers should be created.
Secondly, presenting the data could be improved. By having a team of design-

ers working together with the developers, the presentation of the search results and
their summaries have the potential to be very good and easy to read. This idea could
in the future compete with traditional methods of data gathering, if implemented
well.
Lastly, the storage of the data have to be increased and improved if the appli-
cation would be available and used by a large mass of people. By using a big server
architecture that can handle a large amount of concurrent users, the issues of stor-
ing and managing data would be minimal, as the client-to-server interaction is small.
If all these issues would be solved, there could be potential for the app to work
as a research tool. It would work by collecting and condensing data on more scien-
tific topics, and presenting these to the user as i.e. a summarized report.
62
Bibliography
[1] Growth of data online [updated June 19, 2017: cited June 19, 2017]
http://www.cisco.com/c/en/us/solutions/collateral/service-
provider/visual-networking-index-vni/vni-hyperconnectivity-
wp.html
[2] Search Engine Indexing

AvAmy N. Langville,Carl D. Meyer (2011) Googles PageRank and Beyond
pp.15-20.
[3] A.Herrouz, C.Khentout, M.Djoudi Overview of Web Content Mining Tools

(2013),
[4] Blocking Processes [cited June 19, 2017]

http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-intro.pdf
[5] Theoretical Speedup from Parallelizing Computations [updated August 28, 2000:
cited June 19, 2017]
http://www.phy.duke.edu/~rgb/brahma/brahma_old/als/als/node3.html
[6] CPU Threads [cited August 24, 2013 : cited June 19, 2017]
https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/4_
Threads.html
[7] MultiThreading in Android [updated May 17, 2017 : cited June 19, 2017]
https://developer.android.com/reference/java/util/concurrent/
ThreadPoolExecutor.html
[8] AsyncTasks in Android [updated May 17, 2017 : cited June 19, 2017]
https://developer.android.com/reference/android/os/AsyncTask.html
[9] The SQL language [updated April 21,2017 : cited June 19, 2017]
https://docs.microsoft.com/en-us/sql/odbc/reference/structured-
query-language-sql
[10] PHP Scripting Language [updated June 19, 2017; cited June 19, 2017]
http://php.net/manual/en/intro-whatis.php
[11] Color Theory Explanation [cited June 19, 2017]

https://colorysemiotica.files.wordpress.com/2015/04/harris1770.pdf
[12] Color Wheel Image [updated May 21,2017 : cited June 19, 2017]
https://commons.wikimedia.org/wiki/File:RGV_color_wheel_1908.png
63
[13] Analogous Colors, Joen Wolfrom (1992) The Magical Effects Of Color
pp.31-32
[14] Principles of Interaction Design [updated June 19, 2017: cited June 19, 2017]
http://asktog.com/atc/principles-of-interaction-design/
[15] Inderjeet MANI Summarization Evaluation: An Overview (2001)
[16] Generic Summarization [cited June 19, 2017]

http://www4.ncsu.edu/~slrace/genericsummarizationtalk.pdf
[17] Yihong Gong, Xin Liu Generic Text Summarization Using Relevance Measure
and Latent Semantic Analysis (2001)
[18] Hourglass Writing Structure [cited June 19, 2017]

http://writingcenter.uconn.edu/wp-content/uploads/sites/593/2014/
06/The_Hourglass_Approach.pdf
[19] HTML [updated May 30 2017 : cited June 19, 2017]

https://www.w3.org/standards/webdesign/htmlcss
[20] CSS standard [updated April 5 2017 : cited June 19, 2017]
https://www.w3.org/standards/webdesign/htmlcss#whatcss
[21] SenseBot Website [accessed June 19, 2017]

http://sensebot.com/
[22] SMMRY Website [accessed June 19, 2017]

http://smmry.com/
[23] Qualitative and Quantitative Methods [updated April 4,2017 : cited June 19,
2017]
http://www.lib.vt.edu/research/methodology/quantitative-
qualitative.html
[24] Nielsen Heuristics [updated June 19, 2017: cited June 19, 2017]
https://www.nngroup.com/articles/ten-usability-heuristics/
[25] Summative Evaluation [updated April 4,2017 : cited June 19, 2017]
https://cyfar.org/different-types-evaluation#Summative
[26] Web Page Size [updated May 31,2017 : cited June 19, 2017]
https://www.keycdn.com/support/the-growth-of-web-page-size/
[27] Global 4G Coverage [updated April 4,2017 : cited June 19, 2017]
https://opensignal.com/reports/2016/11/state-of-lte
[28] Mobile Shopping Behaviour [updated April 4,2017 : cited June 19, 2017]
https://www.thinkwithgoogle.com/articles/mobile-shoppers-
consumer-decision-journey.html
[29] Website size [updated April 4,2017 : cited June 19, 2017]
http://httparchive.org/interesting.php?a=All&l=Apr%2015%202017
64
[30] BeautifulSoup Python Library [accessed June 19, 2017]
https://www.crummy.com/software/BeautifulSoup/
[31] PhantomJS JavaScript Library [accessed June 19, 2017]

http://phantomjs.org/
[32] jsoup Java Library [accessed June 19, 2017]

https://jsoup.org/
[33] Mobile Data [updated April 4,2017 : cited June 19, 2017]
https://www.thinkwithgoogle.com/nordics/research-study/the-need-
for-mobile-speed-how-mobile-latency-impacts-publisher-revenue/
[34] Search Engine Algorithms [updated February 5,2007 : cited June 19, 2017]
http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
65
TRITA TRITA-ICT-EX-2017:61
www.kth.se

Harvard Tesis 1 - PP

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Harvard Tesis 1 - PP

Transféré par

Droits d'auteur :

Formats disponibles

EXAMENSARBETE INOM TEKNIK,

Keywords Data collection, Mobile devices, Web scraping, Summarization meth-

En undersokning utfordes genom att en mindre malgrupp av anvandare svarade

Keywords Datainsamling, Mobila enheter, Web scraping, Textsammanfattningsme-

4 Collecting and Presenting Information: Challenges and Possibili-

5 Information Gathering Application: Design and Implementation 39

6 Information Gathering Application: Evaluation 49

1.3 Problem Statement

1.6 Thesis Outline

In chapter 2 the thesis presents the necessary background information along

Chapter 5 provides an overview of the applications implementation, design

Understanding the content of this thesis requires a basic understanding of a wide

2.1 Web Search engines

2.1.1 Web Indexing

2.1.2 Web Scraping

2.2 Asynchronous Programming

2.2.1 Concurrent Programming

2.2.2 Multithreaded Android Programming

2.3 Managing Data

2.4.1 Colour Theory

Figure 2.1: Colour Wheel

Consistency: Each page of the application should keep consistency in regards

Readability: Text should always have a high contrast to make it easy to

Simplicity: The simplicity of a design requires balance between functionality

2.5 Text Processing

2.5.1 Natural Language Processing

2.5.3 Generic Summarization

How long the sentence was compared to an ideal sentence. The

The sentences keyword density. A sentence is given a weighted keyword

A score based on how common the keywords found in the sentence

2.6 Web Browsers

2.7.1 Web Scraping

2.7.3 Similar Applications

3.1 Research Strategy

3.1.1 Research Methods

3.1.2 Research Process

Figure 3.1: The Process Outline.

3.2.1 Literature Study

Figure 3.2: Literature Study Process.

Analyzing the Knowledge Requirement

Acquiring Relevant Source Material

Reading, Studying and Understanding the Chosen Sources

Applying Knowledge to Update Design and Delimitations With the

Evaluating Result Gathering Methods Finally, the literature study was

3.3 Design and Implementation of Prototype

3.3.2 Implementation of Prototype

3.4 Evaluation Methods

3.4.1 Formative Evaluation

3.4.2 Heuristic Evaluation

Consistency and Standards: Ensure that system behaviour was consistent

Aesthetic and Minimalist Design: Only present relevant information and

Could the application serve as an alternative method for information

Were the summaries relevant?

3.5 Evaluating Performance

3.5.1 Performance Metrics

long startTime = System.nanoTime();

Collecting and Presenting

4.1 Issues with Using Search Engines

4.2 Getting Information

4.2.1 Types of Information Requirements

4.2.2 Presentation of Data

4.3 Improving Information Gathering

4.3.1 Reducing Data Usage

4.3.2 Time to Resolution