Vous êtes sur la page 1sur 21

Int. J. Web Science, Vol. X, No.

Y, xxxx 1

How to benefit from small samples of web queries?

Karl Petrič*, Teodor Petrič and


Vladislav Rajkovič
Ministry of the Interior,
Special Library,
1000 Ljubljana, Slovenia
and
Department of German Studies,
Faculty of Arts,
2000 Maribor, Slovenia
and
Department of Informatics,
Faculty of Organisation Sciences,
1000 Ljubljana, Slovenia
E-mail: karl.petric@gov.si
E-mail: teodor.petric@um.si
E-mail: vladislav.rajkovic@ijs.si
*Corresponding author

Abstract: The basic goal of this article is to demonstrate the process of


detecting useful ideas from small web query samples by combining different
software tools. Our sample contained web queries made by users of the
Special Library of the Ministry of the Interior (hereinafter SLOMI). The
process of extracting useful ideas from our sample was partially guided by
statistical analysis, partially by rather intuitive and collective descriptive
evaluation methods. Several rich visualisation techniques were used to
stimulate creative data interpretation. The collected data on the queried persons
and their areas of professional interest in our sample were implemented into a
customised thesaurus and a mind map in order to extract useful ideas.

Keywords: web usage mining; information retrievals; persons; idea discovery;


social networks; thesauri; mind mapping.

Reference to this paper should be made as follows: Petrič, K., Petrič, T. and
Rajkovič, V. (xxxx) ‘How to benefit from small samples of web queries?’,
Int. J. Web Science Vol. X, No. Y, pp.xxx–xxx.

Biographical notes: Karl Petrič is an Information Librarian-Scientist at the


Special Library of the Ministry of the Interior, Slovenia. He received his
Master’s degree in 2005, and PhD in 2008, both from the University
of Ljubljana, Faculty of Computer and Information Science, Slovenia. His
research interests include knowledge management, data mining, text mining,
knowledge discovery, and educational information systems. He is cooperating
in projects on e-archiving, knowledge intranet portal, and in the preparation of
university teaching materials for German phonology.

Copyright © 200x Inderscience Enterprises Ltd.


2 K. Petrič et al.

Teodor Petrič is an Associate Professor of German Linguistics at the University


of Maribor, Faculty of Arts. He received his Master’s in German Linguistics in
1990, and PhD in Linguistics in 1995, both from the University of Ljubljana,
Faculty of Slovenia. His research interests include naturalness theory, first and
second language acquisition theory, translation tools, study of child language,
text mining, and educational information systems. He is cooperating in projects
on language acquisition, natural linguistics, phraseology, and translation
documentation systems.

Vladislav Rajkovič is a Professor of Management Information Systems, Faculty


of Organisational Sciences, University of Maribor and research fellow,
Department of Intelligent Systems, Jozef Stefan Institute. He received his BSc
from the Faculty of Electrical Engineering, University of Ljubljana in 1970 and
MSc from the Faculty of Electrical Engineering, University of Ljubljana in
1975. In 1987, he received his PhD from the Faculty of Electrical Engineering
and Computer Science, University of Ljubljana. His main research interests are
information systems and their application for decision support, artificial
intelligence methods for decision support, knowledge management for decision
support, etc.

1 Introduction

Web queries contain conscious and unconscious messages on collective information


needs or desires in a global manner or within certain organised groups, which can be
revealed by a web traffic analyst. Anyone who can read these aspirations has the ability
to create collective intelligence and business intelligence. Essentially, exploring the
users’ activities with web usage mining methods and visualisation techniques is a process
of transforming data into information and subsequently into knowledge. Methods in the
field of web usage mining usually requires large amount of data that the researcher can
realise specific purposes. The purposes of these researchers or better developers of useful
ideas can be very different (e.g., development of software algorithms, to improve
services). In this paper we will try to detect useful ideas from a very small sample of web
queries which were created by web users on different search engines and directed then
the users to the websites of SLOMI. The usage of classifying data for complex analysis is
by our opinion very essentially. Complex analysis without a classification system leaves
a relative great lack in the manner of observing the environment with their entities (e.g.,
persons, organisations) and activities (e.g., searching solutions, making mistakes).
Classifying of different things, human beings, animals and plants is a basic cognitive
process (just like analysing) for learning about the nature and society. We used two kinds
of classifying the data (classifying the entities and user intentions). It is more than
difficult to extract useful ideas from a small sample of data. This problem cannot be
acceptable solved by using only statistical methods so we also used qualitative methods
(e.g., thesaurus construction and mind mapping). We focused our research on queried
persons with the main intention to derive useful data of them. Surnames and names gave
us the possibility to extract hidden ideas which could be appropriate for improving the
services and applications.
This study has two alternative hypotheses and five research questions:
How to benefit from small samples of web queries? 3

Hypotheses
1 small samples of web queries are sufficient for the extraction of useful ideas
2 big samples of web queries are necessary for the extraction of useful ideas.

Research questions
1 Is it possible to develop useful ideas from small samples of web queries?
2 What kind of useful ideas can we obtain from small samples of web queries?
3 Which techniques can be effectively used to extract useful ideas?
4 Under which circumstances can we benefit from small samples of web queries?
5 Is it possible to derive a key idea about an application system which could improve
an existing working organisation?

Motivation of the study


It is challenging to obtain results from small samples of web queries which are equivalent
to those from big samples. With appropriate methodological tools and knowledge sharing
techniques a lot of time and money could be saved. We are aware of the fact that this is
not always feasible, so it is crucial to find answers to the research questions, especially
the third one.

Objectives
a web query analysis using statistical methods and visualisation techniques
b extracting the professional interest areas of the queried persons in the sample
c developing useful ideas from the creation and detailed analysis of a thesaurus and
mind map.

1.1 Background
Our sample was obtained during the period from 27/11/2008 to 10/01/2010 and consists
of 1,421 different information retrievals with frequencies measured by Google Analytics
and exported to Excel. The research method influenced by Agosti et al. (2011) includes
the total observation of randomly chosen web users in the above mentioned time period.
At first we sorted the queries intellectually by our own classification (here and after CU,
ranging from 1 to 7). For every query, we also determined the assumed intentions of the
users (here and after IU), which were also prepared intellectually. For easy presentation
of both schemes we created a table.
Table 1 presents the description of both classification schemes CU and the assumed
IU. The main purpose of CU is to create different networks which will give us knowledge
about the user interests to get information about organisations, persons, intellectual
cultural work, items and materials, sciences, activities, events, processes, etc., and
(important) questions (Alguliev et al., 2007; Aula, 2005). The assumed IU is defined as a
4 K. Petrič et al.

user motive to perform information retrievals on the external search engine (e.g., Google)
with the desired result of choosing the appropriate web page of SLOMI (e.g., to obtain
some factual knowledge). For further research we prepared the dat in the following form.
Table 1 Fast presentation of CU and IU

CU IU
CU 1: Sociological systems, organisations, Desires for general information
departments, etc. (e.g., about persons, organisations, services)
CU 2: Persons in form of names, gender, Desires for access to different web pages
functions and status (e.g., http://www.mnz.gov.si)
CU 3: Intellectual cultural work comprising Desires to get information of resources just like
books, technical systems, databases, e.g., books, journals, bibliographies, databases
formulary, innovations, etc.
CU 4: Items, materials, prices, flats, etc. Desires to get access to factual knowledge
(e.g., crime, terrorism, forensics)
CU 5: Sciences, arts, professions, sports, Desires of accurate information
etc. (e.g., exact part of a law, procedure)
CU 6: Activities, processes, procedures, Desires of professional information
events and states (e.g., standards, reports)
CU 7: Questions (e.g., how to become a Desires of special information
member of the Police, how to publish (e.g., polls, warrants, births, deaths)
an act)
Desires of empirical knowledge
(e.g., analysis, statistics, researches)
Desire of non-factual knowledge (e.g., tricks,
new methods, useful ideas, intuitive knowledge)

Table 2 shows a part of the prepared data which includes parameters like UIR, F, CU and
IU. Information retrievals on persons are notated in form of acronyms (see Table 2,
e.g., P1).
Table 2 Part of the prepared data including information retrievals of users (UIR), frequencies
(F), CU and IU

UIR F CU IU
MOI 1,431 1 General information
Ministry of interior 1,015 1 General information
Special library MOI 272 1 General information
Journal of security 242 3 Resources
e-publications 136 3 Factual knowledge
Cult of victims 136 3 Factual knowledge
P1, MOI 98 2 General information
Library MOI 73 1 General information
How to benefit from small samples of web queries? 5

1.2 Use of software packages, algorithms and visualisation techniques for


processing and analysing data
1 Google analytics: we measured the web user activities on external search engines
which lead them to the website of SLOMI.
2 ORA Casos (version 2.0.5) (Wei et al., 2001; Carley et al., 2011): After importing
the prepared data of the parameters UIR, F, CU and IU into the software package
ORA Casos the data were processed in order to define the network with UIR as
resource, F as action, CU as knowledge and finally IU as belief. In addition, we
identified the sources, targets and attributes of the network for further processing
with two algorithms (correlation similarity and centrality information). The
algorithm correlation similarity computed the similarity of different nodes
(illuminated in different colours), whereas the algorithm of centrality information
computed the affinity of different nodes (in the network presented as more or less
bigger circles). In the next stage we executed the command ‘visualise’ and thus
included a module for the visualisation of the established network. It was necessary
to further determine the strength and orientation of links, colour and size of nodes.
We presented the networks with organic layouts. And finally for the establishment of
the abstracted network we used a combination of two algorithms, namely correlation
similarity and eigenvector centrality (which is intended to compute the importance of
different nodes).
3 Ontogen 2.0: we developed the main concepts for data preparation in order to build a
sociological knowledge network.
4 Midos Thesaurus 2000 demo: we constructed a customised thesaurus based on the
professional interest areas of the queried persons.
5 EDRAW 6.5.3: with respect to the thesaurus terms we mapped and extracted the
useful ideas into different groups.
6 Apes tool v2.1.2: evaluating the different groups of ideas in form of a complex
matrix.
7 Tableau public edition (version 7): deeply analysis and visualisation of the matrix.

2 Web query analysis using statistical methods and visualisation


techniques

This section first presents different networks which were created on the basis of previous
activities for CU, ranging from CU 1 to CU 7, with the primary aim to envisage the
numerous possibilities of discovering interesting patterns and rules within each
conceptual network for further research on all or specific CU (Alai, 2004; Bollen and
Van de Sompel, 2006; Carmagola et al., 2009). But the focus of our research interest lies
inside the CU 2 network. Therefore this chapter is intended to be a starting point for
exploring useful ideas on the basis of a small sample of specific web queries (Broder,
2002).
6 K. Petrič et al.

Figure 1 shows the before mentioned networks in ORA Casos. We split the whole
hybrid network (All CU) into seven separate networks which are following:
a the organisational network of web queries (CU 1)
b the sociological network of web queries (CU 2)
c the network of information resources of web queries (CU 3)
d the network of materials, prices, etc., of web queries (CU 4)
e the science area network of web queries (CU 5)
f the event/process and activity network of web queries (CU 6)
g the network of web user questions (CU 7)

Figure 1 Structure of networks for all CU and CU 1 to CU 7 as computed with ORA Casos

In Figure 1, the focus of our research interest in this article (i.e., the network of the
queried persons) is marked with a circle (see Figure 1: CU 2). The sociological
network of web queries CU2 consists of persons who are employed at the MOI or not
(cf., Miao et al., 2010 on opinion mining using user queries). This kind of queries also
encompasses the specific functions of a person (e.g., general secretary for MOI,
consultants, security guard, personnel) or even the social role played by some people in
society (e.g., criminal personality, serial killer, illegal migrant, asylum seeker). Based on
this sociological network we extracted different data of the most often searched persons
(e.g., contact information, organisation, scientific/professional area, title and/or function
of persons in their organisation) and (by processing correlation similarity and centrality
information in ORA Casos) connected them to the whole sociological knowledge
network, which will be described in the next subsection. Inspired by Papadimitriou et al.
(2011) we developed the idea that the sociological knowledge network (with an
intelligent semantic search engine) could excellently serve as a dynamic and adaptive
application for the presentation of different people, organisations and useful knowledge,
which could be elaborated in different (research) projects. Before discovering further
How to benefit from small samples of web queries? 7

useful ideas, we are going to prepare the appropriate concept for the parameters to be
included.
Figure 2 points out a part of the parameter concept. The concept tells us which data
will be collected for the creation of our sociological knowledge network. The main root
concept dealt with in this article is ‘person’ consisting of the following sub-concepts:
• title and/or function of a person (see Figure 2: e.g., F1, F2, F3)
• organisation (see Figure 2: e.g., OR1, OR2, OR3)
• location (see Figure 2: e.g., TE1, TE2, TE3)
• frequency of a web query on a specific person which will be used as a weight inside
the before mentioned network and finally
• interest which means the scientific and/or professional area of the persons
(see Figure 2: e.g., marketing HRM, public administration, police crime,
informatics).

Figure 2 Part of the parameter concept

This type of ontology (cf., Jurisica et al., 2004 on ontology for knowledge management)
could be used for discovering the similarity between information retrievals on persons
mentioned on different websites (e.g., public administration, universities). The open
source software package Ontogen 2.0 could be very useful in this respect. This kind of
8 K. Petrič et al.

analysis could give us better insight into the expert knowledge needed by web users
(Yang et al., 2007).

2.1 Connecting the data with concern to persons


The next step involves the creation of a table containing all parameters and collected data
on queried persons with the intention to visualise a sociological knowledge network and
to gain useful information from it (cf., Westerski et al., 2010).
Table 3 presents a part of the collected and prepared data on queried persons working
in different organisations, on different locations, with different titles/functions, different
popularity levels (F of queried persons) and different scientific and/or professional
interests.
Table 3 Part of the collected data on queried persons

Person F Organisation Interest Title/function Location


E1 1 OR1 Marketing, HRM F1 TE1
E2 1 OR2 Public administration F2 TE2
E3 1 OR3 Police, crime F3 TE2
E4 4 OR4 Economy F3 TE2
E5 14 OR6 International terrorism F3 TE2
E6 20 OR1 Organisational behaviour F4 TE1
E7 26 OR2 Police, police law, analytics F5 TE2
E8 5 OR7 Jurisprudence F1 TE2
E9 15 OR2 Domestic violence, police F6 TE2
E10 98 OR2 Law F2 TE2
E11 7 OR8 Penology F1 TE2
E12 2 OR9 Economy F1 TE2
E13 9 OR2 Informatics, knowledge F7 TE2
discovery, text mining
E14 1 OR10 Economy, army logistics F1 TE2

In the next step we processed these data again with the Ora Casos software package. At
first we determined the network relations (e.g., person to area, person to organisation,
person to title/function, person to interest) and the weights in form of frequencies of the
queried persons. After this we determined the link and node properties (e.g., link weight,
link colour, correlation similarity and centrality information of a node).
Figure 3 shows a part of a real world model of a sociological knowledge network.
The concepts of person, title/function, location and organisation are described with
acronyms (e.g., for person E1, E2, etc., title/function F1, F2, etc., location TE1, TE2, etc.,
and for organisation OR1, OR2, etc.). The concept Interest is visible as text (e.g., police
crime, marketing HRM). We found that the sociological knowledge network shows us
similar and different characteristics of queried people in form of scientific/professional
interests, geographical locations and organised communities where employees are treated
as participants. In short, this type of network gives us new information about the location
of different knowledge and skills (biological, sociological and geographical location)
(cf., Cappelin and Post, 2009; Eggers and Sing, 2009).
How to benefit from small samples of web queries? 9

Figure 3 Part of the sociological knowledge network

3 Discovering useful ideas by means of thesaurus construction and mind


mapping

This section primarily intends to explore useful ideas and knowledge from the
sociological knowledge network based on rather intuitive methods (i.e., the construction
of a thesaurus on the basis of given terms and mind mapping), but considering the
knowledge of previous chapters of this article. First we constructed a customised
thesaurus from the sociological knowledge network with the existing data of the scientific
and/or professional areas of the queried persons (see Table 3). Essentially we obtained a
thesaurus containing the relations between terms which were extracted from the interest
areas of the queried persons. After exporting the thesaurus data into Excel, the different
terms were evaluated on a scale ranging from 1 to 5 (e.g., 1 means not so important,
5 means the most important). These weights were determined on the basis of the
sociological knowledge network where the scientific/professional areas are in closest
connection with the frequencies of the queried persons. A part of the customised
thesaurus will be shown on the next pages in form of a table and a network graph.
Afterwards we will introduce the mind mapping technique for the extraction and
presentation of useful ideas (cf., Pearson and Somekh, 2003 on mind mapping).
Figure 4 shows a part of the customised thesaurus extracted from the
sociological knowledge network. The terms of the thesaurus were extracted from
scientific/professional interest areas of the queried persons. According to the
classification (CC) in the customised thesaurus, we can see different relations of the
descriptor ‘informatics’ and other descriptors which are narrower (NT), equivalent
(UF, USE) and related terms (RT). For our mind mapping purposes the NT and RT
relations are very important because they indirectly show the key areas and relations for
developing useful ideas inside the MOI (e.g., NT Digital libraries, Information systems,
10 K. Petrič et al.

Information solutions, knowledge portals, software, text mining, RT crime, forensics,


international terrorism, migrations, police, sociology, statistics). Below the before
mentioned table of the thesaurus and the evaluated descriptors are displayed.

Figure 4 Part of the customised thesaurus extracted from the sociological knowledge network

Table 4 presents a part of the weighted (hereinafter: W) descriptors (hereinafter: DE) and
relations of the customised thesaurus (e.g., TT = top term, CC = classification column).
For a more detailed view of the relations between descriptors we computed a network
graph with the software ORA Casos package. First we determined the link (e.g., link
scale to weight) and node appearance (e.g., we distributed the important of nodes with the
centrality eigenvector and the node similarity with correlation similarity). Afterwards we
hid the isolated nodes and links with weights less than 2.1, and as a result we obtained a
big picture network with several relations between the terms (e.g., DE, NT, RT). Finally
we explored the most important terms with their stronger or weaker connections to each
others. With this procedure we obtained the relevant concepts which could enable us for
the development of useful ideas. A little part of the visualised thesaurus will be presented
in Figure 5.
How to benefit from small samples of web queries? 11

Table 4 Part of the weighted descriptors and relations in the customised thesaurus

DE W TT CC BT NT RT
Anthropology 1 Ethnology
Acquisition 2 Army logistics Army
logistics logistics
Army logistics 3 355:005 Logistics, Economy,
etc. etc.
Artificial 4 Informatics Informatics
intelligence
Borders 4 Migrations
Business 3 Informatics Information
information systems
systems
Business 4 Informatics Informatics
modelling
Chemistry 3 Forensics Forensics
forensics
Communication 4 Informatics Informatics
technology
Computer crime 5 Jurisprudence Crime
Computer 5 Forensics Forensics
forensics
Courts 3 Jurisprudence Law
Crime 5 Jurisprudence 343.3/.7 Penalty law Computer Economy,
crime, etc. etc.

Figure 5 Part of the network extracted from the thesaurus with included weights
12 K. Petrič et al.

Figure 5 shows a little part of the network extracted from the customised thesaurus with
included weights (see Table 4). On the basis of several terms and their relations we
extracted many ideas which could be useful for the MOI. We came to know that the most
important terms and connections inside the whole network of terms arise from crime
(343.3/.7), forensics (340.6), informatics (004), police (351.74) and public administration
(35).
We summarised the main groups of ideas based on the weighted terms of the
thesaurus and created a mind map (cf., Pennington, 2011). Due to the size of the mind
map we cannot show the whole picture. We divided the collected useful ideas in four
main groups.

Figure 6 Mind map of four main groups of useful ideas

Figure 6 presents a section of the mind map containing four main groups of useful
ideas, which are follows:
1 Business intelligence: inside this group we sorted e-legal products (articles,
automatic comparison of different laws), e-meetings (projects, colleges, interest
groups), the application of common and related loan, e-video conferencing,
knowledge mapping services to MOI, dynamic models of business processes on the
intranet (monitoring and analysis of work flow, inventory of new and elimination of
outdated business processes, automated manufacturing business reports, analytical
dashboards), expert system (systems for automatic responses, intelligent Pandora
bots for e.g., police and penalty law, crime profiling).
2 Digital library of the public administration: this group contains issues and answers,
prepared information on the public administration (statistical/mathematical visual
applications), e-communication, online research on the population, a digital library
How to benefit from small samples of web queries? 13

of crime (the organisation of coherent statistical reports, case studies/best practices,


etc.), special library – police, the digital library of the picture – HTML image maps
(HTML image maps for administrative matters, charts with links).
3 Simple IT solutions: this group contains prepared queries with specific areas of
knowledge and/or discipline, query extractions, a knowledge portal with related
organised discussion groups, a specialised database for investigation and research,
specific correlation matrices for various organised groups, a useful relational
database, a network of people and concepts, useful visualisations of web links, a
book cover gallery (equipped with web links), a prepared list of important
documents, a list of the most important magazines/journals for the MOI, a list of
relevant conferences/events for the MOI, prepared and published bibliographies for
persons and topics inside the MOI, a portal of forms and model contracts.
4 Specific e-services: inside this group of useful ideas we sorted the forum for
information literacy, flash-video guides, e-counselling and online courses and
trainings.
In the next step we created an evaluation to increase the effectiveness of the scheme.
Figure 7 presents a part of the matrix with evaluated domains. We divided the basic
ideas into three domains (e.g., IT, semantic cognitive and social domain) and evaluated
them with ponders from 1–3 (see picture 7: triangle = 3 –> very important idea, black
circle = 2 –> important idea, grey circle = 1 –> less important idea). The ideas were
evaluated on the basis of four orientations (collection of collective intelligence, location
determination of knowledge, connection of departments and appropriate IT) and
12 crucial actions inside the working organisation (solving information problems, making
effective decisions, building semantic networks, creating classifications, discovering
knowledge, preventing conflicts, preventing collective idiotism, developing of
organisations, creation of effective organisation structures, accessing content
management systems, social software and intranet). These orientations and actions are
important for the development of key ideas. On the qualitative level the highest scores
were given inside the social domain. For more detailed analysis we exported the
‘orientation – action matrix’ into Excel to point out exactly which ideas inside the
domains are dominant. We analysed the values with the Tableau public software.
Figure 8 shows a part of the analysed domains. The results we got from this analysis
confirm the first of the two alternative hypotheses and enable us to answer the first
research question. The extracted useful ideas (cf., Figures 7 and 8) display the following
sums of measure values:
• expert systems (social domain): 64
• organised groups (social domain): 36
• social networks (social domain): 36
• interest groups (social domain): 34
• online courses (social domain): 34
• projects (social domain): 34
• monitoring and analysing (semantic cognitive domain): 33
14 K. Petrič et al.

• best practices (semantic cognitive domain): 32


• case studies (semantic cognitive domain): 32
• investigation and research (semantic cognitive domain): 32
• knowledge mapping services (social domain): 32
• prepared information (semantic cognitive domain): 32
• discussion groups (social domain): 31
• persons (social domain): 31
• e-meetings (social domain): 30
• e-video conferencing (social domain): 30
• knowledge portal (IT domain): 30
• prepared bibliographies (semantic cognitive domain): 30
• e-counselling (social domain): 29
• forum for information literacy (social domain): 29
• digital library (categorised inside IT domain): 28
• e-legal products (semantic cognitive domain): 27
• colleges (social domain): 26
• online research (semantic cognitive domain): 26
• e-communication (social domain): 25
• models of business processes (semantic cognitive domain): 25
• concepts (semantic cognitive domain): 24
• analytical dashboards (IT domain): 23
• articles (semantic cognitive domain): 23
• outdated business processes (semantic cognitive domain): 21
• HTML image maps (semantic cognitive domain): 20
• intelligent Pandora bots (IT domain): 20
• list of relevant conferences (semantic cognitive domain): 20
• prepared queries (semantic cognitive domain): 20
• automatic business reports (IT domain): 19
• flash video guides (semantic cognitive domain): 19
• issues and answers (semantic cognitive domain): 19
• systems for automatic responses (IT domain): 19
• automatic comparisons (IT domain): 18
How to benefit from small samples of web queries? 15

• databases (IT domain): 18


• query extractions (semantic cognitive domain): 18
• charts with links (semantic cognitive domain): 16
• list of important magazines (semantic cognitive domain): 15
• book cover gallery (IT domain): 14.

Figure 7 Matrix of evaluated domains


16 K. Petrič et al.

Figure 8 Part of the analysed domains

The key idea can be derived from the measured useful ideas on the basis of the before
mentioned four orientations and 12 actions which display the following sums of measure
values:
• making effective decisions: 116
• discovering knowledge: 116
• accessing of content management systems: 115
• solving information problems: 107
• accessing of intranet: 103
• preventing of collective idiotism: 102
• accessing of social software: 101
• developing of organisations: 93
• preventing conflicts: 93
• creation of effective organisation structures: 86
• building of semantic networks: 76
• creating classifications: 66.
With our method relying on crucial actions, orientations and already known IT, we were
able to obtain useful ideas from small samples of web queries. These ideas comprise the
development of collective intelligence, the prevention of conflicts, knowledge discovery,
the enhancement of knowledge organisation and raising the efficiency of the working
organisation (cf., second research question).
How to benefit from small samples of web queries? 17

In this research we also learned which techniques can be effectively used to extract
useful ideas (cf., the third research question):
1 analysing and visualising social and semantic networks with appropriate software
tools
2 creating and analysing a customised thesaurus consisting of the interests of queried
persons
3 mind mapping to allow the grouping of useful ideas
4 constructing a complex matrix to evaluate the groups of useful ideas
5 analysing and visualising the matrix with appropriate analytical software.
The essential discovery of this article lies in the fact that we reached our aim with a small
sample of web queries. This is not always possible, because some environments or
systems in our societies are very unstable and change very often. Similar discoveries with
small samples can be expected when the environment or system is relatively stable and
does not essentially change over a longer period of time (cf., the fourth research
question).

4 Related and future work

Although we retrieved the most significant and well-known scientific databases (e.g.,
Springer link, web of science, INSPEC, LISA, ERIC), we did not find related or similar
works dealing with small web samples, therefore this paper often cites loosely connected
references. Thus the first question could be whether large web samples are necessary for
studies trying to extract useful ideas for the improvement of web applications and web
services. On the one hand it could be claimed that the results of a small web sample could
not replicated, but on the other hand there are many systems, processes and procedures in
our world which do not change over a very long period of time. In such cases it seems
that large data samples are not really necessary to find out how certain entities work. For
example, if library users are interviewed about stress in libraries it will soon become
evident that a small sample of their opinion will suffice and that a larger sample will not
essentially change the final result. Another example indicates that in social networks one
can identify many people with similar or different interests, but if one knows the main
authority, one will be able to extract the basic directions, knowledge and aims of many
people. In many situations people often take decisions on first impression, i.e., on the
basis of very small samples of data or small sets of criteria. Crime investigators often
have only few amounts of data which they call indications at the beginning of the
investigation, but with sophisticated research methods they can build an appropriate
profile of a criminal person. Perhaps the primary questions should be “When to take
small samples of statistical data” to research different relevant entities (organisations,
people, etc.). Nowadays it is quite easy to obtain millions and millions of web data in just
a few seconds, but due to the high entropy of the data it is very challenging to point out
the cases or situations in which all these data are necessary. This kind of research could
excellently serve as a platform for fast evaluations of the environment’s, processes,
procedures, working programmes, people, positions, situations, etc., with the primary aim
18 K. Petrič et al.

to take fast and qualitatively high decisions. Recently we have conducted a study on a
similar topic, but with a larger sample of web data (over 57,500 information retrievals
were taken into account). But we realised that we captured the content mainstream (with
methods like e.g., scientific mapping, clustering, mind mapping) on the basis of a
relatively small range of data so that we were able to implement some of the solutions to
improve the web pages and services of MOI. We found out that the information
behaviour of the web users do not change over a long period of time. This means that we
can perform web log analyses without large numbers of data. From our point of view we
now have enough possibilities to implement the improvements indicated by our web
users.
Having derived the key idea about an application system, we are now able to deal
with the fifth research question and to identify possible future work. The desired
application system must serve different conditions to improve the performance of the
working organisation and their employees. It must be able to gather collective
intelligence (e.g., relevant documents, empirical analysis, case studies, best practices,
empirical knowledge, important studies, user and expert opinions, solved information
problems, knowledge organisation), to make and take effective decisions, to discover
hidden and new knowledge, to prevent collective idiotism’s and conflicts (e.g.,
complicated administrative procedures or rules which take too much time and money,
laws which constrain research and labour progress), to connect employees, users,
departments, working organisations, etc., inside a social network, to provide access to a
content management system (e.g., organisation of knowledge) and social software, to
provide access to intranet (for employees), extranet (e.g., for other working
organisations) and internet (for individual external users e.g., scholars, retired persons), to
provide the possibility of different automation processes (e.g., analytical dashboards with
automatic business reports, automatic comparisons of laws ) and finally to provide the
possibility to measure and analyse differences in dynamics (e.g., users, events,
discussions, semantic networks, social networks).
The above described application system will be referred to as ‘social analytical
knowledge collector and connector’. The name indicates that it enables the collection and
connection of different social networks (expert systems, research networks, etc.),
dynamic analytical applications (e.g., user tracking, online models of business processes,
dashboards, data and text mining, knowledge discovery) and the organisation of
knowledge (e.g., empirical, intuitive, facts, private, institutional). To build such an
application system does not pose a big problem from the IT and financial point of view
since we can use open source IT (especially software). At the beginning we need an
excellent content management system like Joomla (e.g., documents, case studies,
analyses, plans, best practices, development of useful applications), a social network
software (e.g., JomSocial, Maltego, Twitter, Facebook), different open source web
applications (e.g., Ether Titan Pad, Box.com, Wise Mapping) and some knowledge about
different Joomla and JomSocial plugins (e.g., J4age, CC newsletter, question answering
systems, Tagging, search engine plugin for Joomla and JomSocial). On the basis of
different building blocks we could be able to create a flexible and dynamic application
system which serves the needs of many different working organisations. In short it could
be a system for the development of collective intelligence which could easier solve
information problems, improve decisions and prevent collective idiotism and different
conflicts.
How to benefit from small samples of web queries? 19

5 Conclusions

On the basis of a relative small sample of data (including web queries on 27 persons) we
were able to extract useful ideas on different IT solutions which could be employed to the
benefit of the information needs of the citizens and the MOI employees. In this research
we used different methods and methodological tools ranging from statistical methods and
different visualisation techniques to rather intuitive methods of collecting and organising
concepts into a customised thesaurus and mind mapping scheme. We extracted several
useful ideas which were classified into four main groups. Essentially we came to know
which IT solutions and vital information were not present on the MOI web pages,
although some of the information needs identified in our sample are not realisable in a
shorter period of time. The identified information needs lead us to the idea of
implementing a special intranet knowledge portal with a content management system,
expert discussion groups, a system for the survey and analysis of social and semantic
networks. This special application system could be built with compatible open source
tools or inexpensive software (e.g., Joomla, JomSocial, Lime survey and SONIVIS tool).
In the past an internal working group of MOI constructed a similar application system
and implemented ideas identified in our sample. But the working group also
acknowledged that the development and implementation of ideas and knowledge is
hampered by the insufficient number of new special working profiles (e.g., information
specialists, information mechanics, information architects, knowledge analysts,
knowledge/application developers) which could be able to create and organise the
valuable knowledge to the benefit of the citizens and the MOI employees. Until now, our
universities do not educate the above mentioned special working profiles; therefore
progress in this field will be slow. For example we need information preparers and
developers of applications of laws being important for the MOI and Police and enabling
them for translation of laws and automatic comparison, procedures and court practices in
different countries (cf., Moens, 2001). Laws are at most very complex texts which are in
more or less strong connection with other laws. Therefore we need knowledge builders
who are able to connect different paragraphs from different laws (the Police law is in a
causal relation with the labour law, penalty law, road traffic regulations, etc.). At this
time we could claim that the problem to raise effectiveness and develop useful ideas and
knowledge inside MOI and other working organisations is not caused by IT itself but
arises from missing special knowledge working profiles. The deficit of such professionals
finally means an insufficient usage of innovative ideas and knowledge in our societies.

References
Agosti, M., Crivellari, F. and Di Nunzio, G.M. (2011) ‘Web log analysis: a review of a decade of
studies about information acquisition, inspection and interpretation of user interaction’,
Data Mining and Knowledge Discovery, pp.1–34 [online] (accessed 6 September).
Alai, M. (2004) ‘A.I., scientific discovery and realism’, Minds and Machines, Vol. 14, No. 1,
pp.21–42.
Alguliev, R.M., Alyguliev, R.M. and Yusifov, F.F. (2007) ‘Automatic identification of the interests
of web users’, Automatic Control and Computer Sciences, Vol. 41, No. 6, pp.320–331.
Aula, A. (2005) ‘User study on older adults’ use of the web and search engines’, Universal Access
in the Information Society, Vol. 4, No. 1, pp.67–81.
20 K. Petrič et al.

Bollen, J. and Van de Sompel, H. (2006) ‘Mapping the structure of science through usage’,
Scientometrics, Vol. 89, No. 2, pp.227–258.
Broder, A. (2002) ‘A taxonomy of web search’, ACM SIGIR Forum, Vol. 36, No. 2 [online]
http://www.sigir.org/forum/F2002/broder.pdf (accessed 24 May 2011).
Bystryakova, A.Y. and Mizintseva, V.V. (2011) ‘Innovation informatization for state financial
control bodies’, Scientific and Technical Information Processing, Vol. 38, No. 2, pp.10–17.
Cantadora, I., Konstasb, I. and Jose, J.M. (2011) ‘Categorising social tags to improve
folksonomy-based recommendations’, Web Semantics: Science, Services and Agents on the
World Wide Web, Vol. 9, No. 1, pp.1–15.
Cappelin, R. and Post, R. (2009) International Knowledge and Innovation Networks: Knowledge
Creation and Innovation and Medium-Technology Clusters, Vol. VI, p.275 (new horizons and
regional science), Edward Elgar, Cheltenham, Northampton.
Carley, K.M. et al. (2011) ORA User’s Guide 2011, Carnegie Mellon University, School of
Computer Science, Institute for Software Research, Technical Report, CMU-ISR-11-107.
Carmagnola, F. et al. (2011) ‘Supporting content discovery and organization in networks of
contents and users’, Multimedia Systems, Vol. 17, No. 3, pp.199–218.
Courtesy of Tableau Software 7.0 (2012) [online]
http://www.tableausoftware.com/public/community (accessed 27 March 2012).
DiGiacomo, J. (2003) Implementing Knowledge Management as a Strategic Initiative, Thesis,
J. DiGiacomo, Monterey, p.99.
Edraw MindMap Free Version 3.6.5 (2004–2012) [online] http://www.edrawsoft.com/freemind.php
(accessed 25 June 2012).
Eggers, W.D. and Singh, S.K. (2009) The Public Innovator’s Playbook: Nurturing Bold Ideas in
Government, p.164, Harvard Kennedy School, Massachusetts [online]
http://www.deloitte.com/assets/Dcom-
Global/Local%20Assets/Documents/dtt_ps_innovatorsplaybook_100409.pdf (accessed
12 August 2011).
Glennisson, P., Glänzel, W. and Persson, O. (2005) ‘Combining full-text analysis and bibliometric
indicators: a pilot study’, Scientometrics, Vol. 63, No. 1, pp.163–180.
Goh, O.S., Fung, C.C. and Wong, W. (2008) ‘Query based intelligent web interaction with real
world knowledge’, New Generation Computing, p. 16426, No. 1, pp.3–22.
Jurisica, I., Mylopoulos, J. and Yu, E. (2004) ‘Ontologies for knowledge management: an
information systems perspective’, Knowledge and Information Systems, p. 1646, No. 4,
pp.380–401.
Keßler, C. (2011) ‘What is the difference? A cognitive dissimilarity measure for information
retrieval result sets’, Knowledge and Information Systems, pp.1–22 [online]
(accessed 12 February).
Landrin-Schweitzer, Y., Collet, P. and Lutton, E. (2006) ‘Introducing lateral thinking in search
engines’, Genetic Programming and Evolvable Machines, p. 1647, No. 1, pp.9–31.
Miao, Q., Li, Q. and Zeng, D. (2010) ‘Fine-grained opinion mining by integrating multiple sources
review’, Journal of the American Society for Information Science and Technology, Vol. 61,
No. 11, pp.2288–2299.
Moens, M.F. (2001) ‘Innovative techniques for legal text retrieval’, Artificial Intelligence and Law,
Vol. 9, No. 1, pp.29–57.
Ovchenkova, E.A. (2010) ‘The internet as a global search system for scientific articles on
information and communication’, Scientific and Technical Information Processing, Vol. 37,
No. 3, pp.178–186.
Papadimitriou, A., Symeonidis, P. and Manolopoulos, Y. (2011) ‘A generalized taxonomy of
explanations styles for traditional and social recommender systems’, Data Mining
and Knowledge Discovery, [online] http://delab.csd.auth.gr/papers/PSM2011DAMI.pdf
(accessed 27 March 2011).
How to benefit from small samples of web queries? 21

Pearson, M. and Somekh, B. (2003) ‘Concept-mapping as a research tool: a study of primary


children’s representations of information and communication technologies (ICT)’, Education
and Information Technologies, Vol. 8, No. 1, pp.5–22.
Pennington, D.D. (2011) ‘Bridging the disciplinary divide: co-CREATING research ideas in
eScience teams’, Computer Supported Cooperative Work, Vol. 20, No. 3, pp.165–196.
Polanco, X., Roche, I. and Besagni, D. (2006) ‘User science indicators in the web context and
co-usage analysis’, Scientometrics, Vol. 66, No. 1, pp.171–182.
Progris Midos Thesaurus (2012) [online] http://www.progris.de/index.html?/midost.htm (accessed
19 February 2011).
Serdült, U., Vögeli, C., Hirschi, C. and Widmer, T. (2005) APES – Actor-Process-Event
Scheme, Zurich, Switzerland: IPZ, University of Zurich [online] http://www.apes-tool.ch/
(accessed 2012-10-09).
Smith, G. (2004) ‘Folksonomy: social classification’, [online]
http://atomiq.org/archives/2004/08/folksonomy_social_classification.html (accessed
19 February 2011).
Soller, A. (2004) ‘Computational modeling and analysis of knowledge sharing in collaborative
distance learning’, User Modeling and User-Adapted Interaction, Vol. 14, No. 4, pp.351–381.
Srivastava, A.N. and Sahrami, M. (2009) Text Mining: Classification, Clustering, and Applications,
p.290, CRC Press, Boca Raton, London, New York, (Chapman and Hall/CRC Data Mining
and Knowledge Discovery Series).
Thelwall, M.A., Wilkinson, D. and Uppal, S. (2010) ‘Data mining emotion and social network
communication: gender differences and MySpace’, Journal of the American Society for
Information Science and Technology, Vol. 61, No. 1, pp.190–199.
Tolman, E.C. (1948) ‘Cognitive maps in rats and men’, Psychological Review, Vol. 55, No. 4,
pp.189–208.
Wei, W. et al. (2011) Handling Weighted, Asymmetric, Self-Looped, and Disconnected Networks in
ORA, Carnegie Mellon University, School of Computer Science, Institute for Software
Research, Technical Report CMU-ISR-11-113.
Westerski, A., Iglesias, C.A. and Rico, T. (2010) ‘A model for integration and interlinking of idea
management systems’, Metadata and Semantic Research 4th International Conference, MTSR
2010, Alcalá de Henares, Spain, October 20–22, Proceedings, CCIS 108, pp.183–194.
Wissmann, J. and Bahr, G.S. (2007) ‘Bilingual mapping visualizations as tools for Chinese
language acquisition .human-computer interaction’, Part II, HCII 2007, LNCS, Vol. 4551,
pp.171–180.
Yang, B., Song, W. and Xu, Z. (2007) ‘New construction for expert system based on innovative
knowledge discovery technology’, Science in China Series F: Information Sciences, Vol. 50,
No. 1, pp.29–40.
Zhao, L. and Zhang, Q. (2011) ‘Mapping knowledge domains of Chinese digital library research
output, 1994–2010’, Scientometrics, Vol. 89, No. 1, pp.51–87.

Vous aimerez peut-être aussi