Académique Documents
Professionnel Documents
Culture Documents
Proceedings of the WebMedia & LA-Web 2004 Joint Conference 10th Brazilian Symposium on Multimedia and the Web 2nd Latin American Web Congress
(LA-Webmedia’04)
0-7695-2237-8/04 $20.00 © 2004 IEEE
2. Similarity Processing Framework resembles the tf-idf strategy [9] that is commonly used in
classic Information Retrieval since it combines a
2.1. Weight Mapping similarity with a specificity measure. In the general case
this combined measure proved to be the best one in our
This technique tries to explore the fact that ontologies applications. Other similarity and specificity measures
and their instances carry much more information than might be used in the future to achieve better results.
n
¦n
what is explicitly stated, as there is much “hidden”
information entailed by the relations (i.e., a semantically- ijk
1
based linking structure). In traditional ontologies, it is W (C j , C k ) = i =1
(1) W (C j , C k ) = ( 2)
n
only possible to indicate the presence or absence of a
¦n ij
n k
relation between two concept instances. In many i =1
situations, however, it would be desirable to also express
some strength associated with the relation. The classical
way is to associate a numerical value to the corresponding 2.2 Hybrid Spread Activation
link. One of the ideas in this work is to extract knowledge
from the ontology and its instances in order to obtain a The other strategy we use to calculate this similarity
numerical weight for each existing relation instance in the measure employs spread activation techniques. Such
model. A similar idea was presented in [8], to provide a techniques are among the most used processing
novel approach for ranking the results of ontology-based frameworks for semantic networks, having been
searching in the Semantic Web, with good results. We call successfully applied in several fields, particularly in
“Weight Mapping” the technique of calculating a Information Retrieval applications [3,4]. Given an initial
numerical weight value for each relation instance, based set of activated concepts and some restrictions, activation
on the analysis of the link structure of the knowledge base. flows through the network reaching other concepts which
Different ideas were tested in devising a calculation are closely related to the initial concepts. It is very
that can generate a strength formula for each existing powerful to perform proximity searches, where given an
relation instance in the knowledge base. In [7] we initial set of concepts the algorithm returns other concepts
proposed three different measures - cluster, specificity and which are strongly connected to them. Inferences occur
combined- which we found very useful in developing our naturally in this process, since the result set may contain
system. We are aware that the choice of these measures is nodes that are not directly linked to the initial set of
totally application and task dependent. Here we will just nodes. An overview of spread activation techniques is
briefly present the three proposed measures. For deeper presented in [4].
information on the motivations behind them and Usually spread activation techniques are used either on
explanations of the formulas the user should refer to [7]. semantic networks, where each edge in the network has
The first measure tries to establish the degree of only a label associated to it, or in association networks,
similarity between two related concept instances in a where each edge has only a numeric weight associated to
relation. The similarity measure used is very similar to the it. In [7] we showed how to use the weight mapping
cluster function used in [2], obtained by specializing that techniques to construct a hybrid instances network, where
function for concepts that relate to each other. Formula 1 each relation instance has both a semantic label and a
indicates the similarity between concept instances Cj and numerical weight, and use spread activation on this
Ck. The value nij represents that concept Cj is related to network. The intuition behind this idea is that better
concept Ci (it is 1 if the concepts are related and 0 results in the search process can be achieved using the
otherwise). The value nijk represents the fact that both semantic information together with sub-symbolic
concepts Cj and Ck are related to concept Ci (1 if both (numerically encoded) information extracted from the
concepts Cj and Ck are related to Ci and 0 otherwise). instances. Several works in the literature present spread
The second measure is similar to the idf (inverse activation algorithms either in semantic [3] or in
domain frequency) measure [9] widely used in associative nets [2]. However, there are few works that
Information Retrieval (although in I.R. the log function is use both approaches together.
normally used). It is useful when the user wants to give The algorithm has as a starting point an initial set of
the semantic of specificity or differentiation to the instances in the ontology, henceforth called nodes, which
relation. Formula (2) was used for the specificity measure. have an initial activation value; in the functionalities
The value nk is equal to the number of instances of the proposed in this paper this value will be 1.0. All nodes not
given relation type that have k as its destination node. in the initial set have their initial activations set to zero.
The third measure is the combined measure, obtained The initial nodes are put in a priority queue, ordered non-
as the product of the two previous ones. Its calculation increasingly with respect to the node’s activation value.
Proceedings of the WebMedia & LA-Web 2004 Joint Conference 10th Brazilian Symposium on Multimedia and the Web 2nd Latin American Web Congress
(LA-Webmedia’04)
0-7695-2237-8/04 $20.00 © 2004 IEEE
The node with the highest activation value is then taken 3XEOLFDWLRQ 3XEOLFDWLRQ $UHD
³:HE6HUYLFHV ³+\EULG$SSURDFK ³,QIRUPDWLRQ
out of the queue and processed. If it satisfies all the 3DWWHUQV´ IRU6HDUFKLQJ´ 5HWULHYDO´
Proceedings of the WebMedia & LA-Web 2004 Joint Conference 10th Brazilian Symposium on Multimedia and the Web 2nd Latin American Web Congress
(LA-Webmedia’04)
0-7695-2237-8/04 $20.00 © 2004 IEEE
concepts that have some relation among them, this is only the confirmation of the areas related to the publication by
true if the links provided are meaningful. The process of the user, the system could suggest as probable co-authors
adding relationships among concepts is usually done by of the publication, students and professors who typically
human beings in a totally manual way. This is a hard task write publications together in those areas with the given
that consumes a lot of time and requires great knowledge professor, and so on and so forth.
of the specific domain of the application. To suggest a new relation, the user provides a starting
For example, consider the PUC-Rio Department of node, which the system uses as the input node for the
Informatics website. There are approximately 1,600 spread activation algorithm. To prevent suggestion of
publications stored in the website. For each stored links already present in the knowledge base, the spread
publication, the website includes a list of relations to its activation must be configured by adding a restriction
authors, the areas in which the publication is relevant, the rejecting all nodes to which the given node already has a
projects which are related to the publications, etc., relation. The nodes obtained from the spread activation
requiring a tremendous amount of work. Many times, as in algorithm are then presented to the user as possible related
this case, this task is shared among various users, in order nodes to the given node; the user has the option of
to make it a little easier. For example, each professor is immediately inserting any suggested relation in the
responsible for entering the information regarding his own knowledge base.
publications. In many cases, there is decentralized input of In addition to suggesting relations, it also associates a
information, and various inconsistencies can arise from numeric weight that indicates the strength of that
this process, mainly due to incomplete knowledge on the suggestion. The analysis of this weight is difficult, and
part of the user entering the information. varies from relation to relation. Naturally, an instance of a
In various knowledge bases there exists redundancy. relation suggested which has a higher value than another
For example, in the research domain a professor who has suggested instance of the same relation, has a higher
several publications in a specific area has a great chance likelihood of being true.
of being related to that area. If this information is not Next, an example will be presented to clarify the use of
explicit in the knowledge base by a direct edge connecting this functionality. Considering the instances graph shown
the professor to the given area, this might be an in Figure 1, it is possible to observe that professor
inconsistency or error in the knowledge base. “Schwabe” has a relation with three distinct areas (“Web
The idea behind the proposed functionality of detecting Services”, “Hypermedia” and “Software Engineering”). If
and suggesting relations is to identify, for the user or the the system was asked to propose new relations of the type
administrator of the system, possible relations among Professor-Area, it could suggest the relation with the area
concepts, which are not explicit in the knowledge base but of “Information Retrieval”, since professor “Schwabe” has
have a great possibility of existing. That is, the system a publication in this area, and also advises a student in it.
detects possible new relations which were not previously The absence of this relation in the knowledge base
in the knowledge base. Not every detected relation comes could be due to an error. In this case, professor
from an inconsistency in the knowledge base. Sometimes, “Schwabe” is indeed related to the area of “Information
a relation might not exist at an initial moment but, over Retrieval”, but this relation was not stored in the
time, that relation becomes latent due to the modifications knowledge base due to errors in the information input
that are happening in the knowledge base. process. It is also possible that, when the database started
This kind of functionality can also benefit other to be populated, this was really not one of his areas, but
processes. In particular, it can be very helpful in the after he started publishing papers and advising students in
process of updating the knowledge base. Most hypermedia that area, it became true, but this relation was never
applications have their knowledge bases updated actually inserted in the knowledge base.
constantly (that is, some concepts are inserted in the base, Another possibility is that professor “Schwabe” only
other concepts are deleted; some relations among the has direct relations to his main research areas, and since
existing concepts are added, and others are deleted). This “Information Retrieval” is not one of those, he has no
functionality can help the user in the task of inserting new direct relation to it in the knowledge base. In this case, the
concepts and relations among concepts by using the pre- lack of this relation in the base is not an error. Even if this
existing knowledge in the base. is the case, this inference is still very useful in various
In this scenario, it would be very useful if, as the user contexts. If a search for professors in this area is done in
starts to input and insert new relations, the system could the system, it might be interesting to show professor
suggest other probable relations. For instance, the user Schwabe as one of the results, since he has at least some
could begin by inserting the relation with the professor experience in the area, even though no explicit relation is
who wrote the publication. After that, the system would actually stored. In any case, it is important to observe that
automatically suggest as related areas to the publication, the decision on whether or not to insert a suggested
from the main research areas of the given professor. After relation in the knowledge base is taken by the user(s).
Proceedings of the WebMedia & LA-Web 2004 Joint Conference 10th Brazilian Symposium on Multimedia and the Web 2nd Latin American Web Congress
(LA-Webmedia’04)
0-7695-2237-8/04 $20.00 © 2004 IEEE
There are several types of relations in this application, precision rate of the functionality. The goal of the tests
and evaluating the functionality for all of them would be was to analyze if the proposed system suggests new
too expensive. Some specific relations were evaluated. relations with an acceptable precision, where the meaning
Relations involving Laboratories and Students were good of acceptable varies from application to application. In the
candidates since they had much fewer instances than they positive case, it could be employed by the users of an
should have in practice. A balanced analysis of this application, either for error and inconsistency
functionality should also test relations that were identification, or for aiding in the insertion of new
thoroughly filled - those relations where most of the instances in the knowledge base.
actually occurring relations were already explicit in the The graph presented in Figure 3 presents the change in
knowledge base. The intuition we had and wanted to the precision as function of the number of suggested
confirm was that for these relations the precision of the relations. The horizontal axis represents the number of
proposed functionality would be lower. We divided the suggested relations for a given relation type, and the
relations in 3 distinct groups - strong, medium, and weak - vertical axis represents the precision value at a given
based on the average number of instances the relation had, point. The graph was constructed as follows. For each
relative to the expected number of instances, using the relation type, we use the spread activation algorithm to
semantics of each relation (e.g. a paper must have at least obtain a list with all the relation suggestions for that type.
one author). This list was sorted from the best suggestion to the worst.
The algorithm suggests various new relation instances, Table 2 presents the results obtained. As expected, the
ordered for each type of relation, and a real number - its precision value diminishes as more relations instances are
weight - associated to each suggestion. To be of practical suggested. Also, the precision was much higher for the
value, it is necessary to establish a threshold for the relations in the weak and medium groups. This is due to
weights, to filter out meaningless suggestions. The the fact that in these relations there are more missing
difficulty is to establish this single threshold value, since it relations instances, and therefore the level of correct
should be different for each type of relation, because the suggestions tends to be higher. Generally speaking, 329
semantics of each relation type is completely different. relations instances above the threshold were suggested
The approach used was to use existing relations in the with an average precision rate of 78.7%. Ignoring the
database as a “training set”, to collect the weights threshold, the system proposed 834 new relation instances
assigned to them by the algorithm. The threshold value is with an average precision of 75.9%. Both results are very
obtained as a function of these collected weights, which in encouraging, given the fact that the main goal of the
our case was min {min weight, (avg weigth - std. dev.)}. functionality is not to automatically generate new relations
Several other possibilities were considered, and for other instances but to identify them for the user, who has the
domains this function may have to be adjusted. option of accepting or not the suggestion. The correctly
To analyze the precision of the results obtained it was suggested relations instances were responsible for an
necessary to manually evaluate each suggestion proposed increase of 10% in the number of relations in the base.
by the system, using the help of domain experts. They We also tested this functionality in the Portinari Project
classified the suggestions as correct or incorrect. For website. This application is interesting because its
instances where there was any doubt, we classified them database is highly consistent, and the domain model is less
as incorrect. The precision was calculated by identifying redundant, in the sense that there are fewer semantically
the percentage of correct relation suggestions. Recall meaningful transitive paths to be explored by the spread
could not be calculated since we did not have a list of all activation algorithm. Given these characteristics, as
correct instances which were missing in the base (indeed, expected, the suggestion of relations was not very
this is the reason why the functionality was very useful). effective. We can conclude that the utility of this
functionality is proportional to the level of inconsistency
4.1. Tests and Results of the knowledge base, and to the redundancy of the
semantic domain of the application.
Several tests were made to evaluate the suggestion of We also developed an interface where the inferred
relations functionality in the PUC-Rio Department of links are presented together with the existing links. We
Informatics website application. The basic methodology used different colors so the user could differentiate them
of the tests consisted of choosing a set of relations, and for and offered an easy one-click solution for the user to
each one of them asking the spread activation system to insert the inferred link, turning it into an explicit link in
suggest new instances for them. After that, domain experts the database, if he wishes, and has the appropriate
evaluated the proposed new relation instances, and permissions. Users greatly appreciated this functionality.
decided whether or not they should exist in the knowledge The same ideas proposed here can also be used to
base. Based on the hits and misses we calculated the suggest links for relations that do not exist in the
conceptual model of the application, as opposed to
Proceedings of the WebMedia & LA-Web 2004 Joint Conference 10th Brazilian Symposium on Multimedia and the Web 2nd Latin American Web Congress
(LA-Webmedia’04)
0-7695-2237-8/04 $20.00 © 2004 IEEE
Table 2. Evaluation of the suggestion of relations functionality (PUC-Rio Informatics Dept. website)
N. of Precision N. of Precision
Group Instances
Relation Suggestions ( (above Suggestions (all)
Average
Above the Thresh.) (all)
Threshold )
Publication-Lab Weak 0.007 16 100.0% 305 81.0%
Student-Lab Weak 0.0513 20 80.0% 119 67.2%
Lab-Area Medium 0.369 0 0.0% 11 81.8%
Student-Area Medium 0.61 113 98.2% 158 90.5%
Publication-Area Strong 1.04 17 64.6% 175 64.6%
Product-Area Strong 1.26 5 60.0% 66 62.1%
Total 329 78.7% 834 75.9%
120,00%
100,00%
80,00%
Precision
Publication-Lab
60,00%
Student-Lab
Lab-Area
40,00%
Student-Area
20,00% Publication-Area
Product-Area
0,00%
1 14 27 40 53 66 79 92 105 118 131 144 157 170 183 196 209 222 235 248 261 274 287 300
Suggested instances
relation instances for existing ones. For example, it is techniques and themes most closely related to it, even
possible to suggest links between professors, even though though such relations do not exist in the conceptual
such relations are not present in the conceptual model. model. This functionality works as an inference machine
In the case of the Portinari Project application the trying to do proximity search for node instances that are
benefits of such a strategy become clearer. For example, close to a particular node instance.
in this domain, an exhibition is related to the paintings Some qualitative tests were done for this kind of
exhibited in it. A painting is related to the techniques used relation suggestion in both applications, and the results
to paint it, and to its themes. An interesting suggestion of obtained seemed to be very good. We intend to further
links would be to propose, for a given exhibition node, the explore this particular use in future works.
Spreading Activation) system proposed by Crestani [5].
5. Related Work This system searches for relevant Web pages by
autonomously navigating through the Web using
An interesting system which uses spread activation associations between pages. The navigation is processed
techniques is the WebSCSA (Web Search by Constrained and controlled by means of a Constrained Spreading
Activation model. The first big difference to our work is
Proceedings of the WebMedia & LA-Web 2004 Joint Conference 10th Brazilian Symposium on Multimedia and the Web 2nd Latin American Web Congress
(LA-Webmedia’04)
0-7695-2237-8/04 $20.00 © 2004 IEEE
that the spread activation is carried on the Web (not in a we can use both semantic information from the domain
particular application) and therefore no domain and the user profile to perform adaptations. We also
information is available. Also, in our spread activation the intend to investigate in more detail the applicability and
similarity of pages is calculated using semantic utility of suggesting links for relations that do not exist in
information from the domain model while in WebSCSA the conceptual model of the application and the results
the textual contents of the web page are considered. provided by it. In addition, we are also working on
ONTOCOPI [6] presents an approach similar to ours additional refinements in the proposed engine,
for processing ontology-based information through spread experimenting with alternative functions, and other forms
activation techniques for suggesting relations. It is applied of exploiting semantic information.
for identifying communities of practices (COPs) in an
organization. The system tries to suggest persons which Acknowledgement. The research presented in this
are closely related and therefore have common interests. paper was partly funded by scholarships from PUC-Rio
ONTOCOPI attempts to uncover informal COP relations and CNPq, and research grants from CNPq and FAPERJ.
(those which are often indeterminate and expensive to We also want to thank LES/LAC, TecWeb and Milestone
establish and monitor) by spotting patterns in the formal laboratories for providing the necessary infra-structure for
relations represented in ontologies, traversing the developing this work.
ontology from instance to instance via selected relations.
The activation in their system is propagated through a 7. References
semantic network only, and there exists no idea of
extracting semantics from the link structure such as the [1] Brusilovsky, P.: Efficient techniques for adaptive
weight mapping techniques proposed in our work. Their hypermedia. Intelligent Hypertext: Adaptive techniques for
work uses the spread activation system and the suggestion the World Wide Web. C. Nicholas and J. Mayfield, Eds.,
of relations in a much narrower scope than the system Lecture Notes in Computer Science, vol 1326, Berlin:
proposed in this paper. We believe that our system could Springer-Verlag, 1997, pp. 12-30.
be successfully used for the same task as ONTOCOPI. [2] Chen, H., and NG, T.: An Algorithmic Approach to Concept
As previously mentioned, link ordering has been used Exploration in a Large Knowledge Network (Automatic
for Adaptive Hypermedia applications; the main Thesaurus Consultation); Symbolic Branch-and-Bound vs.
Connectionist Hopfield Net Activation. Journal of the
difference with respect to the one presented in this paper American Society for Information Science 46(5):348-369,
is that the type of information and the algorithms used in 1995.
Adaptive systems is based on a model of the individual [3] Cohen, P., and Kjeldsen, R.: Information Retrieval by
user, and its context of use. In our case, we use semantic Constrained Spreading Activation on Semantic Networks.
information from the node instances and its relations, Information Processing and Management, 23(4):255-268,
which is the same for all users. We envision the use of 1987
both technologies together as being an even more [4] Crestani, F.: Application of Spreading Activation
powerful method for ordering the presentation of links in Techniques in Information Retrieval. Artificial Intelligence
Review, 11(6): 453-482, 1997.
hypermedia applications.
[5] Crestani, F., Lee, P.L.: Searching the Web by Constrained
Spreading Activation. Information Processing &
6. Conclusions Management, 36(4), 2000, 585-605.
[6] O’Hara, K., Alani, H., and Shadbolt, N.: Identifying
In this paper, we showed how a similarity processing Communities of Practices: Analyzing Ontologies as
engine can be used to provide some new functionality in Networks to Support Community Recognition, IFIP-WCC
model-based applications. The proposed engine uses 2002, Montreal, 2002, Kluwer.
[7] Rocha, C., Schwabe, D., Poggi, M.: A Hybrid Approach for
semantic information from the model and its instances to
Searching in the Semantic Web., to appear, Proceedings of
explore the instances graph using a hybrid spread the WWW2004 Conference, NY, May, 2004. Available at
activation algorithm. The proposed engine proved to http://www2004.org/proceedings/docs/1p374.pdf.
perform well in presenting links to related information in [8] Stojanovic, N., Struder R., and Stojanovic, L.: An approach
an order that reflects the semantic closeness of the for the Ranking of Query Results in the Semantic Web. Proc.
corresponding information. It was also successfully used of ISWC '03 (Sanibel Island, FL, October 2003), Spring-
to suggest new relation instances to the user of an Verlag, 500-516.
application, helping users in inserting new information in [9] Yates, B., and Neto, B.: Modern Information Retrieval.
the database and also in identifying possible ACM Press, New York, USA, 1999.
inconsistencies and errors in it.
As previously mentioned, we plan on integrating the
proposed engine with adaptive hypermedia applications so
Proceedings of the WebMedia & LA-Web 2004 Joint Conference 10th Brazilian Symposium on Multimedia and the Web 2nd Latin American Web Congress
(LA-Webmedia’04)
0-7695-2237-8/04 $20.00 © 2004 IEEE