Vous êtes sur la page 1sur 10

Subj3ct – A Subject Identity Resolution Service

Graham Moore and Khalil Ahmed

Networked Planet Ltd, Innovation House, Mill Street, Oxford, UK


{graham.moore,khalil.ahmed}@networkedplanet.com

Abstract. Subj3ct is an online service that provides a resolution capa-


bility between identifiers for subjects and the web addresses of encoded
statements about a subject. This paper explains the underlying concepts
and protocols of the service and discusses different usage scenarios.

1 Introduction
Subj3ct [1] is an online Subject Identity Resolution service. Subject Identifiers [2]
are URIs that are used to unambiguously identify some ‘notion’, ‘concept’ or
‘subject’ about which people want to make statements.
The service stores identifiers for subjects and the web addresses of repre-
sentations of statements about a subject. Clients of the service can look up a
subject identity record so that they can choose an appropriate identifier to use
in an application or to look up where on the web they can find more information
about a given subject. All information stored in Subj3ct comes from some source
and that source is recorded. Clients can make use of this provenance information
to decide what identifiers they want to trust and use.
Subj3ct is a registry service for subject identifiers for the Semantic Web. The
goal is to provide the functionality required to realize the vision of the Semantic
Web or Web 3.0. Using the Subj3ct portal and APIs it is possible to register
XML feeds that provide information and identifiers for the concepts that people
or applications are interested in and to connect identifiers with other identifiers
created by other people for the same concepts. By enabling the sharing and
publication of subject identifiers on the web, Subj3ct is a platform that enables
communication between applications and human-beings across the Web.

2 What are Subjects?


Subjects are the corner-stone of the Semantic Web or Web 3.0 as it has been
called. Quite simply, a Subject is absolutely anything you want to talk about –
a person, a place, a piece of music, a feeling . . . anything at all. What makes
a subject special is that every subject has its own identifier. Identifiers are just
a unique URL (a web address) that is used by Web 3.0 applications when they
want to refer to that Subject.
Although the basic concept is very simple, using subjects in Web 3.0 success-
fully faces three significant challenges:

Maicher, L.; Garshol, L. M. (Eds.): Linked Topic Maps. Fifth International Conference on Topic
Maps Research and Applications, TMRA 2009 Leipzig, Germany, November 12–13, 2009 Revised
Selected Papers. Leipziger Beiträge zur Informatik. ISBN 978-3-941608-06-1
164 G. Moore and K. Ahmed

1. Any number of different identifiers could be created for the same subject. The
creators of identifiers need some way to agree that their different identifiers
actually identify the same Subject, and users of identifiers need to be able to
see all the identifiers that identify a given Subject in one place. Subj3ct allows
users to assert that an identifier has another equivalent identifier and uses
trust metrics to give an indication of how much those declarations should be
trusted.
2. Identifiers by themselves are not useful. Just being able to identify a subject
is not quite enough for an application to be useful. To be truly useful, it
should be possible to use the identifier to find resources that talk about that
subject. Subj3ct allows users to specify the web pages, pictures, RDF files,
topic maps or other Internet resources that provide information relevant to
a particular subject.
3. Identifiers need to be trusted. Before using an identifier, assuming that two
identifiers really do talk about the same Subject or following a suggested
resource for an identifier a user needs to have some level of trust in the iden-
tifiers being used and the information they are given about those identifiers.
Subj3ct provides users with a simple system of trust metrics based on iden-
tifying the users who can create identifiers under specific web domains and
correlation of statements made by one user with those made by other users.

3 Subj3ct Glossary
The following is a short glossary of some key terms that are central to the Subj3ct
service.

Feed
A Feed is a web resource that provides information about subjects that can be
crawled and parsed by the Subj3ct crawler. Currently we support syndication
feeds that use the ATOM [3] syntax with some Subj3ct-specific extensions to it
and SKOS [4] RDF/XML. As Subj3ct develops we will gradually add support for
other types of information resources. Any user can register a feed with Subj3ct -
feeds that have been suggested anonymously are given the lowest level of trust;
feeds that are registered by a user signed in with a Subj3ct account are given
higher trust; and feeds registered by a Registered Publisher are given the highest
trust level.

Registered Publisher
A Registered Publisher is an individual or organization that has registered with
Subj3ct and claimed control over a particular Namespace. Feeds registered with
Subj3ct by a Registered Publisher that are contained under the Namespace that
the publisher owns and that provide information about the identifiers that are
contained under the same Namespace are accorded the highest level of trust.
Subj3ct – A Subject Identity Resolution Service 165

Equivalent Subject Identifier


Any Subject Identifier can have any number of equivalent subject identifiers.
Equivalence here simply means that the two identifiers are used to identify
the same subject. To prevent accidental or intentional hijacking of identifiers,
Subj3ct only records statements of equivalence that come from the same feed
that provided the subject identifier. When a feed declares that identifier X is
equivalent to identifier Y, the feed that provided identifier Y may reciprocate
with a declaration that identifier Y is equivalent to identifier X. Alternatively
the owner of identifier Y may not agree that X and Y identify the same concept,
in which case they are under no obligation to make the reciprocal statement.
The statement that identifier X is equivalent to identifier Y is seen as an
outgoing equivalence for identifier X, meaning that X is the subject that they
equivalence statement is about. The same statement is seen as an incoming
equivalence for identifier Y. When you see a record for subject, the equivalence
statements it participates in are divided into incoming and outgoing equivalences
and you should note that while the person that registered the identifier has
control over the outgoing equivalence statements, they have no control over the
incoming statements (this is a bit like a page on the Web - the creator of the
page has control over where that page links out to, but has no control over who
links back to the page).

Subject
A subject is any concept that you want to talk about or share between people
or applications. Subjects are identified using Subject Identifiers.

Subject Identifier
A Subject Identifier is a unique URI that is assigned to a particular Subject.
A Subject Identifier can have any number of Equivalent Subject Identifiers and
any number of Subject Web Addresses associated with it.

Subject Web Address


A Subject Web Address is a resource on the web that provides information
relevant to a particular subject. A Subject Web Address is always provided
in connection with a Subject Identifier which identifies the Subject that the
resource is related to. A single resource may be relevant to many subjects, but
the best resources are those that are related only to a single subject (such as
an encyclopedia entry that describes that subject in some detail, or a series of
RDF statements about the subject) - however, because a single Subject may
have many Subject Identifiers, it is possible that the same Subject Web Address
is provided for a number of Subject Identifiers. Any user can provide a Subject
Web Address for a Subject Identifier, but a reasonable level of trust is assigned
only when the user is a Registered Publisher of either the Subject Identifier or
the Subject Web Address.
166 G. Moore and K. Ahmed

Subject Identity Record


A subject identity record is the data structure returned by Subj3ct for a given
subject identifier. This record can be accessed by people via the portal or by
machine using the API. The record contains information about the subject in-
cluding the identifier, equivalent identifiers and links to web resources that are
said to contain more information about the subject. It is intended that these
records can be cached maintained by intermediate resolution services similar to
Subj3ct.

4 The Subj3ct Feed Processing Service


Subj3ct acts as a registry and clearinghouse service for subjects. It periodically
crawls sources of subject information that have been registered either by the
publisher of that information or by other interested parties and imports the
identifiers used for subjects, any statements of equivalence between subjects and
any links provided to related resources for subjects.
The main format for publishers to expose subject information is to publish
it in an ATOM feed. The following fragment shows a single ATOM entry being
used to express the identity of a subject and also linking to equivalent subjects
and to online resources about the subject.
<entry>
<title>Hierarchical Relation Type</title>
<id>http://www.techquila.com/psi/hierarchy/index.atom/1</id>
<updated>2003-12-04T14:39:10Z</updated>
<summary>
A type of association type. Associations which are typed by
a topic which is an instance of this type represent
a parent-child relationship between two or more topics.
</summary>
<link rel="SubjectIdentifier"
href="http://www.techquila.com/psi/
hierarchy/#hierarchical-relation-type" />
<link rel="SubjectRepresentation"
href="http://www.techquila.com/tmsinia4.html" />
<link rel="SubjectEquivalence"
href="http://www.networkedplanet.com/psi/
hierarchy/hierarchical-relation-type" />
</entry>
Subj3ct doesn’t require identifier publishers to use any ATOM XML extensions;
instead the semantics of each link are indicated through the ‘rel’ attribute. The
allowed values are ‘SubjectIdentifier’, ‘SubjectRepresentation’, and ‘Subject-
Equivalence’.
A feed can contain many entries and each entry results in information about
a subject being stored in the Subj3ct database.
Subj3ct – A Subject Identity Resolution Service 167

This imported information is then given a trust score based on the identity of
the person who registered the information source, the URI of the subject infor-
mation source itself and the URIs used for the identifiers and related resources.
This provenance information is a key aspect of Subj3ct as it allows for webs
of mutual trust to be asserted between co-operating parties while still exposing
identifiers and equivalences in an open world environment.

4.1 Trust Scores and Identifiers in Subj3ct

In Subj3ct, all statements about identifiers are assigned a basic Trust Score. The
Trust Score provides a rough measure of how likely it is that the creator of an
identifier agrees with any statement being made about the identifier. NOTE:
The Trust Score does not tell you if you should trust the creator of an identifier
or the individual who creates statements about that identifier, nor does it assert
anything about the content that might be found if you follow the URLs contained
in those statements.
The Trust Score for a statement is calculated based on the URL of the feed
that contains the statement, and the URLs contained in the statement itself. The
highest score is assigned if all the URLs contained in the statement are contained
within the same parent directory as the URL of the feed itself. A lower score
is assigned if the URLs contained in the statement are all contained within the
same DNS sub-domain as the feed itself. The lowest score is assigned if the URLs
contained in the statement are from different domains or sub-domains from the
feed. In addition, much higher trust scores are assigned to statements from feeds
that are registered through an active Subj3ct user account, and at the other end
of the scale, zero trust is assigned to statements from a feed that is suggested
anonymously and refer to URLs in a different domain or sub-domain than the
feed itself.

5 The Subj3ct API

As well as portal that can be used by humans to search for identifiers Subj3ct
also provides an API for accessing information about subjects.
The API consists of the following operations:

Get Subject Identity Record By Subject Identifier

The format for this request is

http://api.subj3ct.com/subjects?
identifier={identifierUri}
&format={format}

The request will return a SubjectIdentityRecord for the given subject identifier
in a format specified in the optional format parameter.
168 G. Moore and K. Ahmed

Search for Subject Identity Records by Web Resource

The format for this request is

http://api.subj3ct.com/subjects/webaddresses?
uri={representationUri}
&skip={skip}
&take={take}
&format={format}

The request will return a search result. Subjects are included in the result if they
have a corresponding web resource whose URI matches the URI specified. The
search result is returned in format specified in the optional format parameter.

Search for Subject Identity Records by Identifier URI

The format for this request is

http://api.subj3ct.com/subjects/identifiers?
uri={uri}
&skip={skip}
&take={take}
&format={format}

The request will return a search result. Subjects are included in the result if the
start of their subject identifier matches the query term provided. The take and
skip parameters allow for accessing paged results.

Search for Subject Identity Records by Text Query

The format for this request is

http://api.subj3ct.com/subjects/search?
query={query}
&skip={skip}
&take={take}
&format={format}

The request will return a search result. Subjects are included in the result if the
name, description or any part of their subject identifier matches the query term
provided. The take and skip parameters allow for accessing paged results.
The subject identity record has the following structure as defined using
XMLSchema:

<Subject
xmlns="http://subj3ct.com/schema/data"
xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Identifier>http://psi.ontopedia.net/Axel_Borge</Identifier>
Subj3ct – A Subject Identity Resolution Service 169

<Provenance>
http://www.garshol.priv.no/tmphoto/subj3ct.atom
</Provenance>
<Name>Axel Borge</Name>
<RecordLink>
http://api.subj3ct.com/subjects?
identifier=http%3a%2f%2fpsi.ontopedia.net%2fAxel_Borge
</RecordLink>
<Trust>25</Trust>
<EquivalenceStatements />
<RepresentationStatements>
<RepresentationStatement>
<RepresentationUri>
http://psi.ontopedia.net/Axel_Borge
</RepresentationUri>
<Provenance>
http://psi.ontopedia.net/ontopedia/subject/atom_stream
</Provenance>
<Trust>20</Trust>
</RepresentationStatement>
<RepresentationStatement>
<RepresentationUri>
http://www.garshol.priv.no/tmphoto/person.jsp?id=t83420
</RepresentationUri>
<Provenance>
http://www.garshol.priv.no/tmphoto/subj3ct.atom
</Provenance>
<Trust>125</Trust>
</RepresentationStatement>
</RepresentationStatements>
</Subject>
The subject identity record (which can also be retrived in JSON and SKOS
format) can then be used by client application to mash together semantic data
that is referenced by the representation links based on identifiers for similar
subjects.

6 Embedding Subjects
One of the important mechanisms for creating a habit for using subject iden-
tifiers is to have available very simple tools for content authors and systems
developers to make use of. The API is one such tool but another one is the set of
‘embeddable’ mark-up that subj3ct provides for inclusion in things such as blogs
posts and other published content. The ability to embed references to subjects
as part of content provides authors with a powerful way to classify content using
common vocabulary terms.
170 G. Moore and K. Ahmed

The following is an RDFa fragment, using Dublin Core that can be included
in a page to indicate that the page is about that subject.
<span xmlns:dc=http://purl.org/dc/elements/1.1/
rel="dc:subject"
href="http://psi.ontopedia.net/Axel_Borge">Axel Borge</span>
The powerful thing about this is that google and other search engines are be-
ginning to recognise and index semantic mark-up. This reinforces the cycle of
identifiers, content and usage.

7 Applications for Subjects and Subj3ct


Subjects are the corner-stone of the Semantic Web. They can be used to classify
resources; to identify key entities in text; or to discover relevant information
quickly. Below are three categories of usage.

Resource Classification
The identifiers for subjects are much better ways to say what a piece of content
is about than simple textual tags. Tags are limited because they are usually
language-specific and because it is too easy for the same tag to refer to very
different concepts (consider the tag "Paris" for example, is it referring to a place
or a person?). By comparison, Subject identifiers are URIs which are inherently
language-neutral and always refer to a particular subject regardless of the lan-
guage used to name or describe that subject. Subject identifiers are also more
specific than tags in that they always refer to the subject that they were made
for, they cannot be misinterpreted or confused and because subject identifiers
are unique URIs, there is no way for one identifier to be confused with another
one.
Identifiers can be used anywhere where traditional tags or controlled vocab-
ularies are used, including in HTML documents; in tagging applications such as
your favourite online photo sharing application; and even in desktop applications
such as word processors.
Classification has traditionally been a manual activity, requiring a human
being to view the content before then picking the terms used to classify that
content from some controlled list. More recently classification has started to be
done (with varying levels of accuracy) by computers. Subject identifiers can be
used to record the results of either manual or automated classification.

Entity Identification
As well as classification of entire resources (such as the tagging of a document
with the key terms that it is ’about’), Subject identifiers can also be used to
record the results of identifying the key entities (things, people, places, events)
discussed in the content of a document. Online services such as OpenCalais [5]
Subj3ct – A Subject Identity Resolution Service 171

attempt to automatically find these entities in text content and return identi-
fiers for the entities they find. By registering this identifiers with Subj3ct and
connecting the identifiers returned by these services to other identifiers created
through other means it is possible to not only label key entities within text but
also to link to many other relevant resources.

Information Discovery
Subj3ct provides an online search form and APIs that allows users and applica-
tions to find subjects either using an identifier gleaned from somewhere (maybe
sent in an email to a user, or attached to a document as classification metadata,
or embedded in content as a result of entity identification); or by searching from
a resource to find the subjects that claim to be relevant to that resource; or even
by searching against the name and description registered for the subject by the
creator of the subject identifier.
The resources related to an identifier can be HTML pages or other human-
readable content but they can equal be resources that provide information in
RDF [6] or Topic Maps syntax [7]. Semantic Web applications can make use
of the Subj3ct API to quickly discover if there are any useful data sources in
their preferred syntax and use this to enrich the application experience for the
end-user.

8 Current Limitations
A lack of Multi-lingual names for subjects is major limitation. At the time of
publication this limitation has been addressed and the subj3ct service now allows
subject identifier authors the ability to register multiple names for a subject. This
feature was not implemented in earlier versions.
The ability to remove an identifier is not currently offered. This partly be-
cause we think that once an identifier comes into existence it can be in use and
it makes no sense to retract it. It’s description and name can be updated
A general limitation is that while Subj3ct now contains many identifies and
equivalences it still lacks tools for easily using the service in editorial or infor-
mation finding contexts. This is a focus for future work.

9 Relationship to Other Work


One obvious comparison can be made with the ‘sameAs’ service [8]. This ser-
vice aggregates identifies for resources, and equivalent ones as defined using
owl:sameas [9]. The main differences between sameas.org and Subj3ct are:
1. Subj3ct distinguishes identifiers for subjects and identifiers for the addresses
of statements about those subjects.
2. Subj3ct records who registered the identifier, allow clients to filter equivalent
identifiers and
172 G. Moore and K. Ahmed

3. Subj3ct puts a trust score onto an identifier based on the domain from which
it was registered.
4. Subj3ct works at two levels, identifiers and addresses for resource that con-
tain statements about that resource. In this sense it provides a DNS [10]
for the semantic web, whereas sameAs only provides unregulated resource
equivalence capabilities.

10 Conclusion
Subj3ct is a key service in providing a resolution mechanism between subject
identifiers, equivalent identifiers and web resources that contains further state-
ments about a subject.

References
1. Subj3ct.com, http://www.subj3ct.com
2. Graham Moore, Lars Marius Garshol (eds), Topic Maps Data Model,
http://www.isotopicmaps.org/sam/sam-model
3. ATOM Syndication Protocol RFC4287, http://atompub.org/rfc4287.html
4. SKOS (Simple Knowledge Organization System),
http://www.w3.org/2004/02/skos
5. OpenCalais, http://www.opencalais.com
6. RDF/XML Syntax Specification, http://www.w3.org/TR/rdf-syntax-grammar
7. XTM Topic Maps Syntax, http://www.isotopicmaps.org/sam/sam-xtm
8. sameAs service, http://sameas.org
9. OWL Web Ontology Language, http://www.w3.org/TR/owl-ref
10. DNS, http://en.wikipedia.org/wiki/Domain_Name_System

Vous aimerez peut-être aussi